Vessel segmentation for X-ray coronary angiography using ensemble methods with deep learning and filter-based features

Background Automated segmentation of coronary arteries is a crucial step for computer-aided coronary artery disease (CAD) diagnosis and treatment planning. Correct delineation of the coronary artery is challenging in X-ray coronary angiography (XCA) due to the low signal-to-noise ratio and confounding background structures. Methods A novel ensemble framework for coronary artery segmentation in XCA images is proposed, which utilizes deep learning and filter-based features to construct models using the gradient boosting decision tree (GBDT) and deep forest classifiers. The proposed method was trained and tested on 130 XCA images. For each pixel of interest in the XCA images, a 37-dimensional feature vector was constructed based on (1) the statistics of multi-scale filtering responses in the morphological, spatial, and frequency domains; and (2) the feature maps obtained from trained deep neural networks. The performance of these models was compared with those of common deep neural networks on metrics including precision, sensitivity, specificity, F1 score, AUROC (the area under the receiver operating characteristic curve), and IoU (intersection over union). Results With hybrid under-sampling methods, the best performing GBDT model achieved a mean F1 score of 0.874, AUROC of 0.947, sensitivity of 0.902, and specificity of 0.992; while the best performing deep forest model obtained a mean F1 score of 0.867, AUROC of 0.95, sensitivity of 0.867, and specificity of 0.993. Compared with the evaluated deep neural networks, both models had better or comparable performance for all evaluated metrics with lower standard deviations over the test images. Conclusions The proposed feature-based ensemble method outperformed common deep convolutional neural networks in most performance metrics while yielding more consistent results. Such a method can be used to facilitate the assessment of stenosis and improve the quality of care in patients with CAD.

oxygen and nutrient supplements, ultimately leading to myocardial ischemia and infarction [4].
X-ray coronary angiography (XCA) is the gold standard for CAD diagnosis [5]. By releasing dye into the coronary vessels and inspecting its flow through the vessel structure via 2D projections, XCA helps clinicians locate potential stenoses, visually measure their severity, and determine the appropriate interventional therapies [6].
Visual stenosis assessment, however, is often unreliable: it tends to overestimate severe blockages while underestimate mild ones [7,8] and has high intra-and inter-observer variability [9,10]. To evaluate the lumen diameter more objectively, quantitative coronary angiography (QCA) was introduced [11] to offer a (semi-) automatic analysis of XCA. QCA analysis involves frame selection, vessel segmentation, stenosis positioning, and quantitative measurement [12,13]. The vessel segmentation step of QCA is a prerequisite for calculating the percentage of arterial stenosis. Moreover, the correct delineation of coronary arteries plays an important role in center-line extraction, which is used for 3D reconstruction of blood vessels [14], vessel tracking [15], and cardiac dynamics assessment [16].
Due to the nature of XCA images, segmenting vessels accurately is challenging. First, XCA images usually are of low resolution, have low signal-to-noise ratios, and exhibit low contrast between the vessel structure and background region [17][18][19]. Second, the presence of irrelevant structures such as the catheter, diaphragm, and the spine is confounding and leads to non-uniform illumination within the images [20]. Third, the various angles from which the 3D vessel structure is projected to form 2D XCA images create twisted and overlapping vessels, making the segmentation even more challenging [21].
To overcome these difficulties and aid in the quantitative diagnosis of CAD, efforts have been made to develop both supervised and unsupervised methods for automatic coronary vessel segmentation.
Unsupervised methods can be primarily categorized as tracking-based, model-based, or filter-based [22]. Tracking-based methods [23] choose seed points on the edges and the center-lines of vessels, then take a small step in the direction of the vessel to look for the vessel edges or the center-lines nearby. When new edges are found, an estimate of vessel direction is made to take the next step in this search direction. Model-based methods [24][25][26][27][28][29], use deformable models or region growing to evolve the segmentation towards the vessel-background boundaries based on the forces and constraints defined by energy functions. They may also apply growing conditions defined by similarity functions together with a threshold parameter. Both tracking-based and model-based methods require initial seeds for segmentation and are therefore sensitive to initialization. Although they tend to maintain good segmentation continuity for the vessel tree structure, they may fail to handle confounding elements in the background that are adjacent to the vessels. Filter-based methods [17,30] apply a variety of filters for non-uniform background intensity balancing, irrelevant structures suppression, noise reduction, and vessel enhancement. The filtered images can be later processed with thresholding techniques for segmentation mask generation. Due to their ease of implementation and their ability to mitigate illumination problems, filterbased methods have also been employed extensively as preprocessing steps in both supervised and unsupervised methods for automated coronary vessel segmentation [29,[31][32][33][34][35][36]. However, they are usually insufficient for use on their own, as they are sensitive to background structures and may not perform well on vessel junctions and bifurcations [37].
Coronary artery segmentation using supervised methods can be considered as a pixel-wise classification problem, with most current methods utilizing Neural Networks. Cervantes-Sanchez et al. [31] trained a multilayer perceptron with XCA images enhanced by Gaussian matched filters and Gabor filters. Nasr-Esfahani et al. [32] presented a multi-stage model where a Convolutional Neural Networks (CNNs) extracts local, contextual, and edge-based information that were then combined via a final fully connected layer. Recently, deep learning approaches have gained popularity in segmenting both major arteries and full artery trees from XCA images. Samuel and Veeramalai [38] proposed a Vessel Specific Skip Chain Network by adding two vessel-specific layers to the VGG-16 network [39]. Jo et al. [33] developed a two-stage CNN specifically for left anterior descending artery segmentation, where the first stage located candidate areas of interest and the second stage generated the segmentation mask. Iyer et al. [40] designed an angiographic processing network that learned how to preprocess the XCA images with the most suitable filters for local contrast enhancement. The preprocessed images were then fed into DeeplabV3+ [41] for segmentation. Shi et al. [42] developed a generative adversarial network for major branch segmentation with a U-Net generator and a pyramid-structure discriminator, reporting improved connectivity for the segmented mask. Yang et al. [43] replaced the backbone of U-Net with an ImageNet pre-trained ResNet [44], InceptionResNetv2 [45], or DenseNet [46] for main branch segmentation. Fan et al. [47] modified U-Net so that the proposed structure can receive both the target and registered background images before dye release as inputs for generating segmentation masks. The network structure proposed by [48] receives multi-channel inputs by adding a 3D convolution layer to the U-Net encoder, exploiting the temporal information using three consecutive frames from angiographic image sequences to produce a segmentation mask for the middle frame. Zhu et al. [49] applied the Pyramid Scene Parsing Network, a network proposed by [50], for coronary vessel segmentation. They took advantage of the network structure to incorporate features from multiple scales by pyramid pooling and used transfer learning to avoid overfitting on a small training set. Supervised methods for coronary artery segmentation may focus on the major coronary arteries for which clinicians would be more concerned, instead of the entire arterial tree. Network-based supervised methods have a number of drawbacks, including overfitting when the training set is small, weaker interpretability as compared to unsupervised filter-based methods, and an inability to ensure connectivity within their prediction masks. However, supervised methods require less manual input and are more robust in discriminating background structures such as the catheter and spine than unsupervised methods.
In this paper novel ensemble framework for coronary artery segmentation is proposed that employs gradientboosting decision tree (GBDT) [51] and Deep Forest classifiers [52]. The GBDT is a popular machine learning technique that combines weak decision tree learners for loss function minimization. When constructing a GBDT model, a series of trees is built wherein each new weak decision tree attempts to correct errors from the previous stage. The Deep Forest classifier, on the other hand, is a deep ensemble model that uses non-differentiable modules to form deep learning structures. Unlike deep neural networks, it does not apply back-propagation for training, but it still uses multiple layers (with cascade structures) for processing and applies in-model feature transformation. However, GBDT boosts the performance of weak learners gradually in a sequential and additive way, while in Deep Forest, random forests composed of decision trees are considered as a subroutine stacked by layers, with layer outputs feeding into another layer to create depth. Though both GBDT and Deep Forest have not been applied to XCA image segmentation, they have been recently employed in medical image analysis in different image modalities [53][54][55][56][57][58]. The ensemble methods produced promising results in retinal vessel segmentation [59][60][61][62][63][64] and have not been, as far as we know, applied on coronary artery segmentation yet. In this study, 16 deep learning features obtained from the last layer of the Dense-Net-backbone U-Net decoder were combined with 21 multi-scale statistics on responses to a diverse range of filters to construct a 37-dimensional feature vector for each pixel in the input XCA image for training coronary artery segmentation models with GBDT and Deep Forest.
The proposed work takes advantage of both decades of classical computer vision research along with contemporary machine learning and deep learning techniques by employing a diverse set of reliable, well-established, hand-crafted features together with features from a deep structure for ensemble model training. Additional novelties come from the extraction of multiple statistics from the scale-space profile of a filter response and the adoption of a deep ensemble model on coronary artery segmentation.
The remainder of the paper is organized as follows. "Methods" section introduces the datasets used in this study and describes the methods employed for feature extraction, the under-sampling of imbalanced training classes, and model training, testing, and evaluation. "Results" section reports the effect of under-sampling on the training set, the performance of models constructed using different classifiers, and the analysis of feature importance, while "Discussion section provides interpretations of the results and describes limitations and future directions of the current work.

Methods
In the following subsections, the datasets used for the study, the feature extraction techniques (using filterbased and deep learning methods) and under-sampling methods employed are first introduced, after which the training of ensemble classifiers is explained. A schematic diagram of the proposed method for coronary artery segmentation is depicted in Fig. 1.

Dataset
The study was conducted with de-identified angiograms from two sources: "Dataset 1", collected from the University of Michigan Hospital, Ann Arbor, MI and "Dataset 2", collected from a hospital in the United Kingdom . Both datasets are comprised of patients suspected of having coronary artery disease who underwent invasive coronary angiography. Dataset 1 contains three subsets: 1-17, 1-19, and 1-AVI. Angiograms within subsets 1-17 and 1-19 were collected in 2017 and 2019, respectively, and stored in DICOM (Digital Imaging and Communications in Medicine) format, while angiograms in 1-AVI were stored in AVI (Audio Video Interleave) format.
Patients were excluded if any of the following occurred: incomplete injection of contrast dye, percutaneous coronary intervention, an implanted pacemaker or cardioverter defibrillator, or the presence of artificial objects other than the dye injection catheter. Ultimately, 130 angiogram sequences from 130 patients were included in this study. 80 of them visualize the left coronary artery (LCA) while the remaining 50 depict the right coronary artery (RCA). The number of frames in each sequence ranges from 43 to 150, with an average of 86 frames per sequence.
As the entire vascular tree is not always visible in all frames, frames were selected from XCA videos according to three criteria: (1) the selected frame contains the full injection of contrast agent; (2) there is minimal cardiac motion between adjacent frames; and (3) the full coronary artery is visualized in the frame. The frames are gray-scale images with a resolution of 512 × 512 pixels. Segmentation masks used for training and testing were first generated manually using Adobe Photoshop CS and later validated by experienced cardiologists. Catheter diameter size (measured by pixel number and referred to as "Cath" in later sections) was recorded along with the annotation. Only those vessels whose diameters were greater than or equal to 0.75× Cath were annotated. The mean diameter of Cath was 8.0 (± 1.2) pixels. A summary of the dataset information is listed in Table 1. For angiograms whose acquisition angles are available, the information is depicted in Fig. 2. "Caudal" and "Cranial" refer to the caudal and cranial angulation of the X-ray

Feature extraction with filters Scale-space theory and the Z-profile of filter responses
Structures that a filter can extract from an XCA image depend on the scale of observation. A single scale is not always sufficient for capturing vessel structures of varying sizes. The scale-space theory [65] provides a framework for automatic scale selection in image filtering by applying multiple scales for image representation and summarizing filter responses across scales [66]. Based on this theory, the Z-profile of pixel-wise filter responses is constructed with four summary statistics: the maximum, mean, variance, and interquartile range of multi-scale responses. For example, Fig. 3 illustrates the filter response of Frangi filters [67] over ten different scales and the Z-profile thus obtained.
Scale ranges ( ∈ ) were selected to be relative to the physical constraints of coronary arteries, ranging from 0.66×Cath to 6.33×Cath on a logarithmic scale.

Preprocessing
XCAs are often of poor quality due to image noise. To enhance the visibility of vessels within the frame, standard computer vision and image processing techniques were used to construct the preprocessing pipeline. First, metadata from the DICOM files was used to exclude the border regions (pixels outside of the imaging window) from analysis and to obtain catheter size information. Then, the filter scales were set as described in "Scalespace theory and the Z-profile of filter responses" section. In cases where the aforementioned metadata was unavailable, the mean Cath value ("Dataset" section) was used. After that, a non-local mean filter [68] was applied for noise reduction. Following this step, contrast adjustment (Algorithm 1) using Top-bottom-hat filtering [69] was employ to reconstruct the image. Let I : → R be the H × W image with pixel coordinates given by (x, y) ∈ � = {1, 2, . . . , H } × {1, 2, . . . , W } and SE denote the structuring element (SE) with scale , then the Top-hat filtered image I top is defined as the maximum of the differences between an input image I and its SE opening over ∈ , while the Bottom-hat output I bottom is defined as the maximum of the differences between SE closing with I over ∈ , that is, and with ( • ) and ( • ) denoting morphological opening and closing respectively (see Appendix 6.1 for the definitions of these operations.) The Top-bottom-hat enhanced image is then generated as where m and n are the strengths of Top-hat and Bottomhat transformations. (1) Vesselness-enhancing diffusion filtering [70] (see Appendix 1 for details) was also performed on the denoised images to enhance vascular structures as utilized in "Filter-based feature extraction" section.

Filter-based feature extraction
A number of common vessel enhancement and segmentation filters were used to extract features that can be categorized into differentiable, spatial, and Gabor features.
In terms of differentiable features, the Z-profile of the Frangi filter [67], Z-profile of the matched filter [71], the Gaussian-filter-smoothed Z-profile of the gradient magnitude, and the vessel confidence measure [72] were extracted, resulting in a total of 13 features. For spatial features, the granular decomposition of the top-bottomhat image using the method given in [73] was obtained, producing a Z-profile with 4 features. The Gabor features were extracted from the Z-profile of the Gabor filter [17] responses on the complement of contrast enhancement output image.

Feature extraction with deep learning
The deep learning networks described in this section were implemented in Python 3.7 using PyTorch 1.10 and the segmentation model package [74]. Each network was trained on a single NVIDIA Tesla V100 GPU.

Data partitioning for model construction
The dataset containing 130 XCA images was split into training, validation, and test sets in 3:1:1 ratio. The partitions were stratified to ensure different subsets had approximately the same percentage of samples of RCA and LCA angiograms from different sources.

Network structure
The network structures adopted in this study are common deep learning models developed for medical image segmentation, such as U-Net, DeepLabV3+, Incep-tionResNet-v2-backbone U-Net, ResNet101-backbone U-Net, and DenseNet121-backbone U-Net. The latter three were first applied for main branch segmentation in XCA images in [43] and achieved the best performance in terms of F1 score thus far. These structures were employed in this work for major branch segmentation with modified training logic to serve as comparison to the ensemble models, and to obtain deep learning features for ensemble model training. For the DeepLabV3+ model, we used a DeepLabV3 encoder as mentioned in [41] and adapted the ImageNet pre-trained ResNet101 [44] for dense features extraction in the encoder [75]. Details on model parameters can be found in Appendix 3. For the modified U-Net models, the encoder and bottleneck sections of the U-Net were replaced with Ima-geNet pre-trained ResNet [44], InceptionResNet-v2 [45], or DenseNet [46], respectively, except for their average pooling layers and the fully connecting layers at the end. Skip connections were retained between the encoder and the decoder at different spatial resolutions.

Data processing and training setting
Gray-scale XCA images were first preprocessed with 2-D min/max normalization. To increase the diversity of the training samples, data augmentation was employed at each training iteration before feeding data into the networks. Specifically, XCA images were randomly augmented by affine transformations ( −20 • to 20 • rotation, 0-10% of image size translation shift on horizontal and vertical axes, or 0-10% zoom) with a probability of 0.7. The same augmentations were also applied to the corresponding ground-truth masks. The network was trained using a default-setting Adam optimizer with an initial learning rate of 10 −3 and a mini-batch size of 8 images for up to 100 epochs. An early-stop mechanism was triggered if validation loss did not improve for 15 epochs.

Loss function
To take into consideration class imbalance and class importance, the deep-learning model was trained using Generalized Dice loss (GD) [76] where w c = n p=1 (G cp /t)) 2 + ǫ −1 is the weight for class c, t is the total class number, p is the pixel location, and n the total number of pixels in an image. G cp and M cp are the ground truth image pixel and predicted mask pixel values from class c respectively.

Testing and model evaluation
For each network structure, the model that achieved the lowest validation loss was applied to the test set. The final layer of the network output was passed through an element-wise sigmoid activation function to generate the probability of each pixel belonging to either the vessel region or the background region. The same post-processing described in "Post-processing and model evaluation" section was applied to generate the final binary segmentation masks. The quality of the generated masks was

Deep feature extraction
The network structure that achieved the best test performance with respect to F1 score and AUROC was adopted for deep feature extraction. After normalization, the XCA images of dimension 1 × 512 × 512 were fed into the network and the activation maps of the final decoder layer were extracted as the deep features with dimension of 16 × 512 × 512.

Feature standardization and training samples
After feature extraction as described in Feature extraction with filters and Feature extraction with deep learning" sections, all features ( Table 2) were concatenated (Fig. 4). Each feature map outputted from a filter or a deep learning layer is standardized individually by subtracting its mean pixel value and dividing the result by the standard deviation. For each pixel p i within an XCA image, a 37-dimensional feature vector X i was extracted that yielded a training sample ( X i , y i ) where the label y i is 1 if the pixel is part of a vessel in the annotated XCA image and 0 otherwise. The 130 XCA images resulted a in total of 512 × 512 × 130 = 34078720 samples before border removal and any under-sampling.

Under-sampling of non-vessel pixels
If vessel and background pixels are denoted as positive and negative classes respectively, the samples are highly  Moreover, the features extracted between neighboring pixels are highly correlated. Given these two facts, hybrid under-sampling of pixels at the image level was performed to (1) avoid over-fitting due to redundant information across pixels; (2) ensure that the classifiers do not ignore the minority class; and (3) reduce training time.

Uniform and unsupervised under-sampling
The majority class consists of the background, non-vessel areas in the XCA images. Pixels from the majority class were first uniformly under-sampled via a mask (Fig. 5) such that the 8-neighborhood of each sampled pixel was not sampled. Then, an intensity-based unsupervised under-sampling method was employed to further reduce the major class based on the contrast enhancement output (Fig. 6). To affect this under-sampling, the histogram of the pixel intensity of the contrast enhanced image was created with 256 bins, after which the discrete pdf (probability density function) was obtained and assigned to a one-dimensional median filter. The smoothed pdf was compared with the original one to identify the oversaturation peak generated by contrast enhancement. When these over-saturated pixels were excluded, the median value of pixel intensities was calculated as the binarization threshold for bright pixel removal (Fig. 6, right panel). This step remove pixels that clearly belong to the background based solely on their intensity.

Supervised under-sampling
Supervised under-sampling includes Tomek Links and Cluster Centroid under-sampling for both the positive and the negative classes. Tomek Links [77] are defined as pairs of pixels from opposite classes that are the nearest neighbors of each other. As removing overlapping pixels between classes yields more well-defined boundaries for the classifiers [78], the Tomek Links of both classes were removed so that the minimally distanced nearestneighbor pairs of vessel pixels belong to the same class. Figure 7 illustrates an Tomek Link under-sampling. Following the Tomek Links, the Cluster Centroid [79] under-sampling method was employed. This method first applies clustering algorithms such as k-nearest neighbors to generate cluster centroids and then uses these centroids to replace the original sample points. This method can reduce the number of pixels within each class to a fixed number, e.g., 4000, which is much smaller than the sample number in the original image. The Python implementation [80] of Tomek Links and Cluster Centroid methods were employed.

Ensemble learning for coronary vessel segmentation
Two ensemble methods, Gradient Boosting Decision Tree (GBDT) [51] and Deep Forest [52] were trained on the extracted 37-dimensional feature vectors with samples obtained by under-sampling. To explore how Tomek Links and Cluster Centroid affected the segmentation performance, models with and without these two under-sampling methods were constructed for comparison.

Data partitioning for model construction
The dataset was split into training and test sets with a 4:1 ratio. The test set images are exactly the same as the deep learning test set mentioned in "Data partitioning for model construction" section and were not utilized until the test stage. All the partitions (including the cross-validation partitions in "Training and testing strategy" section) were stratified to ensure different sets have approximately the same percentage of LCA and RCA images, and pixels from the same image were not split across sets. Moreover, standardization was applied to the training set with parameters being saved to transform the test set, and this was also true for cross-validation.

Training and testing strategy
For the Deep Forest model, default hyper-parameters settings were used since hyper-parameters had minimal effect on the model performance [52], For other hyper-parameters, default value in the Sklearn package were applied. For example, the loss function to be optimized was the deviance and the split quality of trees were measured by the mean squared error with improvement score by Friedman (Friedman MSE) [51]. Finally, the hyper-parameter combination that achieved the highest mean AUROC score in cross-validation was used to train the final GBDT model on the training set and applied to the test set for evaluation.

Post-processing and model evaluation
Features extracted from test images were passed into the trained ensemble models to generate masks. The masks were then binarized using Otsu's method [81] and postprocessed by (1) removing border regions, (2) adding back unsupervised background masks, and (3) removing artifacts whose areas values were less than 50. The same evaluation metrics listed in " Testing and model evaluation" section were calculated to evaluate the quality of the predicted masks across different models. The metrics were calculated image-wise and produced by calculating the mean and standard deviation over all tested images.

Feature importance
The permutation importance [82] of the GBDT models were computed on the holdout test set for feature evaluation. The importance of a feature was calculated as the decrease of Friedman MSE (mentioned in "Training and testing strategy" section) evaluated on the test set when permuting the feature column for 10 times.

Under-sampling
The counts of positive and negative pixels retained for model training after different under-sampling steps are listed in Table 3. Before under-sampling, the positive class comprised only 5.56% of the total samples after the border regions were removed (see "Preprocessing" section). The uniform under-sampling selected 24.13% of the background pixels in the original labeled image, followed by the unsupervised under-sampling that further kept 45.98% of the negative pixel samples. These two steps resulted in a 34.78% share of vessel pixels in the overall samples and 4,000,602 samples of vessel and background pixels for model training.
Tomek Links only altered this percentage level slightly, while Cluster Centroid completely balanced the positive and negative classes, reducing the training samples to 832,000 in total. Table 4 lists the performances of five deep learning models, six ensemble models and the state-of-the-art deep learning model on the test set in terms of their precision, sensitivity, specificity, F1 score, AUROC and IoU. For the ensemble models, "Unsupervised" indicates that the uniform and unsupervised under-samplings were applied on the training samples while "Tomek Links" means that all the Tomek Links were also removed from the sample pixels. Moreover, "Cluster Centroid" indicates that the Cluster Centroid under-sampling method was further applied for reducing sample numbers. For the deep learning models, Inception-ResNet-v2 U-Net achieved the highest precision and specificity, DeepLabV3+ obtained the best sensitivity and AUROC, while DenseNet121 U-Net had the best F1 score, AUROC, and IoU scores. For ensemble learning models that use Deep Forest as the classifier, the samples after Tomek Links under-sampling method yielded the highest score in precision, specificity, F1 score, and IoU, while the samples after unsupervised under-sampling gave the best test performance in terms of sensitivity and AUROC, achieving the highest AUROC of 0.95 among all models tested. For ensemble learning models that use GBDT as classifier, the samples after Tomek Links under-sampling had the best precision and specificity. Moreover, the further application of the Cluster Centriod under-sampling method generated the highest sensitivity, F1 score, AUROC, and IoU scores. It also yielded the best F1 score and IoU of all models tested. The state-of-the-art method [43] achieved higher sensitivity, F1 score, AUROC and IoU than the five deep learning models. However, it can not beat the proposed ensemble models, which generally performed better than the deep learning methods in all metrics except for specificity. Moreover, GBDT classifiers performed better than the Deep Forest with higher mean value and lower standard deviation in terms of F1 score and IoU regardless of the under-sampling method employed.

Discussion
In this paper, a novel ensemble framework for automatic segmentation of coronary arteries in XCA was developed. The best performing model utilized a GDBT classifier trained on samples generated by the Cluster Centroid From a clinical perspective, as more than 80% of the percutaneous coronary intervention (PCI) are performed at the time of angiography [83], an accurate vessel segmentation method that improves the quality of QCA can greatly facilitate the assessment of stenosis; improve the quality of patient care; and avoid unnecessary PCIs, yield billions of dollars in savings at the national level [84]. In addition, correct delineation of the coronary vascular structures would be valuable in many types of CAD such as coronary endothelial dysfunction, where XCA serves as a testing technique [85].
From a technical perspective, this is the first time to our knowledge that ensemble methods, especially deep ensemble methods, have been applied to coronary artery segmentation. Ensemble methods are known for reducing the variance of predictions by gathering weak learners. In this study, we specifically used them for better predictive performance and robustness with limited data. In terms of the ensemble methods applied, GBDT is one of the leading boosting algorithms that employ decision tree weak learners. Compared with other decision tree boosting algorithms (e.g. [86]), it has more flexibility to handle various losses defined in different forms. Deep Forest, on the other hand, is an emerging ensemble method that creates stacked layers in ensemble training. Although the Deep Forest classifier did not achieve performance that was significantly better than GBDT, it still outperformed the deep learning models and was more consistent when evaluated over all test images. This suggests that the ensemble learning method and the training framework in which various features were extracted from both classic and deep learning filters is more suitable for a relatively small dataset where training samples are limited. The significant increase in sensitivity indicates that the ensemble models have better recognition of the vessel area. This could be attributed to the employment of domain knowledge and a reduction in overfitting via the pixel-wise training scheme.
The proposed method deliberately introduces some redundancies within the feature vectors so that a wider spectrum of possible situations can be handled. In terms of future work, it is important to analyze the most discriminating features extracted for not only a more explainable model but also a more representative feature set. A review of the permutation importance of features indicates that some of the features have little contribution to the final prediction. However, since the permutation importance can be biased towards features that are correlated with one another [87], a more substantial feature analysis is required.
Different from the state-of-the-art paper on vessel segmentation [43], in which the dataset was run for 400 epochs with decreased learning rate on training loss saturation, we used an updated training logic to train the deep neural network model for feature extraction. Specifically, we reduced the training epoch upper bounds and introduced an early-stop mechanism. Although the deep models trained as such have inferior performances compared to those obtained from [43], we believe it is not necessary to have prolonged training for the following two reasons. First, with the training logic currently applied, the state-of-the-art method by itself did not outperform our ensemble model in all metrics. Second, the trained deep neural networks are only used for feature extraction. Given that our dataset is relatively small, prolonged training on the deep feature extraction model may hinder the generalizability of the ensemble model built on its top. To summarize, the ensemble methods we proposed have strengths in predictive power compared to the deep-learning state-of-the-art on our datasets. Comparatively, our weaknesses lie in the complexity of preprocessing, feature extraction, and under-sampling pipeline prior to model training.
Recently, vision transformer networks [88] have been introduced to tackle segmentation tasks [89] for their capability to capture long-range dependencies in images with the self-attention mechanism. However, most of the transformer-based networks are unable to be trained properly with a small-scale dataset. Current transformerbased structures designed for medical image segmentation either need thousands of annotated images [90] for training or require a large amount of computational resources [91]. Considering that we have limited training samples and a model that require a lot of computational resources is less operational at the point of care in the cardiac catheterization lab, the transform-based network may not be a practical or optimal choice.
The proposed method has several limitations. First, the current Python implementation of Cluster Centroid under-sampling requires much more computational time than the Tomek Links and the unsupervised under-sampling methods (see Appendix 2 for details). Although a huge reduction in training samples should expedite the training process and yield better-performing prediction models given the relatively small size of the datasets utilized in this study 4, the prolonged under-sampling process increases training time and may hinder the method's efficiency and scalability in practice. Moreover, the choice of output pixel number for each class in the Cluster Centroid under-sampling method was determined through a number of trial experiments instead of a more comprehensive cross-validated grid search. Since this number affects the time required in under-sampling and the final performance, it is possible that better choices exist to reduce training time while maintaining a good performance.

Top-bottom-hat filtering
Top-hat and bottom-hat filters are morphological filters that combine dilation and erosion operations with a structuring element (SE). For a gray-scale image, dilation and erosion operation (Fig. 9) of a pixel return the minimum and the maximum, respectively, of the pixel intensities in its neighborhood defined by SE and hence, the former is often used for gaps filling and region connections, while the latter is applied for detail elimination. Denoting image matrix as I and the SE with scale as SE , the morphological opening operation ( • ) is define as an erosion ( ⊖ ) followed by a dilation ( ⊕ ) operation and the morphological closing operation ( • ) first perform dilation and then erosion.

Diffusion filtering
Diffusion is a time-dependent process from physics that models the concentration change. This process evolves the input frame, I, by introducing a 'time' variable and generating images, I(t), evolved via the diffusion equation: with respect to time t. Here, ∇ represents the divergence operator; ∇I = (I x , I y ) denotes the image gradient; and D is a diffusion tensor that describes the diffusion process. The diffusion tensor is constructed to enhance the vascular structures in the image using a variant of Frangi's vesselness filter with continuous derivatives. The filter V σ is given by where 1 and 2 are the eigenvectors of the Hessian matrix, | 2 | > | 1 | , R = 1 / 2 and S = In this definition, ω , ǫ , and s denote additional tuning parameters. Vessel-enhancing diffusion is performed on each image using scale-space theory for scale invariance [94] with scaling parameters selected for the LCA and RCA based on expected vessel width ranges [95]. The differential filters are optimized for rotation invariance [96]. Finally, the vesselness response of the diffused image I ′ is given by the maximum vesselness response over all scales, Table 5 shows the mean and standard deviation of the computational times when applying different undersampling methods over all images in our dataset. Unsupervised under-sampling is implemented by Matlab, while Tomek Links and Cluster Centroid are carried out with Python Imbalanced-learn library [80]. Cluster Centroid under-sampling takes 67312 times more computational time than unsupervised method, and uses 163