Skip to main content

An evaluation of performance measures for arterial brain vessel segmentation



Arterial brain vessel segmentation allows utilising clinically relevant information contained within the cerebral vascular tree. Currently, however, no standardised performance measure is available to evaluate the quality of cerebral vessel segmentations. Thus, we developed a performance measure selection framework based on manual visual scoring of simulated segmentation variations to find the most suitable measure for cerebral vessel segmentation.


To simulate segmentation variations, we manually created non-overlapping segmentation errors common in magnetic resonance angiography cerebral vessel segmentation. In 10 patients, we generated a set of approximately 300 simulated segmentation variations for each ground truth image. Each segmentation was visually scored based on a predefined scoring system and segmentations were ranked based on 22 performance measures common in the literature. The correlation of visual scores with performance measure rankings was calculated using the Spearman correlation coefficient.


The distance-based performance measures balanced average Hausdorff distance (rank = 1) and average Hausdorff distance (rank = 2) provided the segmentation rankings with the highest average correlation with manual rankings. They were followed by overlap-based measures such as Dice coefficient (rank = 7), a standard performance measure in medical image segmentation.


Average Hausdorff distance-based measures should be used as a standard performance measure in evaluating cerebral vessel segmentation quality. They can identify more relevant segmentation errors, especially in high-quality segmentations. Our findings have the potential to accelerate the validation and development of novel vessel segmentation approaches.

Peer Review reports


Stroke is a leading cause of mortality and disability, affecting 15 million people worldwide [1]. As a cerebrovascular disease, it is characterised by arterial brain vessel changes, e.g. narrowing and occlusion. Thus, the status of the cerebral arteries is routinely utilised in the clinical setting for the understanding, treatment and prevention of stroke [2]. For example, quantified parameters such as arterial diameters can serve as biomarkers for foreseeing future strokes [3]. Additionally, the incompleteness of intracranial vessel structures, such as the circle of Willis was associated with a higher risk of anterior circulation stroke [4]. In addition, other diseases such as vessel inflammations or aneurysms can lead to changes in the vasculature. Therefore, accurate visualisation and quantification of the status of the arterial vessel tree are of high clinical relevance.

Recently, advances in deep neural network architectures, a particular type of artificial intelligence (AI), made fully automated and clinically applicable cerebral vessel segmentation approaches feasible [5,6,7]. Once deployed, these methods do not rely on human intervention and can provide high-quality binary segmentations of the arterial vessels in less than a minute [5]. However, a severe obstacle to developing and validating improved vessel segmentation approaches is accurate segmentation performance assessment. In other words, how do we know which model provides better segmentations?

Usually, the performance assessment of a given segmentation result encompasses a qualitative and quantitative analysis. Qualitative analysis is done visually; however, its inter-rater variability, susceptibility to human error and time-consuming nature limit its broader use [8, 9]. The quantitative analysis comprises the comparison of a given segmentation to a reference image via a computed measure. The reference image—also called the ground truth—is usually a manual segmentation performed by at least one human expert. The comparison is performed via specific performance measures. Taha et al. provide an extensive overview of the existing measures [10]. In brief, many measures exist, and they can be divided into distinct families: overlap based, volume based, pair counting based, information theoretic based, probabilistic based and spatial distance based measures [10]. Each type of performance measure is sensitive to different types of errors present in a segmentation. Also, each measure has other biases depending on the characteristics of the segmented structures. Therefore, to assess segmentation performance measures should be selected that are the best fit for each given segmentation task.

For arterial brain vessel segmentation, specifically, various performance measures are in widespread use for evaluation of vessel segmentation quality [11].

The most commonly used measure is the Dice coefficient [12, 13]. It is popular because it is easily interpretable and allows comparisons with other studies [14]. Less often, other performance measures such as the average Hausdorff distance [15], the area under the receiver operating characteristic curve [16], sensitivity [17, 18], specificity [18], or accuracy [16,17,18] are used.

Importantly, however, there is no scientific evidence supporting that the Dice coefficient—or any other measure—in arterial brain vessel segmentation is the best choice. While theoretical considerations argue heavily in favour of distance-based measures [10], an empirical assessment to corroborate or refute these theoretical assumptions lacks to date.

Therefore, in the present work, we aimed to fill this scientific gap. To find the most suitable performance measures for cerebral vessel segmentation, we first simulated segmentation variations containing various manually created errors. We then visually scored these segmentations using a predefined scoring system. Finally, we correlated these visual scores with the segmentation rankings provided by 22 different performance measures to find the most suitable measure.



Time-of-flight MR-Angiography (TOF MRA) images of 10 patients from the 1000Plus study were randomly selected. The 1000plus study included patients with the clinical diagnosis of an acute cerebrovascular event within the last 24 h. For our analysis, the only inclusion criterion was a complete Circle of Willis without any occlusion in its vessel segments. The reason for this inclusion criterion was that patients with occlusions in the arteries of the Circle of Willis would not allow the creation of errors in these arteries. The 1000Plus study was carried out with approval from the local Ethics Committee of the Charité University Hospital Berlin (EA4/026/08). Details about the study have been previously published [19].

Imaging parameters

Time-of-flight MR-Angiography (TOF MRA) was performed on a 3T MRI scanner (Tim Trio; Siemens AG, Erlangen, Germany) with the following parameters: Voxel size = (0.53 × 0.53 × 0.65) mm3; Matrix: 364 × 268; Averages: 1; TR/TE = 22 ms/3.86 ms; Gap: − 7.2; FOV: 200 mm; Duration: 3:50 min; Flip angle = 18 degrees.

Ground truth creation

To create a ground truth image of the cerebral arterial vessels, the 3D TOF MRA was pre-segmented using a U-net deep learning framework [8] and manually corrected by OUA (4 years experience in stroke imaging) using ITK-Snap [20]. The results were checked by VIM (11 years experience in stroke imaging). The resulting binary ground truth was manually annotated voxel-wise into following arteries and their corresponding segments: internal carotid artery (ICA), the sphenoidal segment of the middle cerebral artery (M1), posterior communicating artery (Pcom). All other segmented arteries were classified as small vessels (Fig. 1).

Fig. 1
figure 1

Binary ground truth (a) and voxel-wise annotated ground truth (b). White: M1 segment of the middle cerebral artery, Yellow: Posterior communicating artery, Purple: Internal carotid artery, Red: Other arteries and artery segments classified as small vessels

Error creation

To explore the properties of performance measures for quality assessment of cerebral vessel segmentations systematically, a framework to simulate segmentation variations was developed. To simulate segmentation variations for ranking, a set of 48 non-overlapping segmentation errors commonly encountered in a vessel segmentation task were manually created. In this context, an error means that the ground truth was manipulated manually by introducing false negative or false positive voxels. The created errors are selected based on the experience of our group developing and optimising vessel segmentation algorithms. These errors were regularly encountered in segmentations produced by state of the art deep learning models [5, 8] and also other traditional methods like region growing or graph cut algorithms [8]. Additionally, these errors are also encountered in the literature [21,22,23,24,25]. The errors included, for example, boundary errors of various vessel segments, false positively labelled anatomical vessel and non-vessel structures such as the sagittal sinus, middle meningeal artery, fat and muscle tissue and omitted parts of the vessel tree. Three different intensity levels (subtle, moderate, severe) of errors were generated where possible. Error groups and individual errors created in the framework are listed in Table 1. Example illustrations of errors belonging to different error groups can be found in Fig. 2 and visualisations of all errors can be found in the Additional File 1.

Table 1 Manually created errors for simulation of segmentation variations
Fig. 2
figure 2

Examples of manually created errors of various intensity levels that were introduced to the ground truth. Examples of false-positive segmentation of structures in green (ac): a moderate skull error, b severe sigmoid sinus error, c severe orbit error. Examples of false-negative segmentation of vessels in blue (df): d omission of internal carotid artery, e severe small vessel error, f omission of the posterior communicating arteries. Radius manipulation of segments (g, h): g subtle boundary error of the M1 segment of the middle cerebral artery, h severe boundary error of the internal carotid artery. Red: True positive voxels, Green: False-positive voxels, Blue: False-negative voxels

Simulation of segmentation variations

In real-world segmentation of cerebral arteries, errors regularly occur in combinations. The simulation framework, therefore, allows combinations of errors. Example error combinations are shown in Fig. 3. To ensure an equal representation of errors in the created sets, the simulated segmentation variations were generated by selecting errors randomly from an error pool of 48 errors with each error having an equal probability to be selected. However, some errors are mutually exclusive because of overlapping voxels that manipulate the same segment or location within the arterial vessel tree volume. This would lead to an unbalanced representation of errors in the analysis where some errors would be unintentionally found more frequently. This unwanted effect was compensated for by defining boundary conditions for segmentation sets: First, for each patient, a set was supposed to contain 295 to 305 simulated segmentation variations. Second, in each set, the simulated segmentation variations were supposed to contain a minimum of 2 errors and a maximum of 7 errors per segmentation leading to a total of 6 segmentation groups per set. Third, we also balanced how often these error groups appeared per patient set. Each group was allowed to appear 45–60 times. Finally, to prevent an over-representation of specific errors, each manually created error occurred a minimum of 25 and a maximum of 30 times in total in each set.

Fig. 3
figure 3

Example simulated segmentation variations containing error combinations and corresponding visual scores. a This simulated segmentation variation contains 6 errors: severe orbit error, severe skull error, subtle merge/separation error, omission of the internal carotid artery, severe boundary error of the M1 segment of the middle cerebral artery and posterior communicating artery. Due to the high number and severity of errors, a visual score of 10 is assigned to this segmentation, indicating low quality. b This simulated segmentation variation contains 2 errors: a severe omission error of the small vessels and a subtle false-positive segmentation of parts of the superior sagittal sinus. This segmentation gets a visual score of 3, corresponding to moderate quality. Please see Table 2 for the subjective scoring system and Table 1 for a detailed description of errors. Red: True positive voxels, Green: False-positive voxels, Blue: False-negative voxels

Software environment

Our framework was written in the Python programming language. For the introduction of errors to the ground truth, we used the Python library NiBabel to add or subtract images in NIfTI data format. Random combinations were achieved with the combinations function from the itertools module in Python. Error combinations that were not allowed are specified within the code. The ranking was performed using the min method of the rank function in Pandas library in Python. The code is available under the following GitHub repository:

Visual scoring

Each simulated segmentation variation was visually scored based on a newly designed predefined scoring system with scores ranging from 1 to 10. Higher visual scores denote higher severity of errors in the simulated segmentation variations and lower segmentation quality. For example, a score of 10 was assigned to segmentations containing multiple severe errors, whereas a score of 1 was assigned to segmentations with subtle errors not affecting segmentation quality. The visual scoring system is described in Table 2. The scoring was performed by OUA with 4 years of experience in cerebral vessel segmentation. A total of 2984 segmentations were scored with approximately 300 from each of the 10 patients. Example visualisations of two simulated segmentation variations with their corresponding visual scores can be found in Fig. 3.

Table 2 Criteria of the predefined visual scoring system for simulated segmentation variations

A senior rater (VIM) validated a random subset of 50 simulated segmentations by performing an independent visual scoring. We assessed differences between the scorings by VIM and OUA by calculating the median score deviation, the interquartile range, the exact score overlap, and the percentage of cases where the raters chose the same subcategory of the scoring scheme (i.e. low/moderate/high quality).

Performance measures analysis

The simulated segmentation variations were compared against the ground truth using the EvaluateSegmentation software tool [10]. EvaluateSegmentation is an evaluation framework for medical image segmentation comprising implementation of various performance measures from the literature to assess segmentation quality. In addition to the average Hausdorff distance, the tool also included an improved version of the average Hausdorff distance called the balanced average Hausdorff Distance that was introduced recently [26]. The 95th quantile of the Hausdorff distance was utilised to handle outliers [27]. All distance-based measures were calculated in voxels. Complementary to the available measures in the evaluation framework, we added further performance measures used in the literature, namely Conformity and Sensibility [28]. In total, we thus analysed 22 performance measures. These measures belonged to the following categories: Overlap based, volume based, pair counting based, information theoretic based, probabilistic based, and spatial distance based. Details and calculations of the performance measures implemented can be found in the publication of Taha et al. [10] and Table 3.

Table 3 Overview of performance measures analysed in this study

Simulated segmentation variations were ranked by ordering segmentations according to their performance measure values. Each performance measure provided a score for each analysed simulated segmentation variation denoting how similar or different segmentations were compared with the ground truth. The segmentation with the highest similarity with the ground truth ranked first, and the one with the lowest similarity ranked last within that segmentation set. Each performance measure assigns different scores to segmentations thus producing different rankings. Therefore, one can compare performance measures by comparing the segmentation rankings produced by them. We produced and analysed rankings of segmentations by all 22 performance measures.

Then, we aimed to select the most suitable performance measure by measuring the correlation of the performance measures rankings with the ranking assigned by the visual scores. This is a modified version of the method described by Taha et al. [9]. The visual scores can be thought of as manually assigned ranks to segmentations. The Spearman correlation coefficient was used to measure correlation for the simulated segmentation variation set of each patient individually yielding 10 correlation coefficients. For each measure, the median correlation coefficient was reported. Performance measures were ranked from the highest correlation to the lowest (Table 4). Ranking results of performance measures are reported in standard competition ranking.

Table 4 Median spearman correlation coefficients of visual scores and performance measure rankings

Subgroup analysis

We repeated the above-described analysis steps in two subsets to analyze the difference in performance measure rankings based on segmentation quality. The first subset consisted of segmentations of high and moderate quality (visual scores from 1 to 5) and the second subset consisted of segmentations of moderate to low quality (visual scores from 6 to 10).

Sensitivity analysis of performance measures

In a second subanalysis, we assessed the sensitivity of the applied performance measures to the created errors. An ideal performance measure should have a wide score range and reflect the difference in quality of the assessed segmentations in its values. The extent of the score range shows the sensitivity of a performance measure to the created errors and can be measured by the index of dispersion (IoD). The index of dispersion is calculated by dividing the variance by the mean. We calculated the index of dispersion for each performance measure over the values they assigned to all 2984 simulated segmentation variations.

In addition, it can be challenging to compare the absolute values of performance measures [29]. It becomes easier to compare values when for each visual score the corresponding performance measure values are provided. Therefore, across all patients, for each visual score from 1 to 10, we calculated the median values of performance measures of all simulated segmentation variations receiving this score.


In our analysis of 2984 simulated segmentation variations, average distance based performance measures performed best. Balanced average Hausdorff distance (rank 1) and average Hausdorff distance (rank 2) provided the segmentation rankings with the highest median correlation with visual scores. Overlap based measures such as Dice, Jaccard, Conformity performed worse (rank 7). Other popular measures such as Volumetric similarity (rank 19) and 95% Hausdorff distance (rank 20) showed considerably lower correlations than the aforementioned performance measures. In 8 out of the 10 tested patients, an average distance based performance measure, either the bAHD or the classic AHD, led the rankings (see Additional File 2). The rankings of all performance measures can be found in Table 4.

In the subgroup analysis, bAHD and AHD were also the best performing measures for both good and bad quality groups. We provide, as an example, two errors in Fig. 4 with their corresponding Dice and bAHD values.

Fig. 4
figure 4

Comparison of bAHD and Dice values for two examples of manually created errors. a A severe omission of small vessels is shown in the subpanel a. This error received a Dice score of 0.960 and a bAHD score of 0.65. b In subpanel b, a subtle boundary error of the internal carotid artery is shown. This error received a similar Dice value of 0.963, however, it received a lower bAHD value of 0.039. Please note that distance based measures assign lower values to better segmentations. bAHD is sensitive to the error in subpanel a and penalises the omission of small vessels because it considers voxel localisation. In contrast, Dice, which measures only the overlap, cannot distinguish between the two errors. Red: True positive voxels, Green: False-positive voxels, Blue: False-negative voxels

In our second subanalysis, performance measures exhibited different score ranges as evidenced by the index of dispersion (Table 5). The highest IoDs, indicating a beneficial wide spread, were found for the three Hausdorff distance based measures. Generally, the IoDs exhibited large differences, e.g. Conformity (IoD of 0.336) vs. accuracy (IoD of > 0.000002). The balanced average Hausdorff distance had at all times higher IoD values compared with its counterpart, the traditional average Hausdorff distance.

Table 5 Index of dispersion and median performance measure values of performance measures

The validation analysis of visual scores resulted in a median score deviation of 1 (interquartile range 2), the exact score overlap was 26%, and the raters chose the same subcategory of the scoring scheme (i.e. low/moderate/high) in 78% of cases.


In the present work, we developed a performance measure selection framework based on visual scoring to find the most suitable measure for cerebral arterial vessel segmentation from TOF images. We showed that the average Hausdorff distance, especially its balanced version, is best suited for quality assessment of cerebral vessel segmentations. The ranking performance of average distance-based measures was superior in comparison to overlap-based measures, especially in ranking segmentations of good quality. We corroborated the theoretical assumptions that distance-based measures identify more relevant segmentation errors in complex structures like vessel trees due to their consideration of voxel localisation.

Finding a suitable performance measure for a specific segmentation task requires analysing the features of the anatomical structures that are segmented [10]. Cerebral vessel trees have complex boundaries, especially when considering pathologies like the stenosis of a vessel. Cerebral vessel tree segments are remarkably smaller than the background since only around 1% of brain voxels are vessels [8]. Outliers, small false-positive segments far outside of the segment, are also harmful in cerebral vessel segmentation because they often represent false-positive anatomical structures. On theoretical grounds, Taha and colleagues suggested to favour distance-based performance measures for small segments with complex boundaries where outliers are also considered to be important [10]. Our empirical results with bAHD and AHD as the top-performing performance measures confirm these theoretical considerations.

Why average distance-based measures outperformed other measures can be explained by specific measure properties. For example, similarity-based performance measures such as Dice or Sensitivity do not take information about the voxel localisation into consideration. Voxel localisation, however, is of paramount importance in cerebral vessel segmentation. Distance-based performance measures penalise voxels and surfaces that are further away from the ground truth more severely. This allows the distance-based performance measures to recognise a false-positive structure, for example, the superior sagittal sinus, and penalise the error accordingly.

The lack of sensitivity of the Dice coefficient towards specific errors becomes evident when looking at Fig. 4. Here, the severe omission of small vessels leads to a Dice coefficient of 0.960, which is almost identical to that of a minor boundary error of the internal carotid artery with a Dice coefficient of 0.963. bAHD, however, takes voxel localisation into consideration and penalises the severe small vessel error adequately. This shows that the application of measures like the Dice coefficient is problematic. As long as many errors or severe errors are present, both the Dice coefficient and distance-based measures will be sufficient to identify a bad segmentation. When only a few errors are left, i.e. the best segmentation out of a group of good segmentations must be chosen, the Dice coefficient cannot correctly rank the segmentations anymore. The work of Hilbert et al. also corroborates this. They found no significant differences in Dice values when comparing different high-performing architectures but did find significant differences in the average Hausdorff distance values [5].

These considerations have direct implications for the further development of novel vessel segmentation algorithms.

On one hand, research has focused on developing completely new [30], modified [26] or combined [14] performance measures that are more sensitive to errors and have wider score ranges to distinguish between subtle differences between ground truth and segmentation. For example, Chang et al. proposed Conformity instead of DICE and Sensibility instead of Specificity. These two new performance measures promised better performance in recognising errors and detecting minor variabilities in segmentations due to their wider score range [28]. The wider score ranges have also been confirmed in our analysis by the index of dispersion (Table 5). Conformity and Sensibility should thus be preferred over Dice and Specificity, respectively.

On the other hand, our results have direct implications for the training process of deep learning applications. During the training process, the algorithm must be given a mathematical formula according to which it can decide how erroneous the current model’s segmentations are. This error definition, so-called loss function in deep learning terminology is minimised during training and consequently used for model adaptation. Currently, Dice coefficient based loss functions are in widespread use [8, 31,32,33]. Based on the previous considerations, it is evident that such a loss function will experience a ceiling effect and will not identify the optimal segmentation. Thus, we recommend the utilisation of loss functions based on average Hausdorff distance measures as the default loss function for arterial brain vessel segmentation [34, 35].

Our results also argue against the utilisation of single measures. Simultaneous usage of multiple measures for performance assessment may reveal aspects of the tested segmentations, which may be overlooked by relying solely on one performance measure [36]. In this sense, using an additional distance-based performance measure may reveal contour errors or outliers that may compromise the segmentation quality. The 95% Hausdorff Distance, for example, quantifies the largest error of a segmentation as the longest distance one has to travel from a point in one of the two sets to its closest point in the other set [27]. Thus, the 95% Hausdorff Distance provides a different perspective on the quality of the segmentation at hand. We argue that reporting Dice for comparability and overlap-based evaluation, reporting bAHD for capturing more relevant errors, and reporting 95% Hausdorff distance for quantifying the largest segmentation error is a suitable protocol to assess segmentation quality of cerebral vessel segmentations.

Our study has limitations. First, the predefined visual scoring was only performed by one rater due to the highly time-consuming nature of scoring nearly 3000 segmentations. To mitigate this limitation, we performed a validation analysis of visual scores in a random subset which showed a high similarity of scores assigned by two independent raters. This high similarity in the scoring argues in favor of the robustness of our results. Second, we analysed a large amount of 22 measures, but could not analyse all existing performance measures due to availability constraints in the analysis software. Thus, it cannot be ruled out that other measures might exhibit better performance than the ones identified in the current work. Third, the different types of technically designed errors were not weighted according to their clinical impact on treatment decisions. Fourth, our work was performed in images of 3D-TOF-MRI only. However, it is likely that the results are transferable to other 3D neuroimaging modalities such as computed tomography (CT). Fifth, our study included a limited number of 10 patients. Time intensive manual error creation and subsequent visual scoring are the main limiting factors to increase the number of patients. However, it is important to note that our analysis mainly depends on a large number and the variable selection of different errors and less on the number of patients. This is due to the fact that the variability of changes in the vasculature introduced by the errors is far larger than the anatomical variation between patients.


Out of all performance measures analysed in this work, average distance based measures are most suited to identify the optimal segmentations for arterial brain vessel segmentation from 3D-TOF-MRI. Our work has the potential to accelerate the validation and development of novel vessel segmentation approaches.

Availability of data and materials

At the current time-point the imaging data cannot be made publicly accessible due to data protection but the authors will make efforts in the future, thus this status might change. The code for the performance measure selection framework is available under the following GitHub repository:



Time-of-flight MR-Angiography


Internal carotid artery


Posterior communicating artery


The sphenoidal segment of the middle cerebral artery


Index of dispersion


Performance measure


  1. WHO EMRO | Stroke, Cerebrovascular accident | Health topics [Internet]. [cited 2021 Jan 17]. Available from:

  2. Turc G, Bhogal P, Fischer U, Khatri P, Lobotesis K, Mazighi M, et al. European Stroke Organisation (ESO) - European Society for minimally invasive neurological therapy (ESMINT) guidelines on mechanical thrombectomy in acute ischaemic strokeendorsed by stroke alliance for Europe (SAFE). Eur Stroke J. 2019;4(1):6–12.

    Article  Google Scholar 

  3. Gutierrez J, Cheung K, Bagci A, Rundek T, Alperin N, Sacco RL, et al. Brain arterial diameters as a risk factor for vascular events. J Am Heart Assoc. 2015;4(8):e002289.

    Article  Google Scholar 

  4. van Seeters T, Hendrikse J, Biessels GJ, Velthuis BK, Mali WPTM, Kappelle LJ, et al. Completeness of the circle of Willis and risk of ischemic stroke in patients without cerebrovascular disease. Neuroradiology. 2015;57(12):1247–51.

    Article  Google Scholar 

  5. Hilbert A, Madai VI, Akay EM, Aydin OU, Behland J, Sobesky J, et al. BRAVE-NET: fully automated arterial brain vessel segmentation in patients with cerebrovascular disease. Front Artif Intell. 2020.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Patel TR, Paliwal N, Jaiswal P, Waqas M, Mokin M, Siddiqui AH, et al. Multi-resolution CNN for brain vessel segmentation from cerebrovascular images of intracranial aneurysm: a comparison of U-Net and DeepMedic. In: Medical Imaging 2020: Computer-Aided Diagnosis [Internet]. International Society for Optics and Photonics; 2020 [cited 2021 Feb 2]. p. 113142W. Available from:

  7. Ni J, Wu J, Wang H, Tong J, Chen Z, Wong KKL, et al. Global channel attention networks for intracranial vessel segmentation. Comput Biol Med. 2020;118:103639.

    Article  Google Scholar 

  8. Livne M, Rieger J, Aydin OU, Taha AA, Akay EM, Kossen T, et al. A U-net deep learning framework for high performance vessel segmentation in patients with cerebrovascular disease. Front Neurosci. 2019.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Taha AA, Hanbury A. Evaluation Metrics for Medical Organ Segmentation and Lesion Detection. In: Hanbury A, Müller H, Langs G, editors. Cloud-Based Benchmarking of Medical Image Analysis [Internet]. Cham: Springer International Publishing; 2017 [cited 2020 Apr 19]. p. 87–105. Available from:

  10. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC Med Imaging. 2015.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Moccia S, De Momi E, El Hadji S, Mattos LS. Blood vessel segmentation algorithms—review of methods, datasets and evaluation metrics. Comput Methods Programs Biomed. 2018;158:71–91.

    Article  Google Scholar 

  12. Zou KH, Warfield SK, Bharatha A, Tempany CMC, Kaus MR, Haker SJ, et al. Statistical validation of image segmentation quality based on a spatial overlap index. Acad Radiol. 2004;11(2):178–89.

    Article  Google Scholar 

  13. Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26(3):297–302.

    Article  Google Scholar 

  14. Yeghiazaryan V, Voiculescu I. Family of boundary overlap metrics for the evaluation of medical image segmentation. J Med Imaging. 2018;5(01):1.

    Article  Google Scholar 

  15. Nazir A, Cheema MN, Sheng B, Li H, Li P, Yang P, et al. OFF-eNET: an optimally fused fully end-to-end network for automatic dense volumetric 3d intracranial blood vessels segmentation. IEEE Trans Image Process. 2020;1.

  16. Huang D, Yin L, Guo H, Tang W, Wan TR. FAU-Net: fixup initialization channel attention neural network for complex blood vessel segmentation. Appl Sci. 2020;10(18):6280.

    Article  CAS  Google Scholar 

  17. Zhang B, Liu S, Zhou S, Yang J, Wang C, Li N, et al. Cerebrovascular segmentation from TOF-MRA using model- and data-driven method via sparse labels. Neurocomputing. 2020;380:162–79.

    Article  Google Scholar 

  18. Meijs M, Patel A, van de Leemput SC, Prokop M, van Dijk EJ, de Leeuw F-E, et al. Robust segmentation of the full cerebral vasculature in 4D CT of suspected stroke patients. Sci Rep. 2017;7(1):15622.

    Article  Google Scholar 

  19. Hotter B, Pittl S, Ebinger M, Oepen G, Jegzentis K, Kudo K, et al. Prospective study on the mismatch concept in acute stroke patients within the first 24 h after symptom onset - 1000Plus study. BMC Neurol. 2009;8(9):60.

    Article  Google Scholar 

  20. Yushkevich PA, Piven J, Hazlett HC, Smith RG, Ho S, Gee JC, et al. User-guided 3D active contour segmentation of anatomical structures: Significantly improved efficiency and reliability. Neuroimage. 2006;31(3):1116–28.

    Article  Google Scholar 

  21. Deshpande A, Jamilpour N, Jiang B, Michel P, Eskandari A, Kidwell C, et al. Automatic segmentation, feature extraction and comparison of healthy and stroke cerebral vasculature. NeuroImage Clin. 2021;30:102573.

    Article  Google Scholar 

  22. Gao X, Uchiyama Y, Zhou X, Hara T, Asano T, Fujita H. A fast and fully automatic method for cerebrovascular segmentation on time-of-flight (TOF) MRA image. J Digit Imaging. 2011;24(4):609–25.

    Article  Google Scholar 

  23. Chen L, Mossa-Basha M, Balu N, Canton G, Sun J, Pimentel K, et al. Development of a quantitative intracranial vascular features extraction tool on 3D MRA using semi-automated open-curve active contour vessel tracing. Magn Reson Med. 2018;79(6):3229–38.

    Article  Google Scholar 

  24. Hsu C-Y, Schneller B, Alaraj A, Flannery M, Zhou XJ, Linninger A. Automatic recognition of subject-specific cerebrovascular trees. Magn Reson Med. 2017;77(1):398–410.

    Article  Google Scholar 

  25. Wang R, Li C, Wang J, Wei X, Li Y, Zhu Y, et al. Threshold segmentation algorithm for automatic extraction of cerebral vessels from brain magnetic resonance angiography images. J Neurosci Methods. 2015;241:30–6.

    Article  Google Scholar 

  26. Aydin OU, Taha AA, Hilbert A, Khalil AA, Galinovic I, Fiebach JB, et al. On the usage of average Hausdorff distance for segmentation performance assessment: hidden error when used for ranking. Eur Radiol Exp. 2021;5(1):4.

    Article  Google Scholar 

  27. Huttenlocher DP, Klanderman GA, Rucklidge W. Comparing images using the Hausdorff distance. IEEE Trans Pattern Anal Mach Intell. 1993;15:850–63.

    Article  Google Scholar 

  28. Chang H-H, Zhuang AH, Valentino DJ, Chu W-C. Performance measure characterization for evaluating neuroimage segmentation algorithms. Neuroimage. 2009;47(1):122–35.

    Article  Google Scholar 

  29. Li J, Udupa JK, Tong Y, Wang L, Torigian DA. LinSEM: linearizing segmentation evaluation metrics for medical images. Med Image Anal. 2020;1(60):101.

    Google Scholar 

  30. Gegundez-Arias ME, Aquino A, Bravo JM, Marin D. A function for quality evaluation of retinal vessel segmentations. IEEE Trans Med Imaging. 2012;31(2):231–9.

    Article  Google Scholar 

  31. Kitrungrotsakul T, Han X-H, Iwamoto Y, Lin L, Foruzan AH, Xiong W, et al. VesselNet: a deep convolutional neural network with multi pathways for robust hepatic vessel segmentation. Comput Med Imaging Graph Off J Comput Med Imaging Soc. 2019;75:74–83.

    Article  Google Scholar 

  32. Milletari F, Navab N, Ahmadi S-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV) [Internet]. Stanford, CA, USA: IEEE; 2016 [cited 2019 Jun 18]. p. 565–71. Available from:

  33. Jia D, Zhuang X. Learning-based algorithms for vessel tracking: a review. ArXiv201208929 Cs Eess [Internet]. 2020 Dec 16 [cited 2021 Jan 25]; Available from:

  34. Karimi D, Salcudean SE. Reducing the Hausdorff distance in medical image segmentation with convolutional neural networks. ArXiv190410030 Cs Eess Stat [Internet]. 2019 Apr 22 [cited 2019 Jul 2]; Available from:

  35. Ribera J, Güera D, Chen Y, Delp EJ. Locating objects without bounding boxes. ArXiv180607564 Cs [Internet]. 2019 Apr 3 [cited 2020 Jul 23]; Available from:

  36. Renard F, Guedria S, Palma ND, Vuillerme N. Variability and reproducibility in deep learning for medical image segmentation. Sci Rep. 2020;10(1):13724.

    Article  CAS  Google Scholar 

Download references




Open Access funding enabled and organized by Projekt DEAL. This work has received funding by the German Federal Ministry of Education and Research through (1) a GO-Bio grant for the research group PREDICTioN2020 (lead: DF), and funding from the European Commission via the Horizon 2020 program for PRECISE4Q (No. 777107, lead: DF).

Author information

Authors and Affiliations



OUA, AAT, AH, DF, and VM: concept and design; VM, AAK, IG, JBF, AAK: acquisition of data; OUA, AAT, AH, VIM: Code; OUA, AAT, VIM: data analysis; OUA, AAT, AH, AAK, IG, JBF, DF, VIM: data interpretation; OUA, AAT, AH, AAK, IG, JBF, DF, VIM: manuscript drafting and approval. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Orhun Utku Aydin.

Ethics declarations

Ethics approval and consent to participate

The 1000Plus study was carried out with approval from the local Ethics Committee of Charité University Hospital Berlin (EA4/026/08). The study protocol was carried out in accordance with the Declaration of Helsinki.

Consent for publication

The study was carried out with written informed consent from all subjects in accordance with the Declaration of Helsinki.

Competing interests

Dr. Madai reported receiving personal fees from ai4medicine outside the submitted work. Adam Hilbert reported receiving personal fees from ai4medicine outside the submitted work. Dr. Frey reported receiving grants from the European Commission, reported receiving personal fees from and holding an equity interest in ai4medicine outside the submitted work. There is no connection, commercial exploitation, transfer or association between the projects of ai4medicine and the results presented in this work. JBF reported personal fees from Abbvie, AC Immune, Artemida, Bioclinica, Biogen, BMS, Brainomix, Cerevast, Daiichi-Sankyo, Eisai, F.Hoffmann-La Roche AG, Eli Lilly, Guerbet, Ionis Pharmaceuticals, IQVIA, Janssen, Julius clinical, jung diagnostics, Lysogene, Merck, Nicolab, Premier Research, and Tau Rx, outside the submitted work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

. Visualisations of manually created segmentation errors.

Additional file 2

. Performance measure rankings of individual patients.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aydin, O.U., Taha, A.A., Hilbert, A. et al. An evaluation of performance measures for arterial brain vessel segmentation. BMC Med Imaging 21, 113 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: