Reliability of the freehand region-of-interest method in quantitative cerebral diffusion tensor imaging

Background Diffusion tensor imaging (DTI) is a magnetic resonance imaging (MRI) technique used for evaluating changes in the white matter in brain parenchyma. The reliability of quantitative DTI analysis is influenced by several factors, such as the imaging protocol, pre-processing and post-processing methods, and selected diffusion parameters. The region-of-interest (ROI) method is most widely used of the post-processing methods because it is found in commercial software. The focus of our research was to study the reliability of the freehand ROI method using various intra- and inter-observer analyses. Methods This study included 40 neurologically healthy participants who underwent diffusion MRI of the brain with a 3 T scanner. The measurements were performed at nine different anatomical locations using a freehand ROI method. The data extracted from the ROIs included the regional mean values, intra- and inter-observer variability and reliability. The used DTI parameters were fractional anisotropy (FA), the apparent diffusion coefficient (ADC), and axial (AD) and radial (RD) diffusivity. Results The average intra-class correlation coefficient (ICC) of the intra-observer was found to be 0.9 (excellent). The single ICC results were excellent (> 0.8) or adequate (> 0.69) in eight out of the nine regions in terms of FA and ADC. The most reliable results were found in the frontobasal regions. Significant differences between age groups were also found in the frontobasal regions. Specifically, the FA and AD values were significantly higher and the RD values lower in the youngest age group (18–30 years) compared to the other age groups. Conclusions The quantitative freehand ROI method can be considered highly reliable for the average ICC and mostly adequate for the single ICC. The freehand method is suitable for research work with a well-experienced observer. Measurements should be performed at least twice in the same region to ensure that the results are sufficiently reliable. In our study, reliability was slightly undermined by artifacts in some regions such as the cerebral peduncle and centrum semiovale. From a clinical point of view, the results are most reliable in adults under the age of 30, when age-related changes in brain white matter have not yet occurred.

based on the diffusion characteristics of water molecules, which, in turn, reflect the histological structure of the tissue [3]. Diffusion data can be used to calculate several quantitative parameters, such as fractional anisotropy (FA), the apparent diffusion coefficient (ADC), and axial (AD) and radial (RD) diffusivity. FA indicates the degree of diffusion anisotropy. The diffusion is generally strongest in the orientation parallel to the nerve tracts. The ADC expresses the mean diffusion in each direction. AD can be considered to be modulated by the axonal integrity [4,5], and its changes can thus reflect the degree of axonal degeneration [6]. RD, on the other hand, is modulated by axonal myelination [4,5].
Chronic white matter diseases as well as normal aging, causes a decrease in FA values while RD values tend to increase [18][19][20][21][22][23][24][25]. A strong relationship has also been found between the changes in AD and axonal injury [4]. Moreover, ADC values may temporarily decrease in the acute phase of cerebrovascular accidents, but, in the chronic phase, they usually increase [26,27].
The imaging process includes several steps between acquisition and the final parametric result, and each step is susceptible to different pitfall sources [28,29]. Specifically, low resolution, a low signal-to-noise ratio (SNR), and a variety of different types of artifacts can reduce the image quality [30][31][32][33]. In particular, the single-shot echo-planar technique used in diffusion imaging can cause severe image distortions because of the long echo trains that are used in the sequence. The consequence of these susceptibility artifacts are geometric distortions at the interfaces between soft tissue and air at the base of the skull [34]. In addition, B 0 inhomogeneities cause a decrease in the efficiency of fat-saturation pulses [34]. Protons in water and fat have a different Larmor frequency, which leads to fat misregistration in single-shot echo-planar imaging. All of the above-mentioned pitfalls and artifacts also have a detrimental effect on the reliability of parametric results.
Post-processing and analysis methods can be selected according to whether individual or group results are required. The histogram [35], region-of-interest (ROI), and quantitative tractography methods [36] are suitable for both individual-and group-level analysis. In addition, the tract-based spatial statistics (TBSS) method [37] is an option for group analysis. Nowadays, different methods are often used concomitantly, giving additional value to the accuracy of the results [38,39].
The ROI method is still a highly valid method when measuring individual subjects. While laborious, timeconsuming, and observer-dependent, it however, is the most readily available method in commercial clinically approved software. The method can be used to evaluate the focal areas of brain parenchyma of a single subject and it enables leaving artifacts outside the area of measurement. The low or moderate repeatability of the method as well as its high intra-and inter-observer variation have been considered its cons [40].
The main objective of this study was to investigate the reliability of the freehand ROI method, by intra-and inter-observer variation and repeatability measurements. The aim was also to examine the effects of different parameters (FA, ADC, AD and RD) and artifacts on the reliability of the results. In addition, the effects of age on white matter changes were studied in group comparisons.

Subjects
Participants included 40 healthy adult volunteers consisting of 20 women and 20 men with an age range of 18-60 years and a mean age of 40.6 (SD 12.2) years [41,42]. The age groups were: (i) 18-30, (ii) 31-40, (iii) 41-50, and (iv) 51-60 years. Each age group included five men and five women. Thirty-nine of the subjects were right-handed, and one was left-handed. MRI scans were performed within a year (2010-2011). The exclusion criteria consisted of the following: (i) neurological problems (including abnormalities upon neuroimaging), (ii) psychiatric problems, (iii) history of traumatic brain injury, (iv) former neurosurgical procedure, (v) problems with hearing or vision, (vi) first language other than Finnish, (vii) MRI contraindications, and (viii) refusal to participate. No indications of significant structural abnormalities were found in any of the subjects in conventional clinical sequences. An ethics approval was obtained from the Ethical Committee of the Pirkanmaa Hospital District, and a written consent was obtained from each volunteer.

MRI acquisition
The subjects were scanned with a 3 T Siemens Trio (Siemens Healthcare, Erlangen, Germany) MRI scanner. The MRI protocol included sagittal T1-weighted 3D IRprepared gradient echo, axial T2-weighted turbo spin echo, conventional axial and high-resolution sagittal fluid attenuation inversion recovery (FLAIR), axial T2*weighted, and an axial susceptibility weighted imaging (SWI) series. The DTI data was collected by a single-shot, spin echo-based, and diffusion-weighted echo planar imaging sequence. The parameters for the DTI sequence were the repetition time (TR) 5144 ms, echo time (TE) 92 ms, field-of view (FOV) 230 mm, matrix 128 × 128, 3 averages, slice/gap 3.0/0.9 mm, voxel dimension 1.8 × 1.8 × 3.0 mm 3 , b-factor 0, 1000 s/mm 2 , and 20 diffusion gradient orientations. A 12-channel head coil and a four-channel neck coil were simultaneously used. The coils used in the study were subjected to regular quality tests throughout the study, so that they could be proven to be intact and of high quality.

Data analysis
The multidirectional diffusion data was visually analyzed for distortions and artifacts. The eddy current distortion was qualitatively estimated by drawing the brain contours to the b 0 image and copying the contours to the diffusion weighted images. Susceptibility and phase artifacts were verified by reviewing the FA, ADC, AD, RD, and b 0 maps slice-by-slice.
The SNR was determined according the National Electrical Manufacturers Association (NEMA) standards 1-2008 with the expression SNR = S/N, where S = the signal and N = the noise of the image, which was estimated with a Rayleigh distribution (SD = standard deviation): N = SD/0.66. SNR values were measured from the b 0 images in each region (b = 0 s/mm 2 ).
Two experienced observers, a medical physicist (UH) and a neuroradiologist (AB), performed the freehand measurements on a workstation using commercially available software Neuro3D (Siemens Healthcare, Malvern, USA). The freehand ROIs were manually placed on the axial images of the color-coded FA maps and automatically transferred to the ADC, AD, and RD maps as well as the non-diffusion weighted b 0 images. The ROIs were centered in the region using color-coded directions. The measurements were aimed to avoid border areas, such as areas overlapping with cerebrospinal fluid spaces, partial volume effects, and neighboring tracts. The thalamus was drawn to the grayscale FA map, because the border areas were more clearly distinguishable in this manner than in the color map.
Slices containing artefacts were avoided. If this was not possible, the artefact areas were excluded by omitting them from the ROI regions ( Figs. 1 and 2). The sizes of the ROIs were chosen using the anatomical knowledge of brain regions and a tract-based atlas of human white matter anatomy [43]. The ROI size ranged from 10 mm 2 (min, cerebral peduncle) to 430 mm 2 (max, centrum semiovale). The time between the first and repeated freehand ROI measurements was at least four weeks.
Intra-observer measurements were performed for all volunteers (n = 40) and inter-observer measurements for 15 volunteers (n = 15). Nine regions were measured, eight of which were in the white matter (Fig. 3). Two observers analyzed each distinct region. The first observer (UH) analyzed the images of 40 subjects twice and the second observer (AB) measured images of 15 subjects. The same 15 subjects were selected from observer 1 measurements for inter-observer analysis. The measurements were selected from the first measurements. The regions in the pyramidal tracts included: the cerebral peduncle, posterior limb of the internal capsule, corona radiata, and centrum semiovale. In the frontobasal area, these included the uncinate fasciculus and forceps minor, while, in the corpus callosum, these included the genu and splenium. One region-the thalamus-was in the gray matter. The FA, ADC, AD, and RD values were calculated for each region. The left and right hemispheres were measured separately for seven regions. Moreover, the ROIs for the genu and splenium of the corpus callosum were drawn in the center of the axial image with one ROI per region.

Statistical analyses
The statistical analyses were performed using the SPSS software package (IBM SPSS Statistics version 22 and 26, Chicago, IL). Means and standard deviations were calculated for each region and parameter, and asymmetries between hemispheres were evaluated using a paired samples t-test. The statistical significance was set to p < 0.007, with a Bonferroni correction for seven regions, according to the regions measured in each hemisphere of the brain. The normality of distributions was tested using the Shapiro-Wilk test (n < 50). The differences among all the age group means were analyzed using an analysis of variance (ANOVA) for the normally distributed data and Welch's test in inhomogeneous cases, where the variance of the variable differed between the age groups. The Kruskal-Wallis test was used for non-normally distributed data. Correlation analysis between FA, ADC and age from the same data have been published in our previous study [41]. In that study, we mostly used a small circle ROI, including a freehand ROI in three regions for better repeatability.
The samples that showed statistically significant differences among the age groups were analyzed by a group comparison between the different age groups. The independent-samples t-test was used with the normally distributed samples, and the Mann-Whitney U test with the non-normal distributions.
To show the relative variability of each measurement, the percent coefficients of variation (CV%) were calculated according the following equation (with SD = standard deviation and M = mean): (SD/M) × 100% [44]. The variability was considered acceptable when the CV% was less than 10% [45]. The results between 11 and 20% were considered to be moderate but still adequate. CV% values over 21% were considered too high and inadequate.
Bland-Altman plots were used as graphical representations for intra-and inter-observer repeatability [44].
The 95% limits of agreement and ± 2 standard deviation of the differences were calculated. The better was consistency between the first and repeated measurements, the smaller the difference between the two limits. Intraand inter-observer repeatability was also assessed using intra-class correlation coefficients (ICCs) with an absolute agreement. Two-way mixed option was chosen as the model because the aim was to investigate the repeatability of these specific observers. In this study, the average ICC refers to the repeatability (test-retest) when the same region is measured twice and the final score is the average of two measurements. The single ICC approximates a situation where the measurement would only be made once, as is usually the case in clinical situations. The cerebral hemispheres have been analyzed separately, but presented as the mean of the left and right hemispheres of the brain. The ICC values were considered to indicate excellent agreement if they were greater than 0.8. ICC results between 0.70 and 0.79 were considered adequate [45], and values below 0.69 were considered inadequate for clinical work. The statistical significance was set to p < 0.006, with a Bonferroni correction for nine regions.

Results
The data quality was excellent in most cases. In some of the cases, artifacts were detected in the cerebral peduncle, corona radiata, and centrum semiovale (Table 1 and Fig. 2). Significant eddy current artefacts did not occur.

Mean values
In the Shapiro-Wilks test, 90% of the means were normally distributed (p > 0.05). The intra-observer mean values for the FA, ADC, AD, and RD of the sample (n = 40) are shown in Table 2.
In white matter ROIs, the mean FA value was 0.67. The lowest value was found in the corona radiata (0.50), and highest in the genu of the corpus callosum (0.86). The mean ADC value was 0.74 × 10 −3 mm 2 /s, with lowest value being found in the corona radiata (0.70 × 10 −3 mm 2 /s) and the highest in the uncinate fasciculus (0.78 × 10 −3 mm 2 /s). The mean AD value was 1.44 × 10 −3 mm 2 /s, with the lowest value being found in the corona radiata (1.10 × 10 −3 mm 2 /s), and highest in the genu of the corpus callosum (1.82 × 10 −3 mm 2 /s). The mean RD value was 0.39 × 10 −3 mm 2 /s, with the lowest value being found in the genu of the corpus callosum (0.26 × 10 −3 mm 2 /s) and the highest in the forceps minor (0.53 × 10 −3 mm 2 /s). In the gray matter-the thalamus-the corresponding mean values were 0.32 for the FA, 0.76 × 10 −3 mm 2 /s for ADC, 1.00 × 10 −3 mm 2 /s for AD, and 0.64 × 10 −3 mm 2 /s for RD. Statistically significant differences between the right and left hemispheres (paired t test, p < 0.007) are expressed in Table 2, and the absolute mean values can be found in the table footnotes. In the pyramidal tract, more precisely in the posterior limb of the internal capsule and corona radiata, the FA values were significantly higher and RD values lower in the left hemisphere. The ADC values were lower in the left hemisphere in all four regions of the pyramidal tract. In the cerebral peduncle, the AD value was also lower in the left hemisphere. In both frontobasal regions, the FA values were significantly higher in the right hemisphere.
Significant differences between age groups were found in the frontobasal regions (Fig. 4). The FA and AD values were significantly higher and the RD values significantly lower in the youngest age group (18-30 years) compared to the other age groups (31-40, 41-50 and 51-60 years) (Fig. 4A, B). Specifically, the FA and RD differences were  Table 3.

Variation
The intra-observer variations (CV%) are shown in Table 2 (n = 40) (Fig. 5A). In the pyramidal tract, the variation for the FA measurements was 8%. The lowest variation was in the posterior limb of the capsula interna (5%), and the highest in the centrum semiovale (12%). The variation was 11% in the frontobasal area and 5% in the corpus callosum. In the gray matter (thalamus), the variation for   Table 2 The intra-observer (observer 1) regional mean FA (0-1, unitless), ADC (10 −3 mm 2 /s), AD (10 −3 mm 2 /s) and RD (10 −3 mm 2 / s) values ± standard deviation (mean ± SD), variation (the percent coefficients of variation = CV%) and repeatability (the intra-class correlation coefficients (ICC) and mean difference ± 2SD) (N = 40)  to 8% with all white matter and gray matter regions. For the RD measurements, the variation in the pyramidal tract was 12%. The lowest variation was in the posterior limb of the capsula interna (8%) and the highest in the cerebral peduncle (18%). The RD variation was 9% in the frontobasal area and 26% in the corpus callosum. In the gray matter (thalamus), the variation was 5%. The interobserver variation results (CV%) are shown in Table 3 ( Fig. 5B).

Reliability
The intra-observer results of the limits of agreement are shown in Table 2. In the white matter, the best intraobserver agreement was found in the posterior limb of the capsula interna with all diffusion parameters. For the ADC, good agreement was also found in the corona radiata, centrum semiovale, uncinate fasciculus, and forceps minor. The largest range between the limits was found in the centrum semiovale for the FA and in the cerebral peduncle for the ADC, AD and RD measurements. The smallest and largest ranges between the 95% limits of agreement for each DTI parameter are presented in the Bland-Altman plots (Figs. 6, 7). For the gray matter, the agreement was very good with all DTI parameters (Fig. 8). On average, the 2 SD of the limit of agreement for the intra-observer results was 0.06. The inter-observer limits of agreement are shown in Table 3, and the smallest ranges between limits are presented in the Bland-Altman plots for each DTI parameter (Fig. 9).
In white matter regions, the best agreement was found in the uncinate fasciculus for FA and RD in the corona radiata for ADC and AD. On average, the 2 SD of the limit of agreement for the inter-observer results was 0.08.
The intra-observer repeatability results (ICC) are shown in Table 2. For the FA, the mean was 0.87 for the average ICC and 0.78 for the single ICC. The highest average ICC was found in the uncinate fasciculus (0.95), and lowest in the cerebral peduncle (0.75). The average ICC results for the FA were above 0.8, and the single ICCs were above 0.7 in eight of the nine regions. Only one region, cerebral peduncle, had coefficients below these results (average 0.75 and single 0.60). For the ADC, the mean value for the average ICC was 0.91 and 0.85 for the single ICC. The highest ICC values were found in the centrum semiovale at both the average (0.98) and single (0.95) ICC. The lowest ICC was observed in the cerebral peduncle for both the average (0.80) and single (0.67) ICC. For AD, the mean average ICC result was 0.87, and the single ICC result was 0.78. The highest ICC values of AD were found in the splenium of the corpus callosum for both the average (0.94) and single (0.89). The lowest result of AD was in the centrum semiovale at the average (0.76) and single (0.62). For RD, the ICCs results were 0.90 for the average and 0.82 for the single measurement. The best repeatability values of ICCs for the average (0.96) and single (0.93) measurements were both found in the frontobasal area in the uncinate fasciculus. For RD, the lowest value was found in the cerebral peduncle by both the average result (0.76) and the single measurement (0.61). 70% of the inter-observer ICC results were statistically significant (p < 0.006). Only significant results were presented. The means of the average ICCs were 0.84 for FA, 0.88 for ADC, 0.81 for AD, and 0.88 for RD and the means of the single ICCs were 0.72, 0.79, 0.69 and 0.78, respectively. The highest ICCs were found in the corona radiata, the average ICC values were 0.94 for FA, 0.95 for ADC and 0.97 for RD and for the single ICCs 0.89, 0.90, 0.94, respectively. For AD, the highest ICCs were found

Discussion
FA values are considered to reflect the integrity of the white matter. Although not in itself a specific parameter in a diagnostic sense, it provides indirect information about myelination, fiber packing density, and fiber orientation [46]. It is well-known that FA values vary widely at different anatomic levels of the brain [12,13,40,45,47]. Specifically, Lee et al. [12] reported that regional FA values varied from 0.21 in deep gray matter (putamen) to 0.81 in tightly packed parallel white matter tract bundles, such as the genu of the corpus callosum. The Table 3 Inter-observer regional mean FA (0-1, unitless), ADC (10 −3 mm 2 /s), AD (10 −3 mm 2 /s) and RD (10 −3 mm 2 /s) values ± standard deviation (mean ± SD) values, variation (the percent coefficients of variation = CV%) and repeatability (mean difference ± 2SD) (observer 1 & 2) (N = 15) FA, fractional anisotropy; ADC, apparent diffusion coefficient; AD, axial diffusivity; RD, radial diffusivity CV, coefficient of variation; CC, corpus callosum; diff, difference; SD, standard deviation corresponding results in this study were 0.32 for deep gray matter (thalamus) and 0.86 for the genu of the corpus callosum. Regions with coherently oriented fibers, such as the cerebral peduncle, internal capsule, and corpus callosum exhibited higher anisotropy than regions with less coherence, such as the centrum semiovale and other subcortical regions [48]. Due to the vast regional variability of FA, possible anatomical mismatches should be taken into account in inter-observer and intergroup comparisons [47]. The ADC values, on the other hand, exhibit less regional variation [13]. In our study, the ADC mean values varied between 0.7-0.8 × 10 −3 mm 2 /s, and in other similar studies the variation was 0.7 to 0.9 × 10 −3 mm 2 /s [45,[49][50][51] [53] were also similar to ours in the uncinate fasciculus. The FA and ADC values were almost identical to those found in our previous study (30 subjects) in most of the regions [40]. The biggest difference (14%) between our present and previous study was found in the genu of the corpus callosum. In this region, measurements were previously made on sagittal [40] instead of axial images, like in the present study. In general, the measured quantitative diffusion metrics were well in line with previous studies. Asymmetry between the hemispheres was found in some of the regions. Pyramidal tracts, such as the posterior limb of the capsula interna and corona radiata, expressed higher FA values and lower ADC and RD values in the left hemisphere. The present results are well in agreement with previous studies [13,40,54]. In addition, in the centrum semiovale, asymmetry of the cerebral hemispheres was observed in the ADC value, which was also lower on the left. Some of the observed asymmetry in our study may be attributed to handedness of the volunteers; 39 of the 40 volunteers in our study were right-handed. Corresponding hemispheric differences were obtained for right-handers in another study [54]. Phase artifacts (fat misregistration) could also be a possible explanation in the regions of the corona radiata and centrum semiovale. In the corona radiata, phase artifacts were present in 55% of cases in the left hemisphere but were not present at all in the right hemisphere. Similarly, the centrum semiovale included artifacts in 25% of cases in the left hemisphere and only in 5% in the right hemisphere. The fat misregistration generally raises FA values locally and decreases ADC and RD values. Artifacts can affect the ROIs in the vicinity, even if the visible part of the artifact is cropped out. Hemispheric differences were also found in the frontobasal area. In those regions, the FA values were found to be higher in the right hemisphere, which is in agreement with previous findings [40,55]. Jahanshad et al. [55] found that the variance in the asymmetry of the frontal lobe is strongly due to genetic factors. In our study, higher FA values were usually found in the right hemisphere of the frontobasal area. Bonekamp et al. [56] reported that small hemispheric differences could be due to slight slice angulation. Therefore, keeping the same slice position and orientation in longitudinal studies is essential [47].
In terms of age-related changes, we found significant differences between the youngest age group (18-30 years) and other age groups (31-40, 41-50, and 51-60 years). Specifically, the FA values were higher and the RD values lower in the frontobasal area in both hemispheres in the youngest age group when compared to the other age groups. For FA, this result has already been published in our previous study [41]. Other studies have also found changes in the frontal regions of the brain caused by aging [16,17]. In general, several studies have found a negative correlation between age and FA and a positive correlation between age and RD in white matter [21,22,57,59]. These variations may be related to changes in myelination and axon density [17,58,60].
In the present study, acceptable intra-observer variability (≤ 10%) was found in six out of nine regions for FA, while three regions had moderate but adequate variation. For ADC and AD, all regions had acceptable variability. For RD, seven out of nine regions had an acceptable or moderate variation and two had high variation (genu and splenium of the corpus callosum). The percent variation of the RD values in the corpus callosum is naturally high, because the mean value is clearly lower than in the other regions. Low RD values are due to the fact that the fibers are tightly packed and parallel to each other. In this case, the variation was not a good indicator for assessing reliability. Overall, the variation results were in line with our previous study [40]. It is noteworthy that the freehand method gives an average of 4% lower variations in the pyramidal regions compared to the circle method [13,41]. In contrast, in our study, the freehand method gave a slightly higher variation in the corpus callosum than the circle method in previous studies [13,41]. This may be due to the fact that in our study, ROIs were plotted on the axial image, whereas in previous studies they were plotted on the sagittal image [13,41]. Thus, in this particular region, it would be better to use the circle method for a sagittal image than the freehand method for an axial image. The inter-observer (n = 15) variability was acceptable or moderate in seven out of nine regions. The interobserver variabilities are in line with our previous study [40]. The intra-observer repeatability was at a very good level according to the 95% limits of agreement. The results varied according to region, and, with tightly packed white matter tracts, such as the posterior limb of the capsula interna, the difference between the limits was small. Also, the only region of gray matter-the thalamus-was found to be reliable in this analysis. Furthermore, this difference was greater in regions containing crossing fibers, such as the centrum semiovale. Overall, the results were consistent with our previous research [40]. The inter-observer agreement was lower than the intra-observer agreement in all regions, and others have reported similar results [13,40,59,60]. Several studies have shown that interobserver agreement results have been one-third lower than intra-observer results [59,60]. Our study further confirms the trend between inter-observer and intraobserver agreements. The uncinate fasciculus was found to be the most reliable region in the inter-observer analyses for FA and RD, while the corona radiata was the most reliable region for ADC and AD.
The intra-observer reliability was high according to the average measures of the ICC analysis. In our study, average ICC refers to the repeatability obtained as the average of two measurements from a single region. Overall, the average ICC results were excellent for all four parameters. The repeatability result was also excellent (above 0.8) in eight out of nine regions for FA and all regions for the ADC. The repeatability of the freehand method was significantly improved compared to our previous study [40]. The average ICC increase was 0.4 (37%) in terms of the FA and ADC parameters.
The higher ICC values were probably due to increased observer experience in selecting a slide, avoiding artifacts and the partial volume effect of border areas. The single intra-observer ICC analysis was, on average, excellent in terms of the ADC and RD parameters and moderate in terms of the FA and AD parameters. Single ICC in our study refers to the repeatability of a single measurement, which can be considered normal practice in clinical measurements. The results showed excellent or moderate repeatability in seven out of nine regions for all DTI parameters. The region with the highest single ICC values was the forceps minor, with excellent reliability for each parameter. Good reliability was also found in the following regions: the uncinate fasciculus, thalamus, and the genu and splenium of the corpus callosum. High reliability in the corpus callosum is consistent with previous studies with the ROI method [45,61,62] but also with the TBSS method [38]. Inadequate results (ICC < 0.69) were found in the cerebral peduncle (FA, ADC and RD) and centrum semiovale (AD). The reason for the inferior reliability of the cerebral peduncle was the susceptibility artifact, more specifically the air-cavity. This artifact causes local changes in the results of the parameters. Although efforts were made to avoid distracted areas in the ROI, the effects of the artifact were also reflected in the surrounding areas. The reason for the low reliability of the centrum semiovale in the AD values can be explained by the multitude of crossing fibers in the subcortical white matter. Also, the statistically significant inter-observer results were highly similar to the intra-observer results. The differences between intra-and inter-observer ICC results averaged at less than 5% for the average ICC and less than 10% for the single ICC. The most reliable inter-observer region was found to be the corona radiata, which had the highest value for three different parameters (FA, ADC, and RD). For AD, the highest value was obtained in the splenium of the corpus callosum. The reliability of the measurements is greatly improved if the measurement is repeated at least once or if the result is taken as a mean of the measurements from two different observers.
The SNR measurements showed that the image quality was sufficient for reliable quantitative measurements. In general, the SNR of b = 0 s/mm 2 should be at least 20 in order to derive reliable FA values [36]. In our study, the SNR was well above 20 in all regions, and the measured SNR values were comparable to other studies [63,64].
A limitation of this study was that the commercial program did not include eddy current and subject motion corrections. In addition, the used imaging parameters may have not been optimal, especially compared to more recent diffusion imaging, e.g., high angular resolution diffusion imaging (HARDI) using isotropic voxels. Acquisition with higher resolution isotropic voxels and possibly HARDI may give more accurate results [36]. Furthermore, it has been shown that using near 1 mm isotropic voxels gives excellent results in repeatability [65]. In addition, 70% of the inter-observer ICC results were statistically significant. This was a consequence of the small number of samples. The schedule of measurements was limited.
In general, the regions with high reliability and low variation possess some common features. These regions have low anatomical variation and tightly packed fibers with a common orientation [66]. These areas also often have a better SNR, fewer partial volume effects, and are also less affected by "crossing" fibers. In addition, the larger ROI size increases the SNR value and improves the repeatability [66]. When a larger ROI size is used in a limited region, it is likely that there are more percentages of the same voxels between the two measurements than for a smaller ROI. The results of the repeat measurements are thus close to each other.
In future studies, larger samples of carefully collected high-spatial and -angular resolution DTI normal data should be acquired. In those studies, more subjects should be recruited for each age group in order to perform a reliable analysis of the effect of age. In addition, it would be interesting to study how much the reliability of the measurements improve when different methods, such as the ROI, tractography, and TBSS, are used simultaneously.

Conclusions
According to our results, the intra-observer repeatability of the quantitative freehand ROI method can be considered at least adequate. The quantitative freehand ROI method can be considered highly reliable for the average ICC and mostly adequate for the single ICC. The reliability of the single measurements was excellent or moderate in 80% of the regions, including all DTI parameters. In the comparison of parameters, for the single ICCs, most of the repeatability results were excellent in terms of the ADC and RD while only moderate in terms of the FA and AD parameters.
As per our results, the freehand method can be considered highly suitable for research and clinical applications assuming a well-experienced observer. Measurements should be repeated at least once in each region to ensure sufficient reliability of the results. The frontobasal area, such as the uncinate fasciculus and forceps minor, as well as the internal capsule and corona radiata regions of the pyramidal tracts were found to be reliable regions in the repeatability analysis. In addition, the only region of gray matter-the thalamus-was found to be reliable. Therefore, they could be considered as regions which yield the most accurate quantitative ROI measurements in clinical settings. In general, it would be highly beneficial to favor regions with high reliability and repeatability in ROI measurements, if possible. Additionally, special care should be taken in ROI delineation in subjects with image artifacts.
When using the results of healthy adults as a control for patient groups, it should be noted that the results are most reliable on adults less than 30 years of age whose brain white matter does not yet have age-related changes.