Although combining structural MRI scans from different centers provides an opportunity to increase the statistical power of brain morphometric analyses in neurological and neuropsychiatric disorders, one important confound is the potential for scanner and MRI protocol effects to introduce systematic errors, thus making the interpretation of results difficult. In this study, we describe a methodical approach to choosing a T1-weighted volume, based on five steps. We demonstrated our approach for a two-center study, but believe that it can be generalized to larger multi-center studies. We acquired five datasets at two different centers (equipped with scanners from the same manufacturer with the same field strength): two within-subject within-center datasets for an initial comparison of T1-weighted volumes and subsequent optimization of the best performing of these protocols, two between-subject between-center datasets (short-term and long-term comparisons), and one between-subject, within-center dataset after a scanner upgrade in one of the centers. By analysing the summary data quality and quantitative measures extracted from FreeSurfer and VBM (as implemented within SPM5) we aimed to determine an optimised MRI protocol which gave high contrast to noise ratio/image quality (as evidenced by FreeSurfer measures), had minimal image artefacts and was reproducible across centers and over time.
For our initial assessment of image quality, after visual inspection for artefacts we extracted and examined the data quality measures (Euler number and contrast to noise ratio) and used these to compare the scans within the initial T1-weighted volume comparison with each other and with the examplar dataset from FreeSurfer. These quality measures were suggested by Dr. Fischl (http://firstname.lastname@example.org/msg11456.html) and, while they are inherently inter-related (as the quality of the segmentation of the cortical surface depends in contrast and contrast to noise) it is important that measures are included that assess the “down stream” effects of image quality on image analysis as well as simple acquisition-related measures. The results indicated that the performance of FreeSurfer for reconstructing the surfaces was higher for MPRAGE sequences than the FLASH sequences. This is consistent with the results of Tardif et al.  and Deichmann et al.  that showed that because of their higher CNR images, MPRAGE sequences improve the accuracy of tissue classification. Therefore MPRAGE sequences are better options for segmentation tools such as FreeSurfer and VBM. Consequently, the FLASH sequence (protocol A) was rejected from further analysis, and from the remaining MPRAGE sequences we selected the two with the highest total summary data quality ranks (protocols C and F).
The top performing T1-weighted volumes were optimized in order to generate artefact-free protocols while retaining high image quality as assessed by FreeSurfer. We did this by applying a combination of changes in the parameter settings of the protocols including changing the phase encoding direction and/or adding flow compensation and/or saturation band. The extracted data quality measures indicated that image quality remained high enough (compared to “Bert” dataset and to the initial T1-weighted protocols) for all the 7 protocols in this dataset and we therefore visually inspected the images in order to select the best three artefact-free protocols to take forward for further testing. This dataset demonstrated that relatively minor changes to the protocol could have measurable effects on overall data quality, and emphasised the necessity for optimisation of protocols for the particular analysis to be performed.
Within-center comparisons of the quantitative measures from the short-term two-center and long-term two-center datasets with a scan interval of 1.5 year (Figure 2 and 3) indicated that for Center 2 cortical thickness comparisons, protocols F2 (with right-left phase encoding direction) and F3 (with right-left phase encoding direction and saturation band) showed low reproducibility in frontal, cingulate and parietal regions whereas all the regions were highly reproducible when using protocol C3 (with right-left phase encoding direction and flow compensation). In Center 1, on the other hand, all the protocols showed a high reproducibility for within-center cortical thickness comparisons. Within-center comparisons of a selection of subcortical structure volumes also indicated differences in the performance of the protocols, with protocol C3 showing higher reproducibility in most regions for measurements from both centers. VBM total volume results also showed lower reproducibility for GM measurements acquired using F2 and F3 than C3 and WM measurements acquired using C3 than F2 and F3. Whole brain volume measurements found to be more reproducible using C3 than the other protocols for both centers. These findings not only suggest that, for this study, protocol C3 gave the highest reproducible with respect to quantitative measures after a 1.5 year scan interval, but also highlights the importance of evaluating reproducibility at all sites in multi-center studies. While the Center 2 and Center 1 scanners are nominally identical, performance varied between these sites. Takao et al.  have reported similar findings, showing that even with scanners of exactly the same model (3 T General Electric scanners in their case) scanner drift and inter-scanner variability could cancel out the effects of genuine longitudinal brain volume changes. Since the PIQT data of the QA protocol did not show any significant difference between the two centers, the fact that all the protocols performed better at Center1 than at Center2 might reflect a subtle bias due to the fact that the optimization steps were performed for Center1 and the protocols then simply transferred to the other site. In this study, because of considerations of cost and time we chose to perform the initial comparison of T1-weighted volumes at one center and to use the best performing protocols from this center at both centers. However the ideal procedure for a multicenter study would be to: 1) perform and compare the QA information from the different centers, 2) acquire the first two steps at all sites and chose the best performing artefact-free protocols at all centers and then 3) perform the reproducibility tests. Furthermore, within-center variations were found comparing test and retest measurements at each of the centers which were highest for some of the protocols. One of the goals of the present study was thus to highlight the importance of reproducibility studies such as these, since this kind of within-center variation is not predictable a-priori. Han et al.  and Jovicich et al.  reported similar small variations in cortical and subcortical measurements, in their case comparing the scans from two different sessions with a 2-week scan interval (short-term scan interval). In the current study we compared the measurements after a long-term scan interval and showed that although some of the measurements showed lower reproducibility, it was possible to find a protocol and set of scan parameters that gave reproducible measures even over this longer interval.
Between-center comparison of cortical thickness from the short-term two-center comparison indicated that, with protocols F2 and F3 the reproducibility was lower for frontal, cingulate and parietal regions. However, with these two protocols, the differences in these regions were less in the long-term two-center retest comparisons. The reproducibility of all the regions in both the between-center test and in the between-center retest was higher using protocol C3. In the subcortical comparisons, the reproducibility of several volumes were low for protocols F2 and F3 in the initial between-center test scans while protocol C3 showed highly reproducible results. For the between-center retest scans also low reproducibility was found in several volumes using protocols F2 and F3 and protocol C3 found to be the most reproducible one. With respect to VBM total volume measurements, GM and WM measurements of the retest scans acquired using all the protocols were highly reproducible and for the GM and WM of the test scans and also WBV of both test and retest scans, protocol C3 found to be the most reproducible protocol. Again, these findings are important since they indicate that although both centers were equipped with the scanners of the same field strength and from the same manufacturer, the reproducibility was not always high and can be improved by carefully selecting the acquisition protocol.
The reproducibility of the quantitative measures was also examined after a scanner upgrade at Center 1. The reproducibility of all the protocols was relatively high for all the quantitative measures, even in problematic regions and structures, for both the within-center and between-center comparisons. These results are in line with reports from Han et al. , Stonnington et al.  and Jovicich et al.  in which the morphometric brain measurements did not significantly vary after even major scanner upgrades. Our results of between-center comparisons also indicate that the cortical thickness and VBM measurements difference across the centers were reduced after scanner upgrade at Center1. We speculate that this may be because of general servicing and tuning of the scanner during the upgrade but were not able to specifically assess this.
Ideally in large-scale multi-center and longitudinal studies which involve scanning very large number of subjects in several different centers, such as the Alzheimer's Disease Neuroimaging Initiative (ADNI) study  and Schizophrenia Twin and Relatives (STAR) study , using several human subjects for calibration procedure is preferred to get more precise results. In relatively small studies like ours (http://www.neuroimaging-did.com), however, which only involves scanning 50 subjects in each of the two centers, scanning a large number of calibration subjects is not feasible in terms of cost or time. Therefore, we decided to perform the calibration study using a small number of volunteers which is a potential limitation of this study. An additional potential confound is the impact of subject motion on the Euler number and other assessment measures; in the current study we control this by visual assessment of scans, but with a larger number of calibration subjects and scans, more quantitative methods could be applied to assess the effects.
Another important issue to consider is that of longitudinal studies. As was shown in the current study, within-center long-term reproducibility precision may vary to some extent depending on the MR protocol and these variations could be confounding factors. Systematic differences between scans acquired at different times could be mis-interpreted as real brain volumetric changes. Therefore, longitudinal studies need to consider performing reproducibility tests with a larger sample size and on a regular basis, and need to be appropriately powered. They also need more robust techniques than cross sectional measurements, which is particularly important when more than one center is involved in a longitudinal study. The results of such studies may also allow inter-site differences in accuracy to be assessed and if necessary allowed for by calibrations .
To summarize, our results showed that while several of the protocols showed promise, with high FreeSurfer performance/image quality and being artefact-free, one of the protocols (C3) gave the highest reproducibility. Determining this a-priori would have been impossible without acquiring and assessing these datasets. Since we scanned only three participants in the current study, we chose not to statistically compare the quantitative measures, but were nevertheless able to draw useful conclusions from the results.
The approach described here could be applied to protocol optimization across centers for other multi-center studies. These findings suggest researchers planning to perform multicenter studies should consider performing assessments such as these to ensure that by pooling data from different centers they are not unnecessarily reducing the power of their study due to variance from unexpected inter- or intra-site differences.