CTR derived from CXR is a valuable index for the evaluation of heart diseases, especially cardiomegaly [1,2,3,4]. To measure it, however, still requires manual operations that are user dependent and time consuming. Despite its utility, the measurement process is a burden in clinical practice. Recently, the AI method successfully provided automatic calculations of such an index and has been validated technically in various studies [9,10,11,12]. To use AI in the clinical setting, there is a need for clinical evaluation to assess the measurement agreement with the manual method. However, there have been only two published pilot studies [9, 11] with small datasets that addressed this issue.
To our knowledge, this study was the first report of observer and method variations to validate CTR measurement using AI on a large dataset (n = 7,517). Using a modified U-Net deep-learning model (i.e., 2D VGG-16 U-Net) for CTR calculation, AI-only was found not to be suitable for use as an automated method of CTR measurement due to its high variations compared to the manual method. Its CTR calculations, on the other hand, can assist the user to obtain better results. Furthermore, the coefficient of determination (R2) or classification performance test (e.g., AUC) should not be employed because it may lead the investigator to falsely conclude that the AI-only method can be employed on an automated basis. Bland–Altman plot with Covariant of Variation (CV) parameters evaluated on a large data should be utilized instead to indicate agreement between these methods.
We found that the AI-only method can provide excellent outcomes in about 40% of the data, which is a desirable result for an automated method. However, about 56% of outcomes required adjustment by the user (i.e. good outcome), a condition that must be improved before AI-only can be used automatically. Specifically, the AI-only method needs to be improved on heart diameter calculation which is difficult to perform because its pixel value is low, and its edges are fused with the lung borders or the thoracic spine [19]. In addition, the AI-only method also had about a 4% failure rate (i.e., poor outcome) most of which was in the normal data group (97%: 290/299). In routine clinical usage, where CTR is measured only on suspected cardiomegaly cases, this failure is infrequent (9 failures in 2,517 cardiomegaly data). Nevertheless, most of the segmentation failure was on hearts with quite short diameters (e.g., Fig. 3j). This may be due to an inadequate presentation of such heart data shape in the training dataset. Fine tuning the model using a local heart shape dataset should further reduce such failures.
We found that the AI-assisted method had lower inter-observer bias and variation than the manual method (CV and bias: 1.72% vs 2.13% and − 0.61 vs − 1.62). This may be due to the AI’s excellent outcome in about 40% of data which can help to improve measurement agreement. Furthermore, it is almost five-fold faster to perform than using the manual method, and increases F1 from 0.866 to 0.872 at the standard CTR cutoff point of 0.5. This clearly demonstrates the usefulness of the AI method to assist with CTR measurement. Our AI-assisted time performance was also in agreement with a recent study by Bercean et al. [9] which found a similar magnitude of time reduction (22.5 vs 5.1 secs, or 4.4 times). Even on a small dataset (n = 200), that study also found that the model-assisted method can improve the individual radiologist’s cardiomegaly F1 score (0.845 to 0.851) compared to the manual method.
We concluded that the classification performance test of the AI-only method was not better than from the manual method, a finding at odds with a report by Li et al. [11] that found that the sensitivity and negative-predictive values of the AI-only method were significantly better than the manual method. This may be due to two factors. First, the performance of deep learning algorithms in automated CTR measurement tasks depends on their ability to correctly locate heart and lung boundaries. In Li et al. [11], algorithms may have achieved more precise anatomical segmentations, although the authors did not provide precision metrics on an open dataset for comparison with the model we used [10]. Second, the algorithm in Li et al. [11] was trained and tested on the same dataset, while the model used in this paper was trained on an open dataset, and tested in an out-of-sample fashion. It would be useful to validate their finding by performing the classification test using their model on our dataset.
CTR measured from manual and AI-assisted methods were in substantial agreement with the reference method (CVs of 2.0 and 2.2%, respectively). The AI-only method, in contrast, had almost three times higher CVs on all comparisons. This strongly suggests that the AI-only method is not yet suitable to be employed as an automated method. However, its R2 of all data (normal and cardiomegaly groups) and classification performance test at the standard or optimum cutoffs were similar to other methods. This is because R2 measures linear association rather than agreement of data [20, 21] and measurements with highly correlated data may have poor agreement [21], as in our case. Furthermore, the correlation typically depends on the range of measure. This is why the R2 of the manual and AI-only methods was good in normal and cardiomegaly groups (R2 = 0.79; CTR data range = 0.35–0.85), but poorly correlated in the cardiomegaly group alone (R2 = 0.34; CTR data range = 0.52–0.85) (Fig. 5a, c). On the other hand, if the Bland–Altman plot and CV present with good agreement (Fig. 6b, d), then they are likely to be highly correlated [21] as shown in Fig. 6a, c. Thus, the agreement measurement should be employed to evaluate the compatibility of the AI to the manual method in CTR measurement study.
Classification performance tests may also be misleading because they only provide information on the performance of normal and cardiomegaly groups, and not how the methods agree. For example, Fig. 3d, f present cases where the AI-only method gave a false positive and negative result, respectively. These two data have an effect on the classification test, but most of the AI data did not have this effect (i.e., AI’s CTR data did not change the classification) as shown in Fig. 3e and Table 4. Still, we obtained excellent classification performance at the standard CTR cutoff (e.g., AUC = 0.902). However, if the AI-only method were employed to rule out cardiomegaly patients (i.e., using CTR cutoff at the maximum sensitivity), then the method would perform poorly (e.g., accuracy of 34.8%) and should not replace the manual approach. Test agreement is necessary for evaluation of the AI-only method if it is to be implemented as an automated method, and its agreement should be comparable to the manual method (CV = 2.1%).
We performed observer and method variation tests on a large dataset using only a modified U-Net Deep-Learning model because we wished to obtain baseline AI performance data. Our results, especially the manual measurement of 7,517 CXRs, will serve as a reference to evaluate other state-of-the-art AI models [22]. Our plan is to test these models on our dataset and accept the AI outcome only if it differs from our manual results by less than ± 1.8% (i.e., an excellent category where the user can accept its outcome without adjustment). Any model with > 70% acceptance rate will be studied prospectively in a clinical setting and evaluated by our radiologists. Furthermore, at such an acceptance rate, we will perform another retrospective study in our PACS data (around one million CXR images). Such a pioneering study would provide more insight into CTR values and useful information for clinicians.
There were some limitations in our dataset and methods. We used only normal and cardiomegaly data and there was no data from other pathologies, such as the fat pad of the pericardium or pleural effusion. These pathologic conditions may limit the DL model’s ability to segment heart and lung, and may lower the performance of CTR measurement. Such data should be included in future studies to better evaluate the performance of the model. Furthermore, we only investigated adult cases, but evaluation of CTR measurement by AI in pediatric cases is needed. Next, we used only a publicly available dataset. Future studies using local datasets are needed to improve the model’s performance. Finally, unlike most deep learning for CXR analysis studies, this study did not address the question of how AI can be trained to match human performance in CTR measurement, but focused on assessing the extent to which deep learning methods can benefit the radiologists’ practice in a clinical setting. Future studies may focus more on the patterns of errors generated by the algorithms and suggest ways to improve their accuracy.