The e�cacy of deep learning models in the diagnosis of endometrial cancer using MRI: A comparison with radiologists

Purpose: To compare the diagnostic performance of deep learning models using convolutional neural networks (CNN) with that of radiologists in diagnosing endometrial cancer and to verify suitable imaging conditions. Methods: This retrospective study included patients with endometrial cancer or non-cancerous lesions who underwent MRI between 2015 and 2020. In Experiment 1, single and combined image sets of several sequences from 204 patients with cancer and 184 patients with non-cancerous lesions were used to train CNNs. Subsequently, testing was performed using 97 images from 51 patients with cancer and 46 patients with non-cancerous lesions. The test image sets were independently interpreted by three blinded radiologists. Experiment 2 investigated whether the addition of different types of images for training using single image sets improved the diagnostic performance of CNNs. Results: The AUC of the CNNs pertaining to single and combined image sets were 0.88–0.95 and 0.87-0.93, respectively, indicating better or equivalent diagnostic performance than the radiologists. The AUC of the CNNs trained with the addition of other types of images was 0.88–0.95. Conclusion: CNNs demonstrated high diagnostic performance for the diagnosis of endometrial cancer using MRI. Although there were no signi�cant differences, adding other types of images improved the diagnostic performance for some image sets.


Background
Endometrial cancer is the sixth most common malignant disorder in women worldwide (1).About 417,000 new cases of endometrial cancer were diagnosed worldwide in 2020, and about 97,000 people died from this disease (1).The incidence of endometrial cancer is on the rise (2).Surgery and biopsy are the standards for staging endometrial cancer, and MRI can assist in preoperative evaluation and surgical planning by accurately predicting the depth of invasion into the myometrium, invasion of the cervical stroma and surrounding organs, and the presence of lymph node metastases (3,4).Recently, multiparametric MRI has been introduced to improved diagnosis (5).In case, the biopsy is not possible due to closure of the internal uterine ostium or no experience of sexual intercourse, MRI is also used to diagnose the presence of endometrial cancer (3).Although MRI has not been formally incorporated into the FIGO staging system, it is already widely accepted as the most reliable imaging technique for diagnosing, staging, treatment planning, and follow-up of endometrial cancer.Moreover, MRI is said to minimize costs by eliminating the need for expensive diagnostic and surgical procedures (3).
In recent years, deep learning methods based on convolutional neural networks (CNN) have achieved remarkable performance in image pattern recognition (6, 7).Moreover, a wide variety of computer vision tasks have been reported in the literature including deep learning-based segmentation (8-10), lesion detection (11,12), and classi cation (13,14).The diagnostic modalities that were investigated include ultrasound, radiograph, CT, and MRI.The application of CNN to tumor images has the potential to be applied not only to image interpretation assist, but also to screening, prognosis estimation, and selection of optimal treatment methods, and we believe that tumor detection is the rst step.However, to the best of our knowledge, no previous study has developed a CNN for diagnosing the presence of endometrial cancer.In addition, few studies have investigated the optimal image conditions for MR image classi cation using deep learning with several sequences and cross-sections.
The present study constructed CNNs for diagnosing endometrial cancer using several MR images and its combination to validate for optimal CNN imaging conditions, and compared their diagnostic performance with that of experienced radiologists.Furthermore, we veri ed whether the diagnostic performance could be improved by the addition of sequences and cross sections, other than the same type as the test image set, to the training data.

Study design
The current, retrospective study was approved by the Institutional Review Board of our institution and the requirement for written informed consent was waived (approval number: R02-054).The inclusion criteria are stated as follows: (A) woman above 20 years of age, (B) pelvic MRI scan obtained as per the protocol followed at our hospital during the time period from January 2015 to May 2020, (C) hysterectomized and pathologically con rmed as endometrial cancer (cancer group), and (D) pathologically or clinically de nitely benign lesions (non-cancer group).The exclusion criteria are stated as follows: (A) history of treatment for uterine diseases and (B) macroscopically non-mass-forming cancers according to pathological reports.A owchart for the patient selection process is presented in Figure 1.
Figure 2 shows a ow diagram of the study design.As shown in Figure 2a, Experiment 1 constructed CNNs for diagnosing the presence of endometrial cancer.Single and combined image sets of T2weighted image (T2WI), apparent diffusion coe cient of water (ADC) map, and contrast-enhanced T1weighted image (CE-T1WI) were used to validate optimal imaging conditions for CNN, and we compared their diagnostic performance with those of experienced radiologists.As shown in Figure 2b, Experiment 2 veri ed whether the diagnostic performance could be improved by the addition of sequences and cross sections, other than the same type as the test image set, to the training data.

MRI acquisition
The MRI scan was performed using 3T or 1.5T equipment (Ingenia®, Achieva®; Philips Medical Systems, Netherlands) with a 32-channel phased-array body coil.The protocol employed to obtain the image of the entire uterus along the uterine axis included T2WIs, Diffusion weighted images (DWIs) (b-value: 0, 1000), and CE-T1WIs of the equilibrium phase (Table 1).Gadopentetate dimeglumine 5 mmol (Magnevist® 0.5 mol/L or Gadovist® 1.0 mol/L; Bayer, Germany) was used for CE-T1WIs.The gadolinium dose varied according to the patient's weight, as recommended (0.2 ml/kg).Bolus intravenous contrast injection rate was 4 mL (2 mmol)/sec (in case of Gadovist, dilute with saline solution and inject at 4 ml/sec).

Data set
The image slices comprising the endometrium were extracted to create a dataset.In the cancer group, the sequences and pathological ndings were considered and only the image slices depicting the tumor were visualized and extracted, as per the consensus of two radiologists (A.U., T.S.).The current study compared the diagnostic performance of the CNNs and three board certi cated radiologists with 27, 26, and 9 years of experience in pelvic MRI interpretation (T.M., K.M., and T.I.) using ve single image sets and four combined image sets.The same types of single or combined image sets were used for training and testing.The radiologists were blinded to the clinical and pathological ndings and independently reviewed the 97 randomly ordered test images in each image set and reported the presence or absence of cancer.The interpretation commenced with single image sets (ADC map rst), followed by combined image sets.A time interval of one week was maintained between the sessions of interpretation.

Experiment 2: CNN in testing single image sets using different image sets for training
Experiment 2 investigated whether the addition of different types of image sets for training improved the diagnostic performance of CNNs.The CNN was trained using images of the same sequence regardless of the cross-sections, same cross-sectional images regardless of the sequences, and all images regardless of the sequences and cross-sections, in order to test ve single image sets; only single image sets were used for training and testing.

Deep learning with convolutional neural networks
Deep learning was conducted on Deep Station Entry (UEI, Tokyo, Japan) with a GeForce RTX 2080Ti graphics processing unit (NVIDIA, Calif, USA), a Core i7-8700 central processing unit (Intel, Calif, USA), and the graphical deep learning software Deep Analyzer (GHELIA, Tokyo, Japan).The conditions optimized based on the ablation and comparative studies of the previous research were as follows: CNN with Xception architecture (16) was used for deep learning and ImageNet (17) which consists of natural images was used as pre-training.The parameters of optimization are stated as follows: optimizer algorithm = Adam (learning rate = 0.0001, β1 = 0.9, β2 = 0.999, eps = le-7, decay = 0, AMSGrad = false).The batch size was automatically selected.Horizontal ip, rotation (±4.5°), shearing (0.05), and zooming (0.05) were automatically used as the data augmentation techniques.The CNNs were generated by setting the training/validation split ratio to 9:1, 8:2, or 7:3, and the epochs to 50, 100, 200, 500 or 1000 and the diagnostic results of each were validated.The training/validation split ratio and epochs were selected for each image set on the basis of the best performance among the CNNs with sensitivity and speci city above 0.75 (Table 2).

Experiment 1
The results of Experiment 1 are presented in Table 4 and Figure 3. Table 4 shows the diagnostic performance of the CNNs and radiologists for single and combined image sets and Figure 3 shows the ROC curve comparing the performance of the CNNs for single and combined image sets with the area under the receiver operating characteristic curve (AUC) pertaining to the radiologists.The sensitivity, speci city, accuracy, and AUC of the CNNs using both the single and combined image sets were comparable to those displayed by the three radiologists.The AUC of the CNN was signi cantly higher for axial ADC map and axial CE-T1WI, compared to the three radiologists, and on axial T2WI, compared to reader 2, and combined axial T2WI+ ADC map, compared to reader 1.The present study did not observe any other signi cant difference between the CNNs and the three radiologists.The CNN showed the highest diagnostic performance with single axial ADC map with an AUC of 0.95, the graphs of accuracy and loss of training data of single ADC map are shown in Figure 4.The AUC of the CNNs for combined axial T2WI + ADC map + CE-T1WI was 0.87, which was the lowest among the CNNs' results for all the single and combined image sets.

Experiment 2
The results of Experiment 2 are presented in Table 5 and Figure 7. Table 5 shows the diagnostic performance of the CNNs in the testing using single image sets and the addition of various types of image sets of different sequences and/or cross-sections to the training data.In this study, the AUC showed an increase when any types of image sets added for training in sagittal T2WI and sagittal CE-T1WI, and all T2WI and all image sets were used for training in axial T2WI, although the difference was not signi cant.Conversely, for axial ADC map and axial CE-T1WI, the addition of any image set for training did not improve the AUC.

Discussion
The CNNs displayed better diagnostic performance in interpreting all ve single image sets and signi cantly better results with single axial ADC map and axial CE-T1WI, compared to the radiologists.Although there were no signi cant differences, the diagnostic performance was improved by adding other types of image sets to the training data, except for axial ADC map and axial CE-T1WI.The improvement in the interpretation of the combined image sets was not equivalent to that of the radiologists.
Several CNNs using MRI have been constructed for the diagnosis of uterine tumors to date (19,20).As the number of images to be combined increases, the variation in information also increases.
Consequently, increasing the number of images used for training may be warranted.
Adding other types of image sets to the training data improved the diagnostic performance, except for the axial ADC map and axial CE-T1WI in Experiment 2. This result is similar to the recent report by Lee et al.
that training with all available MRI sequences of the same cross-section improves the diagnostic performance of CNNs in distinguishing between pseudo and true tumor progression (29).The present study observed that the addition of other cross-sections of the same sequence was especially bene cial.The amount of training data for the sagittal sections was smaller, compared to the axial sections.Hence, the impact of the improvement may be greater.It is presumed that similar signal information is included in the same sequence even in different cross-sections, and similar morphological information is included in the same cross-section, even in different sequences.The potential for improved diagnostic performance by adding different sequences and cross-sections is an important result concerning the deep learning studies of tumor diagnosis, which involve di culties in obtaining a large number of images.In order to establish the optimum image conditions in deep learning using MR images with various sequences and cross sections, it is necessary to further verify using various combinations of various images in various regions.
The current study has several limitations.First, only one selected image was evaluated, which differs from the clinical practice of diagnosis using a series of images.It also differs from a clinical setting in that the JPEG images, which contain less information than DICOM images, were used.Second, the noncancer group included lesions that were not pathologically con rmed.However, we considered it important to distinguish cancer from benign lesions that do not warrant treatment.Third, it is controversial whether atypical endometrial hyperplasia should be classi ed as benign because it is not cancerous or malignant because it is a precursor lesion.However, it would be unreasonable to exclude only atypical endometrial hyperplasia from this study.Therefore, in this study, we classi ed atypical endometrial hyperplasia as benign because the purpose was to detect endometrial cancer.Fourth, we have not examined dynamic studies to avoid study complexity.Although dynamic study is useful to determine the degree of myometrial invasion, contrast between the tumor and the myometrium is greatest during the equilibrium phase (3).This study targeted the presence of cancer, so only images of the equilibrium phase were used as contrast images.The following can be considered future improvements: the superiority of combined images may be demonstrated using more training data.The performance can be improved using three-dimensional images instead of two-dimensional images, as reported by Mehrtash et al., who used three-dimensional prostate images for convolutional neural networks (30).
Evaluation with DICOM data and learning with clinical data such as tumor markers can also improve diagnostic performance.Further versatility can be achieved using the images obtained with other MRI equipment.

Conclusions
In conclusion, deep learning demonstrated high diagnostic performance in diagnosing the presence of endometrial cancer on MRI.In particular, a deep learning model using convolutional neural networks showed signi cantly better results with single axial apparent diffusion coe cient of water maps and axial contrast-enhanced T1-weighted images, compared to expert radiologists.Moreover, although there were no signi cant differences, the addition of other types of images to the training data improved the diagnostic performance for some of the single image sets.Acc., accuracy.
Three cases of false negatives observed in the single image set of axial ADC: (a) A 55-year-old women with grade 1 endometrioid carcinoma, in which the CNN was able to diagnose the cancer, but the readers 1, 2, and 3 were not (the CNN con dence, cancer = 99.9%).The image shows a tiny tumor lling the uterine cavity (arrow), (b) A 34-year-old women with grade 1 endometrioid carcinoma, in which all the three readers could diagnose cancer, but the CNN could not (the CNN con dence, cancer = 18.8%).The image shows a massive tumor protruding into the myometrium of the posterior wall of the uterus (arrow), (c) A 31-year-old women with grade 2 endometrioid carcinoma, in which neither the CNN nor the three readers could diagnose the presence of cancer (the CNN con dence, cancer = 22.5%).The image shows the tumor lling the uterine cavity (arrow).A slight decrease in ADC map might have made the diagnosis of tumor di cult with a single image, without considering the other images for radiologists.
Three cases of false negatives observed in the combined image set of axial T2WI+ADC+CE-T1WI: (a) A 56-year-old women with grade 1 endometrioid carcinoma, in which the CNN was able to detect the cancer, but the three readers were not (the CNN con dence, cancer=100%), (b) A 30-year-old women with grade 1 endometrioid carcinoma, in which the three readers could diagnose the presence of cancer, but the CNN could not (the CNN con dence, cancer = 0.5%).The image shows a tumor displaying the typical appearance of endometrial cancer and lling the right side of the uterine cavity (arrow), (c) A 45-year-old women with grade 1 endometrioid carcinoma, in which neither the CNN nor the three readers could diagnose the presence of cancer (the CNN con dence, cancer = 0.5%).The image shows a massive tumor lling the uterine cavity (arrow) and a hemorrhage at the center of the lesion.Non-uniform signal intensities of the tumor mass may have made the diagnosis di cult for radiologists.

Figure 5
Figure 5  Three cases of false negatives observed in the single image set of axial ADC: (a) A 55-year-old women with grade 1 endometrioid carcinoma, in which the CNN was able to diagnose the cancer, but the readers 1, 2, and 3 were not (the CNN con dence; cancer = 99.9%).The image shows a tiny tumor lling the uterine cavity (arrow); (b) A 34-year-old women with grade 1 endometrioid carcinoma, in which all the three readers could diagnose cancer, but the CNN could not (the CNN con dence; cancer = 18.8%).The image shows a massive tumor protruding into the myometrium of the posterior wall of the uterus (arrow);(c) A 31-year-old women with grade 2 endometrioid carcinoma, in which neither the CNN nor the three readers could diagnose the presence of cancer (the CNN con dence; cancer = 22.5%).The image shows the tumor lling the uterine cavity (arrow).A slight decrease in ADC map might have made the diagnosis of tumor di cult with a single image, without considering the other images for radiologists.

Figures
Figures

Figure 1 Flowchart of the patient selection process Figure 2 a 1 b 2 Figure 3 Experiment 1 -
Figure 1

Figure 4 Accuracy
Figure 4

Table 2
The best settings for training/validation split ratio and epoch in Experiment 1 and 2 Japan), a graphical user interface for R (The R Foundation for Statistical Computing, Vienna, Austria), and SPSS software (SPSS Statistics 27.0; IBM, New York, NY, USA).The clinical values for each group were compared using the Mann-Whitney U test and the chi-square test.The test data set was used to evaluate the sensitivity, speci city, and accuracy in cancer diagnosis.The receiver operating characteristic (ROC) analysis was performed to evaluate the diagnostic performance (18).For statistics, 95% con dence intervals (CIs) and signi cant differences were estimated.P < 0.05 was considered to be signi cant.
T2WI, T2 weighted image; ADC, Apparent Diffusion Coe cient; CE, contrast enhanced.A total of 485 women (mean age, 52 years; age range, 21-91 years) were evaluated across the datasets.Table3shows the characteristics pertaining to the patients, the pathological types, and the number of each image.Although the patients in the cancer group were substantially older, compared to the noncancer group (P < 0.001), the present study did not observe any signi cant difference between the training and test data with respect to the age of the patients (P =0.817).In the cancer group, 194 SD, standard deviation; EC, Endometrioid carcinoma; T2WI, T2 weighted image; ADC, Apparent Diffusion Coe cient; CE, contrast enhanced.

Table 4
Experiment 1-Diagnostic performance of the CNNs and radiologists Diagnostic performance of the CNNs and radiologists in the test using single and combined image sets.Diagnostic performance of the CNNs and radiologists in the test using single and combined image sets.

Table 5
Experiment 2-Diagnostic performance of the CNNs Diagnostic performance of the CNNs in the testing using single image sets with the addition of other image sets for training.† vs. the CNN trained with single image set Diagnostic performance of the CNNs in the testing using single image sets with the addition of other image sets for training.† vs. the CNN trained with single image set (25)hibara et al.recently developed a CNN that can differentiate between cervical cancer and noncancerous lesions on T2WI(21).Chen et al. and Dong et al. evaluated the myometrial in ltration of endometrial cancer using CNN and T2WI(22), and T2WI + CE-T1WI(23).As far as we know, this is the rst study to diagnose the presence of endometrial cancer and to assess the effects of adding other types of images to the training data and the conditions suitable for the application of deep learning in tumor classi cation.It is also noteworthy that the entire pelvic images were used, not just the cropped images of the uterus.CE-T1WI and DWI are important sequences that allow the functional evaluation of endometrial cancer, and are clinically used as an adjunct to T2WI.The degree of tumor enhancement depends on the tumor vascularity; most endometrial cancers are hypovascular, while quite a few are isovascular or hypervascular, compared to the myometrium(24).ADC values are inversely correlated to the tumor cellularity(25), and ADC values of endometrial cancer are signi cantly lower, compared to endometrial polyps and normal endometrium (26, 27).Hence, referencing CE-T1WI and ADC map with T2WI improves the diagnosis of cancer.The present study observed that the CNNs displayed the best performance with single axial ADC map in Experiment 1, which is consistent with a previous study regarding the diagnosis of prostate cancer.The perception of anatomical structures using ADC map alone is challenging for the radiologists.In contrast, ADC maps are considered to be suitable for cancer detection using CNN, and showing high diagnostic performance on ADC map with low spatial resolution alone may be one of the CNN's strengths.Contrary to the current results, Aldoj et al. reported that the best diagnostic performance of the CNN was attained by combining ADC map + DWI + perfusion + T2WI (28).This research differs from the present study in that a large number of (approximately 120,000) images were used for training.