Study subjects
All procedures performed in this study involving human participants were following the ethical standards of the institutional and national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. Informed consent was obtained from all individual participants included in the study. All the MR images were collected retrospectively from routine clinical scans at the Children’s Hospital of Zhejiang University School of Medicine between the years of 2009 and 2018. A total of 79 HB patients (ABE/non-ABE = 47/32, male/female = 52/27), who had MRI examinations at chronological age from 1 to 18 days during their hospitalization, were selected. The diagnostic criteria for ABE positive cases met either of the following clinical diagnosis criteria, including (1) sever hyperbilirubinemia; (2) at least one of the ABE-related clinical symptoms with bilirubin-induced neurologic dysfunction (BIND) score ≥ 1 point, where 1, 2, or 3 points corresponding to mild, moderate, or severe symptoms based on the severity of the crying pattern, behavioral and mental status, and muscle tone (a total of 9 points). Overall BIND score of 1–3 points, 4–6 points, 7–9 points represented subtle signs of mild ABE, moderate ABE, and advanced ABE, respectively. At last, 47 ABE and 32 non-ABE (HB) cases were confirmed by 2 experienced pediatric radiologists. (S.XX and C.L.).
MRI Acquisition
T1WI was acquired on a 3.0 T Achieva scanner (Philips Healthcare, Best, the Netherlands) using a 2D multislice T1-weighted fast field-echo sequence in the axial direction with the following parameters: echo time of 2.14 ms, repetition time of 200 ms, flip angle of 80°, field-of-view of 330 × 330 mm, resolution of 0.45 mm, and 18 slices with a thickness of 4.5 mm. All slices were visually examined by the radiologists with high image quality and none of them were excluded.
Visual inspection of the MR images
We invited three pediatric radiologists, including a senior radiologist with 12 years of experience (C.L.) and 2 fellows (L.Y. and Z. S.) with 7 years of experience, to independently review the MR images. Since there are currently no radiological standards for diagnosing ABE, the raters were first trained by reviewing all the MR images with true labels (ABE or non-ABE) and the corresponding clinical information, such as age, sex, TSB, etc. One week after the training, they were asked to review the images again without the labels nor the clinical information to make a diagnosis decision based on T1WI only. The images were shuffled with re-assigned identification numbers at the training and testing sessions.
Semi-quantitative diagnosis with normalized T1 Intensity
We chose a slice from T1WI of each HB neonate covering the largest area of GP and STN for analysis. Region of interest (ROI) was manually delineated on the selected slice, including the anterior subcortical white matter (WM), GP, and STN, as shown in Fig. 1a. WM was used as the reference region to normalize the T1-signals in GP or STN, becasue there are no known T1-signal changes in WM between the ABE and non-ABE patients. The normalized T1 intensities of GP and STN were defined as
$${\text{GP}}_{{{\text{norm}}}} = \frac{{\overline{{GP}} }}{{\overline{{WM}} }}$$
(1)
$${\text{STN}}_{{{\text{norm}}}} = \frac{{\overline{{STN}} }}{{\overline{{WM}} }}$$
(2)
where \(\overline{{GP}}\), \(\overline{{STN}}\), and \(\overline{{WM}}\) denote the averaged T1WI intensities in the GP, STN, and WM ROIs, respectively. We then applied the Youden Index [27,28,29] in the software of IBM SPSS Statistics 21 to determine the optimal cut-off threshold of GPnorm and STNnorm for separating ABE and non-ABE patients, respectively.
Deep learning framework
In this section, we describe the deep learning procedure for classification work. 2 or 3 continuous T1WI slices covered the GP from each patient were selected as inputs of the CNN. A total of 190 slices were collected, including 95 randomly selected slices from ABE patients (approximately 2 continuous slices per patient) and 95 slices from non-ABE patients (approximately 3 continuous slices per patient). All selected images were normalized between 0 and 1 with a min–max normalization algorithm.
As our classifier was performed based on a pre-trained CNN in Matlab 2019a (https://www.mathworks.com), which requires 3 channels image input with a size of 224 × 224x3 pixels. Consequently, the normalized image was resized into 224 × 224 pixels and then replicated to 3 channels to form a 224 × 224x3 image which served as an input of the network. The dataset was randomly split into a training dataset and testing dataset with 80% and 20% split ratio, respectively. Then, fivefold cross-validation was followed to estimate the model’s performance.
We employed a CNN of Resnet18 [26] which consists of 18 residual blocks, where each residual block is defined as:
$$y = {\text{F}}\left( {x,\left\{ {Wi} \right\}} \right) + x$$
(3)
where \(x\) and \(y\) were the input and output, and \({\text{F}}\left( {x,\left\{ {Wi} \right\}} \right)\) represented the residual mapping to be learned. ResNet18 applied residual learning to every few stacked layers. A residual block was different from conventional CNN architecture in the existence of a shortcut connection between the input and output, serving as an identity projection for alleviating the vanishing gradient issue in deep networks[26], as shown in Fig. 2a. The mapping function H(x) = F(x) + x was realized as a residual shortcut connection in a feedforward neural network and performs element-wise addition.
The ResNet18 architecture was shown in Fig. 2b, containing 18 learnable layers. The convolutional layers used 3 × 3 filters, and the downsampling was performed for every 4 layers after the input layer by convolutional layers with a stride of 2. Note the number of filters get doubled as a downsample took place. At the end of ResNet18, an average-pooling was applied followed by a fully-connected layer and a softmax layer. Residual shortcut connections denoted as the curves in Fig. 2b were added throughout the network. The solid curves were used when input and output had the same dimensions; while the dotted curves were used when the dimension increased, where the shortcut performed identity mapping with zeros padding for the increased dimension with a stride of 2.
Since the size of our dataset was limited, we applied the transfer learning approach for our classification model [30]. The weights of ResNet18 were initialized by pre-training on the ImageNet [31] and then fine-tuned with our datasets. Data augmentation was also applied on the training datasets to enhance our model’s performance, which included image rotation with a random angle in the range of -30° to 30°, image vertical flipping with 50% probability, images zooming by a random scale within the range of 0.9 to 1.1, image horizontal and vertical translation with random distance in the range of −30 to 30 pixels. The learning rate was initialized to 0.0003, MaxEpoch = 6, Stochastic Gradient Descent Momentum based solver is used with a minibatch size of 10 images for training.
To investigate which brain areas influence our classification results most, we applied the class activation mapping (CAM) to each testing subject. CAM is a technique used to get visual interpretations of the regional contributions to the predications of CNN [32].
The model was implemented under the environment of Matlab 2019a on a computer having specifications of 16 GB RAM and Inter® Core™ i7-8700, CPU@ 3.20 GHz, GPU NVIDIA Geforce GT 730.
Statistical analysis
The group differences in the sex distribution among groups were evaluated using the chi-squared test, while other clinical features were evaluated by t-test.
The group differences in GPnorm and STNnorm between the ABE and non-ABE patients were accessed by using analysis of covariance (ANOVA) with age, sex, gestational age at birth, and PMA as covariates.
To evaluate the classification performances of different methods, serval performance metrics were applied in this study, including sensitivity, specificity, precision, F1-score, and the area under the curve (AUC) of the Receiver Operating Characteristic (ROC) curves [33, 34]. χ2-test was applied to determine the significant differences in the classification accuracy by different methods.
All statistical analysis was performed using IBM SPSS Statistics 21 (https://www.ibm.com/products/spss-statistics).