Model development
We previously developed a general pre-training strategy, which used the radiology reports as weak supervision to pre-train a CNN model that improves the model performance on a given task [30]. This work extended the previous method to build an automatic enteral feeding tube positioning assessment network using a small training dataset. We assume two datasets, \(X_P\) and \(X_L\) (\(|X_P| \gg |X_L|\)), exist, where \(X_P\) contains paired of radiographs and associated radiology reports and \(X_L\) consists of labeled radiographs for enteral feeding tube positioning assessment. Our proposed network pre-trained the feature extractor of the enteral feeding tube positioning assessment model on \(X_P\) directly without requiring manually annotated labels. The feature extractor, then, was fine-tuned on \(X_L\) for the enteral feeding tube positioning assessment. Figure 1 shows an example of a radiograph and the corresponding radiology report.
Pre-training feature extractor via radiograph-report matching
We pre-trained the feature extractor of the enteral feeding tube positioning assessment model through a radiograph-report matching network (Fig. 2), containing a textual report processing branch (Fig. 2a), a radiograph processing branch (Fig. 2b), and a contrastive learning module (Fig. 2c). The two branches worked simultaneously in parallel. The network took a radiology report and radiograph pair as input and predicted whether they were a natural match. Since label (i.e., match or don’t match) is known, no manual annotation will be required. This weakly supervised pre-training approach transfers the rich information in reports to the radiograph feature extractor without requiring manually labeled data.
Specifically, the textual report processing branch (Fig. 2a) took a radiology report as input and (1) passed the report through a pre-trained BERT (Bidirectional Encoder Representations from Transformers) [32] encoder and a \(1\times 1\) convolutional (Conv) layer to convert the natural language in the report to numerical embeddings, i.e., a sequence of numbers that can be processed by computer algorithms, (2) reduced the dimensionality of the embeddings by applying a global average pooling (GAP) operation, and (3) projected the embeddings to a latent feature space by a fully connected (FC) layer. The output of the textual report processing branch was a feature vector that represented the report in latent space. Meanwhile, the radiograph processing branch (Fig. 2b) took a radiograph as input and passed it through a ResNet-18 [15] feature extractor. The generated feature map was then passed through a Conv layer with \(1\times 1\) kernels to transfer the pre-trained features to task-specific features. After that, an FC layer was used to embed the radiograph feature map to the latent space, which is the same as the textual report features. The output of the radiograph processing branch was a feature vector in the latent space that represented the input radiograph. Next, the radiograph-report matching network was trained in a contrastive manner via the contrastive learning module (Fig. 2c). A shallow CNN classifier was added on top of the two branches that takes the absolute difference between the two feature vectors as input and ouputs whether the two feature vectors belonged to the same example.
Mathematically, the radiograph-report matching network could be written as:
$$\begin{aligned} h_{\theta _p}(x^{i}) = h_{\theta _{cls}}(|h_{\theta _t}(x_t^{i}) - h_{\theta _r}(x_r^{i})|). \end{aligned}$$
(1)
where \(x^{i}=\{x_t^{i}, x_r^{i}\}\) was a pair of a textual radiology report, \(x_t^{i}\), and a radiograph, \(x_r^{i}\) from \(X_P\). Note that \(x_t^{i}\) and \(x_r^{i}\) may or may not match. The network \(h_{\theta _p}(\cdot )\) predicted the probability of the input pair being a natural match. The \(h_{\theta _{cls}}(\cdot )\) was the contrastive learning module, \(h_{\theta _t}(\cdot )\) was the textual report processing branch, and \(h_{\theta _r}(\cdot )\) was the radiograph processing branch. Binary cross-entropy loss was used to train the text-image matching network.
The input of the radiograph-report matching network was a radiology report and radiograph pair. A label was naturally assigned to each radiograph-report shwoing whether are from the same imaging event. A true pair meant the report describes the radiograph naturally; otherwise, it was a false pair.
CNN for enteral feeding tube Positioning assessment
The enteral feeding tube positioning assessment model was trained by fine-tuning the feature extractor in the radiograph processing branch, \(h_{\theta _r}(\cdot )\), of the pre-trained network. The process was straightforward and illustrated in Fig. 3. The Conv layers in \(h_{\theta _r}(\cdot )\) were used as the feature extractor in the enteral feeding tube positioning assessment model. A Conv layer and two FC layers were added on top of the feature extractor to build the classification model for the enteral feeding tube positioning assessment network, \(h_{\theta }(\cdot )\). The \(h_{\theta }(\cdot )\) took radiographs from \(X_L\) and predicted the probability of the enteral feeding tube positions being satisfied. Since the feature extractor was pre-trained using a larger dataset set from the same domain, we only need to optimize the \(h_{\theta }(\cdot )\) from scratch that may use significantly reduce the need for the total number of training instances.
Enteral feeding tube positioning dataset
A dataset containing plain radiographs of 175 patients was retrospectively retrieved at a comprehensive tertiary academic medical center. All the images were inspected by a board-certified abdominal radiologist with more than 10 years of experience and a trainee. The dataset included 63 images where the enteral feeding tube positioning was unsatisfactory, and 112 images with a satisfying position. This retrospective study was approved by the Institutional Review Boards of the University of Kentucky.
The pixel values of radiographs were converted to the range of 0-255 using a window of 0-2750. The images were resized to \(256\times 256\) and equally split into five folds for a fivefold cross-testing. Real-time data augmentation for the combination of a random horizontal flip and rotation between 0 and 20 degrees was applied to the training data.
Model evaluation
Compared models
We compared the proposed model with the CNN models trained using four different pre-training strategies: (a) a CNN model without pre-training (denoted as No Pre-Train), (b) a CNN model pre-trained on ImageNet [19] (denoted as ImageNet), (c) a CNN model pre-trained using Compare to Learning [29], a state-of-the-art self-supervised pre-training model for 2D medical images (denoted as C2L), and (d) a true random CNN model (denoted as Random). All models have the same architecture.
The No Pre-Train model was a typical CNN model trained using the enteral feeding tube dataset only. No pre-training strategy was applied. All the weights of this model were randomly initialized before the training.
The ImageNet model was a CNN model that pre-trained on the ImageNet dataset, a natural imaging dataset containing over one million images of 1000 classes. Such a pre-training method is well-accepted and widely used in the medical imaging domain [26,27,28], which was also used in [24], an early study of enteral feeding tube positioning assessment using CNN models. The model was trained on the ImageNet dataset for a classification task and was fine-tuned using the enteral feeding tube dataset under the same approach of Sect. 2.1.2.
The C2L model was pre-trained using Comparing to Learn [29] on the MIMIC-CXR dataset [31] that was a self-supervised, pre-training method that was proposed for medical imaging analysis. The method pre-trained a feature extractor on MIMIC-CXR, containing 227, 835 radiographic studies of 64, 588 patients that including 368, 948 chest radiographs and the associated radiology reports. The model, then, was fine-tuned using the enteral feeding tube dataset under the same approach of Sect. 2.1.2.
The Random model was a CNN model with randomly initialized weights. The model was not trained with any data samples. The model performs random guessing for any input examples.
The proposed model was pre-trained using [30] that was proposed by our previous study. Specifically, the feature extractor of the proposed method was pre-trained on MIMIC-CXR for radiograph-report matching tasks. The detailed pre-training setup of the proposed method was described in [30]. After the feature extractor was pre-trained, the network was fine-tuned using the enteral feeding tube dataset under the same approach of Sect. 2.1.2. No radiology reports were needed for fine-tuning or testing the enteral feeding tube positioning assessment model.
All the compared models were trained for five trials with a fivefold cross-testing strategy. We used three folds for training, one for validation, and one for testing. We repeated this process until all folds were tested. The validation fold is used to select the best checkpoint of the model. Then, the selected checkpoint is used to test the model on the testing fold. The Cyclic learning rate [33] between \(10^{-4}\) and \(10^{-2}\), Adam optimizer [34], and binary cross-entropy loss were used for the enteral feeding tube dataset training or fine-tuning. All the models were trained for 100 epochs. We used Python as the programming language and PyTorch [35] as the scientific computing library to conduct the evaluation. For the ImageNet pre-trained model, we loaded the PyTorch pre-trained weights directly in to the model. The training was performed on a GPU cluster that has a combination of 120 Nividia P100 and V100 GPU cards. However, only one GPU card was used for the training at the same time.
Evaluation metrics
Four evaluation metrics were used in this study, namely the AUC, F1 score, accuracy, and the expected calibration error (ECE) [36]. The AUC, F1 score, and accuracy were used to evaluate models’ performance in making accurate predictions. All three metrics were bound between 0 to 1. A higher number indicated better performance. The ECE was used to measure neural network calibration error, i.e., how accurately the network estimates its prediction confidence, with a smaller value indicating a more accurate representation of its prediction confidence. A perfectly calibrated neural network has a 0 ECE.
We defined the accuracy, AUC, and F1 score following common practice. The ECE was defined as the same as [36, 37] by partitioning predictions into M bins and taking a weighted average of the difference of accuracy and confidence for each bin. More specifically, we first grouped all the samples into M interval bins according to the predicted probability. Then, let \(B_m\) be the set of indices of samples whose predicted confidence falls into the interval \(I_m=(\frac{m-1}{M}, \frac{m}{M}]\), \(m \in M\). The ECE can be calculated as:
$$\begin{aligned} \text {ECE} = \sum _{m=1}^{M} \frac{|B_m|}{n} \left| \frac{1}{|B_m|} \sum _{i\in B_m} 1\cdot ({\hat{y}}^i = y^i) - \frac{1}{|B_m|} \sum _{i\in B_m} {\hat{p}}^i\right| , \end{aligned}$$
(2)
where n was the number of samples, \({\hat{y}}^i\) and \(y^i\) were the predicted and ground-truth label for sample i, \({\hat{p}}^i\) was the confidence of sample i, \(\frac{1}{|B_m|} \sum _{i\in B_m} 1\cdot ({\hat{y}}^i = y^i)\) was the accuracy of \(B_m\), and \(\frac{1}{|B_m|} \sum _{i\in B_m} {\hat{p}}^i\) calculated the average predicted confidence of \(B_m\).
Model interpretation
Integrated Gradients attribution mask (IG) [38] and occlusion sensitivity testing map (OCC) [39] are used as visualization methods to understand how predictions are made by the proposed model. IG is an interpretability technique for CNN models that visualize the important features that contribute to the model’s prediction. Higher values in an IG attribution mask indicate more important features in the decision-making process. OCC is a technique for understanding which parts of an image are most important for a CNN classification. The higher values in an OCC map indicate more important areas for the image during the CNN classification procedure.