Evaluation of the models generated from clinical features and deep learning-based segmentations: Can thoracic CT on admission help us to predict hospitalized COVID-19 patients who will require intensive care?

Background The aim of the study was to predict the probability of intensive care unit (ICU) care for inpatient COVID-19 cases using clinical and artificial intelligence segmentation-based volumetric and CT-radiomics parameters on admission. Methods Twenty-eight clinical/laboratory features, 21 volumetric parameters, and 74 radiomics parameters obtained by deep learning (DL)-based segmentations from CT examinations of 191 severe COVID-19 inpatients admitted between March 2020 and March 2021 were collected. Patients were divided into Group 1 (117 patients discharged from the inpatient service) and Group 2 (74 patients transferred to the ICU), and the differences between the groups were evaluated with the T-test and Mann–Whitney test. The sensitivities and specificities of significantly different parameters were evaluated by ROC analysis. Subsequently, 152 (79.5%) patients were assigned to the training/cross-validation set, and 39 (20.5%) patients were assigned to the test set. Clinical, radiological, and combined logit-fit models were generated by using the Bayesian information criterion from the training set and optimized via tenfold cross-validation. To simultaneously use all of the clinical, volumetric, and radiomics parameters, a random forest model was produced, and this model was trained by using a balanced training set created by adding synthetic data to the existing training/cross-validation set. The results of the models in predicting ICU patients were evaluated with the test set. Results No parameter individually created a reliable classifier. When the test set was evaluated with the final models, the AUC values were 0.736, 0.708, and 0.794, the specificity values were 79.17%, 79.17%, and 87.50%, the sensitivity values were 66.67%, 60%, and 73.33%, and the F1 values were 0.67, 0.62, and 0.76 for the clinical, radiological, and combined logit-fit models, respectively. The random forest model that was trained with the balanced training/cross-validation set was the most successful model, achieving an AUC of 0.837, specificity of 87.50%, sensitivity of 80%, and F1 value of 0.80 in the test set. Conclusion By using a machine learning algorithm that was composed of clinical and DL-segmentation-based radiological parameters and that was trained with a balanced data set, COVID-19 patients who may require intensive care could be successfully predicted.


Background
Severe COVID-19 patients who are admitted to the inpatient ward due to the need for supplemental oxygen or due to evidence of systemic inflammation must be monitored for the development of critical illness, a rapid increase in oxygen needs and/or an increasing systemic deterioration [1]. For patients who progress to a critical illness level, transfers to the intensive care unit (ICU) are required; additionally, depending on the severity of the condition, the patient may also need oxygen delivery through a highflow device, noninvasive ventilation, invasive mechanical ventilation, or extracorporeal membrane oxygenation [1]. Planning the ICU bed capacity is of primary importance during pandemic surges [2] since limitations in the ICU bed capacity have been reported to have an effect on mortality [3]. Thus, it is important to predict the need for ICUs, especially for patients with a severe clinical condition that requires inpatient treatment [4,5]. Additionally, starting remdesivir in the ward was recommended if disease progression was predicted [1].
Although models for identifying ICU candidate patients have been reported, most of these models are based only on clinical data [6][7][8][9][10][11]. In a study where the candidate parameters included the presence or absence of chest X-ray findings, it was noted that this parameter was not included in the final model [12]. Promising results were achieved by combining the clinical data with the semiquantitative visual severity scores (VSS) depending on the volume, type, and extent of the infiltration that were measured on chest X-ray and CT [13][14][15].
Radiomics analysis extracts different quantitative data from medical images with various algorithms, and these data are used in further analyses for decision support [16]. Studies using radiomics models and machine learning methods have shown that these methods can diagnose COVID-19 [17,18] and can determine its prognosis [19]. Although combined models of the clinical and radiomics parameters in RT-PCR-positive cohorts were reported [20], studies evaluating the efficacy of models that include clinical, quantitative volumetric and radiomics parameters for predicting disease progression in hospitalized COVID-19 patients are lacking.
Using deep learning (DL) for a COVID-19 diagnosis was previously studied by using chest X-ray and CT parameters in pretrained or customized models, and the results were successful [21]. DL networks are also used for automated segmentation, and a high accuracy was shown in the U-Net architecture for the CT images in COVID-19 patients [22].
The aim of this study was to generate and compare models that predict the need for ICUs in hospitalized COVID-19 patients using clinical features and volumetric and radiomics data that were calculated by automated segmentations.

Materials and methods
This retrospective, cross-sectional, single-center study was approved by our institution's review board (EK-E1- , and written informed consent was waived. All of the procedures that were performed in this study were in accordance with the 1964 Helsinki Declaration and its later amendments.

Study population
A total of 268 RT-PCR-positive severe COVID-19 patients hospitalized consecutively in our inpatient ward between March 2020 and March 2021 were evaluated ( Fig. 1). All these patients had one or more of the criteria for severe illness [1]: an SpO 2 < 94% when breathing room air, a < 300 mmHg arterial partial oxygen pressure to fraction of inspired oxygen (PaO 2 /FiO 2 ) ratio, a respiratory rate > 30 per minute or infection involving more than 50% of the lung parenchyma. Patients were transferred to the ICU when one or more of the signs of critical illness [1], including acute respiratory distress syndrome, septic shock, and multiorgan failure, had developed.
The inclusion criteria in this patient group were patients older than 18 years of age, patients who had a thoracic CT scan in our hospital, patients who did not receive any steroid or antiviral treatment before the CT study, patients with no interstitial pulmonary disease, and patients who had not undergone pulmonary surgery, and 222 patients met these criteria. In this group, patients with enhanced CT examinations (n = 7, all were suspicious for embolism), respiratory artifacts (n = 14), massive pleural effusion (over two-thirds of the hemithorax) (n = 4), pneumothorax (n = 1), and cystic lung disease (n = 1) were excluded from the study.

CT protocol
CT studies were performed with a 128-detector system (GE Revolution, General Electric, Milwaukee, WI) from the first rib to the adrenal glands, nonenhanced by using the following parameters: 100 kV, 110 mAs, body filter, a 1.25 mm slice thickness, a 512 × 512 reconstruction matrix, a spiral pitch factor of 1.375:1, BonePlus convolution kernel, adaptive statistical iterative reconstruction of 70%.

Deep learning segmentation and radiomics feature calculation
The entire lung parenchyma and pneumonic lesions were segmented by using Quibim's U-Net model, which is a convolutional neural network architecture that uses the ResNet-34 backbone, which was developed for the ' A European initiative for automated diagnosis and quantitative analysis of COVID-19 on imaging' project.
Slices of the studies were preprocessed as the segmentation model input by applying a constant lung window level (WW = 1600, WL = − 600), normalization in range [0, 1], as well as by using the Balance Contrast Enhancement Technique (BCET preprocessing). Thus, the basic shapes of the image histograms were maintained.
Several metrics were accounted for in order to evaluate the segmentation model. Dice Similarity Coefficient (DSC) and Intersection over Union (IoU) values were calculated on all of the scans with at least 1000 voxels in Fig. 1 Flowchart of the study. Clinical, Radiological and Combined models are the final models in cross-validation. LR is Logistic regression the ground truth segmentation. Due to the fact that DSC and IoU are zero in cases without ground truth mask, the final test set for these metrics was determined by using a histogram-based threshold of more than 1000 positive voxels. Average false positive and false negative volumes were calculated for all of the scans. In addition, Pearson's correlation coefficients of positive prediction and ground truth were determined.
Two authors (MG, 16 years of experience and EO, last year of residency training) checked whether all ground glass opacities (GGO), consolidation or crazy paving areas in the CT studies were segmented by the DL algorithm. It was noted that DL did not segment GGO that were smaller than 1 mL, and patients who only had such lesions were excluded from the study (n = 4). Finally, the final study population consisted of 191 patients.
The radiomics features were calculated using Quibim Texture Analysis software (Quibim SL, Valencia, Spain) from the obtained segmentations by the following parameters: (1) Resampled voxel size 1 × 1 × 1 mm 3 by using bicubic interpolation, (2) Fixed bin-width of 25 for gray value discretization, (3) Density normalization according to Eq. (1): where f(x) is the normalized voxel density, x is the original density, μ x is the mean density, σ x is the standard deviation and S is the scaling factor (set to 500). (4) A voxel array shift of 1024 was added to prevent the negative values from being squared. (5) Second-order matrices were calculated using a distance of 1 voxel and 13 isotropic displacement vectors at angles of 0°, 45°, 90° and 135°.

Statistical analysis
Patients were categorized into Group 1 (patients who recovered with treatment in the inpatient ward) and Group 2 (patients transferred to the ICU for progressive disease from the inpatient ward).
The data obtained from the patients were divided into (1) clinical data consisting of the demographic data of the patients, comorbid disease history, therapeutics given to the patient in the inpatient service, oxygen saturation, complete blood count, biochemical parameters, and acute phase reactants obtained at admission and (2) radiological data consisting of the volumetric data of the whole lung, the inflamed lung parenchyma as segmented by DL and the first-and second-order radiomics parameters calculated from the segmented lesions.
Comparison of the nominal data between the two groups was performed with the chi-squared test or the Fisher's exact test. For continuous data, the values of a normally distributed parameter were given as the mean ± SD, and the values of nonnormally distributed parameters were provided as the median (IQR). Comparisons of the groups were conducted with the T-test or Mann-Whitney test, accordingly. If a parameter differed significantly between the two groups, the area under the curve (AUC) was calculated with the receiver operator characteristic (ROC) test, and the cutoff value, optimal sensitivity, and specificity were determined by using the Youden index. Logistic regression was used for the univariate nominal parameters to calculate the sensitivity and specificity.
After the patient population was randomly divided into a training and cross-validation set (n = 152, 79.6%) and a test set (n = 39, 20.4%), logit fit models were created by using the Bayesian information criterion (BIC) from the training set. The clinical model was selected from the clinical data, and the radiological model was selected from the radiological data. A combined model from both clinical and radiological data was also constructed. The adequacy of the model's parameters in predicting the categorical outcomes was evaluated with the Hosmer-Lemeshow goodness-of-fit test. Multicollinearity was evaluated by calculating the variance inflation factor (VIF).
Models were optimized by calculating cost function (log loss), and the gradient descent optimization algorithm and initial theta vectors were replaced with the optimized ones. By using a tenfold cross-validation, the mean sensitivity, specificity, and accuracy values of the model were calculated by averaging all of the crossvalidation results, and the model-specific cutoff values for each model were calculated via the Youden index of ROC analyses [23]. The test set results were obtained by using optimized models and model-specific cutoff values. The C-index and 95% CI values of the models were further separately calculated for the training and cross-validation sets via 1000 bootstrapping studies.
To solve the class imbalance problem, the Synthetic Minority Oversampling Technique (SMOTE) algorithm was used by using the "smotefamily" package in the R statistical computing environment (R Foundation for Statistical Computing, Vienna, Austria) [24]. During the generation of the synthetic data, k = 3 was selected for the K-nearest neighbor algorithm.
Another model including all of the clinical and radiological parameters in the study was created via the random forest classification algorithm, and this model was trained with the balanced training set containing the synthetic data. The effectiveness of the final random forest model was evaluated with the same test set that was used for the logit-fit models.

Group features, demographics, symptoms and findings
There were 117 patients in Group 1 (61.6%) and 74 in Group 2 (38.4%). The mean age of the patients was 65.45 ± 14.02 (26-96 years), 57.6% of the patients were male, and 42.4% were female. ICU patients were followed in the ward for an average of 3.2 days (1-12 days) prior to transfer to the ICU. Whereas Group 1 patients were discharged with a mean duration of 8.8 ± 4.7 days (2-29 days), Group 2 patients had a mean duration of 19.2 ± 13.8 days (7-41 days, including ICU stay) of hospitalization that resulted in either death (n = 3) or discharge (n = 71).
The mean age of the patients and the number of males were higher in Group 2, and the differences were significant (Table 1). Regarding the symptoms and findings, only fever was significantly different between the two groups. Among the comorbidities, patients diagnosed with chronic renal failure or coronary heart disease required significantly more ICU admissions (Table 1). Patients who needed corticosteroids in the ward were more frequently transferred to the ICU (Table 1).

Laboratory findings
Most of the laboratory findings differed significantly between the two groups (Table 2). However, at the time of admission, it was observed that patients who needed ICU care did not have a lower oxygen saturation value. Among the blood tests, procalcitonin was the most effective univariate classifier ( Table 2).

DL segmentation findings and radiomics
The time between the onset of symptoms and the CT examination was 6 [7] days in Group 1 and 5 [5] days in Group 2 (p = 0.775, Mann-Whitney test). The positive RT-PCR test result and the CT study were conducted on the same day.
The DL algorithm segmented both the whole lung tissue and the pneumonic areas of COVID-19 infection in the patients. In Group 2 patients, both the percentage and volume of pneumonic tissue secondary to COVID-19 were significantly higher (Table 3) than those in Group 1 patients. Additionally, in Group 2, the mean total lung volume was decreased by 11.2% compared to that in Group 1. Eighteen first-order and 58 second-order radiomics parameters were calculated from the segmentations ( Table 4). The skewness was higher in Group 1, and the mean density was higher in Group 2 (Table 4).

Predictive logit-fit models
None of the clinical, volumetric or radiomics parameters provided a dependable univariate classifier. Therefore, logit-fit models were created ( Table 5).
The clinical model's PP was calculated using Eq. (2): where PLT is the platelet count, GFR is the estimated glomerular filtration rate, AST is aspartate aminotransferase, LDH is lactate dehydrogenase and PCT is procalcitonin. Although the clinical model had good specificity, its sensitivity was limited in the training and validation sets ( (2) where PIL is the percent of infected lung, RMAD is Robust Mean Absolute Deviation, and LDLGLE is GLDM-LargeDependenceLowGrayLevelEmphasis. This model showed a better sensitivity but a worse specificity than the clinical model (Table 5). Combined model's PP calculated using Eq. (4): where PIL is the percent of infected lung, S is the skewness, C is GLCM-Clustershade, GLE is GLSZM-LargeAreaLowGrayLevelEmphasis, NLR is the neutrophil-to-lymphocyte ratio, PLT is the platelet count, GFR is the estimated glomerular filtration rate, and AST is aspartate aminotransferase. This model had the highest AUC and specificity (Table 5).
(3) The calculated p values for the clinical, radiological, and combined models were 0.376, 0.399, and 0.631, respectively, in the Hosmer-Lemeshow test. Radiologic and combined models showed better calibration than the clinical model (Fig. 2). The VIF value was less than 3.0 for all parameters in the models; thus, there was no significant multicollinearity.
The optimal cutoff-off values were calculated for the clinical, radiological, and combined models as 0.565, 0.444, and 0.429, respectively (Fig. 3).

Test set features and test set results of logit-fit models
Fifty-nine patients in the training set and 15 in the test set were transferred to the ICU (p = 0.908, chisquared test). The median age of the patients was 65 (22.8) years in the training set and 61 [20] years in the test set (p = 0.056, Mann-Whitney test). The training set included 88 males and 64 females, and the test set included 22 males and 17 females (p = 0.867, chisquared test).
In the test set, the combined model produced the best AUC, followed by the clinical model (Table 5).

Synthetic data generation and random forest algorithm
Despite the fact that they were well calibrated, there were problems with the logit-fit models due to the study population. The data had a low sample size and were affected by class-imbalance. Due to the fact that our sample size was low, the BIC method, which penalizes complex models with more parameters [25], was preferred in the model selection for logit-fit models to avoid overfitting [26]. In addition to the logit-fit models, a random forest classification algorithm, as a method that resistant to overfitting, was used for generating a model that uses all of the study parameters.
While 93 (61%) patients in the training set did not require intensive care, 59 (39%) patients were  transferred to the ICU. The use of the unbalanced training set, especially for the high-dimensional data, was reported as a reason for model bias in favor of the majority class [27]. To solve this problem, the instance of the minority class, which involves patients being transferred to the ICU, was increased to 93 by using the SMOTE algorithm. The prediction results on the test set of the model that was trained with the random forest algorithm on a more balanced training set did not lead to an increase in specificity (87.5%). On the other hand, sensitivity (80%) was increased, and increases in accuracy (84.6%), AUC (0.837), precision (0.80), and F1 score (0.80) were followed.
A feature importance study was also conducted for the random forest model (Fig. 4). Overall importance was the highest in the PCT, followed by Skewness, LDH, PIL, CK, and GLDM-LargeDependenceLowGrayLevelEmphasis.  It was observed that most of the parameters that were included in the logit-fit models also had high mean decrease accuracy values in the random forest model. When the ROC curves of the models were evaluated by pairwise comparison [28], no significant difference was found between the RF model and the Combined logit fit model (Fig. 5).

Power analysis
On the post hoc analysis, for a difference of the mean for two independent groups including 117 and 74 patients with input parameters of a medium effect size (Cohen's d = 0.5), two tails, and alpha = 0.05, the calculated power was 0.91.

Discussion
COVID-19 surges in the United States over the past two years were assessed by the CDC as three periods [29]. These are the Winter 2020-2021 period, the Delta period from July to November 2021, and the Omicron period that we are currently in, which started in December 2021. The maximum number of 7-day moving average ICU bed in use for COVID-19 was reported as 27,958 (January 9-16, 2021) in the Winter 2020-2021 period, 24,775 (September 6-13, 2021) in the Delta period, and 24,776 (January 15, 2022) in the Omicron period [29]. It has been reported that the number of patients who required intensive care due to the Omicron variant is one-fourth compared to the Delta variant [30]. The lack of difference between the ICU admission numbers between the Omicron period and the Delta period is due to the difference in the number of COVID-19 cases. While the maximum case number of 7-day moving average in the Delta period was 164,249, this value was reported as 798,976 in the Omicron period [29].
Since none of the 28 clinical, 21 volumetric, and 74 radiomics parameters could reliably predict patients who would require ICU admission, clinical, radiological, and combined models were built, and the combined model provided the best predictions.
Models based solely on the clinical data emerged with easy accessibility and usability features. High fever, older age, elevated LDH, increased acute phase reactants and a decreased lymphocyte count were frequently reported in ICU candidates [4][5][6][7][8][9][10][11]. In our study, procalcitonin was distinguished as the parameter with the highest odds ratio value, and patients with chronic renal failure showed a significantly higher need for ICU care, which is consistent with previous publications [12]. In a study involving cases from 100 hospitals in South Korea, the presence or absence of chest X-ray findings did not significantly improve the clinical model outcomes when used as a parameter [12]. On the other hand, it has been reported that a better prediction was obtained when a deep-learning model, which was trained to discriminate critical and noncritical chest X-ray findings, was combined with a clinical model [31].
In prognostic studies that have evaluated CT findings, both increased volumes in pulmonary involvement [32] and an increased ratio of consolidation in pulmonary lesions [33] were associated with an unfavorable prognosis. In our study, the percentage of infected lung parenchyma was included in the radiological and combined models. The mean and median densities of the lesions were significantly higher in ICU patients, suggesting a higher frequency of consolidation. However, this parameter was not included, and skewness was selected for the models by BIC. It has been previously shown that as the GGO areas in the lesions increase, the skewness value of the lesion also increases [18]. We showed that the skewness of the lesions was significantly higher in patients who did not require ICU admission. , and combined (c) models. In the ROC analysis of the cross-validation sets, the optimal cutoff values of the models were determined and marked by using the Youden index VSS, which includes the classification of lesions as GGO or consolidation, has been reported as an effective method in the evaluation of COVID-19 prognosis [34,35]. However, this method is reported to have reliability and reproducibility problems due to issues such as difficulty classifying lesions containing both areas of consolidation and GGO, and radiomics models were found to be more useful for predicting prognosis [36].
The total lung volume was 12% lower in the ICU group. Alveolar collapse is known to occur in patients with SARS-CoV-2 infection [37,38], and surfactant reduction that results from the loss of alveolar type 2 cells, increased inflammatory cell migration to the interstitial space, and microvascular thrombosis may be responsible for this outcome [39]. Although the total lung volume was not directly entered into the models in the parameter Although segmentation can be performed manually in radiomics modeling studies related to COVID-19 [18,40], this method takes considerable time due to the large number of lesions per patient, and the reproducibility problem needs to be overcome. Methods such as the segmentation of the entire lung (healthy and diseased), rather than individual lesions, have been suggested [36].
Automated segmentation solves all of these problems. While the use of AI in CT sections is not recommended as a screening test, its use as a predictive and prognostic decision support system in hospitalized patients has been suggested [41]. In a study examining clinical data and the radiomic features calculated from automated segmentations, a combination model produced the best predictions [19]. In our study, apart from radiomic parameters, the percent of infected lung parameter was included in the models. Thus, the subjective calculation of critical parameters, such as the lesion classification and the ratio of the diseased parenchyma in the VSS method, were solved, and the models were based on objective criteria. Additionally, volumetric parameters produced by automated segmentations are reportedly more accurate than human semiquantitative estimates [42].
The model that we propose for predicting the risk of ICU in the COVID-19 patient has two important features. First, it does not solely use clinical data. Models that are solely based on laboratory parameters did not consider lung parenchyma involvement as a parameter, nevertheless all of the combined models in our study included more than one radiological parameter, regardless of the machine learning algorithm that was used. Second, the reproducibility problem of VSS methods has been resolved by using the segmentations of the deep learning algorithm that was trained with CT studies from multiple hospitals in affected countries across Europe. Thus, we believe that models based on non-subjective clinical and radiological data that require no parameter calculation effort and that provide reproducible results could be more widely used in the field and can help healthcare providers to make decisions and better organize hospitals' resources.
This study has some limitations. First, these are the results of single center. However, we used automated segmentation, and CT data were resampled during the radiomics parameter calculation. Second, the patient population was retrospectively selected from patients who had an indication for hospitalization, which could introduce selection bias. Third, the relationship between the antiviral treatment efficacy and the ICU requirements was not evaluated in this study since there is no definitively proven antiviral treatment for COVID-19. Fourth, we used an unbalanced data set; however, we increased the sensitivity by adding synthetic data to the training set. Finally, patients with a contrast-enhanced examination were not included in this study since the radiomics parameters would be affected. We believe that a different model is required for embolism cases.

Conclusion
The model that was created by combining the radiological parameters obtained by automated segmentation and the clinical parameters in COVID-19 patients requiring hospitalization was found to be useful as an objective method in predicting the risk of developing critical illness.