A methodical exploration of imaging modalities from dataset to detection through machine learning paradigms in prominent lung disease diagnosis: a review

Kumar, Sunil; Kumar, Harish; Kumar, Gyanendra; Singh, Shailendra Pratap; Bijalwan, Anchit; Diwakar, Manoj

doi:10.1186/s12880-024-01192-w

Research
Open access
Published: 01 February 2024

A methodical exploration of imaging modalities from dataset to detection through machine learning paradigms in prominent lung disease diagnosis: a review

Sunil Kumar^1,2,
Harish Kumar¹,
Gyanendra Kumar³,
Shailendra Pratap Singh⁴,
Anchit Bijalwan⁵ &
…
Manoj Diwakar⁶

BMC Medical Imaging volume 24, Article number: 30 (2024) Cite this article

2288 Accesses
5 Citations
1 Altmetric
Metrics details

Abstract

Background

Lung diseases, both infectious and non-infectious, are the most prevalent cause of mortality overall in the world. Medical research has identified pneumonia, lung cancer, and Corona Virus Disease 2019 (COVID-19) as prominent lung diseases prioritized over others. Imaging modalities, including X-rays, computer tomography (CT) scans, magnetic resonance imaging (MRIs), positron emission tomography (PET) scans, and others, are primarily employed in medical assessments because they provide computed data that can be utilized as input datasets for computer-assisted diagnostic systems. Imaging datasets are used to develop and evaluate machine learning (ML) methods to analyze and predict prominent lung diseases.

Objective

This review analyzes ML paradigms, imaging modalities' utilization, and recent developments for prominent lung diseases. Furthermore, the research also explores various datasets available publically that are being used for prominent lung diseases.

Methods

The well-known databases of academic studies that have been subjected to peer review, namely ScienceDirect, arXiv, IEEE Xplore, MDPI, and many more, were used for the search of relevant articles. Applied keywords and combinations used to search procedures with primary considerations for review, such as pneumonia, lung cancer, COVID-19, various imaging modalities, ML, convolutional neural networks (CNNs), transfer learning, and ensemble learning.

Results

This research finding indicates that X-ray datasets are preferred for detecting pneumonia, while CT scan datasets are predominantly favored for detecting lung cancer. Furthermore, in COVID-19 detection, X-ray datasets are prioritized over CT scan datasets. The analysis reveals that X-rays and CT scans have surpassed all other imaging techniques. It has been observed that using CNNs yields a high degree of accuracy and practicability in identifying prominent lung diseases. Transfer learning and ensemble learning are complementary techniques to CNNs to facilitate analysis. Furthermore, accuracy is the most favored metric for assessment.

Peer Review reports

Introduction

Lung diseases are conditions classified as medically aberrant and impair the functionality of the lungs. Typically, the medically abnormal status of the lung is accompanied by a few specific signs and symptoms. Some intrinsic malfunction of the lungs stimulates the progression of the diseases. The World Health Organization (WHO) reported the top ten fatal diseases from 2000 to 2019. Unexpectedly, the majority of these are lung-related, including COPD ranking third, lower respiratory infections ranking fourth, and trachea, bronchus, and lung cancer ranking sixth in mortality causes [1]. Among the ailments that affect the lower respiratory tract, the most common ones are pneumonia, bronchitis, and influenza [2]. Chronic respiratory diseases (CRDs) are incurable conditions that disrupt the delicate balance of the lungs. They mainly appear as COPD and asthma-causing impairments.

Surprisingly, most deaths related to COPD occur in people under 70 years old. The impact is striking, with COPD claiming about 3 million lives yearly, accounting for 6% of mortality. Asthma is also widespread, affecting children and adults, with around 262 million individuals affected [3]. We will never forget the pandemic kind of lung disease that we live with, known as the novel COVID-19, caused by the SARS-CoV-2 virus. As of 2023, the WHO estimates that the virus has infected over 663 million individuals and generated around 7 million fatalities [4]. A considerable number of people die worldwide as a result of lung diseases and their various prominent forms.

Traditional diagnostic procedures focus on manual symptom analysis to diagnose lung illnesses, with clinicians directing future prescription selections based on disease features evaluated [5]. However, the Association of Interdisciplinary Fields causes technology to be coupled with manual analysis for computer-aided diagnosis. As a result, the healthcare sector relies on technology such as medical imaging and ML. Medical imaging refers to the techniques and technologies used to produce visual representations of the interior of a body. In recent years, it has been widely applied to healthcare. It plays a significant role in modern medicine and is used in almost every aspect of patient care, such as diagnosis, therapy, and surgery. It helps clinicians identify and pinpoint disease progressions more precisely. Numerous imaging modalities have been utilized to detect and analyze lung diseases, including chest X-rays [3], CT scans [6], MRI [7], PET [6], sputum smear microscopy images (SSMI) [8], and molecular imaging [9]. X-rays and CT scans are the most commonly used anatomic imaging modalities for detecting and diagnosing various lung diseases [6].

ML has significantly impacted medical imaging, and there has been substantial progress in applying ML-based detection approaches and algorithms. ML can diagnose lung disorders using images from medical or radiological procedures [10]. ML, a subfield of artificial intelligence (AI), tries to make computers learn from data [11]. Consequently, ML offers an automated framework that may be utilized to detect or anticipate lung illnesses in their earliest stages compared to manual methods [12].

Identifying prominent lung conditions such as Pneumonia, Lung cancer, and COVID-19 using imaging and ML encounters some impediments:

The intricate characteristics of lung structures and the overlapping patterns of diseases might result in misinterpretations.
Various imaging methods may lead to differences in the quality and consistency of data.
The scarcity of labeled datasets impeded the training of accurate models, particularly regarding rare illnesses.
The progressive characteristics of disorders such as COVID-19 provide difficulty for pre-existing models.
Some solutions can be opted to deal with these impediments:
Model generalization may be improved by supplementing datasets with diversified samples and assuring uniform imaging techniques.
Continuous model adaption via real-time data updates is critical, particularly with changing features.
Using ML approaches may improve model interpretability and decision-making. ML systems in lung disease diagnosis benefit from regular validation based on real-world clinical results [10,11,12].
This review analyzes ML approaches for diagnosing lung diseases. The main contribution of the research is:
It investigates and addresses prominent lung diseases such as pneumonia, lung cancer, and COVID-19.
It investigates and addresses the publicly accessible imaging modalities datasets for each prominent lung disease.
It explores and addresses existing challenges and issues in diagnosing prominent lung diseases using ML and its associated novel solutions.
It examines ML and its subfield approaches for identifying prominent lung diseases based on radiographic images and their significance.
It qualitatively assesses ML approaches, emphasizing their efficiency in identifying, classifying, and forecasting prominent lung diseases while outlining essential considerations for enhancing the diagnosis.
The particularity of the investigation is that it offers a conceptual context for the issues. Furthermore, the analysis emphasizes the techniques and primary methods used in the published findings.

The structure of the review is as follows: Section 2 explains the approach utilized to conduct this review and addresses the necessity of a study in light of recent research. Lung diseases and their classifications, following the most prevalent and well-researched trends, are described, as are the challenges in diagnosing lung diseases, in Section 3. In Section 4, the imaging modalities, both conventional and other types, are described. Section 5 discusses machine learning, its trends, prominent sub-fields, and the initial steps for applying machine learning to diagnosing pulmonary diseases. Section 6 presents the diagnosis of prominent lung diseases using ML and imaging and also comprises publicly accessible datasets for each one, along with extensive analysis and narratives. Section 7 provides observations and discussions. Section 8 concludes the review.

Necessity

Multiple reviews/surveys/studies were examined, contrasted, and presented in Table 1 because of the tremendous relevance of correctly identifying prominent lung diseases using imaging modalities and ML.

Table 1 Comparative analysis of the review with recent researches

Full size table

As far as we know, previous research has yet to provide a combined comprehensive examination of identifying prominent lung diseases with ML and imaging modalities datasets. The methodology, procedures, and techniques of ML and imaging modalities are examined and brought to light in this research, which provides less time for understanding.

Methodology

The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flowchart is depicted in Fig. 1, illustrating the approach taken. Establishing a suitable pre-existing research repository was essential for accessing scholarly research articles.

Scopus and Web of Science were preferred due to their prominence as widely used research databases for academic, peer-reviewed scientific papers. In addition, the well-known databases of academic studies that have been subjected to peer review, namely ScienceDirect [23], arXiv [24], IEEE Xplore [25], and MDPI [26], were also used for the search of articles. Only relevant published articles that are related to the issues are taken into consideration.

Identification

Databases were searched using pertinent keywords to explore all feasible machine learning-assisted lung disease diagnosis publications. Applied keywords and combinations used to search procedures with primary considerations for review, such as lung diseases, imaging modalities, and ML, are presented in Table 2.

Table 2 Applied keywords for searching procedure

Full size table

Studies were limited to articles written in English only. Only studies employing ML and its prominent subfields to diagnose lung diseases utilizing specific imaging modalities are included in this review. Studies that are deemed unimportant are excluded. 151 publications from the Scopus database and 92 articles, reports from Google Scholar, the website, and additional databases, including ScienceDirect, MDPI, and IEEE Xplore, were chosen at this round.

Screening

The screening process ensured the selection of only relevant research. The review included only substantial titles and abstracts, not requiring a full-text assessment.

We manually eliminated duplicate titles, resulting in 22 remaining publications. Based on the screening, we selected 221 publications, excluding 40 due to irrelevance. All screened research publications pertained to an entitlement review.

Inclusion

To conduct an entitlement review, we analyzed every research publication we examined. We evaluate each piece of research before considering it for assessment. At the end of this round, we found 181 viable studies/resources through manual investigation.

Lung diseases

Humans breathe by expanding and contracting their lungs to intake and expel oxygen, which is then circulated via deep lung arteries to generate energy for their bodies [27]. Lung diseases include a variety of ailments that influence lung function. These include obstructive, restrictive, and infectious diseases affecting lung structure and function. Lung diseases can be categorized as depicted in Fig. 2.

Airways-Related Lung Diseases: The lung's windpipe, or trachea, is split into bronchi, branching into smaller tubes that extend throughout the lungs. Some conditions that might affect these airways include asthma, COPD, acute bronchitis, chronic bronchitis, emphysema, and cystic fibrosis.
Air Sacs-Related Lung Diseases: The respiratory system comprises bronchioles and narrow passageways inside the lungs, terminating in clusters of alveoli, also called air sacs. These air sacs facilitate the formation of tissue in the lungs. Pneumonia, TB, emphysema, pulmonary edema, COVID-19, and lung cancer represent a selection of respiratory ailments affecting the lungs.
Interstitium-Related Lung Diseases: The narrow, tiny membrane between the lung's alveoli is known as the interstitium. The interstitium is filled with tiny blood capillaries that facilitate the exchange of gases between alveoli and blood. A few lung conditions that impact the interstitium are interstitial lung disease (ILD), pneumonia, and pulmonary edema.
Blood-Vessels-Related Lung Diseases: Low-oxygen blood is pumped into the right side of the heart through veins. It uses the pulmonary arteries to push blood into your lungs. These blood vessels can also acquire diseases. Pulmonary embolism and pulmonary hypertension are two lung disorders that impact blood vessels.
Pleura-Related Lung Diseases: The pleura is a thin membrane surrounding the lungs and chest walls. A slight fluid coating with each inhalation permits the pulmonary pleura to slide down the wall. Pleural effusion and pneumothorax are pleural lung disorders.
Chest Wall-Related Lung Diseases: The chest wall is essential to the respiratory process. The ribs are connected by muscles, enabling the lungs to expand. The diaphragm descends with each breath, which allows the lungs to enlarge due to the action. Neuromuscular problems, chubbiness, and hypo-ventilation disorder are all diseases that disrupt the chest wall [28]. After reviewing these categories of lung diseases, explaining each one in depth is difficult due to the numerous kinds. Our review focuses on humanity's most debilitating and catastrophic prominent lung diseases.

Prominent lung diseases

According to the information mentioned before introducing the issue, the WHO recently produced research outlining the top 10 diseases responsible for the most fatalities worldwide. Lung illnesses, in all of their many facets, are accountable for the deaths of a disproportionately high number of individuals all over the globe. According to the WHO, lung infections like pneumonia are responsible for an estimated 16% of all deceases of kids below the age of 5 worldwide. It is also a top reason for hospitalization for kids below 5 in the United States [2]. According to the WHO, about 1.8 million fatalities a year may be attributed to lung cancer, putting it at the forefront of mortality due to cancer globally. It is responsible for more deaths than breast, prostate, and colorectal cancers combined. Most lung cancer cases are caused by tobacco use, with tobacco smoke being the primary risk factor for the disease [1]. COVID-19 is a well-known type of lung disease caused by the coronavirus. WHO is closely monitoring the ongoing outbreak of COVID-19. COVID-19 is a worldwide epidemic that has already infected almost every nation globally. The WHO reports showed that pneumonia, lung cancer, and COVID-19 are the three conditions that account for most fatalities. As long as COVID-19 persists, the world needs more investigations.

The most frequent lung conditions that may be identified using medical imaging are pneumonia, lung cancer, and COVID-19. This research's most prevalent lung diseases include pneumonia, lung cancer, and COVID-19. Each is described in depth below:

Pneumonia

Pneumonia is a leading cause of morbidity and mortality worldwide, surpassing other prevalent illnesses such as cancer, diabetes, HIV/AIDS, malaria, and several others. It is a severe lung condition with severe medical consequences and a high casualty rate in the short and long term. It is a common respiratory illness affecting the airways and alveoli. The development of pneumonia also depends on the patient's immune system's response to viruses. Patients who suffer from pneumonia exhibit pulmonary abnormalities [29]. There is a diverse array of microbes that are capable of causing pneumonia, such as bacteria, pulmonary pathogens, and fungi. Pneumonic microbial invaders are numerous and diversified. Pneumonia is caused by viruses such as coronavirus, rhinovirus, influenza, parainfluenza, metapneumovirus, and bacteria such as pneumococcus, mycoplasma, legionella, Enterobacteriaceae, Haemophilus, and mycobacteria [30].

Lung cancer

Lung cancer arises from the growth of cancerous cells within lung tissues, exhibiting uncontrolled proliferation that may spread to distant organs or lymph nodes. Lung tumors are divided into three groups from a histopathological perspective: small-cell lung cancer (SCLC), which also includes small-cell carcinoma; non-small-cell lung cancer (NSCLC); and other uncommon forms of tumors, which include sarcoma and lymphoma. Adenocarcinoma, squamous cell carcinoma, and large-cell lung cancer are the three subtypes of NSCLC [31]. Smoking is crucial in identifying lung cancer since it plays a critical function in the disease [32].

COVID-19

A specific contagious lung disease that spreads to people exponentially is COVID-19. COVID-19 symptoms include flu, cough, and shortness of breath. Less common symptoms include headache, decreased smell (hyposmia), decreased taste sensation (hypogeusia), throat infection, runny nose (rhinorrhea), muscle cramps, diarrhea, and vomiting. The main barriers comprise acute respiratory distress syndrome (ARDS), numerous organ failures, and death [29]. An RT-PCR (real-time reverse transcriptase polymerase chain reaction) test is the most modern and innovative way to detect COVID-19. COVID-19 might be classified.

Mild cases

An asymptomatic COVID-19 infection characterized by coughing, fever, and headache.

Moderate cases

Patients experience some shortness of breath as well as pulmonary issues such as hypoxia.

Complex cases

The patient is suffering from hypoxia as well as shock. This kind is to blame for the great majority of life-threatening incidents.

COVID-19 is putting the entire world in a horrific situation, bringing all life to a screeching halt worldwide and claiming millions of lives. As we have seen, when a pandemic occurs, there is a collapse in the healthcare system because we are unable to satisfy all the demands. The COVID-19 epidemic has significantly impacted medical microbiology labs. "Long COVID-19" or "post COVID-19 syndrome" refers to signs that may affect a person's health after recovering from the COVID-19 virus. These symptoms have been reported in many patients who have recovered from the COVID-19 virus [33].

Developmental analysis of prominent lung diseases over the internet

Google is the finest search engine for asking any question, and as almost every internet user utilizes it, it is frequently used to look for any query. So, it's helpful to know how people search for the most common lung disease on the internet. A well-liked and publicly available big data analytics tool called "Google Trends" has been extensively utilized to examine perceptions in several studies. Google Trends' tracking of internet search queries may offer some helpful insight. The searches for lung diseases from 2019 to 2023 were analyzed for this study (Fig. 3) [34].

Lung cancer and acute lower respiratory tract infections, which include pneumonia, asthma, COPD, and TB, are the five primary lung illnesses addressed at the International Respiratory Society Forum. Pneumonia is the top relative search term on Google Trends, according to Barbosa et al., who also noted that there has been an increase in COVID-19 pneumonia cases [35]. Since lung cancer is a fatal disease affecting individuals worldwide, it is commonly searched for online, mainly through research searches. Before 2020, there was a lower volume of COVID-19 searches, but during the pandemic, there has been an exponential increase in COVID-19 searches online. Search comparisons are necessary in the context of all lung diseases (Fig. 4).

The Y-axis in Fig. 4 displays the precise measurement numbering of Google Trends' searched queries, which illustrates the term's level of popularity [34].

Challenges and issues

Many lung disorders are avoidable but may go untreated due to a lack of diagnosis. Lung illness and other diseases, such as cardiovascular disease, sometimes coexist, yet combined diseases are usually misdiagnosed due to the significant overlap in symptoms [36]. When determining the presence of lung illnesses, there are several challenges to surmount. Some of them are as follows:

Selection of Efficient Imaging Modality: Various imaging modalities, including X-ray, CT scan, SSMI, PET, and MRI, have been chosen based on clinical requirements [6,7,8,9]. Medical image analysis requires the selection of an efficient imaging modality for the detection [15, 19].
Scarcity of Useful Datasets: To handle and analyze medical images, an environment that supports access to medical data, data analysis, and processing is required [17]. Various imaging modalities datasets are available for public access [6,7,8,9,10,11,12,13,14, 23,24,25,26].
Effectiveness of Models Derived from ML: The efficacy of models is crucial for identifying lung illnesses. If ML models are used, real-time diagnosis is essential. Thus, research on model training efficiency is necessary [30,31,32,33].
To Address Multiple Pulmonary Disorders Simultaneously: It is expected that the trained ML model would be able to identify multiple lung diseases appropriately, such as COVID-19, pneumonia, and others [19,20,21,22].
Medical Experts' Opinions: Although ML algorithms may be effective in classifying lung illnesses, medical expert evaluations and validations are required to confirm that the identification is correct [28,29,30].

Imaging modalities

Diagnostic imaging is widely acknowledged to have a significant role in clinical evaluation. The processing of diagnostic imaging requires practitioners with extensive expertise. Healthcare practitioners may benefit from computer-assisted solutions due to diverse assessments of images, resulting in varying findings and a tedious process that may result in significant expenses and glitches. On the contrary, the manual diagnosis of lung disorders using radiographic scans often takes a substantial amount of time and is prone to error. The prompt and precise identification of lung disorders has a crucial role in enhancing the prognosis, thereby increasing the sufferer's likelihood of survival. The radiographic findings might be of assistance [37]. When a radiological image of a patient is produced, it is processed in many phases, including image annotation and segmentation. After storing the images in the databases, the radiologists annotated them after adding pertinent information to help the reader interpret them. Image segmentation is one of the most critical aspects of image processing. Images are divided around regions of interest (ROIs) to segment them [38].

With ethical concerns, the patient's clinical and radiological imaging must be processed while maintaining the subject's privacy. After receiving ethical consent, obtaining patient data, de-identifying it appropriately, and storing it securely is necessary. Pseudonymization is the technique of choice for de-identification since it replaces information that may be used to infer the identity of a subject with identifiers. When images are pseudonymized, you can't use this information to figure out who a patient is [39].

Labeled imaging data is commonly cited as a challenge for machine learning in the context of expanding medical imaging datasets. Therefore, various strategies that allow for learning with less or different sorts of monitoring are necessary [40]. The overview of each one is represented here for a better understanding.

Conventional imaging modalities

X-ray

The chest X-ray (a CXR) is the diagnostic imaging method used most often in treating lung ailments. The availability, mobility, and cost-effectiveness of chest X-rays contribute to the initial evaluation of individuals exhibiting lung problems [3]. Since its earliest times, medical X-ray imaging has been captured on photographic films, which must be developed before being examined. Digital X-rays are used to solve this issue. The most popular medical X-ray diagnosis is a digital chest X-ray to diagnose lung disorders [41]. The vast majority of the analyzed studies used chest X-rays in their investigations. For instance, X-ray datasets were used for the diagnosis of pneumonia [42,43,44,45,46,47,48,49,50,51,52,53,54,55], lung cancer [44, 46, 47, 52, 56], and COVID-19 [47, 48, 53,54,55, 57,58,59,60]. Figure 5 depicts many chest X-ray illustrations of diverse lung diseases collected from publicly accessible datasets.

CT scan

In patients with severe lung disorders, a chest CT is frequently recommended. CT imaging is more precise than CXR imaging and is employed when radiography reveals anything unclear [3]. By circling the X-ray tube around the chest, the CT merges several X-ray projections recorded from various angles to generate cross-sectional imaging of regions within the chest [6]. Chest CT scans were used in most of the studies reviewed for this study. For instance, the diagnosis of pneumonia [63], lung cancer [64,65,66,67,68,69,70,71,72,73], and COVID-19 [57, 59, 60, 74,75,76,77,78] relied on datasets that were acquired from CT scans. Figure 6 depicts many chest CT scan illustrations of diverse lung diseases collected from distinct publicly accessible datasets.

Positron emission tomography

Nuclear imaging technology, such as PET, enables monitoring metabolic activities. It is done by injecting the patient with radiolabeled tracers and then figuring out where they went.

The most commonly used PET tracer is known as 18F-fluorodeoxyglucose (FDG). The disappearance of recognizable anatomical features is a defining characteristic of the PET imaging technique [6]. Lung disorders and nodules may be effectively evaluated with PET. It has an outstanding capacity for detecting metastases [81].

Figure 7 displays a chest CT scan of a lung nodule compared to a PET image, which provides a more improved view. The image was obtained from the Openi website, which provides access to publicly available images.

Magnetic resonance imaging

Comparing MRI to other radiography modalities like CT, and Comparing MRI to other radiography modalities like CT and PET, it becomes evident that MRI has little clinical use for patients with lung illnesses. MRI generates images of the region that has been chosen and exhibits them in the form of narrow slices that comprise the entire volume of the area. It did work because nuclei absorb radio frequencies when powerful magnetic fields are present. MRI employs a magnetic field and radio waves to obtain numerous images of the lungs' region from various angles. Combining these images may generate crisp and accurate portrayals of areas [81]. Lung MRI is an excellent technique for doing sequential follow-ups [7]. MRI procedures like three-dimensional gradient sequences and acceleration techniques, among others, have increased MRI's minor lesion detection capabilities [83]. Also, research has shown that MRI might be a better way to screen for lung cancer than low-dose CT [84].

Figure 8 displays the chest radiograph of a lung nodule compared to an MRI image. The image was obtained from the Openi website, which provides access to publicly available images.

Sputum smear microscopy images

A viscous fluid called sputum is produced in the lungs and air passages, which is a crucial factor in the progression of certain lung disorders. Sputum smear microscopy has generally been considered the most effective approach for diagnosing lung diseases like TB. Specimens of sputum expectorated by patients with symptoms are placed chemically onto plain glass microscope slides [8]. Then, they are analyzed by laboratory procedures that identify acid-fast bacteria (AFB), like Mycobacterium TB cells [86]. The images obtained from a sputum smear test are often obtained via fluorescence microscopy or conventional microscopy. SSMI images were captured using a digital microscope and a digital camera. The captured images have a specific size and resolution depending on the magnification. The "pixel pitch," which refers to the physical size of each image pixel, is measured in micrometers [87]. Figure 9 displays SSMI images. The image was obtained from the open-access dataset [88], which provides access to publicly available images.

Molecular imaging

Molecular imaging methods not previously used are also being studied to learn more about lung diseases. It is a specific type of imaging technique that combines the two fields of molecular biology and medical imaging. Recent research has been conducted on several methods of molecular imaging that have the potential to differentiate between the cellular and molecular components of respiratory illnesses. Alternative imaging techniques like single photon emission computed tomography (SPECT) can offer pertinent data at the molecular level because of their remarkable sensitivity and resolution. When it comes to the exactness of a lung diagnosis, the stage of the disease, or monitoring after treatment, molecular imaging may be a great addition to traditional imaging methods [9].

At-bedside imaging modalities

Evolving methods can assess, monitor, or measure lung disorders at the bedside. Bedside methods, including lung ultrasonography (LUS) and electrical impedance tomography (EIT), are gaining prominence alongside conventional imaging modalities. Since they do not require ionizing radiation and are very uncomplicated, these approaches are being intensively explored as an addition to traditional procedures and, in the case of specific lung problems, as a substitute for them [89].

Following is an overview of the numerous imaging modalities. It has become clear that each characteristic sets it apart from the others. Every imaging modality collects its own specific set of images, enabling radiologists to identify a variety of lung illnesses more accurately.

Machine learning

ML is a crucial component that can add resiliency to the medical decision-assistance systems. To better understand ML-based lung disease diagnosis, we provide a new analysis viewpoint on the different machine-learning strategies. The strategies for ML include supervised, unsupervised, and semi-supervised learning. Each method has benefits and drawbacks, and the selection of ML methodology hinges on the nature of the need [90] and the virtues and limitations listed in Table 3.

Table 3 Virtues and limitations of the various ML strategies

Full size table

In supervised learning, the ML model has the input–output pair along with the labeled data [91], whereas in unsupervised, the model only has the input data without any labeled data. Unsupervised learning examines standard results without feedback mechanisms. This strategy extracts features to cluster input data into groups to train the model. The technique finds an unusual pattern in the input data [93]. On the other hand, semi-supervised learning can work with both labeled and unlabeled data [11]. This strategy can operate on massive amounts of data due to the applicability of labeled and unlabeled data, even though labeled data are limited.

The general assumption is that performance measures acquired from labeled data will perform better than those obtained from unlabeled data. This assumption, however, is only sometimes accurate since the researchers demonstrated that unlabeled data may also provide remarkable performance measures [94].

Machine learning developmental analysis on the internet

Since the turn of the decade, people worldwide have searched the internet using the term "machine learning." The Y-axis in Fig. 10 displays the precise measurement numbering of Google Trends' searched queries from 2012 to 2023, which illustrates the term's level of popularity [95]. Such statistics motivate the research of machine learning in the context of the study of the detection of lung diseases. The popularity of ML is seeing meteoric growth.

Introductory steps for employing machine learning to diagnose lung diseases

ML has the potential to diagnose and prognosticate lung illnesses. To make a diagnosis using imaging modalities, ML executes a series of actions, including acquiring an image dataset, preprocessing the image data contained within the dataset, performing feature extraction and selection, training an ML model using specific ML algorithms, and evaluating performance metrics and classification [96]. The lung disease diagnostic process using ML is shown in Fig. 11.

The above-described introductory steps for employing ML to diagnose lung diseases act as the training phase of the ML model, which develops an ML diagnostic model. However, this ML diagnostic model must be validated using new or test data that the model has never seen before. Machine learning advances the lung disease diagnostic pathway. The fundamental framework of an ML-based diagnostic model is shown in Fig. 12, in which the model is trained using a training dataset and evaluated using new test data.

Many imaging modalities make it possible to record data about a patient's lungs from various angles and viewpoints, which may then be annotated and stored for later use [97].

Collecting these images produces an image dataset that can be preprocessed and employed as an input for the ML to operate on [98]. The necessary features must be retrieved and selected manually or automatically from the preprocessed picture dataset to train the model using any particular machine learning algorithm [99]. It is possible to do prediction or classification using a trained model [100]. It is a conventional approach to ML for diagnosing lung diseases using imaging modalities.

Publicly accessible datasets

In the modern world, data is far too important. According to one of the studies of digital health records, it was discovered that around 25 million images were subject to cyberattacks [101]. Assume that the European Union (EU) has enacted special regulations for data protection. The General Data Protection Regulation (GDPR) is a form of legislation that updates and unifies data privacy rules across the EU and its associated businesses. Due to GDPR in the EU, hospitals and other healthcare organizations cannot share data [102]. Data sharing for research and other specific purposes is limited, encouraging private or commercial data use.

In contrast to private or commercially supplied datasets, which are not openly accessible to the research community, publicly available datasets are preferable since they are accessible to all researchers and can be used for their studies. The imaging modality appropriate for the particular lung disease must be ascertained first. Certain lung disorders are diagnosed using imaging techniques such as X-rays, CT scans, SSMI, PET scans, MRIs, and others as specified earlier [103]. A dataset must be compiled based on specific images, which may be either public or private. A researcher may collect or create private datasets depending on the research demands. However, a researcher or organization may also provide publicly available datasets if they wish to make their findings public. Researchers developing ML models must access such a vast dataset of these modalities [104].

Preprocessing

Preprocessing the dataset is essential after choosing a particular image dataset. An image dataset's description, visualization, and other attributes can all be used for analysis. It is necessary for the exploration to collect relevant image data for the ML model of lung illness. The ML model heavily depends on image quality for training. Dealing with real-world imaging data requires a more in-depth examination of the data collection process. Several images may need clarification, including incomplete annotations, anomalies, and nonsensical image data within the obtained image dataset. It is challenging to clean and preprocess image data received from databases correctly. Hence, adapting or implementing appropriate preprocessing techniques is necessary [105].

Image enhancement and optimization may be done using ML-based image processing [106]. Approaches to image processing based on AI can lessen the amount of time needed for the process while improving image processing techniques. When preprocessing an image, it can be transformed into a grayscale and cleaned up with Gaussian blur, median filters, morphological smoothing, and numerous other methods [107]. Contrast Limited Adaptive Histogram Equalization (CLAHE) is one of the famous techniques that can be employed to improve the image's contrast [108]. Image processing techniques like lung segmentation, which necessitates the exclusion of bone, might be used to locate the region of interest, after which lung disease detection could be carried out in the region of interest [109].

Feature extraction and relevant feature selection

Certain extracted features may be valuable, while others will not. That ultimately leads to the identification of relevant components. ML algorithms or Classifiers process these features selected for analysis. The feature engineering method consists of two segments: the first aims to extract parts from an existing image dataset, and the second involves picking features among the extracted ones. Methods like Gabor, Zernike, Haralick, and Tamura were used to extract features [110]. Features may be selected using techniques like the gray level co-occurrence matrices (GLCM), local binary pattern (LBP), and CNN. The bio-inspired algorithms such as the improvised crow search algorithm (ICSA), the improvised grey wolf algorithm (IGWA), and the improvised cuttlefish algorithm (ICFA) are all examples of feature selection algorithms that can be used to narrow down a large number of acquired features to only the most desirable ones. Genetic algorithms can also choose diagnostic imaging features [111].

Training of the machine learning model

ML model training is the primary process of the ML pathway, providing an effective model for assessment, verification, and distribution. The ML model has been trained with the help of the relevant available data and can be used to analyze newly collected data and provide predictions utilizing the model [10].

Following the partitioning of the image database, one segment is expected to be set aside for the training phase of the ML model and another for the testing phase. The test data consists of novel data that will be employed in the future to assess the effectiveness of the ML model. Knowing the significance of training in ML will enable the system to collect the appropriate volume and quality of training data for the model. Once the system knows how it affects model prediction and why it's essential, it can choose the optimal algorithm based on the availability and suitability of the training data set [112].

Machine learning and its algorithms

The ML algorithm enables the ML model to perceive the input data in a particular manner. The training process is the sole method that interoperates with ML algorithms so that ML models can extract meaningful information from learning data. It might take time to find an algorithm that works well and is set up to meet the needs of the intended use in a particular domain. Distinct learning algorithms have different objectives, and their results may vary based on data features. So, it's essential to know about machine learning algorithms and how they work in the real world, such as in medicine and other fields [113].

There are many different kinds of ML algorithms. Some are based on regression, decision trees, the Bayesian method, the kernel method, the clustering method, the ensemble method, and artificial neural networks (ANNs) [105].

Regression is a common technique for reducing model-based uncertainty by iteratively adjusting the model in response to the errors it produces. Some types are linear, logistic, stepwise, and multivariate adaptive regression splines (MARS).
To predict the target variable based on the input variables, an algorithm in the form of a decision tree is utilized. Some examples are random forest, classification and regression tree (CART).
Those algorithms that are based on the Bayesian technique are the ones that use the Bayes theorem and make it easier to use subjective probability in model development. The significant algorithms used for classification and regression problems are Nave Bayes and Bayesian Belief Network.
Pattern analysis is the basis of the kernel approach, which incorporates a wide range of mapping methods. Support vector machines (SVM) and linear discriminant analysis (LDA) are essential kernel approaches in ML modeling.
By grouping data points according to their similarities, clustering is the most widely used unsupervised learning approach. K-Means, partitioning-based, hierarchical, and density-based clustering are just a few examples of clustering techniques that may be classified in various ways.
Ensemble methods are strategies that work on several models and unite them to obtain more accurate outcomes. Compared to relying on a single model, the results of ensemble techniques are often more reliable. Bagging, boosting, AdaBoost, gradient boosting machine, and random forest are prominent ensemble techniques.
Simulations on a computer based on biological principles are used for various purposes, including clustering and classification. There are many ways to use ANN, such as the perceptron, the Hopfield network, and backpropagation.

Performance metrics

Building an ML model is not sufficient; the evaluation of the build model is to ensure its reliability and forecasting. Performance metrics are a set of statistics used to assess an ML model's overall efficacy and efficiency. These metrics can be quantitative or qualitative, and they can evaluate many aspects of performance. Typically, they oversee improvement and progression over time [114]. The majority of researchers, while conducting their studies, make use of a range of vital metrics, some of which are as follows:

Accuracy: The accuracy of an ML model is measured as the proportion of correctly classified samples to the total samples. It is the most common metric used to measure the performance of an ML model. It can be expressed as (Eq. 1):
$$\mathrm{Accuracy\,}=(correctly\,classified\,samples)\,/\,(Total samples)$$

The correctly classified samples can be expressed as follows:

$$\mathrm{correctly\,classified\,samples\,}=\,True\,Positive\,\left(TP\right)+\,True\,Negative\,(TN)$$

The total samples can be expressed as follows:

$$\mathrm{Total\,sample\,}=\,TP\,+\,False\,Positive\,(FP)\,+\,TN\,+\,False\,Negative\,(FN)$$

Sensitivity: This metric measures how many relevant samples an ML model can identify by calculating the proportion of true positives to all actual positives and presented through Eq. 2. It is often called the "true positive rate" and the "recall."
Precision: This metric measures how accurate a model's predictions are by calculating the ratio of true positives to all positive predictions made by the model. It is often referred to as "positive predictive value" and is presented through Eq. 3.
Specificity: It measures how well a model can correctly identify negative samples. It is the ratio of true negatives that are correctly identified and presented through Eq. 4. An ML model with high specificity may have a low false-positive rate, meaning it will rarely incorrectly classify negative examples as positive.
F1 Score: This amalgamation of precision and recall scores provides an overall score for model evaluation. The F1 score is presented in Eq. 5.
AUC: AUC stands for Area Under the Receiver Operating Characteristic Curve. For varied thresholds, AUC graphs the actual positive rate versus the false positive rate, which is used to evaluate a model's ability. The AUC represents the degree of discrimination between classes [115]. Some of the performance metrics are presented in Table 4.

Classification of lung diseases

Classification identifies, comprehends, and groups objects and concepts into predetermined categories. The act of classifying something is pattern recognition. Classification is a specific type that predicts a class label for a given sample Table 4.

Table 4 Performance metrics

Full size table

It transforms a function from input to output variables as a target, label, or class. "binary classification" describes classification tasks with just two possible class labels. Classification problems with more than two categories are called "multiclass classification." Some of the algorithms developed for binary classification can also address multiclass concerns [105].

ML sub-fields

Numerous prominent sub-fields of ML may be utilized to diagnose lung diseases. Deep learning (DL), CNN, ensemble techniques, transfer learning, and many other notable ML subfields may be used to diagnose lung conditions. Many more subfields of ML can also be employed. The focus here is on elaborating on a few particularly notable sub-fields.

Deep learning

A popular and rapidly developing area of ML is DL. Learning A popular and rapidly developing area of ML is DL. Learning from massive datasets is the focus of DL, a subfield of ML that employs neural networks. DL enables the creation of diagnostic models by performing all the processing steps typically associated with the construction of standard ML models, such as feature extraction and selection, in an automated manner. The word "deep" signifies that many hidden layers comprise the neural network. There is a particular set of neurons in the processing layers of neural networks for deep learning. The first layer in a network is known as the input layer, the final layer is known as the output layer, and the layers in between are known as the hidden layers [116]. DL has been influential in diagnostic imaging for feature engineering and image classification [117] and can resolve data-related problems with minimal supervision. It has consequently prompted researchers to research DL approaches at deeper levels. DL algorithms do exceptionally well compared to conventional differential diagnosis screening processes that rely solely on radiologists [118].

Consequently, DL offers novel models for classification tasks and medical image diagnostics [119], which achieve excellent results. In particular, DL approaches are anticipated to aid physicians in the examination and diagnosis processes [120]. DL leverages ANN to examine raw data directly. Multilayer perceptrons (MLP) also comprise the most prevalent deep learning algorithms.

Three primary groups of DL approaches are supervised, unsupervised, and semi-supervised. Several supervised learning approaches include CNN, deep neural networks (DNN), and recurrent neural networks (RNN). DL excelled in non-linear dimensionality reduction and clustering problems in unsupervised learning. It comprises limited Boltzmann machines, auto-encoders, and generative adversarial networks (GANs). Semi-supervised deep understanding also includes GAN. In addition, RNNs, which contain GRUs and LSTM techniques, could be applied to all ML strategies, such as supervised and unsupervised learning [121].

A decade-long comparison of the search volumes for "Machine Learning" vs. "Deep Learning". Figure 13 depicts the Google Trends queries performed between 2012, and 2023. Results indicate that ML searches predominate over DL searches due to their use as an umbrella term [122].

Convolutional neural network

CNNs were implemented in several domains, including computer vision and medical imaging. In particular, CNNs have been effective at producing outputs in previously unattainable settings [123]. It is the case since CNNs can detect and learn crucial traits that radiologists cannot readily observe with visual inspection [124]. CNN's primary advantage over its earlier works is that it intelligently recognizes pertinent features. There are many advantages to utilizing CNNs, including the feature of weight sharing, simultaneously learning both the feature extraction and the classification, and the capability to create large-scale networks [121]. The basic architecture of CNN is represented in Fig. 14.

Convolutional layer

The convolutional layer comprises a procedure that involves repeating a specific filter over the whole image. The incoming image (i) of every layer in the model of CNN is presented in three dimensions: height, width, and depth, represented as a × a × b in the dimensional form, in which the height (a) is the same as the width (a). A different name for depth (b) is the channel number.

Filters may have a variety of sizes, including 3x3, 5x5, 11x11, etc. Filters convolutionally transform the preceding layer's inputs into the corresponding layer's output. A feature map is produced as a result of this convolution procedure.

k is the number of kernels, also known as filters, contained within every convolutional layer with the same dimensional form as the input image, represented as c × c × d, with the following conditions: c < a, and b < = d. A dot product is computed between the inputs of the convolution layer and the weights of that layer. To generate k feature maps (h^k) as presented in Eq. 6, input is convolved with these kernels, which all have the same bias (b^k) and weight (w^k) [121, 125].

$${h}^{k}=f\left({w}^{k}*i+{b}^{k}\right)$$

(6)

Activation functions

All activation functions in neural networks that deal with non-linearity map input to output. The input value is calculated by weighting the neuron input and adjusting for bias. CNN and other types of deep neural networks often use the Relu, Leaky Relu, and Noisy Relu, as well as the Sigmoid and Tanh activation functions. An activation function that may prevent vanishing gradients is the rectified linear unit (ReLU). This interpretation focuses on the argument's positive axes [121]. Some of the prominent activation functions that are widely used are presented in Table 5.

Table 5 Prominent activation function

Full size table

Pooling layer

A down-sampling operation must be done on each feature map in a pooling or subsampling layer. A pooling layer is characterized by a formation that preserves the image features while simultaneously reducing the image size. Additionally, it stores image information. This subsequent step is to use a pooling function, such as maximum, global, or average, with a kernel size or pool size that has already been set for each of the feature maps [125].

Optimizers

Updating the weights in the CNN architecture requires employing optimization algorithms at each level until it is possible to get the maximum learning. The updating procedure is carried out by each approach using its unique algorithm. Some of the best-known optimizers are called Gradient Descent, Stochastic Gradient Descent, and Adam [125].

Fully connected layer

It is a layer in which every precomputed input node is coupled to every output node. It is a layer utilized to make predictions at the network's end. This layer connects each neuron of the preceding layer to each neuron of the current layer. The previous layer's output is flattened and delivered to a fully connected layer that linearly modifies the data before sending it to a nonlinear activation function [128].

CNN architectures

Various CNN architectures carry out classification tasks, including ResNet, VGG Net, Inception, Xception, DenseNet, EfficientNet, MobilenetV2, and many more. On the other hand, segmentation tasks are carried out by U-Net, V-Net, FCN, SegNet, DRUNET, and many different architectures [129]. With the aid of CNN, the number of parameters can be significantly reduced, overfitting can be prevented, and the information gleaned from an image may be preserved.

Ensemble learning

Ensemble learning aims to improve general performance by integrating different models into a single one. It was initially proposed for classification tasks. The benefits of both deep learning and ensemble learning are combined in deep ensemble learning models to provide a model with enhanced performance [130]. An ensemble of learned models may be created by taking the training data, deriving many training sets from it, learning a model from each, and then combining them. The bagging, boosting, and stacking methods are all well-known ensemble learning methods. The result of combining model outputs is a single prediction. A weighted vote facilitates classification, whereas a weighted average reduces numerical prediction. This approach is used by bagging and boosting; however, their respective models are generated uniquely [131]. Stacking enables the combination of fundamental learning algorithms. Diversified foundation models allow the stacked ensemble to learn from various perspectives, producing heterogeneous features. The super learner approach is called "layered ensemble learning" [132].

Transfer learning

ML approaches only function when testing and training data are from the same feature space and dispersion. Statistical models must be reconstructed with fresh training data when the dispersion changes. In many instances, based on the real world, retrieving data for training and recreating models is either impractical or too expensive. It would be helpful to reduce training data collection work. In certain circumstances, transfer learning across task domains is advantageous. Whenever there is inadequate standard training data for a given job, one solution is to use transfer learning methods to bring the knowledge acquired from previously experienced tasks to the target job [133]. Inductive [134] and transductive kinds of transfer learning are preferred for classification or regression studies. On the other hand, unsupervised types of transfer learning are selected when it comes to tasks involving clustering and dimensionality reduction [135]. Transfer learning made the DL model even more accurate by fine-tuning it with more training data and adjusting the parameters.

Detection of prominent lung diseases using machine learning and imaging

The backbone of ML models is input data, which comes in the form of datasets and ML diagnostic methods. Therefore, at first, the primary emphasis of this review was on the datasets that were given for the prominent lung diseases, and the subsequent section discussed the ML approach for the diagnosis in more depth.