Comparison of six machine learning methods for differentiating benign and malignant thyroid nodules using ultrasonographic characteristics

Liang, Jianguang; Pang, Tiantian; Liu, Weixiang; Li, Xiaogang; Huang, Leidan; Gong, Xuehao; Diao, Xianfen

doi:10.1186/s12880-023-01117-z

Research
Open access
Published: 12 October 2023

Comparison of six machine learning methods for differentiating benign and malignant thyroid nodules using ultrasonographic characteristics

Jianguang Liang¹^na1,
Tiantian Pang^2,3,4,5,6^na1,
Weixiang Liu^2,3,4,5,
Xiaogang Li^2,3,4,5,
Leidan Huang^7,8,
Xuehao Gong⁸ &
…
Xianfen Diao^2,3,4,5

BMC Medical Imaging volume 23, Article number: 154 (2023) Cite this article

749 Accesses
Metrics details

Abstract

Background

Several machine learning (ML) classifiers for thyroid nodule diagnosis have been compared in terms of their accuracy, sensitivity, specificity, negative predictive value (NPV), positive predictive value (PPV), and area under the receiver operating curve (AUC). A total of 525 patients with thyroid nodules (malignant, n = 228; benign, n = 297) underwent conventional ultrasonography, strain elastography, and contrast-enhanced ultrasound. Six algorithms were compared: support vector machine (SVM), linear discriminant analysis (LDA), random forest (RF), logistic regression (LG), GlmNet, and K-nearest neighbors (K-NN). The diagnostic performances of the 13 suspicious sonographic features for discriminating benign and malignant thyroid nodules were assessed using different ML algorithms. To compare these algorithms, a 10-fold cross-validation paired t-test was applied to the algorithm performance differences.

Results

The logistic regression algorithm had better diagnostic performance than the other ML algorithms. However, it was only slightly higher than those of GlmNet, LDA, and RF. The accuracy, sensitivity, specificity, NPV, PPV, and AUC obtained by running logistic regression were 86.48%, 83.33%, 88.89%, 87.42%, 85.20%, and 92.84%, respectively.

Conclusions

The experimental results indicate that GlmNet, SVM, LDA, LG, K-NN, and RF exhibit slight differences in classification performance.

Peer Review reports

Background

There is a high incidence of thyroid nodules following the widespread use of high-resolution ultrasound in clinical practice. Ultrasonography plays an important role in the diagnosis of thyroid nodules because it is noninvasive, economical, and convenient. Most thyroid nodules are benign; however, it is difficult to differentiate malignant nodules from benign nodules owing to their hidden early clinical symptoms [1, 2]. Therefore, differentiating benign and malignant thyroid nodules is challenging. Known suspicious US features of differentiated thyroid nodules are margins, borders, calcification, and shape [3, 4]. In this paper, we chose 13 features, including conventional US features, and features based new imaging techniques, such as strain elastosonography (SE) and contrast-enhanced ultrasound (CEUS); see more details in the Materials section.

Machine learning (ML) is one of the fastest developing fields in the computer science field. ML serves as a useful reference tool for classification following the development of artificial intelligence.

Several types of classifiers are used in ML. The support vector machine (SVM), random forest (RF), logistic regression, GlmNet, linear discriminant analysis (LDA), and K-NN are the most common classifiers.

The original SVM was proposed by Vapnik and Ya in 1963. The current standard originated in 1993 and was proposed by Corte and Vapnikdition. SVM is a core machine-learning technology for resolving a variety of classification and regression problems, which produces nonlinear boundaries by constructing a linear boundary in a large, transformed version of the feature space [5]. SVM has been applied to all types of problems, such as object and handwritten digit recognition and image and text classification. The general form of the decision function f (x) for SVM:

$$f (x) =\sum\nolimits_{i=1}^{n}{a}_{i}{y}_{i}k(x, {x}_{i}) + b$$

(1)

where k(x, x_i) is the kernel function, b is the bias, 0 ≤ α_i ≤ C andΣ(α_iy_i) = 0.where α_i can be obtained through training, and C is a penalty term parameter set by user [5,6,7]. In this study, the Gaussian kernel function ${k}_{\gamma }(X, X^{\prime}) = e(-\hspace{0.17em}\gamma ||X\hspace{0.17em}-\hspace{0.17em}X^{\prime}{||}^{2})$ was used to address the nonlinearity classification [5]. The SVM with a Gaussian kernel is implemented in MATLAB using the LIBSVM toolkit, which is a library for SVMs and is publicly available.

Figure 1 is the architecture of an SVM. x = [x₁, x₂,… x_n] is an n-dimensional input feature vector, and y is the decision value.

$$y=sgn\left(\sum\nolimits_{i=1}^{n}{a}_{i}{y}_{i}k\left(x,{x}_{i}\right)+b\right)$$

(2)

RFs were first proposed by Breiman and Cutler. RF is a versatile machine-learning algorithm that can implement regression, classification, and dimensionality reduction. Random forests are a combination of decision trees, where each decision tree depends on the values of a random vector sampled independently [8]. The performance of random forests is quite similar to that of the bootstrap aggregating algorithm for many problems, which depends on the strength of the individual trees in the forest and the correlation between trees [5]. The steps of the algorithm are as follows:

N samples are randomly sampled with replacements from the data set.
The m features are randomly sampled from all the features. A certain strategy (CART) is used to select one feature from m features as the split attribute of the node.
The above two steps are repeated n times, that is, to generate n decision trees to form a random forest.
After each decision, the final vote is confirmed as the category for new data.

K−Nearest Neighbors is memory-based and requires no preprocessing of the sample and no model to fit [5, 9]. Given point x₀, k points that are the closest distance to x₀ were found. The majority vote is then used to classify k points [5]. The decision rule is defined as follows.

$$\widehat{f}(X)=\frac{1}{k}\sum\nolimits_{{x}_{i}\in N_k (\mathbf{X})} {y}_{i}$$

(3)

where N_k(X) is the neighborhood of X.

Logistic regression is a generalized linear regression model and is the most common algorithm used in binary classification problems. The decision function of the logistic regression is

$$Z=sigmoid\left({\theta }^{T}x\right)=\frac{1}{1+{e}^{{-\theta }^{T}x}}$$

(4)

where sigmoid (.) is the activation function and x is the matrix of the input data. The value is set to 1 if Z ≥ 0.5. By contrast, the value is regarded as zero if Z < 0.5.

The GlmNet is a generalized linear model with penalized maximum likelihood. GlmNet solves the following binomial likelihood function:

$${\mathrm{min}}_{{\beta }_{0},\beta }\left\{-\frac{1}{N}\sum\nolimits_{i=1}^{N}\left[{y}_{i}\left({\beta }_{0}+{x}_{i}^{T}\beta \right)+\mathrm{log}\left(1+{e}^{{\beta }_{0}+{x}_{i}^{T}\beta }\right)\right]+{\lambda P}_{a}\left(\beta \right)\right\}$$

(5)

where

$${P}_{a}\left(\beta \right)=\left(1-\alpha \right)\frac{1}{2}{\Vert \beta \Vert }_{{l}_{2}}^{2}+\alpha {\Vert \beta \Vert }_{{l}_{1}}$$

where α is the mixing factor, λ is the regularization parameter, and P_α(β) is the elastic net penalty. The model is a ridge regression model when α is zero. The model is a lasso regression when α = 1.

In the space of dimensionality reduction and data classification, LDA is wildly used. The principle of LDA is to project the labeled data into a lower-dimensional space using the projection method; therefore, the projected points can be easily distinguished, and the points of the same category will be closer to the projected space. The principle of LDA is to maximize the distance between classes and and to minimize the distance between the within-class [10]. The mapping function is

$$Y = {W}^{T} XI$$

(6)

where X is the dataset to be categorized. The original central point of Category i is

$${m}_{i}=\frac{1}{n}\sum\nolimits_{x\epsilon {D}_{i}}x$$

(7)

where D_i represents the set of points belonging to category i and n is the number of D_i.

The variance before the projection of category i s

$${S}_{i}^{2}=\sum\nolimits_{x\epsilon {D}_{i}}(x - {m}_{i}){(x - {m}_{i})}^{T}$$

(8)

The central point after the projection of category i is:

$${\widehat{m}}_{i}={W}^{T}{m}_{i}$$

(9)

The variance after the projection of Category i is

$$\begin{array}{l}{\widehat{S}}_{i}^{2}=\sum\limits_{y\epsilon {Y}_{i}}{\left(y-{\widehat{m}}_{i}\right)}^{2}\\ =\sum\limits_{x\epsilon {D}_{i}}{\left({W}^{T}x-{W}^{T}{m}_{i}\right)}^{2}\\ \begin{array}{l}=\sum\limits_{x\epsilon {D}_{i}}{W}^{T}\left(x-{m}_{i}\right){\left(x-{m}_{i}\right)}^{T}W\\ ={W}^{T}{S}_{i}^{2}W\end{array}\end{array}$$

(10)

where Y_i is the data set after D_i mapping.

Assuming that there are two categories in the dataset, the loss function is

$$\begin{array}{l} J\left(W\right)=\frac{{(\widehat{{m}_{1}}-\widehat{{m}_{2}})}^{2}}{\widehat{{S}_{1}^{2}}+\widehat{{S}_{2}^{2}}}\\ =\frac{{\left({W}^{T}{m}_{1}-{W}^{T}{m}_{2}\right)}^{2}}{{W}^{T}{S}_{1}^{2}W+{W}^{T}{S}_{2}^{2}W}\\ \begin{array}{l} =\frac{{W}^{T}{\left({m}_{1}-{m}_{2}\right)}^{2}W}{{W}^{T}{\left({S}_{1}^{2}+{S}_{2}^{2}\right)}W}\\ =\frac{{W}^{T}{S}_{B}^{2}W}{{W}^{T}{S}_{W}^{2}W}\end{array}\end{array}$$

(11)

where ${S}_{B}^{2}={({m}_{1}-{m}_{2})}^{2} and\, {S}_{w}^{2}={S}_{1}^{2}+{S }_{2}^{2}$

The goal is to find the W that makes J(W) the biggest.

The motivation behind this study is to develop a better understanding of the classification process and evaluate it in terms of accuracy and sensitivity, specificity, NPV, PPV, and AUC, and to analyze the weaknesses and strengths of known classifiers in differentiating malignant from benign nodules. These issues are important and valuable for the application of machine classifiers in thyroid research and for clinicians and researchers who would like to gain an understanding of the classification process and analysis.

Results

The performance of these classifiers is summarized in Table 1. Based on the results in Table 1, logistic regression works relatively well and achieves maximum accuracy (86.48%), which shows the best classification performance. However, there are only slight differences in the performances of the six classifiers.

Table 1 Six evaluate performances for different classifiers

Full size table

A statistical test method was applied to classifier performance differences to quantitatively compare the classifiers [11]. The 10-folder cross-validation paired t-test was applied to compare the two classifiers, and the significance level was 0.05. When the p-value was < 0.05, the two classifiers were significantly different. Table 2 shows the p-values of the paired t-tests. The results indicate that the six classifiers have no significant differences.

Table 2 The result of paired t-test of classifier differences

Full size table

Discussion

In this analysis, the cross-validation technique and paired t-test method were applied to tune parameters and assess classifier performance differences, respectively. The experimental results indicate that GlmNet, SVM, LDA, logistic regression, K-NN, and random forests exhibit slight differences in classification performance. The reason for this result may originate from our data, as all variables and labels are binary.

For clinical research, there are lots of classifiers for a real application. It is useful for clinician to select an optimal classifier. Our exprehensive comparison study may be such an effort for helping clinicians in their real problem.

Conclusions

The strength of this study is that 13 features regarding gender, SE, and CEUS in combination with other 10 conventional US features were used to compare different classifiers in the diagnosis of malignancy and benign disease. This study had a few limitations. First, the sample size was small. Moreover, this was a retrospective study. The established model requires further research to validate and support it. Large-sample studies are expected to be performed in the future. Second, the data in this study were binary. Finally, it is a good way to use other model data with new methods such as deep learning for thyroid nodule diagnosis [12, 13].

Materials and method

Materials

A database of 525 patients (396 females and 129 males) who underwent conventional US, SE, and CEUS at Shenzhen Second People’s Hospital was retrospectively reviewed. The patients were subdivided into two groups based on the final pathology results: those with benign thyroid nodules (n = 297) and those with malignancy (n = 228). We chose 13 features based on our clinical experience and data as many as possible according to our current imaging equipment; all features are listed in Table 3. In this study, 10 conventional US features of malignancy were: irregular margins, ill-defined borders, taller-than-wide shapes, hypoechogenicity or marked hypoechogenicity, microcalcification, posterior echo attenuation, peripheral acoustic halo, interrupted thyroid capsule, central vascularity, and suspected cervical lymph node metastasis. We chose the images according to clinicical experience.

Table 3 The used 13 features for comparison

Full size table

SE is an advanced technology used to evaluate tissue elasticity through the action of an external force. Under the same conditions, soft materials are more distorted than hard materials [2]. The degree of distortion under an external force was used to evaluate tissue hardness. Based on the fact that benign thyroid nodules are softer than malignant nodules, SE is used to differentiate benign from malignant nodules [2].

The SE score was based on Xu’s scoring system [14] as follows: Score 1: the nodule is predominantly white; Score 2: the nodule is predominantly white with few black portions; Score 3: the nodule is equally white and black; Score 4: the nodule is predominantly black with a few white spots; Score 5: nodules are almost completely black; and Score 6: nodules are completely black without white spots. A nodule was considered malignant if the score was greater than 4. CEUS is a new technique that infuses microbubbles into blood capillaries, which are smaller than the erythrocytes. Owing to the ultrasound scattering effect produced by blood capillaries, it can estimate the blood perfusion features of thyroid nodules to evaluate angiogenesis [2].

By comparing the echogenicity brightness between the thyroid nodule and surrounding parenchyma at peak enhancement, the degree of enhancement was classified as hypo, iso, hyper, or no enhancement. According to the echogenicity intensity of the thyroid nodules, the enhancement identity was classified as homogeneous and heterogeneous. Additionally, the nodule was regarded as malignant if the pattern of enhancement was heterogeneous hypoenhancement.

Method

All statistical analysis in this study was conducted using MATLAB software, version R2015a.

Different classifiers had different tuning parameters. There were no tunable parameters for the LDA and logistic regression classifiers. There were two parameters for RF.

The number of randomly selected variables m and decision trees ntree was fixed at 500 as the default value for the two tunable parameters. Therefore, RF was the only tunable parameter in this study. The tunable parameter of K-NN is the number of neighbors K. The other classifiers had two tunable parameters (SVM and GlmNet). The SVM had two tunable parameters: the Gaussian kernel(γ) and penalty coefficient (c). There were two tunable parameters for GlmNet: the mixing factor (α) and the regularization parameter (λ).

In this study, a five-fold cross-validation technique was used to tune the parameters for the classifiers. In each folder, based on a grid of parameter values, the optimal tunable parameters of the classifier were determined using five-fold cross-validation of the training data, which maximized classification accuracy. Table 4 provides a grid of parameter values from which the optimal parameters of the classifiers are chosen by five-fold cross-validation of the training data. This study evaluated performance using 10-folder cross-validation, including sensitivity, specificity, accuracy, PPV, NPV, and AUC.

Table 4 A grid of parameter values for different classifiers

Full size table

Availability of data and materials

The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

ML:: Machine learning
US:: Conventional ultrasonography
SE:: Strain elastosonography
CEUS:: Contrast-enhanced ultrasound
SVM:: Support vector machine
LDA:: Linear discriminant analysis
RF:: Random forest
LG:: Logistic regression
KNN:: K-nearest neighbors
NPV:: Negative predictive value
PPV:: Positive predictive value
AUC:: The area under the receiver operating curve

References

Batawil N, Alkordy T. Ultrasonographic features associated with malignancy in cytologically indeterminate thyroid nodules. Eur J Surg Oncol. 2014;40(2):182–6.
Article PubMed CAS Google Scholar
Pang T, Huang L, Deng Y, Wang T, Chen S, Gong X, Liu W. Logistic regression analysis of conventional ultrasonography, strain elastosonography, and contrast-enhanced ultrasound characteristics for the differentiation of benign and malignant thyroid nodules. PLoS One. 2017;12(12):0188987.
Article Google Scholar
Zhao RN, Zhang B, Yang X, Jiang YX, Lai XJ, Zhang XY. Logistic regression analysis of contrast-enhanced ultrasound and conventional ultrasound characteristics of sub-centimeter thyroid nodules. Ultrasound Med Biol. 2015;41(12):3102–8.
Article PubMed Google Scholar
Chng CL, Kurzawinski TR, Beale T. Value of sonographic features in predicting malignancy in thyroid nodules diagnosed as follicular neoplasm on cytology. Clin Endocrinol. 2015;83(5):711.
Article Google Scholar
Franklin J. The elements of statistical learning: data mining, inference and prediction. Publ Am Stat Assoc. 2010;99(466):567–567.
Google Scholar
Drucker H, Burges CJC, Kaufman L, Smola AJ, Vapnik V. Support vector regression machines. Adv Neural Inf Process Syst. 1997;28(7):779–84.
Google Scholar
Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.
Article Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Cover T, Hart P. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21–7.
Article Google Scholar
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning. New York: Springer; 2009.
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Graph Stat. 2006;15(3):651–74.
Article Google Scholar
Zhu YC, AlZoubi A, Jassim S, Jiang Q, Zhang Y, Wang YB, Ye XD, Hongbo DU. A generic deep learning framework to classify thyroid and breast lesions in ultrasound images. Ultrasonics. 2021;110:106300. https://doi.org/10.1016/j.ultras.2020.106300. Epub 2020 Nov 12. PMID: 33232887.
Article PubMed Google Scholar
Zhu YC, Jin PF, Bao J, Jiang Q, Wang X. Thyroid ultrasound image classification using a convolutional neural network. Ann Transl Med. 2021;9(20):1526. https://doi.org/10.21037/atm-21-4328. PMID: 34790732; PMCID: PMC8576712.
Article PubMed PubMed Central Google Scholar
Zhang YF, He Y, Xu HX, Xu XH, Liu C, Guo LH, Liu LN, Xu JM. Virtual touch tissue imaging on acoustic radiation force impulse elastography. J Ultrasound Med. 2014;33(4):585–95.
Article PubMed CAS Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This work was supported in part by Grants JCYJ20140414170821285 and JCYJ20150529164154046 from Shenzhen Science and Technology Innovation Committee and JCYJ20160422113119640 from Shenzhen Fundamental Research Project.

Author information

Jianguang Liang and Tiantian Pang contributed equally to this work.

Authors and Affiliations

School of Pharmacy & School of Biological and Food Engineering, Changzhou University, Changzhou, Jiangsu, 213164, China
Jianguang Liang
Health Science Center, Shenzhen University, Shenzhen, 518060, China
Tiantian Pang, Weixiang Liu, Xiaogang Li & Xianfen Diao
School of Biomedical Engineering, Shenzhen University, Shenzhen, 518060, China
Tiantian Pang, Weixiang Liu, Xiaogang Li & Xianfen Diao
Guangdong Key Laboratory for Biomedical Measurements and Ultrasound Imaging, Shenzhen, 518060, China
Tiantian Pang, Weixiang Liu, Xiaogang Li & Xianfen Diao
National-Regional Key Technology Engineering Laboratory for Medical Ultrasound, Shenzhen, 518060, China
Tiantian Pang, Weixiang Liu, Xiaogang Li & Xianfen Diao
College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Tiantian Pang
Guangzhou Medical University, Guangzhou, 510182, China
Leidan Huang
Department of Ultrasound, First Affiliated Hospital of Shenzhen University, Second People’s Hospital of Shenzhen, Shenzhen, 518035, China
Leidan Huang & Xuehao Gong

Authors

Jianguang Liang
View author publications
You can also search for this author in PubMed Google Scholar
Tiantian Pang
View author publications
You can also search for this author in PubMed Google Scholar
Weixiang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Li
View author publications
You can also search for this author in PubMed Google Scholar
Leidan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xuehao Gong
View author publications
You can also search for this author in PubMed Google Scholar
Xianfen Diao
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Weixiang Liu, Jianguang Liang and Xianfen Diao; Methodology, Tiantian Pang and Xiaogang Li; Validation, Tiantian Pang, Xianfen Diao, Jianguang Liang and Xiaogang Li; Formal Analysis, Tiantian Pang; Investigation, Leidan Huang and Xuehao Gong; Data Curation, Xuehao Gong and Leidan Huang; Writing–Original Draft Preparation, Tiantian Pang.

Corresponding authors

Correspondence to Jianguang Liang, Xuehao Gong or Xianfen Diao.

Ethics declarations

Ethics approval and consent to participate

All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki, and the protocol was approved by the Ethics Committee of Shenzhen Second People’s Hospital.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Liang, J., Pang, T., Liu, W. et al. Comparison of six machine learning methods for differentiating benign and malignant thyroid nodules using ultrasonographic characteristics. BMC Med Imaging 23, 154 (2023). https://doi.org/10.1186/s12880-023-01117-z

Download citation

Received: 25 December 2022
Accepted: 02 October 2023
Published: 12 October 2023
DOI: https://doi.org/10.1186/s12880-023-01117-z

Comparison of six machine learning methods for differentiating benign and malignant thyroid nodules using ultrasonographic characteristics

Abstract

Background

Results

Conclusions

Background

Results

Discussion

Conclusions

Materials and method

Materials

Method

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

BMC Medical Imaging

Contact us

Comparison of six machine learning methods for differentiating benign and malignant thyroid nodules using ultrasonographic characteristics

Abstract

Background

Results

Conclusions

Background

Results

Discussion

Conclusions

Materials and method

Materials

Method

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Imaging

Contact us