Weighing features of lung and heart regions for thoracic disease classification

Background Chest X-rays are the most commonly available and affordable radiological examination for screening thoracic diseases. According to the domain knowledge of screening chest X-rays, the pathological information usually lay on the lung and heart regions. However, it is costly to acquire region-level annotation in practice, and model training mainly relies on image-level class labels in a weakly supervised manner, which is highly challenging for computer-aided chest X-ray screening. To address this issue, some methods have been proposed recently to identify local regions containing pathological information, which is vital for thoracic disease classification. Inspired by this, we propose a novel deep learning framework to explore discriminative information from lung and heart regions. Result We design a feature extractor equipped with a multi-scale attention module to learn global attention maps from global images. To exploit disease-specific cues effectively, we locate lung and heart regions containing pathological information by a well-trained pixel-wise segmentation model to generate binarization masks. By introducing element-wise logical AND operator on the learned global attention maps and the binarization masks, we obtain local attention maps in which pixels are are 1 for lung and heart region and 0 for other regions. By zeroing features of non-lung and heart regions in attention maps, we can effectively exploit their disease-specific cues in lung and heart regions. Compared to existing methods fusing global and local features, we adopt feature weighting to avoid weakening visual cues unique to lung and heart regions. Our method with pixel-wise segmentation can help overcome the deviation of locating local regions. Evaluated by the benchmark split on the publicly available chest X-ray14 dataset, the comprehensive experiments show that our method achieves superior performance compared to the state-of-the-art methods. Conclusion We propose a novel deep framework for the multi-label classification of thoracic diseases in chest X-ray images. The proposed network aims to effectively exploit pathological regions containing the main cues for chest X-ray screening. Our proposed network has been used in clinic screening to assist the radiologists. Chest X-ray accounts for a significant proportion of radiological examinations. It is valuable to explore more methods for improving performance.

Fang et al. BMC Med Imaging (2021) 21:99 to subjective assessment errors [1]. Hence, it is strongly desired to develop a computer-aided diagnosis system to support clinical practitioners. Many existing works using deep learning have been proposed to automatically diagnose thoracic diseases for chest X-ray images in recent years and achieve remarkable progress, such as disease classification [2,3], abnormality detection [4,5], chest X-ray segmentation [6,7], disease prediction [8,9]. Among various computer-aided diagnosis tasks for chest X-ray images, our work aims to address the disease classification task. The classification task is highly challenging for computer-aided screening due to the low resolution and poor specificity of chest X-ray images.
Early works using convolutional neural networks (CNN) [10][11][12] for thoracic disease classification of chest X-ray images typically employ the global image for model training. However, the global learning strategy may suffer from the affection of normal regions. As shown in Fig. 1, each image contains two parts: pathological regions (red bounding box) and normal regions. The pathological regions are the main cues for screening chest X-ray, and its cues may be drowned in the global image during model learning due to the affection of normal regions. For example, the nodule occupies a small area, and its visual cues are difficult to be reserved in the ultimate features due to a large number of convolution layers that reduce the detail characteristics. Considering this fact, it is vital to enhance the visual features of pathological regions and suppress the disturbing of normal regions during model training. However, although several large chest X-ray datasets [13][14][15] have been published, region-level annotations are still scarce and expensive to acquire. With image-level annotations (class labels), some strategies related to pathological region locating and learning have been explored in many existing methods [16,17].
The performance of region learning heavily relies on the accuracy of locating pathological regions with class labels. Some existing methods have been proposed to locate pathological regions for thoracic disease classification in chest X-rays, such as region proposals [18,19], saliency maps [18,19]. However, without region-level annotations, they cannot precisely identify pathological regions by predicting bounding box, as shown in the blue rectangle of nodule image of Fig. 1. According to the report of existing works [17] on the chest X-ray14 dataset [20], the best performance of predicting bounding box is 0.29 average intersection over union (IoU) and 0.37 average continuous Dice. To avoid the deviation of locating pathological regions, some works [16,21] proposed the deep fusion network by integrating the global features to compensate the lost discriminative cures of local features. However, the fusion methods must be careful tuned to avoid the local features smoothing out in the global features. The local features have learned pathological information, but its differentiating role will be weakened on the fusion process. Considering the above issues, our work designs a novel deep learning framework to explore discriminative information from local regions and enhance the differentiating role of local regions for thoracic disease classification. By observing the area of pathological regions in Fig. 1, the domain knowledge that pathological regions of thoracic diseases are typically limited within the lung and heart can be asserted. Inspired by this prior knowledge, we can locate lung and heart regions by pixel-wise segmentation. Although the lung and heart regions still contain non-pathological regions that occupy large areas, these areas are smaller than the entire image and effectively cover pathological information. In fact, our method makes a trade-off between suppressing normal regions and identifying pathological regions accurately. Based on the global attention maps, the local features of the lung and heart regions are uniquely used for class-probability prediction by applying pixel-wise segmentation. Without region-level annotations, it is difficult to locate pathological regions accurately; our solution is to make the most efforts to narrow the regions containing pathological information. The main contributions of this work are summarized as follows: 1. To effectively learn the discriminative information from pathological regions and avoid the affection of normal regions, we propose a novel deep learning framework for thoracic diseases classification in chest X-ray. The proposed framework combines a feature extractor equipped with a multi-scale attention module and a well-trained pixel-level segmentation model for the lung and heart regions. 2. The multi-scale attention module learns the discriminative information from chest X-ray images to generate global attention maps. We apply a feature weighting strategy for the lung and heart regions containing pathological information to exploit their disease-specific cues effectively. 3. Evaluated by the benchmark split on the publicly available chest X-ray14 dataset, the comprehensive experiments show that our method can achieve the best performance compared to the state-of-the-art methods. The multi-scale attention module can be embedded into any off-the-shelf networks to help promote the classification performance.

Related works
Chest X-ray datasets. Chest X-ray imaging is one of the most widely available modalities to assess thoracic diseases. And for a long time, the task of computer-aided screening for chest X-ray images has been extensively explored in the field of medical image analysis. Several released hospital-scale chest X-ray datasets greatly foster multi-label classification research of thoracic diseases and especially benefits the data-hungry deep learning model. For example, the MIMIC-CXR dataset [13] contains 377, 110 chest X-rays associated with 14 labels, the Chexpert dataset [14] provides 224, 316 chest X-rays associated with 14 labels, the PadChest dataset [15] includes more than 160, 000 images labeled with 19 differential diagnoses. Among the larger publicly available chest X-ray datasets, the Chest X-ray14 dataset [20] attracts more research due to its earlier publish and higher quality and has been established strong baselines [16,17]. Due to the comparable strong baselines, we adopt this dataset to demonstrate the advantage of our proposed method. To automatically extract the lung and heart regions from the global images, we use the JSRT dataset [22] to train the lung and heart segmentation model. It provides 154 nodule and 93 non-nodule chest X-ray images. A detailed delineation of the segmentation's nodule is publicly available to train the lung, and heart segmentation [23]. The annotation images for segmentation tasks are binary images in which pixels are 255 for the foreground and 0 for the background. Attention mechanisms for medical image analysis. Recently, attention mechanisms applied in CNN can significantly enhance the performance of various tasks in the field of medical image analysis [24][25][26]. For instance, A novel Attention Gate (AG) [27] can be easily integrated into standard CNN models to leverage salient regions in medical images for various medical image analysis tasks, including fetal ultrasound classification and 3D computed tomography (CT) abdominal segmentation. Attention mechanisms can help detect subtle differences between different diseases by guiding the model activations to focus on salient regions. This feature is particularly suitable for analyzing chest X-ray images due to the low resolution and poor specificity of chest X-ray images [28,29]. For example, a contrast-induced attention network [30] is proposed to exploits the highly structured property of chest X-ray images and localizes diseases via contrastive learning on the aligned positive and negative samples. For the multi-label classification problem of thoracic diseases, an attention-guided mask inference process is designed to locate salient regions and learn the discriminative feature for classification [16]. Inspired by this work, we improve the spatial-attention module in CBAM [31] to design a multi-scale attention module, which helps explore discriminative cues to advance the classification performance by detecting subtle differences.
Local Learning for chest X-ray classification. Due to the relative scarcity of region-level annotations, local localization and learning are gaining increasing attention in the field of chest X-ray image analysis [32,33]. A thoracic disease is highly characterized by a pathological region, which contains critical cues for classification. With only image-level class labels, previous works [2,10,11] for thoracic disease classification typically learn the discriminative information from the global image by supervised training. However, it is prone to be affected by normal regions. To address the problems caused by merely relying on the global image, recent approaches have shifted to learn the discriminative information from local regions containing pathological information. For example, a deep learning framework (SENet) [12] equipped with the squeeze-and-excitation block [34] reinforces the sensitivity to subtle differences between normal and pathological regions by explicitly modeling the channel interdependence. More methods for local location rely on saliency maps or saliency maps [17][18][19]. For instance, in SalNet [17], the Gumbel-softmax function [35] is used to combine the region proposal and saliency map detector to sample discrete regions from a set of proposed regions differentially. However, without region-level annotations, they cannot precisely identify pathological regions by selecting local regions.
To avoid discriminative information loss in location deviation of pathological regions, some methods fuse the global image training and the local region learning. The deep fusion network unifying global and local features is gradually popular in computer vision tasks [36,37]. For thoracic disease classification in chest X-ray images, the representative work of fusion methods are the segmentation-based deep fusion network (SDFN) [21] and the three-branch attention-guided network (AGCNN) [16]. In SDFN, a global classifier is used as feature extractors to obtain the discriminative features from the entire chest X-ray image, and the cropped lung regions generated by the segmentation model are learned by a local classifier. The obtained features from the global and local classifiers are fused by the feature fusion module for disease classification. Our method and SDFN all use the JSRT dataset [23] to train a pixel-wise segmentation model. However, the fusion methods must be careful tuned to prevent the local features containing pathological information from drowning in the global features. Hence, we apply feature weighting but not fusion to enhance visual cues unique to the lung and heart regions based on the learned global attention maps and the segmented masks.
Based on the above discussion of related works, our proposed method has two novel folds: (1) a feature extractor equipped with the multi-scale attention module is used to learn the global discriminative information; (2) feature weighting strategy is applied to enhance features of the lung and heart region containing pathological information. Extensive experiments on the chest X-ray14 dataset demonstrate the effectiveness of our method.

Methods
Based on image-level class labels, our method is proposed to address the multi-label classification of thoracic diseases by learning the discriminative information from chest X-ray images effectively. This section will elaborate on our method, including the problem statement, feature extractor, feature weighting.

Problem statement
Thoracic disease classification is a multi-label classification problem that detects if one or multiple diseases are presented in each chest X-ray image. We define a 14-dimensional label vector Y = {y 1 , . . . , y i , . . . , y c } for each image, where c = 14 and y i ∈ {0, 1} . y i indicates the presence with respect to corresponding diseases in the image (i.e. 1 for presence and 0 for absence) and an allzero vector of 14-dimensions represents the status of "No Finding" (no disease is found in the scope of any of 14 disease categories as listed). The diseases in Y are in the order of Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening, and Hernia. We address this classification problem by training our classification model presented in Fig. 2 with the binary cross-entropy (BCE) loss function defined in Eq 1.
where c is the number of diseases (classes), Y is the ground truth, and Ŷ denotes the predicted probability.
Our proposed deep framework covers three parts: a feature extractor, a pixel-wise segmentation model, and a feature weighting module. The feature extractor is to embed the global discriminative information into a global attention map by applying a multi-scale attention module. The multi-scale attention module helps the feature extractor to focus on salient regions and detect subtle texture abnormality. Simultaneously, the well-trained pixel segmentation model identifies areas of the lung and heart, following binarized as a global mask in which pixels are 1 for lung and heart region and 0 for other regions. Then we conduct an element-wise summation operation on the global attention map and the global mask to generate a local attention map. By weighing the lung and heart region features, the local attention map only contains visual cues unique to the lung and heart region (1) containing pathological information and discards features of non-lung and heart regions by zeroing operation. Following the local attention map, an average pooling layer and a fully-connected layer are introduced to train disease-specific probability by binary cross-entropy loss.

Feature extractor
The feature extractor consists of a multi-scale attention module and a backbone. Each chest X-ray image X is resized into 3 × 224 × 224 and firstly inputted into the multi-scale attention module. The multi-scale attention module computes a spatial feature hierarchy consisting of two convolutional layers with a kernel step of 2 and three blocks of calculating maximum and average across channels. The spatial feature hierarchy is convoluted into a feature map of 1 × 224 × 224 dimension and merged into the global image by element-wise multiplication. Based on this operation, the global image element is spatially weighted by computing the maximum value at different scales. The multi-scale spatial attention module can detect subtle differences at different scales. Hence, it can enhance the multi-label classification performance by exploiting the visual cues effectively. After a sigmoid activation function, the feature map is merged into the original chest X-ray image by element-wise multiplication, following fed into the backbone. We use the pre-trained 121-layer DenseNet [38] as the backbone. We take out the last convolutional feature map from backbone as a global attention map F g with c × h × w dimensions. The global attention map learns the discriminative information from the chest X-ray image. The diseasespecific feature may be drowned in global features and can not play a differentiating role in classification.

Feature weighting
We apply U-Net [39] to train a segmentation model for the left lung, right lung, and heart on the JSRT dataset by using dice loss. The dice loss is formulated as: where M gt denotes the ground truth mask, and M prob is the predicted mask. The dice loss is minimized for optimization and the model with the smallest loss was saved. The image pre-processing of U-Net follows the same pipeline of the feature extractor to enable automatic region segmentation for the chest X-ray14 dataset. We Fig. 2 The framework of our proposed method. A feature extractor equipped with a multi-scale attention module aims to learn the discriminative information from a chest X-ray image to generate a global attention map. A well-trained pixel segmentation model locates the lung and heart regions to binarize a mask in which pixels are 1 for lung and heart regions and 0 for other regions. A local attention map focusing on the lung and heart regions is formed by introducing a logical AND operator on the mask and the global attention map. This local attention map contains features of the pathological region and suppresses the normal region first input the chest X-ray image X into the well-trained segmentation model to generate three pixel-wise masks for the left lung, right lung, and heart. Then we merge the three pixel-wise masks into a pixel-wise mask M g in which pixels are either 1 for the lung and heart regions or 0 for other regions by pixel-wise summation. The pixelwise mask M g further is resized into a size of 1 × h × w equal to the width and height of the global attention map F g by adaptive average pooling. The global attention map F g of c × h × w is taken out from the backbone of the image classifier. Further, we generate a local attention map F l of c × h × w from the global attention map and the pixel-wise mask by element-wise multiplication. We introduce the logical AND operator on the global attention map and the pixel-wise mask. The local attention map contains the zero pixels of non-lung and heart regions and the non-zero pixels of the lung and heart regions. Hence, only the pixel values of the lung and heart region containing pathological information in the local attention map are embedded into the average pooling layer for label prediction by a channel-wise average operation, and the pixel values of other regions in the attention map are zeroed. The feature weighting for the global attention map F g and the pixel-wise mask M g is defined as: With the help of the multi-scale attention module, the global attention map effectively learns the salient information from the chest X-ray image, containing the discriminative information in the lung and heart. The pathological regions are typically located in the lung and heart, hence, we introduce the binary masks on the global attention map to generate the local attention map. The generated local attention map suppresses the information of other regions and remains the information of the lung and heart regions. By logical AND operation, we locate features of the lung and heart regions containing pathological information.

Experimental setups
In order to test the performance of our proposed framework, we conduct extensive experiments on the public chest X-ray14 dataset to verify the effectiveness of our method. In this section, we will describe the experimental details. Chest X-ray14 dataset consists of 112, 120 frontalview X-ray images of 30, 805 unique patients [20]. Each image is labeled with one or multiple classes of 14 common thoracic disease: Atelectasis, Cardiomegaly, Effusion, Infiltration, Mass, Nodule, Pneumonia, Pneumothorax, Consolidation, Edema, Emphysema, Fibrosis, Pleural Thickening, and Hernia. Besides, the dataset also contains 984 labeled bounding boxes for 880 images related to 8 different diseases by board-certified radiologists. In our experiments, we use disease labels as ground-truth for model training. At the same time, we utilize the bounding boxes for qualitative observation of pathological region localization on chest X-rays. As Table 1 shows, the benchmark split of this dataset [20] contains train set of 86, 524 images for model training, test set of 25, 596 images for model evaluation, and box set of 984 images for model visualization. We randomly select 10% of each disease in the train set as the validation set for model validation. There is no patient overlap between the three splits. There are some images with multi-label, so the number of multi-label totals is greater than the finding number.
Comparative methods. Researches on addressing the multi-label classification problem of thoracic diseases have established strong baselines on the benchmark split of the chest X-ray14 dataset.
• DCNN [20]. In this work [20], they first released the benchmark split of the chest X-ray14 dataset and presented a deep convolutional neural network (DCNN) to tackle thoracic disease classification. We reproduce this method by using the pre-trained ResNet-50 [40], which achieved the best performance in this work. • CheXNet [11]. CheXNet [11] is a 121-layer DenseNet [38] trained on the chest X-ray14 datset. This work  21:99 demonstrated that the performance of CheXNet is statistically significantly higher than radiologist performance.
• SENet [12]. To deal with the challenge that thoracic diseases usually happen in localized disease-specific areas, Yan et al. [12] presented a weakly-supervised deep learning framework equipped with squeezeand-excitation blocks (SENet) to classify thoracic disease. This work was based on the CheXNet model using DenseNet as the backbone and first explored the problem of learning disease-specific areas. • SDFN [21]. Liu et al. [21] provided a segmentationbased deep fusion network (SDFN) to leverage the discriminative information of local regions. SDFN adopted pixel-level segmentation to detect local regions and applied a deep fusion framework to unify the global and local features. Our method also identifies the lung and heart region by using pixel segmentation. But we argue that the deep fusion method can not effectively tackle the problem that the local features are drowned in the global features. Hence, we use a feature weighting strategy to focus on the local features. • AGCNN [16]. Guan et al. proposed a three-branch attention-guided convolutional neural network (AGCNN) [16] for the task of thoracic disease classification on chest X-ray images. This work located salient regions from the global attention map then cropped the corresponding regions from the chest X-ray image. • SalNet [17]. Hermoze et al. [17] designed a threestage deep learning framework (SalNet) for weaklysupervised disease classification by combining region proposal and saliency detection. This work obtained the local regions from salient maps based on region proposals and achieved the best performance on the benchmark split of the chest X-ray14 dataset.
Implementation details and evaluation protocal. We implement CXR-IRNet with the Pytorch framework and use the pre-trained 121-layer DenseNet as the backbone of the feature extractor. We extract the last convolutional feature map of DenseNet as the global attention map. The single output is used for class-probability prediction after a sigmoid non-linearity. For the multi-scale attention module, apart from the original image as one feature, we adopt two convolutions of kernel size 5, 9 to generate the other two features, these three-scale features for following operations. We resize each chest X-ray image to 256 × 256 , and then perform center cropping to obtain an image of size 224 × 224 for training. Each cropped image is normalized with the same mean and standard deviation. We use Adam optimizer with a learning rate of 0.001 and weight decay of 0.0001. Our network is trained for 50 epochs from scratch with a batch size of 512. For comparative methods, we directly report the published performance of SDFN and SalNet, no reproduction. The other methods are implemented by the same experimental setup for a fair comparison. For evaluation, we report the area under the receiver operating characteristic curve (AUROC) and ROC curve. Both are widely used for performance assessment of multi-label classification. The ROC curve comprises of two evaluation criteria to measure performance, including sensitivity (true positive rate) and specificity (true negative rate). For detection visualization, we evaluate in terms of the intersection over union (IoU) on the box set.

Results and discussions
The following research questions will be answered by analyzing experimental results: RQ1 Can feature weighting of the lung and heart regions help improve the performance? RQ2 How is the effectiveness of the multi-scale attention module on learning pathological information?

Classification performance (RQ1)
In Table 2, we report the classification performances of the proposed method and comparative methods in terms of AUROC scores, evaluated by the test set of the benchmark split. Our method achieves the best performance (boldface font) over 4 diseases, including Infiltration, Nodule, Fibrosis, and Pleural Thickening. In terms of the average AUROC, our method is superior to comparative methods. The overall results show that our method establishes a new state-of-the-art on the benchmark split of the chest X-ray14 dataset. Methods (SENet, SDFN, AGCNN, SalNet) unifying the global and local features obtain better performance than methods (DCNN, CheXNet) only employing the global image. To overcome location deviation of methods (AGCNN, SalNet) relying on saliency maps and region proposal, our method identifies the lung and heart regions containing pathological information by using pixel-wise segmentation same as SDFN. However, we argue that unifying the global and local features can not prevent local discriminative information from smoothing out in the global features. Hence, we consider the feature weighting strategy but not fusion like SDFN and AGCNN. SENet locates suspicious lesion regions by using a multi-map transfer layer to encode activations associated with each disease class. Such feature weighting strategy makes it more capable of discriminating the appearance of multiple thoracic diseases on the same chest X-ray, then helps it yields good performance. Different from SENet, Our method conduct feature weighting on the global attention map by using segmentation masks. Benefit from the segmentation locating the lung and heart regions containing pathological information precisely and the feature weighting strategy zeroing features of non-lung and heart regions in attention maps, our method establish a new baseline on the chest X-ray14 dataset.
As Table 2 show, without feature weighting (Ours w/o F l ), our deep framework equipped with the multiscale attention module can achieve the competitive performance, including the highest performance of 4 diseases and the second-highest average performance.
Without feature weighting, our deep framework is equal to ChXNet employing the global image. The improved performance demonstrates that the multiscale attention module can effectively learn the discriminative information from the global image. Only the discriminative information is learned into the global attention maps, the feature weighting on the global attention maps can locate the lung and heart regions containing pathological information. The multiscale attention module exploits the salient information from chest X-ray at three scales, then detects visual cues unique to pathological regions. The performances of Ours w/o F l confirm the contribution of the multiscale attention module. With the help of the multi-scale attention module, our method can further improve the performance by applying the feature weighting strategy to enhance the visual cues unique to the lung and heart regions. We argue that local features containing pathological information maybe drown in the global feature by applying the fusion framework like AGCNN and SDFN. Hence, based on locating the lung and heart regions containing pathological information by segmentation, we directly zero features of non-lung and heart regions.
To further demonstrate the advantage of the feature weighting strategy, we can observe the performance of some diseases. The Infiltration AUROC of our method is significantly improved compared to other methods. The improvement ratio reaches 12.87% compared to the second-highest performance yielded by SDFN. As Table 1 show, the number of Infiltration image is the most among the diseases. However, other methods can not achieve better performance due to the poor specificity of Infiltration. At the same time, we can observe that the pathological region of Infiltration occupies a relatively large area in the left lung, as shown in Fig. 1. This demonstrates that the effectiveness of the feature weighting strategy. The pathological region of Infiltration covers a large area of the left lung, and features of the left lung are enhanced after feature weighting. Hence, the class-probability prediction mainly relies on the learned pathological information of Infiltration. In other words, the pathological information of Infiltration is not weakened or even lost in the pipeline of our deep framework, while the non-pathological regions are suppressed. Benefiting from weighing features of the lung and heart regions, the performance of Nodule is up to 0.8377 obtained by our method. The pathological region of Nodule is usually small and easily drowned in the global image. The characteristics and performance of Nodule also demonstrate the effectiveness of the feature weighting strategy zeroing features of non-pathological regions. Based on the above discussion, we can infer that the feature weighting strategy can help improve classification performance and is superior to the fusion method. Figure 3 shows the ROC curves of our method on the 14 diseases of the benchmark split. According to the ROC curve trained on the chest X-ray14 dataset, we set the class threshold for each disease to classify a new chest X-ray image. Due to the reliable performance, our model has been successfully applied in routine clinical screening to assist radiologists. 1 We automatically output screening results of our method before the radiologists read the chest X-ray images in the picture archiving and communication systems (PACS). On the user interface of PACS, the radiologists can get the pre-screened result to make further diagnosis. For automatic screening chest X-rays, the underlying idea is to effectively suppress nonpathological regions and learn visual cues of pathological regions. In this work, we devote ourselves to locating the lung and heart regions containing pathological information by designing the multi-scale attention module and feature weighting strategy. Our proposed framework can avoid the deviation in locating pathological regions by using pixel-wise segmentation and the local features drown in the global features by using feature weighting. In the future, we try to improve our model by applying region-wise detection to learn visual cues unique to pathological regions.

Learning capability (RQ2)
The capability of learning pathological information determines the final classification performance. Even if the lung and heart regions can be located accurately by pixel-wise segmentation, but if the feature extractor can not learn the pathological information in the lung and heart regions, the feature weighting strategy can not help improve the performance. So we need to analyze the effectiveness of the multi-scale attention module in learning pathological information. The best average AUROC in Table can demonstrate that our proposed method has reliable learning capability. Apart from this proof, we further adopt the box set with ground truth (bounding box) to evaluate the learning capability of pathological information of the feature extractor equipped with the multi-scale attention module. We apply class activation map (CAM) [41] to locate regions containing pathological information. Then we use IoU to evaluate the Some images with the higher IoU performance are shown in Fig. 4. This qualitative visualization demonstrates that the feature extractor can detect pathological regions with some probability. The detection performance can reflect the learning capability of the feature extractor equipped with the multi-scale attention module. The detection performance for Nodule (green rectangle) is lower than other diseases due to its small area, but the detected pathological region lay on the left lung. By filtering out the non-lung and heart regions, pathological information in the left lung can be used for label prediction. The detected pathological region of Cardiomegaly is almost overlapping with the heart region. The detected pathological region of Pneumonia also almost covers the left lung region. But the pathological region of Mass is severely deviating to the lung and heart region. Although the feature extractor has learned the pathological information, the pathological region will be filtered out in the process of feature weighting. Such cases affect the classification performance of our method and can be overcome by using region-level annotations. Our proposed method aims to improve the performance with image-level class labels. Further, we present an ablation study to demonstrate the contribution of the multi-scale  attention module (MA) in the feature extractor (FE). In Table 3, the average IoU of FE can greatly outperform FE without MA by 23.67% from 0.2437 to 0.3014. This IoU performance is competitive to SalNet that reports an average of IoU of 0.29. Our feature extractor adopts the same backbone as DCNN, and the AUROC performance of our method without feature weighting (Ours w/o F l ) is superior to DCNN in Table 2. Based on the above observations, we can conclude that the multi-scale attention module contributes to pathological information learning and classification performance improvement. We typically divide a chest X-ray image into two parts: pathological region and non-pathological region. Our method aims to filter out the information of the nonpathological region. However, it is difficult to locate the pathological region without region-level annotations. Current works relying on saliency map or region proposal lead to location deviation. To overcome this issue, we apply pixel-wise segmentation to locate the lung and heart regions containing pathological information. Although the lung and heart regions can cover the pathological region in most cases, the non-pathological region in the lung and heart regions can not be filtered out by feature weighting. The feature weighting strategy only can filter out non-lung and heart regions. Despite this, our method applying the feature weighting strategy achieves better performance than methods using fusion strategy. With image-level class labels, we design two tricks to improve the performance of multi-label classification for screening chest X-rays. Based on the above experimental results and discussion, we have demonstrated the effectiveness of these two tricks.

Conclusions
In this work, we propose a novel deep framework for the multi-label classification of thoracic diseases in chest X-ray images. The proposed network aims to effectively exploit pathological regions containing the main cues for chest X-ray screening. We present a feature extractor equipped with a multi-scale attention module to effectively learn pathological information from chest X-ray images. At the same time, we apply the pixel-level segmentation to identify the lung and heart regions containing pathological information to overcome location deviation. Then, we adopt the feature weighting strategy to filter out the non-lung and heart regions. Based on our deep framework, the class-probability layer mainly rely on the information of the lung and heart regions. Evaluated on the benchmark split of the chest X-ray14 dataset, we establish a new state-of-the-art baseline. Our proposed network has been used in clinic screening to assist the radiologists. Chest X-ray accounts for a significant proportion of radiological examinations. It is valuable to explore more methods for improving performance.