Automatic detection of anomalies in screening mammograms
© Kendall et al.; licensee BioMed Central Ltd. 2013
Received: 10 May 2013
Accepted: 9 December 2013
Published: 13 December 2013
Skip to main content
© Kendall et al.; licensee BioMed Central Ltd. 2013
Received: 10 May 2013
Accepted: 9 December 2013
Published: 13 December 2013
Diagnostic performance in breast screening programs may be influenced by the prior probability of disease. Since breast cancer incidence is roughly half a percent in the general population there is a large probability that the screening exam will be normal. That factor may contribute to false negatives. Screening programs typically exhibit about 83% sensitivity and 91% specificity. This investigation was undertaken to determine if a system could be developed to pre-sort screening-images into normal and suspicious bins based on their likelihood to contain disease. Wavelets were investigated as a method to parse the image data, potentially removing confounding information. The development of a classification system based on features extracted from wavelet transformed mammograms is reported.
In the multi-step procedure images were processed using 2D discrete wavelet transforms to create a set of maps at different size scales. Next, statistical features were computed from each map, and a subset of these features was the input for a concerted-effort set of naïve Bayesian classifiers. The classifier network was constructed to calculate the probability that the parent mammography image contained an abnormality. The abnormalities were not identified, nor were they regionalized.
The algorithm was tested on two publicly available databases: the Digital Database for Screening Mammography (DDSM) and the Mammographic Images Analysis Society’s database (MIAS). These databases contain radiologist-verified images and feature common abnormalities including: spiculations, masses, geometric deformations and fibroid tissues.
The classifier-network designs tested achieved sensitivities and specificities sufficient to be potentially useful in a clinical setting. This first series of tests identified networks with 100% sensitivity and up to 79% specificity for abnormalities. This performance significantly exceeds the mean sensitivity reported in literature for the unaided human expert.
Classifiers based on wavelet-derived features proved to be highly sensitive to a range of pathologies, as a result Type II errors were nearly eliminated. Pre-sorting the images changed the prior probability in the sorted database from 37% to 74%.
Breast cancer is the most common form of cancer among Canadian women, and is second only to lung cancer in mortality [1–3]. Women in higher risk groups, are encouraged receive a screening x-ray mammogram every two years, with further screening for very high risk patients, such as those with familial history or genetic predisposition.
Treatment efficacy is linked to early detection of tumors. The challenge in x-ray mammography is that features associated with pathology may be patent or subtly represented in the image. For example, micro-calcifications sometimes signal the presence of cancer. Due to calcium’s relatively high absorption of x-ray photons they appear as small bright regions in the mammogram and readily detected by CAD and human reviewers [4–8]. On the other hand, masses are evident in an x-ray if their density differs from that of the surrounding tissue, and this is often not the case. Masses may have almost any size, shape or structure [4, 7, 9–17]. Occasionally, masses are evident only by inducing deformation of adjacent tissue. These architectural distortions are difficult to detect thereby limiting the sensitivity of the screening procedure .
In response to these challenges, a range of software tools have been developed to help radiologists recognize subtle abnormalities in mammograms [7, 19–23]. These tools typically use a common second reader model: the radiologist first examines the raw image and notes suspicious regions . The tool then processes the image marking potentially suspicious regions and the results are compared.
Such systems have a significant drawback: they tend to have low specificity and so require nearly every image to be examined twice: once unaided, and then again to compare to the regions marked as suspicious by the software . This is impractical for screening mammography where fewer than 1% of the images will have tumors. In that setting, the unintended consequence of CAD search routines is an increase the time required to report normal findings. In addition, increasing the number of prompts for review apparently does not guarantee an increase in accuracy .
Here we report the performance of a wavelet-map feature classifier (WFC), designed as a pre-sorting tool. The WFC identifies and removes normal images from the radiologists review queue, leaving those images with a higher probability of showing abnormalities. For this technique to be optimally safe, the algorithm is designed to perform at high sensitivity, detecting all or nearly all abnormalities; for it to be effective, it has sufficient specificity to remove enough normal images to usefully increase the relative frequency of suspicious images in the product queue.
The pre-screening algorithm was developed using the Digital Database for Screening Mammography (DDSM) database (http://marathon.csee.usf.edu/Mammography/Database.html) [26–28], a publicly available resource. A smaller unrelated Mammographic Images Analysis Society’s database (MIAS) database (http://peipa.essex.ac.uk/info/mias.html)  provided a confidence check that the algorithm was not over-specified. These data provided a useful proving ground for testing various incarnations of the algorithm.
The DDSM data set consisted of 1714 images, 1065 of which were classified as normal, in that they showed no abnormalities. The other 649 images showed some type of abnormality that would merit further study. These included: 119 benign calcifications, 120 cancerous calcifications, 213 benign masses and 197 cancerous masses. There was a range of tissue composition and breast size in the DDSM data set, making it representative of the variety of images that may be seen in a clinical setting.
The MIAS data set contained 303 images. There were 205 normal images and 98 images that showed some type of abnormality that included: 11 benign calcifications, 12 cancer-associated calcifications, 38 benign masses, 18 cancerous masses, and 19 architectural distortions (9 from benign masses and 10 from cancerous masses that were not directly visible). Again, the images were from a wide variety of patients, such that the tissues imaged varied widely in terms of breast size and tissue composition.
The Wavelet Filter Classifier (WFC) proceeds in several discrete stages: regularizing the raw digital x-ray image, transforming it to produce scale maps, extracting features from the maps, classifying the features and generating the probability that the image contains some abnormality as an output.
Digital mammograms were pre-processed  to reduce non-pathological variations between images, such as background noise, artifacts, and tissue orientation. All images were rescaled to 200 micron pixel resolution, and were padded or cropped to be 1024 × 1024 pixels, or 20.48 × 20.48 cm. The analysis presented here was restricted to medial-lateral views and the presentation of both breasts was adjusted to a single orientation.
The DDSM (and MIAS) mammograms were scanned from film images. As a result they contained label, noise and other artifacts that are not present in direct digital images. These artifacts were removed using a threshold and segmentation procedure. Otsu’s method  was used to determine the optimal pixel intensity threshold for distinguishing background and foreground (tissue) pixels. The segmented non-tissue regions were set to zero without changing pixel values within the tissue region. The processed images were rescaled to maximum pixel intensity.
Eight decomposition levels were created in a serial process, applying the transformation to the approximation map to create four more maps. Since the approximation map had half the resolution of the input image, the wavelet sampled structures that were twice as large as in the original image. The set of all maps derived from a single original image formed a decomposition tree. The highest levels of the tree had the highest resolution and were most sensitive to structures with small spatial extent, while the lowest levels of the tree had the lowest resolution and were most sensitive to structures with large spatial extent.
Many wavelet bases are available, each with unique sampling characteristics. Several, including the Biorthogonal, Debuchies and Haar appeared promising for detecting subsets of the broad range of shapes and intensity gradients potentially associated with pathology [4, 31–33]. Eleven wavelets were selected from these families, Haar, Db2, Db4, Db8, Bior1.5, Bior2.2, Bior2.8, Bior3.7 Bior4.4, Bior5.5, Bior6.8. The Haar wavelet is a square function that usefully interrogates sharp discontinuities. The other wavelets are more complex. The notation used suggests some of the features. For example, Db2 (Daubechies 2) is an orthogonal function that samples polynomials that have constant and linear scaling regions. The Bior1.5 describes a bi-orthogonal fifth order sampling function that requires a first order reconstruction algorithm.
The decompositions were initially performed in Matlab using the wavelet toolbox and later ported to C++ to improve computational efficiency. Moments of the mean generated from the output maps formed the input features for classification.
Four whole-image statistical features, mean, standard deviation, skewness and kurtosis of pixel intensity, were computed for each of the four wavelet-maps at each of the eight decomposition levels. This produced 132 scalar features for each of the eleven wavelet-bases applied to an x-ray image. The classification trials were restricted to using a combination of one, two or three features to avoid over-specifying the final classifier to the training set. Every combination of one, two or three features from the 132 member set were tested for every wavelet basis. The feature sets with the highest sensitivity for finding the images with known abnormalities were selected.
Mean and standard deviation are familiar metrics, skewness and kurtosis less so. The skewness value provides a measure of the asymmetry of a data distribution. Thus, the presence of a small number of unusually dark or bright pixels may alter skewness even when the mean and standard deviation values are not significantly affected. Here the skewness value may be sensitive to the representation of microcalcifications in an image. While these are only a few pixels in size they are unusually bright. Similarly, skewness may report the presence of bright (dense) masses. Since skewness measures the imbalance between the parts of the distribution above and below the mean, the presence of a dense mass will raise the skewness value relative to that found for a normal image.
Kurtosis reports the sharpness of the central peak of a distribution. Since it depends on the fourth power of the difference from the mean, it is highly sensitive to the addition of distant-valued points. Here, increasing numbers of bright microcalcification-containing pixels may be expected to raise kurtosis values. Interestingly, in some cases the kurtosis measure also detected masses. The post hoc rationale developed was based on the observation that when masses appear brighter than normal stromal tissue, they produced additional structure in wavelet maps at several scales. Adding intensity to normally dark pixels altered the kurtosis value sufficiently to distinguish it from the normal range. Of course for any feature, selection of wavelet bases and scale levels that correlate well with the shape of the anomaly was expected to provide the best differential. This was examined using eleven wavelet bases.
Selecting a subset of the candidate features added flexibility to the design of each individual classifier: for example, one classifier could use a feature subset sensitive to micro-calcifications while another could use a feature subset sensitive to masses.
Each classifier was limited to one wavelet basis and two of the four types of parameters generated from the maps. This reduced the feature pool size to 64. Combinations of these features were searched exhaustively to select the most effective combination.
where w is a weighting factor that varies between zero and one. A high weighting factor favors a more sensitive classifier while a low weighting factor favors a more specific classifier. Since in this work, normal images were not subject to further analysis, the true positive fraction was maximized with a 0.995 weighting factor. When two feature subsets produced the same number of true positives the feature subset with the higher true negative fraction was selected.
The individual classifiers could also be designed to maximize detection of a specific abnormality (e.g. masses). To search for a single abnormality, NTP was replaced with the number of correctly classified images containing the specified abnormality, and NTN was replaced with the number correctly classified images of all other types. To ensure that other abnormalities were not missed by the complete system, the outputs of the individual classifiers were combined.
The goal was to assist a reviewing physician make an informed decision in selecting images for further study. To do this, the classification scheme must provide a measure of the confidence that an image contains an abnormality. Since single naïve Bayesian classifiers do not generate confidence measures, a naïve Bayesian classifier network was constructed. The network’s performance classifying known images was used to calculate a classification confidence statistic. Training and testing was achieved using the leave-one-out cross-validation approach. Here, all but one of the samples were used to train the classifier, and the classifier is tested on the lone remaining sample. The overall performance of the classifier was measured by averaging the classification results when each sample in the data set was used as the test sample.
In all cases the selected scalar features calculated from an image’s wavelet maps formed the inputs. The network of classifiers was constructed by passing the normal and suspicious output images from one classifier into additional classifiers for further analysis; several network configurations were evaluated.
The predicted confidence levels for a realistic distribution of normal and suspicious images were inferred from the results from a small data set after correcting for its inherent bias. In the DDSM data set [26–28], for example, 649 of the 1714 images were abnormal, this was a higher relative frequency than typically found in a screening clinic (1 in 20) . To correct for this, the relative probability that a given input image was normal or suspicious, P i (N) and P i (S), respectively was rescaled.
where P real (N) was the probability of an image from the realistic distribution being normal and T exp (N) was the total number of normal images used in the experimental data set.
The realistic fraction of suspicious images in a normal bin, F real (n,S), was similarly found from the experimentally counted number of suspicious images in the bin, η exp (n,S).
α characterizes the relative frequencies of normal and suspicious images in the experimental data set and in a realistic data set. For the DDSM data set [26–29] with 649 suspicious and 1065 normal images and for a clinic where 1 in 20 images are suspicious, α = 11.57. For the MIAS data set with 98 suspicious and 205 normal images and for a clinic where 1 in 20 images are suspicious, α = 9.08. A similar argument was used to calculate the confidence level (C real (S)) for an image from a realistic distribution to be correctly placed into a certain suspicious bin.
Confidence levels were calculated for the various classifier networks by counting the number of normal and suspicious images assigned to each output bin of the classifier network and using the value of α appropriate for the data set in question.
The relatively low number of suspicious images that occur in practice dominates the realistic confidence levels and makes all bins have a large confidence for containing normal images. To facilitate feature comparisons the theoretical case for an equal chance for an image to be normal or suspicious was also calculated. Thus, C real (N) gave the realistic likelihood that an image in a bin was normal, while C even (N) was useful for comparing the relative confidence levels of different bins when deciding which images are most likely normal. The mapping is monotonic, so bin ranking is the same using either a realistic or equal chance measure.
In summary, images were subjected to wavelet decomposition using a variety of bases and producing 32 scale maps per basis per image. Moments of the mean were calculated for each of the maps resulting in a total of 132 features per image per basis. A Bayesian classifier using leave-one-out cross validation was used to segregate the images into two groups: normal or suspicious. To enhance classification accuracy combinations of up to three features were evaluated. Where classifier networks were employed, a confidence level for the final classification was calculated.
Mean performance of statistical features across all 11 wavelet bases tested
Classification rate (%)
M + σ
M + S
M + K
σ + S
σ + K
S + K
Comparison of the performance of wavelet bases on the DDSM dataset
Best feature combination
Comparison of the performance of wavelet bases on the MIAS dataset
Best feature combination
Overall‡classification rate (%)
The data had been normalized to 10242 leaving open the possibility that the interpolation process may have influenced classifications rates. However, this was found not to be the case. A subset of the data was re-sampled to 2562 and to 5122 and classified using mean features from the Haar wavelet. The lower resolution images provided classification rates indistinguishable from the 10242 resolution (not shown, see also ).
The results obtained (Tables 1, 2 and 3) suggested that no single combination of wavelet basis and feature would correctly classify all the images. Therefore, a network of classifiers was conceived in an attempt to achieve an acceptable classification rate.
In the sequential configuration (Figure 2), an image’s wavelet map features (best set) were passed to the first classifier, images deemed normal were removed from the queue, while images classified as suspicious were passed on to the next classifier for re-analysis. Thus, the further an image passed along the chain before being found normal, the higher was its “suspicious” probability.
Performance of sequential classifiers using the DDSM database
Confidence level (%)
C even (N)
C even (S)
Performance of sequential classifiers on the MIAS database
Confidence level (%)
C even (N)
C even (S)
For the DDSM data, the Haar based classifier correctly identified 390 of the 1065 normal images in the set and misidentified 5 of the 649 suspicious images as normal. This provided a confidence level, using an equal prior probability of normal or suspicious, of 97.9% for normal and 61% for suspicious. Images classified as suspicious were passed down the chain configured with Biorthogonal and Daubechies based classifiers. After stage five in the chain, the confidence that an image classified as normal, was normal, declined sharply. This implied that the incidence of type II error (false negative) rose at this stage and beyond. Considering the emphasis placed on detection in this study, the data suggested that this eight member sequential network might be terminated at stage 5 to maintain high sensitivity. Overall, the DDSM-trained sequential network achieved 91.8% sensitivity for abnormal images with a specificity of 97.2%. Eight percent of the positive images escaped detection.
Re-evaluation on the MIAS-trained network (using different features) achieved 88.8% sensitivity to abnormal images with a specificity of 67.3% at stage 5. These results were very encouraging and led to the second network approach.
The alternative embodiment used classifiers that were tuned to detect specific types of abnormalities, either masses or calcifications. The goal was to determine if performance might be improved by deploying specialized classifiers. Images were first passed through several classifiers looking for one type of abnormality; if they were not suspicious for it, they were passed on to several classifiers looking for the other type of abnormality. These classifiers, with more specific targets, had potentially higher sensitivities. Figures 3 and 4. show two networks designed in this way. The number of images that are normal (n), show calcifications (c), or show masses (m) are listed at each stage of the network with the appropriate letter label. The wavelet features selected were those that had best identified the anomaly as a single feature classifier.
The tuned classifier networks were configured with four or six output taps. This offered the additional potential to distinguish among normal, calcifications and masses. The network selected calcifications first. The four-tap network (Figure 3) used the Db2 wavelet feature tuned for calcifications. Suspicious images were passed to a Db8 classifier also tuned for calcifications. Normal images from the Db8 classifier went to the queue with normal images from the Db2 classifier, to be reexamined using Bior5.5 and Haar classifiers tuned for masses. The Db8 classifier output on the calcifications leg, was a bin with all the images containing calcifications, a few masses and some normal images. On the masses leg of this network the suspicious tap contained most of the masses (all but one), no calcifications and some normal images. The normal output bins on this leg contained 814 of the 1065 normal images, one mass and no calcifications. This configuration achieved 99.8% sensitivity, a specificity of 76.4% and a classification rate of 86.1%.
For the six-tap network, classification began with the Bior1.5 (Figure 4). Suspicious images from this were passed successively to Haar and Bior2.2 classifiers both tuned for calcifications. The suspicious output on this leg was a bin containing all the calcifications, 2 masses and 117 normal images. There were two normal output bins on this leg, these bins contained only normal images. The normal output from the Bior1.5 was passed to Bior5.5 and Haar classifiers tuned for masses. On this leg the classifiers identified all but one of the masses. The two normal bins contained no calcifications and a single mass. This configuration provided a sensitivity of 99.8%, a specificity of 78.9% and an overall classification rate of 86.8%. This configuration successfully removed 840 of the 1065 images from the suspicious bin. To achieve this result, the penalty was one incorrectly classified mass-containing image.
When similar networks were evaluated on the MIAS data set equivalent results were obtained (not shown). Here again, in the four-tap configuration calcifications were identified first, then masses. The six-tap configuration searched for masses first, then calcifications. Using this smaller dataset the four-tap configuration achieved 100% sensitivity and 46.3% specificity with an overall classification rate of 63.7%. The six-tap network also achieved 100% sensitivity, and 65% specificity. Here the overall classification rate achieved was 76.6%.
Performance of branched network classification
C real (N)
C real (S)
Segmentation of mammograms containing masses from those containing calcifications
C even (Norm)
C even (Mass)
C even (Calc)
The networks were designed conservatively; each wavelet classifier was configured for maximum sensitivity. A more aggressive design could have removed more normal images, but may have sacrificed overall sensitivity; that was not considered an acceptable tradeoff.
The classifiers developed in this paper offer a useful approach for binary classification of mammographic x-ray images. In practice, an analyst could use the WFC, tuned to a confidence threshold of their choosing, to remove or re-prioritize normal images. This pre-screening technique should improve subsequent detection of those few images showing abnormalities that merit further analysis [5, 36].
For an algorithm to be effective and optimally safe as a preliminary screening tool, it must be able to correctly identify a significant number of normal images while minimizing the number of suspicious images that are incorrectly identified as normal. That is, the algorithm must offer sensitivity higher than current clinical levels, which have been estimated to be between 75% and 90% [2, 3, 5, 34, 36–38], while offering a non-negligible specificity. Both branched networks tested in this study achieved sensitivity superior to current clinical performance.
An x-ray mammogram image analysis system  was tested on two independent data sets to measure its ability to identify suspicious images that may merit further study by a human expert. The system operated in several steps: first, an image was pre-processed to reduce background noise and artifacts; second, the image was decomposed into a set of maps at different scale levels using a 2D discrete wavelet transform; third, whole-image statistical features were measured from each map and the best triplet of these features was input into naïve Bayes classifiers to determine if an image is normal or suspicious; fourth, several classifiers were chained together to calculate confidence levels from the normally hard classifiers.
Three network designs were tested here: a sequential series of classifiers, a vote-taking scheme of classifiers, and networks where individual classifiers were tuned to detect only calcifications or only masses. All of the networks were designed with sensitivity as the top priority over specificity, since the system is designed to be a first pass for images, so any abnormal images missed by the algorithm would not be likely re-examined by a human expert. All the networks tested provided higher sensitivity than is typically achieved in the screening clinic. Removing a large fraction of normal images from the review queue will reduce the volume of cases that must be examined and, at least statistically, should improve detection of pathology. In the best-case scenario reported here, pre-sorting the images doubled the prior probability of disease in the sorted database.
Once sensitivity is maximized, the effectiveness of a system is governed by its specificity. Here the expert reader excels, typically achieving greater than 95% specificity. The combination of a highly sensitive pre-screening tool and an expert breast screener promises to significantly enhance the overall performance of the typical screening program.
Research supported by:
Canadian Breast Cancer Foundation -Atlantic Region operating grant to EK
Natural Sciences and Engineering Research Council studentships to MB, KC
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.