Breast cancer is the most common form of cancer among Canadian women, and is second only to lung cancer in mortality [1–3]. Women in higher risk groups, are encouraged receive a screening x-ray mammogram every two years, with further screening for very high risk patients, such as those with familial history or genetic predisposition.
Treatment efficacy is linked to early detection of tumors. The challenge in x-ray mammography is that features associated with pathology may be patent or subtly represented in the image. For example, micro-calcifications sometimes signal the presence of cancer. Due to calcium’s relatively high absorption of x-ray photons they appear as small bright regions in the mammogram and readily detected by CAD and human reviewers [4–8]. On the other hand, masses are evident in an x-ray if their density differs from that of the surrounding tissue, and this is often not the case. Masses may have almost any size, shape or structure [4, 7, 9–17]. Occasionally, masses are evident only by inducing deformation of adjacent tissue. These architectural distortions are difficult to detect thereby limiting the sensitivity of the screening procedure [18].
In response to these challenges, a range of software tools have been developed to help radiologists recognize subtle abnormalities in mammograms [7, 19–23]. These tools typically use a common second reader model: the radiologist first examines the raw image and notes suspicious regions [24]. The tool then processes the image marking potentially suspicious regions and the results are compared.
Such systems have a significant drawback: they tend to have low specificity and so require nearly every image to be examined twice: once unaided, and then again to compare to the regions marked as suspicious by the software [25]. This is impractical for screening mammography where fewer than 1% of the images will have tumors. In that setting, the unintended consequence of CAD search routines is an increase the time required to report normal findings. In addition, increasing the number of prompts for review apparently does not guarantee an increase in accuracy [25].
Here we report the performance of a wavelet-map feature classifier (WFC), designed as a pre-sorting tool. The WFC identifies and removes normal images from the radiologists review queue, leaving those images with a higher probability of showing abnormalities. For this technique to be optimally safe, the algorithm is designed to perform at high sensitivity, detecting all or nearly all abnormalities; for it to be effective, it has sufficient specificity to remove enough normal images to usefully increase the relative frequency of suspicious images in the product queue.
The pre-screening algorithm was developed using the Digital Database for Screening Mammography (DDSM) database (http://marathon.csee.usf.edu/Mammography/Database.html) [26–28], a publicly available resource. A smaller unrelated Mammographic Images Analysis Society’s database (MIAS) database (http://peipa.essex.ac.uk/info/mias.html) [29] provided a confidence check that the algorithm was not over-specified. These data provided a useful proving ground for testing various incarnations of the algorithm.
The DDSM data set consisted of 1714 images, 1065 of which were classified as normal, in that they showed no abnormalities. The other 649 images showed some type of abnormality that would merit further study. These included: 119 benign calcifications, 120 cancerous calcifications, 213 benign masses and 197 cancerous masses. There was a range of tissue composition and breast size in the DDSM data set, making it representative of the variety of images that may be seen in a clinical setting.
The MIAS data set contained 303 images. There were 205 normal images and 98 images that showed some type of abnormality that included: 11 benign calcifications, 12 cancer-associated calcifications, 38 benign masses, 18 cancerous masses, and 19 architectural distortions (9 from benign masses and 10 from cancerous masses that were not directly visible). Again, the images were from a wide variety of patients, such that the tissues imaged varied widely in terms of breast size and tissue composition.
The wavelet filter classifier process
The Wavelet Filter Classifier (WFC) proceeds in several discrete stages: regularizing the raw digital x-ray image, transforming it to produce scale maps, extracting features from the maps, classifying the features and generating the probability that the image contains some abnormality as an output.
Regularizing the images
Digital mammograms were pre-processed [30] to reduce non-pathological variations between images, such as background noise, artifacts, and tissue orientation. All images were rescaled to 200 micron pixel resolution, and were padded or cropped to be 1024 × 1024 pixels, or 20.48 × 20.48 cm. The analysis presented here was restricted to medial-lateral views and the presentation of both breasts was adjusted to a single orientation.
The DDSM (and MIAS) mammograms were scanned from film images. As a result they contained label, noise and other artifacts that are not present in direct digital images. These artifacts were removed using a threshold and segmentation procedure. Otsu’s method [30] was used to determine the optimal pixel intensity threshold for distinguishing background and foreground (tissue) pixels. The segmented non-tissue regions were set to zero without changing pixel values within the tissue region. The processed images were rescaled to maximum pixel intensity.
Wavelet decomposition
The normalized mammograms were decomposed using 2D discrete wavelet transformation. This filtering process created four outputs: horizontal, vertical and diagonal detail maps from the high pass component and an approximation map using the low pass component applied vertically and horizontally. After each filter pass every second output pixel was kept. Figure 1 provides an example image and its feature maps generated at the second scale level using the Daubechies 2 wavelet. The high pass component produces scale maps at half the resolution of the parent image. They further emphasize features based upon how the image is sampled (horizontal, diagonal, vertical). The approximation image is the low-pass component and forms the input for the next scale transformation.
Eight decomposition levels were created in a serial process, applying the transformation to the approximation map to create four more maps. Since the approximation map had half the resolution of the input image, the wavelet sampled structures that were twice as large as in the original image. The set of all maps derived from a single original image formed a decomposition tree. The highest levels of the tree had the highest resolution and were most sensitive to structures with small spatial extent, while the lowest levels of the tree had the lowest resolution and were most sensitive to structures with large spatial extent.
Many wavelet bases are available, each with unique sampling characteristics. Several, including the Biorthogonal, Debuchies and Haar appeared promising for detecting subsets of the broad range of shapes and intensity gradients potentially associated with pathology [4, 31–33]. Eleven wavelets were selected from these families, Haar, Db2, Db4, Db8, Bior1.5, Bior2.2, Bior2.8, Bior3.7 Bior4.4, Bior5.5, Bior6.8. The Haar wavelet is a square function that usefully interrogates sharp discontinuities. The other wavelets are more complex. The notation used suggests some of the features. For example, Db2 (Daubechies 2) is an orthogonal function that samples polynomials that have constant and linear scaling regions. The Bior1.5 describes a bi-orthogonal fifth order sampling function that requires a first order reconstruction algorithm.
The decompositions were initially performed in Matlab using the wavelet toolbox and later ported to C++ to improve computational efficiency. Moments of the mean generated from the output maps formed the input features for classification.
Feature extraction
Four whole-image statistical features, mean, standard deviation, skewness and kurtosis of pixel intensity, were computed for each of the four wavelet-maps at each of the eight decomposition levels. This produced 132 scalar features for each of the eleven wavelet-bases applied to an x-ray image. The classification trials were restricted to using a combination of one, two or three features to avoid over-specifying the final classifier to the training set. Every combination of one, two or three features from the 132 member set were tested for every wavelet basis. The feature sets with the highest sensitivity for finding the images with known abnormalities were selected.
Mean and standard deviation are familiar metrics, skewness and kurtosis less so. The skewness value provides a measure of the asymmetry of a data distribution. Thus, the presence of a small number of unusually dark or bright pixels may alter skewness even when the mean and standard deviation values are not significantly affected. Here the skewness value may be sensitive to the representation of microcalcifications in an image. While these are only a few pixels in size they are unusually bright. Similarly, skewness may report the presence of bright (dense) masses. Since skewness measures the imbalance between the parts of the distribution above and below the mean, the presence of a dense mass will raise the skewness value relative to that found for a normal image.
Kurtosis reports the sharpness of the central peak of a distribution. Since it depends on the fourth power of the difference from the mean, it is highly sensitive to the addition of distant-valued points. Here, increasing numbers of bright microcalcification-containing pixels may be expected to raise kurtosis values. Interestingly, in some cases the kurtosis measure also detected masses. The post hoc rationale developed was based on the observation that when masses appear brighter than normal stromal tissue, they produced additional structure in wavelet maps at several scales. Adding intensity to normally dark pixels altered the kurtosis value sufficiently to distinguish it from the normal range. Of course for any feature, selection of wavelet bases and scale levels that correlate well with the shape of the anomaly was expected to provide the best differential. This was examined using eleven wavelet bases.
Selecting a subset of the candidate features added flexibility to the design of each individual classifier: for example, one classifier could use a feature subset sensitive to micro-calcifications while another could use a feature subset sensitive to masses.
Each classifier was limited to one wavelet basis and two of the four types of parameters generated from the maps. This reduced the feature pool size to 64. Combinations of these features were searched exhaustively to select the most effective combination.
The performance metric used to select the most effective feature subset was a weighted sum of the number of true positive classifications, NTP, and the number of true negative classifications, NTN. This score S was calculated as:
where w is a weighting factor that varies between zero and one. A high weighting factor favors a more sensitive classifier while a low weighting factor favors a more specific classifier. Since in this work, normal images were not subject to further analysis, the true positive fraction was maximized with a 0.995 weighting factor. When two feature subsets produced the same number of true positives the feature subset with the higher true negative fraction was selected.
The individual classifiers could also be designed to maximize detection of a specific abnormality (e.g. masses). To search for a single abnormality, NTP was replaced with the number of correctly classified images containing the specified abnormality, and NTN was replaced with the number correctly classified images of all other types. To ensure that other abnormalities were not missed by the complete system, the outputs of the individual classifiers were combined.
Classification using feature subset and naïve Bayesian classifiers
The goal was to assist a reviewing physician make an informed decision in selecting images for further study. To do this, the classification scheme must provide a measure of the confidence that an image contains an abnormality. Since single naïve Bayesian classifiers do not generate confidence measures, a naïve Bayesian classifier network was constructed. The network’s performance classifying known images was used to calculate a classification confidence statistic. Training and testing was achieved using the leave-one-out cross-validation approach. Here, all but one of the samples were used to train the classifier, and the classifier is tested on the lone remaining sample. The overall performance of the classifier was measured by averaging the classification results when each sample in the data set was used as the test sample.
In all cases the selected scalar features calculated from an image’s wavelet maps formed the inputs. The network of classifiers was constructed by passing the normal and suspicious output images from one classifier into additional classifiers for further analysis; several network configurations were evaluated.
Determining classification confidence from classifier network
The predicted confidence levels for a realistic distribution of normal and suspicious images were inferred from the results from a small data set after correcting for its inherent bias. In the DDSM data set [26–28], for example, 649 of the 1714 images were abnormal, this was a higher relative frequency than typically found in a screening clinic (1 in 20) [34]. To correct for this, the relative probability that a given input image was normal or suspicious, P
i
(N) and P
i
(S), respectively was rescaled.
In the following discussion, lower case n and s refer to images classified as normal or suspicious, while upper case N and S refer to actual number of normal or suspicious images. If the number of normal images counted in a normal bin experimentally is η
exp
(n,N), then the expected fraction of all images from a realistic distribution that are normal and are in the same bin, F
real
(n,N) was calculated as:
where P
real
(N) was the probability of an image from the realistic distribution being normal and T
exp
(N) was the total number of normal images used in the experimental data set.
The realistic fraction of suspicious images in a normal bin, F
real
(n,S), was similarly found from the experimentally counted number of suspicious images in the bin, η
exp
(n,S).
The predicted confidence level for an image from a realistic distribution to be correctly placed into a certain normal bin, C
real
(N), was calculated for each bin using the results measured from a small data set:
where α is a constant defined by:
α characterizes the relative frequencies of normal and suspicious images in the experimental data set and in a realistic data set. For the DDSM data set [26–29] with 649 suspicious and 1065 normal images and for a clinic where 1 in 20 images are suspicious, α = 11.57. For the MIAS data set with 98 suspicious and 205 normal images and for a clinic where 1 in 20 images are suspicious, α = 9.08. A similar argument was used to calculate the confidence level (C
real
(S)) for an image from a realistic distribution to be correctly placed into a certain suspicious bin.
Confidence levels were calculated for the various classifier networks by counting the number of normal and suspicious images assigned to each output bin of the classifier network and using the value of α appropriate for the data set in question.
The relatively low number of suspicious images that occur in practice dominates the realistic confidence levels and makes all bins have a large confidence for containing normal images. To facilitate feature comparisons the theoretical case for an equal chance for an image to be normal or suspicious was also calculated. Thus, C
real
(N) gave the realistic likelihood that an image in a bin was normal, while C
even
(N) was useful for comparing the relative confidence levels of different bins when deciding which images are most likely normal. The mapping is monotonic, so bin ranking is the same using either a realistic or equal chance measure.
In summary, images were subjected to wavelet decomposition using a variety of bases and producing 32 scale maps per basis per image. Moments of the mean were calculated for each of the maps resulting in a total of 132 features per image per basis. A Bayesian classifier using leave-one-out cross validation was used to segregate the images into two groups: normal or suspicious. To enhance classification accuracy combinations of up to three features were evaluated. Where classifier networks were employed, a confidence level for the final classification was calculated.