Quantification of heterogeneity observed in medical images
© Brooks and Grigsby; licensee BioMed Central Ltd. 2013
Received: 27 July 2012
Accepted: 25 January 2013
Published: 2 March 2013
Skip to main content
© Brooks and Grigsby; licensee BioMed Central Ltd. 2013
Received: 27 July 2012
Accepted: 25 January 2013
Published: 2 March 2013
There has been much recent interest in the quantification of visually evident heterogeneity within functional grayscale medical images, such as those obtained via magnetic resonance or positron emission tomography. In the case of images of cancerous tumors, variations in grayscale intensity imply variations in crucial tumor biology. Despite these considerable clinical implications, there is as yet no standardized method for measuring the heterogeneity observed via these imaging modalities.
In this work, we motivate and derive a statistical measure of image heterogeneity. This statistic measures the distance-dependent average deviation from the smoothest intensity gradation feasible. We show how this statistic may be used to automatically rank images of in vivo human tumors in order of increasing heterogeneity. We test this method against the current practice of ranking images via expert visual inspection.
We find that this statistic provides a means of heterogeneity quantification beyond that given by other statistics traditionally used for the same purpose. We demonstrate the effect of tumor shape upon our ranking method and find the method applicable to a wide variety of clinically relevant tumor images. We find that the automated heterogeneity rankings agree very closely with those performed visually by experts.
These results indicate that our automated method may be used reliably to rank, in order of increasing heterogeneity, tumor images whether or not object shape is considered to contribute to that heterogeneity. Automated heterogeneity ranking yields objective results which are more consistent than visual rankings. Reducing variability in image interpretation will enable more researchers to better study potential clinical implications of observed tumor heterogeneity.
There are hundreds of papers in the medical literature about the importance of heterogeneity within various types and degrees of cancerous tumors. In oncology parlance, “tumor heterogeneity” generally means that a whole tumor comprises distinct cellular sub-populations . These cell groups vary in morphology, histology and growth rate, for example. The interactions of different tumor cells with each other and with the microenvironment are complex and not well understood . In order to understand intra-tumor biology, potentially subtle differences near and within tumors must be quantified. The ultimate goal of tumor heterogeneity studies is to determine the implications and prognostic value of observed variations in a host of clinically relevant tumor properties such as physical size, shape, cellular density, cellular metabolism, hypoxia and vascularization. Each of these biologically interesting properties generally are assayed via an imaging modality—such as magnetic resonance (MR), computed tomography (CT) or positron emission tomography (PET)—which outputs grayscale images. Thus, the challenge of measuring biological heterogeneity manifests as the quantification of visually evident intensity variations in grayscale images.
Several attempts to objectively quantify tumor heterogeneity have been published previously [6–10]. These attempts largely have yielded mixed results. The more rudimentary measures are non-spatial. That is, they depend only upon the distribution of intensities, not the spatial arrangement of those intensities within the tumor. For example, although the standard deviation, skewness and kurtosis clearly quantify variation of an intensity distribution, skewness and kurtosis have been shown not to vary sufficiently to separate some cervical cancers into groups of differing disease outcome . Another non-spatial measure of heterogeneity which was shown to distinguish patient groups  was later demonstrated to actually be a surrogate for tumor volume . It has been claimed that the image texture metrics introduced in the seminal work of Haralick et al.  can be used as spatial measures of heterogeneity [8, 9]. However, many of those more sophisticated metrics have been shown later to be statistically irreproducible on tumor data . In general, even in cases where enough image data exists to yield consistent inter-image comparison, it remains unclear which texture metric corresponds to which subjectively perceived image quality . In short, there is no standard for measuring tumor heterogeneity in a manner consistent with expert visual inspection.
In this study, we develop and test a novel statistic which may be used to quantify spatial variations in tumors imaged in vivo which were identified and segmented via independent analysis. In other words, we quantify the spatial heterogeneity within definitively bounded objects against a uniform background. Here, we do not seek to attach any clinical or prognostic meaning whatsoever to a given image or statistical value. Instead, we seek to give the clinician a quantitative mechanism of declaring one image to be x times more heterogeneous than another, such that a set of similarly imaged objects may be ranked and compared objectively. We compute our statistic on real PET images of tumors exhibiting visually manifest variations in size, shape and intra-tumoral intensity and find that our numerical heterogeneity comparisons agree well with the subjective comparisons made by experienced experts.
Heterogeneity has application-specific connotation. Common to all applications, however, is that heterogeneity is necessarily defined by the scale of interest. For example, at scales much smaller than a single square, a chessboard image could be interpreted as homogeneous. This is due to the fact that at very small scales, neighboring pixels are much more likely to be in the same square than they are in one of differing color. At a scale on the same order of magnitude as the squares, the chessboard is more heterogenous because any pixel about one square size away from an arbitrary pixel has approximately equal chance of being either color. This is greater average variation than in the previous case of usually being confined to a single square. If the scale of interest becomes much larger than the size of a single square, the image may again be considered homogeneous since, on average, anywhere in the image, the pattern is precisely the same. When attempting to quantify the heterogeneity in an image, one must make clear the intended scale of interest.
Contrast the scale-dependence of heterogeneity to the definitiveness of homogeneity. Ultimate homogeneity is unambiguously defined as the state being the same at all scales, e.g., an image of only a single color. We therefore assume that heterogeneity may be clearly defined as difference from homogeneity. Of course, a perfectly homogeneous image contains no information and is therefore of little practical use. The real images we study will contain pixels of varying grayscale intensity. We thus seek to first find the most homogeneous intensity transition between given pixels with differing intensities. Consider two distinct image pixels where the grayscale intensity is I 1 at pixel one and I 2 at pixel two. Assume that I 1>I 2. Let r be the distance measured from pixel one to an arbitrary point along the straight line segment between pixels one and two. The gradation between pixels one and two which is as locally homogeneous as possible, for every distance r, occurs when the whole intensity change is spread evenly across the entire line, i.e, when the derivative d I/d r is constant. We consider any deviation from that smoothest possible transition to be heterogeneity.
the average absolute intensity difference along a line connecting a pair of object pixels.
For each non-repeating pair of object pixels, we compute the discrete distance L and . The result is a pool of values where any one value of L has associated with it many values of . Those multiple values are then ensemble-averaged for each L such that each discrete pair separation possible then has associated with it only a single value, . The discrete distances are normalized to the largest discrete distance ( ) observed amongst any object pixel pair. The are then plotted versus .
Because the distances plotted are relative to the size of the object, objects of greatly varying size may be compared to one another. This enables the use of ζ as a ranking statistic where increased ζ implies increased heterogeneity.
Computer code to automatically compute ζ on grayscale images was written in Python v2.6.1 ( http://www.python.org). Input images consist of an 8-bit grayscale object, where each object pixel has intensity I>0, against a uniformly black background (I=0). The images were read into the program using the Python Imaging Library v1.17 ( http://www.pythonware.com/products/pil/). The xy-coordinate at each nonzero pixel (an object pixel) along with the intensity at that pixel was recorded. For every unique pixel pair, the discrete line between the pair was calculated using Bresenham’s algorithm . Here, we do not consider the pairs directionally. Thus, once a particular line from pixel m to pixel n has been considered, we do not consider again the line from n to m. We take adjacent pixels to have zero separation and thus there can be no intensity difference from the gradation between them. For line endpoints separated by more than one pixel, the smoothest gradient possible was computed using Equation 1. For each pixel between the endpoints, the average absolute difference between the measured intensity and that given by Equation 1 was computed via Equation 2. These averages were binned together for each integer value of the discrete line length. These binned averages then were themselves averaged, resulting in a single value at each discrete line length ( ). Each integer line length was divided by the maximum integer pair-separation observed for the image object. Using Equation 3, ζ was computed via the trapezoidal rule as implemented in Numpy v1.4 ( http://www.numpy.org).
Images were obtained as described in Ref. . We briefly recapitulate the process here. Patients with cancers of the uterine cervix underwent a pre-treatment hybrid PET/CT scan using the 18F-fluorodeoxyglucose (FDG) radiotracer assay of glucose uptake by cells. As is common throughout the field of nuclear medicine, the raw FDG-PET data are scatter and attenuation corrected via the proprietary software native to the PET machine. Images were reconstructed using ordered subset expectation maximization. A Gaussian smoothing filter with 4-mm full width at half maximum was applied post-reconstruction. No additional processing was implemented. In order to objectively distinguish tumor from background, we employed the rule-of-thumb that, for a visually selected region of interest (ROI), any pixel brighter than 40% of the maximum ROI pixel brightness is to be considered part of the tumor . Any remaining objects that are obviously (for sound anatomical reasons) not tumors are removed and the final ROI is exported as an 8-bit grayscale region against a uniformly black background. Here, variations in grayscale intensity ostensibly imply variations in intra-tumoral metabolic activity . To quantify these variations, we apply the computer code earlier described to the largest cross-sectional tumor image for each patient.
The whole-body images created by our PET scanner are only 168 ×168 pixels in size. The segmented tumor region within these images ranges from only 16 to 291 total pixels with a median of 63 pixels. These regions are generally too small to be inspected visually. Computer code using the Python Imaging Library (PIL) was used to automatically extract the pre-defined tumor region and paste that at the center of a new image such that the tumor is centered within a border which is half of the diameter of the tumor. This new image was expanded to 162 ×162 pixels using the resize function of the PIL with the filter option set to nearest-neighbor. At 72 dots per inch, this corresponds to squares of side length 2.25 inches. Each image was printed and then pasted onto a separate piece of stiff card stock such that a group of images could be readily seen at once. The expert can then manipulate the cards into order of increasing image heterogeneity, i.e., visually evident intensity variations within an isolated tumor region.
In the rescaled images, very small tumor regions will appear manifestly “blockier” than very large ones. This is the typical pixelation seen whenever low-information images are expanded to greater than the original image size. The result is that a highly pixelated image may appear innately more homogeneous than a less-pixelated image simply due the relative sizes of the original pixels in the rescaled images. Furthermore, the blocky appearance of the rescaled tumor regions could give a clever physician some indication of tumor size, even without anatomical reference. Since we do not want the physicians to be biased by innate clinical knowledge regarding tumor size, we binned tumors of similar size into distinct groups which were independently inspected by the experts.
The test image sets earlier described were given to: an experienced clinical oncologist, a senior medical resident in the oncology department and a professional image processing expert with no medical background. Each expert was asked to rank each image set individually in order of increasing heterogeneity. Only the experienced oncologist, however, considered tumor shape as a heterogeneity-defining quality. The other two experts specifically ignored tumor shape and focused only upon intra-tumoral pixel variations. The computer code earlier described was used to compute an objective ranking of each image set based solely upon increasing values of the ζ statistic. The ζ-ranking for each image set was compared to each expert ranking via the Spearman rank test of statistical correlation. As described in many textbooks, a p-value associated with an upper-tailed test of significance may be obtained from the Spearman correlation coefficient. Our null hypothesis is that the computer/expert rankings are totally uncorrelated. Our alternative hypothesis is that larger ζ values tend to pair with higher expert-ranked-heterogeneity.
To explore the dependence of ζ upon object shape, we performed the following test. An 8-bit grayscale image of a circle with radius 16 pixels and origin at the image center was created. The circle was shaded via a two-dimensional Gaussian distribution such that the brightest intensity (255) is at the circle center and the dimmest intensity (128) occurs at the circle circumference. The result is a smoothly varying, symmetric “tumor” object. The object was then decimated by setting all pixels to zero within circular holes of radius 4 pixels centered at random locations within the object. As decimation is performed repeatedly upon the same object, the shape of the object becomes increasingly irregular. To test the effect of shape upon the ζ statistic, we modified the calculation given in Equation 3 to consider contributions from only the nonzero pixels along the connecting Bresenham lines and replacing the pair-separation L in Equation 2 with the number of pixels considered. This way, if the line between a pair of object pixels crosses the background, the differences of the background pixels from the smoothest gradation are ignored. This effectively renders any two background-separated pixels to be neighbors, thus closing the shape while maintaining the relative orientation of one endpoint to the other.
Automated, shape-aware ranking versus ranking by human experts
Shape aware automated ranking
Automated, shape-unware ranking versus ranking by human experts
Shape unaware automated ranking
We further investigated this sensitivity to shape as follows. Inspection of Figure 4 shows that images D5, D6, E5 and F6 each have the distinctive shape feature of a hole at the center. For these images of ring-like objects, 29≤ζ≤37. If one scans the entire image set, one finds five other images with a ζ value in this range. These images—such as B7, E6 and F5—are clearly not shaped similarly to the ring-like objects. It is thus seen that although the ring-like objects differ in ζ value by only 10% of their mean value, all ζ within this range do not correspond to similar objects. This demonstrates that while ζ is sensitive to object shape, it is not slaved to it. That is, one particular tumor shape does not uniquely correspond to a particular ζ value. This may be seen concisely in images D4–D6 where ζ D4=ζ D5, but have totally different shapes, while D5 and D6 have very similar shapes but ζ values which differ by about 10%.
Despite the ordered appearance of the rescaled images shown in Figure 4, clinically relevant tumor images vary greatly in size. This may be also seen in Tables 1 and 2 where the range of cross-sectional tumor sizes varies tenfold. Since it is the relative arrangement of pixel variations that ultimately defines heterogeneity, a robust quantifier of heterogeneity should not scale with image size. For example, doubling the size of a chessboard does not change the size of the squares, instead, there simply are more of them. Thus, the heterogeneity is the same for both sizes of chessboard. With this in mind, we explored the correlation between ζ and tumor size. Since the ζ values for the pool of images shown in Figure 4 do not follow a normal distribution, the traditional Pearson product moment is not applicable. We therefore employ the Spearman rank test of correlation and find ρ S =0.044 (p=0.39), which implies no appreciable correlation between ζ and object size. This is expected because the effective diameter of the object ( ) is divided out of Equation 3.
The values of the heterogeneity metrics are given for each of the images shown in Figure 4
The correlation between the heterogeneity rankings computed via various statistics and that done independently by the veteran oncologist was quantified via the Spearman rank test
As discussed in earlier sections, while the tumor images we study vary greatly in size relative to one another, even the largest tumor yields a fairly small number of pixel pair combinations. Thus, the isolated tumor objects are readily processed on a desktop computer. However, the computational order of ζ is where is the average distance between two pixels and N is the number of pixels. Therefore, the exhaustive pairing we employ on the relatively small tumor regions is unlikely to be feasible on images containing a large number of object pixels. For such cases, we suggest that one take as large a random subset of all possible pixel pairings as is computationally accessible and proceed to calculate ζ as before.
During computation of ζ, one must take the difference between a measured intensity and that predicted by the smoothest gradient. However, the smoothest gradient possible is determined by the number of shades available to transition between pixel pairs. Thus, ζ must depend upon image bit depth. This could be problematic when comparing image sets from different institutions which likely employ very different imaging and data storage protocols. One solution to this is the use of a common bit depth. For example, in our test images, a threshold of 40% of the maximum observed intensity was used to define the tumorous regions. Thus, only the top 60% of the grayscale range is employed to shade all intra-tumoral variations. On average, the brightest pixels corresponded to 80 kBq/mL of radioactivity (data not shown). This means that only 48 kBq/mL separate the brightest tumor regions from the dimmest. The noise associated with the FDG-PET process may be estimated from Ref.  to be ∼1 kBq/mL. Thus, only ≈48 shades of gray are required to shade our example objects (the tumors) in such a manner that differing shades correspond to genuinely differing amounts of radioactivity. A similar downscaling of image bit depth could be employed on a patient-by-patient basis. Statistics derived from these images (such as ζ) could then be meaningfully compared across patients since these variations are those common to the physics of the imaging modality, not those distinct to an arbitrary choice of bit depth.
A directional heterogeneity quantifier is desirable for cases where the grain of the image has physical meaning. This could occur, for example, for stained histology slides  or for magnetic resonance images of tumors . We suggest that the vectorization of ζ may be accomplished as follows. Again, we begin with a pair of arbitrarily labeled object pixels at coordinates (x m ,y m ) and (x n ,y n ). Instead of following the Bresenham line directly from pixel m to pixel n, one first could compute Δ I x along a purely horizontal direction—say from (x m ,y m ) to (x n ,y m )—then compute a purely vertical contribution Δ I y from (x n ,y m ) to (x n ,y n ). We note that does not equal the Δ I computed directly between (x m ,y m ) and (x n ,y n ), as was described previously. For both Δ I x and Δ I y , the ensemble-averaged, average absolute difference versus relative length curves may be constructed and corresponding ζ computed independently for each direction.
Consider, for example, an image consisting of uniquely shaded horizontal stripes. The directional ζ x =0 since, for any pixel m, the intensity at all horizontal distances from m is identical to that of m. That same image, however, will have a ζ y >0 since any vertical line must cross a stripe boundary at some distance away from m.
Two important caveats accompany the computation of directional heterogeneity quantifiers within an image. First, the choice of moving horizontally then vertically is completely arbitrary. As one proceeds horizontally from a given pixel, one encounters different intensity values than would be encountered had one first proceeded vertically. This ambiguity also arises when considering the order in which pixels are paired. Proceeding from the top-left of the image, the horizontal direction from m to n is defined by the upper pixel while pairing from the bottom-right of the image, it is defined by the lower pixel. Thus, even decreeing that one always proceeds horizontally first does not guarantee consistent directional ζ. It is for this reason that we chose to employ a purely scalar ζ; one which is unambiguously defined by the pixel pairs. We suggest that anyone reporting directional ζ also clearly report and motivate their particular procedural choices.
The second concern lies in the fact that there is no reason to presume that directional components within the image have physical meaning within the image. Therefore, what is readily computed as horizontal in the image data may not be horizontal in the real object. For example, there is no guarantee that the stripes of some striped object align with the vertical or horizontal directions of the image. Therefore, before directional quantifiers may be compared across images, some common reference frame must be implemented. We suggest that one such frame is that determined by the quadrupole moment of the image object. The origin of such a frame is given by the object dipole. The object may then be translated such that the dipole is at the center of the image then rotated such that the strongest quadrupole eigenvector aligns with (arbitrarily) the x-axis. This way, directional variations are measured relative to objectively determined reference frames.
We present two dimensional (2D) images test images because we were interested specifically in measuring correspondence with human experts. Two dimensional tumor images pasted onto equally-sized cards served as a ready means of presenting and manipulating the images into subjective order. Of course, an extension of ζ into three dimensions (3D) may be desired for a given clinical application. The Bresenham algorithm is extensible into 3D  and is available in many commercial software packages currently in common use. Using those 3D line data, ζ may be computed as described for the 2D case. A second means of extension into 3D might be as follows. Instead of drawing the Bresenham line along discrete pixels, a Euclidean line may be drawn directly to pixel pairs. The intensity value anywhere along that line may be determined by averaging and/or interpolation of the neighboring pixels. This gives I m and I n in Equation 1. The absolute intensity difference seen in Equation 2 may then be averaged along the Euclidean line using arbitrary intervals. That average may then be used in Equation 3 to compute ζ as was done in the 2D case. While we expect that a 3D ζ thus computed can be used to objectively rank 3D objects in order of increasing heterogeneity, the merits of one computational method over another are not clear and are a subject for further study. Additionally, if a comparison to expert opinion is to be done for 3D virtual objects, the rendering technique will also have to be scrutinized. For example, perception of 3D virtual objects varies greatly from person-to-person. Also, the construction technique—e.g., brightest intensity versus average intensity along a given line of sight—adds variation to the human-computer comparison that likely requires many repeated trials to overcome. In the present work, we sought only an introduction to our automated ranking technique and purposely avoided the complications involved with extension into three dimensions.
There is increasing interest in the role of heterogeneity within the tumor microenvironment as an indicator of disease prognosis . Therefore, one application is to investigate whether the ζ measure has any prognostic value. Cancer patients where the treatment response and/or long-term survival is known a priori could be ranked objectively via ζ heterogeneity. This could be done for some subset of two-dimensional images (such as the largest tumor cross-sections) or for three-dimensional, whole tumor, virtual objects as described above. The correspondence between heterogeneity and disease outcome could then be checked via rank testing, survival analysis or a Bayesian test of predictiveness.
Another interesting application of ζ lies in the association of a heterogeneity score with intra-tumoral diffusion. Diffusion tensor imaging (DTI) is a magnetic resonance modality which can measure the directional diffusion of water within a cancerous tumor [19, 20]. A three dimensional tumor is imaged via two-dimensional cross sections. Each image pixel then corresponds to a three-dimensional vector representing the flow of water at the coordinate of the pixel. Unlike FDG-PET imaging—where the relative contributions of intra-cellular metabolism and inter-cellular density and arrangement are not known—the images resulting from DTI result directly from the tumor microenvironment . Thus, anisotropies in the diffusion imply variations in the initiation and maintenance of tumor growth. With this in mind, it would be very interesting to investigate the relation between the bulk intra-tumoral heterogeneity and diffusion anisotropy. One means of doing this is to parse the DT images into three sets: one for each vector component of the diffusion. For inter-patient comparisons, these directions could be embedded directions relative to the tumor itself, for example, the same principal eigenvectors from which the fractional anisotropy is commonly computed . The result is three sets of three-dimensional coordinates with a grayscale value at each coordinate which represents the magnitude of diffusion in a given direction at that coordinate. A three-dimensional ζ may be computed via Equation 3 for each set of coordinates. Those ζ values, each between zero and unity, may then be combined to form a single vector, . This is very different from the possible vectorization discussed earlier. There, arbitrary directions were imposed upon an image to create a vector quantity within that image. Here, each image set represents diffusion measured in a distinct physical direction. Thus, is a vector where each component measures the heterogeneity of an entire three-dimensional image set, and that set measures diffusion in only one physical direction. If one then maps onto an RGB colorspace, the bulk heterogeneity (which could be indicative of directional diffusion) throughout the tumor bulk may then be reported as a single color. For example, it could be the case that tumors with predominately trans-axial diffusion are less treatable than those with predominately planar diffusion. These differing diffusion scenarios will yield different -colors since two of the similar magnitude can still point in different colorspace directions. The -color is one objective means of quantifying visually perceived image heterogeneity and relating it to a directional, physical quantity which itself feasibly relates to tumor growth. Thus, is another means of studying the relation between image heterogeneity and clinical prognosis.
We have demonstrated a method for automatically ranking grayscale medical images in order of increasing heterogeneity. We have done this not only in a fashion where shape is considered to contribute to overall object heterogeneity, but also under conditions where shape is ignored. In both cases, the automatic ranking is found to agree very well with the rankings done visually by experts. The example images we analyze, specifically, those of a grayscale object against a uniformly black background, are precisely the type of images available after a clinician delineates tumors via standard image segmentation techniques. For these example cases, our shape-sensitive ranking statistic was shown to yield heterogeneity rankings which almost perfectly parallel those given by a veteran radiation oncologist. Automated ranking via heterogeneity offers a new avenue for objectively studying the clinically crucial relation between disease outcome and some tumor properties observable in the images obtained at diagnosis.
Written informed consent for use of the data in this report was waived by the Washington University Institutional Review Board.
aWe note for clarity that the “image energy” and “local homogeneity” monikers employed by some authors actually refer to the “angular second moment” and “inverse difference moment”, respectively, as originally given in Ref. .bWe note again that this is the trans-axial direction relative to the tumor—as determined by the three-dimensional principal components of the image set—which is not necessarily the trans-axial direction of the clinical imaging process.
We thank Bruce E. Davis for illuminating discussions and careful review of the manuscript. We thank Richard Laforest for providing technical advice about the PET process. We thank Jeffery Olsen, M.D. and Shalini Priti, M.Sc. for their expert assistance in ranking the test images. We also thank the reviewers for providing some specific, extraordinarily helpful comments. This work was supported by the National Institutes of Health under Grant 1R01-CA136931-01A2.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.