An automatic restoration framework based on GPU-accelerated collateral filtering in brain MR images

Background Image restoration is one of the fundamental and essential tasks within image processing. In medical imaging, developing an effective algorithm that can automatically remove random noise in brain magnetic resonance (MR) images is challenging. The collateral filter has been shown a more powerful algorithm than many existing methods. However, the computation of the collateral filter is more time-consuming and the selection of the filter parameters is also laborious. This paper proposes an automatic noise removal system based on the accelerated collateral filter for brain MR images. Methods To solve these problems, we first accelerated the collateral filter with parallel computing using the graphics processing unit (GPU) architecture. We adopted the compute unified device architecture (CUDA), an application programming interface for the GPU by NVIDIA, to hasten the computation. Subsequently, the optimal filter parameters were selected and the automation was achieved by artificial neural networks. Specifically, an artificial neural network system associated with image feature analysis was adopted to establish the automatic image restoration framework. The best feature combination was selected by the paired t-test and the sequential forward floating selection (SFFS) methods. Results Experimental results indicated that not only did the proposed automatic image restoration algorithm perform dramatically faster than the traditional collateral filter, but it also effectively removed the noise in a wide variety of brain MR images. A speed up gain of 34 was attained to process an MR image, which completed within 0.1 s. Representative illustrations of brain tumor images demonstrated the capability of identifying lesion boundaries, which outperformed many existing methods. Conclusions We believe that our accelerated and automated restoration framework is promising for achieving robust filtering in many brain MR image restoration applications. Electronic supplementary material The online version of this article (10.1186/s12880-019-0305-9) contains supplementary material, which is available to authorized users.


Background
Since its invention in the 1970s, magnetic resonance imaging (MRI) has been an important imaging technique for noninvasive diagnosis of the human body. The applications of MRI in the brain result in diversified images for further processes such as tissue classification, segmentation and registration [1][2][3]. However, the random noise in MRI scanners inevitably causes deterioration of brain MR images [4]. Consequently, noise removal has become a crucial issue in brain MR image post-processing [5]. The Gaussian filter has been extensively adopted in many image processing applications for its simplicity. However, anatomical boundaries are inevitably blurred by this lowpass filter. In addition to the Gaussian filter, one of the well-known techniques for noise removal has been the bilateral filter, which was originally proposed by Tomasi and Manduci [6]. It has been widely utilized in many medical image denoising applications for decades. The bilateral filter is an effective filtering algorithm that can both remove the random noise and preserve edges rather than blurring the lines in images. An iterative version of the bilateral filter for MR image restoration was recently introduced [7].
One remarkable technique of using statistics strategies is the linear minimum mean squared error (LMMSE) estimator [8], which computes the local mean, the local variance, and the local mean square value of the input image. Another famous method is the anisotropic diffusion filter (ADF) [9], in which pixel intensities are averaged from neighbors in a prescribed window, whose dimension and shape are measured at every location. An appropriate function of the image gradient is constructed in accordance with the diffusion coefficient to encourage filtering within a region of interest in preference not to filtering across the boundaries. It is noted that the quality of the denoised image is greatly relevant to the number of iterations. Subsequently, Ferrari [10] proposed a mathematical framework to automatically determine the best number of iterations of the ADF method by utilizing the maximization of some evaluation index. Nevertheless, it has been adopted in many medical image restoration applications [11].
In contrast to many methods that mainly reply on local pixels within a small neighbourhood, the non-local means (NLM) filter detects repeated structures in the global domain [12]. The similarity between pixels is region-based comparison in that pixels far from the kernel center are not penalized due to the distance to the center. Based on an enhanced sparse representation in the transform domain, the block-matching and 3D filtering (BM3D) strategy [13] extended the nonlocal filtering techniques. The enhancement of the sparsity is accomplished by aggregating similar 2D fragments of the image into 3D arrays. Subsequently, the Wiener filter is employed in the collaborative filtering procedure to remove noise.
Motivated by the bilateral filter, the collateral filter [14] was recently proposed that introduced a median filter and an entropy function to the filtering framework. The median filter compensated the Gaussian filter for random noise reduction. It has been shown that the collateral filter is more powerful than the Gaussian filtering, bilateral filtering, and ADF methods in many scenarios [14]. However, comparing to these two filters, the collateral filter is more complicated and time consuming. A contemporary technique called the tensor processing unit (TPU) has shown its excellent computation and energy efficiency over existing frameworks [15]. However, this domain specific architecture is exclusively deployed in Google data centers. One realistic approach to speed up the denoising procedure is through the parallel programming using the compute unified device architecture (CUDA) [16,17] with the graphics processing unit (GPU).
A GPU is a specialized processor that accelerates graphic operations on personal computers, workstations, and mobile devices. The GPU has more processor cores than a traditional central processing unit (CPU) and each core can process data simultaneously. Therefore, the GPU is well suited to handle tasks that can be parallelized. For example, GPU-based acceleration has been shown effective in segmentation and classification applications [18,19]. Additionally, CUDA is a programming platform developed by NVIDIA for general purpose computation on GPU (GPGPU) on NVIDIA GPUs [20]. Many of NVIDIA graphics cards support CUDA (e.g., GeForce, Quadro, and Tesla). With CUDA, programmers can manage memory on the CPU as well as the GPU from the host. Once memory spaces on both host and device sides are allocated and data are transferred, the program can launch kernels for execution on the device.
A kernel is a function that can be called from the host and executed on the device. Programmers can design the code in the kernel function for parallel computing. The kernel function defines the operations for a single thread and thousands of threads execute the function code synchronously. A thread is the basic execution unit for parallel computing. After the thread completes its task in the kernel function, the data are transferred back to the host.
Based on the GPU architecture, one pixel in an image can be handled by one thread and thousands of threads can run synchronously. Each thread can access data from different types of memories such as local, global and shared memory. While local memory belongs to one specific thread, global memory is shared by all threads. Indeed, local memory is part of global memory and both is slower than shared memory. Threads are organized in blocks and a grid is a group of blocks. Each block has its shared memory that can be accessed by its own threads. Although shared memory's size is smaller than global memory, the access time of shared memory is faster like cache [21].
The objective of this paper is two-fold. First, to accelerate the computation of the collateral filter through the adoption of the CUDA strategy. Second, to automate the collateral filtering process by the aid of image texture features associated with artificial neural networks [22]. We will show that a fast and automatic denoising system based on CUDA-based collateral filtering provides robust and accurate restoration results over many existing denoising methods.

CUDA-based collateral filtering
As described earlier, the collateral filter is a more effective denoising method than the bilateral filter in that it contains three different Gaussian functions [14]. Let (θ x , θ y ) be the pixel location under consideration in the filtering image and Ψ θ x ;θ y be the neighborhood of (θ x , θ y ): The collateral filtering of image I at location(θ x , θ y ) is finally computed using where β is adopted to adjust the weight between the input image I and the median filtered image I M . A number of techniques have been proposed to accelerate image filter computation. Some researchers used approximation methods to reduce the computation time [23]. A straightforward approach is to take advantage of an additional memory space. Inspired by the strategy in [24], we introduce a new scheme not only saving computation time but also achieving the same restoration results as using the traditional collateral filter.
In the traditional collateral filter, the spatial component is repeatedly computed for every pixel, which takes a lot of time in computing the same pixel distances. Therefore, we redefine (2) as where d x and d y are the spatial distances with 0 ≤ d x ≤ N and 0 ≤ d y ≤ N. In our approach, we create a spatial weight buffer WSB, which is defined as We then compute all the spatial weights and store them in this memory buffer WSB according to the filter radius N. Let WSB(x, y) be an element of WSB, (2) can be expressed as WSB(μ x − θ x , μ y − θ y ).
Subsequently, we implement the complex parts of the collateral filter using CUDA shared memory, which is an efficient way to reduce the reading and writing time comparing to the usage of global memory. We split the image pixels into separate blocks, each denoted as block(i, j), where i = 1, 2, …, k and j = 1, 2, …, k. For entropy computation, each pixel and its neighbors are required to sum up as shown in (5). To hasten the computation, the entropy of the pixel at (x, y) is computed and stored in the shared memory buffer in advance using and S × S represents the block size. Once HPB i, j is created, the entropy is defined as: where (δ x , δ y ) represents a thread in block(i, j) with 1 ≤ δ x ≤ S and 1 ≤ δ y ≤ S. Since accessing shared memory is faster than accessing local memory, each thread only needs to sum up its own value from its neighbors. Consequently, we redefine (7) where WSB i, j represents the spatial weight buffer in block(i, j).
In order to more effectively utilize shared memory, the computation of β in (8) is reformulated so that it is calculated at the beginning in every block(i, j) before applying the filter to each pixel. Finally, the accelerated version of the collateral filter is defined as: where BETA i, j (μ x , μ y ) is the value of β at (μ x , μ y ). The pseudo code of our GPU-based collateral filter is depicted in Fig. 1. Note that each pixel being processed is assigned to a thread since the process is a slice-byslice approach, i.e., one 2D image at a time.

Feature extraction
Although the execution time of the traditional collateral filter is hastened using the GPU strategy, it is still laborious to manually adjust the parameters for achieving the best restoration results in a particular brain MR image.
Using an artificial neural network system is suitable as it can learn to estimate the optimal parameters in the filtering process. However, neural networks require appropriate input features to obtain the best performance of the predictable model. In our experience, image texture features play an important role in the neural network automation process. The candidate image texture features adopted in this study are described as follows.

Statistical features
There are several basic statistical features that are computed directly from the image intensity: mean intensity (Mean), standard deviation (SD), variance (VAR), entropy (ENT), and histogram features (skewness, kurtosis, variance, entropy and energy).

Tamura texture features
Tamura features [29] were developed based on the human visual and psychological perception. The first three features, which are correlated more closely with human perception, are computed: Their formulas are briefly provided in Additional file 1.

Noise estimation features
Since the collateral filter parameters are sensitive to noise, noise related features are required. Many noise variance estimation methods that are based on the Rician and Rayleigh distributions are complicated. An easier way to estimate the noise level is to employ a simpler Laplacian operator [30]. A more efficient algorithm [31] with a generic transfer function improved the Laplacian mask for better estimation. Aja-Fernández et al. [32] presented a set of estimators based on the Rayleigh distribution in the background. Moreover, the peak value of the autocorrelation function computed by an image with itself provides the information of noise [33]. Based on the above methods, the following seven features are adopted: (1) Laplacian mask (LAP)

Feature selection
If a large amount of features are selected as the input arguments, the training of neural networks will be an extremely time-consuming procedure. Consequently, we utilize the sequential forward floating selection (SFFS) method [34] to choose the best combination of features. Before applying the SFFS, we make use of the paired-samples t-test [35,36] to rank the significance of candidate features. The t-test is applied to each image feature to evaluate the ability for discriminating differences between noise levels, which results in an average p-value. The smaller the p-value, the better the discrimination. Statistically, those features with an average p < 0.05 are selected as candidates for the SFFS process.

Neural networks
The popular back propagation neural network (BPN), proposed by McClelland et al. [37], is utilized for training our neural network system. The main architecture includes input, hidden and output layers. The back propagation process is employed to optimize the weights by minimizing the target error so that the neural network is able to learn the input-output mapping perfectly. In our approach, the inputs are the features selected by the SFFS algorithm and the outputs are the collateral filter parameters. In addition, a brute-force approach is conducted to obtain the optimal filter parameters based on the peak signal-to-noise ratio (PSNR): where F is the m × n noise-free image, R is the restored image, and MAX I is the maximum possible intensity of F. The optimal filter parameters are decided when the highest PSNR value is obtained. Finally, a two stage neural network structure is developed with the first stage for classification and the second stage for prediction. At the first stage, the neural network classifier distinguishes inputs into three different noise levels. Multiple-layer (input, hidden, and output) neural networks with different noise level models at the second stage predict the optimal filter parameters for image restoration. The Levenberg-Marquardt learning algorithm [38] is exploited to independently train each of the three individual BPN models. For each BPN model, the transfer function between the input layer and the hidden layer is the hyperbolic tangent function, whereas the linear transfer function is employed between the hidden layer and the output layer. The overall flowchart of the proposed methods is depicted in Fig. 2.

Performance evaluation
In addition to the PSNR metric in (15), the structural similarity (SSIM) metric [39] is utilized to evaluate the restored images using where σ represents the standard deviation, μ represents the mean intensity, C 1 = (K 1 MAX I ) 2 and C 2 = (K 2 MAX I ) 2 with K 1 = 0.01 and K 2 = 0.03. The higher the PSNR and SSIM scores the better the restoration results. To more thoroughly evaluate the image restoration ability of the proposed system, the restoration results obtained using the automatic framework are compared with the results obtained using the brute-force method, which is considered as the optimal outcome. Let PSNR A be the restoration score of the proposed automatic system and   PSNR T be the restoration score of the brute-force method. The relative error ε r is then defined as

Results
We first conducted the experiments on the BrainWeb database [40] Table 1 presents the execution performance of the traditional collateral filter and our accelerated version on the same 217 × 181 brain MR images. The performance of using the shared memory (SM) or not is also indicated. The block size for the shared memory was 16 × 16 threads and the filter size was 3 × 3. When the number of images was increasing, the proposed framework executed the filtering process more effectively and a remarkable acceleration was attained. It was noted that  the speed up gain raised from 34 for one single image to 541 for processing 100 images. Our proposed GPU-based collateral filter is much faster than the original CPU version. The corresponding running times with respect to different image pixel sizes are depicted in Fig. 3. Obviously, the GPU implementation ran less than 0.1 s for each scenario, whereas the computation time of the CPU version was roughly linearly proportional to the pixel number. As presented in Table 2, there were 28 selected features from different image feature classes for the input candidates of the BPN system. The order of significance of these adopted image features was based on the average p-value of the t-test outcome with p < 0.05. For understanding the optimal features, T1-weighted brain MR images with three different slice thicknesses (1 mm, 3 mm and 5 mm), five noise levels (1, 3, 5, 7 and 9%), and three intensity non-uniformities (0, 20 and 40%) acquired from the BrainWeb database were utilized to evaluate the proposed automatic restoration framework. Each noise level had 1476 images, which were divided into two groups: 984 slices for training and 492 slices for testing. After the SFFS procedure, five optimal features of RLN(0°), RP(135°), RP(90°), AJA, and kurtosis were obtained and they were further employed in the BPN training phase.
In the testing phase, the first stage BPN system classified the input image into three major categories based on the noise level: low (1 and 3%), median (5 and 7%) and high (9%) estimators. The correct classification rate was up to 97.6% overall. The computed five features of the input image were fed into the corresponding second stage BPN model for producing the filter parameter values for denoising. Table 3 summarizes the PSNR scores and the relative error ε r of the proposed automatic filtering framework and the brute-force method on randomly selected slices of the 5 mm normal MR images with 5 and 7% noise. It is indicated that the PSNR scores of the proposed restoration algorithm were extremely close to the optimal PSNR scores of the brute-force method with negligible errors.
In Fig. 4, we illustrate our automatic restoration outcome of the 10th slice image corrupted by 5% noise level in the 5 mm MS dataset in comparison with the ADF [9], LMMSE [8], NLM [12], iterative bilateral filtering (IBF) [7], and BM3D [13] methods. Comparing to other techniques, the proposed GPU-based collateral filter revealed more uniform intensity profiles and better sharpened edges in the white matter (WM) and gray matter (GM) regions. Quantitative analyses also indicated that our automatic collateral filter provided the highest score of PSNR = 34.32 dB and the second highest SSIM = 0.8803 than other tested methods. Figure 5 depicts the visual restoration results of the 26th slice image corrupted by 7% noise level in the 3 mm MS dataset. It was obvious that, in contrast to other methods, the proposed denoising algorithm achieved a sharper and cleaner representation for the GM and WM with the best PSNR = 31.95 dB and SSIM = 0.8479, which was closer to the intact image. Figure 6 delineates the restoration results of the 1 mm normal image volume with 9% noise level and 40% intensity non-uniformity in 3D visualization. After applying the ADF, LMMSE, and IBF methods, the brain structures still, more or less, exhibited grainy noise influences. On the other hand, the proposed automatic filter decently wiped out the heavy noise while maintaining noticeable cortical structures as illustrated in Fig. 6f. Table 4 presents quantitative analyses on massive 1 mm image volume restoration with various scenarios of anatomical models and noise levels. It was noted that the LMMSE method performed much worse than other methods in the image volumes with 1% noise. Our automatic denoising framework produced the best restoration results in terms of average PSNR and SSIM in all noise levels. As the noise level was increasing, the advantage of our collateral filter over other methods was more compelling.
For completeness, we demonstrate the capability of our automatic filtering algorithm in restoring clinical brain MR images with diseases. Figure 7 depicts the restoration results of denoising T1-weighted MR images with brain tumors using different methods. While the restoration results obtained using the ADF, LMMSE, and IBF methods were somewhat grainy in the tumor area, the NLM, BM3D and proposed algorithms achieved more suitable smoothness while maintaining sharpness at the tumor boundaries. Another example of filtering distorted PD-weighted MR images with brain tumors is illustrated in Fig. 8. It was indicated that, comparing to other methods, our automatic collateral filter more effectively removed the artefact while disclosing Fig. 6 Qualitative analyses on the restoration results of the 1 mm normal image dataset with 9% noise and 40% intensity non-uniformity in 3D visualization. a Noisy image volume. b Intact image volume. c ADF. d LMMSE. e IBF. f Proposed distinct edges between anatomical structures. Finally, in Fig. 9, we show the visual improvement of noisy MR images with low resolution and severe grains using all tested methods. It was obvious that the proposed filtering algorithm more adequately eliminated the noise and identified the lesion in contrast to other techniques.

Discussion
Stimulated by the recent advance in artificial intelligence and parallel computing, we have developed an efficient filtering framework for brain MR images. An essential property of the proposed automatic restoration algorithm is the two-stage mechanism of the neural network concatenation. A benefit of this strategy is to divide the broad noise range into three narrower classes corresponding to three individual BPN models, where the prediction of the filter parameters can be further clarified. To correlate the image with the BPN models for automation, a wide variety of image texture features were investigated based on the SFFS strategy, from which five best texture features were obtained. Three features of RLN(0°), RP(135°), and RP(90°) belong to the GLRLM category, AJA is one of the noise estimation features, and kurtosis makes use of the histogram information. These five features play an essential role in the effective discrimination between images with different noise levels, anatomical structures, and intensity non-uniformities.
A variety of T1-weighted brain MR images acquired from the BrainWeb database [40] with various scenarios were utilized in the proposed BPN models to obtain the optimal filter parameters. Due to the facility limitations, other MR modalities such as T2-weighted and PD-weighted images were not included in the training dataset in the current study. From the viewpoint of the clinical demand, the proposed automatic filter adequately restored various brain MR images with distinct anatomical structures and modalities such as PD-weighted images (see Figs. 7,8,9), which were quite  different from the training BrainWeb dataset. This suggested the propriety of the proposed automation strategy and the robustness of the developed restoration scheme. Nevertheless, to achieve optimal restoration results on massive T2-weighted and PD-weighted brain images, retraining involving these types of images is recommended.

Conclusions
We have described an automatic image restoration algorithm based on the accelerated collateral filter that has been implemented on the GPU architecture. The proposed framework reduced the execution time of the filtering process by taking advantage of shared memory in