Self-relabeling for noise-tolerant retina vessel segmentation through label reliability estimation

Background Retinal vessel segmentation benefits significantly from deep learning. Its performance relies on sufficient training images with accurate ground-truth segmentation, which are usually manually annotated in the form of binary pixel-wise label maps. Manually annotated ground-truth label maps, more or less, contain errors for part of the pixels. Due to the thin structure of retina vessels, such errors are more frequent and serious in manual annotations, which negatively affect deep learning performance. Methods In this paper, we develop a new method to automatically and iteratively identify and correct such noisy segmentation labels in the process of network training. We consider historical predicted label maps of network-in-training from different epochs and jointly use them to self-supervise the predicted labels during training and dynamically correct the supervised labels with noises. Results We conducted experiments on the three datasets of DRIVE, STARE and CHASE-DB1 with synthetic noises, pseudo-labeled noises, and manually labeled noises. For synthetic noise, the proposed method corrects the original noisy label maps to a more accurate label map by 4.0–\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$9.8\%$$\end{document}9.8% on \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 and 10.7–\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16.8\%$$\end{document}16.8% on PR on three testing datasets. For the other two types of noise, the method could also improve the label map quality. Conclusions Experiment results verified that the proposed method could achieve better retinal image segmentation performance than many existing methods by simultaneously correcting the noise in the initial label map. Supplementary Information The online version contains supplementary material available at 10.1186/s12880-021-00732-y.


Background
Retinal fundus images as an essential kind of medical image are widely used in the early screening and diagnosis of ophthalmologic diseases. Segmenting blood vessels from the retinal fundus image is important for the automatic detection of fundus retinopathy and has drawn much interest in recent years. With the development of deep learning in analyzing medical images, researchers have proposed many effective deep learning-based methods such as [1][2][3]. Most of them rely on supervised learning strategies that require a large number of training samples with accurate annotations to obtain a welllearned model. However, because of the thin structure of retina vessels and the high accuracy requirements of the dense pixel labels, retina vessel segmentation labels rely on professional clinical ophthalmologists to annotate the retinal fundus images pixel by pixel, which is a timeconsuming, laborious, and expensive work. This severely limits deep learning models' wide application in actual Open Access *Correspondence: songwang@cec.sc.edu 1 College of Intelligence and Computing, Tianjin University, Tianjin, China 2 College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China Full list of author information is available at the end of the article auxiliary diagnosis. To tackle this bottleneck, researchers try to relax the restrictions on label accuracy. They adopt more economical methods of obtaining labels, such as hiring junior medical staff to annotate, crowdsourcing, or pseudo labeling. All the above methods for obtaining cheap yet noisy label maps on a new unlabeled dataset come up with the same problem: How to fully utilize the correct labels in the noisy label maps to train the model while defending the bad effect from noisy labels to the training?
This problem is named as learning with noisy labels (LNL) in many works [4,5]. Existing methods on LNL are mainly designed for the classification tasks on natural images [4][5][6][7][8]. Among them, Co-teaching [7] is a simple yet effective strategy that uses the agreement of the predictions from two differently initialized networks to select potential correct labels from the low-quality label sets to train the model. Tanaka et al. [5] proposed a framework on LNL which jointly optimizes the network parameters and estimates true labels. Though most of these methods could not be directly applied to the semantic segmentation tasks due to the dense prediction pattern in segmentation, they inspired many methods on LNL in the segmentation tasks [9][10][11][12]. Among these methods, Li et al. [12] proposed a robust framework that could progressively prompt the quality of the labels as well as the learned models. It corrects the noisy labels by iteratively aggregating the current network prediction with the initial noisy labels through a moving average strategy. Nevertheless, the framework proposed by Li et al. [12] directly uses the smoothed prediction values to modify the labels. This method may also mistakenly correct the labels, leading to further accumulation of errors in the subsequent training process. To avoid accumulating errors, Liu [13] et al. utilized a mutual learning strategy to estimate the reliability of the labels. In medical image segmentation, Xue et al. [11] and Zhang et al. [14] proposed two similar mutual learning frameworks which train three networks simultaneously and treat the agreement of two networks as clean labels to train another network. Though the mutual learning strategy could fully utilize the random initialization of different network parameters, it costs high GPU memory and computation to train multiple networks at the same time. In real applications, a more flexible and lightweight noise-tolerant solution is desired for medical image segmentation.
The critical problem in designing such a method is evaluating the accuracy between the predicted labels trained on noisy labels and the given noisy labels themselves. One basic assumption in many studies based on consistency and regularization [15,16] is that: in the process of deep model training, there will be multiple periods of random exploration. The correct label is more steadily close to the predicted value among these periods. Inspired by this point of view, we propose a joint framework for the noise-tolerant retinal vessel segmentation task that simultaneously trains the network and corrects the noisy labels. The framework combines the advantages of Li et al. [12] to update annotations efficiently and iteratively. Differently, we propose an estimation method for the reliability of both labels and predictions. Based on this estimation, we construct a time memory loss for robust training and a label correction compensation mechanism for more accurate label correction. To verify the method proposed in this paper, we conduct experiments on three public retinal blood vessel data sets and analyze the model's accuracy under three different types of noise: synthetic noises, pseudo-labeled noises, and manually labeled noises. The results show that the proposed method can still effectively maintain the accuracy of blood vessel segmentation under a large proportion of noise without the help of additional true labels.
In summary, we make the following contributions in this paper: • An efficient framework for noise-tolerant retinal vessel segmentation that can estimate the reliability of both the labels and the predictions; • a temporal memory loss for robust training; • a label correction compensation mechanism for more accurate label correction.

Related works
Retina vessel segmentation is a task with long studying history [17] and quite a lot of mature methods [18]. Beneficial from the development of deep learning, the current SOTA methods [19,20] have achieved fairly accurate prediction results on the widely used public datasets, such as DRIVE [17], STARE [21], CHASE [22]. However, seldom of them focus on how to eliminate the noisy label map caused by reasons like observer variety, which could degrade the segmentation accuracy [23]. In this work, we aimed to rectify the noisy label map and improve the segmentation accuracy in the meantime. Rectifying segmentation label map is a branch of studies of learning from noisy labels [24] (LNL). Since datasets with both noisy labels and carefully-checked clean labels, e.g., WebVision [25], only provide data and evaluation for LNL of classification task, existing studies of LNL mainly focus on the classification task. Some of them studied the task of reducing the bad effect of noisy labels on the network by reweighting the noisy labels in loss functions [4,7,26] or dropping the noisy labeled samples in data sampling [27,28]. To distinguish the noisy labels from all the labels, strategies like generative learning [29,30], contrastive learning [31], entropy minimization [32], consistency regularization [33,34] and pseudo labeling [35] are widely used and developed to many variants. These strategies also inspired many recent works on LNL of segmentation tasks. Unlike classification, segmentation is a dense prediction task. Even pixel-wise noisy labels have contextual information with their neighbor pixels, which is not suitable for reweighting or dropping them independently. In recent years, many studies [36,37] focused on semi-supervised LNL on segmentation. However, they still need clean labels to provide essential information on distinguishing noisy labels. In this work, we are targeted at the task of unsupervised rectifying noisy label maps in retina vessel image segmentation, which could only provide noisy label maps with positionunknown clean labels.
Existed unsupervised segmentation label map rectify methods are mainly based on strategies like consistency regularization [11] and pseudo labeling [11,12]. Xue et al. [11] proposed a framework that could correct the noisy boundary annotations without knowing clean annotations on chest X-ray images. Inspired by the ideas of Co-teaching [7], they jointly trained three independent networks and treated the agreement of each two networks as correct annotations for the other one's training. However, since the three networks share the same architecture and input, they may end up learning homogeneous knowledge and suffer from coupled noises that hinder the further improvement of label map [36]. Li et al. [12] studied the same task but on natural image datasets. They proposed a framework that directly uses the network's prediction label map to change the supervised label maps iteratively. However, the training of the network is still affected by the noisy label maps and the correctness of the label map changes is hard to guarantee, highly relying on the network's predicted label map accuracy. Our work is based on Li et al. [12] but with important improvements on both training with noisy label maps and distinguishing incorrect label map changes.

Overview
Given the retina vessel images and segmentation label maps with error pixel-wise labels, we aim to train a segmentation model with them and simultaneously correct the errors in the noisy label maps. We illustrate the pipeline of our method in Fig. 1, which contains two modules.
• Segmentation training module (STM) G denotes the segmentation network, which is trained for C cycles (each cycle contains E epochs) on the training set with the following loss where E is the criterion loss function, S and L denote the predicted segmentation label map generated by G and the supervised label maps, respectively. • Label correction module (LCM) After each cycle of training, we correct the given label maps (with noises). Specifically, inspired by [12], we consider the current label correction compensation Q j in each cycle j and the initial label maps L 0 for updating the current corrected label maps in cycle j which is used for training G at the (j + 1)-th cycle. Specially, the label maps of cycle 1 is also equal to L 0 . The details of the above two modules will be discussed in the following. (1)

Temporal memory loss (TML) for training
Since the initial label map L 0 is with noises, we aim to find a more accurate label map as supervision in training G. The key problem lies in estimating the current label maps L in Eq. (1) in each cycle. A straightforward idea is to use the updated label map L j as Q j in Eq. (2) like most previous works [12]. However, the updated label map cannot be considered completely accurate, especially in the early training cycles. In this work, we propose a temporal memory mechanism for improving the robustness of the supervision during training. Specifically, while training the network G in the cycle j, we record the historical segmentation prediction S e at each epoch e and calculate the best pixel-wise predictions of this cycle. For example, at e-th epoch in cycle j, the best historical prediction at each pixel (x, y) is defined as where u ∈ {1, 2, . . . , e} denote the epoch index in cycle j, and S u x,y and L j x,y denote the value at the pixel (x, y) on S u and L j , respectively. We then combine S e x,y by all pixels and get the best historical prediction S e . For the next epoch e + 1 in this cycle, we replace the loss function in Eq. (1) with where is a preset weight and set as 0.1.
We explain the rationale of the proposed unsupervised loss. On the one hand, if the given label L j x,y on pixel (x, y) is correct, the S e x,y will always be better than the prediction S e x,y and guide the optimization in the ideal direction. On the other hand, if the label L j x,y is incorrect, the historically learned S e x,y is less noisy than the label L j x,y , this manner could reduce the bad effect of the noisy label. This is because the network pretends to learn simple patterns first [23], and here the correct (pixel-wise) labels often have more consistent and simple patterns to learn than the various noisy labels.
In the following, we discuss the details of the training as illustrated in Fig. 2. We first train the network with initial noisy label maps L 0 for several epochs as initialization following by multiple cycles of training. At the beginning of each cycle, we train the network for T epochs only consider the second item of Eq. (4) without the weight as loss function. This is because the recorded historical best prediction used in the first term in Eq. (4) needs several epochs to accumulate. After that, we train the network for next E − T epochs using the loss defined in Eq. (4).

Spatial confidence aware label correction
In this section, we discuss the label map correction strategy in Eq. (2), particularly for the label correction compensation Q j . Previous works [12] directly use the final predicted segmentation map in cycle j namely S j as Q j , which may be incredible because of under-fitted training and noisy-label supervision. While only using the S j as Q j is not always the best, since the S j x,y will be worse than S j x,y at the pixels guided by the incorrect label L j x,y . In this work, we propose a spatial confidence aware label correction strategy to obtain a more reliable Q j from the predicted segmentation maps. Specifically, we estimate the uncertainty of the prediction by the difference between its historical best and worst predictions, which could be formulated as d x,y can be taken as the rangeability of the historical prediction results, which contrary reflects its confidence at each pixel. Based on this, we replace the final prediction S j with S j using d j x,y as a soft weight. The proposed label correction compensation is where ⊙ denotes the element-wise multiplication, D j is composed of d j x,y reflecting the pixel-level confidence of the segmentation results. We take the segmentation results from S j where the prediction confidence is high. Otherwise, we use the historical best prediction S j that is more stable when the confidence is low.

Implementation details
In this work, we choose the classical binary cross entropy loss as E in Eq. (4) and use U-Net [2] as network G. To efficiently store the S j x,y and S j x,y on each pixel, we employ  Fig. 1, to record the S j x,y and S j x,y according to the image index and the (x, y) coordinates. During training, for each image, we perform horizontal flipping, vertical flipping, and both of them respectively, to construct three augmented images. The memory bank will first reverse the augmentation operations of the augmented images on their prediction label maps, then calculate and record the S j x,y and S j x,y . We use the Adam [38] optimizer with learning rate 7 × 10 −3 . Following the setting in [12], we also use stochastic weight averaging method [16] to train the network.
We run our method for 100 epochs in total, the first 50 epochs as initialization following with 5 cycles, each containing 10 epochs. We apply SGDR [39] learning rate scheduler to adjust the learning rate dynamically. The learning rate scheduler begins to work at epoch 40 and with 10 as cyclical epoch number.

Setup
We evaluate two tasks in the experiments: 1) We train the network on the training dataset using only the noisy label (as the initial label) and evaluate its segmentation results on the testing dataset with the correct labels. 2) We evaluate the noisy label correction on the training dataset using the correct labels.
Datasets We evaluate our methods on 3 public benchmarks.
• DRIVE [17] contains 40 retina images with size 565 × 584 , 20 images in training set and 20 images in testing set. Each image in the training set has the label map annotated by an expert (taken as the golden standard, i.e., correct label). Besides the correct label maps, each image in the testing set has a label map annotated by another annotator (taken as the noisy label). To satisfy our task in this work, we exchange the data in the training set and testing set and denote the new dataset as DRIVE(R). • STARE (VK) [21] contains 20 images with the resolution of 605 × 700 : first 10 in the training set and the other 10 in the testing set. • CHASE [22] contains 28 retina images with the resolution of 999 × 960 : first 14 for training and the other 14 for testing. In these two datasets, each image has two label maps annotated by two annotators. According to the official description, the label maps from one expert are taken as the golden standard.
Comparison methods We include following 3 methods for comparison.
• U-Net: We select a famous network architecture for image segmentation namely U-Net [2] as the baseline, which maintains the same backbone network and training settings as ours. • Cas [11]: A method for chest X-ray image segmentation task, which also provides the noisy label correction results. • SF [12]: A state-of-the-art method for noisy label based human parsing and label correction.
Pollution sources We use three types of pollution sources, i.e., (1) synthetic noisy label maps, (2) label maps generated by pseudo labeling, and (3) manually labeled noisy label maps to evaluate the label correction performance of our method and the comparison methods. Examples of them are shown in Fig. 3b-d respectively. The original label map is shown in Fig. 3a for comparison.
• We apply the method in [9] to generate the synthetic noisy label maps. We approximate the contour of the retina vessel using the combination of line segments using the tool OpenCV. This could result in pixel label deletion, shifting, and inaccurate contours, which is to simulate the noises in roughly annotating retina vessel images. We control the parameter of approximation accuracy and generate noisy label maps with three aggravated pollution levels, named as LV-1, LV-2, and LV-3.
• For unlabeled segmentation datasets in practical scenes, pseudo label maps generated by models trained on other similar labeled datasets are often used as low-cost noisy supervision. So we also collect pseudo label maps of DRIVE (R) and STARE (VK) datasets generated by existing published work [40] as shown in Fig. 3c. • The manually labeled noisy label maps are from the manual label maps (other than the golden standard) provided by the above three datasets.
All of the noisy labels for the three datasets used in this work are submitted as described in the section of Additional Files, Additional file 1.

Results of label correction
We first evaluate the noisy label correction performance in Table 1. Specifically, we compare the original noisy label with the corrected label generated by different methods using standard segmentation metrics, including the F 1 score and area under the precision-recall curve (PR score). As shown in Table 1, the results in 'Baseline' denote the accuracy of the labels under different polluted sources. The proposed method consistently outperforms all the other methods in all the benchmarks for the synthetic noises, especially in LV-3 groups. It corrects the original noisy label maps to a more accurate label map by 4.0-9.8% on F 1 and 10.7-16.8% on PR on three testing datasets.
For the pseudo labeling noise, the proposed method could also improve the quality of the pseudo label map by a small margin.
For the manually labeled noise, the proposed method shows better accuracy than other methods, especially on the STARE (VK) dataset, where it outperforms the SF method and Cas method by 1.1% and 4.4% on F 1 score. Compared to the original noisy label maps, it obtains the improvement of 4.1-7.0% on F 1 score on three datasets.

Testing performance boost of segmentation
We further evaluate the segmentation performance boost of our method and the comparison methods on the testing set using the same initial noisy labels for training. The results are shown in Table 2 and the 'Baseline' here denotes the U-Net described in the Setup Sect. . We can see that the segmentation performance improvement of Table 1 Comparative results of prediction on the testing set (%) The values with bold denote the best performance in each group the proposed method is also superior compared with others in most experiments. Notably, when the level of synthetic noise is serious, e.g., LV-3, the proposed method could also boost the segmentation performance of the network while other two methods fail in some cases, e.g., those on DRIVE (R) and STARE (VK).

Cross-datasets validation
To evaluate the generalization ability of the proposed method and other compared methods, we use the models trained on the DRIVE(R) dataset to predict segmentation label maps on the test set of the STARE dataset and the other way round for cross-datasets validation. The results are shown in Table 3. From Table 3, we can see a performance decrease of all the methods on both of the datasets, especially the STARE dataset. This is because the images and annotations have a domain gap between these two datasets with different capturing devices and different human annotators. However, the proposed method still achieves considerable high performance in the cross-datasets validation and outperforms other compared methods in all the metrics across different noise settings. Even in the high synthetic noise groups like LV-3, the proposed method still gets the F1 score over 70.0 on both the

Qualitative study
We show the cases of corrected label maps of different types of noise in Fig. 4. We could see that the proposed method tends to correct the noisy labels carefully while preserving the correct labels unchanged. The compared methods either couldn't correct the noise or failed to preserve the correct labels unchanged, such as the cases shown in lines 1, 4, and 5 in Fig. 4. Besides, the proposed method could generate more accurate boundary and thickness of the vessels than the compared methods, such as the cases shown in lines 2 and 3 in Fig. 4. This could be explained by the proposed method considering both the noisy labels in training and noisy predictions in label correction. Thus, for example, if the labels of the vessel are thicker than its correct labels at the boundary, the network in the proposed framework will not be directly influenced by the noisy labels, which otherwise may result in thicker vessel predictions. The full corrected label maps are shown in Fig. 6.

Training-testing curve
We further show the training-testing loss curve and the F1 curve of label map correction in Fig. 5 to understand the training procedure better. As mentioned in the Implement Details Sect. , we train the whole framework for 100 epochs and start the first cycle of label correction and testing at epoch 50. From Fig. 5 we could see that the testing loss curves continuously decrease during multiple training cycles. While the training loss curves are almost constantly reducing as well, except that at the beginning epoch of each cycle, it will get a small peak. This is because the label map is corrected at the end of each cycle, and the SGDR learning rate scheduler will warm up at the beginning of each cycle. The two curves support that the proposed method is not over-fitted to the evaluated datasets. Besides, the F1 curve of label map correction is also continuously increasing. The progress of network training and label map correction will promote each other and further boost the performance of the whole framework (Fig. 6).

Ablation study
In this section, we apply the ablation study to the label map correction task. We consider the following variations of the proposed method.
• w/o TML: We remove the proposed temporal memory loss, i.e., only use E(S e , L j ) as loss function in Eq. (4). • w S j : Replacing Q j with S j in Eq. (5). • w S j : Replacing Q j with S j in Eq. (5).
The results are shown in Table 4. Without using the proposed TML in training will decrease the performance of the proposed framework among all the benchmarks by the range of 0.3-1.6% on F 1 score and 0.2-2.4% on PR score. Notably, when the degree of synthetic noise increases, we can see a larger performance decrease margin if we remove TML. For example, on the DRIVE (R) dataset with LV -1 synthetic noise, removing TML brings a 1.1% decrease on PR score. While with LV -3 synthetic noise, the corresponding performance decreases by 2.4% . Using S j as Q j in Eq. (5) will also consistently decrease the performance among all the benchmarks. It will downgrade the performance by 0.5-2.0% on F 1 score and 0.4-1.2% on PR score. Using the S j to replace Q j in Eq. (5) is slightly better than the proposed Q j in some cases, especially in low-level synthetic noises, such as LV-1 and LV-2 on CHASE. However in most of the benchmarks the proposed Q j is superior to the S j in label correction.