Joint optic disc and cup segmentation based on densely connected depthwise separable convolution deep network

Background Glaucoma is an eye disease that causes vision loss and even blindness. The cup to disc ratio (CDR) is an important indicator for glaucoma screening and diagnosis. Accurate segmentation for the optic disc and cup helps obtain CDR. Although many deep learning-based methods have been proposed to segment the disc and cup for fundus image, achieving highly accurate segmentation performance is still a great challenge due to the heavy overlap between the optic disc and cup. Methods In this paper, we propose a two-stage method where the optic disc is firstly located and then the optic disc and cup are segmented jointly according to the interesting areas. Also, we consider the joint optic disc and cup segmentation task as a multi-category semantic segmentation task for which a deep learning-based model named DDSC-Net (densely connected depthwise separable convolution network) is proposed. Specifically, we employ depthwise separable convolutional layer and image pyramid input to form a deeper and wider network to improve segmentation performance. Finally, we evaluate our method on two publicly available datasets, Drishti-GS and REFUGE dataset. Results The experiment results show that the proposed method outperforms state-of-the-art methods, such as pOSAL, GL-Net, M-Net and Stack-U-Net in terms of disc coefficients, with the scores of 0.9780 (optic disc) and 0.9123 (optic cup) on the DRISHTI-GS dataset, and the scores of 0.9601 (optic disc) and 0.8903 (optic cup) on the REFUGE dataset. Particularly, in the more challenging optic cup segmentation task, our method outperforms GL-Net by 0.7\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}% in terms of disc coefficients on the Drishti-GS dataset and outperforms pOSAL by 0.79\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}% on the REFUGE dataset, respectively. Conclusions The promising segmentation performances reveal that our method has the potential in assisting the screening and diagnosis of glaucoma.


Background
Glaucoma is an eye disease that damages the optic nerves and causes irreversible vision loss [1]. It has been estimated that 60.5 million people globally were affected by glaucoma in 2010 and predicted to affect almost 80 million people worldwide by 2020 [2]. Since vision loss is irreversible, early detection and diagnosis are very important to prevent vision loss and has been shown to decrease the rate of blindness by around 50% [3]. Hence it is essential to have a glaucoma screening technique to identify glaucomatous and healthy eyes. Intraocular pressure assessment (IOP), visual field test and optic nerve head (ONH) assessment are three main techniques to Open Access *Correspondence: pandr@scnu.edu.cn South China Normal University, Guangzhou 510006, China detect glaucoma, in which ONH evaluation is the most clinically significant screening technique for glaucoma. For ONH evaluation, cup to disc ratio (CDR), means optic nerve rim to disc ratio in diameters, is one of the most important indicators for glaucoma screening and diagnosis. Accurate segmentation of optic disc (OD) and optic cup (OC) is essential for the calculation of CDR. However, manual calculation of CDR by experienced clinicians is time-consuming and expensive and is not suitable for population screening for glaucoma. Therefore, computer-aided diagnosis (CAD) methods for large-scale fundus image screening are needed. Segmenting the optic disc and optic cup is the preliminary step in CDR measurement and Glaucoma assessment. Many works have been proposed to segment the optic disc and cup from the fundus image to assist clinicians to diagnose glaucoma more effectively.
The main segmentation techniques include templatebased methods [4,5], boundary detection [6,7], handcrafted visual feature approach, and deep learning segmentation methods [8][9][10][11]. In these methods, the template-based method models the OD as s circular or elliptical object and employed a circular Hough transforms [4,5] or sliding band filter [6] to obtain an approximate boundary of the optic disc. The method based on boundary detection needs to mark multiple landmark points [6], or requires each pixel to have a direct edge in the 15-pixel neighborhood, and does not consider depth information [7]. The hand-crafted visual feature approaches convert boundary problems into pixel classification problems, and obtains satisfactory results. However, due to the quality change of the fundus image and the presence of internal and surrounding blood vessels, the precise boundary of the optic disc and cup cannot be robustly obtained through these approaches.
It has been shown that deep learning-based techniques, represented by convolution neural network (CNN) [12], achieve promising results for optic disc and optic cup segmentation [8][9][10][11]. Compared with the aforementioned hand-crafted design feature extraction methods, deep learning networks based on convolutional neural networks can automatically extract complex features from the input data. In [8], an optic disc and optic cup segmentation method based on CNN which used an entropy-based sampling technique to reduce computational complexity is proposed. Although the segmentation performance of this approach is better than that of the method of using hand crafted features, it leaves much to desire in terms of time-consuming. In [9], a fully convolutional neural network [13] based on VGG-16 net [14] is proposed to segment the optic disc. This method can segment retinal vessels and optic disc simultaneously, without segmenting the more challenging OC. In 2015, Ronneberger et al. [15] proposes an architecture called U-Net, which has been widely used in biomedical image segmentation and has achieved good results. And many biological image segmentation networks are modified based on the U-Net convolutional network. In [10], a modified version of the U-Net convolutional network is presented for automatic optic disc and cup segmentation. Although this method has the advantages of fast processing speed and fewer parameters, it fails to make full of the context information of the network, which results in the poor segmentation of OC. In [11], the authors utilize fully convolutional networks with adversarial training for jointly segment the OD and OC, but assumes that the region of interest (ROI) can be accurately extracted by preprocessing. Furthermore, Fu et al. [16], proposes a deep learning architecture named M-Net which used a polar transformation with the multi-label deep learning concept. Although M-Net uses the same network architecture to segment the optic disc and the cup, it still considers these two segmentation tasks as two independent problems.
At present, most segmentation methods consider the process of detection-then-segmentation mechanism. However, many location methods are susceptible to the influence of the image quality and pathological area [8,10], resulting in wrong location. Furthermore, the accurate segmentation of the optic cup is a more challenging task in the task of disc and cup segmentation. Although most of the disc and cup segmentation networks based on deep learning can obtain satisfactory OD segmentation results, they cannot produce more accurate optic cup segmentation results [8][9][10][11]16].
For the above problems, the purpose of this study is to explore a novel CAD model for joint OD and OC segmentation that can assist clinicians in large-scale glaucoma screening. The aims of this paper are as follow: (1) First, we aim at exploring an OD location method based on deep learning to solve the problem of instability location of traditional hand-crafted design feature methods due to the change of fundus image quality and the influence of pathological areas; (2) Second, we aim at exploring a deeper and wider CNN model structure to obtain richer and more complex fine-grained features in fundus images, so that the model can perform better OD and OC segmentation results, especially in the more difficult OC segmentation task. With correct OD and OC segmentation results, accurate CDR values can be calculated to assist clinicians in screening and diagnosis.

Methods
The method proposed in this paper employed a two-stage approach to implementing the segmentation of the optic disc and cup. In the first stage, CNN and Hough circle detection are used to obtain the center coordinates of the optic disc and extract the ROI. In the second stage, ROI is fed into the model to train a high-precision segmentation network to obtain the accurate segmentation results of optic disc and cup. The proposed method is trained and evaluated on DRISHTI-GS [17] and REFUGE datasets [18], respectively. The overall flowchart of our proposed method is shown in Fig. 1. The details of the datasets and the framework are explained in the following subsections.

Dataset
DRISHTI-GS dataset contains 101 retinal fundus images that were collected at Aravind eye hospital, Madurai. The resolution of these images is 2047 × 1759 and store in uncompressed PNG format. And the ground truth of these images was marked by 4 ophthalmologists with different clinical experience and divided into 50 training and 51 testing images. Retinal Fundus Glaucoma Challenge (REFUGE) dataset contains 1200 images which include 120 glaucomatous and 1080 non-glaucoma images. The REFUGE dataset is divided into three parts: 400 training images, 400 validation images and 400 testing images, in which the validation and testing images are acquired with the same cameras. Brief information about these two datasets is shown in Table 1.
In our proposed method, 50 training images on the DRISHTI-GS dataset are adopted for the training the proposed model, and the other 51 testing images are used for evaluating the performance of the final trained model. Similarly, 800 images from training set and validation set on the REFUGE dataset are utilized for training, and the other 400 images from the testing set are employed to evaluate the performance of the final model trained with the REFUGE dataset.

Image processing and data augmentation
Because the dataset used for training has fewer images, for example, there are only 50 training images on the DRISHTI-GS dataset, and too few data for network training may lead to overfitting, so we utilize data augmentation to expand training images to prevent this problem. The augmentation methods include translation, rotation, noise addition, and brightness adjustment. Among them, the images used for training in the DRISHTI-GS dataset are expanded to 5250, and training images in the REFUGE dataset are expanded to 30,000. Specifically, 90% of the data-augmented training images are randomly selected to train the proposed model, and the rest 10% images are employed for model evaluation when training the model. For example, when using the GS dataset to train the segmentation network, 4725 images out of the 5250 images are adopted to train the segmentation network, and another 525 images are used to evaluate the model during the training process.

ROI extraction network
Since the resolution of a complete fundus image taken by professional camera is generally relatively large, and the area of interest is only a small area in fundus image, locating and cropping out the region of interest can reduce the interference of unnecessary background information on the segmentation result, and can improve the segmentation accuracy and reduce the amount of calculation. However, the methods which employ green channel images [8] or morphological operations [19] to detect optic disc is susceptible to the effects of images taken by different devices, fundus image quality, brightness, internal blood vessels, and lesions in fundus images, resulting in low location accuracy. In our work, we utilize a method based on CNN network to extract features to solve this problem. The model and segmentation process are shown in the Fig. 2. At this stage, we design a simple convolutional neural network to segment the optic disc simply, then use Circular Hough Transform (CHT) [20] to calculate the center of the optic disc. With this method, we can locate the optic disc with 100% accuracy and crop the ROI area. The location result is shown in Table 2.
CHT is an extension of the Hough transform [21], which is mainly used to detect the circle object in the Fig. 1 Overview of our proposed method. Firstly, simple DDSC-Net without multi input is used to rough segment the optic disc, and located the optic disc by CHT. Then fed cropped image into segmentation network to joint segment OD and OC image. For circle detection, the HT is based on the equation of circle, defined as: where (a, b) represents the coordinates of circle center and r is radius. Center coordinates can be obtained by performing the CHT on the image. CHT can be defined as: where p c = (i c , j c ) and r represents the center position and the radius respectively which define the circle with the highest punctuation in the Circular Hough Transform implemented by CHT. I is the input image. The radius r is restricted to be between r min and r max . In our method we set r min and r max as 40, 160 respectively. After obtaining the coordinates of the center of the disc, we use it as the center point to cut the original image into a small picture with a resolution of 480 × 480 on REFUGE dataset and 560 × 560 on DRISHTI-GS dataset. The image contains the optic disc, optic cup and some background information. The visual result examples of ROI extraction are shown in Fig. 3

DDSC network architecture
In the object detection network [22][23][24], features extracted from shallow network can be used to detect small objects, while features extracted from deep network can be used to detect large objects. In the segmentation task of optic disc and optic cup, these ideas were adopted to design our network structure. Considering the prior knowledge that the optic cup is located in the optic disk, we use dense and skip connection to make full use of the context semantics of the shallow layers and deep layers. The proposed network structure is detachedly shown in the Fig. 4. The proposed deep network, named DDSC-Net, is consists of three main parts. The first part is the image pyramid [25], which is used as the multi-scale input of the network so that the network can receive image information of different scales. Multiscale input can solve the problem of losing part of the image information with the depth of the network. The second part of our DDSC-net is a U-shaped fully convolutional network which includes an encoder module on the left and decoder module on the right. The output map is activated by the softmax activation function, and then the cross-entropy loss function is introduced to calculate the difference between the segmentation result and the real ground truth.

Image pyramid multi-scale input
The input of the DDSC-Net is an image pyramid, which can effectively improve the segmentation quality of the network. This method employs the average pooling layer to build an image pyramid, which is then introduced into different layers of the encoder module. The advantages of this are as follow: (1) to avoid a large increase in network parameters; (2) increase the network width of the decoder depth; (3) and reduce the loss of information caused by the deepening of the network.

DDSC network
Inspired by U-net [15], a fully convolutional network with a U-shaped structure that using skip connection for feature fusion in each stage, we designed the DDSC network structure based on the U-net structure. See from Fig. 4. The DDSC network consists of an encoder and a decoder connected by skip connection. Specifically, the encoder is employed to extract the high-level semantic features of the input image, and the decoder is adopted to restore the semantic features extracted by the encoder to the resolution of the original image. Skip connection is utilized to fuse multi-scale features between encoder and decoder. Different from the original U-net, in our proposed network, we employ depthwise separable convolution layers to replace most of the standard convolutional layers in the network, which can significantly reduce the amount of computation. Therefore, we design a deeper network to learn more feature information from input data, especially the semantics of the optic cup. In addition,we execute more skip connections between encoder and decoder to enhance the transfer of contextual feature information in our model. The DDSC network is composed of three parts: densely connected depthwise separable convolution blocks, subsampled layers and upsampling layers. A dense depthwise separable convolution (DDSC) block contains five densely connected layers which consist of a batch normalization layer, a rectified linear unit (Relu) activation function, and a depthwise separable convolution layer with kernel size of 3 × 3 . The subsampled layer is a max pooling layer with kernel size of 2 and stride of 2. And the upsampling layer is a 3 × 3 transposed convolution layer. For standard convolution, the output feature map F for standard convolutional when assuming stride and padding as one is computed as: The parameters and computational cost of the standard convolutions are respectively computed as: and where I is the input feature map or input image, K is the convolution kernel size with k × k , M is the number of input channel, N is the number of output channel, H and W are the height and width of the input feature map or input image respectively. While depthwise separable convolution is made of depthwise and pointwise convolutions [26]. The output feature map F for depthwise separable convolutional is computed as: And the parameters and computational cost of the depthwise separable convolutions are respectively computed as: and Comparing the parameters of the depthwise separable convolution with the standard convolution can be obtained as follows: It can be seen that the depth separable convolution uses about 8 to 9 times less parameter than the standard convolution. Therefore, we can deepen and widen the network without causing an explosive increase in the number of parameters, and also enable the network to learn more contextual information.

Post-processing
The output of the network is a map with resolution of 240 × 240 . We used cubic interpolation to restore it to 480 × 480 and 560 × 560 . Then adopted morphological operations to smooth the edges. There are four kinds of operation methods of image morphology: erode, dilate, open and close. Based on the prior knowledge that most of the optic disc and cup are elliptical structure, we use the closed operation in the image morphology to fuse the pixel points with fine boundary connection and fill the concave angle of the image, so as to make the boundary of the segmented image smoother slippery. The closed operation can be expressed as follows: where f is the image, s the Structure element, ⊕,⊖represent dilate and erode respectively. In our work, s is a 7 × 7 circular structure element.

Loss function
In our work, we regard the optic disc and optic cup segmentation as a multi-category segmentation task and use One-Hot encoding to process the data. Let x ∈ R C×H ×W be the input image, and y ∈ y o , . . . , y i i×H ×W is the One-Hot representation of the ground truth label, when the pixel belong to category i, y i = 1,otherwise, y i = 0 . We treat the output as 3 categories i = 3 and the output of our model is a map of f i (x, v) = y ′ ∈ {p o , . . . , p i } i×C×W . In our work, we use Multi-class cross-entropy loss function to measure the difference between the output of the model and the ground truth label. The loss function Loss is defined as: The output map f i (x, v) is a probability distribution, and each element{p o , . . . , p i } i×C×W represents the probability that the pixel belongs to the i − th category.

Experiments and results
In this section, we firstly introduce the details of experiment implementation, then state the evaluation metrics. Finally, experimental results are given and discussed.

Implementation detail
All experiments are implemented in Python with the Pytorch framework on the workstation of Intel i7-8700K, 16G RAM, Nvidia 1080Ti GPU and Ubuntu16.04. The Adam [27] optimizer and back-propagation are employed to train our model. The initial learning rate is set to 1e-4 and decreased by a factor of 10 every 4 epoch. We train 30 epochs with a batch size of 8. Early Stopping was adopted in the training, and the best performing model is taken as the final model. We adopt a cross-validation method to train our model, therefore, the training images are divided into 90% for training and 10% for validation after data augmentation. Both training and validation images are resized into the resolution of 240 × 240 and then fed to the network for training. Figure 5 shows the training of our model on the GS dataset. As we can see from Fig. 5, our model has converged after training for 10,000 iterations. The loss of the training set and the validation set are basically the same, indicating that our model is easy to converge. We have released our codes on Github: https ://githu b.com/iceya nGG/DDSC-NET.

Evaluation metrics
In our paper, we adopt the Dice coefficients (DC), Jaccard (JAC), Sensitivity (SEN) and Precision (PRE) to evaluate the segmentation performance of the presented method. The criteria are defined as: where the N tp , N fp , N fn represent the number of true positive, false positive and false negative pixels, respectively.

Experimental results
To verify the effectiveness of our algorithm, we perform a lot of comparative experiments. Ablation experiments are firstly conducted to compare the performance with and without image pyramid input, multi-DDSC blocks, and post-processing in the model. Then, we compared the performance of our method with some state-of-theart deep learning-based methods. Finally, we compared our segmentation results with the results of the REFUGE challenge. Testing images of DRISHTI-GS and REFUGE datasets are employed to evaluate the performance of our model, and the final evaluation scores are the average of all the testing images of the dataset, respectively. The final experimental results of the proposed method are shown in Table 3

Ablation experiment results
In the ablation experiments, 4 comparative experiments were conducted to verify the effectiveness of our proposed method, include DDSC-Net without image pyramid input, simple network with only one DDSC  Table 4 summaries the ablation results of the OD and OC segmentation on the Drishti-GS dataset and REFUGE dataset. From Table 4 we can see that both image pyramid input and multi-DDSC blocks can improve the performance of the model, and the proposed post-process method can further improve the segmentation results. Specifically, on the Drishti-GS Fig. 6 The visual examples of OD and OC segmentation on DRISHTI-GS dataset. The first row is the input image, the second and fourth rows are the ground truth of the OD and the OC respectively, and the third and fifth rows are the OD, OC segmentation results of our model, where the black denotes the background, and the white part denotes the OD and OC segmentations Fig. 7 Some ablation experiment results on REFUGE dataset. The first row is the input image, the second is the ground truth, the third row is the segmentation result of simple DDSC-Net, the fourth row is the result of the DDSC-Net without image pyramid input, the fifth row is the result of DDSC-Net and the sixth row is the reprocess result, where the white region denotes the background, and the gray and black region denotes the OD and OC segmentations dataset, when image pyramid is employed as the input of the DDSC-Net, the DC scores of the model are 0.36% (OD) and 0.47% (OC) higher than DDSC-Net without image pyramid inputs, and also outperform the simple network which has only one DDSC blocks in each layer by 0.33 % (OD) and 1.8% (OC). When post-processing is utilized, the segmentation performance is further improved. For example, on the REFUGE dataset, the OD and OC segmentation results on DC are increased by 0.09% (OD) and 0.09% (OC), respectively.
Some visual examples are shown in Fig. 7. We selected 8 representative images among the 400 REFUGE test images to show the segmentation results. The first two rows are the cropped fundus images and ground truth, and the remaining 4 rows are the experimental results of 4 comparative experiments. As can be seen from Fig. 7, using image pyramid input to replace standard input can effectively improve model performance, and using multiple DDSC blocks can also improve model performance.

Compared with deep learning-based methods
In order to verify that our method is better than other deep learning-based methods, we compare our proposed method with the state-of-the-art approaches, such as pOSAL framework [28], GL-Net [29], M-Net [16], Stack-U-Net [30], WGAN [31], two-stage Mask R-CNN [32], multi-modal self-supervised pre-training network [33], Yu et al. [34] and Sevastopolsky [10]. Additionally, we compare with the Fully convolutional network U-Net [15]. The segmentation results of these deep learningbased methods on the testing set of Drishti-GS and REF-UGE dataset are shown in Table 5 and 6, respectively.
Since our network structure is modified based on the U-Net architecture, we firstly compare our model with other methods that also use the U-shaped network. Firstly, we compare with the original U-net on the Drishti-GS dataset. From Table 5, we can see that the proposed method achieves 2.8% and 9.23% higher than the original U-net in the Dice coefficients of the optic disc and the optic cup, respectively. When compared with the same methods modified based on the U-net structure [10,16,30,34], our method outperform the    best performance method proposed by Yu et al. [34], which is 0.42% higher in OD, and 2.46% in the more difficult OC segmentation task. On the REFUGE dataset, our method is also better than that of M-net [16]. From Table 6 we can see that our method is 1.65% higher in the optic disc and 5.88% in the optic cup compared to M-Net. GAN-based methods and other deep learning networks also achieved satisfactory segmentation results in OD and OC segmentation task. Therefore, we then compared the use of generative adversarial ideas, WGAN [28], GL-Net [29] and pOSAL framework [31] and other deep learning methods [32,33]. From Table 5 and 6, we can see that our proposed method still performs better than these methods. Particularly in the optic disc and cup segmentation results on the Drishti-GS dataset, our model outperforms the state-of-the-art method GL-Net by 0.7% and 0.73% in term of dice coefficients, respectively. These comparative results demonstrate that our proposed method can effectively improve the segmentation accuracy of the OC and obtain competitive results in the segmentation of the OD.

Compared with REFUGE challenge
We also compared our segmentation results with the results of the REFUGE challenge. The segmentation results of the 12 participating teams are shown in Table 7. It is obvious that in terms of DC metric our method achieves the best segmentation result in the segmentation of the cup, which is 0.66% . higher than the best result of 0.8837. In the segmentation of the optic disc, we achieved the second-best segmentation result, showing strong competitiveness.
From the above, it can be concluded that our method can effectively segment the optic disc and the optic cup, especially in the segmentation task of OC, and achieved the start-of-the-art segmentation performance on the Drishti-GS and REFUGE dataset.

Discussion
In this paper, a densely connected depthwise separable convolution deep network for joint OD and OC segmentation method is proposed. As experimental results and comparative experiments above, we can draw a conclusion that Multi-scale image pyramid input and densely   connected deep separable network can effectively perform OD and OC segmentation. The correct segmentation of optic disc and optic cup is essential for calculating CDR, which helps clinicians diagnose glaucoma more effectively. Because OC is located in OD and there is no obvious boundary like OD, OC segmentation has always been a more challenging task in the segmentation task of OD and OC. In prior segmentation work, the proposed method could not obtain satisfactory OC segmentation results, whereas the network we designed can effectively improve the OC segmentation results. According to our experimental results, using DDSC blocks to deepen the network and enhance the fusion of contextual semantic information through dense connections and skip connections can effectively improve the segmentation effect. Compared with the current optimal OD and OC segmentation network, our method obtains the stateof-the-art segmentation results both in two public datasets. Our DDSC module is formed by densely connected deep separable convolution, which has 8 to 9 times fewer parameters than standard convolution. Therefore, the advantage of using the DDSC module is that it can deepen the network depth and learn deeper semantic information for feature fusion without increasing the network parameters. Although our method has achieved the best OD and OC segmentation results on the testing images of two public datasets. However, our proposed method still has limitations. The limitations of our research are that for fundus images used by different equipment and institutions, the model parameters we trained cannot perform stable segmentation on fundus images of these different shooting devices. To use our method to perform segmentation on fundus images of a new dataset, the network parameters need to be retrained. Improving the generalization of the model and stably segmenting fundus images from different data sets are also our future research directions.

Conclusion
The research on joint OD and OC segmentation is an important part of the field of medical image processing and of great importance to computer-aided glaucoma technologies. In this paper, we aim at exploring a novel automatic OD and OC segmentation method based on deep learning techniques to improve the performance of CAD system for diagnosing glaucoma, which can be applied to assist clinicians in the diagnosis of glaucoma and population screening. The proposed network employs a dense connection of deep separable convolution network as the backbone network and adds a multi-scale image pyramid at the input end to widen the network. Finally, image morphology is employed to post-process the segmentation results. To verify the effectiveness of our proposed DDSC-Net, we conduct ablation experiments and compare our method with previous methods on the DRISHTI-GS and REF-UGE dataset. The experimental results show that our model outperforms the state-of-the-art method on the DRISHTI-GS and REFUGE dataset, which show better potential for improving the accuracy of the CAD system in diagnosing glaucoma. In future research work, we will try to apply our method to other medical segmentation tasks, such as retinal vessel segmentation, liver lesion segmentation, etc. Besides, we will pay more attention to solving the problem of domain shift between different datasets, so as to improve the generalization performance of the proposed method.