The method proposed in this paper employed a two-stage approach to implementing the segmentation of the optic disc and cup. In the first stage, CNN and Hough circle detection are used to obtain the center coordinates of the optic disc and extract the ROI. In the second stage, ROI is fed into the model to train a high-precision segmentation network to obtain the accurate segmentation results of optic disc and cup. The proposed method is trained and evaluated on DRISHTI-GS [17] and REFUGE datasets [18], respectively. The overall flowchart of our proposed method is shown in Fig. 1. The details of the datasets and the framework are explained in the following subsections.

### Dataset

DRISHTI-GS dataset contains 101 retinal fundus images that were collected at Aravind eye hospital, Madurai. The resolution of these images is \(2047\times 1759\) and store in uncompressed PNG format. And the ground truth of these images was marked by 4 ophthalmologists with different clinical experience and divided into 50 training and 51 testing images. Retinal Fundus Glaucoma Challenge (REFUGE) dataset contains 1200 images which include 120 glaucomatous and 1080 non-glaucoma images. The REFUGE dataset is divided into three parts: 400 training images, 400 validation images and 400 testing images, in which the validation and testing images are acquired with the same cameras. Brief information about these two datasets is shown in Table 1.

In our proposed method, 50 training images on the DRISHTI-GS dataset are adopted for the training the proposed model, and the other 51 testing images are used for evaluating the performance of the final trained model. Similarly, 800 images from training set and validation set on the REFUGE dataset are utilized for training, and the other 400 images from the testing set are employed to evaluate the performance of the final model trained with the REFUGE dataset.

### Image processing and data augmentation

Because the dataset used for training has fewer images, for example, there are only 50 training images on the DRISHTI-GS dataset, and too few data for network training may lead to overfitting, so we utilize data augmentation to expand training images to prevent this problem. The augmentation methods include translation, rotation, noise addition, and brightness adjustment. Among them, the images used for training in the DRISHTI-GS dataset are expanded to 5250, and training images in the REFUGE dataset are expanded to 30,000. Specifically, 90\(\%\) of the data-augmented training images are randomly selected to train the proposed model, and the rest 10\(\%\) images are employed for model evaluation when training the model. For example, when using the GS dataset to train the segmentation network, 4725 images out of the 5250 images are adopted to train the segmentation network, and another 525 images are used to evaluate the model during the training process.

### ROI extraction network

Since the resolution of a complete fundus image taken by professional camera is generally relatively large, and the area of interest is only a small area in fundus image, locating and cropping out the region of interest can reduce the interference of unnecessary background information on the segmentation result, and can improve the segmentation accuracy and reduce the amount of calculation. However, the methods which employ green channel images [8] or morphological operations [19] to detect optic disc is susceptible to the effects of images taken by different devices, fundus image quality, brightness, internal blood vessels, and lesions in fundus images, resulting in low location accuracy. In our work, we utilize a method based on CNN network to extract features to solve this problem. The model and segmentation process are shown in the Fig. 2. At this stage, we design a simple convolutional neural network to segment the optic disc simply, then use Circular Hough Transform (CHT) [20] to calculate the center of the optic disc. With this method, we can locate the optic disc with 100\(\%\) accuracy and crop the ROI area. The location result is shown in Table 2.

CHT is an extension of the Hough transform [21], which is mainly used to detect the circle object in the image. For circle detection, the HT is based on the equation of circle, defined as:

$$\begin{aligned} \left( x-a\right) ^{2}+\left( y-b\right) ^{2}=r^{2} \end{aligned}$$

where (*a*, *b*) represents the coordinates of circle center and *r* is radius. Center coordinates can be obtained by performing the CHT on the image. CHT can be defined as:

$$\begin{aligned} \left( P_{c}, r\right) =C H T\left( I, r_{\min }, r_{\max }\right) \end{aligned}$$

where \(p_c=(i_c,j_c)\) and *r* represents the center position and the radius respectively which define the circle with the highest punctuation in the Circular Hough Transform implemented by CHT. *I* is the input image. The radius r is restricted to be between \(r_{\min }\) and \(r_{\max }\). In our method we set \(r_{\min }\) and \(r_{\max }\) as 40, 160 respectively. After obtaining the coordinates of the center of the disc, we use it as the center point to cut the original image into a small picture with a resolution of \(480\times 480\) on REFUGE dataset and \(560\times 560\) on DRISHTI-GS dataset. The image contains the optic disc, optic cup and some background information. The visual result examples of ROI extraction are shown in Fig. 3

### DDSC network architecture

In the object detection network [22,23,24], features extracted from shallow network can be used to detect small objects, while features extracted from deep network can be used to detect large objects. In the segmentation task of optic disc and optic cup, these ideas were adopted to design our network structure. Considering the prior knowledge that the optic cup is located in the optic disk, we use dense and skip connection to make full use of the context semantics of the shallow layers and deep layers. The proposed network structure is detachedly shown in the Fig. 4. The proposed deep network, named DDSC-Net, is consists of three main parts. The first part is the image pyramid [25], which is used as the multi-scale input of the network so that the network can receive image information of different scales. Multi-scale input can solve the problem of losing part of the image information with the depth of the network. The second part of our DDSC-net is a U-shaped fully convolutional network which includes an encoder module on the left and decoder module on the right. The output map is activated by the softmax activation function, and then the cross-entropy loss function is introduced to calculate the difference between the segmentation result and the real ground truth.

#### Image pyramid multi-scale input

The input of the DDSC-Net is an image pyramid, which can effectively improve the segmentation quality of the network. This method employs the average pooling layer to build an image pyramid, which is then introduced into different layers of the encoder module. The advantages of this are as follow: (1) to avoid a large increase in network parameters; (2) increase the network width of the decoder depth; (3) and reduce the loss of information caused by the deepening of the network.

#### DDSC network

Inspired by U-net [15], a fully convolutional network with a U-shaped structure that using skip connection for feature fusion in each stage, we designed the DDSC network structure based on the U-net structure. See from Fig. 4. The DDSC network consists of an encoder and a decoder connected by skip connection. Specifically, the encoder is employed to extract the high-level semantic features of the input image, and the decoder is adopted to restore the semantic features extracted by the encoder to the resolution of the original image. Skip connection is utilized to fuse multi-scale features between encoder and decoder. Different from the original U-net, in our proposed network, we employ depthwise separable convolution layers to replace most of the standard convolutional layers in the network, which can significantly reduce the amount of computation. Therefore, we design a deeper network to learn more feature information from input data, especially the semantics of the optic cup. In addition,we execute more skip connections between encoder and decoder to enhance the transfer of contextual feature information in our model. The DDSC network is composed of three parts: densely connected depthwise separable convolution blocks, subsampled layers and upsampling layers. A dense depthwise separable convolution (DDSC) block contains five densely connected layers which consist of a batch normalization layer, a rectified linear unit (Relu) activation function, and a depthwise separable convolution layer with kernel size of \(3\times 3\). The subsampled layer is a max pooling layer with kernel size of 2 and stride of 2. And the upsampling layer is a \(3\times 3\) transposed convolution layer.

For standard convolution, the output feature map *F* for standard convolutional when assuming stride and padding as one is computed as:

$$\begin{aligned} F_{k, l, n}=\sum _{i, j, m} K_{i, j, m, n} \cdot I_{k+i-1, l+j-1, m} \end{aligned}$$

The parameters and computational cost of the standard convolutions are respectively computed as:

$$\begin{aligned} k \times k \times M \times N \end{aligned}$$

and

$$\begin{aligned} k \times k \times M \times N \times H \times W \end{aligned}$$

where *I* is the input feature map or input image, *K* is the convolution kernel size with \(k \times k\), *M* is the number of input channel, *N* is the number of output channel, *H* and *W* are the height and width of the input feature map or input image respectively. While depthwise separable convolution is made of depthwise and pointwise convolutions [26]. The output feature map *F* for depthwise separable convolutional is computed as:

$$\begin{aligned} F_{k, l, n}^{\prime }=\sum _{i, j} K_{i, j, m}^{\prime } \cdot I_{k+i-1, l+j-1, m} \end{aligned}$$

And the parameters and computational cost of the depthwise separable convolutions are respectively computed as:

$$\begin{aligned} k \times k \times M + M \times N \end{aligned}$$

and

$$\begin{aligned} k \times k \times M \times H \times W+M \times N \times H \times W \end{aligned}$$

Comparing the parameters of the depthwise separable convolution with the standard convolution can be obtained as follows:

$$\begin{aligned} \frac{k \times k \times M + M \times N }{ k \times k \times M \times N }=\frac{1}{N}+\frac{1}{k^{2}} \end{aligned}$$

It can be seen that the depth separable convolution uses about 8 to 9 times less parameter than the standard convolution. Therefore, we can deepen and widen the network without causing an explosive increase in the number of parameters, and also enable the network to learn more contextual information.

### Post-processing

The output of the network is a map with resolution of \(240\times 240\). We used cubic interpolation to restore it to \(480\times 480\) and \(560 \times 560\). Then adopted morphological operations to smooth the edges. There are four kinds of operation methods of image morphology: erode, dilate, open and close. Based on the prior knowledge that most of the optic disc and cup are elliptical structure, we use the closed operation in the image morphology to fuse the pixel points with fine boundary connection and fill the concave angle of the image, so as to make the boundary of the segmented image smoother slippery. The closed operation can be expressed as follows:

$$\begin{aligned} F=(f \oplus s) \ominus s \end{aligned}$$

where *f* is the image, *s* the Structure element, \(\oplus\),\(\ominus\)represent dilate and erode respectively. In our work, *s* is a \(7\times 7\) circular structure element.

### Loss function

In our work, we regard the optic disc and optic cup segmentation as a multi-category segmentation task and use One-Hot encoding to process the data. Let \(x \in R^{C \times H \times W}\)be the input image, and \(y \in \left\{ y_{o}, \ldots , y_{i}\right\} ^{i \times H \times W}\) is the One-Hot representation of the ground truth label, when the pixel belong to category *i*, \(y_{i}=1\),otherwise, \(y_{i}=0\). We treat the output as 3 categories \(i=3\) and the output of our model is a map of \(f_{i}(x, v)=y^{\prime } \in \left\{ p_{o}, \ldots , p_{i}\right\} ^{i \times C \times W}\). In our work, we use Multi-class cross-entropy loss function to measure the difference between the output of the model and the ground truth label. The loss function Loss is defined as:

$$\begin{aligned} {Loss}=-\sum _{i=0}^{2} y_{i} \log \left( f_{i}(x, v)\right) \end{aligned}$$

The output map \(f_{i}(x, v)\)is a probability distribution, and each element\(\left\{ p_{o}, \ldots , p_{i}\right\} ^{i \times C \times W}\) represents the probability that the pixel belongs to the \(i-th\) category.