Hepatic vessel segmentation based on 3D swin-transformer with inductive biased multi-head self-attention

Purpose Segmentation of liver vessels from CT images is indispensable prior to surgical planning and aroused a broad range of interest in the medical image analysis community. Due to the complex structure and low-contrast background, automatic liver vessel segmentation remains particularly challenging. Most of the related researches adopt FCN, U-net, and V-net variants as a backbone. However, these methods mainly focus on capturing multi-scale local features which may produce misclassified voxels due to the convolutional operator’s limited locality reception field. Methods We propose a robust end-to-end vessel segmentation network called Inductive BIased Multi-Head Attention Vessel Net(IBIMHAV-Net) by expanding swin transformer to 3D and employing an effective combination of convolution and self-attention. In practice, we introduce voxel-wise embedding rather than patch-wise embedding to locate precise liver vessel voxels and adopt multi-scale convolutional operators to gain local spatial information. On the other hand, we propose the inductive biased multi-head self-attention which learns inductively biased relative positional embedding from initialized absolute position embedding. Based on this, we can gain more reliable queries and key matrices. Results We conducted experiments on the 3DIRCADb dataset. The average dice and sensitivity of the four tested cases were 74.8\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}% and 77.5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\%$$\end{document}%, which exceed the results of existing deep learning methods and improved graph cuts method. The Branches Detected(BD)/Tree-length Detected(TD) indexes also proved the global/local feature capture ability better than other methods. Conclusion The proposed model IBIMHAV-Net provides an automatic, accurate 3D liver vessel segmentation with an interleaved architecture that better utilizes both global and local spatial features in CT volumes. It can be further extended for other clinical data.

utilizes both global and local spatial features in CT volumes.It can be further extended for other clinical data.

Background
CT liver vessel segmentation is essential for 3D visualization, path planning , and guidance in interventional liver surgery [27,26].However, the vessel and liver background show similar intensity values on CT images due to their similarity in the enhancement characteristics.They are curvy, twist, occlude one another, and sometimes are seriously distorted by liver tumors.Due to the intensity similarity and complex structure of the liver vessel, accurate liver vessel segmentation remains some challenge.Nowadays, accurate liver vessel segmentation heavily relies on doctors' manual segmentation, which is hugely time-consuming and subject to the experience and skills of the experts [5].
Therefore, automatic vessel segmentation has triggered a broad discussion in the community.Even some deep learning methods achieved big success on organ segmentation tasks, they cannot perform well in vessel segmentation due to the considerable variations of vessel structure and unbalance between backgrounds and vessels.Most recent work are designed based on FCN [19], U-net [24], and V-net's [21] variants.They heavily rely on convolution layers, which integrate multi-scale local information to get passable results.Yet convolution's limited reception field does not have long dependencies and enough global features, it can hardly accurately distinguish variant vessel margins and segment minor vessels.Therefore, developing a liver vessel segmentation method that adds long dependencies and utilizes global spatial features is necessary.

Related work
Current liver vessel segmentation methods can be roughly classified into traditional region-based methods, edge-based segmentation methods and deep learning-based methods.As region-based methods do not perform well in vessel segmentation, we review most related work in the latter two categories.Since we use transformer as our backbone, we also review the newest work related to transformer.A more comprehensive literature survey can refer to [7].

Traditional methods
Edge-based methods can be further classified into image filtering and enhancement algorithms, tracking-based algorithms [22].Filter and enhancement algorithms extract the volume with a common process called filtering to reduce the noise, then enhance the vessels by applying image gradients or multi-scale high order deviations, particular the second derivatives of the angiographic images to extract high-frequency information [20,16].Besides, Pamulapati et al. [23] introduced a vessel segmentation method based on the medial axis enhancement filter.Tracking-based algorithms focus on the predefined vessel models and track the minimum cost path.Friman et al. [9] proposed to track many hypothetical vessel trajectories at the same time, which improved the results in low contrast conditions.Cetin et al. [3], Cetin and Unal [2] presented the tubular structure segmentation method, which utilized a second-order tensor from directional intensity measurement and employed higher-order tensor based on cylindrical flux-based to construct the vascular structure.

Deep learning-based methods
Most deep learning-based liver vessel segmentation work rely on CNN-based architecture, specifically, U-net [24] and it variants, as well as little attempts by FCN [19] and V-net [21].In chronological order, early stage vessel segmentation methods like retinal vessel segmentation based on 2D methods.Later, with the segmentation targets changed to 3D images, 3D methods became the mainstream.Fu et al. [10], Li et al. [17] have proposed the segmentation method for the retinal vessel from 2D images.These methods can handle small objects in 2D slice, howeve the vessel segmentation on liver, brain, or lung are volume tasks.Most 2D methods cannot transfer to 3D images directly due to space continuous along the Z-axis, which omits essential information.Therefore, the current state of art solutions for liver vessel segmentation focus on 2D multi-path(2.5d)and 3D methods.Kitrungrotsakul et al. [15] specifically proposed three DenseNets with shared kernel that fit for resampling three planes(sagittal, coronal and transverse planes) patches from IRCADb dataset called 2.5D method.C ¸içek et al. [6] extend UNet from 2D image to volume, which fused multi-scale 3D convolution feature called 3D-UNet.In order to employ the 3D representation of liver vessel features, Huang et al. [12] proposed the variant of 3D-Unet fit the problem worked well, their evaluation on IRCADb incomplete annotations further improved result.Yu et al. [31] added the residual module into the 3D-UNet that provided more residual features.Xu et al. [29] employed a 3D-FCN frame for this task.However, a reasonable supervised deep network model has to be trained on a large dataset with highquality labels, and the current datasets cause the noise labels to hurt the model performance.Lately, Yan et al. [30] proposed a way to fuse self-attention into 3D U-net that improved segmentation details as a great attempt.

Vision transformers and 2D swin transformer
The self-attention mechanism allows transformers to dynamically extract the important features of word sequences and learn their long-range dependencies.This notion has recently been extended to computer vision by defining the vision transformer [8], which aims at the image recognition task.By taking 2D image patches with positional embeddings as input and pre-trained on large classical dataset, ViT achieved comparable results with the CNN-based methods.In medical image tasks, more recent methods like [32,4] enjoyed the benefit of both CNNs and transformers.Effort of Chen et al. [4] firstly utilize CNNs to extract low-level local features and transformers to catch global intersections.Currently, based on the shifted windows mechanism, Liu et al. [18] proposed Swin transformer that can learn hierarchical object concepts at different scales by applying appropriate downsampling to feature maps that achieved state-of-art semantic segmentation.Inspired by swin-transformer, Swin-Unet [1] firstly employed hierarchical transformer blocks with integrated encoder and decoder to build U-shape architecture.This work improved tran-sUnet's result on medical multi-organ segmentation tasks.For 3D segmentation, Karimi et al. [14] tentatively replace the 3D convolutional operators with transformers as the backbone to build the model.They first split the local volume block into 3D patches and embedded them into 1D sequence and through ViT's self-attention design.Compared to these methods, our IBIMHAV-Net inherits advantages of convolution in encoding precise spatial information and using inductive biased self-attention in hierarchical representation that helps to overcome connectivity and variance of liver vessel segmentation.

Proposed method
Motivated by existing 2D swin-transformer [1,18] and past vision transformer attempts [4,8,11], we propose a transformer-based architecture for volumetric liver vessel segmentation which better utilize global features and long dependencies.The main advantages and contributions of the proposed method are as follows: 1. We propose a network architecture by expanding swin transformer to 3D and combining convolution and self-attention to play their strengths.For self-attention, the global spatial information has been encoded by embedding, and long dependencies have been entangled by our designed 3D transformer block.For convolution, multi-scale convolutions in the local feature path and downsampling/upsampling layers help to encode precise local information and capture hierarchical resolution features.
2. We introduce the voxel-wise rather than patch-wise embedding as the initial transformer input to fully utilize volumetric information, which transforms volumetric prediction to the sequence to sequence prediction in hierarchical resolution features.
3. We propose the Inductive Biased multi-head attention(IB-MSA) which changes the positional embedding way that learns biased positional embedding with an initialization of absolute 1-dimensional embedding in the transformer blocks.Thus dramatically improves liver vessel segmentation results.

Methodology
The proposed method starts with dataset preprocessing.Then we introduce architecture of our framework, namely Inductive BIased Multi-Head Attention Vessel Net(IBIMHAV-Net), including the details of our 3D transformer design and inductive biased multi-head attention mechanism.Finally, we describe post-processing which reduces some discrete inaccurate results.

Preprocessing
Preprocessing plays an essential role and affects the segmentation results significantly [12,13].For example, applying preprocessing to lower the background noises and augment image contrast.Therefore, we arranges preprocessing as 4 steps: (1)3D IRCADb provides 20 groups of CT images, liver vessel masks and liver masks.We crop CT images and liver vessel masks to liver region boundary as the ROI.Then adjust to the size to 256x256x192 to unify the model input.(2)We truncate the intensity of all voxels in the volumes to the range of [-50, 250] HU to reduce the irrelevant details and increase image contrast.(3)In order to remain enough vessels' continuity features, we add vessel mask outside the liver as supplement of vessel information(Eg.Fig.1).( 4)Images are normalized to zero mean and unit variance.Because most liver vessels are quite small, we keep images with their original resolution can prevent artifact errors caused by resampling.

Overview of the architecture
The overall proposed architecture is shown in Fig. 2 Left, which illustrates its U-shape form which includes encoder and decoder.We introduce the U-shape end-to-end Transformer network IBIMHAV-Net, which employ pixel-wise embedding way for transformers.Our model's long-range contextual interactions and precise spatial locate dependencies was provided by inductive biased multihead self attention(IB-MSA) modules.The U-shape structure combined with feature extract path and three skip connections between multi-scale feature pyramids of encoder and decoder in a symmetrical manner.It helps to keep fine-grained details between transformer blocks.The feature extraction block and interleaved convolution up/downsampling layers gain accurate local spatial information and abundant local features.

Encoder
Past vision transformer work like [18,8,4] have complete encoder part, yet they did not design a 3D encoder.Our architecture build up a path that includes 3D embedding block, downsampling layers combine with our transformer block design.In the encoder, the input is a 3D volume patch randomly cropped from full volume.Then we represent each 3D patch as HxWxD where H,W,D denote the height, width and depth of each input patch, respectively.Thus, the 3D convolution embedding layer obtains tokens, with each patch/token consisting of a 128-dimensional feature.A linear embedding layer is then applied to project the features of each token to a 1D sequence length denoted by C. The outputs of patch embedding block are connected to five 3D swin transformer blocks interspersed with down-sampling blocks.
The patch embedding block The linear embedding part is essential in the original swin transformer design [18], the Swin-T version first splits the one channel vessel patch into non-overlapping patches size of into 1-D sequence, then followed with big convolutional kernels in the linear embedding layer to extract small patches features.However, our task needs more precise spatial information with larger input volume.Our embedding layer first tokenized the vessel volume patch X ∈ R H×W ×D into high dimensional tensor.This highdimensional tensor represents as is the patch tokens and C represents the length of sequence which is 128(discussed in 3.3).Due to the variant and complex vessel structure, we design the successive large kernel convolutional combinations for pixel-wise level sequence encoding instead patch-size encoding.Moreover, this setting reduce computational complexity with same range of receptive field to accommodate long sequence.After every convolutional layer followed one GELU and one layerNorm layer to fully embedding as 1-D sequence.The kernels and strides are set as Fig. 2 Right since the input volumes were nearly squares to fit the model.

Down-sampling layer
The swin transformer blocks used neighboring concatenate operation in past 2D tasks [1,18].However, we find that easy convolution with small strides worked better.It also needs a GELU layer and a Layer Norm to keep normalization of processing measures to refine the feature map mapped to [0, 1] to keep the sensitivity of model.It works better than Batch Nomalization(BN) and ReLU activation function in our architecture.

3D swin transformer block with Inductive Biased MSA Module
After passing patch embedding block's, the high dimensional sequence tensor T is put into transformer blocks.Compare to original Swin transformer, our method conduct self-attention in a hierarchical path and compute selfattention within 3D patches volume with bias focusing on block edge segmentation (i.e.IB-MSA, bias positional multi-head self-attention) instead 2D shift window.
3D transformer block In the tail of embedding block, the sequence is transformed to the high-dimensional tensor in swin transformer blocks.The main idea is to fully mix the captured long-term dependencies with hierarchical object concepts at various scales with following down-sampling convolution and global spatial information from beginning embedding block.
In order to represent the workflow in our design, let the high-dimensional tensor T ∈ R L×C reshape as T ∈ R N ×P ×C by passing through IB-MSA, where N is the number of tiny local volumes, P = S H × S W × S D denotes the number of patch tokens in each volume.{S H , S W , S D } stand for the size of tiny local volume.To fit to our task's various shape of vessel CT scans, this setting could cover all patch tokens of last transformer block in encoder.Because different sampling quality between datasets, it may not be reasonable to brute-forcely pad the data in order to satisfy fixed {S H , S W , S D }.Therefore, the cropped patch X needs to adaptively adjusted in order to fit the size of local volumes.And we set {S H , S W , S D } on IRCADb to {4, 4, 4}.
Following the baseline [1], we present two successive transformer blocks.The main difference is that our computational unit is built for 3D volumes rather than 2D windows.Based on above volume partitioning way, continues swin transformer can be formulated as follows: Here, l expresses the layer number, MLP represents multi layer perceptron.IB-MSA is our bias multi-head attention and it has the 3D shifted version.Here, the computational complexity of IB-MSA on a volume of HxWxD patches is: In the original swin-transformer [18] and ViT [8], the complexity of multihead self attention(MSA) is : The shifted window reduces the computational complexity about 56% for IRCADb dataset.The Fig. 3 shows the way where shifted IB-MSA reduces computational complexity by using smaller tiny volume IB-MSA and relative position bias matrix Different to the first try on ViT [8], some recent researches [18,1] have shown that there are a lot advantageous in bias to self-attention computation.Here, we intuitively change the bias focus on edge of segmentation volume by introducing 3D relative position bias B ∈ R M 2 ×M 2 ×M 2 for each head as: where Q, K, V ∈ R P ×d are the query, key and value matrices; d is the dimension of query and key features, and P is the number of patch tokens in a 3D window.Since the relative position along each axis lies in the range of [−2M + 1, 2M − 1] , the positional mask have a big value other than B item. we parameterize a smaller-sized bias matrix B ∈ R (2M −1)×(2M −1)×(2M −1) , and = (2, 2, 2) tokens, the number of windows becomes 3 × 3 × 3 = 27.Though the number of windows is increased, the efficient batch computation in [28] for the shifted configuration can be followed, such that the final number of windows for computation is still 8.
values in B are taken from B. We know from ViT that, for patch-level embedding inputs, different positional embedding is less important [8].2D swin transformer related researches [1,18] believe the only use relative bias position embedding is better than only use than absolute position embedding.However, relative bias may lose some inductive bias such as locality and translation equivariance.Yet spatial invariance is crucial in our transformer-based design which interleaved with convolution in upsamping/downsamping layers.Therefore, we adding 1-dimensional absolute positional embedding at the beginning of self-attention computing as inductive bias and then learning to compute new bias matrix.Our specific setting improved liver vessel edge segmentation in Fig. 5 and we observe slight improvement with this bias complement with absolute position.The comparison of other methods is shown in Table 2.

Decoder
In the decoder part, the transformer block are similar to encoder in another direction.Moreover, the up-sampling blocks use deconvolution operator with small kernels and strides which can recover low-level feature to high resolution details quickly if it combined with skip connection.In the final stage, the transformer result combined with local extraction block to output the end-toend result.

Weighted Loss Function
Liver vessels only exist in a small region of the liver, and unbalanced foreground(hepatic vessels) and background classes(liver) often cause predictive deviation and bias the classification to the background with more voxels.It is hard to achieve desired segmentation results with vessels edge and small branch.The similarity matrix of dice coefficient with special penalty weight parameter as M (P, G, β) has been proposed to design loss function [12] as follows: where β determined the weight of the number of correctly classified foreground voxels and misclassified voxels.
Since our task has 2 class labels, we can take foreground and background as the first and second classes, respectively.Then the equation ( 5) becomes: where p 0i and p 1i are the probabilities that voxel i belongs to the foreground (liver vessel) and the background (liver), respectively in the softmax layer output result.g 0i and g 1i are the labels of voxel i in the annotated data for liver vessel or liver with value 0 or 1 , respectively.
From Huang et al. [12]'s studies, the gradient of similarity in equation( 6) to 2 variant shows the weight of the liver(background) and liver vessel(foreground) do not need a pre-trained method unlike Chen et al. [4], which provided the initial training weights from other models or datasets.Moreover, the proposed algorithms adjust the penalty for misclassified voxels by selecting β as 6 can both optimize the dice value and sensitivity in our model.

Post-processing
Due to limitations of the GPU's memory, we cannot put full size volumes into our model.It can cause residual errors in the patch edges.Therefore, connected component analysis is performed on the vessels after trained by IBIMHAV-Net.To remove some noises caused by classification, regions with small partitions(less than 180 mm 3 )are removed.

Datasets augmentation and experiment material
3Dircadb datasets (https://www.ircad.fr/research/3d-ircadb-01/)are currently available with liver and liver vessel contours suited to our training and evaluation of liver vessel segmentation algorithms.The datasets include 20 contrastenhanced CT volumes with various image resolutions, vessel structures, intensity distributions, and contrast between liver and liver vessels.To keep the accuracy, transform invariance, and robustness of our network, the training set and test set should have clear, abundant hepatic vessel structures with different intensity ranges, and contrast with background and vessels.The liver vessel appearances should be similar in both training and testing datasets, so we deliver some experiments.By observing the voxel numbers and statistical data, The 3DIRCADb includes 6 simple samples and 14 challenging samples.Finally, we choose 16 volumes and 4 volumes as training/test data separately (both include simple and challenging samples) based on hand selection in each experiment.For the 16 training sets, we have to applied some image amplification methods for increase our training set.For a sample in training case, the fixed rotation set for 60 • , 270 • then add random translation from -25 to +25 pixels to get three times datasets as an augmentation strategy.In both the training and testing datasets, the original pixel spacing varied from 0.56mm to 0.87mm, slice thickness varies from 1.25mm to 2mm and slicer varied from 113 to 225.
Our proposed method was implemented using python 3.8 and PyTorch 1.9.0.All experiments were conducted on an Nvidia A6000 GPU with 48GB memories.Input image size after preprocessing is set as 256x256x192.The crop size based on our network is 128x128x96 with overlapping stride 24 in the test result.The batch size is set to 2, and the learning rate was set as 3e-5, as far as the initial work tested [18], swin transformer can hardly converge in the first 20-30 epochs.In the training process, we set training epoch as 750.The default optimizer with momentum 0.9 and weight decay 2e-3 was used for model back propagation.We employ precision, dice loss, and sensitivity three indexes to evaluate the results.

Experiments
In this subsection, we compare the proposed model with other state-of-art methods on 3DIRCADb related work.CNN-based methods including UNet [6], VNet [21], Huang et al [12] which is U-net's optimized variant, and also ResUnet [31].Also the improved graph cuts method proposed by sangsefidiet.alwhich is a practical newly improvement for traditional method has good performance in liver vessel segmentation [25].In addition, there are some methods applied data refinement [12] or specific data augmentation strategy like filters [15], we do not compare with these results.

Quantitative Results
To compare with other state-of-art methods in an equitable way, we only focus on original volume 3DIRCADb datasets.Our results are reported in the Table 1.From Table 1, we can see the numerical results on each metric are larger than other methods.Thanks to swin transformer's shifted window and IB-MSA mechanism, our model adopts larger input to catch global relationship and to obtain better segmentation results.Our method exceed other methods in three indexes.
Visualization Results Fig. 6 shows the visualization of our experiment in one complex sample.After 3D morphological close operation and post-processing, the surface of the vessels becomes smoother and some noise blocks are removed.To compare the results visually, we utilize 3D slicer's toolbox and the  zoomed-in patches.The full results are shown below in Fig. 6.This sample is long and curvy, the segmentation results of FCN and 3D U-Net,3D v-Net on hepatic veins are not so well, in which some regions are over-segmented or some minor vessel are missed.The reason could be Convolutional operators limit the capability of learning long-range dependencies.In addition, the third row's Huang et.al and ResUnet did fairly well in the whole vessel structure, yet have many errors in the vessel edge which can be seen in the zoom-in views.
By utilizing the inductive biased multi-head attention and transformer, our methods on vessels performed relatively closer to the ground truth in vessel edges and overall structure.

Evaluation of results
To validate the generalization of our method, we conduct 4 test cases with hard cases and simple cases to show the result in Table 2.In simple cases, our network performed very well.As the minor vessel becomes more complex and variant, even if the results have some errors and oversegmentations, the edge segmentation steadily performs different situations.The dice coefficient in this 4 cases is 84.3, 71.6, 75.9, 67.4 respectively.From Fig. 5 we can see, the simple cases (a) the vessel edge segmentation may have over-segmentation.In complex cases (c) and (d), the green arrows point to some misclassifications voxels.They are caused by missing labels in the ground truth.The red arrow points to the discontinuous vessel net.It is caused by tumor in that position.In specific, the tumor effects and unlabeled liver vessels in the expert manual annotation both lower the segmentation accuracy.

Ablation studies
To explore the influence of our design on the model performance,we conducted a series ablation studies on 3Dircadb dataset.

Influences of inductive biased positional embedding and IB-MSA
Table 2 shows the comparison of different position embedding approaches for our network.IBIMHAV-Net with general position relative bias yields 2.5% accuracy improvement compared to absolute position embedding, indicating the effectiveness of relative position bias.In addition, our proposed biased attention yields the result better than other positional embedding approaches.Fig. 6 The different settings to study effect of kernel size and model bottleneck Influences of more skip conncections and 3D swin-transformer block(bottleneck) In our network architecture, the skip connections are connected between after the down-sampling block and before the up-sampling block to unify the feature dimensions.Because the transformer has a different convergence rule compared to CNNs, which need more discussion [1,28].In our model, there are Only two successive Swin-transformer blocks are used to learn deep feature representation.Our experiment can only set 6, 10, 14, 18, 22 transformer blocks and corresponding upsampling/downsampling layers to study the convergence pattern of this model, which are shown in Fig. 7.It is worth noting that when the number of transformer blocks is 6, the smaller and larger up/down-sampling kernel blocks cannot lead to converge.In 3 groups of ablation experiences, the 2x2x2 kernel and 10 block group gains the best performance for our model.
Effect of downsampling strategies Patch merging is the down sampling strategy used in original swin transformer and the main idea is concatenetes the neghboring patches [18,1].We expand it to 3D by concatenates 2x2x2 neighboring patches first.Then applying a linear layer on the features can downsampling to 2x the original dimension.We choose small kernel convolution layers to reach the same operate and has better results.The results are shown in Table .3Effect of up-sampling strategies Original swin transformer choose the patch expanding layer in the encoder based on resize [1], which relies on resizing the patches to upsampling the featueres.We design a small kernel transposed convolution layer in the decoder to perform up-sampling as the feature dimension increases.To explore the proposed strategy effectiveness, we conduct the experiments of IBIMHAV-Net with Trilinear interpolation, 3D transposed convolution and patch expanding layer( [18]) on IRCADb datasets.The experimental results are shown in Table .4indicate that the proposed model combined with the transposed convolution layer can obtain better segmentation accuracy.  .3. As the input size increases from 224x224x96 to and the patch size remains the same as 2, the input token sequence of transformer will become larger, thus improving the segmentation performance of the model.However, although the segmentation accuracy of the model has been slightly improved ±0.3% DSC, the computational load of the whole network has also increased significantly.In order to balance the running efficiency of the algorithm, the experiments in this paper are based on 128x128x96 resolution scale as the input.

Conclusions
This paper designs a liver vessels segmentation method from CT images using the transformer-based network.Swin transformer has been expanding to 3D as the backbone which interleaved with convolutions and expanding for 3D volumes.In specific, the small stride convolution in both local feature block path and up/down-sampling blocks keep the spatial information hierarchically for two successive swin transformer blocks.A new pixel-wised embedding method has been used for our few samples task with variant structures.A new type of bias positional embedding in our transformer is proposed.Numerical Evaluation and visualization based on different benchmarks proved the validity of this deep learning method.Our method has been trained and tested on 3DIRCADb datasets.In the future, we would further improve segmentation accuracy by introducing more precise datesets and trying multi-task method to reduce negative effects of liver tumors.

Fig. 1
Fig. 1 supplement of vessel mask used in the training set

Fig. 2
Fig. 2 Left: Architecture of IBIMHAV-Net.Right: Compose of conv embedding, feature extraction block, up-sampling layers and down-sampling layers.

Fig. 3
Fig. 3 An illustrated example of 3D shifted windows.The input size H × W × D is 8 × 8 × 8, and the 3D window size M × M × M is 4 × 4 × 4. As layer l adopts regular window partitioning, the number of windows in layer l is 2 × 2 × 2 = 8.For layer l + 1, as the windows are shifted by S H 2 , S W 2 , S D 2

Fig. 4
Fig. 4 Visualization and comparison of proposed deep learning method and state-of-art machine learning-based methods using raw volume as in put with post-processing.Three row indicates different genres methods.First row: (a) ground truth result which is most similar to our result.Second row: (b), (c), (d) the traditional 3d medical image methods.Third row:(e), (f), (g) the modern deep learning methods in the journals and our method.

Fig. 5
Fig. 5 The first column list ground truth in different cases.The second column list our network's results (a),(b),(c),(d) represent different cases.

Table 1
Qualitative comparison of segmentation performance by three evaluation metrics on 3DIRCADb.

Table 2
Relative position bias

Table 3
Ablation study on the impact of down-sampling DSC(%)

Table 4
Ablation study on the impact of upsamplingEffect of local feature extraction blockThe local feature extraction block includes some large kernel convolution layers.We have tried adding other feature extraction residual blocks in deeper places.The experiment shows that the CNNs can only perform well in high resolution part.The main reason may be that the CNNs do not have enough spatial invariant property, which can supplement precise local features for another swin-transformer path.When we dropped this design, the accuracy and Dice coefficient reduced by 12% and 7.5%, respectively.Rolling of cropping patch size The testing results of the proposed IBIMHAV-Net with 224 x 224 x 96, 128 x 128 x 96 input resolutions as input are presented in Table