BPI-MVQA: a bi-branch model for medical visual question answering

Liu, Shengyan; Zhang, Xuejie; Zhou, Xiaobing; Yang, Jian

doi:10.1186/s12880-022-00800-x

Research
Open access
Published: 29 April 2022

BPI-MVQA: a bi-branch model for medical visual question answering

Shengyan Liu¹,
Xuejie Zhang²,
Xiaobing Zhou² &
…
Jian Yang²

BMC Medical Imaging volume 22, Article number: 79 (2022) Cite this article

3655 Accesses
13 Citations
Metrics details

Abstract

Background

Visual question answering in medical domain (VQA-Med) exhibits great potential for enhancing confidence in diagnosing diseases and helping patients better understand their medical conditions. One of the challenges in VQA-Med is how to better understand and combine the semantic features of medical images (e.g., X-rays, Magnetic Resonance Imaging(MRI)) and answer the corresponding questions accurately in unlabeled medical datasets.

Method

We propose a novel Bi-branched model based on Parallel networks and Image retrieval for Medical Visual Question Answering (BPI-MVQA). The first branch of BPI-MVQA is a transformer structure based on a parallel network to achieve complementary advantages in image sequence feature and spatial feature extraction, and multi-modal features are implicitly fused by using the multi-head self-attention mechanism. The second branch is retrieving the similarity of image features generated by the VGG16 network to obtain similar text descriptions as labels.

Result

The BPI-MVQA model achieves state-of-the-art results on three VQA-Med datasets, and the main metric scores exceed the best results so far by 0.2$\%$, 1.4$\%$, and 1.1$\%$.

Conclusion

The evaluation results support the effectiveness of the BPI-MVQA model in VQA-Med. The design of the bi-branch structure helps the model answer different types of visual questions. The parallel network allows for multi-angle image feature extraction, a unique feature extraction method that helps the model better understand the semantic information of the image and achieve greater accuracy in the multi-classification of VQA-Med. In addition, image retrieval helps the model answer irregular, open-ended type questions from the perspective of understanding the information provided by images. The comparison of our method with state-of-the-art methods on three datasets also shows that our method can bring substantial improvement to the VQA-Med system.

Peer Review reports

Background

The visual question answering in medical domain (VQA-Med) system has great potential in medical applications, but it is not yet well developed. The original medical question and answer (QA) system was developed prior to VQA-Med and was mainly used for information retrieval, databases, and other technologies. The representative works are the MedQA [1], MiPACQ [2], and AskHERMES [3] systems. Current medical QA systems are generally based on knowledge mapping technology, which stores medical information in the form of an entity-relationship in a non-relational database, and they provide medical advice by searching and reasoning, Aarthi [4] enumerates the traditional subtasks of QA, including almost all MedQA questions. For example, Izcovich [5] developed a GRADE-based medical question answering system. However, the ability to analyze medical test cases is not sufficient for clinical adoption; analyzing medical images is also a necessary skill of the auxiliary medical system. A VQA system [6] can meet this requirement. Such a system utilizes computer vision (CV) and natural language processing (NLP) to systematically learn the features of given images and questions and then generates answers to the questions. At first, VQA technology was widely used for fine-grained recognition, object recognition, and behavior recognition in random scenes including people or objects. These tasks require the VQA system to not only classify images and detect targets, but also to extract semantic features and have a certain degree of common sense. When the ImageCLEF2018 competition [7] proposed a VQA-Med task in 2018, VQA was applied to the medical field for the first time. Similar to VQA, the questions of VQA-Med include organ type recognition (e.g., what organ is this?), abnormal type identification (e.g., is the lung abnormal?), and classification of medical images (e.g., what is the imaging mode of given medical images?). Due to the lack of annotation information in medical datasets—such as the labeling of organ lesions and the center point, length, and width of the boundary box of the location of lesions—we cannot use a series of effective target detection methods in the field of general VQA to help extract medical image features, which makes it difficult to apply VQA in specific fields. Visual Question Generation (VQG) from images is also a rising research topic in both fields of natural language processing and computer vision [8]. Although there are some recent efforts towards generating questions from images in the open domain, it also represents another meaningful solution to the VQA-Med task.

With the emergence and popularization of new digital medical imaging equipment [9], clinicians can use both knowledge and medical equipment to diagnose diseases. In some cases of non-obvious trauma, medical imaging is much more informative than patient-reported symptoms. However, interpreting medical imaging is challenging for inexperienced interns and medical students. A well-established VQA-Med system can help them practice and judge whether their conclusions are correct or not. Traditional computer-aided diagnosis technology is usually aimed at one disease; for example, for judging the probability of lung cancer based on the presence of pulmonary nodules in Computed Tomography(CT) images of the chest [10], for detecting tuberculosis and classifying its severity [11], or for detecting breast cancer based on chest radiographs [12]. A major limitation of auxiliary diagnosis technology based on analyzing a single type of medical imaging is its inability to provide a complicated, specific description of a patient’s condition similar to a clinician’s diagnosis. The VQA-Med system can realize this function. However, the current VQA-Med datasets generally have substandard problems, which is a part that needs to be improved, because even if there is a large amount of training data support, wrong data will lead to an increase in classification inaccuracy. VQA-Med has great research significance. First of all, it is in its infancy, and there are still many technologies to be explored. Secondly, because of the lack of standardized data sets, we need to make the model have good data adaptability. Based on the model research foundation of VQA-Med, a series of methods in this paper are proposed for VQA-Med, which makes the VQA-Med system convenient for patients’ consultation and doctors’ research. In addition, VQA-Med also faces many challenges, such as special processing of medical-specific vocabulary in medical texts and medical images, the problem with the combination of multi-modal features at different levels of medical images and medical texts, and the interaction between the question and the visual information extracted from the text semantics is often overlooked.

We propose a novel bi-branched model based on parallel network and image retrieval for medical visual question answering (BPI-MVQA). The main contributions of this work can be summarized as follows:

We propose a bi-branched neural network model that can be used in different classification methods for different types of training data for VQA-Med. The first branch uses a model similar to a transformer [13] to extract image features in parallel for classification. The second branch uses the method which retrieves the similarity of images and outputs the labels of similar images as similar text descriptions. Our model achieves state-of-the-art results on three datasets, which proves that our model is effective for VQA-Med.

We propose a novel method that uses the pre-trained VGG16 network [14], which removes the full connection layer to output image features, and then select the answer labels of similar images by calculating the cosine similarity of the feature matrices of the two images. This method significantly improves the accuracy of part of the data on the test set.

We propose the ResNet152 [15, 16] and Gate Recurrent Unit(GRU) [17] parallel structure to extract both full-scale image features and local features. Its purpose is to preserve the spatial feature information of images in different dimensions. Then, the original three-channel images are processed into single-channel grayscale images and input into the stacked GRU network to retain the sequence feature information of the images. Finally, the features extracted from each layer of ResNet152 and the output of the features from the GRU network are concatenated as complete features of the images.

We apply the transformer structure model as the main part of the multi-classification model. In the NLP task of biomedicine, Biobert [18] is much better than Bidirectional Encoder Representations from Transformers(Bert) [19] in many biomedical text mining tasks and is more suitable for biomedical data training because it utilizes the biomedical corpus on PubMed to understand complex biomedical literature. Unlike the traditional Bert model input, we take the concatenated image features and question features as the input of the transformer and make use of their multiple characteristics in the transformer. The multi-head self-attention mechanism fuses the input features, and then the model outputs the answers.

Related work

The development of VQA-Med is a very interesting challenge, and many new solutions have emerged to handle VQA tasks. Some methods are also applicable to the VQA-Med field. A classical convolution neural network (CNN) pre-trained on ImageNet is usually selected as the image feature extractor, and a recurrent neural network (RNN) or a model of transformer structure is usually selected as the feature extractor. Peng et al. [20] proposed a deep network model based on ResNet152 and long short-term memory (LSTM) that uses the multi-modal factorized bilinear pooling model (MFB) [21] with a ‘co-attention’ mechanism to fuse features. This end-to-end deep learning network can realize learning on images and questions at the same time, and it won first place in the VQA-Med task of the ImageCLEF2018 competition. Zhou et al. [22] put forward a model based on Inception-Resnet-v2 [23] and BiLSTM [24], which won second place in the competition. Yang et al. [25] put forward a model combined with a stacked attention network (SAN) [26] capable of obtaining the local attention information of the image area through multiple iterations, which won third place in the competition. The following year, Zhejiang University’s team [27] proposed a novel model capable of extracting image features from the middle layer of VGG16 and extracting question features using Bert, which won first place in the ImageCLEF2019 VQA-Med task. Kornuta et al. [28] proposed a modular pipeline architecture that utilized transfer learning and multi-task learning. Liao et al. [29] used a knowledge inference methodology called Skeleton-based Sentence Mapping (SSM) and won first place in the ImageCLEF2021 VQA-Med task. Al-sadi et al. [30] used a effective data augmentation technique and won second place in the ImageCLEF2021 VQA-Med task. Zhang at al. [31] proposed a novel conditional reasoning framework for Med-VQA, aiming to automatically learn effective reasoning skills for various Med-VQA tasks. Gong et al. designed a hierarchical feature extraction structure to capture multi-scale features of medical images and won first place in the ImageCLEF2021 VQA-Med task. Xiao et al. [32] fused the semantic features and image features by Multi-modal Factorized High-order (MFH) Pooling and won second place in the ImageCLEF2021 VQA-Med task. Gupta et al. [33] proposed a hierarchical deep multi-modal network that analyzes and classifies end-user questions and then incorporates a query-specific approach for answer prediction. Do et al. [34] present a new multiple meta-model quantifying method that effectively learns meta-annotation and leverages meaningful features to the VQA-Med task. Lin at al. [35] gave a detailed description of the current situation of Medical Visual Question Answering.

There are many VQA datasets widely used in general fields, such as COCO-QA [36], VQA-dataset [6], FM-IQA [37], Visual Genome [38], Visual7W [39], and Clevr [40]. Early VQA datasets mainly asked questions about the location, color, and quantity of images. Later, in addition to the simple attributes in the images, some reasoning problems based on common sense were added. At present, the main VQA-Med datasets are ImageCLEF2018 VQA-Med, ImageCLEF2019 VQA-Med, and VQA-Rad. These three datasets are all radiation datasets, and each dataset is divided into several types. The number of question and answer pairs(QA pairs) and the number of candidate answers(candidate answer means the number of different answers contained in all QA pairs )corresponding to each type questions, are shown in Fig. 1.

In addition, data augmentation plays an important role in small sample training. Regarding data augmentation in the field of VQA, Kushal et al. [41] used LSTM to generate a new question sequence corresponding to the original image, and some people translate the questions into other languages and then back into English, these are methods for question data expansion, for image data augmentation, image flipping, rotation at a certain angle, translation, random clipping are frequently used in image processing. Data augmentation [42] can be used to expand a dataset to prevent model overfitting.

Methods

Overview of BPI-MVQA

BPI-MVQA is composed of two branches. We count the number of candidate answers for each type of training set data as the first step. According to Fig. 1, if the current type of data has few candidate answers and is easy to classify, it will be transferred to the first branch(parallel structure model). Otherwise, the data will be transferred to the second branch(image retrieval model). The image extractor in the first branch, which has a transformer structure with a parallel structure, will be used for classification. Next, the pre-trained VGG16 network in the second branch will be used to retrieve the similarity of images and output the labels of similar images as similar text descriptions. The whole process is shown in Fig. 2.

Parallel structure model

In the first branch of BPI-MVQA, we choose the transformer structure as the main framework of our parallel structure model. Different from the VilBert [43] and LXMERT [44] models, which input the questions and images into two independent transformers to process the features of the two parts separately, our model takes the features of the two parts as the input of the single transformer. The idea of the parallel network structure is embodied in the image feature extraction. As shown in the visual features part of Fig. 3, the feature blocks $V_{i}$ are realized by the parallel network structure composed of ResNet152 and GRU. As shown in the question features part of Fig. 3, the feature blocks $E_{i}$ are embedded by three-layer word embedding based on a biomedical corpus. Subsequently, $V_{i}$ and $E_{i}$ are concatenated into a complete visual feature and fused by the multi-head self-attention module of the transformer framework. In addition, the special symbols [CLS] and [SEP] are used to separate sentences. Figure 3 shows the overall structure of our parallel structure model.

Image feature extraction

In this parallel structure model, we adopt a parallel network to extract the image features. Firstly, we use an improved CNN model to extract the spatial features of the medical images. Secondly, we use an RNN model to extract the sequence features of the medical images. The following two sections introduce these parts of the parallel network model.

CNN part In the CNN portion of the parallel network model, we use the pre-trained ResNet152 model. We know that the deeper the network is, the more difficult it is to train. There will be problems of gradient disappearance and gradient explosion. Skip connection can be activated from one layer, and then quickly feedback to another layer or even deeper layer, and a residual network can be constructed to train a deeper network with the skip connection. We input the image into ResNet152 after image preprocessing involving processes such as rotation, random resizing, brightness adjustment, and contrast adjustment, and retain the features of images passing through each intermediate layer. Then, the features are passed through the full connection layer, image features are projected into the same dimension space as question features, and the global average pooling (GAP) [45] operation is performed. GAP reduces most of the parameters compared with the fully connected operation while unifying the dimensions to prevent overfitting. The structure of the CNN part of the parallel network model is shown in Fig. 4.

RNN part It has been accepted that images have spatial features. However, if we look at the pixel level, there is also a temporal relationship between each pixel of the image. For example, if we regard the width of an image as the eigenvalue and the height of the image as the time step, we can consider that each row of pixels in each image has a time-dependent relationship. Therefore, when designing the model, we should not only consider the spatial relationship of the image but also the temporal relationship between pixels. We use a two-layer stacked GRU as the RNN module, in which the original three-channel images are processed into single-channel grayscale images and input into the stacked GRU network to retain the sequence feature information of the images. The structure of the RNN part of the parallel network model is shown in Fig. 5.

Parallel feature fusion If the image features of different layers are concatenated together and the image sequence information is extracted again through the GRU network, the dimension of the feature matrix will be reduced from a large dimension to a very small dimension, which may result in the loss of many useful image features. Therefore, we combine the image features of different middle layers with the sequence features of the image extracted by GRU to get the feature matrix of the visual portion. The structure of the image feature fusion module is shown in Fig. 6. We take the final output $f_{1}... f_{5}$ as the image features, combine them with the text features S, and then input them into the model of transformer structure.

Text feature extraction

We convert all questions and answers into lowercase letters to prevent two candidates with the same meaning from being extracted due to different letter cases. Our model adopts three embedding methods in the transformer structure and uses the biomedical corpus based on PubMed. In order to input question features into the transformer structure model, we first use token embedding to transform each word into a fixed dimensional vector. In this process, two special tokens, [CLS] and [SEP], are inserted into the beginning and the end of the input text, respectively, to segment the sentence. We then use segment embedding to assist the transformer in distinguishing the vector representation of two adjacent sentences. Finally, we use position embedding to introduce the coding information of the sequence order with the following formula.

$$\begin{aligned} \left\{ \begin{array}{lr} PE_{2i}=\sin (p/10000^{2i/d_{pos}}) &{} \\ PE_{2i+1}=\cos (p/10000^{2i/d_{pos}}) &{}, \end{array} \right. \end{aligned}$$

(1)

where pos stands for position and i stands for dimension. This formula means that for each word vector, the sin variable is added at each even position, and the cos variable is added at each odd position to fill the whole PE matrix. It can be seen from Eq. (1) that each dimension i corresponds to a sine or cosine curve of a different period. When i = 0, it is a sin function with a period of 2$\pi$, and when i = 1, it is a cos function with a period of 2$\pi$. For two different positions $pos_{1}$ and $pos_{2}$ in one dimension, if they have the same coding value on a certain dimension 2i, the difference between the two positions is equal to the period of the curve where the dimension is located, that is, $|pos_{2}-pos_{1}|$ = $T_{2i}$, and for another dimension $2i+1$(2i$\ne$2i+1). Since $T_{2i}$ $\ne$ $T_{2i+1}$, the coded values of $pos_{1}$ and $pos_{2}$ on different dimensions will not be equal. This coding method ensures that different positions will not be coded to exactly the same value in all dimensions.

Fusion of question features and image features

The transformer is composed of a self-attention module and a feed forward neural network (FFN). It uses the attention mechanism to solve the problem of information loss in the process of sequential computing, as shown in Fig. 7. We embed the above-mentioned image features into the front part of the question features, integrate the two parts of features into a feature matrix, and then input it into the stacked four-layer transformer structure. As a result, the model can learn the dependency between image features and question features, and capture the internal structure of the input feature vector. For example, if the input is a sentence, in order to consider the order of the input feature sequence, we use position embedding to determine the position of each word in the sentence. However, because our medical image lacks the relevant annotation for target detection, we cannot match the local information of the image with the question information in the position. We can only input the overall image features from different layers extracted from the parallel structure model with the word vector into the transformer for attention operation. An advantage of this method, though, is that it pays more attention to the dependence between image and question features than the traditional method of inputting features. We input the joint feature $X_{e}\in R^{n\times d_{model}}$ of the images and the questions into the model. First, a linear transformation is carried out, and then the weight matrices $W_{Q}$, $W_{k}$, and $W_{v}$ are assigned to the corresponding matrices Q, K, and V so as to generate the Q, K, and V matrices. The Q, K, and V matrices of self-attention are $X_{e}W^{Q}$, $X_{e}W^{K}$, and $X_{e}W^{V}$, respectively. Because the weights are different, the final Q, K, and V matrices are different. We get the Q, K, and V matrices using scaled dot-product attention for similarity computation. This softmax score determines the possibility of the current word in each word position in each sentence. The following is the formula of the attention mechanism.

$$\begin{aligned} Attention(Q,K,V)=softmax\left( \frac{QK^{T}}{\sqrt{d_{k}}}\right) V. \end{aligned}$$

(2)

The essence of the multi-head attention mechanism is to independently calculate multiple self-attention mechanisms and then concatenate them, as shown in Eq. 3 and Eq. 4. Equation 5 represents the principle of multi-head self-attention. Each head learns features in different representation spaces. For example, the two heads may notice slightly different emphases, which gives the model more capacity for feature information. We divide the 312-dimensional feature vector into h dimensions. It should be noted that h must be a factor of 312. Here, we apply h = 8 and h = 12 to learn the feature differences of 8 and 12 representation spaces, respectively.

$$\begin{aligned} head_{i}= & {} Attention(QW^{Q}_{i},KW^{K}_{i},VW^{V}_{i}). \end{aligned}$$

(3)

$$\begin{aligned} MultiHead(Q,K,V)= & {} Concat(head_{1},...head_{h})W^{o}. \end{aligned}$$

(4)

$$\begin{aligned} head_{i}= & {} Attention(X_{e}W^{Q}_{i},X_{e}W^{K}_{i},X_{e}W^{V}_{i}). \end{aligned}$$

(5)

After the calculation of multi-head self-attention, residual connection and normalization will be carried out, and the result will be sent to the full connection layer for nonlinear transformation to get the final output. The activation function we use is relu. Finally, we perform classification by the location of the output of the special symbol [CLS].

Image retrieval model

Inspired by [46], we use the answers of the training set as the labels of the corresponding images, ignoring the influence of question features on the classification results. The main idea is to retrieve the most similar images and labels on the training set with the images to be tested. Figure 8 shows the overall structure of our image retrieval model. VGG16 network is selected to be the feature extractor, because for VGG16, the number of channels in the first layer of the network is 64, and the number of channels in each subsequent layer is doubled. With the increase in the number of channels, more information can be extracted. Next, we remove the fully connected layers and obtain the image feature from the last convolution layer. The structure of the image retrieval model is shown in Fig. 9. We use this method to predict the answers to irregular, open-ended type questions, which have many different answers on the training set of three datasets. Because it is difficult to distinguish effective candidate answers, we start with the analysis of medical image features. First, we input the images of the training set into VGG16, and then we divide the output feature matrix by the matrix’s own normal form as the image features to be compared. The reason for doing this is that the dot product of two matrices is directly proportional to the cosine of the angle between them, so the closer they are in direction, the larger the dot product is, and the higher the similarity between the two images is. The mathematical principle is shown in Fig. 10. As shown in Eq. 6, $X_{1}$ and $X_{2}$ represent the vectors of two features corresponding to Fig. 10, where the $x_{1}$ and $x_{2}$ sets are the elements in the vectors, and $\theta$ is their angle.

$$\begin{aligned} \cos (\theta )=\frac{X_{1}X_{2}}{|X_{1}||X_{2}|}=\frac{\sum \limits _{k=1}^{n}x_{1k}x_{2k}}{\sqrt{\sum \limits _{k=1}^{n}x^{2}_{1k}}\sqrt{\sum \limits _{k=1}^{n}x^{2}_{2k}}}. \end{aligned}$$

(6)

If the feature matrix of the image to be tested is $A\in R^{n\times n}$, the a and b sets are the elements in the matrices, and the feature of an image in the training set is $B\in R^{n\times n}$, the inner product of the two matrices, as shown in Eq. 7.

$$\begin{aligned} A\bullet B=<A,B>=Tr(A^{T}B)={\sum \limits _{k=1}^{n}\sum \limits _{k=1}^{n}a_{ij}b_{ij}}=(vecA)^{T}vecB. \end{aligned}$$

(7)

The similarity of the two matrices is obtained, and then we can output the text description of the image with the highest similarity. Table 1 shows similar images and their corresponding text descriptions. On the left side of the table is the image to be tested, as well as its question and true answer, and on the right side is the image of the training set retrieved with the image retrieval model, as well as its question and answer pair.

Table 1 An example of using the image retrieval model

BPI-MVQA: a bi-branch model for medical visual question answering

Abstract

Background

Method

Result

Conclusion

Background

Related work

Methods

Overview of BPI-MVQA

Parallel structure model

Image feature extraction

Text feature extraction

Fusion of question features and image features

Image retrieval model

Evaluation metrics

Accuracy

Word-based semantic similarity (WBSS)

BLEU

Analysis of datasets

ImageCLEF2018 VQA-Med

ImageCLEF2019 VQA-Med

VQA-RAD

Analysis of experimental results

Conclusion

Avaliability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Medical Imaging

Contact us