Data
This was an algorithm development and evaluation study involving simulated and clinical radiology report datasets. The AC corpus (n = 205 topic documents) was extracted from the ACR website (link). Then, text for each topic document was tokenized and lemmatized using the python nltk library [14] to create a set of document bodies. Separately, titles and variants for each document were extracted to create a set of document headers. The simulated evaluation dataset of 410 indications (Testing Dataset 1) was created by a medical student and a PGY-5 radiology resident under the supervision of a board-certified radiologist. This testing dataset contained two indications for each AC corpus topic document, including pediatric topics. Radiology reports used for evaluation in Testing Dataset 2 were retrospectively collected following Institutional Review Board approval and consent waiver from a single tertiary academic medical institution.
Text preprocessing
In all cases, the raw query was preprocessed by solving abbreviations, tokenizing, and removing stop words and punctuations. To solve abbreviations, the Radiopedia list of ~ 3000 abbreviations [15] was extracted, processed, and edited to discard irrelevant and ambiguous abbreviations. Expanded abbreviations were added to the query.
Algorithm development
An overview of our algorithm’s backend is outlined in Fig. 1. All code is available at https://bit.ly/3giZwSa. Our algorithm’s overall complexity is O(n), with n being the number of words in the search query.
AC document ranking score
The semantic similarity aspect of our algorithm uses sent2vec [11], an extension of word2vec [16]. We implemented sent2vec with unigrams and bigrams. Our model was trained on the open-source PubMed and MIMIC-III datasets [17], mimicking the approach of the BioSentVec model (https://bit.ly/2X7ZB1W) [18]. After training, the model was used to embed each AC document into three vectors, one for the document’s body, one for its header, and one for its top 50 TF-IDF features (see below). Each document’s ranking score, \(S_{i}\), for a given query, \(q\), was calculated by the following:
$$S_{i} = H_{q,i} + B_{q,i} + \beta T_{q,i}$$
where \(\beta\) is the weight given to the TF-IDF score, and \(H_{q,i}\), \(B_{q,i}\), \(T_{q,i}\) are the cosine similarities between the query’s embeddings vector and document \(i\)’s header, body, and TF-IDF feature vectors, respectively.
Term frequency-inverse document frequency features
A term frequency-inverse document frequency (TF-IDF) model [19, 20] was created from raw AC documents using TfidfVectorizer in scikit-learn in python, with unigrams, bigrams, and trigrams. This model calculates a score for each word/phrase based on its frequency in a document relative to the corpus. Each document’s top 50 features were embedded into one vector using the sent2vec model.
Testing dataset 1: simulated radiology indications dataset
To comprehensively evaluate retrieval of each AC document, we generated a query dataset of two clinical indications for each of the 205 AC documents, one simple indication and one complex indication with distractors and synonymous wording similar to those in clinical indications (examples in Additional file 1: Table S1). To quantify the quality of search result ranking from our simulated queries, we used normalized discounted cumulative gain (NDCG) [21]:
$$NDCG = \frac{{ \mathop \sum \nolimits_{i = 1}^{n} \frac{{rel_{i} }}{{log_{2} \left( {i + 1} \right)}} }}{{\mathop \sum \nolimits_{i = 1}^{n} \frac{{REL_{i} }}{{log_{2} \left( {i + 1} \right)}}}}$$
where \(n\) is number of unique AC documents, \(i\) is the search result rank, \(rel_{i}\) is the relevance of result \(i\), and \(REL_{i}\) is the maximum relevance of result \(i\).
Relevance was calculated by first tagging each AC document with one or more of the following tags: vascular disease, infection/inflammation, neoplasm, congenital, trauma, surgical, and many etiologies/topics (e.g. chest pain). Then, \(rel_{i}\) was calculated by number of matching tags between query and search result \(i\). Maximum possible relevance (\(REL_{i}\)) was calculated by sorting the query’s relevance for all results. An NDCG of 1 would indicate perfect search result ranking.
Testing dataset 2: radiology report clinical indications dataset
To test the algorithm’s performance in clinical workflow, we extracted a dataset (n = 3731) of de-identified radiology notes from our department of radiology from 01/11/2020 to 01/18/2020 (Fig. 2). Diagnostic radiology reports from all study types except chest x-rays were extracted consecutively and comprehensively with limited exclusion criteria as specified below to minimize selection bias and simulate real clinical workflow. Chest X-ray reports were not collected as indications are frequently too simple (e.g. “fever”) or not clinically relevant. Clinical indications section text was automatically extracted from this dataset using pattern matching. Some reports (n = 291; 7.8%) were excluded because they had blank indication text or the radiology report did not follow our institution’s standard format. The resulting n = 3440 radiology report clinical indications were run through our algorithm and top 10 predictions were aggregated. A random subset of 100 indications and algorithm outputs was evaluated by a radiologist who clinically determined whether each indication had none, one, or multiple appropriate AC documents, and ranked which (if any) of the algorithm outputs were correct.
Custom google search engine
Using Google’s Programmable Search feature, we created a custom Google search engine that was constrained to searching on the AC documents. This engine was programmed to search only on web pages with the prefix “https://acsearch.acr.org/docs/”, which corresponded to the 205 AC topic document pages. We ran a randomly chosen subset of Testing Dataset 1 (n = 100 indications) through the custom Google search and our algorithm.
Statistical Analysis
All values reported are means unless otherwise noted. To compare performance between simple and complex simulated indications, the Mann–Whitney U test was used for NDCG values and ground truth rank values, a chi-squared test for Top 3 values, and a two sample Kolmogorov Smirnov test for the cumulative frequency curves. For the NDCG analysis, the non-parametric Kruskal–Wallis H-test was used to compare performance among AC categories. The Friedman rank test was used to compare performances between our proposed algorithm and a custom Google search. All statistical analyses were conducted in python using the scipy package. Statistical significance was defined as p < 0.05.