1. Introduction
The release of chat generative pre-trained transformer (ChatGPT) in November 2022, followed by GPT-4 in March 2023, highlighted the use of large language models (LLMs) as powerful tools across a broad range of applications
[1],
[2]. When trained on vast corpora containing billions of tokens, LLMs exhibit impressive text generation and interpretation capabilities, which can parallel human-level understanding
[3]. LLMs have not only revolutionized creative and technical writing tasks for the public but also achieved state-of-the-art performance in diverse scientific fields. A keyword search for “large language models” or “ChatGPT” in the Web of Science returned 105 722 articles by the end of November 2023, covering major topics such as engineering, computer science, and medicine.
LLMs excel as statistical models, leveraging conditional probabilities to predict word sequences. They have consistently set new benchmarks in natural language processing (NLP) tasks, catalyzing the development of specialized biomedical LLMs. Notable models, including BioMedLM
[4], BioGPT
[5], and PMC-Llama
[6], have been trained or fine-tuned on domain-specific datasets, such as PubMed citations and full texts, to increase their utility in biomedical applications. For example, BioGPT leverages a vast corpus of medical literature to empower researchers and healthcare professionals with tools that facilitate insight extraction and decision-making tasks. These models underscore the leading-edge advancements that LLMs bring to biomedical research, enhancing the efficiency and precision of data analyses while opening new pathways for innovations in information extraction
[7], healthcare delivery, and medical education
[8].
Despite their impressive benefits, LLMs sometimes suffer from hallucination and confabulation issues, which could be particularly problematic in applications requiring high accuracy and reliability, such as medical advice provision. Inaccuracies and biased responses from biomedical LLMs may cause delays before providing optimal treatments, bring psychological or physical harm, or even endanger lives. Therefore, ensuring that the responses obtained from biomedical LLMs are rigorously validated and validated with adequate transparency is crucial
[9]. Strategies for mitigating these issues include improving the quality of training data and providing sufficient evidence for model inferences.
The outstanding performance of LLMs in NLP tasks positions them as promising engines for medical knowledge retrieval and question-answering (QA) systems. In this study, we aim to construct a novel medical knowledge retrieval and QA framework that leverages an external database and provides comprehensive evaluations from both qualitative and quantitative perspectives.
2. Related work
LLMs hold significant promise across a wide array of biomedical and healthcare applications, spanning clinical decision-making, medical education, and beyond. Recent studies have continued to expand their capabilities, including applications such as data-augmented LLMs that are used to infer cancer responses from radiology reports
[10]. Given the focus of this study on medical knowledge retrieval and QA, the NLP research that is pertinent to these domains is primarily reviewed in the following section.
2.1. Knowledge extraction
Knowledge extraction is a core task in NLP that transforms unstructured text into structured knowledge; it primarily consists of two subtasks: named entity recognition (NER) and relation extraction (RE). While the early approaches relied heavily on handcrafted features and rule-based systems, advances in deep learning and transformer models have driven substantial improvements.
Classic knowledge extraction systems, such as the Mayo clinical text analysis and knowledge extraction system (cTAKES)
[11], have been foundational in medical information extraction work. This open-source framework offers a modular architecture for analyzing clinical text, integrating machine learning with rule-based methods to support extraction tasks. Similarly, knowledge guided distance supervision (KGDS)
[12] enhances the ability to extract relations from electronic medical records by introducing biomedical knowledge, which is particularly valuable when entities in text do not align with standard knowledge bases.
Recent progress in knowledge extraction has come from encoding medical knowledge directly into pretrained language models. For example, Roy and Pan
[13] integrated Unified Medical Language System (UMLS) concepts into bidirectional encoder representations from transformers (BERT) embeddings, improving the ability of their model to understand and extract complex medical relationships. Semantic repository (SemRep), an RE tool that was developed by the US National Library of Medicine (NLM), also leverages UMLS-based rules to capture relations in biomedical texts
[14]. However, a stringent evaluation by the NLM revealed that SemRep achieved a precision of only 0.55 and an F1 score of 0.42, with biomedical entity recognition and normalization contributing significantly to the error rate of 0.27
[15]. Additionally, specialized frameworks such as knowledge-enhanced medical relation extraction (KeMRE)
[16] have been created to handle Chinese medical texts, enhancing the BERT-based RE process through knowledge embeddings derived from clinical guidelines.
LLMs are extensively evaluated in biomedical NER and RE tasks via benchmark datasets. For example, GPT-3 and GPT-4 achieved F1 scores of 0.73 and 0.82, respectively, on the BC5CDR-chemical dataset, which is a benchmark corpus for chemical–disease RE. Despite the power of LLMs, fine-tuning them on domain-specific knowledge remains essential to ensure contextually accurate information extraction processes. BioGPT exemplifies this approach, achieving high performance in chemical–disease, drug–target, and drug–drug RE tasks.
2.2. Information retrieval
Traditional information retrieval (IR) models, such as Okapi BM25
[17], were foundational with respect to ranking documents based on term frequency-inverse document frequency (TF-IDF), setting the stage for precision-focused retrieval tasks. These early models remain the backbone of many IR systems that are still in use today.
Specialized medical IR systems have since evolved, with notable examples including MedSearch
[18], which has enhanced traditional web search capabilities by accommodating long medical queries. The query reformulation techniques and diversified search results of MedSearch proved particularly beneficial for users who were unfamiliar with complex medical terminology. Similarly, PubMed
[19], which employs Medical Subject Headings (MeSH), has become the gold standard for biomedical literature retrieval, helping medical professionals access the latest advancements that are essential for evidence-based practice
[20]. Recent advancements in medical IR have focused on integrating artificial intelligence to better handle complex queries and synonyms
[21],
[22]. For example, a novel NLP-based keyword enhancement and screening method was introduced to assist scientists in optimizing keywords
[23]. This method uses prior knowledge to extract meaningful candidate keywords from initial search titles and abstracts and has exhibited efficacy in studies on atrial fibrillation topics. Additionally, Jin et al.
[24] developed a plug-in biomedical literature search module that enhances dense retrieval models with click logs, providing improved retrieval results on the basis of related queries.
While the traditional methods excel at performing structured searches, they often fall short in terms of addressing the contextual and semantic intricacies of complex medical queries. The introduction of LLMs has brought a new dimension to medical IR, enabling enhanced user experiences through sophisticated contextual understanding. By combining the robustness of classic IR models with the advanced semantic capabilities of LLMs, novel IR systems are becoming increasingly effective and intuitive for use in medical IR scenarios.
2.3. Question answering
Neural sequence-to-sequence models
[25] have significantly advanced the ability to generate context-aware responses. Traditionally, QA systems relying on retrieval have used predefined templates
[26] or search algorithms
[27] to extract answers from structured databases or unstructured text corpora. Recent work has shown that integrating knowledge graphs into neural dialog systems can substantially improve semantic understanding and the response accuracy achieved in scenarios involving medical dialogs
[28]. Similarly, models such as commonsense knowledge-aware dialogue generation model (ConKADI)
[29] and channel-aware knowledge fusing network (CAKF)
[30] leverage external medical knowledge to enable reasoning processes with personalized logic, enhancing the quality of medical QA.
The development of neural transformer architectures has further transformed the medical QA field
[31],
[32]. Numerous biomedical QA datasets, including MedQA (based on the US medical licensing exam) and PubMedQA (based on PubMed citation abstracts)
[33], have been curated to rigorously evaluate these models. These datasets have been instrumental in the advancement of QA-oriented LLMs. Notably, GPT-4 and medical pathways language model (Med-PaLM) 2
[34] have demonstrated high performance on MedQA, with accuracy scores of 86.1 and 86.5, respectively, compared with an average of 87.0 for human experts. BioMedLM, BioGPT, and Med-PaLM 2 have also yielded strong results on PubMedQA, with scores of 74.4, 81.0, and 81.8, respectively. Although these LLMs perform comparably to humans on standard QA datasets, they require comprehensive evaluations for effectively addressing practical biomedical inquiries.
Text summarization, which can be regarded as a subset of QA tasks, was pioneered in the 1950s and includes two primary approaches: extractive and abstractive summarization
[35]. Extractive summarization identifies key sentences within a document via statistical methods such as TF-IDF for word weighting
[36] or graph-based algorithms for sentence importance ranking
[37]. Abstractive summarization, on the other hand, interprets the main ideas of a text and rephrases them into a more concise, clear summary
[38]. Transformer-based models have recently dominated abstractive methods, with GPT-4 becoming widely used for medical literature reviews by condensing lengthy articles into manageable summaries, making it easier for researchers to access principal findings
[39],
[40]. In addition, summarization methods can be used for clinical note generation
[41] or diagnostic report extraction
[42].
3. Materials and methods
3.1. Data Collection
The data collection process for this study centered on retrieving coronavirus disease 2019 (COVID-19)-related literature from the PubMed platform as the primary data source. We established precise search criteria to capture relevant literature via keywords such as “novel coronavirus,” “2019-nCoV,” “COVID-19 virus,” “SARS-CoV-2,” and “SARS2.” These terms allowed us to effectively filter out irrelevant content and ensure that the retrieved articles directly pertained to our research objectives. Essential metadata, including titles, authors, abstracts, keywords, MeSH terms, and digital object identifiers (DOIs), were saved for secondary review and model training purposes. The collected articles were organized into seven categories, namely, mechanism, transmission, diagnosis, treatment, prevention, case report, and forecasting work, as shown in
Table 1. To maintain high accuracy and reliability, the potential limitations or biases within studies were carefully considered, resulting in a dataset consisting of 426 541 articles.
To further minimize the dataset biases, we utilized the TripClick public dataset
[43] as an additional benchmark. TripClick is a large-scale dataset derived from user interactions on the Trip Database health search engine, including approximately 5.2 million user interactions from 2013 to 2020. This dataset, along with an IR evaluation benchmark and accompanying metadata, supported the training processes of deep learning-based IR models.
For the semantic retrieval and fine-tuning steps of the tested LLMs, minimal preprocessing was applied to the articles through two key steps: ① standardizing the text to unicode transformation format-8-bit (UTF-8) encodings and removing any garbled or illegal characters and ② structuring the article content in a hierarchical JavaScript object notation (JSON) format. A thorough review process was subsequently undertaken by two annotators to confirm the accuracy and completeness of the data. Any discrepancies were resolved by a third reviewer. The review process was divided into the following four stages.
• Relevance review: In this stage, annotators independently reviewed each article in terms of their alignment with the research focus, assessing their relevance based on critical keywords and ensuring that the content matched the predefined categories.
• Duplication review: During this review, given the potential for redundant entries (e.g., multiple versions of the same article) in large-scale PubMed retrieval, the annotators ensured the removal of all duplicates.
• Completeness review: In this step, the annotators verified that each article contained the essential metadata (e.g., titles, authors, abstracts, MeSH terms, and DOIs) and performed checks to ensure correct preprocessing (e.g., UTF-8 encoding and the removal of special or illegal characters). Missing data points were addressed by consulting supplementary sources.
• Repeatability review: To evaluate the consistency of the annotation process, a random 10% sample of the reviewed articles was reannotated by the same annotators after a set interval.
3.2. Framework design
In this work, we propose an LLM-driven m
Edical literature
Retrieval and
QA framework called
ERQA. As shown in
Fig. 1, ERQA integrates an enhanced LLM, a literature database, and a semantic vector database into a cohesive system. We anticipate that ERQA will advance the traditional retrieval methods, providing a sophisticated knowledge acquisition scheme that is directly linked to the medical literature.
The
enhanced LLM is based on Llama2
[44], leveraging a two-step process consisting of incremental pretraining and fine-tuning. Llama2, which was initially trained on diverse, general-purpose text corpora, lacks the nuanced understanding needed for medical literature retrieval and QA. Conducting incremental pretraining on collected biomedical text allows the utilized model to gradually incorporate domain-specific knowledge while preserving its original language generation capabilities
[45]. Fine-tuning based on sophisticated prompts further refines the model so that is can handle question classification, question reconstruction, abstract summarization, and literature-based QA tasks by using manually curated prompts to ensure high-quality outputs.
The literature database serves as a comprehensive repository of scholarly works, maintaining both the integrity and accessibility of the original textual content. Each entry operates at the article level, capturing essential metadata such as titles, authors, institutions, abstracts, keywords, and structured text, facilitating the seamless tracking of the original content after applying semantic queries.
The
semantic vector database is designed to support semantic-based retrieval, storing text embeddings at the paragraph level
[46]. The specified text inputs are processed by the enhanced language model, with the outputs yielded by the final transformer layer used to generate query embeddings. Efficient semantic retrieval is achieved through approximate nearest-neighbor search technology
[47],
[48]. Utilizing a
K-means clustering algorithm, all embeddings undergo preclustering into multiple subregions, with inversion files generated for rapid matching. When a query is received, initial similarity calculations are performed on the subregion center, and this is followed by secondary matching within the subregion, avoiding the computational load imposed by full enumeration. Mapping between these embeddings and unique article identifiers enables a transition from vector-based matching to readable text retrieval.
The integration of these components provides ERQA with an innovative medical knowledge retrieval solution. The workflow, which is illustrated in
Fig. 2, begins with a researcher formulating a question, such as “What are the research hotspots regarding how syndrome coronavirus 2 (SARS-CoV-2) viruses regulate host immune responses after 2021?” ERQA categorizes this question as a literature retrieval query. In scenarios involving specific bibliographic fields (e.g., publication dates, authors, and institutions), the enhanced LLM identifies these constraints (e.g., 2021–2024) and adjusts the question accordingly (e.g., “How do SARS-CoV-2 viruses regulate host immune responses?”); the adjusted question is then processed by the semantic vector database.
The framework retrieves the top N relevant semantic vectors that satisfy the extracted constraints, along with unique identifiers that link these vectors back to the full bibliographic information contained in the literature database. The enhanced LLM then generates summaries from the abstracts and titles of the top N articles, which are presented to the researcher in a list format. If the researcher seeks more detailed information, they may pose a more specific follow-up question (e.g., “How can cross-immune responses to SARS-CoV-2 infections be determined through T cell detection methods?”). This query, which is related to a specific retrieval result (e.g., DOI: 10.1038/s41467-021-21856-3), is transformed into an instruction-based question (e.g., “Based on the information in the article titled ‘...,’ please address the question ‘...’”). This retrieval-augmented generation approach helps mitigate the hallucination tendencies that are common in LLMs.
3.3. Implementation details
The enhanced LLM was developed through incremental pretraining and fine-tuning based on the foundational LLM, that is, Llama2
[44], which includes 32 decoder layers with root mean square layer normalization (RMSNorm) replacing LayerNorm, multihead attention with grouped query attention (GQA), and rotary embedding for positional encoding. After training on a dataset consisting of 2 trillion tokens with a context window of 4096, we selected Llama-7B with 7 billion parameters and Llama-13B with 13 billion parameters as the foundational models for ERQA.
In the incremental pretraining stage, byte-pair encoding was applied to tokenize the acquired medical texts. This approach allows complex terminologies (e.g., “angiotensin-converting enzymes”) to be broken into meaningful subword units, enabling the model to efficiently handle medical terms. The pretraining process focused on next-token prediction to imbue the model with domain-specific knowledge via an unsupervised learning approach. Through exposure to biomedical texts, the model learned the relationships and dependencies within the medical literature. For example, recognizing that “ACE inhibitors” are related to “hypertension” or that “PCR testing” is linked to “COVID-19 diagnosis” allows the model to generate precise medical responses. The performance of the model was evaluated after each epoch on a held-out validation set derived from the COVID-19 and TripClick datasets.
The fine-tuning stage was designed to align the model with the key tasks that are relevant to medical knowledge retrieval and QA, including question classification, question reconstruction, abstract summarization, and literature-based QA. These tasks are described in Section 3.2 and illustrated in
Fig. 2. Examples of the utilized fine-tuning prompts are provided in
Table 2. For the “question classification” and “question reconstruction” components, we initially targeted medical researchers, capturing practical retrieval queries. Restrictions such as publication date, author, and institution constraints were removed to reconstruct the retrieval questions. The “abstract summarization” segment drew inspiration from prior research
[49],
[50], where literature citing each original article served as is summarization, followed by a manual review. The “literature-based QA” dataset included QA pairs derived from manual abstract readings and the PubMedQA methodology
[33]. During fine-tuning, we applied low-rank adaptation (LORA), which froze the base LLM parameters while updating a low-rank matrix
[51]. After experimenting with various rank sizes, we selected a rank of
to balance computational efficiency with fine-tuning precision, applying a scaling factor of
to adjust the influence of the low-rank matrices. This approach preserved the general language capabilities of the base LLM while efficiently adapting it to domain-specific tasks.
The dataset was divided into training, validation, and testing sets with 80%, 10%, and 10% splits, respectively, as shown in
Table 1,
Table 3. A temperature setting of 0.15 was chosen to improve the reliability of the model outputs, particularly for critical medical QA tasks, as a lower temperature reduces the randomness exhibited by the output responses. We further enhanced the generation process with top
P (nucleus) sampling, which was set to a cumulative probability of
, thereby balancing comprehensiveness and hallucination minimization.
The weighted adaptive moment estimation (AdamW) optimizer was used with a weight decay rate of 0.01 and beta coefficients of
. The learning rate schedule included a 1500-step warmup phase, after which the rate decayed to 10% of its maximum value. The learning rate was set at
during pretraining and
for fine-tuning, with batch sizes of 64 and 32, respectively. To prevent overfitting, early stopping was implemented, halting the training process after five epochs without validation loss improvements. Training was conducted on a high-performance cluster with six NVIDIA A100 graphics processing units (GPUs), with the longest duration extending up to 1440 hours.
Fig. 3 illustrates the loss and perplexity trends observed over 1000 fine-tuning iterations.
4. Evaluation and discussion
By deconstructing practical scenarios, we evaluated the performance of the proposed framework from three main perspectives in this section, namely, litersature retrieval, abstract summarization, and literature-based QA.
4.1. Comparison models
To assess the generative performance of the proposed framework, we compared it with several semantic baselines, including BERT
[52], BioBERT
[53], and BioClinicalBERT
[54], as well as more recent medical LLMs such as BioMedLM
[4], Meditron-7B
[55], and ChatDoctor
[56], to provide a comprehensive evaluation.
BERT, which is a bidirectional transformer model with 110 million parameters, captures long-range dependencies by considering both the left and right contexts within a sentence. BioBERT and BioClinicalBERT are variants of BERT that were fine-tuned specifically on medical literature and clinical case data, respectively, to improve their performance in medical NLP tasks.
In contrast, BioMedLM is an LLM that was trained exclusively on biomedical abstracts and papers, utilizing a standard transformer stacking architecture with a context window of 1024 and a hidden size of 2560, yielding robust results across a wide range of biomedical NLP applications. Meditron-7B, which is a more generalized medical LLM built on Llama-2, was trained with NVIDIA’s Megatron-LM distributed trainer. This extensive training process equipped Meditron-7B to effectively handle a variety of medical reasoning tasks. ChatDoctor, on the other hand, was designed specifically for doctor–patient dialogs. Built upon the Llama architecture, ChatDoctor was fine-tuned on large-scale doctor–patient interaction datasets, including over 100 000 dialogs. It incorporates real-time knowledge retrieved from curated offline databases and external sources, such as Wikipedia, enabling it to effectively handle real-world clinical queries. All hyperparameters for these comparison models were fine-tuned through grid searches on the relevant datasets to ensure that each model attained optimal performance on the collected dataset.
4.2. Literature retrieval
As discussed in Section 3.2, vector databases offer promising solutions for effectively implementing semantic retrieval, with vector embeddings playing a critical role in maximizing retrieval performance. In this section, we evaluated the impact of embeddings across various literature retrieval models. For the COVID-19 dataset, we used two types of gold standards: article categories and manual feedback. The gold standard for the TripClick dataset was based on click log entries. All the selected models were evaluated in terms of metrics, including their normalized discounted cumulative gain (NDCG), recall, and mean reciprocal rank (MRR) values.
When assessing the retrieval performance achieved on the COVID-19 dataset using article categories as the gold standard, ERQA-7B performed comparably to Meditron, with both models significantly outperforming the ChatDoctor and BERT-based models. As shown in
Fig. 4, ERQA-7B achieved an NDCG@10 score of 0.897, which was slightly lower than the 0.899 achieved by Meditron but notably higher than the 0.893 and 0.885 achieved by ChatDoctor and BioMedLM, respectively. A similar trend was observed in terms of Recall@10, with ERQA-7B scoring 0.906, finishing marginally below the 0.907 achieved by Meditron yet outperforming ChatDoctor at 0.902 and BioMedLM at 0.894. These results highlight the retrieval accuracy of ERQA-7B and Meditron on the COVID-19 dataset, both of which have clear advantages over the BERT-based models. Additionally, we evaluated the retrieval performance achieved with human feedback as the ground truth for this dataset.
Fig. 5(a) shows that ERQA-7B again outperformed ChatDoctor and BioMedLM, achieving an NDCG@10 score of 0.264 and a recall@10 score of 0.289, whereas the NDCG@10 score and recall@10 score of ChatDoctor were 0.257 and 0.279, respectively. Although Meditron slightly underperformed compared toERQA-7B, with an NDCG@10 score of 0.261 and a recall@10 score of 0.287, both models exhibited similar overall performance. In contrast, the BERT-based models, such as BioClinicalBERT, yielded significantly weaker results, with an NDCG@10 score of 0.221 and a recall@10 score of 0.237, underscoring the limitations of smaller-scale models when human feedback is the evaluation criterion.
On the TripClick dataset (
Fig. 5(b)), ERQA-7B demonstrated strong performance, closely matching Meditron and outperforming ChatDoctor. ERQA-7B achieved an NDCG@10 score of 0.337 and a recall@10 score of 0.279, which were nearly on par with those of Meditron, which produced NDCG@10 and recall@10 scores of 0.341 and 0.276, respectively, and ahead of those yielded by ChatDoctor, which achieved an NDCG@10 score of 0.332 and a recall@10 score of 0.272. ERQA-13B, however, demonstrated the most substantial improvement on the TripClick dataset, achieving an NDCG@20 score of 0.428 and a recall@50 score of 0.391, which were well above those of the next-best model, ERQA-7B, which achieved an NDCG@20 of 0.348 and a recall@50 of 0.376. This marked improvement highlights the efficacy of larger-scale models and sophisticated fine-tuning, particularly for handling diverse and noisy real-world retrieval data.
4.3. Abstract summarization
Abstract summarization, as a specialized subset of text summarization, demands high content generation accuracy, especially within scientific and medical research contexts. To evaluate the performance of our model and the baselines, we employed widely used recall-oriented understudy for gisting evaluation (ROUGE) metrics: ROUGE-1, ROUGE-2, and ROUGE-L. These metrics measure the overlap between generated summaries and reference summaries, with ROUGE-1 capturing unigram overlap, ROUGE-2 assessing bigram overlap, and ROUGE-L focusing on the longest common subsequence
[35].
The results obtained on the COVID-19 dataset, shown in
Fig. 6(a), indicate that the LLMs generally outperformed the traditional BERT-based models. Notably, ERQA-13B achieved the highest scores across all the metrics, with a ROUGE-1 score of 0.434, followed closely by ERQA-7B, which produced a score of 0.420. Compared with the traditional models, ERQA-13B yielded improvements of 28.4% over BERT, 33.95% over BioBERT, and 19.89% over BioClinicalBERT. BioMedLM, Meditron, and ChatDoctor achieved ROUGE-1 scores of 0.409, 0.413, and 0.411, respectively, although they did not surpass the ERQA models in terms of performance. In terms of ROUGE-2, ERQA-13B maintained the lead with a score of 0.203, representing a 16.67% improvement over BioMedLM. ERQA-7B and Meditron also performed well, with scores of 0.184 and 0.181, respectively. With respect to ROUGE-L, which assesses the longest common subsequence, ERQA-13B again led with a score of 0.345, followed by ERQA-7B and Meditron with scores of 0.329 and 0.320, respectively.
On the TripClick dataset, as shown in
Fig. 6(b), similar trends emerged. ERQA-13B led across all the metrics, achieving a ROUGE-1 score of 0.421. Meditron and ChatDoctor performed competitively, with scores of 0.400 and 0.387, respectively, whereas ERQA-7B reached a score of 0.403. In terms of ROUGE-2, ERQA-7B and ERQA-13B achieved scores of 0.294 and 0.303, respectively, surpassing Meditron, which had a score of 0.286, and ChatDoctor, which had a score of 0.275. For the ROUGE-L metric, ERQA-13B achieved a score of 0.367, which was the highest among all the models, followed by ERQA-7B with 0.331, Meditron with 0.327, and ChatDoctor with 0.316.
Table 4 presents relevant examples of the proposed ERQA model in the summarization task. The ERQA models provided more accurate and contextually relevant summaries, further demonstrating the advantages of the proposed framework in medical knowledge extraction scenarios.
4.4. Literature-based QA
The recently proposed biomedical LLM, Med-PaLM 2, which was trained on the MedQA and MedMCQA datasets, has achieved scores comparable to those of professional doctors on the United States Medical Licensing Examination (USMLE). These datasets, which are structured in a multiple-choice format, allow the performance of models to be assessed primarily in terms of accuracy. However, unlike Med-PaLM 2, the ERQA model is tailored for medical knowledge retrieval and QA tasks, primarily involving context-based reading comprehension, where traditional accuracy metrics are improper. Instead, we used the bilingual evaluation understudy (BLEU) metric
[57], which evaluates fluency and contextual relevance and is essential for assessing the quality of generated responses in context-based comprehension tasks.
As shown in
Table 5, the LLM-based models substantially outperformed the BERT-based models on both the COVID-19 and TripClick datasets. Specifically, the BERT-based models (BERT, BioBERT, and BioClinicalBERT) exhibited relatively low BLEU scores, indicating limitations in terms of generating high-quality, context-relevant responses for QA tasks.
In contrast, the LLM-based models demonstrated markedly higher performance. For example, BioMedLM achieved a BLEU-1 score of 5.843 and a BLEU-4 score of 0.672 on the COVID-19 dataset, whereas BLEU-1 and BLEU-4 scores of 5.703 and 0.417 were attained on the TripClick dataset, respectively. Meditron and ChatDoctor further highlighted the superiority of LLM-based models, with Meditron achieving BLEU-1 and BLEU-4 scores of 6.278 and 0.725 on the COVID-19 dataset and scores of 6.004 and 0.432 on the TripClick dataset, respectively. The proposed ERQA models, which were specifically fine-tuned for medical knowledge retrieval and QA tasks, outperformed these baselines, with ERQA-7B achieving BLEU-1 and BLEU-4 scores of 6.467 and 0.722 on the COVID-19 dataset and 6.284 and 0.447 on TripClick. The larger ERQA-13B model further enhanced these results.
To deepen our understanding of the QA capabilities of LLMs, we conducted a human evaluation of model-generated answers, using coherence, consistency, and satisfaction as scoring criteria (inspired by prior research)
[34]. Each metric was scored on a 0–100 scale, with scores segmented into four levels: 1–25 (poor), 26–50 (fair), 51–75 (good), and 76–100 (excellent). This structure provides nuanced insights into the response quality achieved across different dimensions.
• Coherence assesses the logical flow of a response. A high coherence score (76–100) indicates that the generated sentences are conceptually accurate and logically sound, strongly supporting the associated argument, whereas a low score (1–25) reflects fragmented or difficult-to-follow responses.
• Consistency measures alignment with the source material, ensuring that responses accurately reflect the source without hallucinations or errors. High scores indicate strong source fidelity, whereas low scores reflect deviations or inaccuracies.
• Satisfaction gauges how well a response satisfies the user’s information needs, assessing completeness and informativeness. High satisfaction scores signify thorough, relevant answers, whereas low scores indicate unmet expectations.
During the human evaluation, three expert reviewers independently assessed the QA pairs generated by the ERQA model, scoring its responses for coherence, consistency, and satisfaction on a 0–100 scale. To ensure reliable evaluations, Krippendorff’s alpha was calculated for each QA pair to assess the interrater reliability of each of the three metrics. A threshold of 0.75 was set for Krippendorff’s alpha, as this is widely regarded as the minimum acceptable level for good interrater reliability. If the alpha value for a QA pair met or exceeded this threshold, the reviewers’ scores were considered consistent, and the average score of the three reviewers was used as the final score for that QA pair. However, if Krippendorff’s alpha for a QA pair fell below 0.75, the QA was flagged for re-evaluation. In such cases, reviewers reassessed the response and discussed any discrepancies, repeating the process until the alpha value exceeded the threshold. Once all the QA pairs had been evaluated and the discrepancies were resolved, the final scores of the examined metrics (coherence, consistency, and satisfaction) were calculated as averages across all the QA pairs, ensuring consistently reliable evaluations across the dataset.
As depicted in
Fig. 7, increasing the model size from 7 billion to 13 billion parameters improved the performance of ERQA with respect to coherence, consistency, and satisfaction. The larger model particularly excelled in terms of user satisfaction, demonstrating its enhanced ability to deliver accurate, relevant information from the literature. Nevertheless, the performance gains did not fully justify the additional computational demands, suggesting that ERQA-7B may serve as a more practical option for medical knowledge retrieval and QA tasks.
In some instances, the model produced responses with logical structures and clarity but introduced minor inconsistencies with the source material, likely due to the inherent hallucinations of LLMs. For example, in one response concerning COVID-19 immunity, the generated text misstated the duration of antibody protection, which lowered its consistency score. Occasionally, the responses met the information needs of users but lacked logical flow, impacting the resulting coherence scores, particularly when addressing complex medical concepts. To assess the ability of ERQA to understand medical texts, extract relevant information, and produce contextually accurate answers, we used a diverse set of abstracts acquired from the COVID-19 literature as input queries.
Table 6 highlights cases in which ERQA struggled with rigorous reasoning requirements, suggesting avenues for future refining the model.
Medical knowledge retrieval and QA often involve multiple rounds of context-dependent interaction. While ERQA can process extended context inputs, its maximum context length is limited primarily by the architecture of the constructed model. For example, Llama-7B, as used in our study, supports a context window of up to 4096 tokens, which is sufficient for most medical queries. However, the length of the retrievable context is also influenced by the preprocessed abstracts and key literature sections stored in vector databases. This structure enables ERQA to deliver detailed responses under inherent token constraints, ensuring comprehensive answers within the context limits of the model.
4.5. Ablation study
We conducted ablation studies on the COVID-19 and TripClick datasets to evaluate the contributions of various components contained in the ERQA framework. The original Llama2 model served as the baseline, whereas Llama2 w/VD refers to the Llama2 model enhanced with a literature vector database for QA tasks. ERQA represents the complete model with vector database support, and ERQA w/o VD denotes the ERQA model without the vector database.
For article retrieval, which relies on an embedding-assisted vector database, we compared the performance of Llama2 w/VD with that of ERQA. As shown in
Table 7, ERQA achieved notable improvements over Llama2 w/VD, demonstrating the impact of the fine-tuning process of ERQA on retrieval tasks. In the abstract summarization task, we focused on Llama2 and ERQA w/o VD to evaluate whether the models could effectively distill the key topics of the given articles. The results in
Fig. 8 indicate that a substantial ROUGE score improvement was achieved by ERQA w/o VD. Incremental pretraining and fine-tuning enabled the proposed model to outperform the original baseline in terms of medical knowledge comprehension and task compliance, respectively.
Regarding the downstream tasks of article retrieval and abstract summarization, literature-based QA is the most direct manifestation of LLM-driven medical knowledge mining. Although both the fine-tuning process and the database were tailored to support the proposed ERQA framework in completing the QA task, the enhanced LLM (i.e., ERQA w/o VD) and Llama2 w/VD were still capable of handling QA tasks, albeit with decreased performance. In terms of the BLEU metric, the performance of ERQA w/o VD was similar to that of Llama2 w/VD but fell short compared with that of ERQA, as shown in
Table 8. For example, when faced with the question (PMID: 37069722) shown in
Table 6, the response generated by ERQA w/o VD was as follows: “Despite the administration of vaccines, several challenges continue to hinder the control of the COVID-19 pandemic. Some of these challenges include vaccine hesitancy, limited access to vaccines in resource-poor areas, difficulty in reaching vulnerable populations, and inadequate public health infrastructure and resources.” This response overlooks important content, such as “viral mutations,” and introduces a hallucination by mentioning “vaccine hesitancy,” which does not align with the question. Notably, the complete ERQA framework outperformed all other versions across the different BLEU scores on both datasets. The ablation comparison indicates that the combined effect of fine-tuning and vector database integration significantly enhances QA performance.
4.6. Limitations
When evaluating the performance of the ERQA model across literature retrieval, abstract summarization, and literature-based QA tasks, we observed the exceptional advantages that LLMs offer over traditional language models in terms of comprehending and leveraging medical knowledge. This insight opens new avenues for advancing literature-based medical knowledge mining and underscores the transformative potential of LLMs in this field. However, several practical challenges remain with respect to LLM-driven medical knowledge retrieval and QA systems.
One possible source of error in these tasks could be the difficulty of retrieving articles that adequately satisfy user requirements. The development of an auxiliary retrieval strategy tailored to LLMs is a complex challenge. The key considerations include the granularity of text segmentation and layer selection for generating embeddings within the vector database. Given the complexity and diverse contents of medical literature, choosing the optimal text segmentation granularity is essential for achieving precise retrieval. If the segmentation process is too coarse, it may lead to information overload or context losses; if it is too fine, it risks omitting critical information or diluting context. Additionally, since LLMs comprise multiple hidden layers that capture information at varying levels of abstraction, selecting the appropriate layer for embedding generation significantly affects the resulting retrieval quality. While vector-based retrieval strategies enhance semantic relevance, metadata-based matching often yields greater precision and interpretability. Balancing vector embeddings with metadata features requires a careful design to maximize the accuracy and effectiveness of retrieval.
In literature-based QA performance evaluations, compared with its consistency and satisfaction scores, ERQA achieved notably higher coherence scores. This gap was due primarily to the phenomenon of hallucination in LLMs, where model-generated responses diverge from reality owing to data biases or inconsistencies encountered during training. These inaccuracies or contradictions in generated answers can compromise the reliability and practical utility of the utilized system. To address this challenge in future work, incorporating external knowledge sources and verification mechanisms to validate and refine model-generated responses could mitigate the hallucination problem and yield improved response accuracy.
5. Conclusions
In this study, we proposed an LLM-driven framework for medical knowledge retrieval and QA tasks that integrates three key components into a cohesive workflow. Multiple scenarios were chosen to evaluate the proposed framework, including literature retrieval, abstract summarization, and literature-based QA. Both the qualitative and quantitative results of this study highlight the promise of the developed approach in terms of advancing biomedical knowledge discovery. Moving forward, we plan to incorporate larger-scale biomedical literature datasets and adopt additional evaluation metrics to further enhance the performance of the model.
CRediT authorship contribution statement
Yuyang Liu: Writing – review & editing, Writing – original draft, Validation, Methodology, Data curation, Conceptualization. Xiaoying Li: Writing – original draft, Methodology, Conceptualization. Yan Luo: Investigation, Formal analysis, Data curation. Jinhua Du: Software, Methodology. Ying Zhang: Formal analysis, Data curation. Tingyu Lv: Formal analysis, Data curation. Hao Yin: Writing – review & editing, Supervision. Xiaoli Tang: Writing – review & editing, Supervision, Funding acquisition. Hui Liu: Writing – review & editing, Supervision, Funding acquisition.
Acknowledgments
This work was supported by the Innovation Fund for Medical Sciences of the Chinese Academy of Medical Sciences (2021-I2M-1-033) and the National Key Research and Development Program of China (2022YFF0711900).
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.