Toward a Large Language Model-Driven Medical Knowledge Retrieval and QA System: Framework Design and Evaluation

Yuyang Liu , Xiaoying Li , Yan Luo , Jinhua Du , Ying Zhang , Tingyu Lv , Hao Yin , Xiaoli Tang , Hui Liu

Engineering ›› 2025, Vol. 50 ›› Issue (7) : 270 -282.

PDF (1863KB)
Engineering ›› 2025, Vol. 50 ›› Issue (7) :270 -282. DOI: 10.1016/j.eng.2025.02.010
Research
Article

Toward a Large Language Model-Driven Medical Knowledge Retrieval and QA System: Framework Design and Evaluation

Author information +
History +
PDF (1863KB)

Abstract

Recent advancements in large language models (LLMs) have driven remarkable progress in text processing, opening new avenues for medical knowledge discovery. In this study, we present ERQA, a mEdical knowledge Retrieval and Question-Answering framework powered by an enhanced LLM that integrates a semantic vector database and a curated literature repository. The ERQA framework leverages domain-specific incremental pretraining and conducts supervised fine-tuning on medical literature, enabling retrieval and question-answering (QA) tasks to be completed with high precision. Performance evaluations implemented on the coronavirus disease 2019 (COVID-19) and TripClick datasets demonstrate the robust capabilities of ERQA across multiple tasks. On the COVID-19 dataset, ERQA-13B achieves state-of-the-art retrieval metrics, with normalized discounted cumulative gain at top 10 (NDCG@10) 0.297, recall values at top 10 (Recall@10) 0.347, and mean reciprocal rank (MRR) = 0.370; it also attains strong abstract summarization performance, with a recall-oriented understudy for gisting evaluation (ROUGE)-1 score of 0.434, and QA performance, with a bilingual evaluation understudy (BLEU)-1 score of 7.851. The comparable performance achieved on the TripClick dataset further underscores the adaptability of ERQA across diverse medical topics. These findings suggest that ERQA represents a significant step toward efficient biomedical knowledge retrieval and QA.

Graphical abstract

Keywords

Large language models / Medical knowledge / Information retrieval / Vector database

Cite this article

Download citation ▾
Yuyang Liu, Xiaoying Li, Yan Luo, Jinhua Du, Ying Zhang, Tingyu Lv, Hao Yin, Xiaoli Tang, Hui Liu. Toward a Large Language Model-Driven Medical Knowledge Retrieval and QA System: Framework Design and Evaluation. Engineering, 2025, 50(7): 270-282 DOI:10.1016/j.eng.2025.02.010

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

The release of chat generative pre-trained transformer (ChatGPT) in November 2022, followed by GPT-4 in March 2023, highlighted the use of large language models (LLMs) as powerful tools across a broad range of applications [1], [2]. When trained on vast corpora containing billions of tokens, LLMs exhibit impressive text generation and interpretation capabilities, which can parallel human-level understanding [3]. LLMs have not only revolutionized creative and technical writing tasks for the public but also achieved state-of-the-art performance in diverse scientific fields. A keyword search for “large language models” or “ChatGPT” in the Web of Science returned 105 722 articles by the end of November 2023, covering major topics such as engineering, computer science, and medicine.

LLMs excel as statistical models, leveraging conditional probabilities to predict word sequences. They have consistently set new benchmarks in natural language processing (NLP) tasks, catalyzing the development of specialized biomedical LLMs. Notable models, including BioMedLM [4], BioGPT [5], and PMC-Llama [6], have been trained or fine-tuned on domain-specific datasets, such as PubMed citations and full texts, to increase their utility in biomedical applications. For example, BioGPT leverages a vast corpus of medical literature to empower researchers and healthcare professionals with tools that facilitate insight extraction and decision-making tasks. These models underscore the leading-edge advancements that LLMs bring to biomedical research, enhancing the efficiency and precision of data analyses while opening new pathways for innovations in information extraction [7], healthcare delivery, and medical education [8].

Despite their impressive benefits, LLMs sometimes suffer from hallucination and confabulation issues, which could be particularly problematic in applications requiring high accuracy and reliability, such as medical advice provision. Inaccuracies and biased responses from biomedical LLMs may cause delays before providing optimal treatments, bring psychological or physical harm, or even endanger lives. Therefore, ensuring that the responses obtained from biomedical LLMs are rigorously validated and validated with adequate transparency is crucial [9]. Strategies for mitigating these issues include improving the quality of training data and providing sufficient evidence for model inferences.

The outstanding performance of LLMs in NLP tasks positions them as promising engines for medical knowledge retrieval and question-answering (QA) systems. In this study, we aim to construct a novel medical knowledge retrieval and QA framework that leverages an external database and provides comprehensive evaluations from both qualitative and quantitative perspectives.

2. Related work

LLMs hold significant promise across a wide array of biomedical and healthcare applications, spanning clinical decision-making, medical education, and beyond. Recent studies have continued to expand their capabilities, including applications such as data-augmented LLMs that are used to infer cancer responses from radiology reports [10]. Given the focus of this study on medical knowledge retrieval and QA, the NLP research that is pertinent to these domains is primarily reviewed in the following section.

2.1. Knowledge extraction

Knowledge extraction is a core task in NLP that transforms unstructured text into structured knowledge; it primarily consists of two subtasks: named entity recognition (NER) and relation extraction (RE). While the early approaches relied heavily on handcrafted features and rule-based systems, advances in deep learning and transformer models have driven substantial improvements.

Classic knowledge extraction systems, such as the Mayo clinical text analysis and knowledge extraction system (cTAKES) [11], have been foundational in medical information extraction work. This open-source framework offers a modular architecture for analyzing clinical text, integrating machine learning with rule-based methods to support extraction tasks. Similarly, knowledge guided distance supervision (KGDS) [12] enhances the ability to extract relations from electronic medical records by introducing biomedical knowledge, which is particularly valuable when entities in text do not align with standard knowledge bases.

Recent progress in knowledge extraction has come from encoding medical knowledge directly into pretrained language models. For example, Roy and Pan [13] integrated Unified Medical Language System (UMLS) concepts into bidirectional encoder representations from transformers (BERT) embeddings, improving the ability of their model to understand and extract complex medical relationships. Semantic repository (SemRep), an RE tool that was developed by the US National Library of Medicine (NLM), also leverages UMLS-based rules to capture relations in biomedical texts [14]. However, a stringent evaluation by the NLM revealed that SemRep achieved a precision of only 0.55 and an F1 score of 0.42, with biomedical entity recognition and normalization contributing significantly to the error rate of 0.27 [15]. Additionally, specialized frameworks such as knowledge-enhanced medical relation extraction (KeMRE) [16] have been created to handle Chinese medical texts, enhancing the BERT-based RE process through knowledge embeddings derived from clinical guidelines.

LLMs are extensively evaluated in biomedical NER and RE tasks via benchmark datasets. For example, GPT-3 and GPT-4 achieved F1 scores of 0.73 and 0.82, respectively, on the BC5CDR-chemical dataset, which is a benchmark corpus for chemical–disease RE. Despite the power of LLMs, fine-tuning them on domain-specific knowledge remains essential to ensure contextually accurate information extraction processes. BioGPT exemplifies this approach, achieving high performance in chemical–disease, drug–target, and drug–drug RE tasks.

2.2. Information retrieval

Traditional information retrieval (IR) models, such as Okapi BM25 [17], were foundational with respect to ranking documents based on term frequency-inverse document frequency (TF-IDF), setting the stage for precision-focused retrieval tasks. These early models remain the backbone of many IR systems that are still in use today.

Specialized medical IR systems have since evolved, with notable examples including MedSearch [18], which has enhanced traditional web search capabilities by accommodating long medical queries. The query reformulation techniques and diversified search results of MedSearch proved particularly beneficial for users who were unfamiliar with complex medical terminology. Similarly, PubMed [19], which employs Medical Subject Headings (MeSH), has become the gold standard for biomedical literature retrieval, helping medical professionals access the latest advancements that are essential for evidence-based practice [20]. Recent advancements in medical IR have focused on integrating artificial intelligence to better handle complex queries and synonyms [21], [22]. For example, a novel NLP-based keyword enhancement and screening method was introduced to assist scientists in optimizing keywords [23]. This method uses prior knowledge to extract meaningful candidate keywords from initial search titles and abstracts and has exhibited efficacy in studies on atrial fibrillation topics. Additionally, Jin et al. [24] developed a plug-in biomedical literature search module that enhances dense retrieval models with click logs, providing improved retrieval results on the basis of related queries.

While the traditional methods excel at performing structured searches, they often fall short in terms of addressing the contextual and semantic intricacies of complex medical queries. The introduction of LLMs has brought a new dimension to medical IR, enabling enhanced user experiences through sophisticated contextual understanding. By combining the robustness of classic IR models with the advanced semantic capabilities of LLMs, novel IR systems are becoming increasingly effective and intuitive for use in medical IR scenarios.

2.3. Question answering

Neural sequence-to-sequence models [25] have significantly advanced the ability to generate context-aware responses. Traditionally, QA systems relying on retrieval have used predefined templates [26] or search algorithms [27] to extract answers from structured databases or unstructured text corpora. Recent work has shown that integrating knowledge graphs into neural dialog systems can substantially improve semantic understanding and the response accuracy achieved in scenarios involving medical dialogs [28]. Similarly, models such as commonsense knowledge-aware dialogue generation model (ConKADI) [29] and channel-aware knowledge fusing network (CAKF) [30] leverage external medical knowledge to enable reasoning processes with personalized logic, enhancing the quality of medical QA.

The development of neural transformer architectures has further transformed the medical QA field [31], [32]. Numerous biomedical QA datasets, including MedQA (based on the US medical licensing exam) and PubMedQA (based on PubMed citation abstracts) [33], have been curated to rigorously evaluate these models. These datasets have been instrumental in the advancement of QA-oriented LLMs. Notably, GPT-4 and medical pathways language model (Med-PaLM) 2 [34] have demonstrated high performance on MedQA, with accuracy scores of 86.1 and 86.5, respectively, compared with an average of 87.0 for human experts. BioMedLM, BioGPT, and Med-PaLM 2 have also yielded strong results on PubMedQA, with scores of 74.4, 81.0, and 81.8, respectively. Although these LLMs perform comparably to humans on standard QA datasets, they require comprehensive evaluations for effectively addressing practical biomedical inquiries.

Text summarization, which can be regarded as a subset of QA tasks, was pioneered in the 1950s and includes two primary approaches: extractive and abstractive summarization [35]. Extractive summarization identifies key sentences within a document via statistical methods such as TF-IDF for word weighting [36] or graph-based algorithms for sentence importance ranking [37]. Abstractive summarization, on the other hand, interprets the main ideas of a text and rephrases them into a more concise, clear summary [38]. Transformer-based models have recently dominated abstractive methods, with GPT-4 becoming widely used for medical literature reviews by condensing lengthy articles into manageable summaries, making it easier for researchers to access principal findings [39], [40]. In addition, summarization methods can be used for clinical note generation [41] or diagnostic report extraction [42].

3. Materials and methods

3.1. Data Collection

The data collection process for this study centered on retrieving coronavirus disease 2019 (COVID-19)-related literature from the PubMed platform as the primary data source. We established precise search criteria to capture relevant literature via keywords such as “novel coronavirus,” “2019-nCoV,” “COVID-19 virus,” “SARS-CoV-2,” and “SARS2.” These terms allowed us to effectively filter out irrelevant content and ensure that the retrieved articles directly pertained to our research objectives. Essential metadata, including titles, authors, abstracts, keywords, MeSH terms, and digital object identifiers (DOIs), were saved for secondary review and model training purposes. The collected articles were organized into seven categories, namely, mechanism, transmission, diagnosis, treatment, prevention, case report, and forecasting work, as shown in Table 1. To maintain high accuracy and reliability, the potential limitations or biases within studies were carefully considered, resulting in a dataset consisting of 426 541 articles.

To further minimize the dataset biases, we utilized the TripClick public dataset [43] as an additional benchmark. TripClick is a large-scale dataset derived from user interactions on the Trip Database health search engine, including approximately 5.2 million user interactions from 2013 to 2020. This dataset, along with an IR evaluation benchmark and accompanying metadata, supported the training processes of deep learning-based IR models.

For the semantic retrieval and fine-tuning steps of the tested LLMs, minimal preprocessing was applied to the articles through two key steps: ① standardizing the text to unicode transformation format-8-bit (UTF-8) encodings and removing any garbled or illegal characters and ② structuring the article content in a hierarchical JavaScript object notation (JSON) format. A thorough review process was subsequently undertaken by two annotators to confirm the accuracy and completeness of the data. Any discrepancies were resolved by a third reviewer. The review process was divided into the following four stages.

• Relevance review: In this stage, annotators independently reviewed each article in terms of their alignment with the research focus, assessing their relevance based on critical keywords and ensuring that the content matched the predefined categories.

• Duplication review: During this review, given the potential for redundant entries (e.g., multiple versions of the same article) in large-scale PubMed retrieval, the annotators ensured the removal of all duplicates.

• Completeness review: In this step, the annotators verified that each article contained the essential metadata (e.g., titles, authors, abstracts, MeSH terms, and DOIs) and performed checks to ensure correct preprocessing (e.g., UTF-8 encoding and the removal of special or illegal characters). Missing data points were addressed by consulting supplementary sources.

• Repeatability review: To evaluate the consistency of the annotation process, a random 10% sample of the reviewed articles was reannotated by the same annotators after a set interval.

3.2. Framework design

In this work, we propose an LLM-driven mEdical literature Retrieval and QA framework called ERQA. As shown in Fig. 1, ERQA integrates an enhanced LLM, a literature database, and a semantic vector database into a cohesive system. We anticipate that ERQA will advance the traditional retrieval methods, providing a sophisticated knowledge acquisition scheme that is directly linked to the medical literature.

The enhanced LLM is based on Llama2 [44], leveraging a two-step process consisting of incremental pretraining and fine-tuning. Llama2, which was initially trained on diverse, general-purpose text corpora, lacks the nuanced understanding needed for medical literature retrieval and QA. Conducting incremental pretraining on collected biomedical text allows the utilized model to gradually incorporate domain-specific knowledge while preserving its original language generation capabilities [45]. Fine-tuning based on sophisticated prompts further refines the model so that is can handle question classification, question reconstruction, abstract summarization, and literature-based QA tasks by using manually curated prompts to ensure high-quality outputs.

The literature database serves as a comprehensive repository of scholarly works, maintaining both the integrity and accessibility of the original textual content. Each entry operates at the article level, capturing essential metadata such as titles, authors, institutions, abstracts, keywords, and structured text, facilitating the seamless tracking of the original content after applying semantic queries.

The semantic vector database is designed to support semantic-based retrieval, storing text embeddings at the paragraph level [46]. The specified text inputs are processed by the enhanced language model, with the outputs yielded by the final transformer layer used to generate query embeddings. Efficient semantic retrieval is achieved through approximate nearest-neighbor search technology [47], [48]. Utilizing a K-means clustering algorithm, all embeddings undergo preclustering into multiple subregions, with inversion files generated for rapid matching. When a query is received, initial similarity calculations are performed on the subregion center, and this is followed by secondary matching within the subregion, avoiding the computational load imposed by full enumeration. Mapping between these embeddings and unique article identifiers enables a transition from vector-based matching to readable text retrieval.

The integration of these components provides ERQA with an innovative medical knowledge retrieval solution. The workflow, which is illustrated in Fig. 2, begins with a researcher formulating a question, such as “What are the research hotspots regarding how syndrome coronavirus 2 (SARS-CoV-2) viruses regulate host immune responses after 2021?” ERQA categorizes this question as a literature retrieval query. In scenarios involving specific bibliographic fields (e.g., publication dates, authors, and institutions), the enhanced LLM identifies these constraints (e.g., 2021–2024) and adjusts the question accordingly (e.g., “How do SARS-CoV-2 viruses regulate host immune responses?”); the adjusted question is then processed by the semantic vector database.

The framework retrieves the top N relevant semantic vectors that satisfy the extracted constraints, along with unique identifiers that link these vectors back to the full bibliographic information contained in the literature database. The enhanced LLM then generates summaries from the abstracts and titles of the top N articles, which are presented to the researcher in a list format. If the researcher seeks more detailed information, they may pose a more specific follow-up question (e.g., “How can cross-immune responses to SARS-CoV-2 infections be determined through T cell detection methods?”). This query, which is related to a specific retrieval result (e.g., DOI: 10.1038/s41467-021-21856-3), is transformed into an instruction-based question (e.g., “Based on the information in the article titled ‘...,’ please address the question ‘...’”). This retrieval-augmented generation approach helps mitigate the hallucination tendencies that are common in LLMs.

3.3. Implementation details

The enhanced LLM was developed through incremental pretraining and fine-tuning based on the foundational LLM, that is, Llama2 [44], which includes 32 decoder layers with root mean square layer normalization (RMSNorm) replacing LayerNorm, multihead attention with grouped query attention (GQA), and rotary embedding for positional encoding. After training on a dataset consisting of 2 trillion tokens with a context window of 4096, we selected Llama-7B with 7 billion parameters and Llama-13B with 13 billion parameters as the foundational models for ERQA.

In the incremental pretraining stage, byte-pair encoding was applied to tokenize the acquired medical texts. This approach allows complex terminologies (e.g., “angiotensin-converting enzymes”) to be broken into meaningful subword units, enabling the model to efficiently handle medical terms. The pretraining process focused on next-token prediction to imbue the model with domain-specific knowledge via an unsupervised learning approach. Through exposure to biomedical texts, the model learned the relationships and dependencies within the medical literature. For example, recognizing that “ACE inhibitors” are related to “hypertension” or that “PCR testing” is linked to “COVID-19 diagnosis” allows the model to generate precise medical responses. The performance of the model was evaluated after each epoch on a held-out validation set derived from the COVID-19 and TripClick datasets.

The fine-tuning stage was designed to align the model with the key tasks that are relevant to medical knowledge retrieval and QA, including question classification, question reconstruction, abstract summarization, and literature-based QA. These tasks are described in Section 3.2 and illustrated in Fig. 2. Examples of the utilized fine-tuning prompts are provided in Table 2. For the “question classification” and “question reconstruction” components, we initially targeted medical researchers, capturing practical retrieval queries. Restrictions such as publication date, author, and institution constraints were removed to reconstruct the retrieval questions. The “abstract summarization” segment drew inspiration from prior research [49], [50], where literature citing each original article served as is summarization, followed by a manual review. The “literature-based QA” dataset included QA pairs derived from manual abstract readings and the PubMedQA methodology [33]. During fine-tuning, we applied low-rank adaptation (LORA), which froze the base LLM parameters while updating a low-rank matrix [51]. After experimenting with various rank sizes, we selected a rank of r=4 to balance computational efficiency with fine-tuning precision, applying a scaling factor of α=16 to adjust the influence of the low-rank matrices. This approach preserved the general language capabilities of the base LLM while efficiently adapting it to domain-specific tasks.

The dataset was divided into training, validation, and testing sets with 80%, 10%, and 10% splits, respectively, as shown in Table 1, Table 3. A temperature setting of 0.15 was chosen to improve the reliability of the model outputs, particularly for critical medical QA tasks, as a lower temperature reduces the randomness exhibited by the output responses. We further enhanced the generation process with top P (nucleus) sampling, which was set to a cumulative probability of P=0.9, thereby balancing comprehensiveness and hallucination minimization.

The weighted adaptive moment estimation (AdamW) optimizer was used with a weight decay rate of 0.01 and beta coefficients of (0.9,0.999). The learning rate schedule included a 1500-step warmup phase, after which the rate decayed to 10% of its maximum value. The learning rate was set at 1×10-4 during pretraining and 1×10-5 for fine-tuning, with batch sizes of 64 and 32, respectively. To prevent overfitting, early stopping was implemented, halting the training process after five epochs without validation loss improvements. Training was conducted on a high-performance cluster with six NVIDIA A100 graphics processing units (GPUs), with the longest duration extending up to 1440 hours. Fig. 3 illustrates the loss and perplexity trends observed over 1000 fine-tuning iterations.

4. Evaluation and discussion

By deconstructing practical scenarios, we evaluated the performance of the proposed framework from three main perspectives in this section, namely, litersature retrieval, abstract summarization, and literature-based QA.

4.1. Comparison models

To assess the generative performance of the proposed framework, we compared it with several semantic baselines, including BERT [52], BioBERT [53], and BioClinicalBERT [54], as well as more recent medical LLMs such as BioMedLM [4], Meditron-7B [55], and ChatDoctor [56], to provide a comprehensive evaluation.

BERT, which is a bidirectional transformer model with 110 million parameters, captures long-range dependencies by considering both the left and right contexts within a sentence. BioBERT and BioClinicalBERT are variants of BERT that were fine-tuned specifically on medical literature and clinical case data, respectively, to improve their performance in medical NLP tasks.

In contrast, BioMedLM is an LLM that was trained exclusively on biomedical abstracts and papers, utilizing a standard transformer stacking architecture with a context window of 1024 and a hidden size of 2560, yielding robust results across a wide range of biomedical NLP applications. Meditron-7B, which is a more generalized medical LLM built on Llama-2, was trained with NVIDIA’s Megatron-LM distributed trainer. This extensive training process equipped Meditron-7B to effectively handle a variety of medical reasoning tasks. ChatDoctor, on the other hand, was designed specifically for doctor–patient dialogs. Built upon the Llama architecture, ChatDoctor was fine-tuned on large-scale doctor–patient interaction datasets, including over 100 000 dialogs. It incorporates real-time knowledge retrieved from curated offline databases and external sources, such as Wikipedia, enabling it to effectively handle real-world clinical queries. All hyperparameters for these comparison models were fine-tuned through grid searches on the relevant datasets to ensure that each model attained optimal performance on the collected dataset.

4.2. Literature retrieval

As discussed in Section 3.2, vector databases offer promising solutions for effectively implementing semantic retrieval, with vector embeddings playing a critical role in maximizing retrieval performance. In this section, we evaluated the impact of embeddings across various literature retrieval models. For the COVID-19 dataset, we used two types of gold standards: article categories and manual feedback. The gold standard for the TripClick dataset was based on click log entries. All the selected models were evaluated in terms of metrics, including their normalized discounted cumulative gain (NDCG), recall, and mean reciprocal rank (MRR) values.

When assessing the retrieval performance achieved on the COVID-19 dataset using article categories as the gold standard, ERQA-7B performed comparably to Meditron, with both models significantly outperforming the ChatDoctor and BERT-based models. As shown in Fig. 4, ERQA-7B achieved an NDCG@10 score of 0.897, which was slightly lower than the 0.899 achieved by Meditron but notably higher than the 0.893 and 0.885 achieved by ChatDoctor and BioMedLM, respectively. A similar trend was observed in terms of Recall@10, with ERQA-7B scoring 0.906, finishing marginally below the 0.907 achieved by Meditron yet outperforming ChatDoctor at 0.902 and BioMedLM at 0.894. These results highlight the retrieval accuracy of ERQA-7B and Meditron on the COVID-19 dataset, both of which have clear advantages over the BERT-based models. Additionally, we evaluated the retrieval performance achieved with human feedback as the ground truth for this dataset. Fig. 5(a) shows that ERQA-7B again outperformed ChatDoctor and BioMedLM, achieving an NDCG@10 score of 0.264 and a recall@10 score of 0.289, whereas the NDCG@10 score and recall@10 score of ChatDoctor were 0.257 and 0.279, respectively. Although Meditron slightly underperformed compared toERQA-7B, with an NDCG@10 score of 0.261 and a recall@10 score of 0.287, both models exhibited similar overall performance. In contrast, the BERT-based models, such as BioClinicalBERT, yielded significantly weaker results, with an NDCG@10 score of 0.221 and a recall@10 score of 0.237, underscoring the limitations of smaller-scale models when human feedback is the evaluation criterion.

On the TripClick dataset (Fig. 5(b)), ERQA-7B demonstrated strong performance, closely matching Meditron and outperforming ChatDoctor. ERQA-7B achieved an NDCG@10 score of 0.337 and a recall@10 score of 0.279, which were nearly on par with those of Meditron, which produced NDCG@10 and recall@10 scores of 0.341 and 0.276, respectively, and ahead of those yielded by ChatDoctor, which achieved an NDCG@10 score of 0.332 and a recall@10 score of 0.272. ERQA-13B, however, demonstrated the most substantial improvement on the TripClick dataset, achieving an NDCG@20 score of 0.428 and a recall@50 score of 0.391, which were well above those of the next-best model, ERQA-7B, which achieved an NDCG@20 of 0.348 and a recall@50 of 0.376. This marked improvement highlights the efficacy of larger-scale models and sophisticated fine-tuning, particularly for handling diverse and noisy real-world retrieval data.

4.3. Abstract summarization

Abstract summarization, as a specialized subset of text summarization, demands high content generation accuracy, especially within scientific and medical research contexts. To evaluate the performance of our model and the baselines, we employed widely used recall-oriented understudy for gisting evaluation (ROUGE) metrics: ROUGE-1, ROUGE-2, and ROUGE-L. These metrics measure the overlap between generated summaries and reference summaries, with ROUGE-1 capturing unigram overlap, ROUGE-2 assessing bigram overlap, and ROUGE-L focusing on the longest common subsequence [35].

The results obtained on the COVID-19 dataset, shown in Fig. 6(a), indicate that the LLMs generally outperformed the traditional BERT-based models. Notably, ERQA-13B achieved the highest scores across all the metrics, with a ROUGE-1 score of 0.434, followed closely by ERQA-7B, which produced a score of 0.420. Compared with the traditional models, ERQA-13B yielded improvements of 28.4% over BERT, 33.95% over BioBERT, and 19.89% over BioClinicalBERT. BioMedLM, Meditron, and ChatDoctor achieved ROUGE-1 scores of 0.409, 0.413, and 0.411, respectively, although they did not surpass the ERQA models in terms of performance. In terms of ROUGE-2, ERQA-13B maintained the lead with a score of 0.203, representing a 16.67% improvement over BioMedLM. ERQA-7B and Meditron also performed well, with scores of 0.184 and 0.181, respectively. With respect to ROUGE-L, which assesses the longest common subsequence, ERQA-13B again led with a score of 0.345, followed by ERQA-7B and Meditron with scores of 0.329 and 0.320, respectively.

On the TripClick dataset, as shown in Fig. 6(b), similar trends emerged. ERQA-13B led across all the metrics, achieving a ROUGE-1 score of 0.421. Meditron and ChatDoctor performed competitively, with scores of 0.400 and 0.387, respectively, whereas ERQA-7B reached a score of 0.403. In terms of ROUGE-2, ERQA-7B and ERQA-13B achieved scores of 0.294 and 0.303, respectively, surpassing Meditron, which had a score of 0.286, and ChatDoctor, which had a score of 0.275. For the ROUGE-L metric, ERQA-13B achieved a score of 0.367, which was the highest among all the models, followed by ERQA-7B with 0.331, Meditron with 0.327, and ChatDoctor with 0.316. Table 4 presents relevant examples of the proposed ERQA model in the summarization task. The ERQA models provided more accurate and contextually relevant summaries, further demonstrating the advantages of the proposed framework in medical knowledge extraction scenarios.

4.4. Literature-based QA

The recently proposed biomedical LLM, Med-PaLM 2, which was trained on the MedQA and MedMCQA datasets, has achieved scores comparable to those of professional doctors on the United States Medical Licensing Examination (USMLE). These datasets, which are structured in a multiple-choice format, allow the performance of models to be assessed primarily in terms of accuracy. However, unlike Med-PaLM 2, the ERQA model is tailored for medical knowledge retrieval and QA tasks, primarily involving context-based reading comprehension, where traditional accuracy metrics are improper. Instead, we used the bilingual evaluation understudy (BLEU) metric [57], which evaluates fluency and contextual relevance and is essential for assessing the quality of generated responses in context-based comprehension tasks.

As shown in Table 5, the LLM-based models substantially outperformed the BERT-based models on both the COVID-19 and TripClick datasets. Specifically, the BERT-based models (BERT, BioBERT, and BioClinicalBERT) exhibited relatively low BLEU scores, indicating limitations in terms of generating high-quality, context-relevant responses for QA tasks.

In contrast, the LLM-based models demonstrated markedly higher performance. For example, BioMedLM achieved a BLEU-1 score of 5.843 and a BLEU-4 score of 0.672 on the COVID-19 dataset, whereas BLEU-1 and BLEU-4 scores of 5.703 and 0.417 were attained on the TripClick dataset, respectively. Meditron and ChatDoctor further highlighted the superiority of LLM-based models, with Meditron achieving BLEU-1 and BLEU-4 scores of 6.278 and 0.725 on the COVID-19 dataset and scores of 6.004 and 0.432 on the TripClick dataset, respectively. The proposed ERQA models, which were specifically fine-tuned for medical knowledge retrieval and QA tasks, outperformed these baselines, with ERQA-7B achieving BLEU-1 and BLEU-4 scores of 6.467 and 0.722 on the COVID-19 dataset and 6.284 and 0.447 on TripClick. The larger ERQA-13B model further enhanced these results.

To deepen our understanding of the QA capabilities of LLMs, we conducted a human evaluation of model-generated answers, using coherence, consistency, and satisfaction as scoring criteria (inspired by prior research) [34]. Each metric was scored on a 0–100 scale, with scores segmented into four levels: 1–25 (poor), 26–50 (fair), 51–75 (good), and 76–100 (excellent). This structure provides nuanced insights into the response quality achieved across different dimensions.

• Coherence assesses the logical flow of a response. A high coherence score (76–100) indicates that the generated sentences are conceptually accurate and logically sound, strongly supporting the associated argument, whereas a low score (1–25) reflects fragmented or difficult-to-follow responses.

• Consistency measures alignment with the source material, ensuring that responses accurately reflect the source without hallucinations or errors. High scores indicate strong source fidelity, whereas low scores reflect deviations or inaccuracies.

• Satisfaction gauges how well a response satisfies the user’s information needs, assessing completeness and informativeness. High satisfaction scores signify thorough, relevant answers, whereas low scores indicate unmet expectations.

During the human evaluation, three expert reviewers independently assessed the QA pairs generated by the ERQA model, scoring its responses for coherence, consistency, and satisfaction on a 0–100 scale. To ensure reliable evaluations, Krippendorff’s alpha was calculated for each QA pair to assess the interrater reliability of each of the three metrics. A threshold of 0.75 was set for Krippendorff’s alpha, as this is widely regarded as the minimum acceptable level for good interrater reliability. If the alpha value for a QA pair met or exceeded this threshold, the reviewers’ scores were considered consistent, and the average score of the three reviewers was used as the final score for that QA pair. However, if Krippendorff’s alpha for a QA pair fell below 0.75, the QA was flagged for re-evaluation. In such cases, reviewers reassessed the response and discussed any discrepancies, repeating the process until the alpha value exceeded the threshold. Once all the QA pairs had been evaluated and the discrepancies were resolved, the final scores of the examined metrics (coherence, consistency, and satisfaction) were calculated as averages across all the QA pairs, ensuring consistently reliable evaluations across the dataset.

As depicted in Fig. 7, increasing the model size from 7 billion to 13 billion parameters improved the performance of ERQA with respect to coherence, consistency, and satisfaction. The larger model particularly excelled in terms of user satisfaction, demonstrating its enhanced ability to deliver accurate, relevant information from the literature. Nevertheless, the performance gains did not fully justify the additional computational demands, suggesting that ERQA-7B may serve as a more practical option for medical knowledge retrieval and QA tasks.

In some instances, the model produced responses with logical structures and clarity but introduced minor inconsistencies with the source material, likely due to the inherent hallucinations of LLMs. For example, in one response concerning COVID-19 immunity, the generated text misstated the duration of antibody protection, which lowered its consistency score. Occasionally, the responses met the information needs of users but lacked logical flow, impacting the resulting coherence scores, particularly when addressing complex medical concepts. To assess the ability of ERQA to understand medical texts, extract relevant information, and produce contextually accurate answers, we used a diverse set of abstracts acquired from the COVID-19 literature as input queries. Table 6 highlights cases in which ERQA struggled with rigorous reasoning requirements, suggesting avenues for future refining the model.

Medical knowledge retrieval and QA often involve multiple rounds of context-dependent interaction. While ERQA can process extended context inputs, its maximum context length is limited primarily by the architecture of the constructed model. For example, Llama-7B, as used in our study, supports a context window of up to 4096 tokens, which is sufficient for most medical queries. However, the length of the retrievable context is also influenced by the preprocessed abstracts and key literature sections stored in vector databases. This structure enables ERQA to deliver detailed responses under inherent token constraints, ensuring comprehensive answers within the context limits of the model.

4.5. Ablation study

We conducted ablation studies on the COVID-19 and TripClick datasets to evaluate the contributions of various components contained in the ERQA framework. The original Llama2 model served as the baseline, whereas Llama2 w/VD refers to the Llama2 model enhanced with a literature vector database for QA tasks. ERQA represents the complete model with vector database support, and ERQA w/o VD denotes the ERQA model without the vector database.

For article retrieval, which relies on an embedding-assisted vector database, we compared the performance of Llama2 w/VD with that of ERQA. As shown in Table 7, ERQA achieved notable improvements over Llama2 w/VD, demonstrating the impact of the fine-tuning process of ERQA on retrieval tasks. In the abstract summarization task, we focused on Llama2 and ERQA w/o VD to evaluate whether the models could effectively distill the key topics of the given articles. The results in Fig. 8 indicate that a substantial ROUGE score improvement was achieved by ERQA w/o VD. Incremental pretraining and fine-tuning enabled the proposed model to outperform the original baseline in terms of medical knowledge comprehension and task compliance, respectively.

Regarding the downstream tasks of article retrieval and abstract summarization, literature-based QA is the most direct manifestation of LLM-driven medical knowledge mining. Although both the fine-tuning process and the database were tailored to support the proposed ERQA framework in completing the QA task, the enhanced LLM (i.e., ERQA w/o VD) and Llama2 w/VD were still capable of handling QA tasks, albeit with decreased performance. In terms of the BLEU metric, the performance of ERQA w/o VD was similar to that of Llama2 w/VD but fell short compared with that of ERQA, as shown in Table 8. For example, when faced with the question (PMID: 37069722) shown in Table 6, the response generated by ERQA w/o VD was as follows: “Despite the administration of vaccines, several challenges continue to hinder the control of the COVID-19 pandemic. Some of these challenges include vaccine hesitancy, limited access to vaccines in resource-poor areas, difficulty in reaching vulnerable populations, and inadequate public health infrastructure and resources.” This response overlooks important content, such as “viral mutations,” and introduces a hallucination by mentioning “vaccine hesitancy,” which does not align with the question. Notably, the complete ERQA framework outperformed all other versions across the different BLEU scores on both datasets. The ablation comparison indicates that the combined effect of fine-tuning and vector database integration significantly enhances QA performance.

4.6. Limitations

When evaluating the performance of the ERQA model across literature retrieval, abstract summarization, and literature-based QA tasks, we observed the exceptional advantages that LLMs offer over traditional language models in terms of comprehending and leveraging medical knowledge. This insight opens new avenues for advancing literature-based medical knowledge mining and underscores the transformative potential of LLMs in this field. However, several practical challenges remain with respect to LLM-driven medical knowledge retrieval and QA systems.

One possible source of error in these tasks could be the difficulty of retrieving articles that adequately satisfy user requirements. The development of an auxiliary retrieval strategy tailored to LLMs is a complex challenge. The key considerations include the granularity of text segmentation and layer selection for generating embeddings within the vector database. Given the complexity and diverse contents of medical literature, choosing the optimal text segmentation granularity is essential for achieving precise retrieval. If the segmentation process is too coarse, it may lead to information overload or context losses; if it is too fine, it risks omitting critical information or diluting context. Additionally, since LLMs comprise multiple hidden layers that capture information at varying levels of abstraction, selecting the appropriate layer for embedding generation significantly affects the resulting retrieval quality. While vector-based retrieval strategies enhance semantic relevance, metadata-based matching often yields greater precision and interpretability. Balancing vector embeddings with metadata features requires a careful design to maximize the accuracy and effectiveness of retrieval.

In literature-based QA performance evaluations, compared with its consistency and satisfaction scores, ERQA achieved notably higher coherence scores. This gap was due primarily to the phenomenon of hallucination in LLMs, where model-generated responses diverge from reality owing to data biases or inconsistencies encountered during training. These inaccuracies or contradictions in generated answers can compromise the reliability and practical utility of the utilized system. To address this challenge in future work, incorporating external knowledge sources and verification mechanisms to validate and refine model-generated responses could mitigate the hallucination problem and yield improved response accuracy.

5. Conclusions

In this study, we proposed an LLM-driven framework for medical knowledge retrieval and QA tasks that integrates three key components into a cohesive workflow. Multiple scenarios were chosen to evaluate the proposed framework, including literature retrieval, abstract summarization, and literature-based QA. Both the qualitative and quantitative results of this study highlight the promise of the developed approach in terms of advancing biomedical knowledge discovery. Moving forward, we plan to incorporate larger-scale biomedical literature datasets and adopt additional evaluation metrics to further enhance the performance of the model.

CRediT authorship contribution statement

Yuyang Liu: Writing – review & editing, Writing – original draft, Validation, Methodology, Data curation, Conceptualization. Xiaoying Li: Writing – original draft, Methodology, Conceptualization. Yan Luo: Investigation, Formal analysis, Data curation. Jinhua Du: Software, Methodology. Ying Zhang: Formal analysis, Data curation. Tingyu Lv: Formal analysis, Data curation. Hao Yin: Writing – review & editing, Supervision. Xiaoli Tang: Writing – review & editing, Supervision, Funding acquisition. Hui Liu: Writing – review & editing, Supervision, Funding acquisition.

Acknowledgments

This work was supported by the Innovation Fund for Medical Sciences of the Chinese Academy of Medical Sciences (2021-I2M-1-033) and the National Key Research and Development Program of China (2022YFF0711900).

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

[1]

Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. GPT-4 technical report.2023. ar Xiv:230308774.

[2]

OPENA I. ChatGPT [Internet].San Francisco: OPENAI; undated [cited 2024 May 5]. Available from: https://openai.com/blog/chatgpt/.

[3]

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023; 29(8):1930-1940.

[4]

Bolton E, Hall D, Yasunaga M, Lee T, Manning C, Liang P. BioMedLM: a domain-specific large language model for biomedical text.Stanford: Stanford Center for Research on Foundation Models; 2022.

[5]

Luo R, Sun L, Xia Y, Qin T, Zhang S, Poon H, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief bioinform 2022; 23(6):bbac409.

[6]

Wu C, Lin W, Zhang X, Zhang Y, Xie W, Wang Y. PMC-LLaMA: toward building open-source language models for medicine. J Am Med Inform Assoc 2024; 31(9):1833-1843.

[7]

Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2024; 25(1):bbad493.

[8]

Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023; 9(1):e45312.

[9]

Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT affect information seeking from the medical literature?. J Am Soc Nephrol 2023; 34(8):1302-1304.

[10]

Tan RSYC, Lin Q, Low GH, Lin R, Goh TC, Chang CCE, et al. Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting. J Am Med Inform Assoc 2023; 30(10):1657-1664.

[11]

Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc 2010; 17(5):507-513.

[12]

Zhao Q, Xu D, Li J, Zhao L, Akhtar RF. Knowledge guided distance supervision for biomedical relation extraction in Chinese electronic medical records. Expert Syst Appl 2022; 204:117606.

[13]

Roy A, Pan S. Incorporating medical knowledge in BERT for clinical relation extraction.In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; 2021 Nov 7–11; Online. Stroudsburg: Association for Computational Linguistics; 2021.

[14]

Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21(1):188.

[15]

Hristovski D, Dinevski D, Kastrin A, Rindflesch TC. Biomedical question answering using semantic relations. BMC Bioinformatics 2015; 16(1):6.

[16]

Qi T, Qiu S, Shen X, Chen H, Yang S, Wen H, et al. KeMRE: Knowledge-enhanced medical relation extraction for Chinese medicine instructions. J Biomed Inform 2021; 120:103834.

[17]

Whissell JS, Clarke CL. Improving document clustering using Okapi BM25 feature weighting. Inf Retrieval 2011; 14(5):466-487.

[18]

Luo G, Tang C, Yang H, Wei X. MedSearch: a specialized search engine for medical information retrieval.Proceedings of the 17th ACM Conference on Information and Knowledge Management; 2008 Oct 26–30; Napa Valley, C A, US A. New York City: Association for Computing Machinery; 2008.

[19]

Canese K, Weis S. PubMed: the bibliographic database.In: The NCBI handbook. 2nd ed. Bethesda: National Center for Biotechnology Information (U S); 2013.

[20]

Vanopstal K, Buysschaert J, Laureys G, Vander SR. Lost in PubMed. Factors influencing the success of medical information retrieval. Expert Syst Appl 2013; 40(10):4106-4114.

[21]

Jin Q, Leaman R, Lu Z. PubMed and beyond: biomedical literature search in the age of artificial intelligence. EBioMedicine 2024; 100:104988.

[22]

Mour Aão, Martins F, Magalh Jães. Multimodal medical information retrieval with unsupervised rank fusion. Comput Med Imaging Graph 2015; 39:35-45.

[23]

Ma J, Wu X, Huang L. The use of artificial intelligence in literature search and selection of the PubMed database. Sci Program 2022; 16(1):1-9.

[24]

Jin Q, Shin A, Lu Z. Lader: log-augmented dense retrieval for biomedical literature search.In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval; 2023 July 23–27; Taipei, China. New York City: Association for Computing Machinery; 2023.

[25]

Zeng H, Liu J, Wang M, Wei B. A sequence to sequence model for dialogue generation with gated mixture of topics. Neurocomputing 2021; 437:282-288.

[26]

Gomes J Jr, de RC Mello, Ströele V, de Souza JF. A hereditary attentive template-based approach for complex knowledge base question answering systems. Expert Syst Appl 2022; 205:117725.

[27]

Guu K, Lee K, Tung Z, Pasupat P, Chang M. REALM: retrieval augmented language model pre-training.In: Proceedings of International Conference on Machine Learning; 2020 Feb 15–17; Shenzhen, China. New York City: Association for Computing Machinery; 2020.

[28]

Varshney D, Zafar A, Behera NK, Ekbal A. Knowledge graph assisted end-to-end medical dialog generation. Artif Intell Med 2023; 139:102535.

[29]

Wu S, Li Y, Zhang D, Zhou Y, Wu Z. Diverse and informative dialogue generation with context-specific commonsense knowledge awareness.In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; 2020 Jul 5–10; Online. Stroudsburg: Association for Computational Linguistics; 2020.

[30]

Wu S, Li Y, Zhang D, Wu Z. Generating rational commonsense knowledge-aware dialogue responses with channel-aware knowledge fusing network. IEEE/ACM Trans Audio Speech Lang Process 2022; 30:3230-3239.

[31]

Pereira J, Fidalgo R, Lotufo R, Nogueira R. Visconde: multi-document QA with GPT-3 and neural reranking.In: Proceedings of the European Conference on Information Retrieval; 2023 Apr 2–6; Dublin, Ireland. Berlin: Springer; 2023.

[32]

Huang D, Wei Z, Yue A, Zhao X, Chen Z, Li R, et al. DSQA-LLM: domain-specific intelligent question answering based on large language model.In: Proceedings of the International Conference on AI-generated Content; 2023 Aug 25–26, Shanghai, China. Berlin: Springer; 2023.

[33]

Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X. PubMedQA: a dataset for biomedical research question answering.2019. ar Xiv:190906146.

[34]

Singhal K, Tu T, Gottweis J, Sayres R, Wulczyn E, Amin M, et al. Toward expert-level medical question answering with large language models. Nat Med 2025; 31:943-950.

[35]

El-Kassas WS, Salama CR, Rafea AA, Mohamed HK. Automatic text summarization: a comprehensive survey. Expert Syst Appl 2021; 165:113679.

[36]

Khan R, Qian Y, Naeem S. Extractive based text summarization using K-means and TF-IDF. Int J Electron Bus 2019; 12(3):33-44.

[37]

Parveen D, Ramsl HM, Strube M. Topical coherence for graph-based extractive summarization.In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; 2015 Sep 17–21; Lisbon, Portugal. Stroudsburg: Association for Computational Linguistics; 2015. p. 1949–54

[38]

Gehrmann S, Deng Y, Rush AM. Bottom-up abstractive summarization.2018. ar Xiv:180810792.

[39]

Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology 2023; 1(2):100017.

[40]

Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, et al. Evaluating large language models on medical evidence summarization. npj Digit Med 2023; 6(1):158.

[41]

Abacha AB, Yim WW, Adams G, Snider N, Yetisgen-Yildiz M. Overview of the MEDIQA-Chat 2023 shared tasks on the summarization & generation of doctor-patient conversations.In: Proceedings of the 5th Clinical Natural Language Processing Workshop; 2023 Jul 9–14; Toronto, O N, Canada. Stroudsburg: Association for Computational Linguistics; 2023.

[42]

Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stüber AT, Topalis J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34(5):2817-2825.

[43]

Rekabsaz N, Lesota O, Schedl M, Brassey J, Eickhoff C. TripClick: the log files of a large health web search engine.In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval; 2021 Jul 11–15; Virtual Event. New York City: Association for Computing Machinery; 2021. p. 2507–13.

[44]

Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y, et al. Llama 2: open foundation and fine-tuned chat models.2023. ar Xiv:230709288.

[45]

Ma G, Wu X, Wang P, Lin Z, Hu S. Pre-training with large language model-based document expansion for dense passage retrieval.2023. ar Xiv:230808285.

[46]

Esteva A, Kale A, Paulus R, Hashimoto K, Yin W, Radev D, et al. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. npj Digit Med 2021; 4(1):68.

[47]

Cunningham P, Delany SJ. K-nearest neighbour classifiers-a tutorial. ACM Comput Surv 2021; 54(6):1-25.

[48]

Hezel N, Barthel KU, Schall K, Jung K. Fast approximate nearest neighbor search with a dynamic exploration graph using continuous refinement.2023. ar Xiv:230710479.

[49]

Yasunaga M, Kasai J, Zhang R, Fabbri AR, Li I, Friedman D, et al. ScisummNet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks.In: Proceedings of the AAAI conference on artificial intelligence; 2019 Jan 27–February 1; Honolulu, H I, US A. Washington, D C: Association for the Advancement of Artificial Intelligence (AAA I) Press; 2019.

[50]

Chen Y, Polajnar T, Batchelor C, Teufel S. A corpus of very short scientific summaries.In: Proceedings of the 24th Conference on Computational Natural Language Learning; 2020 Nov 19–20; Online. Stroudsburg: The Association for Computational Linguistics; 2020.

[51]

Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. Lora: low-rank adaptation of large language models.2021. ar Xiv:210609685.

[52]

Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding.2018. ar Xiv:181004805.

[53]

Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36(4):1234-1240.

[54]

Alsentzer E, Murphy JR, Boag W, Weng WH, Jin D, Naumann T, et al. Publicly available clinical BERT embeddings.2019. ar Xiv:190403323.

[55]

Chen Z, Cano AH, Romanou A, Bonnet A, Matoba K, Salvi F, et al. Meditron-70b: scaling medical pretraining for large language models.2023. ar Xiv:231116079.

[56]

Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. Chatdoctor: a medical chat model fine-tuned on a large language model meta-AI (Llama) using medical domain knowledge. Cureus 2023; 15(6):e40895.

[57]

Papineni K, Roukos S, Ward T, Zhu WJ. BLEU: a method for automatic evaluation of machine translation.In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics; 2002 Jul 7–12; Philadelphia, P A, US A. Stroudsburg: Association for Computational Linguistics; 2002.

PDF (1863KB)

6151

Accesses

0

Citation

Detail

Sections
Recommended

/