Developing a Predictive Platform for Salmonella Antimicrobial Resistance Based on a Large Language Model and Quantum Computing

Yujie You , Kan Tan , Zekun Jiang , Le Zhang

Engineering ›› 2025, Vol. 48 ›› Issue (5) : 184 -195.

PDF (2933KB)
Engineering ›› 2025, Vol. 48 ›› Issue (5) :184 -195. DOI: 10.1016/j.eng.2025.01.013
Research Artificial Intelligence—Article
research-article
Developing a Predictive Platform for Salmonella Antimicrobial Resistance Based on a Large Language Model and Quantum Computing
Author information +
History +
PDF (2933KB)

Abstract

As a common foodborne pathogen, Salmonella poses risks to public health safety, common given the emergence of antimicrobial-resistant strains. However, there is currently a lack of systematic platforms based on large language models (LLMs) for Salmonella resistance prediction, data presentation, and data sharing. To overcome this issue, we firstly propose a two-step feature-selection process based on the chi-square test and conditional mutual information maximization to find the key Salmonella resistance genes in a pan-genomics analysis and develop an LLM-based Salmonella antimicrobial-resistance predictive (SARPLLM) algorithm to achieve accurate antimicrobial-resistance prediction, based on Qwen2 LLM and low-rank adaptation. Secondly, we optimize the time complexity to compute the sample distance from the linear to logarithmic level by constructing a quantum data augmentation algorithm denoted as QSMOTEN. Thirdly, we build up a user-friendly Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs, which not only facilitates online resistance prediction for users but also visualizes the pan-genomics analysis results of the Salmonella datasets.

Graphical abstract

Keywords

Salmonella resistance prediction / Pan-genomics / Large language model / Quantum computing / Bioinformatics

Cite this article

Download citation ▾
Yujie You, Kan Tan, Zekun Jiang, Le Zhang. Developing a Predictive Platform for Salmonella Antimicrobial Resistance Based on a Large Language Model and Quantum Computing. Engineering, 2025, 48(5): 184-195 DOI:10.1016/j.eng.2025.01.013

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

Salmonella is a common foodborne pathogen and the third leading cause of death due to foodborne diseases [1]. Although antimicrobials are an effective clinical treatment for the diseases caused by Salmonella, their efficacy is affected by gene mutations and antimicrobial abuse [2]. These factors have resulted in several Salmonella strains gradually evolving into antimicrobial-resistant strains, weakening the therapeutic effects of antimicrobials. Thus, to decrease the impacts of antimicrobial-resistant Salmonella strains on food safety and public health, there is an urgent need to develop targeted antimicrobial therapy for infected patients.

As it is both time-consuming and difficult to investigate mechanisms of antimicrobial resistance (AMR) [3], the bacterial antimicrobial susceptibility test (AST) commonly used to detect bacterial AMR is inefficient for predicting AMR. Based on the close association between AMR and genes, previous studies [4], [5], [6] have employed whole-genome sequencing (WGS) data instead of ASTs to predict Salmonella resistance. Due to the curse of high dimensionality introduced by WGS data, however, current machine-learning- and early deep-learning-based predictive models [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17] are subject to overfitting during training, resulting in unsatisfactory predictive results. Given that current fine-tuning models based on large language models (LLMs) are powerful tools that can achieve high performance and robust stability with a small sample size, we pose our first scientific question: How can an efficient feature-selection process be established to mine the key Salmonella resistance genes with Salmonella WGS data and to build up an LLM-based Salmonella resistance predictive model in order to alleviate the curse of high dimensionality caused by small samples?

Salmonella resistance brings another challenge as well: Because of the significant imbalance between the number of antimicrobial-resistant samples and the number of sensitive samples in the Salmonella WGS data, the performance of the antimicrobial-resistance prediction model will be significantly lowered. Data augmentation is typically employed for such sample-imbalance problems, as it can generate pseudo samples based on the original minority class samples to balance the sample size. SMOTEN [18] is an oversampling data-augmentation algorithm that looks for k-nearest neighbors of the original minority class samples and then interpolates between the samples to generate new ones. However, SMOTEN and its derived algorithms [18], [19], [20] have such high computational complexity that they are unsuitable for high-dimensional WGS data. Since quantum computing holds the possibility of speeding up the SMOTEN algorithm, we pose our second scientific question: How can quantum computing be used to speed up the SMOTEN algorithm in order to balance the numbers of antimicrobial-resistant and sensitive samples for the Salmonella WGS data?

Although several online AMR gene analysis websites have already been established based on ResFinder [21] and CARD [22], these sites do not provide a predictive functionality for AMR. In addition, existing Salmonella resistance prediction models [23], [24], [25], [26], [27] are often presented as source code, toolkits, or experimental flowcharts, making resistance prediction and analysis difficult. For these reasons, we pose our third scientific question: How can a user-friendly and convenient platform be built for Salmonella resistance prediction, data presentation, and sharing?

In this article, we propose the following innovative works to solve the scientific questions. Firstly, we propose a two-step feature-selection process based on the chi-square test and conditional mutual information maximization to investigate the key Salmonella resistance genes in pan-genomics analysis and develop an LLM-based Salmonella antimicrobial-resistance predictive (SARPLLM) algorithm to achieve accurate antimicrobial-resistance prediction based on the Qwen2 LLM [28] and low-rank adaptation (LoRA) [29]. Secondly, we optimize the time complexity for the computation of the sample distance from the linear to logarithmic level by constructing a quantum data-augmentation algorithm, QSMOTEN. Thirdly, we build up a user-friendly Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs [30], [31], which not only facilitates online resistance prediction for users but also visualizes the pan-genomics analysis results of the Salmonella datasets.

To assess our algorithm, we carry out a comparative experiment between SARPLLM and other antimicrobial-resistance prediction models, which demonstrates that SARPLLM not only outperforms other antimicrobial-resistance predictive models but also shows high robustness in terms of Salmonella antimicrobial-resistance prediction. We then simulate the proposed QSMOTEN algorithm on both a virtual quantum machine and a physical quantum machine. The simulation results indicate that the QSMOTEN algorithm can accurately and quickly compute the distance between Salmonella antimicrobial samples, demonstrating the potential of quantum computing methods to accelerate the SMOTEN algorithm. Finally, we construct an LLM-based predictive platform for Salmonella resistance that provides users with four key modules to facilitate their study of Salmonella resistance.

2. Materials and methods

A pipeline of the study’s methodology is illustrated in Fig. 1.

2.1. Data acquisition

Antimicrobial-resistant Salmonella data samples were obtained from the National Antimicrobial Resistance Monitoring System Now website, provided by the US Centers for Disease Control and Prevention [32]. All samples were Salmonella typhimurium, isolated from patients in the United States. A total of 1167 Salmonella samples were available with AST and WGS results. Based on the AST results, we chose all 1167 samples to form an AMR matrix indicating whether a sample was resistant or not. The sample size of antimicrobial and AST results are list in Appendix A Table S1. A sample size of five Salmonella antimicrobials and the AST results are provided in Appendix A Table S2.

Based on the sample numbers of the 1167 Salmonella samples with AMR, we obtained the WGS of the Salmonella samples from the US National Center for Biotechnology Information (NCBI) database [33]. The 1167 Salmonella gene annotation data was generated from the raw WGS data through a hierarchical genome assembly process [34] and the NCBI prokaryotic genome annotation pipeline [35]. The information of data acquisition is detailed in Appendix A Section S1.

2.2. The pan-genomics analysis and two-step feature-selection process

Fig. 2(a) describes the pan-genomics analysis procedure. First, we took the Salmonella gene annotation data as input and carried out a pan-genomic analysis using the pan-genome pipeline Roary [36] to generate the gene existence matrix. Then, to reflect the differences among genomes, we removed the core genes shared by all Salmonella genomes from the gene existence matrix to obtain the accessory gene existence matrix.

Second, we used the gene annotation data as input and carried out a multiple sequence alignment of the Salmonella WGS data by means of the multiple sequence alignment based on fast Fourier transform (MAFFT) program [37] to obtain the core gene alignment data. Subsequently, we used the single nucleotide polymorphism (SNP)-sites [38] tool to detect SNPs from the core gene alignment data and ultimately output the core SNP matrix.

As shown in Fig. 2(b), to address inaccurate AMR predictions caused by the curse of high dimensionality, we propose a two-step feature-selection process based on the chi-square test [39] and the conditional mutual information maximization algorithm [27] to quickly screen out the genes highly correlated with AMR. First, based on the resistance to five Salmonella antimicrobials (i.e., augmentin (AUG), ceftriaxone (AXO), chloramphenicol (CHL), ampicillin (AMP), and cefoxitin (FOX)), we carried out the chi-square test (Eq. (1)) to screen the Salmonella resistance genes with p values of less than 0.05 from the accessory genes and the core SNPs. Then, we employed the conditional mutual information maximization algorithm (Eq. (2)) to evaluate the relative importance of the Salmonella resistance key genes for each antimicrobial. Based on the relative importance, we further screened the Salmonella resistance genes that were highly correlated with the five antimicrobials. The conditional information maximization algorithm is detailed in Appendix A Section S2.

χ2x,y=NAD-BC2A+CA+BB+DC+D
Ix;y|z=IEx,z+IEy,z-IEx,y,z-IEz

where x and y indicate a gene from a Salmonella sample and the Salmonella AMR label, respectively. N represents the total number of Salmonella samples, A represents the number of Salmonella AMR samples with gene x, B represents the number of Salmonella antimicrobial sensitivity samples with gene x, C represents the number of Salmonella antimicrobial-resistance samples without gene x , and D represents the number of Salmonella antimicrobial sensitivity samples without gene x. IE represents the information entropy [40], and I represents the conditional mutual information of the number of Salmonella samples with gene x and Salmonella AMR label y under the condition of gene z.

2.3. SARPLLM

Here, we propose an SARPLLM , which uses the Salmonella accessory genes feature and the SNP feature obtained from the two-step feature-selection process as inputs and learns potential AMR relationships to predict the AMR of Salmonella samples. As shown in Fig. 3, the SARPLLM consists of three steps: data consolidation and prompt engineering, modeling and fine-tuning, and SARPLLM prediction.

2.3.1. Data consolidation and prompt engineering

A prompt is the input text for an LLM. Role-play is a popular prompt engineering technique in which a description is included in the prompt about the person the LLM should portray as it completes a task. In this task, we provide the following prompt to make SARPLLM understand the meaning of input data and the expected output: “Prompt”: “You are an expert in Salmonella antimicrobial-resistance prediction, and you will receive gene feature sequences. Please output the prediction results.”

Since LLM performance is very sensitive to the precise details of the natural-language input, and because the core SNP features and accessory gene existence features differ in their schema, data consolidation is implemented by converting the schema of these two types of features into natural language descriptions. Here, we convert the value of the element of the core SNP features into *, A, G, C, and T, thereby indicating that the core SNP locus of the Salmonella sample is missing adenine, guanine, cytosine, or thymine, respectively. We also set the value of the element in the accessory gene existence features as 1 or 0, indicating whether a gene appears in the genome of the Salmonella sample or not.

Since previous studies [41], [42] have reported that the predictive performance of LLMs relies more on correct values than on feature names, we first list the names of the predicted antimicrobials and then list the values of the Salmonella resistance features in the order of relative importance evaluated by Eq. (2). Here, we use the space notation to denote the string conversion of the corresponding resistance feature; for example, “Input”: “AMP: A C * 1 ...”. In this way, we convert the datasets into sentences that SARPLLM can recognize.

2.3.2. Modeling and fine-tuning

SARPLLM is modeled based on the Qwen2 LLM [28] and LoRA [29]. More specifically, SARPLLM takes a pretrained transformer LLM named Qwen2 [28] as its base classifier. We then train SARPLLM on Salmonella resistance datasets to specialize its knowledge and increase its predictive performance by means of LoRA, a parameter-efficient method that constrains the weight matrix updates to be low-rank [29]. SARPLLM’s efficacy is highlighted by its ability to leverage the extensive knowledge encoded in the pretrained Qwen2, such that it requests minimal Salmonella AMR labeled data.

Since a language model fine-tunes and works on non-language tasks without changing the architecture or loss function at all [41], we use the default cross-entropy loss [32] to fine-tune SARPLLM. For each training sample used for fine-tuning, we define the training template as follows:

“Prompt”: “You are an expert in Salmonella antimicrobial-resistance prediction, and you will receive gene feature sequences. Please output the prediction results.”

“Input”: “AMP: A C * 1 ... 0”, “Output”: “1”.

2.3.3. SARPLLM prediction

After we fine-tune SARPLLM, we parse the predicted output of SARPLLM for each input sample. Since the predictive performance of an LLM relies more on correct values than on feature names [41], [42], SARPLLM simply outputs “1” or “0” to indicate whether the sample has resistance or sensitivity, instead of outputting text strings of “resistance” or “sensitivity.” For example, if SARPLLM outputs “1,” we parse the final prediction result of this training sample as AMR.

2.4. The QSMOTEN algorithm

In this section, we construct the optimization algorithm QSMOTEN based on the SMOTEN algorithm and provide a circuit simplification and a circuit mapping method to implement the QSMOTEN algorithm, thereby providing a solution to alleviate the significant difference in the numbers of resistant and sensitive Salmonella samples for AMR prediction.

The SMOTEN algorithm [18] is an improvement of the SMOTE algorithm for unordered data. To optimize the time complexity of the SMOTEN algorithm for large-scale Salmonella samples, this study proposes the quantum-computing-based QSMOTEN algorithm, which uses similarity as a distance metric for k-nearest neighbor computation, encodes feature names and values into quantum states, and uses the SWAP-test [33] circuit to compute the distance between samples. The QSMOTEN algorithm and its related quantum circuit are described by Algorithm 1 and Fig. 4, respectively.

2.4.1. Quantum state encoding

For a sample set S={ϕii=0,1,...,N-1, each sample consists of F features, and each feature consists of a maximum of T values ϕi={xijj=0,1,...,F-1;xij=0,1,...,T-1;F>1,T>1. Here, N represents the total sample size. ϕi is the ith element in sample set Sxij is the jth sample features in ϕi. QSMOTEN encodes the features into the following quantum states by means of Eq. (3):

ϕi=12a-1v=02a-1vxiv

where a=log2F, b=log2T, and xivt|t=0,1,2,...,(b-1). a is the quantity of qubits, b is the quantity of qubits, and v represents the position of the vth feature; when vF, xiv=0b. xiv is the unordered sample features. |ϕi is the quantum state of the sample ϕi.

Here, we show the preparation process of the quantum state of ϕi:

(1) We initialize the quantum state as 0a+b.

(2) We apply the Hadamard gate (H) to the a-qubits (Eq. (4)) to represent the encoding of the feature name. For example, 0110 represents the sixth (in binary, 0110) feature.

Ha0a+b=12a-1v=02a-1v0b

(3) We apply a multiple controlled NOT (MCX) gate to set the a-qubits as the control bit. Then, we apply the Pauli-X gate to the b-qubits to invert its value. The results of the quantum state can be expressed by Eq. (5). For example, when the a-qubits are v, the b-qubits are set to xiv.

ϕi=MCX(12a-1v=02a-1v0b)=12a-1v=02a-1vxiv

2.4.2. The similarity between Salmonella samples

The QSMOTEN algorithm finds the top-k most similar neighbor samples by computing the similarity between samples. The similarity is defined as follows:

Simϕi,ϕj=v=0F-1xivxjv

where xjv are unordered sample features and represents the the equivalence operation. ϕj is the jth element in sample set S.

As shown in Fig. 4, we adopt the SWAP-test circuit to compute the similarity (Eq. (6)) in the following three steps:

Step 1: The circuit takes two quantum states, ϕi and ϕj, with the same number of qubits as input, as well as an auxiliary qubit initialized to 0.

Step 2: The circuit applies a Hadamard gate to the auxiliary qubit and applies a Fredkin gate (CSWAP gate) to swap ϕi and ϕj when the auxiliary qubit is 1.

Step 3: The circuit applies a Hadamard gate to the auxiliary qubit again and then measures the auxiliary qubit.

As shown in Appendix A Section S3, the relationship between the probability of the measurement result 1 and the similarity can be described by Eq. (7):

Simϕi,ϕj=F-2a1-1-2P1

This indicates that the similarity between the two samples increases as the probability P1 decreases. Therefore, the similarity between two samples can be determined by the probability P1. Here, P is the probability of the measurement.

2.4.3. Time complexity analysis of the QSMOTEN algorithm

Assuming that there are a total of N samples, each sample consists of F sequence features, and each sequence feature consists of T types of values, a total of M new samples need to be generated. The SMOTEN algorithm needs to compute the distance between each pair of samples for a total of N2/2 times, and each computation requires an F comparison. Therefore, the time complexity of the distance computation is ON2F.

The QSMOTEN algorithm first encodes each sample into the quantum state of length log2F+log2T with a time complexity ON2(logF+logT). Then, the SMOTEN algorithm uses the SWAP-test circuit with log2F+log2T Fredkin gates to compute the distance between samples. The time complexity of this process becomes ON2(logF+logT). Therefore, the time complexity of the distance computation for the QSMOTEN algorithm is ON2(logF+logT).

Gene samples typically have T=4 features, where the value of each feature is A, T, G, or C. logT is a constant with a small value, which is usually ignored. Therefore, in comparison with the SMOTEN algorithm, the QSMOTEN algorithm can greatly decrease the time complexity of the distance computing from ON2F to ON2logF+logT.

2.5. The Salmonella resistance predictive platform

Based on the pan-genomics analysis results and the SARPLLM model, we establish a Salmonella antimicrobial-resistance predictive online platform based on web technology and knowledge graphs [29]. The Salmonella antimicrobial-resistance predictive online platform employs Django [43] as the back-end service architecture to enable surveillance and response to user access. The front-end uses Echarts for knowledge graph visualization. The Salmonella antimicrobial-resistance predictive online platform can perform online predictions for multiple AMRs based on the Salmonella gene feature files uploaded by users. In addition, our platform not only displays the pan-genomics analysis results of the Salmonella datasets but also provides a convenient way for users to download raw and analytical data. The platform has four major modules:

(1) An antimicrobial-resistance predictive module: This module provides an interface for users to upload Salmonella gene feature files; it also provides an interface to use the SARPLLM model for Salmonella antimicrobial-resistance online prediction and results visualization.

(2) A pan-genomics analysis results module: This module visualizes the statistical results obtained from the pan-genomics analysis for the Salmonella antimicrobial-resistance dataset.

(3) A gene sample antimicrobial knowledge-graph module: This module constructs a directed graph to describe the relationship among the gene, sample, and antimicrobial agent, and visualizes their relationship.

(4) A data download module: This module provides the download function for raw data and pan-genomics analysis data.

3. Results

3.1. Experimental results for the antimicrobial-resistance prediction model

To answer our first scientific question, we present the results of the pan-genomics analysis process, two-step feature-selection, and comparative experiments of antimicrobial-resistance prediction models. Firstly, through the pan-genomics analysis process described in Section 2.2, we obtained a 18125(gene)×1167(sample) accessory gene existence matrix, which is provided in Appendix A Section S4.1. In addition, we obtained a 126087(gene)×1167(sample) core SNP matrix, which is provided in Appendix A Section S4.2. Secondly, through the two-step feature-selection, we screened the strongly correlated features of the Salmonella resistance genes respectively corresponding to each antimicrobial (i.e., AUG, AXO, CHL, AMP, and FOX). These correlated features are listed in Appendix A Sections S4.3 and S4.4. The top 5 gene features selected after two-step feature-selection are listed in Appendix A Table S3.Thirdly, we present the AMR predictions for five antimicrobials (AUG, AXO, CHL, AMP, and FOX) using SARPLLM, logistic regression (LR), random forest (RF) [44], eXtreme gradient boosting (XGBoost) [45], support vector machine (SVM) [46], multilayer perceptron (MLP), and resistance prediction neural network (RPNN) model (detailed in Appendix A Section S5). The configuration parameters for the LR, RF, XGBoost, SVM, and RPNN are provided in Appendix A Sections S6.1 and S6.2.

The SARPLLM adopts batch learning, with a batch size of four and four training epochs. The training uses the AdamW optimizer [47] with a learning rate of 10-5. The learning scheduler type is set to polynomial. More architecture parameters of the SARPLLM model are provided in Appendix A Section S6.3. Considering the imbalanced resistant and sensitive antimicrobial labels in the Salmonella datasets, as well as the commonly comprehensive evaluation indicators used in predictive tasks by LLMs, we use the F1-score to evaluate the predictive performance.

For each antimicrobial, we take the Salmonella samples with the top N=10,20,30,40,50 strongly correlated features as input, respectively. We then carry out a three-fold cross validation with four repeated tests and statistically determine the mean value of the predictive indicators. Fig. 5 presents the predictive results for AUG, AXO, CHL, AMP, and FOX.

Fig. 5 shows that the predictive performances of all the models are greater than 85% with only ten resistance features. Furthermore, as the number of resistance features gradually increases from 10 to 50, the evaluation indicators of all predictive models show fluctuations and then tend to stabilize, indicating that our proposed two-step feature-selection method can accurately screen and retain the features related to AMR, thereby reducing the size of the input resistance features while maintaining a high-performance predictive ability for all the models.

To further analyze the performance differences between SARPLLM and the other models, we carried out T-tests [48], [49], [50] between SARPLLM and the other models for the five antimicrobials with 50 features. The results are provided in Table 1 and Appendix A Table S4. The records of the predicted results are listed in Appendix A Section S7. Table 1 shows that the F1-score of SARPLLM is significantly better (p value < 0.05) than those of the LR, RF, XGB, SVM, MLP, and RPNN models in most cases, indicating that our proposed SARPLLM model exhibits significant performance advantages in antimicrobial-resistance prediction for multiple antimicrobial datasets and thus possesses excellent generalization. In addition, although there are missing values in the input data, SARPLLM still has the best predictive performance, which implies that SARPLLM is good at dealing with missing data and therefore possesses excellent robustness.

3.2. Experimental results for QSMOTEN

To answer our second scientific question, we present the experimental results using a virtual quantum machine and a physical quantum machine, respectively.

3.2.1. Experimental results on a virtual quantum machine

We assume that there are four samples, ϕi (i[0-3]). Each sample has four features, and the value of each feature is A, T, G, or C. This experiment takes ϕ0=ATCG,ϕ1=ATGC,ϕ2=AAGT, and ϕ3=TATA as examples to validate the effectiveness of the QSMOTEN algorithm.

Based on the above assumptions, we provide a quantum circuit (Fig. 6) to compute the similarity between ϕ0 and ϕ1. The circuits to compute the similarity between the remaining sample pairs (ϕ0ϕ2, ϕ0ϕ3, ϕ1ϕ2, ϕ1ϕ3, and ϕ2ϕ3) are provided in Appendix A Section S8. For each sample pair, we set the measurements to be carried out 104 times for each experiment. The experiments were simulated under ideal conditions by Qiskit [51] using the “aer_simulator” simulator. We computed the similarity by means of Eq. (7); the simulation results are recorded in Table 2.

Table 2 shows that the similarity computed by QSMOTEN is close to the actual similarity of the sample pairs, which indicates that our proposed QSMOTEN algorithm can correctly compute the similarity between samples and provides an efficient approach for data augmentation via quantum computer.

3.2.2. Experimental results on a physical quantum machine

Due to the limited hardware support available for quantum computers, the current fidelity of physical quantum machine is low. Therefore, we employ ϕ0=A, ϕ1=T, and ϕ2=A as examples to validate the effectiveness of the QSMOTEN algorithm. The three samples are encoded in the following form: ϕ0=ϕ2=0 and ϕ1=1. Based on the above assumptions, we develop the quantum circuits shown in Figs. 7(a) and (b) to compute the similarity between ϕ0 and ϕ1 and the similarity between ϕ0 and ϕ2, respectively.

For each sample pair, we set the measurements to be carried out 104 measurements for each experiment. The experiments were simulated on the “Xiaohong” quantum computer. The quantum processor parameters of “Xiaohong” quantum computer is detailed in Appendix A Section S9. Figs. 7(c) and (d) show the measurement results for samples ϕ0=A and ϕ1=T and for samples ϕ0=A and ϕ2=A, respectively. Fig. 7(c) shows that the measurement results of the quantum circuit (Fig. 7(a)) are P0=0.4808 and P1=0.5192, respectively. By substituting the measurement results into Eq. (7), the similarity between ϕ0=A and ϕ1=T is found to be 0, which is the same as the actual similarity of 0. Fig. 7(d) shows that the measurement results of the quantum circuit (Fig. 7(b)) are P0=0.8018 and P1=0.1982, respectively. By substituting the measurement results into Eq. (7), the similarity between ϕ0=A and ϕ1=T is found to be 0.7769, which is close to the actual similarity of 1. Therefore, the experimental results on the Xiaohong quantum computer indicate that the QSMOTEN algorithm can correctly compute the similarity between sample pairs, demonstrating the potential of quantum computing methods for accelerating the SMOTEN algorithm on a quantum physics machine.

3.3. The Salmonella antimicrobial-resistance prediction online platform

To answer the third scientific question, we establish a Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs, which provides four online services: an antimicrobial-resistance prediction module, a pan-genomics analysis results module, a gene-sample-antimicrobial knowledge-graph module, and a data download module. Fig. 8(a) shows the antimicrobial-resistance prediction module, which has two functions: “Select file” uploads the Salmonella gene feature file in the specified format, and “Predict” predicts and visualizes the AMR results. Fig. 8(b) shows the pan-genomics analysis results module, which displays the number and distribution of the genes in the Salmonella pan-genome. Users can move the mouse over a type of gene in the pie chart to obtain detailed information on the gene. In addition, users can move the mouse over the histogram to determine how many y genes in the pan-genome exist in x genomes. Fig. 8(c) displays the gene-sample-antimicrobial knowledge-graph module. The blue dots in the figure represent the strain sample entities, the green dots represent the antimicrobial entities, the yellow dots represent the gene entities, the lines represent the relationships between entities, and the text labels on the lines describe the type of relationship. Users can move the mouse over an entity or a relationship to highlight that entity or adjacent relationship. In addition, the properties of the entity are displayed when selected by the mouse. Fig. 8(d) shows the data download module, which allows users to obtain the corresponding data by clicking the “download” button.

4. Discussion

Due to genetic mutations and the misuse of antimicrobials, the number of antimicrobial-resistant Salmonella strains continues to increase, challenging public health safety. In order to quickly and accurately predict Salmonella AMR, this study innovatively integrated LLM and quantum technology to alleviate the curse of high dimensionality and the sample imbalance problem of WGS data for Salmonella AMR. According to the curse of high dimensionality for Salmonella antimicrobial-resistance prediction, we chose genes that are highly associated with antimicrobials through a pan-genomic analysis and two-step feature-selection based on AST data and WGS data for Salmonella. Next, we constructed a SARPLLM for Salmonella antimicrobial-resistance prediction based on Qwen2 LLM [28], which converts Salmonella samples into sentences and uses LoRA to fine-tune the pretrained Qwen2. As shown in Fig. 5, as the number of input resistance features changes from 10 to 50, all the evaluation indexes of all the compared predictive models first increase and then stabilize or decrease. This is because the performance of the predictive models increases with an increase in the number of resistance features. However, inputting too many features unrelated to AMR into prediction models is like introducing a large amount of noise. These irrelevant features not only increase the structural complexity of the predictive model but also result in a significant decrease in the predictive performance. Since Fig. 5 and Table 1 show that SARPLLM performs relatively well on five antimicrobials datasets, we consider it to possess excellent generalization. Since SARPLLM is good at processing missing data to give optimal predictive performance, we also consider it to be very robust.

Benefiting from the development of quantum technology, quantum computing can exponentially increase the capability of classical algorithms, resulting in the emergence of numerous works on quantum-computing-based algorithm acceleration [52], [53], [54]. For example, Zha et al. [52] introduced grid point matching and feature atom matching to accelerate attitude sampling in molecular docking by encoding the problem into a quadratic unconstrained binary optimization model, resulting in a 1000-fold increase in the efficiency of molecular docking in drug discovery. Shu et al. [53] proposed a quantum integration algorithm suitable for any continuous functions, which exhibits quadratic acceleration in comparison with classical integration algorithms by reducing the computational complexity from ON to O(N). Liu et al. [54] designed a quantum method for classical information compression that exploits the hidden subgroup quantum algorithm, demonstrating that data with a given group structure can be compressed with the same query complexity as the hidden subgroup problem while being exponentially faster than the best-known classical algorithms.

Since current data-augmentation algorithms for balancing data categories are inefficient for large-scale high-dimensional WGS data, this study proposed QSMOTEN based on the SMOTEN algorithm with a SWAP test quantum circuit. A time complexity analysis of the QSMOTEN algorithm showed that the algorithm reduces the time complexity of the key step of computing the distance between samples from a linear level ON2F to a logarithmic level ON2logF+logT. Moreover, simulation experiments on virtual (Table 2) and physical machines (Figs. 7(c) and (d)) showed that the QSMOTEN algorithm can both accurately compute the distance between samples and accelerate the SMOTEN algorithm via quantum computing.

To address the current lack of specialized online platforms for Salmonella antimicrobial-resistance prediction, we constructed an LLM platform in this study by integrating web technology and knowledge graph technology. The platform consists of four modules: an antimicrobial-resistance prediction module, a pan-genomics analysis results module, a gene-sample-antimicrobial knowledge-graph module, and a data download module. The platform not only provides users with convenient Salmonella resistance predictive services but also visualizes the pan-genomics analysis results. In addition, the platform employs knowledge graph technology to store Salmonella genome data and AMR data with high scalability.

Although we developed a predictive platform for Salmonella AMR in this study based on an LLM and quantum computing (the codes are listed in Appendix A Section S10), this work has two shortcomings. First, Salmonella antimicrobial-resistance prediction involves complex biological and genetic knowledge, and it is difficult for current LLMs to fully understand and accurately represent this knowledge, which decreases the accuracy of the predictions. In addition, the performance of LLMs strongly depends on the quality and quantity of the training dataset. Insufficient high-quality and diverse datasets are available in the field of Salmonella resistance prediction, limiting LLM predictive performance. Second, quantum computing technology is still in its early stages of development and quantum computing is limited by the capability of quantum hardware processors. Most quantum computing algorithms can only carry out mathematical analysis or run on high-performance simulators of quantum computers. Therefore, there is still a long way to go before these algorithms can be validated on physical machines and applied in engineering.

In summary, our future research will focus on integrating multi-source data and domain knowledge to increase the accuracy of the predictive platform for Salmonella AMR based on an LLM. We will also attempt to develop more stable and reliable quantum hardware to increase the application of quantum computers in Salmonella resistance data augmentation.

CRediT authorship contribution statement

Yujie You: Writing – review & editing, Writing – original draft, Validation, Methodology. Kan Tan: Writing – original draft, Visualization, Investigation. Zekun Jiang: Supervision, Resources, Investigation. Le Zhang: Writing – review & editing, Resources, Project administration, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Science and Technology Major Project (2021YFF1201200), the National Natural Science Foundation of China (62372316), and the Sichuan Science and Technology Program key project (2024YFHZ0091). We thank QuantumCTek for supporting the quantum computer Xiaohong.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2025.01.013.

References

[1]

Ferrari RG, Rosario DKA, Cunha-Neto A, Mano SB, Figueiredo EES, Conte CA, et al.Worldwide epidemiology of Salmonella serovars in animal-based foods: a meta-analysis.Appl Environ Microbiol 2019; 85(14):e00591-19.

[2]

Qin XJ, Yang MZ, Cai H, Liu YT, Gorris L, Aslam MZ, et al.Antibiotic resistance of Salmonella typhimurium monophasic variant 1, 4, 5, 12:i:-in China: a systematic review and meta-analysis.Antibiotics 2022; 11(4):532.

[3]

Anahtar MN, Yang JH, Kanjilal S.Applications of machine learning to the problem of antimicrobial resistance: an emerging model for translational research.J Clin Microbiol 2021; 59(7):e01260-20.

[4]

Botelho J, Schulenburg H.The role of integrative and conjugative elements in antibiotic resistance evolution.Trends Microbiol 2021; 29(1):8-18.

[5]

Soon WW, Hariharan M, Snyder MP.High-throughput sequencing for biology and medicine.Mol Syst Biol 2013; 9:640.

[6]

Su M, Satola SW, Read TD.Genome-based prediction of bacterial antibiotic resistance.J Clin Microbiol 2019; 57(3):e01405-e01418.

[7]

Wang CC, Hung YT, Chou CY, Hsuan SL, Chen ZW, Chang PY, et al.Using random forest to predict antimicrobial minimum inhibitory concentrations of nontyphoidal Salmonella in Taiwan.Vet Res 2023; 54(1):11.

[8]

Ren Y, Chakraborty T, Doijad S, Falgenhauer L, Falgenhauer J, Goesmann A, et al.Deep transfer learning enables robust prediction of antimicrobial resistance for novel antibiotics.Antibiotics 2022; 11(11):1611.

[9]

Gao J, Lao QH, Liu P, Yi HH, Kang QB, Jiang ZK, et al.Anatomically guided cross-domain repair and screening for ultrasound fetal biometry.IEEE J Biomed Health Inform 2023; 27(10):4914-4925.

[10]

Lai X, Zhou J, Wessely A, Heppt M, Maier A, Berking C, et al.A disease network-based deep learning approach for characterizing melanoma.Int J Cancer 2022; 150(6):1029-1044.

[11]

Song H, Chen L, Cui Y, Li Q, Wang Q, Fan J, et al.Denoising of MR and CT images using cascaded multi-supervision convolutional neural networks with progressive training.Neurocomputing 2022; 469:354-365.

[12]

Zhang Q, Zhang H, Zhou K, Zhang L.Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition.Tsinghua Sci Technol 2023; 28(4):673-685.

[13]

Zhang L, Song W, Zhu T, Liu Y, Chen W, Cao Y, et al.ConvNeXt-MHC: improving MHC-peptide affinity prediction by structure-derived degenerate coding and the ConvNeXt model.Brief Bioinform 2024; 25(3):bbae133.

[14]

Jiang Z, Cheng D, Qin Z, Gao J, Lao Q, Li K, et al.TV-SAM: increasing zero-shot segmentation performance on multimodal medical images using GPT-4 generated descriptive prompts without human annotation.Big Data Min Anal 2024; 7(4):1199-1211.

[15]

Gao J, Lao Q, Kang Q, Liu P, Du C, Li K, et al.Boosting your context by dual similarity checkup for in-context learning medical image segmentation.IEEE Trans Med Imaging (2025;44(1):310–9)

[16]

You Y, Zhou F, Yue Y.The classical iterative HHL-based hemodynamic simulation quantum linear equation algorithm for abdominal aortic aneurysm.Eur Phys J Spec Top. In press.

[17]

Xiao M, Wei R, Yu J, Gao C, Yang F, Zhang L, et al.CpG island definition and methylation mapping of the T2T-YAO genome.Genom Proteom Bioinform 2024; 22(2):qzae009.

[18]

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.SMOTE: synthetic minority over-sampling technique.J Artif Intell Res 2002; 16:321-357.

[19]

He H, Bai Y, Garcia EA, Li S.ADASYN: adaptive synthetic sampling approach for imbalanced learning.In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008 Jun 1–8; Hong Kong, China; 2008. p. 1322–8.

[20]

Han H, Wang WY, Mao BH.Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning.In: Proceedings of the International Conference on Intelligent Computing; 2005 Aug 23–26; Hefei, China. Berlin: Springer Nature; 2005. p. 878–87.

[21]

Zankari E, Hasman H, Cosentino S, Vestergaard A, Rasmussen S, Lund O, et al.Identification of acquired antimicrobial resistance genes.J Antimicrob Chemother 2012; 67:2640-2644.

[22]

McArthur AG, Waglechner N, Nizam F, Yan A, Azad MA, Baylay AJ, et al.The comprehensive antibiotic resistance database.Antimicrob Agents Chemother 2013; 57:3348-3357.

[23]

Moradigaravand D, Palm M, Farewell A, Mustonen V, Warringer J, Parts L, et al.Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data.PLoS Comput Biol 2018; 14(12):e1006258.

[24]

Ha SM, Lin EY, Klausner JD, Adamson PC.Machine learning to predict ceftriaxone resistance using single nucleotide polymorphisms within a global database of Neisseria gonorrhoeae genomes.Microbiol Spectr 2023; 11(6):e0170323.

[25]

Yang Y, Walker TM, Kouchaki S, Wang C, Peto TEA, Crook DW, et al.An end-to-end heterogeneous graph attention network for Mycobacterium tuberculosis drug-resistance prediction.Brief Bioinform 2021; 22(6):bbab29.

[26]

Jiang Z, Lu Y, Liu Z, Wu W, Xu X, Dinny Aés, et al.Drug resistance prediction and resistance genes identification in Mycobacterium tuberculosis based on a hierarchical attentive neural network utilizing genome-wide variants.Brief Bioinform 2022; 23(3):bbac041.

[27]

Shi JH, Yan Y, Links MG.Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection.BMC Bioinformatics, 20 (Suppl 15) (2019), p. 535

[28]

Bai J, Bai S, Chu Y, Cui Z, Dang K, Deng X, et al.Qwen technical report.2023. arXiv: 2309.16609.

[29]

Ma F, Xiao M, Zhu L, Jiang W, Jiang J, Zhang PF, et al.An integrated platform for Brucella with knowledge graph technology: from genomic analysis to epidemiological projection.Front Genet 2022; 13:981633.

[30]

Zhang L, Dai Z, Yu J, Xiao M.CpG-island-based annotation and analysis of human housekeeping genes.Brief Bioinform 2021; 22(1):515-525.

[31]

Zhang L, Zhang L, Guo Y, Xiao M, Feng L, Yang C, et al.MCDB: a comprehensive curated mitotic catastrophe database for retrieval, protein sequence alignment, and target prediction.Acta Pharm Sin B 2021; 11(10):3092-3104.

[32]

Kline DM, Berardi VL.Revisiting squared-error and cross-entropy functions for training neural network classifiers.Neural Comput Appl 2005; 14(4):310-318.

[33]

Barenco A, Berthiaume A, Deutsch D, Ekert AK, Jozsa R, Macchiavello C, et al.Stabilization of quantum computations by symmetrization.SIAM J Comput 1997; 26:1541-1557.

[34]

Chin CS, Alexander D, Marks P, Klammer AA, Drake J, Heiner C, et al.Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.Nat Methods 2013; 10(6):563-569.

[35]

Tatusova T.NCBI prokaryotic genome annotation pipeline.Nucleic Acids Res 2016; 44(14):6614-6624.

[36]

Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, et al.Roary: rapid large-scale prokaryote pan genome analysis.Bioinformatics 2015; 31(22):3691-3693.

[37]

Katoh K, Misawa K, Ki K, Miyata T.MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.Nucleic Acids Res 2002; 30(14):3059-3066.

[38]

Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T, Keane JA, et al.SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments.Microb Genom 2016; 2(4):e000056.

[39]

Dong X, Sun F, Han X, Hou R.Study of positive and negative association rules based on multi-confidence and chi-squared test.X. Li, O.R. Zaïane, Z. Li (Eds.), Advanced data mining and applications, Springer Nature, Berlin 2006; 100-109.

[40]

Liang J, Shi Z, Li D, Wierman MJ.Information entropy, rough entropy and knowledge granulation in incomplete information systems.Int J Gen System 2006; 35:641-654.

[41]

Dinh T, Zeng Y, Zhang R, Lin Z, Gira M, Rajput S, et al.LIFT: language-interfaced fine-tuning for non-language machine learning tasks.In: Proceedings of the 36th International Conference on Neural Information Processing Systems; 2022 Nov 28–Dec 9; New Orleans, L A, US A. Red Hook: Curran Associates Inc.; 2022. p. 11763–84.

[42]

Hegselmann S, Buendia A, Lang H, Agrawal M, Jiang X, Sontag D ,et al.TabLLM: few-shot classification of tabular data with large language models.In: Proceedings of the International Conference on Artificial Intelligence and Statistics; 2023 Apr 25–27; Valencia, Spain. PMLR. p. 5549–58.

[43]

Putnam J.Python Web development with Django.Comput Rev 2010; 51(6):330.

[44]

Breiman L.Random forests.Mach Learn 2001; 45:5-32.

[45]

Chen T, Guestrin C.XGBoost: a scalable tree boosting system.In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13–17; San Francisco, C A, US A. New York City: Association for Computing Machinery (AC M); 2016. p. 785–94.

[46]

Cortes C, Vapnik VN.Support-vector networks.Mach Learn 1995; 20:273-297.

[47]

Loshchilov I, Hutter F.Decoupled weight decay regularization.In: Proceedings of the International Conference on Learning Representations; 2019 May 6–9; New Orleans, LA, USA. Wadern: dblp; 2019.

[48]

Xia Y, Yang C, Hu N, Yang Z, He X, Li T, et al.Exploring the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme patients by a novel survival analysis model.BMC Genomics, 18 (Suppl 1) (2017), p. 950

[49]

Zhang L, Liu G, Kong M, Li T, Wu D, Zhou X, et al.Revealing dynamic regulations and the related key proteins of myeloma-initiating cells by integrating experimental data into a systems biological model.Bioinformatics 2021; 37(11):1554-1561.

[50]

You Y, Lai X, Pan Y, Zheng H, Vera J, Liu S, et al.Artificial intelligence in cancer target identification and drug discovery.Signal Transduct Target Ther 2022; 7(1):156.

[51]

Aleksandrowicz G, Alexander T, Barkoutsos P, Bello L, Ben-Haim Y, Bucher D, et al.Qiskit: an open-source framework for quantum computing [Internet].Genève: Zenodo; 2019 Jan 23 [cited 2024 Jan 22]. Available from: https://zenodo.org/records/2562111.

[52]

Zha J, Su J, Li T, Cao C, Ma Y, Wei H, et al.Encoding molecular docking for quantum computers.J Chem Theory Comput 2023; 19(24):9018-9024.

[53]

Shu G, Shan Z, Xu J, Zhao J, Wang S.A general quantum algorithm for numerical integration.Sci Rep 2024; 14:10432.

[54]

Liu F, Bian K, Meng F, Zhang W, Dahlsten O.Information compression via hidden subgroup quantum autoencoders.npj Quantum Inf 2024; 10:74.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (2933KB)

Supplementary files

supplementary data

5131

Accesses

0

Citation

Detail

Sections
Recommended

/