As a common foodborne pathogen, Salmonella poses risks to public health safety, common given the emergence of antimicrobial-resistant strains. However, there is currently a lack of systematic platforms based on large language models (LLMs) for Salmonella resistance prediction, data presentation, and data sharing. To overcome this issue, we firstly propose a two-step feature-selection process based on the chi-square test and conditional mutual information maximization to find the key Salmonella resistance genes in a pan-genomics analysis and develop an LLM-based Salmonella antimicrobial-resistance predictive (SARPLLM) algorithm to achieve accurate antimicrobial-resistance prediction, based on Qwen2 LLM and low-rank adaptation. Secondly, we optimize the time complexity to compute the sample distance from the linear to logarithmic level by constructing a quantum data augmentation algorithm denoted as QSMOTEN. Thirdly, we build up a user-friendly Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs, which not only facilitates online resistance prediction for users but also visualizes the pan-genomics analysis results of the Salmonella datasets.
Yujie You, Kan Tan, Zekun Jiang, Le Zhang.
Developing a Predictive Platform for Salmonella Antimicrobial Resistance Based on a Large Language Model and Quantum Computing.
Engineering, 2025, 48(5): 184-195 DOI:10.1016/j.eng.2025.01.013
Salmonella is a common foodborne pathogen and the third leading cause of death due to foodborne diseases [1]. Although antimicrobials are an effective clinical treatment for the diseases caused by Salmonella, their efficacy is affected by gene mutations and antimicrobial abuse [2]. These factors have resulted in several Salmonella strains gradually evolving into antimicrobial-resistant strains, weakening the therapeutic effects of antimicrobials. Thus, to decrease the impacts of antimicrobial-resistant Salmonella strains on food safety and public health, there is an urgent need to develop targeted antimicrobial therapy for infected patients.
As it is both time-consuming and difficult to investigate mechanisms of antimicrobial resistance (AMR) [3], the bacterial antimicrobial susceptibility test (AST) commonly used to detect bacterial AMR is inefficient for predicting AMR. Based on the close association between AMR and genes, previous studies [4], [5], [6] have employed whole-genome sequencing (WGS) data instead of ASTs to predict Salmonella resistance. Due to the curse of high dimensionality introduced by WGS data, however, current machine-learning- and early deep-learning-based predictive models [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17] are subject to overfitting during training, resulting in unsatisfactory predictive results. Given that current fine-tuning models based on large language models (LLMs) are powerful tools that can achieve high performance and robust stability with a small sample size, we pose our first scientific question: How can an efficient feature-selection process be established to mine the key Salmonella resistance genes with Salmonella WGS data and to build up an LLM-based Salmonella resistance predictive model in order to alleviate the curse of high dimensionality caused by small samples?
Salmonella resistance brings another challenge as well: Because of the significant imbalance between the number of antimicrobial-resistant samples and the number of sensitive samples in the Salmonella WGS data, the performance of the antimicrobial-resistance prediction model will be significantly lowered. Data augmentation is typically employed for such sample-imbalance problems, as it can generate pseudo samples based on the original minority class samples to balance the sample size. SMOTEN [18] is an oversampling data-augmentation algorithm that looks for k-nearest neighbors of the original minority class samples and then interpolates between the samples to generate new ones. However, SMOTEN and its derived algorithms [18], [19], [20] have such high computational complexity that they are unsuitable for high-dimensional WGS data. Since quantum computing holds the possibility of speeding up the SMOTEN algorithm, we pose our second scientific question: How can quantum computing be used to speed up the SMOTEN algorithm in order to balance the numbers of antimicrobial-resistant and sensitive samples for the Salmonella WGS data?
Although several online AMR gene analysis websites have already been established based on ResFinder [21] and CARD [22], these sites do not provide a predictive functionality for AMR. In addition, existing Salmonella resistance prediction models [23], [24], [25], [26], [27] are often presented as source code, toolkits, or experimental flowcharts, making resistance prediction and analysis difficult. For these reasons, we pose our third scientific question: How can a user-friendly and convenient platform be built for Salmonella resistance prediction, data presentation, and sharing?
In this article, we propose the following innovative works to solve the scientific questions. Firstly, we propose a two-step feature-selection process based on the chi-square test and conditional mutual information maximization to investigate the key Salmonella resistance genes in pan-genomics analysis and develop an LLM-based Salmonella antimicrobial-resistance predictive (SARPLLM) algorithm to achieve accurate antimicrobial-resistance prediction based on the Qwen2 LLM [28] and low-rank adaptation (LoRA) [29]. Secondly, we optimize the time complexity for the computation of the sample distance from the linear to logarithmic level by constructing a quantum data-augmentation algorithm, QSMOTEN. Thirdly, we build up a user-friendly Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs [30], [31], which not only facilitates online resistance prediction for users but also visualizes the pan-genomics analysis results of the Salmonella datasets.
To assess our algorithm, we carry out a comparative experiment between SARPLLM and other antimicrobial-resistance prediction models, which demonstrates that SARPLLM not only outperforms other antimicrobial-resistance predictive models but also shows high robustness in terms of Salmonella antimicrobial-resistance prediction. We then simulate the proposed QSMOTEN algorithm on both a virtual quantum machine and a physical quantum machine. The simulation results indicate that the QSMOTEN algorithm can accurately and quickly compute the distance between Salmonella antimicrobial samples, demonstrating the potential of quantum computing methods to accelerate the SMOTEN algorithm. Finally, we construct an LLM-based predictive platform for Salmonella resistance that provides users with four key modules to facilitate their study of Salmonella resistance.
2. Materials and methods
A pipeline of the study’s methodology is illustrated in Fig. 1.
2.1. Data acquisition
Antimicrobial-resistant Salmonella data samples were obtained from the National Antimicrobial Resistance Monitoring System Now website, provided by the US Centers for Disease Control and Prevention [32]. All samples were Salmonella typhimurium, isolated from patients in the United States. A total of 1167 Salmonella samples were available with AST and WGS results. Based on the AST results, we chose all 1167 samples to form an AMR matrix indicating whether a sample was resistant or not. The sample size of antimicrobial and AST results are list in Appendix A Table S1. A sample size of five Salmonella antimicrobials and the AST results are provided in Appendix A Table S2.
Based on the sample numbers of the 1167 Salmonella samples with AMR, we obtained the WGS of the Salmonella samples from the US National Center for Biotechnology Information (NCBI) database [33]. The 1167 Salmonella gene annotation data was generated from the raw WGS data through a hierarchical genome assembly process [34] and the NCBI prokaryotic genome annotation pipeline [35]. The information of data acquisition is detailed in Appendix A Section S1.
2.2. The pan-genomics analysis and two-step feature-selection process
Fig. 2(a) describes the pan-genomics analysis procedure. First, we took the Salmonella gene annotation data as input and carried out a pan-genomic analysis using the pan-genome pipeline Roary [36] to generate the gene existence matrix. Then, to reflect the differences among genomes, we removed the core genes shared by all Salmonella genomes from the gene existence matrix to obtain the accessory gene existence matrix.
Second, we used the gene annotation data as input and carried out a multiple sequence alignment of the Salmonella WGS data by means of the multiple sequence alignment based on fast Fourier transform (MAFFT) program [37] to obtain the core gene alignment data. Subsequently, we used the single nucleotide polymorphism (SNP)-sites [38] tool to detect SNPs from the core gene alignment data and ultimately output the core SNP matrix.
As shown in Fig. 2(b), to address inaccurate AMR predictions caused by the curse of high dimensionality, we propose a two-step feature-selection process based on the chi-square test [39] and the conditional mutual information maximization algorithm [27] to quickly screen out the genes highly correlated with AMR. First, based on the resistance to five Salmonella antimicrobials (i.e., augmentin (AUG), ceftriaxone (AXO), chloramphenicol (CHL), ampicillin (AMP), and cefoxitin (FOX)), we carried out the chi-square test (Eq. (1)) to screen the Salmonella resistance genes with p values of less than 0.05 from the accessory genes and the core SNPs. Then, we employed the conditional mutual information maximization algorithm (Eq. (2)) to evaluate the relative importance of the Salmonella resistance key genes for each antimicrobial. Based on the relative importance, we further screened the Salmonella resistance genes that were highly correlated with the five antimicrobials. The conditional information maximization algorithm is detailed in Appendix A Section S2.
where and y indicate a gene from a Salmonella sample and the Salmonella AMR label, respectively. represents the total number of Salmonella samples, represents the number of Salmonella AMR samples with gene , represents the number of Salmonella antimicrobial sensitivity samples with gene , represents the number of Salmonella antimicrobial-resistance samples without gene , and represents the number of Salmonella antimicrobial sensitivity samples without gene . represents the information entropy [40], and represents the conditional mutual information of the number of Salmonella samples with gene and Salmonella AMR label under the condition of gene .
2.3. SARPLLM
Here, we propose an SARPLLM , which uses the Salmonella accessory genes feature and the SNP feature obtained from the two-step feature-selection process as inputs and learns potential AMR relationships to predict the AMR of Salmonella samples. As shown in Fig. 3, the SARPLLM consists of three steps: data consolidation and prompt engineering, modeling and fine-tuning, and SARPLLM prediction.
2.3.1. Data consolidation and prompt engineering
A prompt is the input text for an LLM. Role-play is a popular prompt engineering technique in which a description is included in the prompt about the person the LLM should portray as it completes a task. In this task, we provide the following prompt to make SARPLLM understand the meaning of input data and the expected output: “Prompt”: “You are an expert in Salmonella antimicrobial-resistance prediction, and you will receive gene feature sequences. Please output the prediction results.”
Since LLM performance is very sensitive to the precise details of the natural-language input, and because the core SNP features and accessory gene existence features differ in their schema, data consolidation is implemented by converting the schema of these two types of features into natural language descriptions. Here, we convert the value of the element of the core SNP features into *, A, G, C, and T, thereby indicating that the core SNP locus of the Salmonella sample is missing adenine, guanine, cytosine, or thymine, respectively. We also set the value of the element in the accessory gene existence features as 1 or 0, indicating whether a gene appears in the genome of the Salmonella sample or not.
Since previous studies [41], [42] have reported that the predictive performance of LLMs relies more on correct values than on feature names, we first list the names of the predicted antimicrobials and then list the values of the Salmonella resistance features in the order of relative importance evaluated by Eq. (2). Here, we use the space notation to denote the string conversion of the corresponding resistance feature; for example, “Input”: “AMP: A C * 1 ...”. In this way, we convert the datasets into sentences that SARPLLM can recognize.
2.3.2. Modeling and fine-tuning
SARPLLM is modeled based on the Qwen2 LLM [28] and LoRA [29]. More specifically, SARPLLM takes a pretrained transformer LLM named Qwen2 [28] as its base classifier. We then train SARPLLM on Salmonella resistance datasets to specialize its knowledge and increase its predictive performance by means of LoRA, a parameter-efficient method that constrains the weight matrix updates to be low-rank [29]. SARPLLM’s efficacy is highlighted by its ability to leverage the extensive knowledge encoded in the pretrained Qwen2, such that it requests minimal Salmonella AMR labeled data.
Since a language model fine-tunes and works on non-language tasks without changing the architecture or loss function at all [41], we use the default cross-entropy loss [32] to fine-tune SARPLLM. For each training sample used for fine-tuning, we define the training template as follows:
“Prompt”: “You are an expert in Salmonella antimicrobial-resistance prediction, and you will receive gene feature sequences. Please output the prediction results.”
“Input”: “AMP: A C * 1 ... 0”, “Output”: “1”.
2.3.3. SARPLLM prediction
After we fine-tune SARPLLM, we parse the predicted output of SARPLLM for each input sample. Since the predictive performance of an LLM relies more on correct values than on feature names [41], [42], SARPLLM simply outputs “1” or “0” to indicate whether the sample has resistance or sensitivity, instead of outputting text strings of “resistance” or “sensitivity.” For example, if SARPLLM outputs “1,” we parse the final prediction result of this training sample as AMR.
2.4. The QSMOTEN algorithm
In this section, we construct the optimization algorithm QSMOTEN based on the SMOTEN algorithm and provide a circuit simplification and a circuit mapping method to implement the QSMOTEN algorithm, thereby providing a solution to alleviate the significant difference in the numbers of resistant and sensitive Salmonella samples for AMR prediction.
The SMOTEN algorithm [18] is an improvement of the SMOTE algorithm for unordered data. To optimize the time complexity of the SMOTEN algorithm for large-scale Salmonella samples, this study proposes the quantum-computing-based QSMOTEN algorithm, which uses similarity as a distance metric for k-nearest neighbor computation, encodes feature names and values into quantum states, and uses the SWAP-test [33] circuit to compute the distance between samples. The QSMOTEN algorithm and its related quantum circuit are described by Algorithm 1 and Fig. 4, respectively.
2.4.1. Quantum state encoding
For a sample set , each sample consists of features, and each feature consists of a maximum of values . Here, N represents the total sample size. ϕi is the ith element in sample set S. xij is the jth sample features in ϕi. QSMOTEN encodes the features into the following quantum states by means of Eq. (3):
where , , and . a is the quantity of qubits, b is the quantity of qubits, and represents the position of the vth feature; when , . is the unordered sample features. |ϕ〉i is the quantum state of the sample ϕi.
Here, we show the preparation process of the quantum state of :
(1) We initialize the quantum state as .
(2) We apply the Hadamard gate (H) to the -qubits (Eq. (4)) to represent the encoding of the feature name. For example, represents the sixth (in binary, 0110) feature.
(3) We apply a multiple controlled NOT (MCX) gate to set the -qubits as the control bit. Then, we apply the Pauli-X gate to the -qubits to invert its value. The results of the quantum state can be expressed by Eq. (5). For example, when the -qubits are , the -qubits are set to .
2.4.2. The similarity between Salmonella samples
The QSMOTEN algorithm finds the top-k most similar neighbor samples by computing the similarity between samples. The similarity is defined as follows:
where are unordered sample features and represents the the equivalence operation. ϕj is the jth element in sample set S.
As shown in Fig. 4, we adopt the SWAP-test circuit to compute the similarity (Eq. (6)) in the following three steps:
Step 1: The circuit takes two quantum states, and , with the same number of qubits as input, as well as an auxiliary qubit initialized to .
Step 2: The circuit applies a Hadamard gate to the auxiliary qubit and applies a Fredkin gate (CSWAP gate) to swap and when the auxiliary qubit is .
Step 3: The circuit applies a Hadamard gate to the auxiliary qubit again and then measures the auxiliary qubit.
As shown in Appendix A Section S3, the relationship between the probability of the measurement result and the similarity can be described by Eq. (7):
This indicates that the similarity between the two samples increases as the probability decreases. Therefore, the similarity between two samples can be determined by the probability . Here, is the probability of the measurement.
2.4.3. Time complexity analysis of the QSMOTEN algorithm
Assuming that there are a total of samples, each sample consists of sequence features, and each sequence feature consists of types of values, a total of new samples need to be generated. The SMOTEN algorithm needs to compute the distance between each pair of samples for a total of times, and each computation requires an comparison. Therefore, the time complexity of the distance computation is .
The QSMOTEN algorithm first encodes each sample into the quantum state of length with a time complexity . Then, the SMOTEN algorithm uses the SWAP-test circuit with Fredkin gates to compute the distance between samples. The time complexity of this process becomes . Therefore, the time complexity of the distance computation for the QSMOTEN algorithm is .
Gene samples typically have features, where the value of each feature is or is a constant with a small value, which is usually ignored. Therefore, in comparison with the SMOTEN algorithm, the QSMOTEN algorithm can greatly decrease the time complexity of the distance computing from to .
2.5. The Salmonella resistance predictive platform
Based on the pan-genomics analysis results and the SARPLLM model, we establish a Salmonella antimicrobial-resistance predictive online platform based on web technology and knowledge graphs [29]. The Salmonella antimicrobial-resistance predictive online platform employs Django [43] as the back-end service architecture to enable surveillance and response to user access. The front-end uses Echarts for knowledge graph visualization. The Salmonella antimicrobial-resistance predictive online platform can perform online predictions for multiple AMRs based on the Salmonella gene feature files uploaded by users. In addition, our platform not only displays the pan-genomics analysis results of the Salmonella datasets but also provides a convenient way for users to download raw and analytical data. The platform has four major modules:
(1) An antimicrobial-resistance predictive module: This module provides an interface for users to upload Salmonella gene feature files; it also provides an interface to use the SARPLLM model for Salmonella antimicrobial-resistance online prediction and results visualization.
(2) A pan-genomics analysis results module: This module visualizes the statistical results obtained from the pan-genomics analysis for the Salmonella antimicrobial-resistance dataset.
(3) A gene sample antimicrobial knowledge-graph module: This module constructs a directed graph to describe the relationship among the gene, sample, and antimicrobial agent, and visualizes their relationship.
(4) A data download module: This module provides the download function for raw data and pan-genomics analysis data.
3. Results
3.1. Experimental results for the antimicrobial-resistance prediction model
To answer our first scientific question, we present the results of the pan-genomics analysis process, two-step feature-selection, and comparative experiments of antimicrobial-resistance prediction models. Firstly, through the pan-genomics analysis process described in Section 2.2, we obtained a accessory gene existence matrix, which is provided in Appendix A Section S4.1. In addition, we obtained a core SNP matrix, which is provided in Appendix A Section S4.2. Secondly, through the two-step feature-selection, we screened the strongly correlated features of the Salmonella resistance genes respectively corresponding to each antimicrobial (i.e., AUG, AXO, CHL, AMP, and FOX). These correlated features are listed in Appendix A Sections S4.3 and S4.4. The top 5 gene features selected after two-step feature-selection are listed in Appendix A Table S3.Thirdly, we present the AMR predictions for five antimicrobials (AUG, AXO, CHL, AMP, and FOX) using SARPLLM, logistic regression (LR), random forest (RF) [44], eXtreme gradient boosting (XGBoost) [45], support vector machine (SVM) [46], multilayer perceptron (MLP), and resistance prediction neural network (RPNN) model (detailed in Appendix A Section S5). The configuration parameters for the LR, RF, XGBoost, SVM, and RPNN are provided in Appendix A Sections S6.1 and S6.2.
The SARPLLM adopts batch learning, with a batch size of four and four training epochs. The training uses the AdamW optimizer [47] with a learning rate of . The learning scheduler type is set to polynomial. More architecture parameters of the SARPLLM model are provided in Appendix A Section S6.3. Considering the imbalanced resistant and sensitive antimicrobial labels in the Salmonella datasets, as well as the commonly comprehensive evaluation indicators used in predictive tasks by LLMs, we use the F1-score to evaluate the predictive performance.
For each antimicrobial, we take the Salmonella samples with the top strongly correlated features as input, respectively. We then carry out a three-fold cross validation with four repeated tests and statistically determine the mean value of the predictive indicators. Fig. 5 presents the predictive results for AUG, AXO, CHL, AMP, and FOX.
Fig. 5 shows that the predictive performances of all the models are greater than 85% with only ten resistance features. Furthermore, as the number of resistance features gradually increases from 10 to 50, the evaluation indicators of all predictive models show fluctuations and then tend to stabilize, indicating that our proposed two-step feature-selection method can accurately screen and retain the features related to AMR, thereby reducing the size of the input resistance features while maintaining a high-performance predictive ability for all the models.
To further analyze the performance differences between SARPLLM and the other models, we carried out T-tests [48], [49], [50] between SARPLLM and the other models for the five antimicrobials with 50 features. The results are provided in Table 1 and Appendix A Table S4. The records of the predicted results are listed in Appendix A Section S7. Table 1 shows that the F1-score of SARPLLM is significantly better (p value < 0.05) than those of the LR, RF, XGB, SVM, MLP, and RPNN models in most cases, indicating that our proposed SARPLLM model exhibits significant performance advantages in antimicrobial-resistance prediction for multiple antimicrobial datasets and thus possesses excellent generalization. In addition, although there are missing values in the input data, SARPLLM still has the best predictive performance, which implies that SARPLLM is good at dealing with missing data and therefore possesses excellent robustness.
3.2. Experimental results for QSMOTEN
To answer our second scientific question, we present the experimental results using a virtual quantum machine and a physical quantum machine, respectively.
3.2.1. Experimental results on a virtual quantum machine
We assume that there are four samples, (). Each sample has four features, and the value of each feature is or This experiment takes and as examples to validate the effectiveness of the QSMOTEN algorithm.
Based on the above assumptions, we provide a quantum circuit (Fig. 6) to compute the similarity between and . The circuits to compute the similarity between the remaining sample pairs (, , , , and ) are provided in Appendix A Section S8. For each sample pair, we set the measurements to be carried out 104 times for each experiment. The experiments were simulated under ideal conditions by Qiskit [51] using the “aer_simulator” simulator. We computed the similarity by means of Eq. (7); the simulation results are recorded in Table 2.
Table 2 shows that the similarity computed by QSMOTEN is close to the actual similarity of the sample pairs, which indicates that our proposed QSMOTEN algorithm can correctly compute the similarity between samples and provides an efficient approach for data augmentation via quantum computer.
3.2.2. Experimental results on a physical quantum machine
Due to the limited hardware support available for quantum computers, the current fidelity of physical quantum machine is low. Therefore, we employ and as examples to validate the effectiveness of the QSMOTEN algorithm. The three samples are encoded in the following form: and . Based on the above assumptions, we develop the quantum circuits shown in Figs. 7(a) and (b) to compute the similarity between and and the similarity between and , respectively.
For each sample pair, we set the measurements to be carried out measurements for each experiment. The experiments were simulated on the “Xiaohong” quantum computer. The quantum processor parameters of “Xiaohong” quantum computer is detailed in Appendix A Section S9. Figs. 7(c) and (d) show the measurement results for samples and and for samples and respectively. Fig. 7(c) shows that the measurement results of the quantum circuit (Fig. 7(a)) are and , respectively. By substituting the measurement results into Eq. (7), the similarity between and is found to be 0, which is the same as the actual similarity of 0. Fig. 7(d) shows that the measurement results of the quantum circuit (Fig. 7(b)) are and , respectively. By substituting the measurement results into Eq. (7), the similarity between and is found to be 0.7769, which is close to the actual similarity of 1. Therefore, the experimental results on the Xiaohong quantum computer indicate that the QSMOTEN algorithm can correctly compute the similarity between sample pairs, demonstrating the potential of quantum computing methods for accelerating the SMOTEN algorithm on a quantum physics machine.
3.3. The Salmonella antimicrobial-resistance prediction online platform
To answer the third scientific question, we establish a Salmonella antimicrobial-resistance predictive online platform based on knowledge graphs, which provides four online services: an antimicrobial-resistance prediction module, a pan-genomics analysis results module, a gene-sample-antimicrobial knowledge-graph module, and a data download module. Fig. 8(a) shows the antimicrobial-resistance prediction module, which has two functions: “Select file” uploads the Salmonella gene feature file in the specified format, and “Predict” predicts and visualizes the AMR results. Fig. 8(b) shows the pan-genomics analysis results module, which displays the number and distribution of the genes in the Salmonella pan-genome. Users can move the mouse over a type of gene in the pie chart to obtain detailed information on the gene. In addition, users can move the mouse over the histogram to determine how many genes in the pan-genome exist in genomes. Fig. 8(c) displays the gene-sample-antimicrobial knowledge-graph module. The blue dots in the figure represent the strain sample entities, the green dots represent the antimicrobial entities, the yellow dots represent the gene entities, the lines represent the relationships between entities, and the text labels on the lines describe the type of relationship. Users can move the mouse over an entity or a relationship to highlight that entity or adjacent relationship. In addition, the properties of the entity are displayed when selected by the mouse. Fig. 8(d) shows the data download module, which allows users to obtain the corresponding data by clicking the “download” button.
4. Discussion
Due to genetic mutations and the misuse of antimicrobials, the number of antimicrobial-resistant Salmonella strains continues to increase, challenging public health safety. In order to quickly and accurately predict Salmonella AMR, this study innovatively integrated LLM and quantum technology to alleviate the curse of high dimensionality and the sample imbalance problem of WGS data for Salmonella AMR. According to the curse of high dimensionality for Salmonella antimicrobial-resistance prediction, we chose genes that are highly associated with antimicrobials through a pan-genomic analysis and two-step feature-selection based on AST data and WGS data for Salmonella. Next, we constructed a SARPLLM for Salmonella antimicrobial-resistance prediction based on Qwen2 LLM [28], which converts Salmonella samples into sentences and uses LoRA to fine-tune the pretrained Qwen2. As shown in Fig. 5, as the number of input resistance features changes from 10 to 50, all the evaluation indexes of all the compared predictive models first increase and then stabilize or decrease. This is because the performance of the predictive models increases with an increase in the number of resistance features. However, inputting too many features unrelated to AMR into prediction models is like introducing a large amount of noise. These irrelevant features not only increase the structural complexity of the predictive model but also result in a significant decrease in the predictive performance. Since Fig. 5 and Table 1 show that SARPLLM performs relatively well on five antimicrobials datasets, we consider it to possess excellent generalization. Since SARPLLM is good at processing missing data to give optimal predictive performance, we also consider it to be very robust.
Benefiting from the development of quantum technology, quantum computing can exponentially increase the capability of classical algorithms, resulting in the emergence of numerous works on quantum-computing-based algorithm acceleration [52], [53], [54]. For example, Zha et al. [52] introduced grid point matching and feature atom matching to accelerate attitude sampling in molecular docking by encoding the problem into a quadratic unconstrained binary optimization model, resulting in a 1000-fold increase in the efficiency of molecular docking in drug discovery. Shu et al. [53] proposed a quantum integration algorithm suitable for any continuous functions, which exhibits quadratic acceleration in comparison with classical integration algorithms by reducing the computational complexity from to . Liu et al. [54] designed a quantum method for classical information compression that exploits the hidden subgroup quantum algorithm, demonstrating that data with a given group structure can be compressed with the same query complexity as the hidden subgroup problem while being exponentially faster than the best-known classical algorithms.
Since current data-augmentation algorithms for balancing data categories are inefficient for large-scale high-dimensional WGS data, this study proposed QSMOTEN based on the SMOTEN algorithm with a SWAP test quantum circuit. A time complexity analysis of the QSMOTEN algorithm showed that the algorithm reduces the time complexity of the key step of computing the distance between samples from a linear level to a logarithmic level . Moreover, simulation experiments on virtual (Table 2) and physical machines (Figs. 7(c) and (d)) showed that the QSMOTEN algorithm can both accurately compute the distance between samples and accelerate the SMOTEN algorithm via quantum computing.
To address the current lack of specialized online platforms for Salmonella antimicrobial-resistance prediction, we constructed an LLM platform in this study by integrating web technology and knowledge graph technology. The platform consists of four modules: an antimicrobial-resistance prediction module, a pan-genomics analysis results module, a gene-sample-antimicrobial knowledge-graph module, and a data download module. The platform not only provides users with convenient Salmonella resistance predictive services but also visualizes the pan-genomics analysis results. In addition, the platform employs knowledge graph technology to store Salmonella genome data and AMR data with high scalability.
Although we developed a predictive platform for Salmonella AMR in this study based on an LLM and quantum computing (the codes are listed in Appendix A Section S10), this work has two shortcomings. First, Salmonella antimicrobial-resistance prediction involves complex biological and genetic knowledge, and it is difficult for current LLMs to fully understand and accurately represent this knowledge, which decreases the accuracy of the predictions. In addition, the performance of LLMs strongly depends on the quality and quantity of the training dataset. Insufficient high-quality and diverse datasets are available in the field of Salmonella resistance prediction, limiting LLM predictive performance. Second, quantum computing technology is still in its early stages of development and quantum computing is limited by the capability of quantum hardware processors. Most quantum computing algorithms can only carry out mathematical analysis or run on high-performance simulators of quantum computers. Therefore, there is still a long way to go before these algorithms can be validated on physical machines and applied in engineering.
In summary, our future research will focus on integrating multi-source data and domain knowledge to increase the accuracy of the predictive platform for Salmonella AMR based on an LLM. We will also attempt to develop more stable and reliable quantum hardware to increase the application of quantum computers in Salmonella resistance data augmentation.
CRediT authorship contribution statement
Yujie You: Writing – review & editing, Writing – original draft, Validation, Methodology. Kan Tan: Writing – original draft, Visualization, Investigation. Zekun Jiang: Supervision, Resources, Investigation. Le Zhang: Writing – review & editing, Resources, Project administration, Funding acquisition.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Science and Technology Major Project (2021YFF1201200), the National Natural Science Foundation of China (62372316), and the Sichuan Science and Technology Program key project (2024YFHZ0091). We thank QuantumCTek for supporting the quantum computer Xiaohong.
FerrariRG, RosarioDKA, Cunha-NetoA, ManoSB, FigueiredoEES, ConteCA, et al.Worldwide epidemiology of Salmonella serovars in animal-based foods: a meta-analysis.Appl Environ Microbiol2019; 85(14):e00591-19.
[2]
QinXJ, YangMZ, CaiH, LiuYT, GorrisL, AslamMZ, et al.Antibiotic resistance of Salmonella typhimurium monophasic variant 1, 4, 5, 12:i:-in China: a systematic review and meta-analysis.Antibiotics2022; 11(4):532.
[3]
AnahtarMN, YangJH, KanjilalS.Applications of machine learning to the problem of antimicrobial resistance: an emerging model for translational research.J Clin Microbiol2021; 59(7):e01260-20.
[4]
BotelhoJ, SchulenburgH.The role of integrative and conjugative elements in antibiotic resistance evolution.Trends Microbiol2021; 29(1):8-18.
[5]
SoonWW, HariharanM, SnyderMP.High-throughput sequencing for biology and medicine.Mol Syst Biol2013; 9:640.
WangCC, HungYT, ChouCY, HsuanSL, ChenZW, ChangPY, et al.Using random forest to predict antimicrobial minimum inhibitory concentrations of nontyphoidal Salmonella in Taiwan.Vet Res2023; 54(1):11.
[8]
RenY, ChakrabortyT, DoijadS, FalgenhauerL, FalgenhauerJ, GoesmannA, et al.Deep transfer learning enables robust prediction of antimicrobial resistance for novel antibiotics.Antibiotics2022; 11(11):1611.
[9]
GaoJ, LaoQH, LiuP, YiHH, KangQB, JiangZK, et al.Anatomically guided cross-domain repair and screening for ultrasound fetal biometry.IEEE J Biomed Health Inform2023; 27(10):4914-4925.
[10]
LaiX, ZhouJ, WesselyA, HepptM, MaierA, BerkingC, et al.A disease network-based deep learning approach for characterizing melanoma.Int J Cancer2022; 150(6):1029-1044.
[11]
SongH, ChenL, CuiY, LiQ, WangQ, FanJ, et al.Denoising of MR and CT images using cascaded multi-supervision convolutional neural networks with progressive training.Neurocomputing2022; 469:354-365.
[12]
ZhangQ, ZhangH, ZhouK, ZhangL.Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition.Tsinghua Sci Technol2023; 28(4):673-685.
[13]
ZhangL, SongW, ZhuT, LiuY, ChenW, CaoY, et al.ConvNeXt-MHC: improving MHC-peptide affinity prediction by structure-derived degenerate coding and the ConvNeXt model.Brief Bioinform2024; 25(3):bbae133.
[14]
JiangZ, ChengD, QinZ, GaoJ, LaoQ, LiK, et al.TV-SAM: increasing zero-shot segmentation performance on multimodal medical images using GPT-4 generated descriptive prompts without human annotation.Big Data Min Anal2024; 7(4):1199-1211.
[15]
GaoJ, LaoQ, KangQ, LiuP, DuC, LiK, et al.Boosting your context by dual similarity checkup for in-context learning medical image segmentation.IEEE Trans Med Imaging (2025;44(1):310–9)
[16]
YouY, ZhouF, YueY.The classical iterative HHL-based hemodynamic simulation quantum linear equation algorithm for abdominal aortic aneurysm.EurPhysJSpecTop. In press.
[17]
XiaoM, WeiR, YuJ, GaoC, YangF, ZhangL, et al.CpG island definition and methylation mapping of the T2T-YAO genome.Genom Proteom Bioinform2024; 22(2):qzae009.
HeH, BaiY, GarciaEA, LiS.ADASYN: adaptive synthetic sampling approach for imbalanced learning.In: Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008 Jun 1–8; Hong Kong, China; 2008. p. 1322–8.
[20]
HanH, WangWY, MaoBH.Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning.In: Proceedings of the International Conference on Intelligent Computing; 2005 Aug 23–26; Hefei, China. Berlin: Springer Nature; 2005. p. 878–87.
[21]
ZankariE, HasmanH, CosentinoS, VestergaardA, RasmussenS, LundO, et al.Identification of acquired antimicrobial resistance genes.J Antimicrob Chemother2012; 67:2640-2644.
MoradigaravandD, PalmM, FarewellA, MustonenV, WarringerJ, PartsL, et al.Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data.PLoS Comput Biol2018; 14(12):e1006258.
[24]
HaSM, LinEY, KlausnerJD, AdamsonPC.Machine learning to predict ceftriaxone resistance using single nucleotide polymorphisms within a global database of Neisseria gonorrhoeae genomes.Microbiol Spectr2023; 11(6):e0170323.
[25]
YangY, WalkerTM, KouchakiS, WangC, PetoTEA, CrookDW, et al.An end-to-end heterogeneous graph attention network for Mycobacterium tuberculosis drug-resistance prediction.Brief Bioinform2021; 22(6):bbab29.
[26]
JiangZ, LuY, LiuZ, WuW, XuX, DinnyAés, et al.Drug resistance prediction and resistance genes identification in Mycobacterium tuberculosis based on a hierarchical attentive neural network utilizing genome-wide variants.Brief Bioinform2022; 23(3):bbac041.
[27]
ShiJH, YanY, LinksMG.Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection.BMC Bioinformatics, 20 (Suppl 15) (2019), p. 535
MaF, XiaoM, ZhuL, JiangW, JiangJ, ZhangPF, et al.An integrated platform for Brucella with knowledge graph technology: from genomic analysis to epidemiological projection.Front Genet2022; 13:981633.
[30]
ZhangL, DaiZ, YuJ, XiaoM.CpG-island-based annotation and analysis of human housekeeping genes.Brief Bioinform2021; 22(1):515-525.
[31]
ZhangL, ZhangL, GuoY, XiaoM, FengL, YangC, et al.MCDB: a comprehensive curated mitotic catastrophe database for retrieval, protein sequence alignment, and target prediction.Acta Pharm Sin B2021; 11(10):3092-3104.
[32]
KlineDM, BerardiVL.Revisiting squared-error and cross-entropy functions for training neural network classifiers.Neural Comput Appl2005; 14(4):310-318.
[33]
BarencoA, BerthiaumeA, DeutschD, EkertAK, JozsaR, MacchiavelloC, et al.Stabilization of quantum computations by symmetrization.SIAM J Comput1997; 26:1541-1557.
[34]
ChinCS, AlexanderD, MarksP, KlammerAA, DrakeJ, HeinerC, et al.Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.Nat Methods2013; 10(6):563-569.
PageAJ, CumminsCA, HuntM, WongVK, ReuterS, HoldenMT, et al.Roary: rapid large-scale prokaryote pan genome analysis.Bioinformatics2015; 31(22):3691-3693.
[37]
KatohK, MisawaK, KiK, MiyataT.MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.Nucleic Acids Res2002; 30(14):3059-3066.
[38]
PageAJ, TaylorB, DelaneyAJ, SoaresJ, SeemannT, KeaneJA, et al.SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments.Microb Genom2016; 2(4):e000056.
[39]
DongX, SunF, HanX, HouR.Study of positive and negative association rules based on multi-confidence and chi-squared test.X. Li, O.R. Zaïane, Z. Li (Eds.), Advanced data mining and applications, Springer Nature, Berlin2006; 100-109.
[40]
LiangJ, ShiZ, LiD, WiermanMJ.Information entropy, rough entropy and knowledge granulation in incomplete information systems.Int J Gen System2006; 35:641-654.
[41]
DinhT, ZengY, ZhangR, LinZ, GiraM, RajputS, et al.LIFT: language-interfaced fine-tuning for non-language machine learning tasks.In: Proceedings of the 36thInternationalConference onNeuralInformationProcessingSystems; 2022 Nov 28–Dec 9; NewOrleans, LA, USA. RedHook: CurranAssociatesInc.; 2022. p. 11763–84.
[42]
HegselmannS, BuendiaA, LangH, AgrawalM, JiangX, SontagD ,et al.TabLLM: few-shot classification of tabular data with large language models.In: Proceedings of the International Conference on Artificial Intelligence and Statistics; 2023 Apr 25–27; Valencia, Spain. PMLR. p. 5549–58.
[43]
PutnamJ.Python Web development with Django.Comput Rev2010; 51(6):330.
[44]
BreimanL.Random forests.Mach Learn2001; 45:5-32.
[45]
ChenT, GuestrinC.XGBoost: a scalable tree boosting system.In: Proceedings of the 22ndACMSIGKDDInternationalConference onKnowledgeDiscovery andDataMining; 2016 Aug 13–17; SanFrancisco, CA, USA. NewYorkCity: Association forComputingMachinery (ACM); 2016. p. 785–94.
LoshchilovI, HutterF.Decoupled weight decay regularization.In: Proceedings of the International Conference on Learning Representations; 2019 May 6–9; New Orleans, LA, USA. Wadern: dblp; 2019.
[48]
XiaY, YangC, HuN, YangZ, HeX, LiT, et al.Exploring the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme patients by a novel survival analysis model.BMC Genomics, 18 (Suppl 1) (2017), p. 950
[49]
ZhangL, LiuG, KongM, LiT, WuD, ZhouX, et al.Revealing dynamic regulations and the related key proteins of myeloma-initiating cells by integrating experimental data into a systems biological model.Bioinformatics2021; 37(11):1554-1561.
[50]
YouY, LaiX, PanY, ZhengH, VeraJ, LiuS, et al.Artificial intelligence in cancer target identification and drug discovery.Signal Transduct Target Ther2022; 7(1):156.
[51]
AleksandrowiczG, AlexanderT, BarkoutsosP, BelloL, Ben-HaimY, BucherD, et al.Qiskit: an open-source framework for quantum computing [Internet].Genève: Zenodo; 2019 Jan 23 [cited 2024 Jan 22]. Available from: https://zenodo.org/records/2562111.
[52]
ZhaJ, SuJ, LiT, CaoC, MaY, WeiH, et al.Encoding molecular docking for quantum computers.J Chem Theory Comput2023; 19(24):9018-9024.
[53]
ShuG, ShanZ, XuJ, ZhaoJ, WangS.A general quantum algorithm for numerical integration.Sci Rep2024; 14:10432.