1. Introduction
In the past few decades, the pharmaceutical industry has been limited by the extent of cutting-edge research in pharmaceutical sciences, because the development of new drugs is a long and complex process accompanied by high risks and high costs [
1], [
2]. In other words, the current field of drug research and development (R&D) requires significant productivity improvements to shorten the cycle time and cost of drug development [
3]. Technologies such as network pharmacology, RNA-sequencing (RNA-seq), high-throughput screening (HTS), or virtual screening (VS) have all accelerated the discovery of new targets, as well as new drugs to some extent [
4], [
5], [
6], [
7], [
8], [
9]. Nevertheless, these technologies have rarely been significant contributors to the current process of new drug discovery. Thus, there is an urgent need for new technology to drive the development of new drugs.
As the computing power of devices grows, artificial intelligence (AI) has been used in many real cases, such as in image classification and speech recognition, due to its ability to learn, process, and predict massive amounts of information [
10], [
11], [
12]. At present, after a long period of data accumulation, in combination with the development of high-throughput RNA-seq technology, massive amounts of biomedical data have been collected [
13], [
14], [
15], [
16], [
17], [
18]. Biomedical data, which has a high level of heterogeneity and complexity, comes from a variety of sources, including omics data from different platforms, experimental data from biological or chemical laboratories, data generated by pharmaceutical companies, publicly disclosed textual information, and manually collated data from publicly available databases [
19], [
20], [
21], [
22]. AI can be used to learn the potential patterns in these vast amounts of biomedical data, thereby bringing new opportunities and challenges to the pharmaceutical sciences and industries.
The AlphaFold2 system used AI in the 14th round of the Critical Assessment of Protein Structure Prediction (CASP14) competition and outperformed others in accurately predicting the three-dimensional (3D) structures of proteins [
23]. Similarly, in the Open-Graph Benchmark Large-Scale Challenge (OGB-LSC) competition, a graph neural network (GNN) combined with a transformer model won the top rank in predicting the molecular properties calculated by means of density functional theory (DFT), which is difficult and highly time-consuming using traditional methods [
24]. These competitions demonstrated the strong ability of AI to analyze biological or chemical data. Due to its powerful capability to utilize related biomedical data to understand complex biological systems and chemical reaction spaces [
25], [
26], AI has had a revolutionary impact on all stages of drug R&D, including not only research on proteins and small molecules but also the assisted design of clinical trials and post-market surveillance [
27]. Furthermore, in pharmaceutical companies, many state-of-the-art (SOTA) AI models have been adopted in diverse pipelines to shorten the R&D cycle time and decrease costs [
28], [
29], [
30].
AI techniques in this context mainly involve machine learning (ML) and deep learning (DL). Both ML and DL algorithms are involved in target discovery and validation [
31], drug discovery and design [
32], and preclinical drug research [
33], where they are used to analyze different data characteristics in different formats. After a drug candidate is enrolled in a clinical trial [
34], DL plays a pivotal role in assisting in the design of the clinical trial and in supervising and analyzing data from the clinical phase IV [
33]. Approved drugs have a strong impact on manufacturing [
35] and the market economy, and DL can play a part in these areas as well. Therefore, in this review, we present a comprehensive overview of most aspects of the use of AI in the pharmaceutical sciences. We focus on how AI can be used to promote target discovery and drug discovery (as shown in
Fig. 1) and reflect on how to further accelerate the development of this field.
2. Basic concepts of AI and its scope of application
AI was first proposed at the Dartmouth Conference in 1956 and was defined as an algorithm that gives machines the ability to reason and perform functions [
36]. From perceptual machines to support vector machines (SVMs) and artificial neural networks (ANNs), the development of AI has gone through several ups and downs, and is currently flourishing thanks to the hardware support that is now available. Both ML and DL fall under the category of AI; strictly speaking, DL can be placed within the category of ML. However, our discussion of ML in this review only concentrates on traditional ML methods, such as random forest (RF) and SVMs.
2.1. The big data era
In the current big data era, gigantic amounts of biological and clinical data have laid a foundation for the application of AI in the field of medical and pharmaceutical research. Although AI has been successfully and effectively applied in multiple aspects of the drug R&D process, the quantity and quality of medical data have become one of the main obstacles to the development of AI in the pharmaceutical sciences. Thus far, pharmaceutical databases with detailed and structured big data proposed by medicinal researchers worldwide are playing a key role in promoting AI applications in medical and pharmaceutical research.
For example, the Therapeutic Target Database (TTD) includes the most comprehensive information about known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information, and the corresponding drugs directed at each of these targets. It provides detailed knowledge of the functions of targets, as well as their sequence, 3D structures, ligand-binding properties, relevant enzymes, and corresponding drug information [
37]. PubChem [
17] provides collective information of chemical molecules and their activities in response to biological assays, including molecular structure, identifiers, physicochemical properties, patent information, and molecular toxicity. Some popular databases aimed at various pharmaceutical issues have been proposed and are frequently used; these play significant roles in promoting the application of AI in medical and pharmaceutical research [
38], [
39], [
40], [
41], [
42]. Summarizing various popular pharmaceutical databases,
Table 1 [
17], [
18], [
37], [
43], [
44], [
45], [
46], [
47], [
48], [
49], [
50], [
51], [
52], [
53], [
54], [
55], [
56], [
57], [
58], [
59], [
60], [
61], [
62] provides brief information on popular pharmaceutical databases, categorized into protein-related, gene-related, drug-related, and disease-related databases.
2.2. ML and DL
Unlike traditional computer programming calculations, ML and DL can learn potential patterns from the input data without explicit programming. They are not limited by the format of the input data, which is broad and can include text, images, sound, and more (all types of data that can be encoded) [
63]. Similar to the human learning model, ML and DL can gradually recognize different features of the data, infer the patterns lying within, and update their model parameters through continuous iterations until a valid model is formed.
According to the application scenarios, the models can be categorized into regression models and classification models. The difference between regression and classification tasks lies mainly in whether the type of output variable is continuous or discrete. Cheng and Ng [
64] applied ML approaches to predict the biological activity of per- and polyfluorinated alkyl substances (PFAS) with an output of continuous values, and this study is a typical regression task. Hong et al. [
65] built a DL model to predict whether a protein in a bacterium is of the type IV secreted effectors (T4SE), with an output of discrete values (e.g., 0/1), and this study is a typical classification task.
Depending on the type of learning algorithm required to solve the problem, models are conceptualized into three categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning is a labeled-data-driven process that trains a model on the relationship between input and its prespecified output in order to predict the categories or continuous variables of future input. In comparison, unsupervised methods are used for identifying patterns in unlabeled datasets and exploring a dataset’s potential structures to allow clustering of the data for further analysis. In addition, semi-supervised learning is part-way between supervised and unsupervised learning; it accepts only part of the labeled data to develop a training model and is used as a potential solution for problems that lack high-quality data [
66]. Reinforcement learning performs model construction through constant interactive learning, relying on penalties for failure or rewards for success.
2.3. Introduction to different types of ML/DL-based algorithms
ML and DL methods have been successfully applied to solve relevant biomedical problems, with the adopted modeling approach varying for different problems or even the same problems. For example, small molecules used to be characterized as engineered features for direct loading in several ML methods to predict the properties; however, more recently, GNNs can also be utilized to describe small molecules for predictions of properties [
67]. Determining the function annotations of proteins is essential for the selection of druggable proteins as potential targets. Kulmanov et al. [
68] conducted a convolutional neural network (CNN) to annotate the gene ontology annotation (GOA) of proteins. Gligorijević et al. [
69] built a recurrent neural network (RNN) for protein function annotations, and Xia et al. [
70] combined both a CNN and RNN to predict the gene ontology (GO) label of proteins.
ML builds a special algorithm—not a specific algorithm—that focuses on the features of the data and transforms them into knowledge that machines can read to provide humans with new insights. Various common algorithms exist for researchers to choose from. The naïve Bayes (NB) algorithm is a probabilistic-based classifier based on Bayes’ theorem and independence assumptions between features; it is a simple and intuitive algorithm [
71]. An RF algorithm constructs a set of unrelated decision trees that form a whole hierarchical structure; under model construction, each tree is individually responsible for a corresponding problem [
72]. The final decision is based on the majority votes of the decision trees. Models that make decisions based on this approach are also commonly referred to as ensemble models. Extreme gradient boosting (XGBoost) is a scalable ML algorithm based on gradient boosting, which is also an ensemble model [
73]. Multi-layer perceptron (MLP) can be viewed as a directed graph consisting of multiple node layers, each fully connected to the next layer, so that it maps a set of input vectors to a set of output vectors. SVM is one of the most widely applied ML algorithms. An optimal hyperplane is used to classify samples, which are obtained by maximizing the margins between different classes in a specific dimensional space, with the dimensionality being determined by the number of features [
74]. The
k-nearest neighbor (KNN) is regarded as “lazy learning” that classifies the sample according to only a few neighboring samples when distinguishing between categories [
75]. In addition to the above methods, several other ML methods such as principal component analysis (PCA), partial least-squares (PLS), linear discriminant analysis (LDA), and logistic regression (LR) have been applied in biomedical data processes [
76], [
77].
DL is popular due to its powerful generalization and feature-extraction capabilities; its learning and prediction process is end-to-end. Unlike the traditional ML process (which often consists of multiple independent modules), DL obtains the output data (output-end) directly from the input data (input-end) during the model training process and continuously adjusts and optimizes the model based on the error between the output and the true value, until it meets the expected result. A deep neural network (DNN) is a feed-forward neural network consisting of densely connected input, hidden, and output layers. It achieves the feature learning of input data by simulating nonlinear transformations between neurons, with each layer consisting of various neurons [
78]. A CNN is a feed-forward neural network that consists of convolutional (feature extraction) and pooling (dimensionality reduction) layers. The convolutional and pooling layers help to extract all the information in a dataset without consuming too much time and computational resources [
79]. An RNN is a class of ANN in which linked nodes form a directed or undirected graph along a temporal sequence. An RNN includes a feedback component that allows signals from one layer to be fed back to the previous layer. It is the only neural network with internal memory, which helps to address the difficulty of learning and storing long-term information [
80]. A GNN is a connectivity model that derives the dependencies in a graph by means of information transfer between nodes in the network [
81], [
82]. A GNN updates the state of a node according to neighbors of the node at any depth from the node; this state is able to represent the node information. The neural network architectures of the four networks described above are shown in
Fig. 2.
An autoencoder (AE), which consists of an encoder and a decoder, is used to learn efficient encodings of input data. The encoding, which is generated by feeding input to the encoder, regenerates the input by the decoder. An AE is usually used for data compression and dimensionality reduction through the representation methods (i.e., the encoding) of a set of data [
83]. A generative adversarial network (GAN) is composed of two underlying neural networks: a generator neural network and a discriminator neural network. The former is used to generate content, while the latter is used to discriminate the generated content [
84]. Models can also be used in combination to solve a wider range of problems. For example, a graph convolution network (GCN) extends convolutional operations from traditional data (e.g., images) to graph data [
85].
When a model fails to learn the underlying patterns in data features effectively and loses the ability to generalize to new data, such a problem is called model underfitting [
86]. In contrast, overfitting occurs when the model is training and noise in the data fitted as a representative feature resulting in poor predictions for new data [
87]. Compared with underfitting, model overfitting is more difficult to deal with. Models often become overfitted due to being overly complex or because of an underrepresentation of data. A dataset used for a model is often divided into a training set, validation set, and test set. These sets are respectively used for model training, model adjustment, and model evaluation. To put it simply, a model that works badly on both the training and test sets is an underfitted model, while a model that works well on the training set but badly on the test set is an overfitted model. Typical ways to suppress overfitting include regularization, data augmentation [
88], dropout [
89], early stopping, ensemble learning, and among other methods.
Researchers encountered underfitting and overfitting problems, using only one model of traditional epidemic models or ML models, when predicting the long-term trends of the coronavirus disease 2019 (COVID-19) pandemic. To address these issues, Sun et al. [
90] proposed a new model called dynamic-susceptible-exposed-infective-quarantined (D-SEIQ). The D-SEIQ model can accurately predict the long-term trends of COVID-19 outbreaks by appropriately modifying the susceptible-exposed-infective-recovered (SEIR) model and integrating ML-based parameter optimization under reasonable epidemiology constraints.
Different models have different evaluation criteria. In regression models, commonly used evaluation criteria include mean squared error (MSE), root mean squared error (RMSE), and
R-squared. In classification models, the more commonly used criteria are recall, precision, and
F1-score. The receiver operating characteristic (ROC) curve and precision-recall curve (PRC) are the most commonly used evaluation criteria in classification models, with ROC curves taking into account both positive and negative cases to assess the overall performance of the model, while PRCs focus more on positive cases [
91].
2.4. A brief description of molecule representation as model input
Over time, the accumulation of data on small molecules and proteins has resulted in an extremely large data resource. Databases of molecular sequences, structures, physicochemical properties, and so forth have been collected and organized by different organizations and contain a great deal of knowledge and information. However, the different sources and formats of the data make it difficult to integrate the correlated data from multiple heterogeneous sources. Therefore, it is particularly important to adopt suitable methods to represent molecules in an appropriate way and to mine the crucial information in the data on molecules by means of AI [
92]. Current AI algorithms are highly dependent on the quality of the data; thus, when performing model construction, it is necessary to unify the input format of molecules, such as by representing small molecules and proteins as model-readable vectors or matrices.
At present, the representation of small molecules is generally done using one of four main approaches. The first approach involves knowledge-based representation. Molecular descriptors and molecular fingerprints based on human
a priori knowledge are widely used in various ML or DL algorithms [
93]. The second approach involves direct representation based on images. CNNs have now been used to learn rules from two-dimensional (2D) digital images. A 2D chemical digital grid of a molecule can be directly used as input to allow a CNN model to learn the properties of the molecule [
94]. The third approach is string-based representation. For example, a typical canonical simplified molecular-input line-entry system (SMILES) represents small molecules in the form of strings. Thus, CNNs and RNNs can be further used to learn molecular embeddings from the string representations of chemical structures [
95], [
96], [
97]. The fourth approach involves graph-based feature representation. Representation methods based on graph convolution or graph attention have been widely used to explore the feature representation of small molecules. In these methods, atoms and bonds are considered to be nodes and edges, respectively, while new molecular representations are obtained during the continuous updating of information at individual nodes. Graph-based representations have achieved outstanding performance in a variety of pharmaceutical learning tasks [
98], [
99].
Protein representation methods can be basically classified into four categories: representation based on intrinsic properties of sequences, representation based on physicochemical properties, representation based on protein structure, and graph-based representation. Sequence-based protein representation methods include amino acid composition (AAC), dipeptide composition, autocorrelation descriptors, position-specific scoring matrices (PSSMs), and one-hot encoding [
100], [
101], [
102], [
103], [
104], [
105], [
106], [
107]; these methods reflect the content of various amino acids, dipeptide content, and the distribution of amino acids on the sequence. Physicochemical property-based protein representation methods include composition, transition, and distribution (CTD), pseudo-amino acid composition (PAAC), and amphiphilic pseudo-amino acid composition (APAAC) [
108], [
109], [
110], which reflect the properties of each amino acid and the distribution of these properties on the sequences. The two feature representation methods described above are widely used in various models, because they can obtain protein feature representations by knowing only the sequence information. It is well known that the high-level structure of a protein determines the function of that protein, so it will sometimes directly represent the structure of proteins. Protein representation methods based on structural properties include topological molecular structure and protein secondary structure and solvent accessibility (PSSSA) [
111], [
112], [
113], which reflect the structural properties of each amino acid in a protein and the structural type of a protein. PSSSA is also a graph-based protein representation. In the simplest graph, each node corresponds to a residue, while the edges connect pairs of residues within a certain distance [
114]. Structure-based and graph-based protein representation methods can effectively represent the structure of a protein and the relationships between amino acid residues in the structure, and can be applied to a variety of novel model architectures, such as GNNs, transformer models, and GANs [
114], [
115], [
116], [
117].
In recent years, novel molecular representation methods have been emerging, such as knowledge-graph-based and large-scale pretrained-based representation methods [
118], [
119]; these methods also excel in suitable downstream tasks. Overall, representing the raw data of a molecule using a vector or matrix that captures the molecule’s key features is critical for subsequent data exploration and analysis.
2.5. The study of drug research and disease with distinct AI algorithms
When studying different types of drugs and performing disease research, choosing a suitable model can maximize the potential information of the data. Given classification or regression problems with small datasets, ML can often achieve a satisfactory performance in a short time. For example, a drug-protein affinity prediction study based on quantitative structure-activity relationship (QSAR) models could choose to use SVM or RF models (see Section 5 for more detail) [
120], [
121]. When the amount of data is progressively higher, DL algorithms are often more appropriate. For example, for the prediction of protein-folding problems, CNN models can better predict residues [
122]. In the research area of drug
de novo design, generative models and variational autoencoders (VAEs) can help to design molecules that align with the design vision [
123], [
124] (see Section 4 for more detail). Instead of selecting models from the perspective of the tasks, studies often use the data representation form to select an appropriate algorithm. Therefore, researchers can often choose from different AI algorithms that are available for the same task. When predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of molecules, CNNs, RNNs, and multi-task learning can achieve outstanding results [
125] (see Section 5 for more detail). By starting from the relationships between data, graph-based AI algorithms allow the modeling of unstructured data. In the pharmaceutical sciences, there is never a lack of complex relationships. Therefore, modeling complex interactions such as drug-drug interactions, drug-protein interactions, protein-protein interactions (PPIs), and so forth enhances the learning capability of the models [
126] (see Section 3 for more detail). When combined with representations of these entities themselves, key information about the entities can be learned at a deeper level to aid in making predictions, while providing a more explanatory model.
Therefore, the boundaries between the use of distinct algorithms have become increasingly blurred when such methods are applied to the actual drugs and disease problems to be studied. Depending on the type of data available and taking into account the biological significance can be informative for model selection and construction.
3. Target identification and validation
From a conventional standpoint, there are two paradigms for discovering new (first-in-class) drugs [
127]: phenotypic drug discovery (PDD) and target-based drug discovery (TDD). Early biological research techniques relied on microscopy, imaging, and cellular techniques to observe the phenotypic changes in living systems. PDD is used to screen a library of compounds or antibodies by constructing an animal model or experiment that is highly relevant to the disease. Next, the responses of cells or experimental animals to these compounds are observed, with the aim of identifying molecules with a certain level of efficacy for further structural modification and optimization [
128]. With the development of molecular biology and various sequencing techniques, research on biological macromolecules has reached a new height. Drug discovery research has entered the TDD era [
129], and TDD has gradually replaced PDD as the mainstream drug discovery paradigm. TDD is centered on a “one gene, one drug, and one disease” concept [
4]. This approach relies on a highly disease-relevant target, which could be an enzyme, protein, or other gene product, along with an elaborate and meticulous small-molecule design for this target, which is used to modulate the target to act as a therapeutic agent for the disease. Although the drug discovery paradigm of PDD has been re-emerging in recent years [
128], the screened drugs often require further target validation and mechanistic studies. Therefore, target discovery is often the first, critical step in the drug development phase [
129]. The target discovery process involves multifaceted research, including the study of disease-related genes, signaling pathways, protein interactions, and small molecule-protein interactions. Of particular interest is the fact that target discovery based on experimental means is difficult to carry out quickly and widely, due to limitations in throughput, accuracy, and cost, whereas AI-based discovery can efficiently and effectively identify biomolecules with the potential to become drug targets.
3.1. Target identification based on omics techniques
With the advancement of high-throughput sequencing technologies, huge amounts of omics data are continuously being generated. The processing and analysis of such large-scale omics data (genomics, transcriptomics, proteomics, metabolomics, etc.) [
130], [
131], [
132], [
133], [
134], [
135], [
136], [
137], [
138] have been revolutionary to biology, medicine, and pharmacology, especially in facilitating researchers’ understanding of complex biological systems and processes. Many genes or proteins playing important roles in biological processes that may be associated with specific diseases have been identified based on omics data [
135], [
139], [
140], [
141], thereby facilitating research on drug target discovery. For example, new candidate disease targets such as SETD2 and VGLL4 have been uncovered using omics data. However, processing and analyzing these complex and high-dimensional omics data is extremely challenging; thus, ML and DL approaches can be used to learn potential knowledge from large-scale omics datasets, which can help in the discovery of genes or pathways critical to biological processes [
142].
Table 2 [
18], [
44], [
53], [
48], [
49], [
50], [
143], [
144], [
145], [
146], [
147], [
148], [
149], [
150], [
151] provides examples of omics projects for drugs, proteins, and diseases analysis.
Potential targets are molecules that are associated with a specific disease and have the smallest possible degree of association with other diseases. Complex diseases such as oncological, cardiovascular, and immune diseases are often regulated by multiple key genes, molecules, or signaling pathways, so it is often necessary to unravel the connection between multiple molecules and the disease. Omics data are essential for discovering and assessing the biological effects or toxicity of potential targets. For example, cancer stem cells (CSCs) cause great resistance to the treatment of lung adenocarcinoma (LUAD). Studying the expression of stem-cell-related genes in LUAD could provide new insights into the treatment of LUAD. Zhang et al. [
152] applied an unsupervised ML algorithm known as one-class LR (OCLR) to the molecular datasets of normal stem cells and their progeny to obtain the messenger RNA (mRNA) expression-based stemness index (mRNAsi), DNA methylation-based stemness index (mDNAsi), and epigenetic regulation-based mRNAsi (EREG-mRNAsi) for analyzing the LUAD cases data in The Cancer Genome Atlas (TCGA) in order to calculate the scores of sample stemness indices. In this process, weighted gene co-expression network analysis (WGCNA) was used to find key genes associated with LUAD. In the end, 13 previously overlooked key genes with an overall association were identified, which could be used as potential targets for the treatment of LUAD by suppressing the stemness features.
Since their release, the connectivity map (CMAP) and Library of Integrated Network-based Cellular Signature (LINCS)-L1000 databases—which contain a large amount of transcriptomic data following drug perturbations and various other environmental disturbances—have been used to do a great deal of research to identify the mechanism of action and targets of small molecule compounds, with the aim of discovering potential drugs for diseases or potential targets for drugs [
153], [
154], [
155]. The web service PharmMapper [
156], [
157], [
158] gathered 52 431 pharmacophore models from TargetBank, DrugBank, BindingDB, and the potential drug target database (PDTD), and used them to identify potential target candidates for the given probe small molecules by means of a fast pharmacophore mapping approach. ChemMapper [
159] is another web service that aims to predict polypharmacology effects, potential protein targets, and modes of action for small molecules based on 3D similarity computation, using a database containing 4 350 000 chemical structures with bioactivities and associated target annotations. The iDrug [
160] platform provides a versatile, user-friendly, and efficient online tool for computer-aided drug design (CADD) based on pharmacophore and 3D molecular similarity searching, enabling binding sites detection, VS, and drug target prediction in an interactive manner through a seamless interface. DeltaNet was designed by Noh and Gunawan [
161] based on the ordinary differential equation (ODE) model for analyzing gene transcription processes and predicting potential targets of compounds. There are two versions of DeltaNet—namely, DeltaNet-LAR and DeltaNet-LASSO—which use last angle regression (LAR) and least absolute shrinkage and selection operator (LASSO) regularization to solve linear regression problems, respectively. DeltaNet outputs a predicted ranking of gene targets for further enrichment analysis to find other key molecular targets. Zhu et al. [
162] constructed a DL-based efficacy prediction system (DLEPS) to identify new drug candidates and discovering targets. Trained by transcriptional profiles data, mainly from the L1000 project profiles, DLEPS uses changes in gene expression profiles in the state of disease as input. In addition to the discovery of three new drug candidates, DLEPS also demonstrated that mitogen-activated protein kinase kinase (MEK)-extracellular-signal-regulated kinase (ERK) was a critical signaling pathway in nonalcoholic steatohepatitis—knowledge that can be used to develop specific targets. The data mining analysis of such transcriptomes through ML and DL can help not only to find drug targets but also to elucidate the mode of action of drugs and disease mechanisms [
163].
The analysis of omics data has helped researchers to identify many overlooked disease candidate targets [
164]. With the advancement of sequencing technology and deeper research, the drawbacks of the deeper mining of only single omics data are becoming increasingly obvious, as such mining can neither reflect the relevance and variability of biological processes (e.g., simple gene expression levels do not reflect true protein expression levels) nor reveal complex biological systems and disease mechanisms (e.g., glycolytic processes are associated with genomics, proteomics, and metabolomics). In particular, disease onset often involves multiple pathways and requires the integration of multimodal data. For example, genes with increased DNA copy numbers have been found to be involved in important cancer pathways, and somatic mutation frequency and expression levels are also important factors in cancer drivers [
143], [
165], [
166]. By integrating information at multiple omics levels and mining the linear or nonlinear associations through AI approaches, candidate key factors can be identified at a more in-depth level, which is crucial for discovering candidate targets for diseases.
Complex diseases such as cardiovascular disease, schizophrenia, cancer, and Alzheimer’s disease (AD) have many therapeutic targets, and multiple potential causative genes can be discovered through the multi-omics features of individual patients. Jeon et al. [
31] used an SVM algorithm with a radial basis function (RBF) kernel to construct three models to predict potential targets specific to breast cancer (BrCa), pancreatic cancer (PaCa), and ovarian cancer (OvCa), respectively. Gene essentiality, gene expression, DNA copy number variation, somatic mutation, and PPI network topology were the main input features, and the SVM was able to deeply explore the association of and difference among these features to distinguish potential drug targets from non-target proteins. The model was cross-validated with ten folds and had a high area under the ROC curve (AUROC) value and a low false-positive rate. By using the trained model to predict 15 663 human proteins and score the prediction results, a total of 122 global cancer targets were identified for all cancers (69 of which corresponded to the 116 known targets that were rigorously validated). In addition, a large number of potential targets specific to BrCa, PaCa, and OvCa were identified. Of course, the identified targets were only for guidance and were not true drug targets.
Moreover, using multi-omics data with PPI networks, a group developed a network-based Bayesian algorithm framework [
167] to infer loci for an AD genome-wide association study (GWAS) and revealed 103 AD risk genes (ARGs). This study included gene expression data from single cell transcriptomics, gene expression data from microarrays, and proteomics, fully demonstrating the ability of AI approaches to integrate multi-source and multimodal data to discover potential therapeutic targets.
ML has been instrumental in driving the learning process of multi-omics data, but it can be overwhelmed by larger multi-omics data and more complex problems. However, DL can handle much larger amounts of multi-omics data and unearth deeper associations. On the assumption that the drug inhibition of targets and target gene knockdown (KD) should lead to the occurrence of similar biological processes, resulting in similar mRNA expression profiles, Pabon et al. [
168] explored the direct feature correlation and indirect feature correlation between compound-induced features and gene KD in CMAP, and combined these features with other features such as PPIs as inputs into the RF model to predict drug targets. To better mine the correlation between chemical perturbation (CP) features and KD genetic perturbation features, Zhong et al. [
169] proposed a GCN model known as Siamese spectral-based graph convolution network (SSGCN) to mine transcriptomic data to predict compound-protein interactions (CPIs). SSGCN constructed two parallel GCN models for the feature extraction of CP profiles and KD profiles, respectively, where CP profiles and KD profiles were integrated with a PPI network (the attribute values of the network nodes were gene differential expression values, and if there was an interaction between two nodes, these two nodes were connected by an edge). Two sets of graph embedding vectors were obtained after feature extraction, and the degree of correlation between the CP features and KD features was obtained by means of a simple linear regression layer. The correlation was expressed as Pearson’s coefficient
R2 and was fed to the classifier as features along with cell line, CP time, dosage, and KD time to discriminate the interaction of compounds with the corresponding proteins. This model was subsequently validated externally and shown to be effective in identifying potential drug targets and facilitating drug repositioning studies.
Most of these target discovery models use end-to-end models to directly discover druggable proteins. DL can also perform key roles in multiple specific steps in the target discovery process, such as predicting splicing from pre-mRNA transcript sequence using SpliceAI [
170], using scVI to predict and analyze gene expression probabilities in single cells from transcriptomic data [
171], and using PLEDA to predict an enhancer predictor [
172]. Some studies have performed a GWAS of COVID-19, with results suggesting a possible association with COVID-19 susceptibility in the 3p21.21 region of the chromosome. Building on these studies, Downes et al. [
173] used multiple DL approaches combined with multi-omics data to discover that the gain-of-function risk A allele of a single-nucleotide polymorphism (SNP), rs17713054G>A, may be a variant that can cause disease. Further analysis revealed that leucine zipper transcription factor like 1 (
LZTFL1), a gene regulated by rs17713054, was a critical gene for the development of epithelial-mesenchymal transition (EMT). EMT is a developmental pathway associated with lung inflammation that is frequently induced by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in lung cancer cell lines (CCLs) and the respiratory tract. As a key gene in this series of biological processes,
LZTFL1 could serve as a potential therapeutic target.
The use of AI approaches can help effectively predict drug responses in cancer cells to advance precision medicine [
174], [
175], [
176]. One group used elastic net regression and RF to identify how multi-omics data affect drug response prediction [
177]. In this study, 265 drugs across 990 CCLs were screened to construct pharmacogenomic datasets. To comprehensively investigate the influence of different combinations of molecular data, linear and nonlinear ML models were built. Among the genome-wide gene expression, DNA methylation, gene copy number, and somatic mutation data, gene expression data was the most predictive data type in pan-cancer analysis, and genomic data (i.e., driver mutations, copy number alterations, or DNA methylation data) was the most predictive data type in cancer-specific analysis.
The importance of multi-omics data in drug response prediction has also been demonstrated. However, most methods do not take drug/cell line specificity, drug/cell line, or drug-protein associations into consideration. To address this issue, Peng et al. [
178] combined multi-omics data with a GCN to construct an end-to-end model known as MOFGCN. Drug/cell line associations were used to initially construct a heterogeneous network in which the nodes were drugs or cell lines. The properties of the drugs were obtained by calculating the similarity of molecular fingerprints, and the properties of the cell lines were obtained by fusing multi-omics data (i.e., gene expression, copy number variation, and somatic mutation data) and calculating their similarity. The completely constructed heterogeneous network served as the input to a graph convolutional network, and the final features were obtained by passing messages between nodes to further learn the potential associations of drugs and cell lines. To predict drug sensitivity, a CCL-drug correlation matrix required further reconstruction based on a linear correlation matrix that was calculated from the updated features of drug and cell lines. The DL framework of predicting drug sensitivity, DeepDRK [
179], integrated mutations, copy number variation, DNA methylation, gene expression, and drug screening as cell line features and extracted molecular-protein information as drug features. Then, the two features were spliced as the features of a CCL-drug pair and were fed into the DNN to predict the drug sensitivity.
The combination of omics data and AI methods can help researchers quickly obtain the information they need at the molecular scale, as the various levels of omics data reflect the various processes of life activity. Integrating and analyzing this information can aid in the understanding of complex biological systems and thus assist in the discovery of new drug targets.
3.2. Drug-target interactions (DTIs) discovered using chemogenomics
The identification of DTIs is currently contributing to research in drug discovery. Newly discovered DTIs can be used to find new targets that interact with existing drugs or to discover new compounds that interact with a disease-related target. Therefore, research results on DTIs are widely used in the fields of lead compound discovery, new target discovery, drug repositioning, and drug side-effect prediction [
3], [
180], [
181]. Although HTS have been developed to determine the activity of thousands of compounds at once, they cannot catch up with AI methods in terms of either cost consumption or the number of compounds measured. In general, methods for predicting DTIs have been divided into three main approaches: ligand-based methods, structure-based methods, and chemogenomic methods. Each of these three methods has its own advantages and disadvantages, with the third method being the most widely applicable and popular. Therefore, this section focuses on reviewing chemogenomic methods, while the other two methods are covered in Section 4.
The chemogenomic approach not only uses drug-related and target-related information but also connects this information to multiple sources of biomedical information in order to better predict DTIs. Publicly accessible database resources contain a large amount of structured and unstructured biomedical data to support access to information. ML and DL can extract relevant functional information and reduce the noise from this large amount of heterogeneous data in order to discover new protein targets precisely and efficiently.
Table 3 [
37], [
54], [
55], [
57], [
58], [
182], [
183], [
184], [
185], [
186], [
187], [
188], [
189], [
190], [
191] lists some currently high-quality public databases.
Prediction of DTIs is usually regarded as a binary classification problem. It is very convenient to use an ML approach to predict DTIs, which usually only requires obtaining the SMILES of small molecules and the sequences of target proteins. These sequences are converted into feature vectors via different rules and are later used as inputs to a model to predict their final classification. These molecules and proteins are characterized in a variety of ways and often contain information about the physicochemical properties of the molecules and proteins, as well as their structure. A number of toolkits and libraries for molecule and protein representations have been developed and are listed in
Table 4 [
192], [
193], [
194], [
195], [
196], [
197], [
198], [
199], [
200], [
201], [
202], [
203], [
204], [
205], [
206], [
207], [
208], [
209], [
210], [
211], [
212], [
213], [
214], [
215]. For example, small molecules characterized using MACCS fingerprints were spliced with protein vectors characterized by CTD descriptors and used as inputs to an SVM to predict DTIs [
216]. The occurrence of a DTI is influenced by numerous factors and corresponds to multidimensional features that represent the structure and properties of the molecule and protein. It is hoped that the model can find out more about the mechanism of DTI from these features and then give classification judgments based on information. Such problems have also been treated as regression problems; DeepDTA is a CNN model that used the SMILES of small molecules with sequences of proteins to predict the affinity of small molecules with proteins [
217]. Using only single-feature representation does not fully characterize small molecules or proteins, so some studies have used multiple descriptors to characterize small molecules and proteins and have integrated these features as vectors of inputs to predict DTIs. This improves the classification performance of the model to a certain extent [
218]. In order to enable researchers to more conveniently use DL to make predictions about DTIs, Huang et al. [
219] proposed DeepPurpose, which implements more than 50 DL models (including CNN, MLP, RNN, etc.). DeepPurpose can encode proteins in seven distinct ways, including MLP on AAC, PAAC, conjoint triad, quasi-sequence descriptors, CNN on amino acid sequences, RNN on top of CNN, and transformer encoder on substructure fingerprints. For compounds, there are eight encoders, including MLP on Morgan, PubChem, Daylight fingerprint, RDKit 2D fingerprint, CNN on SMILES strings, RNN on top of CNN, transformer encoders on substructure fingerprints, and a message-passing GNN on a molecular graph. Those encoding methods just use SMILES and the amino acid sequence as input. In this way, researchers can conveniently predict DTIs using different encoding methods on different models.
The abovementioned studies were able to obtain a good performance using only the SMILES sequence and amino-acid sequence of proteins. At the same time, it is important to integrate various data sources to predict DTI, such as drug-drug interactions, PPIs, and drug-disease associations. Bleakley and Yamanishi [
220] constructed a bipartite graph on DTI [
221], [
222] and applied an SVM model for DTI prediction in a later work. The four datasets constructed in this work have become the gold standard datasets for later DTI prediction models. Inspired by this work, there have been a proliferation of network-based approaches to predict DTI. A computational pipeline called DTINet was then developed that integrated multiple heterogeneous data sources to construct networks on DTI [
223]. In this study, four drug similarity networks were constructed based on ① drug-drug interaction networks, ② drug-disease association information, ③ drug side-effect association information, and ④ chemical structure information. Similarly, three protein similarity networks were constructed based on ① PPIs, ② protein-disease associations, and ③ genomic sequences. Using these similarity networks, a network diffusion algorithm (random walk with restart (RWR)) was first applied on individual networks separately, and the feature vectors were optimized. The low-dimensional vector representations obtained after this learning process contained information derived from various heterogeneous data sources and were able to better represent the drug/protein-specific properties. The obtained vectors were then used to discover new DTIs according to their spatial correspondence with drugs and proteins.
The use of DL models allows for the integration of heterogeneous data from multiple sources while providing a comprehensive characterization of drugs or biomolecules. Zeng et al. [
224] proposed a framework called deepDTnet to integrate heterogeneous data sources for the prediction of DTI. In this study, 15 networks—including genomics, GOA, protein-related similarity, and drug-related similarity—were integrated to construct a heterogeneous network connecting drug targets and disease information. A DNN for graph representation (DNGR) algorithm was developed to obtain the informative vector of both drugs and targets based on the constructed network. However, the lack of negative samples in public databases led to difficulties in the model training process; thus, a positive-unlabeled (PU)-matrix completion algorithm was employed to infer whether two drugs shared a target. The results showed that combining the heterogeneous data to re-represent the drug and target without a descriptor or fingerprint achieved an excellent performance.
As mentioned before, the emergence of large-scale knowledge of omics data, systems biology, chemistry, pharmacology, and so forth is providing new perspectives for DTI prediction. However, the integration of heterogeneous data from multiple sources undoubtedly introduces a huge amount of noise and does not solve the “cold-start” problem well. Here, knowledge graphs (KGs) stand out with their powerful ability to integrate heterogeneous information. By leveraging the interactions of phenotype, drug, target, and gene, a KG can help to further understand the molecular mechanism of a disease and to explore potential drug targets. Recent studies have integrated resources from several databases (DrugBank, TTD, ChEMBL, BindingDB, SIDER, GO, etc.) to construct KG such as BioKG, PharmKG, Hetionet, and drug-repurposing KG (DRKG) [
30], [
225]. A KG usually represents knowledge as a triple, which is composed of a head entity, relation, and tail entity. In the field of DTI recognition, the KG embedding (KGE) model is often used to represent entities and relations by means of low-rank vectors, in what is also known as the representation learning of KGs. The representation vectors obtained by a KG can be further used for link prediction to discover drug-target relationships [
30]. A KG typically integrates a huge amount of data with dozens or even hundreds of relationships. The vectors obtained via a KG often contain a certain exact positioning and relationship of this entity in the biological network, but not its own structure or physical and chemical properties. The same is true for proteins. To address this issue, Ye et al. [
118] developed a framework called KGE_neural factorization machine (NFM) that performs DTI prediction using a KGE technique combined with a recommendation system technique. In this process, an accurate entity vector is first obtained from the potential information learned from the heterogeneous network via KGE. Next, the structural information of the drug and target is obtained from molecular fingerprints and protein descriptors. Finally, multimodal information is extracted using an NFM, and the DTIs are predicted using DL methods. This approach was tested for “cold-start” scenarios of drugs or proteins and achieved a SOTA performance, particularly for protein “cold-start” scenarios.
In addition to the aforementioned methods for predicting DTIs, similarity-based [
226] and matrix decomposition-based methods [
227] can be used, among others, and have contributed greatly to DTI prediction in the past. With the development of DL, network-based methods, feature-based methods, and so forth are now being used in combination, bringing the advantages of each method into play to better predict DTIs and discover new targets [
228], [
229]. Based on recent studies in the field, DTI research methods can be roughly classified into six groups;
Table 5 [
217], [
221], [
223], [
226], [
227], [
230], [
231], [
232], [
233], [
234], [
235], [
236], [
237], [
238], [
239], [
240], [
241], [
242], [
243], [
244], [
245], [
246], [
247] provides a brief summary of the relevant strategies.
Future research should integrate omics data more closely with biomedical data networks for a more accurate characterization of drugs or proteins. Moreover, similarity approaches have a crucial effect on DTI prediction, and combining multiple similarity results may improve model performance. One common problem in model training is the unavailability of accurate negative datasets. Accurate DTI data in publicly available data sources are rigorously experimentally validated, and the experimental validation process for each one is exhaustive; however, most failed experiments will not be reported. Furthermore, manually validated data is time-consuming, and a large amount of data has not been validated for exact interactions. Therefore, the dataset used for DTIs should always use the latest and most comprehensive drug-target database, such as TTD and DrugBank, and additional inactive experimental data should be open-sourced to improve the current DTI data system.
4. SOTA application of AI to modern drug design
Drug discovery is a long-term and painstaking process. In the past decades, techniques such as HTS and combinatorial chemistry, as well as other techniques, played an important role in the discovery of lead compounds. Further structural modifications of the obtained lead compounds were then developed to reduce toxicities and improve efficacy. As these techniques gradually increased in popularity, however, their various disadvantages were gradually revealed. Similarly, in the 1980s, CADD was no less popular than today’s AI. For example, QSAR models were widely used as soon as they were proposed. However, in those days, QSAR-based models were limited by the available computing power, dataset size, and other issues, and their predictive performances were never satisfactory [
248], [
249], [
250].
In recent years, the advancement of computing power has driven the rapid development of AI, while positively promoting the development of computational chemistry and pharmacology. For example, various ML and DL methods were used in various Kaggle competitions to improve the predictive performance of QSAR methods, all of which achieved high performance [
78]. As mentioned above, DL allows the identification of new molecular representations instead of relying solely on off-the-shelf and expert-derived chemical signatures. AI algorithms relying on rich biomedical data show promising prospects in areas such as bioactivity prediction, VS of drugs, and
de novo drug design.
Before going into details, it is necessary to briefly introduce the concepts of structure-activity relationships (SARs) and QSARs. These two concepts are frequently used in drug design using ML and DL methods and are powerful aids in the design, optimization, and development of drugs. SARs are based on the assumption that molecules with similar structures have similar activity. In drug discovery, QSARs are based on various molecular characterization methods (e.g., molecular descriptors and molecular fingerprints) and mathematical models to describe the mathematical relationship between the structure of a molecule and its specific biological activity. A QSAR model assumes that the structure of a compound determines its physicochemical properties and biological activity; therefore, quantitative relationships can be established between the structure of a compound and its physicochemical properties, biological activity, toxicological effects, and so forth. The QSAR analysis process usually includes the preparation of preliminary datasets, the calculation and selection of molecular descriptors, the establishment of relevant models, and the evaluation and validation of model results [
248], [
251].
4.1. Cutting-edge techniques facilitating VS
VS has endured for the past decade or so. In order to reduce the number of compounds that actually need to be measured and increase the efficiency of lead compound discovery, the
in silico approach is used to simulate the interaction between a target and a small molecule and predict the affinity between the two before a bioactivity test is performed [
252]. VS methods are often classified into structure-based VS (SBVS) or ligand-based VS (LBVS) [
253], [
254], [
255]. The combination of AI and VS has brought a new dynamism to the field. A variety of molecular characterization approaches combined with various novel model architectures have provided new insights into the discovery of new compounds [
9].
SBVS selects potential ligands based on the 3D conformation of the protein and scores the ligand’s ability to bind to the protein based on the inputted knowledge of biophysical methods, resulting in a ranking of drug candidates. Previously, simulations using various docking software were the dominant approach and resulted in many algorithms, such as Monte Carlo (MC) algorithms [
256] and molecular dynamics (MD) algorithms [
252], [
257], [
258]. A primary limitation of the simulation results is the construction of the scoring function, which must take many factors into account along with their plausibility as parameters. AI takes these many factors as features of the data, implicitly learns the relationship between the features and the experimental results, extracts useful nonlinear mapping relationships from them, and gives a final score. A VS method known as ID-Score [
120] selected nine classes of property descriptors (i.e., van der Waals interaction, hydrogen-bonding interaction, electrostatic interaction, π-system interaction, metal-ligand bonding interaction, desolvation effect, entropic loss effect, shape matching, and surface property matching) as features, used 2278 compounds as the training set, and used a support vector regression (SVR) algorithm to fit the binding affinity of small molecules to proteins. The results showed that ID-Score can correctly distinguish structurally similar ligands, demonstrating its use as a powerful tool for assessing structure-based drug-protein affinity.
In another study, a CNN was used to score protein ligands. Unlike traditional methods, CNNs are powerful enough to accept 3D representations of protein-ligand interactions as input. During the training of the model, the CNN learns the key features affecting binding from the 3D representation, which is used to determine the correct or incorrect binding pose and known binders and nonbinders. Xie et al. [
259] took a different perspective to improve the efficiency of lead compound discovery by combining an SVM classification model with a docking-based VS method. More specifically, they developed an SVM model to distinguish inhibitors of the target from non-inhibitors and performed a docking-based VS on this basis. This combination greatly improved the hit rate and enrichment factor of the VS. In contrast to the work by Xie et al. [
259], Pereira et al. [
260] developed DeepVS, which uses a DL approach to optimize docking-based VS. In this study, a directory of useful decoys (DUD) [
261] was used as the benchmark dataset to evaluate the method. Dock [
262] and Autodock Vina1.1.2 [
263] were used as docking programs to generate protein-compound complexes. Then, essential processing of the protein-compound complexes was done and the results were fed into the CNN model as input. The CNN model extracted the key features from this essential data and used them to evaluate the score of the ligands. The results showed that the proposed DeepVS achieved advanced performance on VS.
In comparison with the SBVS approach, which is limited by the structural information of the target protein, LBVS can make full use of the known ligand bioactivity data and screen a large database of compounds to discover potential lead compounds. Therefore, AI-based VS tends to favor LBVS. The starting point of LBVS is the assumption that structurally similar compounds have similar biological activities; thus, the AI methods used in this field include both regression models for activity prediction and classification models based on compound similarity.
QSAR is widely used in LBVS because of its use of mathematical models to relate molecular structures to quantitative biological activities. NB, RF, and SVM are very popular algorithms in LBVS. AbdulHameed et al. [
264] screened a database with nearly 2000 compounds using a QSAR-based model with an NB algorithm and using the physicochemical properties of the molecules as features. Finally, it was found that activators of pregnane X receptor (PXR) tend to be hydrophobic, while the
in vitro and
in vivo activities are often consistent. Profile-QSAR 2.0 was presented to predict the activity of compounds [
265]. Compared with the earlier profile QSAR (pQSAR) 1.0 method, the pQSAR 2.0 method used the historical activity values of the compounds as variables. The optimized pQSAR used an RF model to predict the half-maximal inhibitory concentration (IC
50) values, achieving the same accuracy as the medium-throughput four-concentration IC
50 measurements. Chen and Visco [
266] created a pipeline integrating QSAR with an SVM model to identify the inhibitors of Cathepsin L. They used a signature—a descriptor based on fragments—as the model’s input. After optimizing the model, nine out of 12 screened compounds were experimentally confirmed. ANNs are another commonly used tool in QSAR studies. Myint et al. [
267] reported an ANN-based QSAR method called fingerprint-based ANN (FANN)-QSAR that uses three different molecular fingerprints: ECFP6, FP2, and MACCS. The well-trained model was used to predict the affinity of cannabis ligands and found compounds with a good affinity for cannabinoid receptor type 2 (CB
2). In another group study, the minimal inhibitory concentration (MIC) of quinolones was determined by using topological descriptors in an ANN [
268]. As more DL methods have gradually been used for QSAR-related studies, researchers have found that DL tends to outperform ML in both single-task and multi-task learning [
269], [
270], [
271].
QSAR methods are not the only tools used for LBVS [
272], [
273], [
274]. Li et al. [
275] used multiple ML methods to construct classification models to select liver X receptor (LXR) agonists. During this process, optimized property descriptors and topological fingerprints were used to characterize small molecules in the database and constitute a total of 324 models with four algorithms: NB, SVM, KNN, and recursive partitioning (RP). The top 15 models were selected for evaluation, and ten models were found to have an accuracy of more than 90%. In another study, an SVM with NB was used to identify butyrylcholinesterase (BuChE) inhibitors [
276]. Initially, 1870 descriptors were selected; after analysis, activity-related descriptors were then selected to reduce noise. A better performance was eventually achieved. There are also numerous examples of self-organizing mapping (SOM) being used in LBVS [
277]. For example, Hristozov et al. [
278] used SOM as a model to recognize and exclude compounds that are unlikely to have specific biological activity. The power of SOM has also led to its use in some software [
279].
With the rapid increase in the number of known compounds in recent years, DL architecture has been found to be more suitable for processing large compound datasets. One group trained with existing HTS data and used a molecular graph as input to a neural network to learn molecular representations [
280]. Compounds with similar representations were then assigned in the neighboring hyperdimensional feature space. After learning the features, the similarity to drug molecules in a large compound library was measured using cosine similarity, and the small molecules in the library were ranked and filtered to obtain lead compounds. Unlike the use of graph models to generate the features of small molecules, adversarial AEs (AAE) were used by Kadurin et al. [
281] to construct a small molecule feature generator. Based on the obtained features, 72 million compounds in PubChem were screened to discover potential anticancer drug molecules. CNNs are widely used in image recognition; thus, for the purpose of using CNN models in drug research, molecules or proteins are often characterized in the form of matrices. Xu et al. [
282] directly used images of molecules as input to CNN models to screen for inhibitors of Chemistry Development Kit 4 (CDK4) and achieved better effects than competing models. The use of DL for LBVS has been increasingly studied in recent years, and models such as RNN [
283] and RL [
284] have been used for drug discovery, providing more opportunities and benefits for LBVS.
Overall, efficient lead compound discovery through VS is still a huge challenge, as there is no satisfactory way to address issues such as the activity cliff. AI algorithms are powerful tools that can be used not only for SBVS but also for LBVS to help break through the relevant challenges and assist in de novo drug design. As the complexity of algorithms increases and high-quality data becomes available in future, bottlenecks in existing technologies will continue to be broken, facilitating the discovery of new drugs.
4.2. Recent progress in de novo drug design
The aim of drug design is to design drugs with specific properties that satisfy specific criteria, including efficacy, safety, reasonable chemical and biological properties, and structural novelty. In recent years,
de novo drug design with the help of deep generative models and reinforcement learning algorithms has been considered to be an effective means of drug discovery. This approach can bypass the drawbacks of the traditional empirical-based drug design paradigm and allow computers to learn the drug targets and molecular features by themselves to generate compounds that meet specific requirements at a faster and less costly rate [
285], [
286], [
287].
De novo drug design according to protein structure used to be the dominant approach. In this approach, whether designing new molecules directly from protein structures or making reasonable inferences from the properties of known ligands, the corresponding ligands are designed according to the spatial and electric potential constraints of the target protein binding pocket in order to discover molecules with specific properties. A huge limitation of these early approaches was that the resulting new molecules were not chemically accessible—that is, their structures were practically impossible to synthesize or extremely difficult to produce, or the molecules had poor druggability. In addition, many
de novo drug design approaches utilize fragments of molecules with known properties for molecular assembly, and use large libraries of molecular fragments to generate and design molecules with novel structures while ensuring that the molecules can be synthesized. However, this approach relies on chemical knowledge to replace or add molecular fragments, which will restrict the search space and ignore certain potential molecular structures. The generation of new molecules with deep generative models and the targeted optimization of models with reinforcement learning algorithms can solve the problems of the above traditional methods in a more satisfactory way [
288], [
289], [
290].
Deep generative models are of great advantage in the field of de novo drug design, as they do not require explicit prior input of chemical knowledge during the generation of molecules. These models can search in a broader unknown chemical space to automatically design novel molecular scaffolds beyond the limitations of existing molecular scaffolds. Deep generative models that are widely used for de novo drug design include RNN-based generative models, variational AEs, AAEs, and GANs. The process of designing molecules with generative models is highly stochastic, and the generated molecules are highly variable in structure and uneven in quality. Reinforcement learning can enable generative models to perform targeted optimization by fine-tuning the model parameters so that the generated molecules have specific drug molecule properties.
RNN-based generative models can generate compounds with similar biochemical properties as the sample compound but with a completely new scaffold structure. The training process starts by using a large chemical database to train the RNN model so that the model can learn how to generate the correct chemical structure. Reinforcement learning algorithms are then used to fine-tune the RNN parameters so the model is capable of mapping generated chemical structures to a specified chemical space. Reinforcement learning enables the RNN-based generative model to generate new molecules with promising pharmacological properties, while ensuring the structural diversity of the generated molecules. A single reinforcement learning reward mechanism often leads to relatively simple structures of the generated molecules, so an appropriate and multi-perspective reward function must be selected to guide molecule generation. Olivecrona et al. [
123] developed a sequence-based approach to
de novo drug design called REINVENT. First, the researchers collected 1.5 million molecules from the ChEMBL database that satisfied specific requirements and used SMILES of these molecules to train the RNN model to learn the characteristics of active molecules and generate new molecules. The generated molecules were then scored using a reinforcement learning algorithm to fine-tune the RNN parameters, so that new compounds with activity against a specific target could be generated. This method was applied to several different molecule generation tasks in the study, including the generation of sulfur-free molecules, backbone expansion from a single molecule to generate celecoxib-like structures, and the generation of new inhibitor molecules for type 2 dopamine receptors.
Another area in which RNN-based generative models are applied in drug design is the optimization problem of lead compounds [
291]. A new molecular generation algorithm called scaffold-constrained molecular generation (SAMOA) was proposed to solve the scaffold constraint problem within the lead compound optimization problem. The study used an RNN generation model to generate SMILES sequences of new molecules, and then used a refined sampling procedure to implement the scaffold constraint and generate molecules. A strategy-based reinforcement learning algorithm was also applied to explore the relevant chemical space and generate new molecules matching the expected ones. The DeepFMPO framework proposed by Ståhl et al. [
292] started from an initial set of lead compounds and modified the structure of these lead molecules by replacing some of their fragments. This study confirmed the wide use of RNN-based generative models in the field of molecular generation.
As deep generative models, VAEs are often used in various generative tasks, including the
de novo design of small molecules and the generation of peptide sequences. A group constructed a molecular generation model based on a conditional VAE for
de novo molecular design with a three-layer RNN for both the encoder and decoder. The results demonstrated that this model can design drug-like molecules with five target properties and can also tune individual molecular properties without affecting other properties [
124].
In 2019, Insilico Medicine published a study [
28] on the rapid
de novo design of potent discoidin domain receptor 1 (DDR1) kinase inhibitors using a VAE. Several new compounds with inhibitory activity against DDR1 kinase were identified, chemically synthesized, and experimentally validated in just 21 days. This study demonstrated the potential of the method to perform fast and efficient molecular design. The generative tensorial reinforcement learning (GENTRL) model consists of two main components: a VAE and a strategic gradient reinforcement learning algorithm. The VAE is used to generate new molecules, while the reinforcement learning fine-tunes the model parameters to make the new molecules generated by the VAE more consistent with the expected properties. The encoder of the VAE is used to encode known molecules into hidden vectors. The decoder samples and decodes the hidden vector into a new molecule based on the hidden vector space. A reinforcement learning algorithm is used to guide the VAE-directed optimization during the training process. After model construction, Insilico Medicine used GENTRL to generate four new active compounds, two of which were validated in cellular experiments. Moreover, one of the lead compounds was tested in mice and was shown to have good pharmacokinetic properties. This study provides strong evidence that reinforcement learning combined with deep generative models can accelerate the process of and provide new insights into
de novo drug design.
GANs are capable of generating new samples with a similar distribution to real data and have advantages in the fields of image recognition and natural language processing (NLP). In the pharmaceutical field, GANs are often integrated with techniques such as feature learning and reinforcement learning, and have played an important part in protein function prediction, small molecule generation, and more. Various molecular generation models have been constructed based on GANs, such as Mol-CycleGAN [
293], objective-reinforced generative adversarial network for inverse-design chemistry (ORGANIC) [
294], and reinforced adversarial neural computer (RANC) [
295]. ORGANIC is a well-known molecular generation model that has become a comparative baseline model for other models. Its combination of a GAN model and a reinforcement learning algorithm can generate novel and effective molecules. The molecule generation performance of the RANC model has surpassed ORGANIC in many aspects, including the ability to generate new molecular structures and drug-like properties of molecules, which allows the design of active new molecules for different biological targets and covers a wide chemical space.
In addition, Harel and Radinsky [
296] proposed a molecular template-driven neural network that combines a VAE, CNN, and RNN to generate chemical structures with similar properties to the template molecules while being structurally diverse. The researchers found that the proportion of effective molecules among the generated molecules was significantly enhanced by adjusting the sampling process of the VAE.
Molecules designed by computer must not only have good physicochemical properties but also be highly active and selective for the target under study; therefore, the question of how to set up an effective reward function is an important challenge in reinforcement learning. A combination of the framework of deep generative models with reinforcement learning algorithms drives the development of the drug design field and will have significant applications in the future in the de novo design of small-molecule and peptide drugs.
4.3. Application of advanced techniques in antibody design
Due to the wide application of ML and DL in chemistry, biology, and medicine, as well as their use in basic research in various fields, researchers now have a profound comprehension of biomolecules and systems biology. In the future, the direction of drug R&D will be biased toward the research of small molecules; moreover, bio-innovative drugs will gain ground. Similarly, there are already many DL approaches for the study of biological macromolecules drugs, both now and in the near future, such as oligonucleotides, monoclonal antibodies, or peptides with specific pharmacological properties. Here, we will elaborate on the design of antibodies.
Since antibodies are inherently biological macromolecules, the characterization of antibodies is similar to the encoding of proteins and RNAs. There are six general strategies for encoding antibodies: “one-hot” encoding, substitution matrix, amino acid properties, learned amino acid properties, encoding of supplementary attributes, and encoding of structural features [
297]. The application of AI in antibodies is different from its application in ordinary biomolecules because antibodies are biological agents that can be used for disease treatment. Therefore, the design of antibodies has more in common with the design of drugs, since safety and efficacy of drugs must be taken into account. At present, AI-based methods are often used for antibody structure prediction, antigen-antibody binding prediction, antibody generation/design, deimmunization studies, and antibody sequence-based studies [
297].
The AlphaFold2 DL system has been able to solve most of the protein structure prediction problems; however, for antibody structure prediction, as a special subfield of protein structure prediction, it is necessary to capture the subtle differences in the structure with extreme precision. Many methods have been developed to solve this problem, such as DeepAb [
298] and DeepH3 [
299]. To perform VS for the binding of antibodies to target antigens, a structure-based framework called DL for antibodies (DLAB) was proposed to improve antibody-antigen dockings [
300]. As DLAB is a structure-based approach, it can optimize the pose ranking of antibody docking experiments and select antibody-antigen pairs for which accurate poses are generated and properly ranked. This approach has also demonstrated that the SBVS of antibodies can strongly complement traditional experimental screening methods.
The search for new antibody sequences is a major research hotspot in antibody discovery. Early computational approaches attempted to use enumeration methods for new sequence discovery and subsequent prediction work. Although these methods reflect the diversity of designed antibodies, they do not explain these discoveries in a biological sense and lack conviction. Recently, the potential features of antibodies—including the frequency of amino acid positions and the physicochemical properties of the antibody—have been learned by GANs or VAEs [
301]. These methods provide a new way of thinking and a new approach for antibody generation and design, which can be relied upon in the future to design therapeutic antibodies via DL.
The directions for the development of antibody drugs discussed above stem from a starting point that is similar to that of the design of small molecule drugs. Antibodies can be designed differently than traditional drugs due to their large molecular weight and attributes such as biomolecular function. In designing an antibody drug, it is necessary to consider the immune response the drug elicits when it enters the body. Thus, it is critical to use ML algorithms for analysis of next-generation sequencing (NGS) data to carry out deimmunization studies of antibodies [
302]. In addition, antibodies similar to human antibodies must be designed without loss of activity during the humanization process. [
303]. Novel humanization (e.g., Sapiens) and humanness evaluation methods (e.g., OASis) are two data-driven approaches to address these issues. Sapiens uses a masked language model (MLM) to learn the humanization method of antibodies, while OASis is used to evaluate the humanness of an antibody sequence. BioPhi successfully combined these two algorithms to capture the intrinsic features of antibody complexes and provide similar mutation selection to that used experimentally for humanized mutations. This achievement indicates that DL will be indispensable in the deimmunization studies of antibodies. Another major feature of DL in antibody research is its ability to use NLP to learn and encode the antibody space to reveal new insights into the biological function of antibodies. For example, antibody-specific bidirectional encoder representation from transformers (AntiBERTa) [
304] and AbLang [
305] can understand the back-and-forth association of antibody sequences and, based on this understanding, can infer specific masked regions.
When conducting antibody drug research, DL can be used to connect the microscopic properties of molecules with the macroscopic results of experiments and provide additional insights into the biology associated with immunoglobulins. Therefore, DL approaches are increasingly being applied in the research and design of therapeutic antibodies to enable the efficient development of new antibodies and provide a new strategy for the future pipeline of antibody design. Overall, AI has shown promising power in drug target identification and new drug discovery.
Fig. 3 depicts a generic workflow using AI for target and drug identification.
5. Application of AI to preclinical drug research
Preclinical studies focus on non-clinical pharmacology, pharmacokinetics, and toxicology studies. The physicochemical properties of a drug and its ADMET properties are essential for pharmacokinetic and toxicology studies [
33], [
306]. Unsuitable properties of drug candidates will lead to the failure of the expensive drug development phase [
307]. The failure rate and loss of clinical studies can be decreased by early evaluation of the relevant properties of drug candidates.
5.1. Prediction of physicochemical properties
The ADMET properties of a drug candidate can be directly influenced by its physicochemical properties and will have a critical impact on the success of a drug entering the market [
308], [
309]. For example, the ionization constant (p
Ka), which is the fundamental parameter underlying properties such as octanol-water distribution coefficient (log
D) and solubility, affects the aqueous solubility of a molecule, which can in turn affect the drug formulation method. Moreover, the ADMET of compounds under different pH conditions are profoundly influenced by the charge state of the compounds [
310]. Although lead compounds with promising drug-like properties may not always be successfully marketed, promising properties are still an inspiration for drug design. However, physicochemical properties are not easily measured directly, and accurate prediction of the properties of small molecule drug candidates facilitates further structural optimization of small molecules until they are designed to meet the desired properties.
Some approaches for predicting the physicochemical properties of molecules focus on predicting a certain physicochemical property, such as lipophilicity [
311] or aqueous solubility [
312], while others predict several physicochemical properties together [
99]. Although molecules can be represented in a variety of ways, predictions for a single property may use certain specific features, such as the number of hydrogen bonds [
313] and the connectivity indices of various molecules [
314] correlated with solubility. To date, accurate prediction of the aqueous solubility of small molecules remains a challenge [
315], but DL methods have been found to be more effective than previous ML methods in this endeavor [
316]. In the second challenge to predict aqueous solubility, one of the models [
317] combined an NLP approach to obtain embedding vectors based on small molecule SMILES, in order to feed these vectors into the transformer model for predicting molecular aqueous solubility. Francoeur and Koes [
317] found that overly complex models did not perform as well as small DL models in this task, which may be due to overfitting of the model as a result of the complex model and the smaller amount of data.
To address the issue of simultaneously predicting several physicochemical properties of small molecules, researchers have focused on molecular feature learning and characterization; examples include molecular feature learning and representation based on a GNN architecture [
98], combining traditional molecular representation approaches with features learned by message-passing neural networks (MPNNs) [
99], and a form of graphical representation of molecular design based on extended-connectivity circular fingerprints (ECFPs) [
318]. Shen et al. [
319] proposed a new form of molecular representation that involved first calculating the distance matrices of molecular fingerprints and the molecular descriptors of eight million molecules, respectively, and then reducing the distance matrices to two dimensions via uniform manifold approximation and projection (UMAP) to form a scatter plot. Next, the dimensionality-reduced scatterplots were assigned to 2D grid maps using the Jonker-Volgenant (J-V) algorithm. Finally, the data was divided into different channels based on different molecular fingerprints or descriptors. These molecular representation forms were fed into a CNN for the prediction of molecular properties, achieving a SOTA performance on multiple datasets.
5.2. Prediction of ADMET-related properties
The failure of most clinical trials is often blamed on inadequate ADMET studies of the drug, rather than on a lack of certain efficacy. The “absorption, distribution, metabolism, excretion (ADME)” portion of ADMET often determines whether a drug molecule will reach the target protein
in vivo, what protein will transport or metabolize this drug [
47], [
320], how long it will stay in the blood, and when it will be inactivated, while the “T” portion (i.e., toxicity assessment) is a major concern in phase I clinical trials. If the risk of clinical trial failure can be reduced via thorough preliminary ADMET studies, significant money and time costs will be avoided [
321], [
322]. With hundreds of compounds waiting to be evaluated for their ADMET properties in the early drug discovery phase, it would be time-consuming and expensive to validate each one through extensive animal studies. Therefore, the use of AI to rapidly and accurately predict the ADMET properties of drugs has been widely adopted [
323].
QSAR and quantitative structure-property relationship (QSPR) models play pivotal roles in the ADMET prediction of small molecules. Many ML methods, in combination with QSAR or QSPR models, have performed well in ADMET prediction [
324]. Most of these ML methods focus on several ADMET properties [
325], such as human ether-a-go-go related gene (hERG)-mediated cardiotoxicity [
326], blood-brain barrier penetration [
327], permeability glycoprotein (P-gp) [
328], cytochrome P450 (CYP) enzyme family [
329], acute oral toxicity [
330], carcinogenicity [
331], mutagenicity [
332], respiratory toxicity [
333], or irritation/corrosion [
333]. Zhu et al. [
334] used a QSPR model to predict the blood-brain partition coefficient (logBB). The researchers used four ML methods—namely, SVM, multivariate linear regression, multivariate adaptive regression splines, and RF—to predict this property for 287 compounds and found that the polar surface area and octanol-water partition coefficient were strongly relevant to the blood-brain partitioning. A CYP enzymes-inhibition prediction model based on the C5.0 algorithm (a decision tree model algorithm) was constructed using several molecular fingerprints or molecular descriptors as inputs to predict five CYP enzymes related to drug oxidation or hydrolysis [
335].
Most of the ADMET datasets are imbalanced and have high dimensionality problems, and the integrated learning approach has been applied to deal with these two types of problems. The processing of imbalanced data, the combination of multiple models, and optimization steps have been integrated to form an adaptive ensemble classification framework (AECF) [
336]. Yang et al. [
336] used AECF to predict a variety of ADME properties using multiple ML methods; their results all had satisfactory AUROC values ranging from 0.78 to 0.91. This ensemble approach was demonstrated to be a very useful multi-classification system through validation with the DrugBank database.
DL approaches are also widely applied to the prediction of ADMET properties. For example, a classical feed-forward back-propagation neural network (BPNN) architecture and a repeated double cross-validation (rdCV) approach were combined to estimate the blood-brain barrier penetration [
337]. DL allows a model to be trained using a larger and more representative dataset, ensuring that a wider variety of compounds are covered than is possible with ML. Validated with external datasets, this method predicts values that are in good agreement with many experimentally derived logBB values. In another work, it similarly demonstrated that neural networks generally outperform ML methods for ADMET properties prediction. Montanari et al. [
121] predicted seven different ADMET properties corresponding to each of the following endpoints: log
D, solubility, melting point, membrane affinity, and human serum albumin binding. Moreover, Wang et al. [
338] developed a DL model to predict drug metabolites with an accuracy superior to the popular rule-based method systematic generation of potential metabolites (SyGMa). In a comparison of a multi-task graph convolutional model, a fully connected neural network, and an RF model, it was shown that the multi-task graph convolutional model performed the best. However, for more complex tasks, such as the prediction of Caco2 permeation or
in vitro metabolic stability, multi-task graph convolutional networks were unable to achieve good results, probably due to the simplicity of the model constructed in this study, which hindered the model from learning the deeper features. In addition, the multitasking model in this study was considered a trial-and-error exercise, and there were no specific experiences and rules about which tasks should be combined together.
Other recent work has similarly demonstrated the potential of multitasking models for ADMET properties prediction. Various user-friendly ADMET software and web servers have been developed for predicting the ADMET properties of molecules [
125], [
339], [
340], [
341], [
342]; among these, ADMETlab 2.0 [
125] is widely praised. ADMETlab 2.0 is based on a multi-task graph attention (MGA) framework and can predict multiple ADMET properties of drugs (it contains a total of 88 relevant parameters with 23 ADME properties, 27 toxicity endpoints, and eight toxicophore rules). Most of the data used for training was derived from bioactivity data in the open-access database, relevant literature, and toxicity prediction software (Toxicity Estimation Software Tool (TEST)). Based on these training sets and the novel model architecture, some of the properties predicted by ADMETlab 2.0 are unique in comparison with the results of similar tools. It is a convenient tool for non-expert users while being able to provide comprehensive and accurate ADMET properties for target molecules for medicinal chemists.
6. AI-assisted clinical trial design, post-market surveillance, and prognosis prediction
A drug candidate can be sent to clinical studies only after it has undergone the process from target identification to drug design, synthesis, and optimization, and then to preclinical studies of ADMET-related properties, which initially confirm the safety and efficacy of this compound. The clinical trial phase consumes most of the time and investment during drug R&D. Although AI cannot be used to directly predict the clinical trial results of drug candidates in clinical studies, it can be used to assist in the design of clinical trials to enhance the rationality and safety and ultimately provide a more realistic response to the clinical trial results of a drug. After phase III clinical trials, drugs also require long-term regulatory work to further identify undocumented toxic effects in previous studies in order to prevent malignant events.
6.1. AI-assisted clinical trial design
The high failure rate of clinical trials makes this the most difficult step in the new drug development pipeline, with about 90% of drug candidates being eliminated in clinical trials [
343], where each failed clinical trial costs approximately 0.8 billion to 1.4 billion USD. To overcome these shortcomings, a number of AI-based approaches are now available to assist in crucial steps in clinical trial design, such as helping to improve patient recruitment and enhance patient monitoring [
344]. To address the issue of patient selection, AI can be used to explore the association of patient biomarkers with external indications to predict the likely treatment response of patients, which can help in screening for patients with high clinical success [
345]. In addition, e-phenotyping can be used to reduce patient population heterogeneity [
346] and to aid patient selection through prognostic or predictive enrichment [
347], [
348].
Patient monitoring in clinical trials is also a critical process. By incorporating wearable technology, AI can be used to help automate and personalize real-time patient monitoring, thereby reducing patient workload and improving medication adherence issues. Accurate medication adherence data can better reflect the results of clinical trials, and AiCure [
349]—a new AI platform used to measure medication adherence—has shown a 25% improvement in adherence compared with traditional therapies in a phase II trial for schizophrenia. In addition, AI has been used to optimize dosing to reduce adverse effects, improve the safety of trial protocols, and reduce patient defaults due to safety concerns [
350].
6.2. AI-assisted post-market surveillance and prognosis prediction
After a drug is approved and successfully enters the market after the clinical phase, it undergoes a long-term investigation to further monitor and evaluate the drug safety. Electronic health record (EHR) mining is an important data source for AI applications in post-market surveillance, in which the use of structured data can simplify the process of data pre-processing. Existing methods used in EHR include the self-control case series (SCCS) model [
351], cohort and case-control methods [
352], and temporal pattern-discovery algorithms [
353].
Convolutional SCCS (ConvSCCS) is a scalable model for predicting longitudinal features using SCCS. Morel et al. [
354] used step functions and exposures to avoid the problem of classical SCCS models that require a precisely defined risk window. The results showed a significant improvement in the computational speed and accuracy of the method and enabled its application to adverse drug reactions (ADRs) detection in a cohort of diabetic patients. Aside from the application of structured data, unstructured data from biomedical and clinical corpora can be used for NLP methods for drug-drug interaction (DDI) detection and classification [
355] and the prediction of ADR [
356]. Systems pharmacology, which is based on systems biology, studies the effect of drugs on the system as a whole; it is a rich source of data and is a common approach for AI in ADR mining. Lorberbaum et al. [
357] proposed a network-based algorithm involving the modular assembly of drug safety subnets (MADSS). They combined systems pharmacology models with pharmacovigilance statistics to validate the algorithm, and the results showed a significant improvement in the prediction of adverse effects for four drugs.
Disease prognosis is the prediction of the course and outcome of the future development of a disease. In the past, clinicians usually relied on professional experience and traditional statistical analysis for clinical prognosis prediction, making it difficult to provide accurate results. Now, through the introduction of AI technology, multi-patient and multi-factor data can be analyzed to improve the accuracy of prediction results. In cancer prognosis, patient survival and disease recurrence are usually predicted. Enshaei et al. [
358] used an AI model to compare the prediction accuracy of an ANN with traditional statistical methods (e.g., LR); the results showed that AI has higher accuracy in predicting the prognosis of OvCa patients. Nowadays, there are many ML and DL methods for the prognosis of various cancers, such as BrCa [
359], [
360], [
361], [
362], [
363], lung cancer [
364], [
365], gastric cancer [
366], [
367], [
368], bladder cancer [
369], [
370], and prostate cancer [
371], [
372], illustrating the potential of AI technology in cancer prognosis.
7. Automation of drug synthesis with AI
The development of a new drug usually involves four stages: design, make, test, and analyze (DMTA). The application of AI is particularly important in the stage of drug synthesis, as it can effectively shorten the cycle of new drug R&D by speeding up the discovery of a new synthetic route for target molecules and reducing the rate of synthetic failure when the structure of the target molecule is known.
7.1. Automated exploration of reaction spaces with AI
In the 1960s, Corey and Wipke [
373] proposed computer-aided synthetic design (CASP) as the earliest AI drug synthesis design. However, due to the lack of computing power at that time, this concept could not be further developed. With the development of ML methods in recent years, CASP has come back into the limelight. CASP mainly consists of three aspects: retrosynthetic planning, reaction condition recommendation, and forward reaction prediction [
374]. Retrosynthetic planning, which involves the stepwise splitting of the target molecule into commercially available chemical materials, is an important approach in the design of drug synthesis reactions. MC tree search (MCTS) is a general search technique for sequential decision-making with large branching factors. Segler et al. [
375] combined three different neural networks trained with all published reactions with MCTS to predict the best retrosynthetic routes. In comparison with conventional algorithms, the model is 30 times faster and doubles the number of molecules solved.
After designing the synthetic route, the rationality of each step in the synthesis process must also be considered. Researchers have also used AI for the prediction of reaction conditions in order to reduce the time spent on screening reaction conditions. Gao et al. [
376] proposed a neural network model to predict appropriate reaction conditions and reaction temperature. They trained the model using ten million examples on Reaxys and tested it on one million reactions outside the training set. Their results showed the model’s ability to predict reaction conditions that matched those in the record in 69.6% of those cases. The computational framework DeepReac+ [
377] also adopted an active learning strategy to explore the response space more efficiently in order to reduce the time for model learning and prediction.
Forward reaction prediction verifies the feasibility of the designed route by predicting the products. The starting material, which is predicted by retrosynthetic planning, can be replaced by many other compounds, and forward reaction prediction can be used to rank these compounds in order to select the best solution. For example, Coley et al. [
378] proposed a neural network model for predicting reaction outcomes. They trained the model with 15 000 reaction examples from the United States Patent and Trademark Office (USPTO) literature and ranked all the generated candidate compounds to select the product that matched the record. The model used an edit-based representation of the candidate reactions and achieved an accuracy of 71.8%.
In addition to designing new reaction routes based on target molecules, unknown chemical spaces can be explored by synthetic robots based on AI. Recently, a synthetic robot proposed by Granda et al. [
26] not only analyzed chemical reactions faster than manual analysis but was also able to predict the reactivity of various reaction combinations on its own and explore the unknown reaction space. The robot model’s analysis of samples by nuclear magnetic resonance and infrared spectroscopy is coupled with ML for decision-making, allowing reactions to be evaluated in real time. The outcomes showed that the model can predict the reactivity of about 1000 reaction combinations with over 80% accuracy. Four entirely new reactions were discovered by chemists using real-time data from this robot for prediction. In addition, Caramelli et al. [
379] proposed an inexpensive synthetic robot with the ability to network and coordinate multiple reactions in addition to performing chemical reactions autonomously. The robot can also explore new chemical spaces to search for new reaction results and can evaluate the reproducibility of reactions. In conclusion, the invention of intelligent synthesis robots is an important step toward an automated synthesis approach with AI.
7.2. AI usage in automatic drug synthesis
AI-based automated chemical synthesis technologies are freeing researchers from a great deal of manual works by automating experimental processes. Many reactions can already be performed on automated synthetic systems, such as the synthesis of peptides [
380], oligonucleotides [
381], natural products [
382], and various drug molecules [
383], as reported earlier. To establish a common standard for automated chemical synthesis, Steiner et al. [
35] proposed the Chemputer system and used it to synthesize three drug compounds—diphenhydramine hydrochloride, flufenamide, and sildenafil—in yields comparable to those from manual synthesis. The program they developed, called Chempiler, allows low-level instructions to be compiled in order to synthesize compounds through a modular robotic platform. Moreover, the synthesis process is captured to generate digital code that is shared between platforms, thereby driving the spread of automated chemical synthesis in the laboratory.
In parallel to increasing the automation of reactions, improving the reaction throughput is a goal of automated synthesis, causing high-throughput experiments (HTEs) to receive much attention in recent years. HTEs with 24- or 96-well reactors are capable of performing dozens of reactions in a single experiment [
384], [
385]. In contrast, ultra-high-throughput reactions on the nanoscale can even perform thousands of reactions at a time [
386], [
387]. Of the limited types of reactions that high throughput can currently achieve, heated reactions with homogeneous reactions in low-volatile solvents at room temperature are relatively easy to achieve [
388]. Moreover, among the reactions commonly used in HTE, metal-catalyzed cross-coupling reactions in which many reaction variables are observed during development are a hot research topic. Ahneman et al. [
389] proposed an RF algorithm trained by a high-throughput dataset to predict the tolerance of palladium catalysts to isoxazole during C-N bond formation. The performance of the algorithm was shown to be significantly improved compared with conventional linear regression analysis, and the model was also useful for analyzing the inhibition mechanism of metal catalysts.
As an increasing number of algorithms related to reaction prediction are developed, scientists can identify optimal reaction conditions faster and more accurately, obtain optimal reaction routes, and further explore the reaction space. The integration of these novel and effective algorithms can facilitate the development of automated chemical synthesis platforms, freeing researchers from repetitive tasks [
377].
8. Application of AI in other areas related to drug discovery
AI technology has been widely used in the whole process of drug R&D, including target identification, drug design, synthesis, and property evaluation. It has undoubtedly shortened the drug R&D cycle and saved a great deal of experimental cost compared with the traditional experimental process. Scientists are continuing to explore the application of AI technology, as they attempt to use AI in more fields to promote the development of pharmaceutical sciences.
8.1. Facilitating knowledge discovery through literature mining
Every year, numerous papers are published in the fields of medicine, pharmacy, biology, chemistry, materials, and so forth. There is a great deal of relevant expertise in these papers. Mining the literature and linking information with relevant knowledge quickly and purposefully is very important. NLP algorithms can extract the required knowledge from unstructured information in a large number of papers, patents, and published documents. Further analysis of the extracted knowledge can reveal the knowledge associations hidden in many documents and can thereby reduce the workload of researchers in analyzing documents one by one [
390]. Long short-term memory (LSTM), gate recurrent units (GRUs), bidirectional encoder representations from transformers (BERT), and transformers, which are commonly used in NLP research, have made their mark in this field [
391], [
392].
MEDLINE is a commonly used corpus in the biomedical field and is an important part of PubMed. For decades, there has been extensive work on text mining this corpus for screening key genes, targets, and drugs and for drug side-effect discovery, drug repositioning, and other research. Researchers have focused on five main areas of text mining in biomedicine—namely, biomedical named entity recognition (NER) and normalization, biomedical text classification, relation extraction (RE), pathway extraction, and hypothesis generation [
393]—which has led to many new discoveries. For example, hypothesis generation studies on biomedicine have driven research on drug repositioning [
394], [
395], drug development [
396], [
397], and pharmacovigilance [
398], [
399].
Hundreds of papers are published every day on COVID-19 research, and text mining can be helpful for finding useful knowledge from the vast literature of this research boom. The COVID-19 Open Research Dataset (CORD-19,
https://www.semanticscholar.org/cord19) is a corpus containing a large amount of information related to COVID-19, and most text mining models are based on this corpus for information extraction. The COVID-19 text mining model uses NLP correlation models to mine the constructed corpus for the implementation of the following applications: a question-answering (QA) system (to answer questions asked by users, the model system extracts relevant answers from the corpus), a summarization system (for long texts, the main points are automatically inferred to provide users with a quick overview), visualization (the information in the text is visualized to make it easier for users to understand), and others [
400]. These findings have greatly helped researchers to cope with the challenge of information overload and to obtain valuable information in a short period of time.
Aside from the examples given above, text mining models driven by DL will have applications in many more scenarios. As time progresses, advances in NLP technology will make it easier for models to understand human language. Then the model will be able to extract knowledge from this unstructured information by relying on contextual associations to extract the focus of the full text. In this way, thousands of related documents will be processed into a knowledge network to provide a rich knowledge base for drug development. For example, the web service—explorer for target significance and novelty (e-TSN) [
401]—constructed the world’s largest relation map using drug targets and diseases extracted by means of NLP-based text mining. The service aims to visualize target-disease KG and provide approved drugs and associated bioactivity information to assist in prioritizing candidate disease-related proteins. Furthermore, Wang et al. [
402] developed a multimodal chemical information reconstruction system (CIRS) that automatically processes, extracts, and aligns heterogeneous structure information from text descriptions and structural images of chemical documents. CIRS is a powerful tool for constructing a structured molecular database based on chemical patents to enrich the near-drug space.
8.2. Advancing the development of precision medicine
Precision medicine usually involves the adoption of different treatment plans for the diseases or symptoms of different people. This approach is the opposite of simplifying (or over-simplifying) the classification method of diseases such that all individuals with certain symptoms use the same treatment plan [
403]. In society today, the causes of patients’ illness are affected by more factors than before, so more accurate diagnosis and treatment plans are required for each patient. The specific concept of precision medicine has been defined as a process [
404]. First, information on the patient is needed at different levels, such as the patient’s medical history, lifestyle, physical examination results, basic laboratory results, imaging, functional diagnostics, immunology, and omics. This data is then preprocessed to build a relevant model that reflects the patient’s situation. Among the data collected, omics data is recognized as the largest and most complex data [
404] and has been widely used in the discovery of biomarkers, the identification of disease subgroups, and prognosis prediction [
405], [
406], [
407], [
408]. In the current era of big data, AI has rapidly advanced the development of precision medicine—especially precision medicine based on omics.
The extensive use of second-generation sequencing technologies has enabled complex diseases to be finely characterized at the molecular scale, especially in the field of tumor research. The global tumor genome sequencing program, represented by the TCGA project, has laid an essential foundation for the molecular typing and precision treatment of tumors. Based on the mRNA expression data of a TCGA dataset through the analysis of differentially expressed genes, Zhao et al. [
409] selected the first 40 differentially expressed genes from each type of tumor, merged them to form a feature subset containing 791 different genes, and established a DL model named cancer of unknown primary (CUP)-AI-Dx for predicting the tissue origin and tumor subtype of tumor samples. Yeh et al. [
410] studied the transcriptome of patients with severe asthma using the highly variable expressed gene profile of patients’ peripheral blood mononuclear cells (PBMCs); their
k-means clustering analysis of 2048 genes revealed that the genetic characteristics of the transcriptome clusters in patients with asthma determine specific asthma subtypes. In comparison with transcriptomics, the in-depth study of proteomics can help uncover biomarkers and drug targets for different diseases. Rolland et al. [
411] used a hierarchical clustering approach to analyze proteomic data from lymphoma patients to reveal specific
N-glycoprotein biomarkers in different lymphoma subtypes, thereby providing potential therapeutic targets for precision medicine in lymphoma. Niu et al. [
412] identified a combination of protein biomarkers for predicting liver fibrosis, hepatitis, and hepatic steatosis with satisfactory performance using mass spectrometry-based proteomic assays and ML models.
Of course, as mentioned in Section 3, multi-omics technologies are more promising for application than single omics. Many published works explore the molecular mechanisms of disease and the discovery of reliable biomarkers to serve in the diagnosis and treatment of diseases through multi-omics technology. The growing scale of omics data and the increasing development of AI technology will greatly advance the development of precision medicine.
8.3. Utilization of AI in drug formulation and release
With advances in new drug discovery methods, advanced drug delivery systems have expanded rapidly, promoting clinical translation and associated with safety, efficiency, and patient compliance [
413], [
414]. A drug delivery system can be visualized as a “cart” (i.e., a carrier) that transports “goods” (i.e., therapeutics) to the appropriate destination. With the advancement of materials, engineering, and biology technologies, the term “carrier” has expanded to include nanocarriers, cells, eluting devices, and micro-nano robots [
415], [
416]. Compared with conventional drug carriers, nanocarriers can improve drug solubility and mitigate the adverse effects of conventional solubilizers. In addition to protecting the drug from deterioration, nanocarriers can endow the drug with a targeting function [
417].
Nevertheless, preparing a suitable nanocarrier is extraordinarily complicated, as it depends on the drugs, excipients, and reaction conditions (including temperature, time, and stirring speed). Experiments alone cannot screen all of these parameters. In addition to determining a drug’s molecular target and biological activity [
418], [
419], AI can accurately predict its optimal nano-forming conditions (
Fig. 4) [
420], [
421], [
422].
Shamay et al. [
422] predicted particle self-assembly via computational methods. Using quantitative structure-nanoparticle assembly prediction (QSNAP) calculations, they discovered two molecular descriptors for predicting which drugs will form nanoparticles with indocyanine. This method also revealed crucial molecular structural characteristics that permit the self-assembly and the formation of nanoparticles. With the aid of indocyanine sulfate, these drugs were assembled into nanoparticles with a loading efficiency of 90%. The researchers also evaluated the targeted delivery properties of nanoparticles in human colon and primary liver cancer models expressing caveolin-1 (CAV1). Sorafenib- and trametinib-containing nanoparticles were able to selectively target tumors without harming healthy tissue.
In addition, Traverso et al. [
421] utilized MD simulations, ML, and an HTE co-aggregation platform to determine which drug-excipient combinations could self-assemble into stable solid drug nanoparticles without additional stabilization. The researchers isolated 100 self-assembled drug nanoparticles from 2.1 million pairs, each containing one of 788 drug candidates and one of 2686 approved excipients. Nanoparticles of sorafenib-glycyrrhizin and terbinafine-taurocholic acid were subjected to proof-of-concept studies
in vitro and
in vivo. Both validations suggest that this platform can produce nanoparticles with a high drug loading and enhanced bioavailability, representing a significant step toward personalized drug delivery.
The release pattern of a drug is also crucial for disease treatment. Developing drugs that are released in response to differences in the physiological signals of various organs, tissues, and organelles can enhance the drug’s efficacy, prevent toxic and side effects caused by non-specific off-targets, and achieve safe and precise treatment. Multiple endogenous signals—including pH, active redox species, enzymes, glucose, various ions, adenosine triphosphate (ATP), and oxygen—have been incorporated into the design of responsive drug nanocarriers (
Fig. 5) [
423]. In addition to the material’s properties, the target tissue environment influences drug release. AI can facilitate the evaluation of a drug-release mode and can provide feedback for the formulation of drug carriers through ML [
424], [
425], [
426], [
427].
8.4. Promoting the economic development of the pharmaceutical market
AI has shown itself to be powerful and promising in the pharmaceutical industries, leading to a surge of interest in AI-based drug development from both the scientific and industrial communities. In the past five years, numerous AI-based pharmaceutical companies have been established and have signed collaboration agreements with many large pharmaceutical companies [
428]. These shifts have driven massive financing in the drug market, injecting new dynamics into the pharmaceutical economy.
Some of these AI-based pharmaceutical companies focus on a specific stage of the drug discovery pipeline, such as target discovery and the screening of compounds. Some are involved in multiple stages of the pipeline, while others have built end-to-end platforms for new drug discovery [
428].
BenevolentAI is a leading AI-based pharmaceutical company that focuses on drug target discovery. Founded in 2013, the company has seen rapid growth in recent years and has emerged as a leader in AI-based drug discovery, attracting significant investor attention. The company was listed in Amsterdam on 6 December 2021 and has a pre-investment valuation of 1.1 billion EUR and a post-investment valuation of up to 1.5 billion EUR. BenevolentAI identifies drug targets for complex diseases through its leading KG technology, which integrates large amounts of publicly available biopharmaceutical data with internal company data. For example, the KG identified baricitinib as a possible treatment for COVID-19 [
429]. Through this technology, BenevolentAI has entered into a long-term collaboration with AstraZeneca for target identification in chronic kidney disease, idiopathic pulmonary fibrosis, heart failure, and systemic lupus erythematosus. On 17 May 2022, AstraZeneca made a milestone payment to BenevolentAI for a new target discovery in idiopathic pulmonary fibrosis, which is the third new target identified through BenevolentAI’s R&D platform. In addition, BenevolentAI has entered into a new drug discovery collaboration with Johnson & Johnson. The judgment-augmented cognition system (JACS) is a core technology that can focus on processing large amounts of unstructured data in a short period of time through its NLP capabilities. The current market opportunity around AI-led drug discovery capabilities is over 30 billion USD [
430].
In 2019, Insilico Medicine completed a challenge to design new small molecule inhibitors of DDR1 in 21 days using the GENTRL AI system [
28]. This challenge caused a great sensation at the time, because it was unimaginable for so many new inhibitors to be discovered in such a short period of time using AI methods. The total time taken was reduced by 1-2 years compared with the traditional process. Insilico Medicine’s outstanding performance has made it a hit with investors. In June 2021, Insilico Medicine raised 225 million USD in a Series C round of funding and, in February 2022, it announced the launch of a phase I clinical trial of a small molecule inhibitor for the treatment of idiopathic pulmonary fibrosis [
430].
The company Exscientia stands tall in the area of getting small molecules that have been discovered using AI into clinical trials. At a time when AI-based pharmaceutical companies are competing with each other, Exscientia has become the first company to send an AI-discovered drug candidate, DSP-1181, to the clinical stage. This process will take less than 12 months, compared with a historical average of about 4.5 years for this step. In 2021, Exscientia raised a total of approximately 800 million USD through Series C and Series D funding, and an initial public offering (IPO). The company has also raised significant funding through deal partnerships, signing deals with Bristol Myers Squibb and Sanofi for potential transaction amounts of 1.2 billion and 5.2 billion USD, respectively. Both deals are focused on drug discovery in the areas of oncology and immunology. Over the decade of Exscientia’s development, a complete end-to-end AI drug development pipeline has been progressively established, from target selection to molecular screening and generation. It is this complete pipeline that continues to drive Exscientia’s growth. To date, Exscientia has three drugs in the clinical stage, and its market value is highly anticipated upon launch [
430].
Thus far, the development of AI-driven drugs is at a historical inflection point, and the average funding for pharmaceutical companies with AI as a core technology has been on the rise.
Table 6 provides some information on the core technologies of AI-based pharmaceutical companies. Investors now recognize that drug R&D based on AI technology is becoming a powerful tool to accelerate biopharmaceutical innovation. This technology can provide new insights to accelerate drug discovery by analyzing the biopharmaceutical data that is accumulated and generated on a daily basis. As a result, this field has become a strategic area of focus for pharmaceutical companies and continues to attract capital market attention.
9. Challenges
This review has elaborated on most of the applications of AI in the whole process of drug R&D. However, at the present stage, AI has not really broken down the traditional pharmaceutical system, and many research processes are still waiting for “optimization” by AI. The use of AI for more in-depth research in the field of pharmaceutical preparations is still being gradually explored. For example, some scholars have used AI technology to assist in studying the interaction of drug excipients with biomolecules [
431]. In addition to the application areas of AI in the drug development stage that still require expansion, there are limitations in the application of AI to drug discovery.
9.1. Data limitations
The development of AI algorithms cannot be separated from the drive of data. High-quality and accurate data can sometimes enable simple models to outperform complex models. There are many excellent publicly accessible databases for data research, including TTD, ChEMBL, DrugBank, CMAP, and PRIDE, but the amount of data is insufficient to support more complex research. The construction of AI algorithms relies heavily on high-quality and sufficient data. The acquisition of high-quality data is a very important issue for sophisticated and complex biological systems, due to the limitations of current technology, and it is costly to process this data into standard data with high confidence. The method, time, and place of operation of each batch of data acquisition are different, making it more difficult to process the acquired data into uniform and valid data [
432]. For example, the results obtained by current single-cell RNA-seq vary with their sequencing platforms and often tend to form doublets. Some data is obtained by
in vitro assays; however, due to the lack of a thorough understanding of the response in the organism, the
in vitro data often differs significantly from the actual
in vivo data. Therefore, the prediction results of models trained with the data obtained from
in vitro experiments are often unconvincing.
These limitations reflect the uneven quality of the data that is currently used. Data imbalance is also a major difficulty in model training. As previously mentioned, positive datasets are readily available in the pharmaceutical field, but negative datasets are often not accurately identified because failed data is often not publicly available. In addition to the problem of data quality and balance, some types of data are generally unavailable to researchers. The key core data for new drug R&D usually originates from drug companies; this part of the data is usually not open source, as drugs are commodities. Similarly, clinical data involves patient privacy and is usually not open source. The problem of data quality and balance requires advances in experimental techniques to obtain more accurate biomedical data in comparison with current data, in order to break the data bottleneck. The development of algorithms such as distributed training can be expected to solve the problem of privacy data to a certain extent. We also appeal to major institutions and companies to disclose as much high-quality data as possible without compromising their own interests.
9.2. Limitations in interpretability
In addition to the limitations of data, DL methods lack interpretability. Compared with traditional ML methods, which often pass through a rigorous mathematical reasoning validation analysis, DL methods are considered to be a black box. Although DL performs better than ML on most tasks, it is often impossible for researchers to understand the reason results of ML are so good. When a DL model yields a new result that contradicts previous research, the lack of interpretability makes the result unacceptable. In particular, compared with other fields, the field of drug discovery has a complete set of knowledge logic, such as the mechanisms of action of molecules, the metabolic mechanisms of molecules, and the regulatory mechanisms of biological pathways. In order to ensure the safety and efficacy of drugs, relevant biological processes must be thoroughly studied, ranging from the physicochemical properties of a drug to what proteins it binds to in the body, how it binds, what biological reactions it triggers, and how it is metabolized. DL can only accept input and give predicted output; it cannot provide sufficient explanations for how this output is derived. For example, for protein function annotation, although DL methods can predict the GOA of a specific protein [
70], the computational process is not known and most of the predictions are not accepted when the accuracy is not reliable. Even in terms of data representation methods, no uniform standards have been developed regarding which representation method is more suitable for which study and which representation methods lead to a loss of information.
In the future, the development of DL in the pharmaceutical sciences and industry should focus on improving interpretability as much as possible without compromising accuracy, and should involve the establishment of a set of well-established research methods that combine white-box models with black-box models.
10. Conclusions
In conclusion, AI is advantageous in all aspects of new drug R&D. It can be used in the discovery of drug targets, the design and development of new drugs, preclinical research, clinical trial design, and post-market surveillance to assist in the design of safe and effective drugs, while greatly reducing the cycle time and cost of drug R&D. Some limitations still remain in the AI-based drug R&D process. However, we believe that the emergence of AI is gradually assisting us in unraveling the mystery of large and complex biological systems, and that AI has become an indispensable technology in the drug R&D process. Furthermore, AI technologies will change the R&D paradigm of pharmaceutical sciences in the future, helping us to better overcome complex diseases while providing personalized medicine to patients. In this process, further research is needed to inject new energy into this field.
The authors would like to dedicate this article to Prof. Hualiang Jiang, the member of the Chinese Academy of Sciences (CAS) and professor in Shanghai Institute of Materia Medica and Lingang Laboratory. Prof. Jiang had devoted great efforts to the cutting-edge research on CADD and artificial intelligence for drug discover, and made significant contributions to the development of pharmaceutical sciences. All authors would like to take this opportunity to thank for his kind and persistent supports to their research.
Acknowledgments
This work was funded by the Natural Science Foundation of Zhejiang Province (LR21H300001), National Key R&D Program of China (2022YFC3400501), National Natural Science Foundation of China (22220102001, U1909208, 81872798, and 81825020), Leading Talent of the “Ten Thousand Plan”—National High-Level Talents Special Support Plan of China, Fundamental Research Fund of Central University (2018QNA7023), Key R&D Program of Zhejiang Province (2020C03010), “Double Top-Class” University (181201*194232101), Westlake Laboratory (Westlake Laboratory of Life Sciences and Biomedicine), Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, and Alibaba Cloud, Information Technology Center of Zhejiang University.
Compliance with ethics guidelines
Mingkun Lu, Jiayi Yin, Qi Zhu, Gaole Lin, Minjie Mou, Fuyao Liu, Ziqi Pan, Nanxin You, Xichen Lian, Fengcheng Li, Hongning Zhang, Lingyan Zheng, Wei Zhang, Hanyu Zhang, Zihao Shen, Zhen Gu, Honglin Li, and Feng Zhu declare that they have no conflict of interest or financial conflicts to disclose.