Artificial Intelligence in Pharmaceutical Sciences

Mingkun Lu , Jiayi Yin , Qi Zhu , Gaole Lin , Minjie Mou , Fuyao Liu , Ziqi Pan , Nanxin You , Xichen Lian , Fengcheng Li , Hongning Zhang , Lingyan Zheng , Wei Zhang , Hanyu Zhang , Zihao Shen , Zhen Gu , Honglin Li , Feng Zhu

Engineering ›› 2023, Vol. 27 ›› Issue (8) : 37 -69.

PDF (3773KB)
Engineering ›› 2023, Vol. 27 ›› Issue (8) :37 -69. DOI: 10.1016/j.eng.2023.01.014
Research
Review
Artificial Intelligence in Pharmaceutical Sciences
Author information +
History +
PDF (3773KB)

Abstract

Drug discovery and development affects various aspects of human health and dramatically impacts the pharmaceutical market. However, investments in a new drug often go unrewarded due to the long and complex process of drug research and development (R&D). With the advancement of experimental technology and computer hardware, artificial intelligence (AI) has recently emerged as a leading tool in analyzing abundant and high-dimensional data. Explosive growth in the size of biomedical data provides advantages in applying AI in all stages of drug R&D. Driven by big data in biomedicine, AI has led to a revolution in drug R&D, due to its ability to discover new drugs more efficiently and at lower cost. This review begins with a brief overview of common AI models in the field of drug discovery; then, it summarizes and discusses in depth their specific applications in various stages of drug R&D, such as target discovery, drug discovery and design, preclinical research, automated drug synthesis, and influences in the pharmaceutical market. Finally, the major limitations of AI in drug R&D are fully discussed and possible solutions are proposed.

Graphical abstract

Keywords

Artificial intelligence / Machine learning / Deep learning / Target identification / Target discovery / Drug design / Drug discovery

Cite this article

Download citation ▾
Mingkun Lu, Jiayi Yin, Qi Zhu, Gaole Lin, Minjie Mou, Fuyao Liu, Ziqi Pan, Nanxin You, Xichen Lian, Fengcheng Li, Hongning Zhang, Lingyan Zheng, Wei Zhang, Hanyu Zhang, Zihao Shen, Zhen Gu, Honglin Li, Feng Zhu. Artificial Intelligence in Pharmaceutical Sciences. Engineering, 2023, 27(8): 37-69 DOI:10.1016/j.eng.2023.01.014

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

In the past few decades, the pharmaceutical industry has been limited by the extent of cutting-edge research in pharmaceutical sciences, because the development of new drugs is a long and complex process accompanied by high risks and high costs [1], [2]. In other words, the current field of drug research and development (R&D) requires significant productivity improvements to shorten the cycle time and cost of drug development [3]. Technologies such as network pharmacology, RNA-sequencing (RNA-seq), high-throughput screening (HTS), or virtual screening (VS) have all accelerated the discovery of new targets, as well as new drugs to some extent [4], [5], [6], [7], [8], [9]. Nevertheless, these technologies have rarely been significant contributors to the current process of new drug discovery. Thus, there is an urgent need for new technology to drive the development of new drugs.

As the computing power of devices grows, artificial intelligence (AI) has been used in many real cases, such as in image classification and speech recognition, due to its ability to learn, process, and predict massive amounts of information [10], [11], [12]. At present, after a long period of data accumulation, in combination with the development of high-throughput RNA-seq technology, massive amounts of biomedical data have been collected [13], [14], [15], [16], [17], [18]. Biomedical data, which has a high level of heterogeneity and complexity, comes from a variety of sources, including omics data from different platforms, experimental data from biological or chemical laboratories, data generated by pharmaceutical companies, publicly disclosed textual information, and manually collated data from publicly available databases [19], [20], [21], [22]. AI can be used to learn the potential patterns in these vast amounts of biomedical data, thereby bringing new opportunities and challenges to the pharmaceutical sciences and industries.

The AlphaFold2 system used AI in the 14th round of the Critical Assessment of Protein Structure Prediction (CASP14) competition and outperformed others in accurately predicting the three-dimensional (3D) structures of proteins [23]. Similarly, in the Open-Graph Benchmark Large-Scale Challenge (OGB-LSC) competition, a graph neural network (GNN) combined with a transformer model won the top rank in predicting the molecular properties calculated by means of density functional theory (DFT), which is difficult and highly time-consuming using traditional methods [24]. These competitions demonstrated the strong ability of AI to analyze biological or chemical data. Due to its powerful capability to utilize related biomedical data to understand complex biological systems and chemical reaction spaces [25], [26], AI has had a revolutionary impact on all stages of drug R&D, including not only research on proteins and small molecules but also the assisted design of clinical trials and post-market surveillance [27]. Furthermore, in pharmaceutical companies, many state-of-the-art (SOTA) AI models have been adopted in diverse pipelines to shorten the R&D cycle time and decrease costs [28], [29], [30].

AI techniques in this context mainly involve machine learning (ML) and deep learning (DL). Both ML and DL algorithms are involved in target discovery and validation [31], drug discovery and design [32], and preclinical drug research [33], where they are used to analyze different data characteristics in different formats. After a drug candidate is enrolled in a clinical trial [34], DL plays a pivotal role in assisting in the design of the clinical trial and in supervising and analyzing data from the clinical phase IV [33]. Approved drugs have a strong impact on manufacturing [35] and the market economy, and DL can play a part in these areas as well. Therefore, in this review, we present a comprehensive overview of most aspects of the use of AI in the pharmaceutical sciences. We focus on how AI can be used to promote target discovery and drug discovery (as shown in Fig. 1) and reflect on how to further accelerate the development of this field.

2. Basic concepts of AI and its scope of application

AI was first proposed at the Dartmouth Conference in 1956 and was defined as an algorithm that gives machines the ability to reason and perform functions [36]. From perceptual machines to support vector machines (SVMs) and artificial neural networks (ANNs), the development of AI has gone through several ups and downs, and is currently flourishing thanks to the hardware support that is now available. Both ML and DL fall under the category of AI; strictly speaking, DL can be placed within the category of ML. However, our discussion of ML in this review only concentrates on traditional ML methods, such as random forest (RF) and SVMs.

2.1. The big data era

In the current big data era, gigantic amounts of biological and clinical data have laid a foundation for the application of AI in the field of medical and pharmaceutical research. Although AI has been successfully and effectively applied in multiple aspects of the drug R&D process, the quantity and quality of medical data have become one of the main obstacles to the development of AI in the pharmaceutical sciences. Thus far, pharmaceutical databases with detailed and structured big data proposed by medicinal researchers worldwide are playing a key role in promoting AI applications in medical and pharmaceutical research.

For example, the Therapeutic Target Database (TTD) includes the most comprehensive information about known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information, and the corresponding drugs directed at each of these targets. It provides detailed knowledge of the functions of targets, as well as their sequence, 3D structures, ligand-binding properties, relevant enzymes, and corresponding drug information [37]. PubChem [17] provides collective information of chemical molecules and their activities in response to biological assays, including molecular structure, identifiers, physicochemical properties, patent information, and molecular toxicity. Some popular databases aimed at various pharmaceutical issues have been proposed and are frequently used; these play significant roles in promoting the application of AI in medical and pharmaceutical research [38], [39], [40], [41], [42]. Summarizing various popular pharmaceutical databases, Table 1 [17], [18], [37], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62] provides brief information on popular pharmaceutical databases, categorized into protein-related, gene-related, drug-related, and disease-related databases.

2.2. ML and DL

Unlike traditional computer programming calculations, ML and DL can learn potential patterns from the input data without explicit programming. They are not limited by the format of the input data, which is broad and can include text, images, sound, and more (all types of data that can be encoded) [63]. Similar to the human learning model, ML and DL can gradually recognize different features of the data, infer the patterns lying within, and update their model parameters through continuous iterations until a valid model is formed.

According to the application scenarios, the models can be categorized into regression models and classification models. The difference between regression and classification tasks lies mainly in whether the type of output variable is continuous or discrete. Cheng and Ng [64] applied ML approaches to predict the biological activity of per- and polyfluorinated alkyl substances (PFAS) with an output of continuous values, and this study is a typical regression task. Hong et al. [65] built a DL model to predict whether a protein in a bacterium is of the type IV secreted effectors (T4SE), with an output of discrete values (e.g., 0/1), and this study is a typical classification task.

Depending on the type of learning algorithm required to solve the problem, models are conceptualized into three categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning is a labeled-data-driven process that trains a model on the relationship between input and its prespecified output in order to predict the categories or continuous variables of future input. In comparison, unsupervised methods are used for identifying patterns in unlabeled datasets and exploring a dataset’s potential structures to allow clustering of the data for further analysis. In addition, semi-supervised learning is part-way between supervised and unsupervised learning; it accepts only part of the labeled data to develop a training model and is used as a potential solution for problems that lack high-quality data [66]. Reinforcement learning performs model construction through constant interactive learning, relying on penalties for failure or rewards for success.

2.3. Introduction to different types of ML/DL-based algorithms

ML and DL methods have been successfully applied to solve relevant biomedical problems, with the adopted modeling approach varying for different problems or even the same problems. For example, small molecules used to be characterized as engineered features for direct loading in several ML methods to predict the properties; however, more recently, GNNs can also be utilized to describe small molecules for predictions of properties [67]. Determining the function annotations of proteins is essential for the selection of druggable proteins as potential targets. Kulmanov et al. [68] conducted a convolutional neural network (CNN) to annotate the gene ontology annotation (GOA) of proteins. Gligorijević et al. [69] built a recurrent neural network (RNN) for protein function annotations, and Xia et al. [70] combined both a CNN and RNN to predict the gene ontology (GO) label of proteins.

ML builds a special algorithm—not a specific algorithm—that focuses on the features of the data and transforms them into knowledge that machines can read to provide humans with new insights. Various common algorithms exist for researchers to choose from. The naïve Bayes (NB) algorithm is a probabilistic-based classifier based on Bayes’ theorem and independence assumptions between features; it is a simple and intuitive algorithm [71]. An RF algorithm constructs a set of unrelated decision trees that form a whole hierarchical structure; under model construction, each tree is individually responsible for a corresponding problem [72]. The final decision is based on the majority votes of the decision trees. Models that make decisions based on this approach are also commonly referred to as ensemble models. Extreme gradient boosting (XGBoost) is a scalable ML algorithm based on gradient boosting, which is also an ensemble model [73]. Multi-layer perceptron (MLP) can be viewed as a directed graph consisting of multiple node layers, each fully connected to the next layer, so that it maps a set of input vectors to a set of output vectors. SVM is one of the most widely applied ML algorithms. An optimal hyperplane is used to classify samples, which are obtained by maximizing the margins between different classes in a specific dimensional space, with the dimensionality being determined by the number of features [74]. The k-nearest neighbor (KNN) is regarded as “lazy learning” that classifies the sample according to only a few neighboring samples when distinguishing between categories [75]. In addition to the above methods, several other ML methods such as principal component analysis (PCA), partial least-squares (PLS), linear discriminant analysis (LDA), and logistic regression (LR) have been applied in biomedical data processes [76], [77].

DL is popular due to its powerful generalization and feature-extraction capabilities; its learning and prediction process is end-to-end. Unlike the traditional ML process (which often consists of multiple independent modules), DL obtains the output data (output-end) directly from the input data (input-end) during the model training process and continuously adjusts and optimizes the model based on the error between the output and the true value, until it meets the expected result. A deep neural network (DNN) is a feed-forward neural network consisting of densely connected input, hidden, and output layers. It achieves the feature learning of input data by simulating nonlinear transformations between neurons, with each layer consisting of various neurons [78]. A CNN is a feed-forward neural network that consists of convolutional (feature extraction) and pooling (dimensionality reduction) layers. The convolutional and pooling layers help to extract all the information in a dataset without consuming too much time and computational resources [79]. An RNN is a class of ANN in which linked nodes form a directed or undirected graph along a temporal sequence. An RNN includes a feedback component that allows signals from one layer to be fed back to the previous layer. It is the only neural network with internal memory, which helps to address the difficulty of learning and storing long-term information [80]. A GNN is a connectivity model that derives the dependencies in a graph by means of information transfer between nodes in the network [81], [82]. A GNN updates the state of a node according to neighbors of the node at any depth from the node; this state is able to represent the node information. The neural network architectures of the four networks described above are shown in Fig. 2.

An autoencoder (AE), which consists of an encoder and a decoder, is used to learn efficient encodings of input data. The encoding, which is generated by feeding input to the encoder, regenerates the input by the decoder. An AE is usually used for data compression and dimensionality reduction through the representation methods (i.e., the encoding) of a set of data [83]. A generative adversarial network (GAN) is composed of two underlying neural networks: a generator neural network and a discriminator neural network. The former is used to generate content, while the latter is used to discriminate the generated content [84]. Models can also be used in combination to solve a wider range of problems. For example, a graph convolution network (GCN) extends convolutional operations from traditional data (e.g., images) to graph data [85].

When a model fails to learn the underlying patterns in data features effectively and loses the ability to generalize to new data, such a problem is called model underfitting [86]. In contrast, overfitting occurs when the model is training and noise in the data fitted as a representative feature resulting in poor predictions for new data [87]. Compared with underfitting, model overfitting is more difficult to deal with. Models often become overfitted due to being overly complex or because of an underrepresentation of data. A dataset used for a model is often divided into a training set, validation set, and test set. These sets are respectively used for model training, model adjustment, and model evaluation. To put it simply, a model that works badly on both the training and test sets is an underfitted model, while a model that works well on the training set but badly on the test set is an overfitted model. Typical ways to suppress overfitting include regularization, data augmentation [88], dropout [89], early stopping, ensemble learning, and among other methods.

Researchers encountered underfitting and overfitting problems, using only one model of traditional epidemic models or ML models, when predicting the long-term trends of the coronavirus disease 2019 (COVID-19) pandemic. To address these issues, Sun et al. [90] proposed a new model called dynamic-susceptible-exposed-infective-quarantined (D-SEIQ). The D-SEIQ model can accurately predict the long-term trends of COVID-19 outbreaks by appropriately modifying the susceptible-exposed-infective-recovered (SEIR) model and integrating ML-based parameter optimization under reasonable epidemiology constraints.

Different models have different evaluation criteria. In regression models, commonly used evaluation criteria include mean squared error (MSE), root mean squared error (RMSE), and R-squared. In classification models, the more commonly used criteria are recall, precision, and F1-score. The receiver operating characteristic (ROC) curve and precision-recall curve (PRC) are the most commonly used evaluation criteria in classification models, with ROC curves taking into account both positive and negative cases to assess the overall performance of the model, while PRCs focus more on positive cases [91].

2.4. A brief description of molecule representation as model input

Over time, the accumulation of data on small molecules and proteins has resulted in an extremely large data resource. Databases of molecular sequences, structures, physicochemical properties, and so forth have been collected and organized by different organizations and contain a great deal of knowledge and information. However, the different sources and formats of the data make it difficult to integrate the correlated data from multiple heterogeneous sources. Therefore, it is particularly important to adopt suitable methods to represent molecules in an appropriate way and to mine the crucial information in the data on molecules by means of AI [92]. Current AI algorithms are highly dependent on the quality of the data; thus, when performing model construction, it is necessary to unify the input format of molecules, such as by representing small molecules and proteins as model-readable vectors or matrices.

At present, the representation of small molecules is generally done using one of four main approaches. The first approach involves knowledge-based representation. Molecular descriptors and molecular fingerprints based on human a priori knowledge are widely used in various ML or DL algorithms [93]. The second approach involves direct representation based on images. CNNs have now been used to learn rules from two-dimensional (2D) digital images. A 2D chemical digital grid of a molecule can be directly used as input to allow a CNN model to learn the properties of the molecule [94]. The third approach is string-based representation. For example, a typical canonical simplified molecular-input line-entry system (SMILES) represents small molecules in the form of strings. Thus, CNNs and RNNs can be further used to learn molecular embeddings from the string representations of chemical structures [95], [96], [97]. The fourth approach involves graph-based feature representation. Representation methods based on graph convolution or graph attention have been widely used to explore the feature representation of small molecules. In these methods, atoms and bonds are considered to be nodes and edges, respectively, while new molecular representations are obtained during the continuous updating of information at individual nodes. Graph-based representations have achieved outstanding performance in a variety of pharmaceutical learning tasks [98], [99].

Protein representation methods can be basically classified into four categories: representation based on intrinsic properties of sequences, representation based on physicochemical properties, representation based on protein structure, and graph-based representation. Sequence-based protein representation methods include amino acid composition (AAC), dipeptide composition, autocorrelation descriptors, position-specific scoring matrices (PSSMs), and one-hot encoding [100], [101], [102], [103], [104], [105], [106], [107]; these methods reflect the content of various amino acids, dipeptide content, and the distribution of amino acids on the sequence. Physicochemical property-based protein representation methods include composition, transition, and distribution (CTD), pseudo-amino acid composition (PAAC), and amphiphilic pseudo-amino acid composition (APAAC) [108], [109], [110], which reflect the properties of each amino acid and the distribution of these properties on the sequences. The two feature representation methods described above are widely used in various models, because they can obtain protein feature representations by knowing only the sequence information. It is well known that the high-level structure of a protein determines the function of that protein, so it will sometimes directly represent the structure of proteins. Protein representation methods based on structural properties include topological molecular structure and protein secondary structure and solvent accessibility (PSSSA) [111], [112], [113], which reflect the structural properties of each amino acid in a protein and the structural type of a protein. PSSSA is also a graph-based protein representation. In the simplest graph, each node corresponds to a residue, while the edges connect pairs of residues within a certain distance [114]. Structure-based and graph-based protein representation methods can effectively represent the structure of a protein and the relationships between amino acid residues in the structure, and can be applied to a variety of novel model architectures, such as GNNs, transformer models, and GANs [114], [115], [116], [117].

In recent years, novel molecular representation methods have been emerging, such as knowledge-graph-based and large-scale pretrained-based representation methods [118], [119]; these methods also excel in suitable downstream tasks. Overall, representing the raw data of a molecule using a vector or matrix that captures the molecule’s key features is critical for subsequent data exploration and analysis.

2.5. The study of drug research and disease with distinct AI algorithms

When studying different types of drugs and performing disease research, choosing a suitable model can maximize the potential information of the data. Given classification or regression problems with small datasets, ML can often achieve a satisfactory performance in a short time. For example, a drug-protein affinity prediction study based on quantitative structure-activity relationship (QSAR) models could choose to use SVM or RF models (see Section 5 for more detail) [120], [121]. When the amount of data is progressively higher, DL algorithms are often more appropriate. For example, for the prediction of protein-folding problems, CNN models can better predict residues [122]. In the research area of drug de novo design, generative models and variational autoencoders (VAEs) can help to design molecules that align with the design vision [123], [124] (see Section 4 for more detail). Instead of selecting models from the perspective of the tasks, studies often use the data representation form to select an appropriate algorithm. Therefore, researchers can often choose from different AI algorithms that are available for the same task. When predicting the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of molecules, CNNs, RNNs, and multi-task learning can achieve outstanding results [125] (see Section 5 for more detail). By starting from the relationships between data, graph-based AI algorithms allow the modeling of unstructured data. In the pharmaceutical sciences, there is never a lack of complex relationships. Therefore, modeling complex interactions such as drug-drug interactions, drug-protein interactions, protein-protein interactions (PPIs), and so forth enhances the learning capability of the models [126] (see Section 3 for more detail). When combined with representations of these entities themselves, key information about the entities can be learned at a deeper level to aid in making predictions, while providing a more explanatory model.

Therefore, the boundaries between the use of distinct algorithms have become increasingly blurred when such methods are applied to the actual drugs and disease problems to be studied. Depending on the type of data available and taking into account the biological significance can be informative for model selection and construction.

3. Target identification and validation

From a conventional standpoint, there are two paradigms for discovering new (first-in-class) drugs [127]: phenotypic drug discovery (PDD) and target-based drug discovery (TDD). Early biological research techniques relied on microscopy, imaging, and cellular techniques to observe the phenotypic changes in living systems. PDD is used to screen a library of compounds or antibodies by constructing an animal model or experiment that is highly relevant to the disease. Next, the responses of cells or experimental animals to these compounds are observed, with the aim of identifying molecules with a certain level of efficacy for further structural modification and optimization [128]. With the development of molecular biology and various sequencing techniques, research on biological macromolecules has reached a new height. Drug discovery research has entered the TDD era [129], and TDD has gradually replaced PDD as the mainstream drug discovery paradigm. TDD is centered on a “one gene, one drug, and one disease” concept [4]. This approach relies on a highly disease-relevant target, which could be an enzyme, protein, or other gene product, along with an elaborate and meticulous small-molecule design for this target, which is used to modulate the target to act as a therapeutic agent for the disease. Although the drug discovery paradigm of PDD has been re-emerging in recent years [128], the screened drugs often require further target validation and mechanistic studies. Therefore, target discovery is often the first, critical step in the drug development phase [129]. The target discovery process involves multifaceted research, including the study of disease-related genes, signaling pathways, protein interactions, and small molecule-protein interactions. Of particular interest is the fact that target discovery based on experimental means is difficult to carry out quickly and widely, due to limitations in throughput, accuracy, and cost, whereas AI-based discovery can efficiently and effectively identify biomolecules with the potential to become drug targets.

3.1. Target identification based on omics techniques

With the advancement of high-throughput sequencing technologies, huge amounts of omics data are continuously being generated. The processing and analysis of such large-scale omics data (genomics, transcriptomics, proteomics, metabolomics, etc.) [130], [131], [132], [133], [134], [135], [136], [137], [138] have been revolutionary to biology, medicine, and pharmacology, especially in facilitating researchers’ understanding of complex biological systems and processes. Many genes or proteins playing important roles in biological processes that may be associated with specific diseases have been identified based on omics data [135], [139], [140], [141], thereby facilitating research on drug target discovery. For example, new candidate disease targets such as SETD2 and VGLL4 have been uncovered using omics data. However, processing and analyzing these complex and high-dimensional omics data is extremely challenging; thus, ML and DL approaches can be used to learn potential knowledge from large-scale omics datasets, which can help in the discovery of genes or pathways critical to biological processes [142]. Table 2 [18], [44], [53], [48], [49], [50], [143], [144], [145], [146], [147], [148], [149], [150], [151] provides examples of omics projects for drugs, proteins, and diseases analysis.

Potential targets are molecules that are associated with a specific disease and have the smallest possible degree of association with other diseases. Complex diseases such as oncological, cardiovascular, and immune diseases are often regulated by multiple key genes, molecules, or signaling pathways, so it is often necessary to unravel the connection between multiple molecules and the disease. Omics data are essential for discovering and assessing the biological effects or toxicity of potential targets. For example, cancer stem cells (CSCs) cause great resistance to the treatment of lung adenocarcinoma (LUAD). Studying the expression of stem-cell-related genes in LUAD could provide new insights into the treatment of LUAD. Zhang et al. [152] applied an unsupervised ML algorithm known as one-class LR (OCLR) to the molecular datasets of normal stem cells and their progeny to obtain the messenger RNA (mRNA) expression-based stemness index (mRNAsi), DNA methylation-based stemness index (mDNAsi), and epigenetic regulation-based mRNAsi (EREG-mRNAsi) for analyzing the LUAD cases data in The Cancer Genome Atlas (TCGA) in order to calculate the scores of sample stemness indices. In this process, weighted gene co-expression network analysis (WGCNA) was used to find key genes associated with LUAD. In the end, 13 previously overlooked key genes with an overall association were identified, which could be used as potential targets for the treatment of LUAD by suppressing the stemness features.

Since their release, the connectivity map (CMAP) and Library of Integrated Network-based Cellular Signature (LINCS)-L1000 databases—which contain a large amount of transcriptomic data following drug perturbations and various other environmental disturbances—have been used to do a great deal of research to identify the mechanism of action and targets of small molecule compounds, with the aim of discovering potential drugs for diseases or potential targets for drugs [153], [154], [155]. The web service PharmMapper [156], [157], [158] gathered 52 431 pharmacophore models from TargetBank, DrugBank, BindingDB, and the potential drug target database (PDTD), and used them to identify potential target candidates for the given probe small molecules by means of a fast pharmacophore mapping approach. ChemMapper [159] is another web service that aims to predict polypharmacology effects, potential protein targets, and modes of action for small molecules based on 3D similarity computation, using a database containing 4 350 000 chemical structures with bioactivities and associated target annotations. The iDrug [160] platform provides a versatile, user-friendly, and efficient online tool for computer-aided drug design (CADD) based on pharmacophore and 3D molecular similarity searching, enabling binding sites detection, VS, and drug target prediction in an interactive manner through a seamless interface. DeltaNet was designed by Noh and Gunawan [161] based on the ordinary differential equation (ODE) model for analyzing gene transcription processes and predicting potential targets of compounds. There are two versions of DeltaNet—namely, DeltaNet-LAR and DeltaNet-LASSO—which use last angle regression (LAR) and least absolute shrinkage and selection operator (LASSO) regularization to solve linear regression problems, respectively. DeltaNet outputs a predicted ranking of gene targets for further enrichment analysis to find other key molecular targets. Zhu et al. [162] constructed a DL-based efficacy prediction system (DLEPS) to identify new drug candidates and discovering targets. Trained by transcriptional profiles data, mainly from the L1000 project profiles, DLEPS uses changes in gene expression profiles in the state of disease as input. In addition to the discovery of three new drug candidates, DLEPS also demonstrated that mitogen-activated protein kinase kinase (MEK)-extracellular-signal-regulated kinase (ERK) was a critical signaling pathway in nonalcoholic steatohepatitis—knowledge that can be used to develop specific targets. The data mining analysis of such transcriptomes through ML and DL can help not only to find drug targets but also to elucidate the mode of action of drugs and disease mechanisms [163].

The analysis of omics data has helped researchers to identify many overlooked disease candidate targets [164]. With the advancement of sequencing technology and deeper research, the drawbacks of the deeper mining of only single omics data are becoming increasingly obvious, as such mining can neither reflect the relevance and variability of biological processes (e.g., simple gene expression levels do not reflect true protein expression levels) nor reveal complex biological systems and disease mechanisms (e.g., glycolytic processes are associated with genomics, proteomics, and metabolomics). In particular, disease onset often involves multiple pathways and requires the integration of multimodal data. For example, genes with increased DNA copy numbers have been found to be involved in important cancer pathways, and somatic mutation frequency and expression levels are also important factors in cancer drivers [143], [165], [166]. By integrating information at multiple omics levels and mining the linear or nonlinear associations through AI approaches, candidate key factors can be identified at a more in-depth level, which is crucial for discovering candidate targets for diseases.

Complex diseases such as cardiovascular disease, schizophrenia, cancer, and Alzheimer’s disease (AD) have many therapeutic targets, and multiple potential causative genes can be discovered through the multi-omics features of individual patients. Jeon et al. [31] used an SVM algorithm with a radial basis function (RBF) kernel to construct three models to predict potential targets specific to breast cancer (BrCa), pancreatic cancer (PaCa), and ovarian cancer (OvCa), respectively. Gene essentiality, gene expression, DNA copy number variation, somatic mutation, and PPI network topology were the main input features, and the SVM was able to deeply explore the association of and difference among these features to distinguish potential drug targets from non-target proteins. The model was cross-validated with ten folds and had a high area under the ROC curve (AUROC) value and a low false-positive rate. By using the trained model to predict 15 663 human proteins and score the prediction results, a total of 122 global cancer targets were identified for all cancers (69 of which corresponded to the 116 known targets that were rigorously validated). In addition, a large number of potential targets specific to BrCa, PaCa, and OvCa were identified. Of course, the identified targets were only for guidance and were not true drug targets.

Moreover, using multi-omics data with PPI networks, a group developed a network-based Bayesian algorithm framework [167] to infer loci for an AD genome-wide association study (GWAS) and revealed 103 AD risk genes (ARGs). This study included gene expression data from single cell transcriptomics, gene expression data from microarrays, and proteomics, fully demonstrating the ability of AI approaches to integrate multi-source and multimodal data to discover potential therapeutic targets.

ML has been instrumental in driving the learning process of multi-omics data, but it can be overwhelmed by larger multi-omics data and more complex problems. However, DL can handle much larger amounts of multi-omics data and unearth deeper associations. On the assumption that the drug inhibition of targets and target gene knockdown (KD) should lead to the occurrence of similar biological processes, resulting in similar mRNA expression profiles, Pabon et al. [168] explored the direct feature correlation and indirect feature correlation between compound-induced features and gene KD in CMAP, and combined these features with other features such as PPIs as inputs into the RF model to predict drug targets. To better mine the correlation between chemical perturbation (CP) features and KD genetic perturbation features, Zhong et al. [169] proposed a GCN model known as Siamese spectral-based graph convolution network (SSGCN) to mine transcriptomic data to predict compound-protein interactions (CPIs). SSGCN constructed two parallel GCN models for the feature extraction of CP profiles and KD profiles, respectively, where CP profiles and KD profiles were integrated with a PPI network (the attribute values of the network nodes were gene differential expression values, and if there was an interaction between two nodes, these two nodes were connected by an edge). Two sets of graph embedding vectors were obtained after feature extraction, and the degree of correlation between the CP features and KD features was obtained by means of a simple linear regression layer. The correlation was expressed as Pearson’s coefficient R2 and was fed to the classifier as features along with cell line, CP time, dosage, and KD time to discriminate the interaction of compounds with the corresponding proteins. This model was subsequently validated externally and shown to be effective in identifying potential drug targets and facilitating drug repositioning studies.

Most of these target discovery models use end-to-end models to directly discover druggable proteins. DL can also perform key roles in multiple specific steps in the target discovery process, such as predicting splicing from pre-mRNA transcript sequence using SpliceAI [170], using scVI to predict and analyze gene expression probabilities in single cells from transcriptomic data [171], and using PLEDA to predict an enhancer predictor [172]. Some studies have performed a GWAS of COVID-19, with results suggesting a possible association with COVID-19 susceptibility in the 3p21.21 region of the chromosome. Building on these studies, Downes et al. [173] used multiple DL approaches combined with multi-omics data to discover that the gain-of-function risk A allele of a single-nucleotide polymorphism (SNP), rs17713054G>A, may be a variant that can cause disease. Further analysis revealed that leucine zipper transcription factor like 1 (LZTFL1), a gene regulated by rs17713054, was a critical gene for the development of epithelial-mesenchymal transition (EMT). EMT is a developmental pathway associated with lung inflammation that is frequently induced by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in lung cancer cell lines (CCLs) and the respiratory tract. As a key gene in this series of biological processes, LZTFL1 could serve as a potential therapeutic target.

The use of AI approaches can help effectively predict drug responses in cancer cells to advance precision medicine [174], [175], [176]. One group used elastic net regression and RF to identify how multi-omics data affect drug response prediction [177]. In this study, 265 drugs across 990 CCLs were screened to construct pharmacogenomic datasets. To comprehensively investigate the influence of different combinations of molecular data, linear and nonlinear ML models were built. Among the genome-wide gene expression, DNA methylation, gene copy number, and somatic mutation data, gene expression data was the most predictive data type in pan-cancer analysis, and genomic data (i.e., driver mutations, copy number alterations, or DNA methylation data) was the most predictive data type in cancer-specific analysis.

The importance of multi-omics data in drug response prediction has also been demonstrated. However, most methods do not take drug/cell line specificity, drug/cell line, or drug-protein associations into consideration. To address this issue, Peng et al. [178] combined multi-omics data with a GCN to construct an end-to-end model known as MOFGCN. Drug/cell line associations were used to initially construct a heterogeneous network in which the nodes were drugs or cell lines. The properties of the drugs were obtained by calculating the similarity of molecular fingerprints, and the properties of the cell lines were obtained by fusing multi-omics data (i.e., gene expression, copy number variation, and somatic mutation data) and calculating their similarity. The completely constructed heterogeneous network served as the input to a graph convolutional network, and the final features were obtained by passing messages between nodes to further learn the potential associations of drugs and cell lines. To predict drug sensitivity, a CCL-drug correlation matrix required further reconstruction based on a linear correlation matrix that was calculated from the updated features of drug and cell lines. The DL framework of predicting drug sensitivity, DeepDRK [179], integrated mutations, copy number variation, DNA methylation, gene expression, and drug screening as cell line features and extracted molecular-protein information as drug features. Then, the two features were spliced as the features of a CCL-drug pair and were fed into the DNN to predict the drug sensitivity.

The combination of omics data and AI methods can help researchers quickly obtain the information they need at the molecular scale, as the various levels of omics data reflect the various processes of life activity. Integrating and analyzing this information can aid in the understanding of complex biological systems and thus assist in the discovery of new drug targets.

3.2. Drug-target interactions (DTIs) discovered using chemogenomics

The identification of DTIs is currently contributing to research in drug discovery. Newly discovered DTIs can be used to find new targets that interact with existing drugs or to discover new compounds that interact with a disease-related target. Therefore, research results on DTIs are widely used in the fields of lead compound discovery, new target discovery, drug repositioning, and drug side-effect prediction [3], [180], [181]. Although HTS have been developed to determine the activity of thousands of compounds at once, they cannot catch up with AI methods in terms of either cost consumption or the number of compounds measured. In general, methods for predicting DTIs have been divided into three main approaches: ligand-based methods, structure-based methods, and chemogenomic methods. Each of these three methods has its own advantages and disadvantages, with the third method being the most widely applicable and popular. Therefore, this section focuses on reviewing chemogenomic methods, while the other two methods are covered in Section 4.

The chemogenomic approach not only uses drug-related and target-related information but also connects this information to multiple sources of biomedical information in order to better predict DTIs. Publicly accessible database resources contain a large amount of structured and unstructured biomedical data to support access to information. ML and DL can extract relevant functional information and reduce the noise from this large amount of heterogeneous data in order to discover new protein targets precisely and efficiently. Table 3 [37], [54], [55], [57], [58], [182], [183], [184], [185], [186], [187], [188], [189], [190], [191] lists some currently high-quality public databases.

Prediction of DTIs is usually regarded as a binary classification problem. It is very convenient to use an ML approach to predict DTIs, which usually only requires obtaining the SMILES of small molecules and the sequences of target proteins. These sequences are converted into feature vectors via different rules and are later used as inputs to a model to predict their final classification. These molecules and proteins are characterized in a variety of ways and often contain information about the physicochemical properties of the molecules and proteins, as well as their structure. A number of toolkits and libraries for molecule and protein representations have been developed and are listed in Table 4 [192], [193], [194], [195], [196], [197], [198], [199], [200], [201], [202], [203], [204], [205], [206], [207], [208], [209], [210], [211], [212], [213], [214], [215]. For example, small molecules characterized using MACCS fingerprints were spliced with protein vectors characterized by CTD descriptors and used as inputs to an SVM to predict DTIs [216]. The occurrence of a DTI is influenced by numerous factors and corresponds to multidimensional features that represent the structure and properties of the molecule and protein. It is hoped that the model can find out more about the mechanism of DTI from these features and then give classification judgments based on information. Such problems have also been treated as regression problems; DeepDTA is a CNN model that used the SMILES of small molecules with sequences of proteins to predict the affinity of small molecules with proteins [217]. Using only single-feature representation does not fully characterize small molecules or proteins, so some studies have used multiple descriptors to characterize small molecules and proteins and have integrated these features as vectors of inputs to predict DTIs. This improves the classification performance of the model to a certain extent [218]. In order to enable researchers to more conveniently use DL to make predictions about DTIs, Huang et al. [219] proposed DeepPurpose, which implements more than 50 DL models (including CNN, MLP, RNN, etc.). DeepPurpose can encode proteins in seven distinct ways, including MLP on AAC, PAAC, conjoint triad, quasi-sequence descriptors, CNN on amino acid sequences, RNN on top of CNN, and transformer encoder on substructure fingerprints. For compounds, there are eight encoders, including MLP on Morgan, PubChem, Daylight fingerprint, RDKit 2D fingerprint, CNN on SMILES strings, RNN on top of CNN, transformer encoders on substructure fingerprints, and a message-passing GNN on a molecular graph. Those encoding methods just use SMILES and the amino acid sequence as input. In this way, researchers can conveniently predict DTIs using different encoding methods on different models.

The abovementioned studies were able to obtain a good performance using only the SMILES sequence and amino-acid sequence of proteins. At the same time, it is important to integrate various data sources to predict DTI, such as drug-drug interactions, PPIs, and drug-disease associations. Bleakley and Yamanishi [220] constructed a bipartite graph on DTI [221], [222] and applied an SVM model for DTI prediction in a later work. The four datasets constructed in this work have become the gold standard datasets for later DTI prediction models. Inspired by this work, there have been a proliferation of network-based approaches to predict DTI. A computational pipeline called DTINet was then developed that integrated multiple heterogeneous data sources to construct networks on DTI [223]. In this study, four drug similarity networks were constructed based on ① drug-drug interaction networks, ② drug-disease association information, ③ drug side-effect association information, and ④ chemical structure information. Similarly, three protein similarity networks were constructed based on ① PPIs, ② protein-disease associations, and ③ genomic sequences. Using these similarity networks, a network diffusion algorithm (random walk with restart (RWR)) was first applied on individual networks separately, and the feature vectors were optimized. The low-dimensional vector representations obtained after this learning process contained information derived from various heterogeneous data sources and were able to better represent the drug/protein-specific properties. The obtained vectors were then used to discover new DTIs according to their spatial correspondence with drugs and proteins.

The use of DL models allows for the integration of heterogeneous data from multiple sources while providing a comprehensive characterization of drugs or biomolecules. Zeng et al. [224] proposed a framework called deepDTnet to integrate heterogeneous data sources for the prediction of DTI. In this study, 15 networks—including genomics, GOA, protein-related similarity, and drug-related similarity—were integrated to construct a heterogeneous network connecting drug targets and disease information. A DNN for graph representation (DNGR) algorithm was developed to obtain the informative vector of both drugs and targets based on the constructed network. However, the lack of negative samples in public databases led to difficulties in the model training process; thus, a positive-unlabeled (PU)-matrix completion algorithm was employed to infer whether two drugs shared a target. The results showed that combining the heterogeneous data to re-represent the drug and target without a descriptor or fingerprint achieved an excellent performance.

As mentioned before, the emergence of large-scale knowledge of omics data, systems biology, chemistry, pharmacology, and so forth is providing new perspectives for DTI prediction. However, the integration of heterogeneous data from multiple sources undoubtedly introduces a huge amount of noise and does not solve the “cold-start” problem well. Here, knowledge graphs (KGs) stand out with their powerful ability to integrate heterogeneous information. By leveraging the interactions of phenotype, drug, target, and gene, a KG can help to further understand the molecular mechanism of a disease and to explore potential drug targets. Recent studies have integrated resources from several databases (DrugBank, TTD, ChEMBL, BindingDB, SIDER, GO, etc.) to construct KG such as BioKG, PharmKG, Hetionet, and drug-repurposing KG (DRKG) [30], [225]. A KG usually represents knowledge as a triple, which is composed of a head entity, relation, and tail entity. In the field of DTI recognition, the KG embedding (KGE) model is often used to represent entities and relations by means of low-rank vectors, in what is also known as the representation learning of KGs. The representation vectors obtained by a KG can be further used for link prediction to discover drug-target relationships [30]. A KG typically integrates a huge amount of data with dozens or even hundreds of relationships. The vectors obtained via a KG often contain a certain exact positioning and relationship of this entity in the biological network, but not its own structure or physical and chemical properties. The same is true for proteins. To address this issue, Ye et al. [118] developed a framework called KGE_neural factorization machine (NFM) that performs DTI prediction using a KGE technique combined with a recommendation system technique. In this process, an accurate entity vector is first obtained from the potential information learned from the heterogeneous network via KGE. Next, the structural information of the drug and target is obtained from molecular fingerprints and protein descriptors. Finally, multimodal information is extracted using an NFM, and the DTIs are predicted using DL methods. This approach was tested for “cold-start” scenarios of drugs or proteins and achieved a SOTA performance, particularly for protein “cold-start” scenarios.

In addition to the aforementioned methods for predicting DTIs, similarity-based [226] and matrix decomposition-based methods [227] can be used, among others, and have contributed greatly to DTI prediction in the past. With the development of DL, network-based methods, feature-based methods, and so forth are now being used in combination, bringing the advantages of each method into play to better predict DTIs and discover new targets [228], [229]. Based on recent studies in the field, DTI research methods can be roughly classified into six groups; Table 5 [217], [221], [223], [226], [227], [230], [231], [232], [233], [234], [235], [236], [237], [238], [239], [240], [241], [242], [243], [244], [245], [246], [247] provides a brief summary of the relevant strategies.

Future research should integrate omics data more closely with biomedical data networks for a more accurate characterization of drugs or proteins. Moreover, similarity approaches have a crucial effect on DTI prediction, and combining multiple similarity results may improve model performance. One common problem in model training is the unavailability of accurate negative datasets. Accurate DTI data in publicly available data sources are rigorously experimentally validated, and the experimental validation process for each one is exhaustive; however, most failed experiments will not be reported. Furthermore, manually validated data is time-consuming, and a large amount of data has not been validated for exact interactions. Therefore, the dataset used for DTIs should always use the latest and most comprehensive drug-target database, such as TTD and DrugBank, and additional inactive experimental data should be open-sourced to improve the current DTI data system.

4. SOTA application of AI to modern drug design

Drug discovery is a long-term and painstaking process. In the past decades, techniques such as HTS and combinatorial chemistry, as well as other techniques, played an important role in the discovery of lead compounds. Further structural modifications of the obtained lead compounds were then developed to reduce toxicities and improve efficacy. As these techniques gradually increased in popularity, however, their various disadvantages were gradually revealed. Similarly, in the 1980s, CADD was no less popular than today’s AI. For example, QSAR models were widely used as soon as they were proposed. However, in those days, QSAR-based models were limited by the available computing power, dataset size, and other issues, and their predictive performances were never satisfactory [248], [249], [250].

In recent years, the advancement of computing power has driven the rapid development of AI, while positively promoting the development of computational chemistry and pharmacology. For example, various ML and DL methods were used in various Kaggle competitions to improve the predictive performance of QSAR methods, all of which achieved high performance [78]. As mentioned above, DL allows the identification of new molecular representations instead of relying solely on off-the-shelf and expert-derived chemical signatures. AI algorithms relying on rich biomedical data show promising prospects in areas such as bioactivity prediction, VS of drugs, and de novo drug design.

Before going into details, it is necessary to briefly introduce the concepts of structure-activity relationships (SARs) and QSARs. These two concepts are frequently used in drug design using ML and DL methods and are powerful aids in the design, optimization, and development of drugs. SARs are based on the assumption that molecules with similar structures have similar activity. In drug discovery, QSARs are based on various molecular characterization methods (e.g., molecular descriptors and molecular fingerprints) and mathematical models to describe the mathematical relationship between the structure of a molecule and its specific biological activity. A QSAR model assumes that the structure of a compound determines its physicochemical properties and biological activity; therefore, quantitative relationships can be established between the structure of a compound and its physicochemical properties, biological activity, toxicological effects, and so forth. The QSAR analysis process usually includes the preparation of preliminary datasets, the calculation and selection of molecular descriptors, the establishment of relevant models, and the evaluation and validation of model results [248], [251].

4.1. Cutting-edge techniques facilitating VS

VS has endured for the past decade or so. In order to reduce the number of compounds that actually need to be measured and increase the efficiency of lead compound discovery, the in silico approach is used to simulate the interaction between a target and a small molecule and predict the affinity between the two before a bioactivity test is performed [252]. VS methods are often classified into structure-based VS (SBVS) or ligand-based VS (LBVS) [253], [254], [255]. The combination of AI and VS has brought a new dynamism to the field. A variety of molecular characterization approaches combined with various novel model architectures have provided new insights into the discovery of new compounds [9].

SBVS selects potential ligands based on the 3D conformation of the protein and scores the ligand’s ability to bind to the protein based on the inputted knowledge of biophysical methods, resulting in a ranking of drug candidates. Previously, simulations using various docking software were the dominant approach and resulted in many algorithms, such as Monte Carlo (MC) algorithms [256] and molecular dynamics (MD) algorithms [252], [257], [258]. A primary limitation of the simulation results is the construction of the scoring function, which must take many factors into account along with their plausibility as parameters. AI takes these many factors as features of the data, implicitly learns the relationship between the features and the experimental results, extracts useful nonlinear mapping relationships from them, and gives a final score. A VS method known as ID-Score [120] selected nine classes of property descriptors (i.e., van der Waals interaction, hydrogen-bonding interaction, electrostatic interaction, π-system interaction, metal-ligand bonding interaction, desolvation effect, entropic loss effect, shape matching, and surface property matching) as features, used 2278 compounds as the training set, and used a support vector regression (SVR) algorithm to fit the binding affinity of small molecules to proteins. The results showed that ID-Score can correctly distinguish structurally similar ligands, demonstrating its use as a powerful tool for assessing structure-based drug-protein affinity.

In another study, a CNN was used to score protein ligands. Unlike traditional methods, CNNs are powerful enough to accept 3D representations of protein-ligand interactions as input. During the training of the model, the CNN learns the key features affecting binding from the 3D representation, which is used to determine the correct or incorrect binding pose and known binders and nonbinders. Xie et al. [259] took a different perspective to improve the efficiency of lead compound discovery by combining an SVM classification model with a docking-based VS method. More specifically, they developed an SVM model to distinguish inhibitors of the target from non-inhibitors and performed a docking-based VS on this basis. This combination greatly improved the hit rate and enrichment factor of the VS. In contrast to the work by Xie et al. [259], Pereira et al. [260] developed DeepVS, which uses a DL approach to optimize docking-based VS. In this study, a directory of useful decoys (DUD) [261] was used as the benchmark dataset to evaluate the method. Dock [262] and Autodock Vina1.1.2 [263] were used as docking programs to generate protein-compound complexes. Then, essential processing of the protein-compound complexes was done and the results were fed into the CNN model as input. The CNN model extracted the key features from this essential data and used them to evaluate the score of the ligands. The results showed that the proposed DeepVS achieved advanced performance on VS.

In comparison with the SBVS approach, which is limited by the structural information of the target protein, LBVS can make full use of the known ligand bioactivity data and screen a large database of compounds to discover potential lead compounds. Therefore, AI-based VS tends to favor LBVS. The starting point of LBVS is the assumption that structurally similar compounds have similar biological activities; thus, the AI methods used in this field include both regression models for activity prediction and classification models based on compound similarity.

QSAR is widely used in LBVS because of its use of mathematical models to relate molecular structures to quantitative biological activities. NB, RF, and SVM are very popular algorithms in LBVS. AbdulHameed et al. [264] screened a database with nearly 2000 compounds using a QSAR-based model with an NB algorithm and using the physicochemical properties of the molecules as features. Finally, it was found that activators of pregnane X receptor (PXR) tend to be hydrophobic, while the in vitro and in vivo activities are often consistent. Profile-QSAR 2.0 was presented to predict the activity of compounds [265]. Compared with the earlier profile QSAR (pQSAR) 1.0 method, the pQSAR 2.0 method used the historical activity values of the compounds as variables. The optimized pQSAR used an RF model to predict the half-maximal inhibitory concentration (IC50) values, achieving the same accuracy as the medium-throughput four-concentration IC50 measurements. Chen and Visco [266] created a pipeline integrating QSAR with an SVM model to identify the inhibitors of Cathepsin L. They used a signature—a descriptor based on fragments—as the model’s input. After optimizing the model, nine out of 12 screened compounds were experimentally confirmed. ANNs are another commonly used tool in QSAR studies. Myint et al. [267] reported an ANN-based QSAR method called fingerprint-based ANN (FANN)-QSAR that uses three different molecular fingerprints: ECFP6, FP2, and MACCS. The well-trained model was used to predict the affinity of cannabis ligands and found compounds with a good affinity for cannabinoid receptor type 2 (CB2). In another group study, the minimal inhibitory concentration (MIC) of quinolones was determined by using topological descriptors in an ANN [268]. As more DL methods have gradually been used for QSAR-related studies, researchers have found that DL tends to outperform ML in both single-task and multi-task learning [269], [270], [271].

QSAR methods are not the only tools used for LBVS [272], [273], [274]. Li et al. [275] used multiple ML methods to construct classification models to select liver X receptor (LXR) agonists. During this process, optimized property descriptors and topological fingerprints were used to characterize small molecules in the database and constitute a total of 324 models with four algorithms: NB, SVM, KNN, and recursive partitioning (RP). The top 15 models were selected for evaluation, and ten models were found to have an accuracy of more than 90%. In another study, an SVM with NB was used to identify butyrylcholinesterase (BuChE) inhibitors [276]. Initially, 1870 descriptors were selected; after analysis, activity-related descriptors were then selected to reduce noise. A better performance was eventually achieved. There are also numerous examples of self-organizing mapping (SOM) being used in LBVS [277]. For example, Hristozov et al. [278] used SOM as a model to recognize and exclude compounds that are unlikely to have specific biological activity. The power of SOM has also led to its use in some software [279].

With the rapid increase in the number of known compounds in recent years, DL architecture has been found to be more suitable for processing large compound datasets. One group trained with existing HTS data and used a molecular graph as input to a neural network to learn molecular representations [280]. Compounds with similar representations were then assigned in the neighboring hyperdimensional feature space. After learning the features, the similarity to drug molecules in a large compound library was measured using cosine similarity, and the small molecules in the library were ranked and filtered to obtain lead compounds. Unlike the use of graph models to generate the features of small molecules, adversarial AEs (AAE) were used by Kadurin et al. [281] to construct a small molecule feature generator. Based on the obtained features, 72 million compounds in PubChem were screened to discover potential anticancer drug molecules. CNNs are widely used in image recognition; thus, for the purpose of using CNN models in drug research, molecules or proteins are often characterized in the form of matrices. Xu et al. [282] directly used images of molecules as input to CNN models to screen for inhibitors of Chemistry Development Kit 4 (CDK4) and achieved better effects than competing models. The use of DL for LBVS has been increasingly studied in recent years, and models such as RNN [283] and RL [284] have been used for drug discovery, providing more opportunities and benefits for LBVS.

Overall, efficient lead compound discovery through VS is still a huge challenge, as there is no satisfactory way to address issues such as the activity cliff. AI algorithms are powerful tools that can be used not only for SBVS but also for LBVS to help break through the relevant challenges and assist in de novo drug design. As the complexity of algorithms increases and high-quality data becomes available in future, bottlenecks in existing technologies will continue to be broken, facilitating the discovery of new drugs.

4.2. Recent progress in de novo drug design

The aim of drug design is to design drugs with specific properties that satisfy specific criteria, including efficacy, safety, reasonable chemical and biological properties, and structural novelty. In recent years, de novo drug design with the help of deep generative models and reinforcement learning algorithms has been considered to be an effective means of drug discovery. This approach can bypass the drawbacks of the traditional empirical-based drug design paradigm and allow computers to learn the drug targets and molecular features by themselves to generate compounds that meet specific requirements at a faster and less costly rate [285], [286], [287].

De novo drug design according to protein structure used to be the dominant approach. In this approach, whether designing new molecules directly from protein structures or making reasonable inferences from the properties of known ligands, the corresponding ligands are designed according to the spatial and electric potential constraints of the target protein binding pocket in order to discover molecules with specific properties. A huge limitation of these early approaches was that the resulting new molecules were not chemically accessible—that is, their structures were practically impossible to synthesize or extremely difficult to produce, or the molecules had poor druggability. In addition, many de novo drug design approaches utilize fragments of molecules with known properties for molecular assembly, and use large libraries of molecular fragments to generate and design molecules with novel structures while ensuring that the molecules can be synthesized. However, this approach relies on chemical knowledge to replace or add molecular fragments, which will restrict the search space and ignore certain potential molecular structures. The generation of new molecules with deep generative models and the targeted optimization of models with reinforcement learning algorithms can solve the problems of the above traditional methods in a more satisfactory way [288], [289], [290].

Deep generative models are of great advantage in the field of de novo drug design, as they do not require explicit prior input of chemical knowledge during the generation of molecules. These models can search in a broader unknown chemical space to automatically design novel molecular scaffolds beyond the limitations of existing molecular scaffolds. Deep generative models that are widely used for de novo drug design include RNN-based generative models, variational AEs, AAEs, and GANs. The process of designing molecules with generative models is highly stochastic, and the generated molecules are highly variable in structure and uneven in quality. Reinforcement learning can enable generative models to perform targeted optimization by fine-tuning the model parameters so that the generated molecules have specific drug molecule properties.

RNN-based generative models can generate compounds with similar biochemical properties as the sample compound but with a completely new scaffold structure. The training process starts by using a large chemical database to train the RNN model so that the model can learn how to generate the correct chemical structure. Reinforcement learning algorithms are then used to fine-tune the RNN parameters so the model is capable of mapping generated chemical structures to a specified chemical space. Reinforcement learning enables the RNN-based generative model to generate new molecules with promising pharmacological properties, while ensuring the structural diversity of the generated molecules. A single reinforcement learning reward mechanism often leads to relatively simple structures of the generated molecules, so an appropriate and multi-perspective reward function must be selected to guide molecule generation. Olivecrona et al. [123] developed a sequence-based approach to de novo drug design called REINVENT. First, the researchers collected 1.5 million molecules from the ChEMBL database that satisfied specific requirements and used SMILES of these molecules to train the RNN model to learn the characteristics of active molecules and generate new molecules. The generated molecules were then scored using a reinforcement learning algorithm to fine-tune the RNN parameters, so that new compounds with activity against a specific target could be generated. This method was applied to several different molecule generation tasks in the study, including the generation of sulfur-free molecules, backbone expansion from a single molecule to generate celecoxib-like structures, and the generation of new inhibitor molecules for type 2 dopamine receptors.

Another area in which RNN-based generative models are applied in drug design is the optimization problem of lead compounds [291]. A new molecular generation algorithm called scaffold-constrained molecular generation (SAMOA) was proposed to solve the scaffold constraint problem within the lead compound optimization problem. The study used an RNN generation model to generate SMILES sequences of new molecules, and then used a refined sampling procedure to implement the scaffold constraint and generate molecules. A strategy-based reinforcement learning algorithm was also applied to explore the relevant chemical space and generate new molecules matching the expected ones. The DeepFMPO framework proposed by Ståhl et al. [292] started from an initial set of lead compounds and modified the structure of these lead molecules by replacing some of their fragments. This study confirmed the wide use of RNN-based generative models in the field of molecular generation.

As deep generative models, VAEs are often used in various generative tasks, including the de novo design of small molecules and the generation of peptide sequences. A group constructed a molecular generation model based on a conditional VAE for de novo molecular design with a three-layer RNN for both the encoder and decoder. The results demonstrated that this model can design drug-like molecules with five target properties and can also tune individual molecular properties without affecting other properties [124].

In 2019, Insilico Medicine published a study [28] on the rapid de novo design of potent discoidin domain receptor 1 (DDR1) kinase inhibitors using a VAE. Several new compounds with inhibitory activity against DDR1 kinase were identified, chemically synthesized, and experimentally validated in just 21 days. This study demonstrated the potential of the method to perform fast and efficient molecular design. The generative tensorial reinforcement learning (GENTRL) model consists of two main components: a VAE and a strategic gradient reinforcement learning algorithm. The VAE is used to generate new molecules, while the reinforcement learning fine-tunes the model parameters to make the new molecules generated by the VAE more consistent with the expected properties. The encoder of the VAE is used to encode known molecules into hidden vectors. The decoder samples and decodes the hidden vector into a new molecule based on the hidden vector space. A reinforcement learning algorithm is used to guide the VAE-directed optimization during the training process. After model construction, Insilico Medicine used GENTRL to generate four new active compounds, two of which were validated in cellular experiments. Moreover, one of the lead compounds was tested in mice and was shown to have good pharmacokinetic properties. This study provides strong evidence that reinforcement learning combined with deep generative models can accelerate the process of and provide new insights into de novo drug design.

GANs are capable of generating new samples with a similar distribution to real data and have advantages in the fields of image recognition and natural language processing (NLP). In the pharmaceutical field, GANs are often integrated with techniques such as feature learning and reinforcement learning, and have played an important part in protein function prediction, small molecule generation, and more. Various molecular generation models have been constructed based on GANs, such as Mol-CycleGAN [293], objective-reinforced generative adversarial network for inverse-design chemistry (ORGANIC) [294], and reinforced adversarial neural computer (RANC) [295]. ORGANIC is a well-known molecular generation model that has become a comparative baseline model for other models. Its combination of a GAN model and a reinforcement learning algorithm can generate novel and effective molecules. The molecule generation performance of the RANC model has surpassed ORGANIC in many aspects, including the ability to generate new molecular structures and drug-like properties of molecules, which allows the design of active new molecules for different biological targets and covers a wide chemical space.

In addition, Harel and Radinsky [296] proposed a molecular template-driven neural network that combines a VAE, CNN, and RNN to generate chemical structures with similar properties to the template molecules while being structurally diverse. The researchers found that the proportion of effective molecules among the generated molecules was significantly enhanced by adjusting the sampling process of the VAE.

Molecules designed by computer must not only have good physicochemical properties but also be highly active and selective for the target under study; therefore, the question of how to set up an effective reward function is an important challenge in reinforcement learning. A combination of the framework of deep generative models with reinforcement learning algorithms drives the development of the drug design field and will have significant applications in the future in the de novo design of small-molecule and peptide drugs.

4.3. Application of advanced techniques in antibody design

Due to the wide application of ML and DL in chemistry, biology, and medicine, as well as their use in basic research in various fields, researchers now have a profound comprehension of biomolecules and systems biology. In the future, the direction of drug R&D will be biased toward the research of small molecules; moreover, bio-innovative drugs will gain ground. Similarly, there are already many DL approaches for the study of biological macromolecules drugs, both now and in the near future, such as oligonucleotides, monoclonal antibodies, or peptides with specific pharmacological properties. Here, we will elaborate on the design of antibodies.

Since antibodies are inherently biological macromolecules, the characterization of antibodies is similar to the encoding of proteins and RNAs. There are six general strategies for encoding antibodies: “one-hot” encoding, substitution matrix, amino acid properties, learned amino acid properties, encoding of supplementary attributes, and encoding of structural features [297]. The application of AI in antibodies is different from its application in ordinary biomolecules because antibodies are biological agents that can be used for disease treatment. Therefore, the design of antibodies has more in common with the design of drugs, since safety and efficacy of drugs must be taken into account. At present, AI-based methods are often used for antibody structure prediction, antigen-antibody binding prediction, antibody generation/design, deimmunization studies, and antibody sequence-based studies [297].

The AlphaFold2 DL system has been able to solve most of the protein structure prediction problems; however, for antibody structure prediction, as a special subfield of protein structure prediction, it is necessary to capture the subtle differences in the structure with extreme precision. Many methods have been developed to solve this problem, such as DeepAb [298] and DeepH3 [299]. To perform VS for the binding of antibodies to target antigens, a structure-based framework called DL for antibodies (DLAB) was proposed to improve antibody-antigen dockings [300]. As DLAB is a structure-based approach, it can optimize the pose ranking of antibody docking experiments and select antibody-antigen pairs for which accurate poses are generated and properly ranked. This approach has also demonstrated that the SBVS of antibodies can strongly complement traditional experimental screening methods.

The search for new antibody sequences is a major research hotspot in antibody discovery. Early computational approaches attempted to use enumeration methods for new sequence discovery and subsequent prediction work. Although these methods reflect the diversity of designed antibodies, they do not explain these discoveries in a biological sense and lack conviction. Recently, the potential features of antibodies—including the frequency of amino acid positions and the physicochemical properties of the antibody—have been learned by GANs or VAEs [301]. These methods provide a new way of thinking and a new approach for antibody generation and design, which can be relied upon in the future to design therapeutic antibodies via DL.

The directions for the development of antibody drugs discussed above stem from a starting point that is similar to that of the design of small molecule drugs. Antibodies can be designed differently than traditional drugs due to their large molecular weight and attributes such as biomolecular function. In designing an antibody drug, it is necessary to consider the immune response the drug elicits when it enters the body. Thus, it is critical to use ML algorithms for analysis of next-generation sequencing (NGS) data to carry out deimmunization studies of antibodies [302]. In addition, antibodies similar to human antibodies must be designed without loss of activity during the humanization process. [303]. Novel humanization (e.g., Sapiens) and humanness evaluation methods (e.g., OASis) are two data-driven approaches to address these issues. Sapiens uses a masked language model (MLM) to learn the humanization method of antibodies, while OASis is used to evaluate the humanness of an antibody sequence. BioPhi successfully combined these two algorithms to capture the intrinsic features of antibody complexes and provide similar mutation selection to that used experimentally for humanized mutations. This achievement indicates that DL will be indispensable in the deimmunization studies of antibodies. Another major feature of DL in antibody research is its ability to use NLP to learn and encode the antibody space to reveal new insights into the biological function of antibodies. For example, antibody-specific bidirectional encoder representation from transformers (AntiBERTa) [304] and AbLang [305] can understand the back-and-forth association of antibody sequences and, based on this understanding, can infer specific masked regions.

When conducting antibody drug research, DL can be used to connect the microscopic properties of molecules with the macroscopic results of experiments and provide additional insights into the biology associated with immunoglobulins. Therefore, DL approaches are increasingly being applied in the research and design of therapeutic antibodies to enable the efficient development of new antibodies and provide a new strategy for the future pipeline of antibody design. Overall, AI has shown promising power in drug target identification and new drug discovery. Fig. 3 depicts a generic workflow using AI for target and drug identification.

5. Application of AI to preclinical drug research

Preclinical studies focus on non-clinical pharmacology, pharmacokinetics, and toxicology studies. The physicochemical properties of a drug and its ADMET properties are essential for pharmacokinetic and toxicology studies [33], [306]. Unsuitable properties of drug candidates will lead to the failure of the expensive drug development phase [307]. The failure rate and loss of clinical studies can be decreased by early evaluation of the relevant properties of drug candidates.

5.1. Prediction of physicochemical properties

The ADMET properties of a drug candidate can be directly influenced by its physicochemical properties and will have a critical impact on the success of a drug entering the market [308], [309]. For example, the ionization constant (pKa), which is the fundamental parameter underlying properties such as octanol-water distribution coefficient (logD) and solubility, affects the aqueous solubility of a molecule, which can in turn affect the drug formulation method. Moreover, the ADMET of compounds under different pH conditions are profoundly influenced by the charge state of the compounds [310]. Although lead compounds with promising drug-like properties may not always be successfully marketed, promising properties are still an inspiration for drug design. However, physicochemical properties are not easily measured directly, and accurate prediction of the properties of small molecule drug candidates facilitates further structural optimization of small molecules until they are designed to meet the desired properties.

Some approaches for predicting the physicochemical properties of molecules focus on predicting a certain physicochemical property, such as lipophilicity [311] or aqueous solubility [312], while others predict several physicochemical properties together [99]. Although molecules can be represented in a variety of ways, predictions for a single property may use certain specific features, such as the number of hydrogen bonds [313] and the connectivity indices of various molecules [314] correlated with solubility. To date, accurate prediction of the aqueous solubility of small molecules remains a challenge [315], but DL methods have been found to be more effective than previous ML methods in this endeavor [316]. In the second challenge to predict aqueous solubility, one of the models [317] combined an NLP approach to obtain embedding vectors based on small molecule SMILES, in order to feed these vectors into the transformer model for predicting molecular aqueous solubility. Francoeur and Koes [317] found that overly complex models did not perform as well as small DL models in this task, which may be due to overfitting of the model as a result of the complex model and the smaller amount of data.

To address the issue of simultaneously predicting several physicochemical properties of small molecules, researchers have focused on molecular feature learning and characterization; examples include molecular feature learning and representation based on a GNN architecture [98], combining traditional molecular representation approaches with features learned by message-passing neural networks (MPNNs) [99], and a form of graphical representation of molecular design based on extended-connectivity circular fingerprints (ECFPs) [318]. Shen et al. [319] proposed a new form of molecular representation that involved first calculating the distance matrices of molecular fingerprints and the molecular descriptors of eight million molecules, respectively, and then reducing the distance matrices to two dimensions via uniform manifold approximation and projection (UMAP) to form a scatter plot. Next, the dimensionality-reduced scatterplots were assigned to 2D grid maps using the Jonker-Volgenant (J-V) algorithm. Finally, the data was divided into different channels based on different molecular fingerprints or descriptors. These molecular representation forms were fed into a CNN for the prediction of molecular properties, achieving a SOTA performance on multiple datasets.

5.2. Prediction of ADMET-related properties

The failure of most clinical trials is often blamed on inadequate ADMET studies of the drug, rather than on a lack of certain efficacy. The “absorption, distribution, metabolism, excretion (ADME)” portion of ADMET often determines whether a drug molecule will reach the target protein in vivo, what protein will transport or metabolize this drug [47], [320], how long it will stay in the blood, and when it will be inactivated, while the “T” portion (i.e., toxicity assessment) is a major concern in phase I clinical trials. If the risk of clinical trial failure can be reduced via thorough preliminary ADMET studies, significant money and time costs will be avoided [321], [322]. With hundreds of compounds waiting to be evaluated for their ADMET properties in the early drug discovery phase, it would be time-consuming and expensive to validate each one through extensive animal studies. Therefore, the use of AI to rapidly and accurately predict the ADMET properties of drugs has been widely adopted [323].

QSAR and quantitative structure-property relationship (QSPR) models play pivotal roles in the ADMET prediction of small molecules. Many ML methods, in combination with QSAR or QSPR models, have performed well in ADMET prediction [324]. Most of these ML methods focus on several ADMET properties [325], such as human ether-a-go-go related gene (hERG)-mediated cardiotoxicity [326], blood-brain barrier penetration [327], permeability glycoprotein (P-gp) [328], cytochrome P450 (CYP) enzyme family [329], acute oral toxicity [330], carcinogenicity [331], mutagenicity [332], respiratory toxicity [333], or irritation/corrosion [333]. Zhu et al. [334] used a QSPR model to predict the blood-brain partition coefficient (logBB). The researchers used four ML methods—namely, SVM, multivariate linear regression, multivariate adaptive regression splines, and RF—to predict this property for 287 compounds and found that the polar surface area and octanol-water partition coefficient were strongly relevant to the blood-brain partitioning. A CYP enzymes-inhibition prediction model based on the C5.0 algorithm (a decision tree model algorithm) was constructed using several molecular fingerprints or molecular descriptors as inputs to predict five CYP enzymes related to drug oxidation or hydrolysis [335].

Most of the ADMET datasets are imbalanced and have high dimensionality problems, and the integrated learning approach has been applied to deal with these two types of problems. The processing of imbalanced data, the combination of multiple models, and optimization steps have been integrated to form an adaptive ensemble classification framework (AECF) [336]. Yang et al. [336] used AECF to predict a variety of ADME properties using multiple ML methods; their results all had satisfactory AUROC values ranging from 0.78 to 0.91. This ensemble approach was demonstrated to be a very useful multi-classification system through validation with the DrugBank database.

DL approaches are also widely applied to the prediction of ADMET properties. For example, a classical feed-forward back-propagation neural network (BPNN) architecture and a repeated double cross-validation (rdCV) approach were combined to estimate the blood-brain barrier penetration [337]. DL allows a model to be trained using a larger and more representative dataset, ensuring that a wider variety of compounds are covered than is possible with ML. Validated with external datasets, this method predicts values that are in good agreement with many experimentally derived logBB values. In another work, it similarly demonstrated that neural networks generally outperform ML methods for ADMET properties prediction. Montanari et al. [121] predicted seven different ADMET properties corresponding to each of the following endpoints: logD, solubility, melting point, membrane affinity, and human serum albumin binding. Moreover, Wang et al. [338] developed a DL model to predict drug metabolites with an accuracy superior to the popular rule-based method systematic generation of potential metabolites (SyGMa). In a comparison of a multi-task graph convolutional model, a fully connected neural network, and an RF model, it was shown that the multi-task graph convolutional model performed the best. However, for more complex tasks, such as the prediction of Caco2 permeation or in vitro metabolic stability, multi-task graph convolutional networks were unable to achieve good results, probably due to the simplicity of the model constructed in this study, which hindered the model from learning the deeper features. In addition, the multitasking model in this study was considered a trial-and-error exercise, and there were no specific experiences and rules about which tasks should be combined together.

Other recent work has similarly demonstrated the potential of multitasking models for ADMET properties prediction. Various user-friendly ADMET software and web servers have been developed for predicting the ADMET properties of molecules [125], [339], [340], [341], [342]; among these, ADMETlab 2.0 [125] is widely praised. ADMETlab 2.0 is based on a multi-task graph attention (MGA) framework and can predict multiple ADMET properties of drugs (it contains a total of 88 relevant parameters with 23 ADME properties, 27 toxicity endpoints, and eight toxicophore rules). Most of the data used for training was derived from bioactivity data in the open-access database, relevant literature, and toxicity prediction software (Toxicity Estimation Software Tool (TEST)). Based on these training sets and the novel model architecture, some of the properties predicted by ADMETlab 2.0 are unique in comparison with the results of similar tools. It is a convenient tool for non-expert users while being able to provide comprehensive and accurate ADMET properties for target molecules for medicinal chemists.

6. AI-assisted clinical trial design, post-market surveillance, and prognosis prediction

A drug candidate can be sent to clinical studies only after it has undergone the process from target identification to drug design, synthesis, and optimization, and then to preclinical studies of ADMET-related properties, which initially confirm the safety and efficacy of this compound. The clinical trial phase consumes most of the time and investment during drug R&D. Although AI cannot be used to directly predict the clinical trial results of drug candidates in clinical studies, it can be used to assist in the design of clinical trials to enhance the rationality and safety and ultimately provide a more realistic response to the clinical trial results of a drug. After phase III clinical trials, drugs also require long-term regulatory work to further identify undocumented toxic effects in previous studies in order to prevent malignant events.

6.1. AI-assisted clinical trial design

The high failure rate of clinical trials makes this the most difficult step in the new drug development pipeline, with about 90% of drug candidates being eliminated in clinical trials [343], where each failed clinical trial costs approximately 0.8 billion to 1.4 billion USD. To overcome these shortcomings, a number of AI-based approaches are now available to assist in crucial steps in clinical trial design, such as helping to improve patient recruitment and enhance patient monitoring [344]. To address the issue of patient selection, AI can be used to explore the association of patient biomarkers with external indications to predict the likely treatment response of patients, which can help in screening for patients with high clinical success [345]. In addition, e-phenotyping can be used to reduce patient population heterogeneity [346] and to aid patient selection through prognostic or predictive enrichment [347], [348].

Patient monitoring in clinical trials is also a critical process. By incorporating wearable technology, AI can be used to help automate and personalize real-time patient monitoring, thereby reducing patient workload and improving medication adherence issues. Accurate medication adherence data can better reflect the results of clinical trials, and AiCure [349]—a new AI platform used to measure medication adherence—has shown a 25% improvement in adherence compared with traditional therapies in a phase II trial for schizophrenia. In addition, AI has been used to optimize dosing to reduce adverse effects, improve the safety of trial protocols, and reduce patient defaults due to safety concerns [350].

6.2. AI-assisted post-market surveillance and prognosis prediction

After a drug is approved and successfully enters the market after the clinical phase, it undergoes a long-term investigation to further monitor and evaluate the drug safety. Electronic health record (EHR) mining is an important data source for AI applications in post-market surveillance, in which the use of structured data can simplify the process of data pre-processing. Existing methods used in EHR include the self-control case series (SCCS) model [351], cohort and case-control methods [352], and temporal pattern-discovery algorithms [353].

Convolutional SCCS (ConvSCCS) is a scalable model for predicting longitudinal features using SCCS. Morel et al. [354] used step functions and exposures to avoid the problem of classical SCCS models that require a precisely defined risk window. The results showed a significant improvement in the computational speed and accuracy of the method and enabled its application to adverse drug reactions (ADRs) detection in a cohort of diabetic patients. Aside from the application of structured data, unstructured data from biomedical and clinical corpora can be used for NLP methods for drug-drug interaction (DDI) detection and classification [355] and the prediction of ADR [356]. Systems pharmacology, which is based on systems biology, studies the effect of drugs on the system as a whole; it is a rich source of data and is a common approach for AI in ADR mining. Lorberbaum et al. [357] proposed a network-based algorithm involving the modular assembly of drug safety subnets (MADSS). They combined systems pharmacology models with pharmacovigilance statistics to validate the algorithm, and the results showed a significant improvement in the prediction of adverse effects for four drugs.

Disease prognosis is the prediction of the course and outcome of the future development of a disease. In the past, clinicians usually relied on professional experience and traditional statistical analysis for clinical prognosis prediction, making it difficult to provide accurate results. Now, through the introduction of AI technology, multi-patient and multi-factor data can be analyzed to improve the accuracy of prediction results. In cancer prognosis, patient survival and disease recurrence are usually predicted. Enshaei et al. [358] used an AI model to compare the prediction accuracy of an ANN with traditional statistical methods (e.g., LR); the results showed that AI has higher accuracy in predicting the prognosis of OvCa patients. Nowadays, there are many ML and DL methods for the prognosis of various cancers, such as BrCa [359], [360], [361], [362], [363], lung cancer [364], [365], gastric cancer [366], [367], [368], bladder cancer [369], [370], and prostate cancer [371], [372], illustrating the potential of AI technology in cancer prognosis.

7. Automation of drug synthesis with AI

The development of a new drug usually involves four stages: design, make, test, and analyze (DMTA). The application of AI is particularly important in the stage of drug synthesis, as it can effectively shorten the cycle of new drug R&D by speeding up the discovery of a new synthetic route for target molecules and reducing the rate of synthetic failure when the structure of the target molecule is known.

7.1. Automated exploration of reaction spaces with AI

In the 1960s, Corey and Wipke [373] proposed computer-aided synthetic design (CASP) as the earliest AI drug synthesis design. However, due to the lack of computing power at that time, this concept could not be further developed. With the development of ML methods in recent years, CASP has come back into the limelight. CASP mainly consists of three aspects: retrosynthetic planning, reaction condition recommendation, and forward reaction prediction [374]. Retrosynthetic planning, which involves the stepwise splitting of the target molecule into commercially available chemical materials, is an important approach in the design of drug synthesis reactions. MC tree search (MCTS) is a general search technique for sequential decision-making with large branching factors. Segler et al. [375] combined three different neural networks trained with all published reactions with MCTS to predict the best retrosynthetic routes. In comparison with conventional algorithms, the model is 30 times faster and doubles the number of molecules solved.

After designing the synthetic route, the rationality of each step in the synthesis process must also be considered. Researchers have also used AI for the prediction of reaction conditions in order to reduce the time spent on screening reaction conditions. Gao et al. [376] proposed a neural network model to predict appropriate reaction conditions and reaction temperature. They trained the model using ten million examples on Reaxys and tested it on one million reactions outside the training set. Their results showed the model’s ability to predict reaction conditions that matched those in the record in 69.6% of those cases. The computational framework DeepReac+ [377] also adopted an active learning strategy to explore the response space more efficiently in order to reduce the time for model learning and prediction.

Forward reaction prediction verifies the feasibility of the designed route by predicting the products. The starting material, which is predicted by retrosynthetic planning, can be replaced by many other compounds, and forward reaction prediction can be used to rank these compounds in order to select the best solution. For example, Coley et al. [378] proposed a neural network model for predicting reaction outcomes. They trained the model with 15 000 reaction examples from the United States Patent and Trademark Office (USPTO) literature and ranked all the generated candidate compounds to select the product that matched the record. The model used an edit-based representation of the candidate reactions and achieved an accuracy of 71.8%.

In addition to designing new reaction routes based on target molecules, unknown chemical spaces can be explored by synthetic robots based on AI. Recently, a synthetic robot proposed by Granda et al. [26] not only analyzed chemical reactions faster than manual analysis but was also able to predict the reactivity of various reaction combinations on its own and explore the unknown reaction space. The robot model’s analysis of samples by nuclear magnetic resonance and infrared spectroscopy is coupled with ML for decision-making, allowing reactions to be evaluated in real time. The outcomes showed that the model can predict the reactivity of about 1000 reaction combinations with over 80% accuracy. Four entirely new reactions were discovered by chemists using real-time data from this robot for prediction. In addition, Caramelli et al. [379] proposed an inexpensive synthetic robot with the ability to network and coordinate multiple reactions in addition to performing chemical reactions autonomously. The robot can also explore new chemical spaces to search for new reaction results and can evaluate the reproducibility of reactions. In conclusion, the invention of intelligent synthesis robots is an important step toward an automated synthesis approach with AI.

7.2. AI usage in automatic drug synthesis

AI-based automated chemical synthesis technologies are freeing researchers from a great deal of manual works by automating experimental processes. Many reactions can already be performed on automated synthetic systems, such as the synthesis of peptides [380], oligonucleotides [381], natural products [382], and various drug molecules [383], as reported earlier. To establish a common standard for automated chemical synthesis, Steiner et al. [35] proposed the Chemputer system and used it to synthesize three drug compounds—diphenhydramine hydrochloride, flufenamide, and sildenafil—in yields comparable to those from manual synthesis. The program they developed, called Chempiler, allows low-level instructions to be compiled in order to synthesize compounds through a modular robotic platform. Moreover, the synthesis process is captured to generate digital code that is shared between platforms, thereby driving the spread of automated chemical synthesis in the laboratory.

In parallel to increasing the automation of reactions, improving the reaction throughput is a goal of automated synthesis, causing high-throughput experiments (HTEs) to receive much attention in recent years. HTEs with 24- or 96-well reactors are capable of performing dozens of reactions in a single experiment [384], [385]. In contrast, ultra-high-throughput reactions on the nanoscale can even perform thousands of reactions at a time [386], [387]. Of the limited types of reactions that high throughput can currently achieve, heated reactions with homogeneous reactions in low-volatile solvents at room temperature are relatively easy to achieve [388]. Moreover, among the reactions commonly used in HTE, metal-catalyzed cross-coupling reactions in which many reaction variables are observed during development are a hot research topic. Ahneman et al. [389] proposed an RF algorithm trained by a high-throughput dataset to predict the tolerance of palladium catalysts to isoxazole during C-N bond formation. The performance of the algorithm was shown to be significantly improved compared with conventional linear regression analysis, and the model was also useful for analyzing the inhibition mechanism of metal catalysts.

As an increasing number of algorithms related to reaction prediction are developed, scientists can identify optimal reaction conditions faster and more accurately, obtain optimal reaction routes, and further explore the reaction space. The integration of these novel and effective algorithms can facilitate the development of automated chemical synthesis platforms, freeing researchers from repetitive tasks [377].

8. Application of AI in other areas related to drug discovery

AI technology has been widely used in the whole process of drug R&D, including target identification, drug design, synthesis, and property evaluation. It has undoubtedly shortened the drug R&D cycle and saved a great deal of experimental cost compared with the traditional experimental process. Scientists are continuing to explore the application of AI technology, as they attempt to use AI in more fields to promote the development of pharmaceutical sciences.

8.1. Facilitating knowledge discovery through literature mining

Every year, numerous papers are published in the fields of medicine, pharmacy, biology, chemistry, materials, and so forth. There is a great deal of relevant expertise in these papers. Mining the literature and linking information with relevant knowledge quickly and purposefully is very important. NLP algorithms can extract the required knowledge from unstructured information in a large number of papers, patents, and published documents. Further analysis of the extracted knowledge can reveal the knowledge associations hidden in many documents and can thereby reduce the workload of researchers in analyzing documents one by one [390]. Long short-term memory (LSTM), gate recurrent units (GRUs), bidirectional encoder representations from transformers (BERT), and transformers, which are commonly used in NLP research, have made their mark in this field [391], [392].

MEDLINE is a commonly used corpus in the biomedical field and is an important part of PubMed. For decades, there has been extensive work on text mining this corpus for screening key genes, targets, and drugs and for drug side-effect discovery, drug repositioning, and other research. Researchers have focused on five main areas of text mining in biomedicine—namely, biomedical named entity recognition (NER) and normalization, biomedical text classification, relation extraction (RE), pathway extraction, and hypothesis generation [393]—which has led to many new discoveries. For example, hypothesis generation studies on biomedicine have driven research on drug repositioning [394], [395], drug development [396], [397], and pharmacovigilance [398], [399].

Hundreds of papers are published every day on COVID-19 research, and text mining can be helpful for finding useful knowledge from the vast literature of this research boom. The COVID-19 Open Research Dataset (CORD-19, https://www.semanticscholar.org/cord19) is a corpus containing a large amount of information related to COVID-19, and most text mining models are based on this corpus for information extraction. The COVID-19 text mining model uses NLP correlation models to mine the constructed corpus for the implementation of the following applications: a question-answering (QA) system (to answer questions asked by users, the model system extracts relevant answers from the corpus), a summarization system (for long texts, the main points are automatically inferred to provide users with a quick overview), visualization (the information in the text is visualized to make it easier for users to understand), and others [400]. These findings have greatly helped researchers to cope with the challenge of information overload and to obtain valuable information in a short period of time.

Aside from the examples given above, text mining models driven by DL will have applications in many more scenarios. As time progresses, advances in NLP technology will make it easier for models to understand human language. Then the model will be able to extract knowledge from this unstructured information by relying on contextual associations to extract the focus of the full text. In this way, thousands of related documents will be processed into a knowledge network to provide a rich knowledge base for drug development. For example, the web service—explorer for target significance and novelty (e-TSN) [401]—constructed the world’s largest relation map using drug targets and diseases extracted by means of NLP-based text mining. The service aims to visualize target-disease KG and provide approved drugs and associated bioactivity information to assist in prioritizing candidate disease-related proteins. Furthermore, Wang et al. [402] developed a multimodal chemical information reconstruction system (CIRS) that automatically processes, extracts, and aligns heterogeneous structure information from text descriptions and structural images of chemical documents. CIRS is a powerful tool for constructing a structured molecular database based on chemical patents to enrich the near-drug space.

8.2. Advancing the development of precision medicine

Precision medicine usually involves the adoption of different treatment plans for the diseases or symptoms of different people. This approach is the opposite of simplifying (or over-simplifying) the classification method of diseases such that all individuals with certain symptoms use the same treatment plan [403]. In society today, the causes of patients’ illness are affected by more factors than before, so more accurate diagnosis and treatment plans are required for each patient. The specific concept of precision medicine has been defined as a process [404]. First, information on the patient is needed at different levels, such as the patient’s medical history, lifestyle, physical examination results, basic laboratory results, imaging, functional diagnostics, immunology, and omics. This data is then preprocessed to build a relevant model that reflects the patient’s situation. Among the data collected, omics data is recognized as the largest and most complex data [404] and has been widely used in the discovery of biomarkers, the identification of disease subgroups, and prognosis prediction [405], [406], [407], [408]. In the current era of big data, AI has rapidly advanced the development of precision medicine—especially precision medicine based on omics.

The extensive use of second-generation sequencing technologies has enabled complex diseases to be finely characterized at the molecular scale, especially in the field of tumor research. The global tumor genome sequencing program, represented by the TCGA project, has laid an essential foundation for the molecular typing and precision treatment of tumors. Based on the mRNA expression data of a TCGA dataset through the analysis of differentially expressed genes, Zhao et al. [409] selected the first 40 differentially expressed genes from each type of tumor, merged them to form a feature subset containing 791 different genes, and established a DL model named cancer of unknown primary (CUP)-AI-Dx for predicting the tissue origin and tumor subtype of tumor samples. Yeh et al. [410] studied the transcriptome of patients with severe asthma using the highly variable expressed gene profile of patients’ peripheral blood mononuclear cells (PBMCs); their k-means clustering analysis of 2048 genes revealed that the genetic characteristics of the transcriptome clusters in patients with asthma determine specific asthma subtypes. In comparison with transcriptomics, the in-depth study of proteomics can help uncover biomarkers and drug targets for different diseases. Rolland et al. [411] used a hierarchical clustering approach to analyze proteomic data from lymphoma patients to reveal specific N-glycoprotein biomarkers in different lymphoma subtypes, thereby providing potential therapeutic targets for precision medicine in lymphoma. Niu et al. [412] identified a combination of protein biomarkers for predicting liver fibrosis, hepatitis, and hepatic steatosis with satisfactory performance using mass spectrometry-based proteomic assays and ML models.

Of course, as mentioned in Section 3, multi-omics technologies are more promising for application than single omics. Many published works explore the molecular mechanisms of disease and the discovery of reliable biomarkers to serve in the diagnosis and treatment of diseases through multi-omics technology. The growing scale of omics data and the increasing development of AI technology will greatly advance the development of precision medicine.

8.3. Utilization of AI in drug formulation and release

With advances in new drug discovery methods, advanced drug delivery systems have expanded rapidly, promoting clinical translation and associated with safety, efficiency, and patient compliance [413], [414]. A drug delivery system can be visualized as a “cart” (i.e., a carrier) that transports “goods” (i.e., therapeutics) to the appropriate destination. With the advancement of materials, engineering, and biology technologies, the term “carrier” has expanded to include nanocarriers, cells, eluting devices, and micro-nano robots [415], [416]. Compared with conventional drug carriers, nanocarriers can improve drug solubility and mitigate the adverse effects of conventional solubilizers. In addition to protecting the drug from deterioration, nanocarriers can endow the drug with a targeting function [417].

Nevertheless, preparing a suitable nanocarrier is extraordinarily complicated, as it depends on the drugs, excipients, and reaction conditions (including temperature, time, and stirring speed). Experiments alone cannot screen all of these parameters. In addition to determining a drug’s molecular target and biological activity [418], [419], AI can accurately predict its optimal nano-forming conditions (Fig. 4) [420], [421], [422].

Shamay et al. [422] predicted particle self-assembly via computational methods. Using quantitative structure-nanoparticle assembly prediction (QSNAP) calculations, they discovered two molecular descriptors for predicting which drugs will form nanoparticles with indocyanine. This method also revealed crucial molecular structural characteristics that permit the self-assembly and the formation of nanoparticles. With the aid of indocyanine sulfate, these drugs were assembled into nanoparticles with a loading efficiency of 90%. The researchers also evaluated the targeted delivery properties of nanoparticles in human colon and primary liver cancer models expressing caveolin-1 (CAV1). Sorafenib- and trametinib-containing nanoparticles were able to selectively target tumors without harming healthy tissue.

In addition, Traverso et al. [421] utilized MD simulations, ML, and an HTE co-aggregation platform to determine which drug-excipient combinations could self-assemble into stable solid drug nanoparticles without additional stabilization. The researchers isolated 100 self-assembled drug nanoparticles from 2.1 million pairs, each containing one of 788 drug candidates and one of 2686 approved excipients. Nanoparticles of sorafenib-glycyrrhizin and terbinafine-taurocholic acid were subjected to proof-of-concept studies in vitro and in vivo. Both validations suggest that this platform can produce nanoparticles with a high drug loading and enhanced bioavailability, representing a significant step toward personalized drug delivery.

The release pattern of a drug is also crucial for disease treatment. Developing drugs that are released in response to differences in the physiological signals of various organs, tissues, and organelles can enhance the drug’s efficacy, prevent toxic and side effects caused by non-specific off-targets, and achieve safe and precise treatment. Multiple endogenous signals—including pH, active redox species, enzymes, glucose, various ions, adenosine triphosphate (ATP), and oxygen—have been incorporated into the design of responsive drug nanocarriers (Fig. 5) [423]. In addition to the material’s properties, the target tissue environment influences drug release. AI can facilitate the evaluation of a drug-release mode and can provide feedback for the formulation of drug carriers through ML [424], [425], [426], [427].

8.4. Promoting the economic development of the pharmaceutical market

AI has shown itself to be powerful and promising in the pharmaceutical industries, leading to a surge of interest in AI-based drug development from both the scientific and industrial communities. In the past five years, numerous AI-based pharmaceutical companies have been established and have signed collaboration agreements with many large pharmaceutical companies [428]. These shifts have driven massive financing in the drug market, injecting new dynamics into the pharmaceutical economy.

Some of these AI-based pharmaceutical companies focus on a specific stage of the drug discovery pipeline, such as target discovery and the screening of compounds. Some are involved in multiple stages of the pipeline, while others have built end-to-end platforms for new drug discovery [428].

BenevolentAI is a leading AI-based pharmaceutical company that focuses on drug target discovery. Founded in 2013, the company has seen rapid growth in recent years and has emerged as a leader in AI-based drug discovery, attracting significant investor attention. The company was listed in Amsterdam on 6 December 2021 and has a pre-investment valuation of 1.1 billion EUR and a post-investment valuation of up to 1.5 billion EUR. BenevolentAI identifies drug targets for complex diseases through its leading KG technology, which integrates large amounts of publicly available biopharmaceutical data with internal company data. For example, the KG identified baricitinib as a possible treatment for COVID-19 [429]. Through this technology, BenevolentAI has entered into a long-term collaboration with AstraZeneca for target identification in chronic kidney disease, idiopathic pulmonary fibrosis, heart failure, and systemic lupus erythematosus. On 17 May 2022, AstraZeneca made a milestone payment to BenevolentAI for a new target discovery in idiopathic pulmonary fibrosis, which is the third new target identified through BenevolentAI’s R&D platform. In addition, BenevolentAI has entered into a new drug discovery collaboration with Johnson & Johnson. The judgment-augmented cognition system (JACS) is a core technology that can focus on processing large amounts of unstructured data in a short period of time through its NLP capabilities. The current market opportunity around AI-led drug discovery capabilities is over 30 billion USD [430].

In 2019, Insilico Medicine completed a challenge to design new small molecule inhibitors of DDR1 in 21 days using the GENTRL AI system [28]. This challenge caused a great sensation at the time, because it was unimaginable for so many new inhibitors to be discovered in such a short period of time using AI methods. The total time taken was reduced by 1-2 years compared with the traditional process. Insilico Medicine’s outstanding performance has made it a hit with investors. In June 2021, Insilico Medicine raised 225 million USD in a Series C round of funding and, in February 2022, it announced the launch of a phase I clinical trial of a small molecule inhibitor for the treatment of idiopathic pulmonary fibrosis [430].

The company Exscientia stands tall in the area of getting small molecules that have been discovered using AI into clinical trials. At a time when AI-based pharmaceutical companies are competing with each other, Exscientia has become the first company to send an AI-discovered drug candidate, DSP-1181, to the clinical stage. This process will take less than 12 months, compared with a historical average of about 4.5 years for this step. In 2021, Exscientia raised a total of approximately 800 million USD through Series C and Series D funding, and an initial public offering (IPO). The company has also raised significant funding through deal partnerships, signing deals with Bristol Myers Squibb and Sanofi for potential transaction amounts of 1.2 billion and 5.2 billion USD, respectively. Both deals are focused on drug discovery in the areas of oncology and immunology. Over the decade of Exscientia’s development, a complete end-to-end AI drug development pipeline has been progressively established, from target selection to molecular screening and generation. It is this complete pipeline that continues to drive Exscientia’s growth. To date, Exscientia has three drugs in the clinical stage, and its market value is highly anticipated upon launch [430].

Thus far, the development of AI-driven drugs is at a historical inflection point, and the average funding for pharmaceutical companies with AI as a core technology has been on the rise. Table 6 provides some information on the core technologies of AI-based pharmaceutical companies. Investors now recognize that drug R&D based on AI technology is becoming a powerful tool to accelerate biopharmaceutical innovation. This technology can provide new insights to accelerate drug discovery by analyzing the biopharmaceutical data that is accumulated and generated on a daily basis. As a result, this field has become a strategic area of focus for pharmaceutical companies and continues to attract capital market attention.

9. Challenges

This review has elaborated on most of the applications of AI in the whole process of drug R&D. However, at the present stage, AI has not really broken down the traditional pharmaceutical system, and many research processes are still waiting for “optimization” by AI. The use of AI for more in-depth research in the field of pharmaceutical preparations is still being gradually explored. For example, some scholars have used AI technology to assist in studying the interaction of drug excipients with biomolecules [431]. In addition to the application areas of AI in the drug development stage that still require expansion, there are limitations in the application of AI to drug discovery.

9.1. Data limitations

The development of AI algorithms cannot be separated from the drive of data. High-quality and accurate data can sometimes enable simple models to outperform complex models. There are many excellent publicly accessible databases for data research, including TTD, ChEMBL, DrugBank, CMAP, and PRIDE, but the amount of data is insufficient to support more complex research. The construction of AI algorithms relies heavily on high-quality and sufficient data. The acquisition of high-quality data is a very important issue for sophisticated and complex biological systems, due to the limitations of current technology, and it is costly to process this data into standard data with high confidence. The method, time, and place of operation of each batch of data acquisition are different, making it more difficult to process the acquired data into uniform and valid data [432]. For example, the results obtained by current single-cell RNA-seq vary with their sequencing platforms and often tend to form doublets. Some data is obtained by in vitro assays; however, due to the lack of a thorough understanding of the response in the organism, the in vitro data often differs significantly from the actual in vivo data. Therefore, the prediction results of models trained with the data obtained from in vitro experiments are often unconvincing.

These limitations reflect the uneven quality of the data that is currently used. Data imbalance is also a major difficulty in model training. As previously mentioned, positive datasets are readily available in the pharmaceutical field, but negative datasets are often not accurately identified because failed data is often not publicly available. In addition to the problem of data quality and balance, some types of data are generally unavailable to researchers. The key core data for new drug R&D usually originates from drug companies; this part of the data is usually not open source, as drugs are commodities. Similarly, clinical data involves patient privacy and is usually not open source. The problem of data quality and balance requires advances in experimental techniques to obtain more accurate biomedical data in comparison with current data, in order to break the data bottleneck. The development of algorithms such as distributed training can be expected to solve the problem of privacy data to a certain extent. We also appeal to major institutions and companies to disclose as much high-quality data as possible without compromising their own interests.

9.2. Limitations in interpretability

In addition to the limitations of data, DL methods lack interpretability. Compared with traditional ML methods, which often pass through a rigorous mathematical reasoning validation analysis, DL methods are considered to be a black box. Although DL performs better than ML on most tasks, it is often impossible for researchers to understand the reason results of ML are so good. When a DL model yields a new result that contradicts previous research, the lack of interpretability makes the result unacceptable. In particular, compared with other fields, the field of drug discovery has a complete set of knowledge logic, such as the mechanisms of action of molecules, the metabolic mechanisms of molecules, and the regulatory mechanisms of biological pathways. In order to ensure the safety and efficacy of drugs, relevant biological processes must be thoroughly studied, ranging from the physicochemical properties of a drug to what proteins it binds to in the body, how it binds, what biological reactions it triggers, and how it is metabolized. DL can only accept input and give predicted output; it cannot provide sufficient explanations for how this output is derived. For example, for protein function annotation, although DL methods can predict the GOA of a specific protein [70], the computational process is not known and most of the predictions are not accepted when the accuracy is not reliable. Even in terms of data representation methods, no uniform standards have been developed regarding which representation method is more suitable for which study and which representation methods lead to a loss of information.

In the future, the development of DL in the pharmaceutical sciences and industry should focus on improving interpretability as much as possible without compromising accuracy, and should involve the establishment of a set of well-established research methods that combine white-box models with black-box models.

10. Conclusions

In conclusion, AI is advantageous in all aspects of new drug R&D. It can be used in the discovery of drug targets, the design and development of new drugs, preclinical research, clinical trial design, and post-market surveillance to assist in the design of safe and effective drugs, while greatly reducing the cycle time and cost of drug R&D. Some limitations still remain in the AI-based drug R&D process. However, we believe that the emergence of AI is gradually assisting us in unraveling the mystery of large and complex biological systems, and that AI has become an indispensable technology in the drug R&D process. Furthermore, AI technologies will change the R&D paradigm of pharmaceutical sciences in the future, helping us to better overcome complex diseases while providing personalized medicine to patients. In this process, further research is needed to inject new energy into this field.

The authors would like to dedicate this article to Prof. Hualiang Jiang, the member of the Chinese Academy of Sciences (CAS) and professor in Shanghai Institute of Materia Medica and Lingang Laboratory. Prof. Jiang had devoted great efforts to the cutting-edge research on CADD and artificial intelligence for drug discover, and made significant contributions to the development of pharmaceutical sciences. All authors would like to take this opportunity to thank for his kind and persistent supports to their research.

Acknowledgments

This work was funded by the Natural Science Foundation of Zhejiang Province (LR21H300001), National Key R&D Program of China (2022YFC3400501), National Natural Science Foundation of China (22220102001, U1909208, 81872798, and 81825020), Leading Talent of the “Ten Thousand Plan”—National High-Level Talents Special Support Plan of China, Fundamental Research Fund of Central University (2018QNA7023), Key R&D Program of Zhejiang Province (2020C03010), “Double Top-Class” University (181201*194232101), Westlake Laboratory (Westlake Laboratory of Life Sciences and Biomedicine), Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, and Alibaba Cloud, Information Technology Center of Zhejiang University.

Compliance with ethics guidelines

Mingkun Lu, Jiayi Yin, Qi Zhu, Gaole Lin, Minjie Mou, Fuyao Liu, Ziqi Pan, Nanxin You, Xichen Lian, Fengcheng Li, Hongning Zhang, Lingyan Zheng, Wei Zhang, Hanyu Zhang, Zihao Shen, Zhen Gu, Honglin Li, and Feng Zhu declare that they have no conflict of interest or financial conflicts to disclose.

References

[1]

L. Martin, M. Hutchens, C. Hawkins, A. Radnov. How much do clinical trials cost?. Nat Rev Drug Discov, 16 (6) (2017), pp. 381-382. DOI: 10.1038/nrd.2017.70

[2]

T.J. Moore, H. Zhang, G. Anderson, G.C. Alexander. Estimated costs of pivotal trials for novel therapeutic agents approved by the US Food and Drug Administration,2015-2016. JAMA Intern Med, 178 (11) (2018), pp. 1451-1457. DOI: 10.1001/jamainternmed.2018.3931

[3]

S.M. Paul, D.S. Mytelka, C.T. Dunwiddie, C.C. Persinger, B.H. Munos, S.R. Lindborg, et al. How to improve R&D productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discov, 9 (3) (2010), pp. 203-214. DOI: 10.1038/nrd3078

[4]

A.L. Hopkins. Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol, 4 (11) (2008), pp. 682-690. DOI: 10.1038/nchembio.118

[5]

Z. Wang, M. Gerstein, M. Snyder. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 10 (1) (2009), pp. 57-63

[6]

J. Giacomotto, L. Ségalat. High-throughput screening and small animal models, where are we?. Br J Pharmacol, 160 (2) (2010), pp. 204-216. DOI: 10.1111/j.1476-5381.2010.00725.x

[7]

L.M. Mayr, D. Bojanic. Novel trends in high-throughput screening. Curr Opin Pharmacol, 9 (5) (2009), pp. 580-588.

[8]

B.K. Shoichet. Virtual screening of chemical libraries. Nature, 432 (7019) (2004), pp. 862-865. DOI: 10.1038/nature03197

[9]

D.B. Kitchen, H. Decornez, J.R. Furr, J. Bajorath. Docking and scoring in virtual screening for drug discovery: methods and applications. Nat Rev Drug Discov, 3 (11) (2004), pp. 935-949. DOI: 10.1038/nrd1549

[10]

Y. LeCun, Y. Bengio, G. Hinton. Deep learning. Nature, 521 (7553) (2015), pp. 436-444. DOI: 10.1038/nature14539

[11]

C. Farabet, C. Couprie, L. Najman, Y. LeCun. Learning hierarchical features for scene labeling. IEEE Trans Pattern Anal Mach Intell, 35 (8) (2013), pp. 1915-1929.

[12]

G.E. Dahl, D. Yu, L. Deng, A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, 20 (1) (2012), pp. 30-42.

[13]

J. Ding, N. Sharon, Z. Bar-Joseph. Temporal modelling using single-cell transcriptomics. Nat Rev Genet, 23 (6) (2022), pp. 355-368. DOI: 10.1038/s41576-021-00444-7

[14]

S. Liu, C. Trapnell. Single-cell transcriptome sequencing: recent advances and remaining challenges. F1000 Res, 5 (1) (2016), Article 182. DOI: 10.12688/f1000research.7223.1

[15]

M.D. Luecken, F.J. Theis. Current best practices in single-cell RNA-seq analysis: a tutorial. Mol Syst Biol, 15 (6) (2019), Article e8746.

[16]

R. Aebersold, M. Mann. Mass spectrometry-based proteomics. Nature, 422 (6928) (2003), pp. 198-207.

[17]

S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Res, 49 (D1) (2021), pp. D1388-D1395. DOI: 10.1093/nar/gkaa971

[18]

A. Bateman, M.J. Martin, S. Orchard, M. Magrane, R. Agivetova, S. Ahmad, et al. UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res, 49 (D1) (2021), pp. D480-D489.

[19]

C. Manzoni, D.A. Kia, J. Vandrovcova, J. Hardy, N.W. Wood, P.A. Lewis, et al. Genome, transcriptome and proteome: the rise of omics data and their integration in biomedical sciences. Brief Bioinform, 19 (2) (2018), pp. 286-302. DOI: 10.1093/bib/bbw114

[20]

Y. Shi, P.L. Prieto, T. Zepel, S. Grunert, J.E. Hein. Automated experimentation powers data science in chemistry. Acc Chem Res, 54 (3) (2021), pp. 546-555. DOI: 10.1021/acs.accounts.0c00736

[21]

A.S. Nam, R. Chaligne, D.A. Landau. Integrating genetic and non-genetic determinants of cancer evolution by single-cell multi-omics. Nat Rev Genet, 22 (1) (2021), pp. 3-18. DOI: 10.1038/s41576-020-0265-5

[22]

M.J. Waring, J. Arrowsmith, A.R. Leach, P.D. Leeson, S. Mandrell, R.M. Owen, et al. An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nat Rev Drug Discov, 14 (7) (2015), pp. 475-486. DOI: 10.1038/nrd4609

[23]

K. Tunyasuvunakool, J. Adler, Z. Wu, T. Green, M. Zielinski, A. Žídek, et al. Highly accurate protein structure prediction for the human proteome. Nature, 596 (7873) (2021), pp. 590-596. DOI: 10.1038/s41586-021-03828-1

[24]

C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, et al. Do transformers really perform badly for graph representation?. Adv Neural Inf Process Syst, 34 (1) (2021), pp. 28877-28888.

[25]

N.S. Seyed Tabib, M. Madgwick, P. Sudhakar, B. Verstockt, T. Korcsmaros, S. Vermeire. Big data in IBD: big progress for clinical practice. Gut, 69 (8) (2020), pp. 1520-1532. DOI: 10.1136/gutjnl-2019-320065

[26]

J.M. Granda, L. Donina, V. Dragone, D.L. Long, L. Cronin. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature, 559 (7714) (2018), pp. 377-381 Corrected in: Nature 2019;570:E67-9. DOI: 10.1038/s41586-018-0307-8

[27]

F. Zhong, J. Xing, X. Li, X. Liu, Z. Fu, Z. Xiong, et al. Artificial intelligence in drug design. Sci China Life Sci, 61 (10) (2018), pp. 1191-1204. DOI: 10.1007/s11427-018-9342-2

[28]

A. Zhavoronkov, Y.A. Ivanenkov, A. Aliper, M.S. Veselov, V.A. Aladinskiy, A.V. Aladinskaya, et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat Biotechnol, 37 (9) (2019), pp. 1038-1040. DOI: 10.1038/s41587-019-0224-x

[29]

R. Winter, F. Montanari, F. Noé, D.A. Clevert. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci, 10 (6) (2019), pp. 1692-1701. DOI: 10.1039/c8sc04175j

[30]

S. Zheng, J. Rao, Y. Song, J. Zhang, X. Xiao, E.F. Fang, et al. PharmKG: a dedicated knowledge graph benchmark for bomedical data mining. Brief Bioinform, 22 (4) (2021), Article bbaa344.

[31]

J. Jeon, S. Nim, J. Teyra, A. Datti, J.L. Wrana, S.S. Sidhu, et al. A systematic approach to identify novel cancer drug targets using machine learning, inhibitor design and high-throughput screening. Genome Med, 6 (7) (2014), Article 57.

[32]

S. Riniker, Y. Wang, J.L. Jenkins, G.A. Landrum. Using information from historical high-throughput screens to predict active compounds. J Chem Inf Model, 54 (7) (2014), pp. 1880-1891. DOI: 10.1021/ci500190p

[33]

A.O. Basile, A. Yahi, N.P. Tatonetti. Artificial intelligence for drug toxicity and safety. Trends Pharmacol Sci, 40 (9) (2019), pp. 624-635.

[34]

S. Cruz Rivera, X. Liu, A.W. Chan, A.K. Denniston, M.J. Calvert. The SPIRIT-AI and CONSORT-AI Working Group. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Lancet Digit Health, 2 (10) (2020), pp. e549-e560.

[35]

S. Steiner, J. Wolf, S. Glatzel, A. Andreou, J.M. Granda, G. Keenan, et al. Organic synthesis in a modular robotic system driven by a chemical programming language. Science, 363 (6423) (2019), Article eaav2211.

[36]

P. Hamet, J. Tremblay. Artificial intelligence in medicine. Metabolism, 69 (Suppl) ( 2017), pp. S36-S40.

[37]

Y. Zhou, Y. Zhang, X. Lian, F. Li, C. Wang, F. Zhu, et al. Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res, 50 (D1) (2022), pp. D1398-D1407. DOI: 10.1093/nar/gkab953

[38]

K. Amahong, W. Zhang, Y. Zhou, S. Zhang, J. Yin, F. Li, et al. CovInter: interaction data between coronavirus RNAs and host proteins. Nucleic Acids Res, 51 (D1) (2022), pp. D546-D556

[39]

S. Liu, L. Chen, Y. Zhang, Y. Zhou, Y. He, Z. Chen, et al. M6AREG: m6A-centered regulation of disease development and drug response. Nucleic Acids Res, 51 (D1) (2022), pp. D1333-D1344

[40]

X. Sun, Y. Zhang, H. Li, Y. Zhou, S. Shi, Z. Chen, et al. DRESIS: the first comprehensive landscape of drug resistance information. Nucleic Acids Res, 51 (D1) (2022), pp. D1263-D1275

[41]

X. Wang, F. Li, W. Qiu, B. Xu, Y. Li, X. Lian, et al. SYNBIP: synthetic binding proteins for research, diagnosis and therapy. Nucleic Acids Res, 50 (D1) (2022), pp. D560-D570. DOI: 10.1093/nar/gkab926

[42]

S. Zhang, X. Sun, M. Mou, K. Amahong, H. Sun, W. Zhang, et al. REGLIV: molecular regulation data of diverse living systems facilitating current multiomics research. Comput Biol Med, 148 (1) (2022), Article 105825.

[43]

S.K. Burley, C. Bhikadiya, C. Bi, S. Bittrich, L. Chen, G.V. Crichlow, et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res, 49 (D1) (2021), pp. D437-D451. DOI: 10.1093/nar/gkaa1038

[44]

Y. Perez-Riverol, J. Bai, C. Bandla, D. García-Seisdedos, S. Hewapathirana, S. Kamatchinathan, et al. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences. Nucleic Acids Res, 50 (D1) (2022), pp. D543-D552. DOI: 10.1093/nar/gkab1038

[45]

M. Blum, H.Y. Chang, S. Chuguransky, T. Grego, S. Kandasaamy, A. Mitchell, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res, 49 (D1) (2021), pp. D344-D354. DOI: 10.1093/nar/gkaa977

[46]

J. Yin, W. Sun, F. Li, J. Hong, X. Li, Y. Zhou, et al. VARIDT 1.0: variability of drug transporter database. Nucleic Acids Res, 48 (D1) (2020), pp. D1042-D1050. DOI: 10.1093/nar/gkz779

[47]

T. Fu, F. Li, Y. Zhang, J. Yin, W. Qiu, X. Li, et al. VARIDT 2.0: structural variability of drug transporter. Nucleic Acids Res, 50 (D1) (2022), pp. D1417-D1431. DOI: 10.1093/nar/gkab1013

[48]

F. Cunningham, J.E. Allen, J. Allen, J. Alvarez-Jarreta, M.R. Amode, I.M. Armean, et al. Ensembl 2022. Nucleic Acids Res, 50 (D1) (2022), pp. D988-D995. DOI: 10.1093/nar/gkab1049

[49]

W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler, et al. The human genome browser at UCSC. Genome Res, 12 (6) (2002), pp. 996-1006

[50]

T. Barrett, S.E. Wilhite, P. Ledoux, C. Evangelista, I.F. Kim, M. Tomashevsky, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res, 41 (D1) (2013), pp. D991-D995.

[51]

E.W. Sayers, M. Cavanaugh, K. Clark, K.D. Pruitt, C.L. Schoch, S.T. Sherry, et al. GenBank. Nucleic Acids Res, 50 (D1) (2022), pp. D161-D164. DOI: 10.1093/nar/gkab1135

[52]

W. Li, K.R. O’Neill, D.H. Haft, M. DiCuccio, V. Chetvernin, A. Badretdin, et al. RefSeq: expanding the prokaryotic genome annotation pipeline reach with protein family model curation. Nucleic Acids Res, 49 (D1) (2021), pp. D1020-D1028. DOI: 10.1093/nar/gkaa1105

[53]

I. Papatheodorou, P. Moreno, J. Manning, A.M. Fuentes, N. George, S. Fexova, et al. Expression Atlas update: from tissues to single cells. Nucleic Acids Res, 48 (D1) (2020), pp. D77-D83.

[54]

D. Mendez, A. Gaulton, A.P. Bento, J. Chambers, M. De Veij, E. Félix, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res, 47 (D1) (2019), pp. D930-D940. DOI: 10.1093/nar/gky1075

[55]

D.S. Wishart, Y.D. Feunang, A.C. Guo, E.J. Lo, A. Marcu, J.R. Grant, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res, 46 (D1) (2018), pp. D1074-D1082. DOI: 10.1093/nar/gkx1037

[56]

F. Li, J. Yin, M. Lu, M. Mou, Z. Li, Z. Zeng, et al. DrugMAP: molecular atlas and pharma-information of all drugs. Nucleic Acids Res, 51 (D1) (2022), pp. D1288-D1299

[57]

J. Tang, Z.U. Tanoli, B. Ravikumar, Z. Alam, A. Rebane, M. Vähä-Koskela, et al. Drug target commons: a community effort to build a consensus knowledge base for drug-target interactions. Cell Chem Biol, 25 (2) (2018), pp. 224-229. DOI: 10.3390/electronics7100224

[58]

T.K. Sheils, S.L. Mathias, K.J. Kelleher, V.B. Siramshetty, D.T. Nguyen, C.G. Bologa, et al. TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res, 49 (D1) (2021), pp. D1334-D1346. DOI: 10.1093/nar/gkaa993

[59]

C. Hutter, J.C. Zenklusen. The Cancer Genome Atlas: creating lasting value beyond its data. Cell, 173 (2) (2018), pp. 283-285.

[60]

J. Piñero, J. Saüch, F. Sanz, L.I. Furlong. The DisGeNET Cytoscape App: exploring and visualizing disease genomics data. Comput Struct Biotechnol J, 19 (1) (2021), pp. 2960-2967.

[61]

M.J. Landrum, S. Chitipiralla, G.R. Brown, C. Chen, B. Gu, J. Hart, et al. ClinVar: improvements to accessing data. Nucleic Acids Res, 48 (D1) (2020), pp. D835-D844. DOI: 10.1093/nar/gkz972

[62]

J.S. Amberger, C.A. Bocchini, A.F. Scott,A. Hamosh. OMIM. org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res, 47 (D1) (2019), pp. D1038-D1043. DOI: 10.1093/nar/gky1151

[63]

D.A. Hashimoto, E. Witkowski, L. Gao, O. Meireles, G. Rosman. Artificial intelligence in anesthesiology: current techniques, clinical applications, and limitations. Anesthesiology, 132 (2) (2020), pp. 379-394. DOI: 10.1097/aln.0000000000002960

[64]

W. Cheng, C.A. Ng. Using machine learning to classify bioactivity for 3486 per- and polyfluoroalkyl substances (PFASs) from the OECD list. Environ Sci Technol, 53 (23) (2019), pp. 13970-13980. DOI: 10.1021/acs.est.9b04833

[65]

J. Hong, Y. Luo, M. Mou, J. Fu, Y. Zhang, W. Xue, et al. Convolutional neural network-based annotation of bacterial type IV secretion system effectors with enhanced accuracy and reduced false discovery. Brief Bioinform, 21 (5) (2020), pp. 1825-1836. DOI: 10.1093/bib/bbz120

[66]

A.S. Rifaioglu, H. Atas, M.J. Martin, R. Cetin-Atalay, V. Atalay, T. Doğan. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform, 20 (5) (2019), pp. 1878-1912. DOI: 10.1093/bib/bby061

[67]

M. Kulmanov, R. Hoehndorf. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics, 36 (2) (2020), pp. 422-429. DOI: 10.1093/bioinformatics/btz595

[68]

M. Kulmanov, M.A. Khan, R. Hoehndorf. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34 (4) (2018), pp. 660-668. DOI: 10.1093/bioinformatics/btx624

[69]

V. Gligorijević, P.D. Renfrew, T. Kosciolek, J.K. Leman, D. Berenberg, T. Vatanen, et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun, 12 (1) (2021), Article 3168.

[70]

W. Xia, L. Zheng, J. Fang, F. Li, Y. Zhou, Z. Zeng, et al. PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput Biol Med, 145 (1) (2022), Article 105465.

[71]

P. Carracedo-Reboredo, J. Liñares-Blanco, N. Rodríguez-Fernández, F. Cedrón, F.J. Novoa, A. Carballal, et al. A review on machine learning approaches and trends in drug discovery. Comput Struct Biotechnol J, 19 (1) (2021), pp. 4538-4558.

[72]

J. Ubels, T. Schaefers, C. Punt, H.J. Guchelaar, J. de Ridder. RAINFOREST: a random forest approach to predict treatment benefit in data from (failed) clinical drug trials. Bioinformatics, 36 (Suppl 2) (2020), pp. i601-i609. DOI: 10.1093/bioinformatics/btaa799

[73]

C. Yang, Y. Zhang. Delta machine learning to improve scoring-ranking-screening performances of protein-ligand scoring functions. J Chem Inf Model, 62 (11) (2022), pp. 2696-2712. DOI: 10.1021/acs.jcim.2c00485

[74]

K. Heikamp, J. Bajorath. Support vector machines for drug discovery. Expert Opin Drug Discov, 9 (1) (2014), pp. 93-104. DOI: 10.1517/17460441.2014.866943

[75]

S. Zhang, X. Li, M. Zong, X. Zhu, R. Wang. Efficient kNN classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst, 29 (5) (2018), pp. 1774-1785. DOI: 10.1109/tnnls.2017.2673241

[76]

B. Liu, H. He, H. Luo, T. Zhang, J. Jiang. Artificial intelligence and big data facilitated targeted drug discovery. Stroke Vasc Neurol, 4 (4) (2019), pp. 206-213

[77]

D. Cirillo, A. Valencia. Big data analytics for personalized medicine. Curr Opin Biotechnol, 58 (1) (2019), pp. 161-167.

[78]

J. Ma, R.P. Sheridan, A. Liaw, G.E. Dahl, V. Svetnik. Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model, 55 (2) (2015), pp. 263-274. DOI: 10.1021/ci500747n

[79]

H.C. Shin, H.R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, et al. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans Med Imaging, 35 (5) (2016), pp. 1285-1298.

[80]

B.J. Hou, Z.H. Zhou. Learning with interpretable structure from gated RNN. IEEE Trans Neural Netw Learn Syst, 31 (7) (2020), pp. 2267-2279.

[81]

Z. Zhang, L. Chen, F. Zhong, D. Wang, J. Jiang, S. Zhang, et al. Graph neural network approaches for drug-target interactions. Curr Opin Struct Biol, 73 (1) (2022), Article 102327

[82]

H. Zhang, Y. Wang, Z. Pan, X. Sun, M. Mou, B. Zhang, et al. ncRNAInter: a novel strategy based on graph neural network to discover interactions between lncRNA and miRNA. Brief Bioinform, 23 (6) (2022), Article bbac411.

[83]

C. Sun, Y. Cao, J.M. Wei, J. Liu. Autoencoder-based drug-target interaction prediction by preserving the consistency of chemical properties and functions of drugs. Bioinformatics, 37 (20) (2021), pp. 3618-3625. DOI: 10.1093/bioinformatics/btab384

[84]

X. Yi, E. Walia, P. Babyn. Generative adversarial network in medical imaging: a review. Med Image Anal, 58 (1) (2019), Article 101552.

[85]

X. Zhou, F. Shen, L. Liu, W. Liu, L. Nie, Y. Yang, et al. Graph convolutional network hashing. IEEE Trans Cybern, 50 (4) (2020), pp. 1460-1472. DOI: 10.1109/tcyb.2018.2883970

[86]

G.S. Handelman, H.K. Kok, R.V. Chandra, A.H. Razavi, S. Huang, M. Brooks, et al. Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. AJR Am J Roentgenol, 212 (1) (2019), pp. 38-43. DOI: 10.2214/ajr.18.20224

[87]

B.A. Richards, P.W. Frankland. The persistence and transience of memory. Neuron, 94 (6) (2017), pp. 1071-1084.

[88]

A. Krizhevsky, I. Sutskever, G.E. Hinton. ImageNet classification with deep convolutional neural networks. Commun ACM, 60 (6) (2017), pp. 84-90. DOI: 10.1145/3065386

[89]

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res, 15 (1) (2014), pp. 1929-1958.

[90]

J. Sun, X. Chen, Z. Zhang, S. Lai, B. Zhao, H. Liu, et al. Forecasting the long-term trend of COVID-19 epidemic using a dynamic model. Sci Rep, 10 (1) (2020), Article 21122.

[91]

Y. Jiao, P. Du. Performance measures in evaluating machine learning based bioinformatics predictors for classifications. Quant Biol, 4 (4) (2016), pp. 320-330. DOI: 10.1007/s40484-016-0081-2

[92]

L. Xue, J. Bajorath. Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening. Comb Chem High Throughput Screen, 3 (5) (2000), pp. 363-372. DOI: 10.2174/1386207003331454

[93]

J. Wenzel, H. Matter, F. Schmidt. Predictive multitask deep neural network models for ADME-Tox properties: learning from large data sets. J Chem Inf Model, 59 (3) (2019), pp. 1253-1268. DOI: 10.1021/acs.jcim.8b00785

[94]

Goh GB, Siegel C, Vishnu A, Hodas N. Using rule-based labels for weak supervised learning: a ChemNet for transferable chemical property prediction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018 Aug 19-23; London, UK. New York City: Association for Computing Machinery; 2018. p.302-10.

[95]

M. Popova, O. Isayev, A. Tropsha. Deep reinforcement learning for de novo drug design. Sci Adv, 4 (7) (2018), Article eaap7885.

[96]

P. Karpov, G. Godin, I.V. Tetko. Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J Cheminform, 12 (1) (2020), Article 17.

[97]

Goh GB, Hodas NO, Siegel C, Vishnu A.SMILES2Vec:an interpretable general-purpose deep neural network for predicting chemical properties. 2017. arXiv:171202034.

[98]

Z. Xiong, D. Wang, X. Liu, F. Zhong, X. Wan, X. Li, et al. Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem, 63 (16) (2020), pp. 8749-8760. DOI: 10.1021/acs.jmedchem.9b00959

[99]

K. Yang, K. Swanson, W. Jin, C. Coley, P. Eiden, H. Gao, et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model, 59 (8) (2019), pp. 3370-3388. DOI: 10.1021/acs.jcim.9b00237

[100]

K. Li, C. Xu, J. Huang, W. Liu, L. Zhang, W. Wan, et al. Prediction and identification of the effectors of heterotrimeric G proteins in rice (Oryza sativa L.). Brief Bioinform, 18 (2) (2017), pp. 270-278.

[101]

M. Wu, Y. Yang, H. Wang, Y. Xu. A deep learning method to more accurately recall known lysine acetylation sites. BMC Bioinf, 20 (1) (2019), Article 49. DOI: 10.1007/978-3-319-99270-9_4

[102]

Y.H. Li, J.Y. Xu, L. Tao, X.F. Li, S. Li, X. Zeng, et al. SVM-Prot 2016: a web-server for machine learning prediction of protein functional families from sequence irrespective of similarity. PLoS One, 11 (8) (2016), Article e0155290. DOI: 10.1371/journal.pone.0155290

[103]

L. Zou, C. Nan, F. Hu. Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles. Bioinformatics, 29 (24) (2013), pp. 3135-3142. DOI: 10.1093/bioinformatics/btt554

[104]

P. Petrilli. Classification of protein sequences by their dipeptide composition. Comput Appl Biosci, 9 (2) (1993), pp. 205-209. DOI: 10.1093/bioinformatics/9.2.205

[105]

S. Seo, M. Oh, Y. Park, S. Kim. DeepFam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics, 34 (13) (2018), pp. i254-i262. DOI: 10.1093/bioinformatics/bty275

[106]

J. Wang, B. Yang, J. Revote, A. Leier, T.T. Marquez-Lago, G. Webb, et al. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics, 33 (17) (2017), pp. 2756-2758. DOI: 10.1093/bioinformatics/btx302

[107]

J. Hong, Y. Luo, Y. Zhang, J. Ying, W. Xue, T. Xie, et al. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Brief Bioinform, 21 (4) (2020), pp. 1437-1447. DOI: 10.1093/bib/bbz081

[108]

C.Y. Yu, X.X. Li, H. Yang, Y.H. Li, W.W. Xue, Y.Z. Chen, et al. Assessing the performances of protein function prediction algorithms from the perspectives of identification accuracy and false discovery rate. Int J Mol Sci, 19 (1) (2018), Article 183

[109]

K.C. Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21 (1) (2005), pp. 10-19.

[110]

K.C. Chou. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins, 43 (3) (2001), pp. 246-255.

[111]

P.D. Mosier, A.E. Counterman, P.C. Jurs, D.E. Clemmer. Prediction of peptide ion collision cross sections from topological molecular structure and amino acid parameters. Anal Chem, 74 (6) (2002), pp. 1360-1370.

[112]

B. Ren. Atomic-level-based AI topological descriptors for structure-property correlations. J Chem Inf Comput Sci, 43 (1) (2003), pp. 161-169.

[113]

C.N. Magnan, P. Baldi. SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity. Bioinformatics, 30 (18) (2014), pp. 2592-2597. DOI: 10.1093/bioinformatics/btu352

[114]

A. Strokach, D. Becerra, C. Corbi-Verge, A. Perez-Riba, P.M. Kim. Fast and flexible protein design using deep graph neural networks. Cell Syst, 11 (4) (2020), pp. 402-411.

[115]

J. Ingraham, V. Garg, R. Barzilay, T. Jaakkola. Generative models for graph-based protein design. Adv Neural Inf Process Syst, 32 (1) (2019), pp. 15820-15831

[116]

J.G. Greener, L. Moffat, D.T. Jones. Design of metalloproteins and novel protein folds using variational autoencoders. Sci Rep, 8 (1) (2018), Article 16189.

[117]

M. Karimi, S. Zhu, Y. Cao, Y. Shen. De novo protein design for novel folds using guided conditional Wasserstein generative adversarial networks. J Chem Inf Model, 60 (12) (2020), pp. 5667-5681. DOI: 10.1021/acs.jcim.0c00593

[118]

Q. Ye, C.Y. Hsieh, Z. Yang, Y. Kang, J. Chen, D. Cao, et al. A unified drug-target interaction prediction framework based on knowledge graph and recommendation system. Nat Commun, 12 (1) (2021), Article 6775.

[119]

A. Rives, J. Meier, T. Sercu, S. Goyal, Z. Lin, J. Liu, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA, 118 (15) (2021): e2016239118

[120]

G.B. Li, L.L. Yang, W.J. Wang, L.L. Li, S.Y. Yang. ID-Score: a new empirical scoring function based on a comprehensive set of descriptors related to protein-ligand interactions. J Chem Inf Model, 53 (3) (2013), pp. 592-600. DOI: 10.1021/ci300493w

[121]

F. Montanari, L. Kuhnke, A. Ter Laak, D.A. Clevert. Modeling physico-chemical ADMET endpoints with multitask graph convolutional networks. Molecules, 25 (1) (2019), Article 44. DOI: 10.3390/molecules25010044

[122]

S. Dara, S. Dhamercherla, S.S. Jadav, C.M. Babu, M.J. Ahsan. Machine learning in drug discovery: a review. Artif Intell Rev, 55 (3) (2022), pp. 1947-1999. DOI: 10.1007/s10462-021-10058-4

[123]

M. Olivecrona, T. Blaschke, O. Engkvist, H. Chen. Molecular de-novo design through deep reinforcement learning. J Cheminform, 9 (1) (2017), Article 48.

[124]

S.N. Dean, J.A.E. Alvarez, D. Zabetakis, S.A. Walper, A.P. Malanoski. PepVAE: variational autoencoder framework for antimicrobial peptide generation and activity prediction. Front Microbiol, 12 (1) (2021), Article 725727.

[125]

G. Xiong, Z. Wu, J. Yi, L. Fu, Z. Yang, C. Hsieh, et al. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res, 49 (W1) (2021), pp. W5-14. DOI: 10.1093/nar/gkab255

[126]

T. Gaudelet, B. Day, A.R. Jamasb, J. Soman, C. Regep, G. Liu, et al. Utilizing graph machine learning within drug discovery and development. Brief Bioinform, 22 (6) (2021), p. bbab159.

[127]

D.C. Swinney, J. Anthony. How were new medicines discovered?. Nat Rev Drug Discov, 10 (7) (2011), pp. 507-519. DOI: 10.1038/nrd3480

[128]

F. Vincent, A. Nueda, J. Lee, M. Schenone, M. Prunotto, M. Mercola.Publisher correction: phenotypic drug discovery: recent successes, lessons learned and new directions. Nat Rev Drug Discov, 21 (7) (2022), p. 541. DOI: 10.1038/s41573-022-00503-6

[129]

Y.H. Li, X.X. Li, J.J. Hong, Y.X. Wang, J.B. Fu, H. Yang, et al. Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs. Brief Bioinform, 21 (2) (2020), pp. 649-662. DOI: 10.1093/bib/bby130

[130]

B.B. Misra, C.D. Langefeld, M. Olivier, L.A. Cox. Integrated omics: tools, advances, and future approaches. J Mol Endocrinol, 62 (1) (2019), pp. 21-45.

[131]

J. Fu, Y. Zhang, Y. Wang, H. Zhang, J. Liu, J. Tang, et al. Optimization of metabolomic data processing using NOREVA. Nat Protoc, 17 (1) (2022), pp. 129-151. DOI: 10.1038/s41596-021-00636-9

[132]

J. Tang, J. Fu, Y. Wang, B. Li, Y. Li, Q. Yang, et al. ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies. Brief Bioinform, 21 (2) (2020), pp. 621-636. DOI: 10.1093/bib/bby127

[133]

F. Li, Y. Zhou, Y. Zhang, J. Yin, Y. Qiu, J. Gao, et al. POSREG: proteomic signature discovered by simultaneously optimizing its reproducibility and generalizability. Brief Bioinform, 23 (2) (2022), p. bbac040.

[134]

F. Li, J. Yin, M. Lu, Q. Yang, Z. Zeng, B. Zhang, et al. ConSIG: consistent discovery of molecular signature from OMIC data. Brief Bioinform, 23 (4) (2022), p. bbac253.

[135]

Q. Yang, B. Li, P. Wang, J. Xie, Y. Feng, Z. Liu, et al. LargeMetabo: an out-of-the-box tool for processing and analyzing large-scale metabolomic data. Brief Bioinform, 23 (6) (2022), p. bbac455.

[136]

M. Mou, Z. Pan, M. Lu, H. Sun, Y. Wang, Y. Luo, et al. Application of machine learning in spatial proteomics. J Chem Inf Model, 62 (23) (2022), pp. 5875-5895. DOI: 10.1021/acs.jcim.2c01161

[137]

J. Fu, Q. Yang, Y. Luo, S. Zhang, J. Tang, Y. Zhang, et al. Label-free proteome quantification and evaluation. Brief Bioinform, 24 (1) (2022), p. bbac477

[138]

Q. Yang, Y. Wang, Y. Zhang, F. Li, W. Xia, Y. Zhou, et al. NOREVA: enhanced normalization and evaluation of time-course and multi-class metabolomic data. Nucleic Acids Res, 48 (W1) (2020), pp. W436-W448. DOI: 10.1093/nar/gkaa258

[139]

S. Zhang, K. Amahong, C. Zhang, F. Li, J. Gao, Y. Qiu, et al. RNA-RNA interactions between SARS-CoV-2 and host benefit viral development and evolution during COVID-19 infection. Brief Bioinform, 23 (1) (2022), p. bbab397.

[140]

Q. Yang, B. Li, J. Tang, X. Cui, Y. Wang, X. Li, et al. Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief Bioinform, 21 (3) (2020), pp. 1058-1068. DOI: 10.1093/bib/bbz049

[141]

S. Zhang, K. Amahong, X. Sun, X. Lian, J. Liu, H. Sun, et al. The miRNA: a small but powerful RNA for COVID-19. Brief Bioinform, 22 (2) (2021), pp. 1137-1149. DOI: 10.1093/bib/bbab062

[142]

P.S. Reel, S. Reel, E. Pearson, E. Trucco, E. Jefferson. Using machine learning approaches for multi-omics data analysis: a review. Biotechnol Adv, 49 (1) (2021), Article 107739.

[143]

The Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature, 455 (7216) ( 2008), pp. 1061-1068. Corrected in: Nature 2013 ;494(7438):506

[144]

A. Subramanian, R. Narayan, S.M. Corsello, D.D. Peck, T.E. Natoli, X. Lu, et al. A next generation connectivity map: L 1000 platform and the first 1,000,000 profiles. Cell, 171 (6) (2017), pp. 1437-1452.

[145]

M. Uhlén, L. Fagerberg, B.M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu, et al. Tissue-based map of the human proteome. Science, 347 (6220) (2015), p. 1260419.

[146]

M.S. Kim, S.M. Pinto, D. Getnet, R.S. Nirujogi, S.S. Manda, R. Chaerkady, et al. A draft map of the human proteome. Nature, 509 (7502) (2014), pp. 575-581. DOI: 10.1038/nature13302

[147]

D.S. Wishart, A. Guo, E. Oler, F. Wang, A. Anjum, H. Peters, et al. HMDB 5.0: the human metabolome database for 2022. Nucleic Acids Res, 50 (D1) (2022), pp. D622-D631. DOI: 10.1093/nar/gkab1062

[148]

C.A. Smith, G. O’Maille, E.J. Want, C. Qin, S.A. Trauger, T.R. Brandon, et al. METLIN: a metabolite mass spectral database. Ther Drug Monit, 27 (6) (2005), pp. 747-751. DOI: 10.1097/01.ftd.0000179845.53213.39

[149]

M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, K. Morishima. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res, 45 (D1) (2017), pp. D353-D361. DOI: 10.1093/nar/gkw1092

[150]

R. Caspi, R. Billington, I.M. Keseler, A. Kothari, M. Krummenacker, P.E. Midford, et al. The MetaCyc database of metabolic pathways and enzymes—a 2019 update. Nucleic Acids Res, 48 (D1) (2020), pp. D445-D453. DOI: 10.1093/nar/gkz862

[151]

M. Gillespie, B. Jassal, R. Stephan, M. Milacic, K. Rothfels, A. Senff-Ribeiro, et al. The Reactome pathway knowledgebase 2022. Nucleic Acids Res, 50 (D1) (2022), pp. D687-D692. DOI: 10.1093/nar/gkab1028

[152]

Y. Zhang, J.T. Tseng, I.C. Lien, F. Li, W. Wu, H. Li. mRNAsi index: machine learning in mining lung adenocarcinoma stem cell biomarkers. Genes, 11 (3) (2020), p. 257

[153]

M. Duda, H. Zhang, H.D. Li, D.P. Wall, M. Burmeister, Y. Guan. Brain-specific functional relationship networks inform autism spectrum disorder gene prediction. Transl Psychiatry, 8 (1) (2018), p. 56.

[154]

T.P. Liu, Y.Y. Hsieh, C.J. Chou, P.M. Yang. Systematic polypharmacology and drug repurposing via an integrated L1000-based Connectivity Map database mining. R Soc Open Sci, 5 (11) (2018), Article 181321. DOI: 10.1098/rsos.181321

[155]

Y. Gao, S. Kim, Y.I. Lee, J. Lee.Cellular stress-modulating drugs can potentially be identified by in silico screening with Connectivity Map (CMap). Int J Mol Sci, 20 (22) (2019), p. 5601. DOI: 10.3390/ijms20225601

[156]

X. Liu, S. Ouyang, B. Yu, Y. Liu, K. Huang, J. Gong, et al. PharmMapper server: a web server for potential drug target identification using pharmacophore mapping approach. Nucleic Acids Res, 38 (Suppl 2) (2010), pp. W609-W614. DOI: 10.1093/nar/gkq300

[157]

X. Wang, Y. Shen, S. Wang, S. Li, W. Zhang, X. Liu, et al. PharmMapper 2017 update: a web server for potential drug target identification with a comprehensive target pharmacophore database. Nucleic Acids Res, 45 (W1) (2017), pp. W356-W360. DOI: 10.1093/nar/gkx374

[158]

X. Wang, C. Pan, J. Gong, X. Liu, H. Li. Enhancing the enrichment of pharmacophore-based target prediction for the polypharmacological profiles of drugs. J Chem Inf Model, 56 (6) (2016), pp. 1175-1183. DOI: 10.1021/acs.jcim.5b00690

[159]

J. Gong, C. Cai, X. Liu, X. Ku, H. Jiang, D. Gao, et al. ChemMapper: a versatile web server for exploring pharmacology and chemical structure association based on molecular 3D similarity method. Bioinformatics, 29 (14) (2013), pp. 1827-1829. DOI: 10.1093/bioinformatics/btt270

[160]

X. Wang, H. Chen, F. Yang, J. Gong, S. Li, J. Pei, et al. iDrug: a web-accessible and interactive drug discovery and design platform. J Cheminform, 6 (1) (2014), p. 28

[161]

H. Noh, R. Gunawan. Inferring gene targets of drugs and chemical compounds from gene expression profiles. Bioinformatics, 32 (14) (2016), pp. 2120-2127. DOI: 10.1093/bioinformatics/btw148

[162]

J. Zhu, J. Wang, X. Wang, M. Gao, B. Guo, M. Gao, et al. Prediction of drug efficacy from transcriptional profiles with deep learning. Nat Biotechnol, 39 (11) (2021), pp. 1444-1452. DOI: 10.1038/s41587-021-00946-z

[163]

J.H. Woo, Y. Shimoni, W.S. Yang, P. Subramaniam, A. Iyer, P. Nicoletti, et al. Elucidating compound mechanism of action by network perturbation analysis. Cell, 162 (2) (2015), pp. 441-451.

[164]

B. Li, J. Tang, Q. Yang, S. Li, X. Cui, Y. Li, et al. NOREVA: normalization and evaluation of MS-based metabolomics data. Nucleic Acids Res, 45 (W1) (2017), pp. W162-W170. DOI: 10.1093/nar/gkx449

[165]

The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474 (7353) (2011), pp. 609-615 Erratum in: Nature 2012;490(7419):292

[166]

D.L. Masica, R. Karchin. Correlation of somatic mutation and expression identifies genes important in human glioblastoma progression and survival. Cancer Res, 71 (13) (2011), pp. 4550-4561.

[167]

J. Fang, P. Zhang, Q. Wang, C.W. Chiang, Y. Zhou, Y. Hou, et al. Artificial intelligence framework identifies candidate targets for drug repurposing in Alzheimer’s disease. Alzheimers Res Ther, 14 (1) (2022), p. 7.

[168]

N.A. Pabon, Y. Xia, S.K. Estabrooks, Z. Ye, A.K. Herbrand, E. Süß, et al. Predicting protein targets for drug-like compounds using transcriptomics. PLoS Comput Biol, 14 (12) (2018), p. e1006651. DOI: 10.1371/journal.pcbi.1006651

[169]

F. Zhong, X. Wu, R. Yang, X. Li, D. Wang, Z. Fu, et al. Drug target inference by mining transcriptional data using a novel graph convolutional network framework. Protein Cell, 13 (4) (2022), pp. 281-301. DOI: 10.1007/s13238-021-00885-0

[170]

K. Jaganathan, S. Kyriazopoulou Panagiotopoulou, J.F. McRae, S.F. Darbandi, D. Knowles, Y.I. Li, et al. Predicting splicing from primary sequence with deep learning. Cell, 176 (3) (2019), pp. 535-548.

[171]

R. Lopez, J. Regier, M.B. Cole, M.I. Jordan, N. Yosef. Deep generative modeling for single-cell transcriptomics. Nat Methods, 15 (12) (2018), pp. 1053-1058. DOI: 10.1038/s41592-018-0229-2

[172]

F. Liu, H. Li, C. Ren, X. Bo, W. Shu. PEDLA: predicting enhancers with a deep learning-based algorithmic framework. Sci Rep, 6 (1) (2016), p. 28517.

[173]

D.J. Downes, A.R. Cross, P. Hua, N. Roberts, R. Schwessinger, A.J. Cutler, et al. COMBAT Consortium. Identification of LZTFL 1 as a candidate effector gene at a COVID-19 risk locus. Nat Genet, 53 (11) (2021), pp. 1606-1615. DOI: 10.1038/s41588-021-00955-3

[174]

J. Barretina, G. Caponigro, N. Stransky, K. Venkatesan, A.A. Margolin, S. Kim, et al. The cancer cell line encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483 (7391) (2012), pp. 603-607. DOI: 10.1038/nature11003

[175]

J.R. Dry, S. Pavey, C.A. Pratilas, C. Harbron, S. Runswick, D. Hodgson, et al. Transcriptional pathway signatures predict MEK addiction and response to selumetinib (AZD6244). Cancer Res, 70 (6) (2010), pp. 2264-2273.

[176]

H. Sharifi-Noghabi, O. Zolotareva, C.C. Collins, M. Ester. MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics, 35 (14) (2019), pp. i501-i509. DOI: 10.1093/bioinformatics/btz318

[177]

F. Iorio, T.A. Knijnenburg, D.J. Vis, G.R. Bignell, M.P. Menden, M. Schubert, et al. A landscape of pharmacogenomic interactions in cancer. Cell, 166 (3) (2016), pp. 740-754.

[178]

W. Peng, T. Chen, W. Dai. Predicting drug response based on multi-omics fusion and graph convolution. IEEE J Biomed Health Inform, 26 (3) (2022), pp. 1384-1393. DOI: 10.1109/jbhi.2021.3102186

[179]

Y. Wang, Y. Yang, S. Chen, J. Wang. DeepDRK: a deep learning framework for drug repurposing through kernel-based multi-omics integration. Brief Bioinform, 22 (5) (2021), p. bbab048.

[180]

N. Novac. Challenges and opportunities of drug repositioning. Trends Pharmacol Sci, 34 (5) (2013), pp. 267-272.

[181]

M.J. Keiser, V. Setola, J.J. Irwin, C. Laggner, A.I. Abbas, S.J. Hufeisen, et al. Predicting new molecular targets for known drugs. Nature, 462 (7270) (2009), pp. 175-181. DOI: 10.1038/nature08506

[182]

A.P. Davis, C.J. Grondin, R.J. Johnson, D. Sciaky, J. Wiegers, T.C. Wiegers, et al. Comparative Toxicogenomics Database (CTD): update 2021. Nucleic Acids Res, 49 (D1) (2021), pp. D1138-D1143. DOI: 10.1093/nar/gkaa891

[183]

S.D. Harding, J.F. Armstrong, E. Faccenda, C. Southan, S.P.H. Alexander, A.P. Davenport, et al. The IUPHAR/BPS guide to PHARMACOLOGY in 2022: curating pharmacology for COVID-19, malaria and antibacterials. Nucleic Acids Res, 50 (D1) (2022), pp. D1282-D1294. DOI: 10.1093/nar/gkab1010

[184]

S. Avram, C.G. Bologa, J. Holmes, G. Bocci, T.B. Wilson, D.T. Nguyen, et al. DrugCentral 2021 supports drug discovery and repositioning. Nucleic Acids Res, 49 (D1) (2021), pp. D1160-D1169. DOI: 10.1093/nar/gkaa997

[185]

L. Urán Landaburu, A.J. Berenstein, S. Videla, P. Maru, D. Shanmugam, A. Chernomoretz, et al. TDR Targets 6: driving drug discovery for human pathogens through intensive chemogenomic data integration. Nucleic Acids Res, 48 (D1) (2020), pp. D992-1005.

[186]

T.F. Chen, Y.C. Chang, Y. Hsiao, K.H. Lee, Y.C. Hsiao, Y.H. Lin, et al. DockCoV2: a drug database against SARS-CoV-2. Nucleic Acids Res, 49 (D1) (2021), pp. D1152-D1159. DOI: 10.1093/nar/gkaa861

[187]

M. Kanehisa, M. Furumichi, Y. Sato, M. Ishiguro-Watanabe, M. Tanabe. KEGG: integrating viruses and cellular organisms. Nucleic Acids Res, 49 (D1) (2021), pp. D545-D551. DOI: 10.1093/nar/gkaa970

[188]

C. Wang, G. Hu, K. Wang, M. Brylinski, L. Xie, L. Kurgan. PDID: database of molecular-level putative protein-drug interactions in the structural human proteome. Bioinformatics, 32 (4) (2016), pp. 579-586. DOI: 10.1093/bioinformatics/btv597

[189]

M. Kuhn, I. Letunic, L.J. Jensen, P. Bork. The SIDER database of drugs and side effects. Nucleic Acids Res, 44 (D1) (2016), pp. D1075-D1079. DOI: 10.1093/nar/gkv1075

[190]

D. Ochoa, A. Hercules, M. Carmona, D. Suveges, A. Gonzalez-Uriarte, C. Malangone, et al. Open Targets Platform: supporting systematic drug-target identification and prioritisation. Nucleic Acids Res, 49 (D1) (2021), pp. D1302-D1310. DOI: 10.1093/nar/gkaa1027

[191]

Z. Gao, H. Li, H. Zhang, X. Liu, L. Kang, X. Luo, et al.PDTD: a web-accessible protein database for drug target identification. BMC Bioinf, 9 (1) (2008), p. 104.

[192]

RDKit: open-source cheminformatics software [Internet]. Basel: T 5 Informatics GmbH; [cited 2023 Feb 9]. Available from: https://www.rdkit.org/.

[193]

N.M. O’Boyle, M. Banck, C.A. James, C. Morley, T. Vandermeersch, G.R. Hutchison. Open Babel: an open chemical toolbox. J Cheminform, 3 (1) (2011), p. 33.

[194]

Daylight Toolkit: C-language interface for SMILESTM, SMARTS® and SMIRKS® [Internet]. Laguna Niguel: Daylight Chemical Information Systems, Inc.; [cited 2023 Feb 9]. Available from: https://www.daylight.com/products/toolkit.html.

[195]

C. Steinbeck, Y. Han, S. Kuhn, O. Horlacher, E. Luttmann, E. Willighagen. The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J Chem Inf Comput Sci, 43 (2) (2003), pp. 493-500.

[196]

OpenEye Toolkits 2022.2.2 [Internet]. Santa Fe: OpenEye Scientific Software, Inc.; [cited 2023 Feb 9]. Available from: https://docs.eyesopen.com/toolkits/python/index.html.

[197]

Y. Cao, A. Charisi, L.C. Cheng, T. Jiang, T. Girke. ChemmineR: a compound mining framework for R. Bioinformatics, 24 (15) (2008), pp. 1733-1734. DOI: 10.1093/bioinformatics/btn307

[198]

Indigo Toolkit Internet. Newtown: EPAM System, Inc.; [cited 2023 Feb 9]. Available from: https://lifescience.opensource.epam.com/indigo/.

[199]

X. Liu, H. Jiang, H. Li. SHAFTS: a hybrid approach for 3D molecular similarity calculation. 1. Method and assessment of virtual screening. J Chem Inf Model, 51 (9) (2011), pp. 2372-2385. DOI: 10.1021/ci200060s

[200]

W. Lu, X. Liu, X. Cao, M. Xue, K. Liu, Z. Zhao, et al. SHAFTS: a hybrid approach for 3D molecular similarity calculation. 2. Prospective case study in the discovery of diverse p 90 ribosomal S6 protein kinase 2 inhibitors to suppress cell migration. J Med Chem, 54 (10) (2011), pp. 3564-3574. DOI: 10.1021/jm200139j

[201]

G. He, Y. Song, W. Wei, X. Wang, X. Lu, H. Li. eSHAFTS: integrated and graphical drug design software based on 3D molecular similarity. J Comput Chem, 40 (6) (2019), pp. 826-838. DOI: 10.1002/jcc.25769

[202]

P. Zhang, L. Tao, X. Zeng, C. Qin, S. Chen, F. Zhu, et al. A protein network descriptor server and its use in studying protein, disease, metabolic and drug targeted networks. Brief Bioinform, 18 (6) (2017), pp. 1057-1070.

[203]

G.M. Boratyn, C. Camacho, P.S. Cooper, G. Coulouris, A. Fong, N. Ma, et al. BLAST: a more efficient report with usability improvements. Nucleic Acids Res, 41 (W1) (2013), pp. W29-W33. DOI: 10.1093/nar/gkt282

[204]

J.D. Thompson, D.G. Higgins, T.J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 22 (22) (1994), pp. 4673-4680. DOI: 10.1093/nar/22.22.4673

[205]

L. Holm, L.M. Laakso. Dali server update. Nucleic Acids Res, 44 (W1) (2016), pp. W351-W355. DOI: 10.1093/nar/gkw357

[206]

M. Shatsky, R. Nussinov, H.J. Wolfson. A method for simultaneous alignment of multiple protein structures. Proteins, 56 (1) (2004), pp. 143-156.

[207]

Y. Zhang, J. Skolnick. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res, 33 (7) (2005), pp. 2302-2309. DOI: 10.1093/nar/gki524

[208]

S. Li, C. Cai, J. Gong, X. Liu, H. Li. A fast protein binding site comparison algorithm for proteome-wide protein function prediction and drug repurposing. Proteins, 89 (11) (2021), pp. 1541-1556. DOI: 10.1002/prot.26176

[209]

A. Prlić, S. Bliven, P.W. Rose, W.F. Bluhm, C. Bizon, A. Godzik, et al. Pre-calculated protein structure alignments at the RCSB PDB website. Bioinformatics, 26 (23) (2010), pp. 2983-2985. DOI: 10.1093/bioinformatics/btq572

[210]

A. Shulman-Peleg, R. Nussinov, H.J. Wolfson. Recognition of functional sites in protein structures. J Mol Biol, 339 (3) (2004), pp. 607-633.

[211]

M. Gao, J. Skolnick. APoc: large-scale identification of similar protein pockets. Bioinformatics, 29 (5) (2013), pp. 597-604. DOI: 10.1093/bioinformatics/btt024

[212]

M. Brylinski.eMatchSite: sequence order-independent structure alignments of ligand binding pockets in protein models. PLoS Comput Biol, 10 (9) (2014), p. e1003829. DOI: 10.1371/journal.pcbi.1003829

[213]

P. Björkholm, P. Daniluk, A. Kryshtafovych, K. Fidelis, R. Andersson, T.R. Hvidsten. Using multi-data hidden Markov models trained on local neighborhoods of protein structure to predict residue-residue contacts. Bioinformatics, 25 (10) (2009), pp. 1264-1270. DOI: 10.1093/bioinformatics/btp149

[214]

L.J. McGuffin, K. Bryson, D.T. Jones. The PSIPRED protein structure prediction server. Bioinformatics, 16 (4) (2000), pp. 404-405.

[215]

M. Nayal, B. Honig. On the nature of cavities on protein surfaces: application to the identification of drug-binding sites. Proteins, 63 (4) (2006), pp. 892-906. DOI: 10.1002/prot.20897

[216]

D.S. Cao, S. Liu, Q.S. Xu, H.M. Lu, J.H. Huang, Q.N. Hu, et al. Large-scale prediction of drug-target interactions using protein sequences and drug topological structures. Anal Chim Acta, 752 (1) (2012), pp. 1-10. DOI: 10.1038/emi.2012.7

[217]

H. Öztürk, A. Özgür, E. Ozkirimli. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics, 34 (17) (2018), pp. i821-i829. DOI: 10.1093/bioinformatics/bty593

[218]

F. Rayhan, S. Ahmed, S. Shatabda, D.M. Farid, Z. Mousavian, A. Dehzangi, et al. iDTI-ESBoost: identification of drug target interaction using evolutionary and structural features with boosting. Sci Rep, 7 (1) (2017), p. 17731.

[219]

K. Huang, T. Fu, L.M. Glass, M. Zitnik, C. Xiao, J. Sun. DeepPurpose: a deep learning library for drug-target interaction prediction. Bioinformatics, 36 (22-23) (2021), pp. 5545-5547. DOI: 10.1093/bioinformatics/btaa1005

[220]

K. Bleakley, Y. Yamanishi. Supervised prediction of drug-target interactions using bipartite local models. Bioinformatics, 25 (18) (2009), pp. 2397-2403. DOI: 10.1093/bioinformatics/btp433

[221]

Y. Yamanishi, M. Araki, A. Gutteridge, W. Honda, M. Kanehisa. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics, 24 (13) (2008), pp. i232-i240. DOI: 10.1093/bioinformatics/btn162

[222]

M.A. Yıldırım, K.I. Goh, M.E. Cusick, A.L. Barabási, M. Vidal. Drug-target network. Nat Biotechnol, 25 (10) (2007), pp. 1119-1126. DOI: 10.1038/nbt1338

[223]

Y. Luo, X. Zhao, J. Zhou, J. Yang, Y. Zhang, W. Kuang, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun, 8 (1) (2017), p. 573.

[224]

X. Zeng, S. Zhu, W. Lu, Z. Liu, J. Huang, Y. Zhou, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci, 11 (7) (2020), pp. 1775-1797. DOI: 10.1039/c9sc04336e

[225]

S.K. Mohamed, A. Nounu, V. Nováček. Biological applications of knowledge graph embedding models. Brief Bioinform, 22 (2) (2021), pp. 1679-1693. DOI: 10.1093/bib/bbaa012

[226]

L. Perlman, A. Gottlieb, N. Atias, E. Ruppin, R. Sharan. Combining drug and gene similarity measures for drug-target elucidation. J Comput Biol, 18 (2) (2011), pp. 133-145. DOI: 10.1089/cmb.2010.0213

[227]

M.C. Cobanoglu, C. Liu, F. Hu, Z.N. Oltvai, I. Bahar. Predicting drug-target interactions using probabilistic matrix factorization. J Chem Inf Model, 53 (12) (2013), pp. 3399-3409. DOI: 10.1021/ci400219z

[228]

D. Sydow, L. Burggraaff, A. Szengel, H.W.T. van Vlijmen, IJzerman AP, et al. Advances and challenges in computational target prediction. J Chem Inf Model, 59 (5) (2019), pp. 1728-1742. DOI: 10.1021/acs.jcim.8b00832

[229]

M. Bagherian, E. Sabeti, K. Wang, M.A. Sartor, Z. Nikolovska-Coleska, K. Najarian. Machine learning approaches and databases for prediction of drug-target interaction: a survey paper. Brief Bioinform, 22 (1) (2021), pp. 247-269. DOI: 10.1093/bib/bbz157

[230]

X. Zhang, L. Li, M.K. Ng, S. Zhang. Drug-target interaction prediction by integrating multiview network data. Comput Biol Chem, 69 (1) (2017), pp. 185-193.

[231]

W. Zhang, Y. Chen, D. Li. Drug-target interaction prediction through label propagation with linear neighborhood information. Molecules, 22 (12) (2017), p. 2056. DOI: 10.3390/molecules22122056

[232]

T. van Laarhoven, E. Marchiori.Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PLoS One, 8 (6) (2013), p. e66952. DOI: 10.1371/journal.pone.0066952

[233]

T. He, M. Heidemeyer, F. Ban, A. Cherkasov, M. Ester. SimBoost: a read-across approach for predicting drug-target binding affinities using gradient boosting machines. J Cheminform, 9 (1) (2017), p. 24.

[234]

A. Sharma, R. Rani. BE-DTI’: ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Comput Methods Programs Biomed, 165 (1) (2018), pp. 151-162.

[235]

Y. Liu, M. Wu, C. Miao, P. Zhao, X.L. Li.Neighborhood regularized logistic matrix factorization for drug-target interaction prediction. PLoS Comput Biol, 12 (2) (2016), p. e1004760. DOI: 10.1371/journal.pcbi.1004760

[236]

B. Bolgár, P. Antal. VB-MK-LMF: fusion of drugs, targets and interactions using variational Bayesian multiple kernel logistic matrix factorization. BMC Bioinf, 18 (1) (2017), p. 440.

[237]

L. Li, M. Cai. Drug target prediction by multi-view low rank embedding. IEEE/ACM Trans Comput Biol Bioinform, 16 (5) (2019), pp. 1712-1721. DOI: 10.1109/tcbb.2017.2706267

[238]

F. Cheng, C. Liu, J. Jiang, W. Lu, W. Li, G. Liu, et al. Prediction of drug-target interactions and drug repositioning via network-based inference. PLoS Comput Biol, 8 (5) (2012), p. e1002503. DOI: 10.1371/journal.pcbi.1002503

[239]

X. Chen, M.X. Liu, G.Y. Yan. Drug-target interaction prediction by random walk on the heterogeneous network. Mol Biosyst, 8 (7) (2012), pp. 1970-1978. DOI: 10.1039/c2mb00002d

[240]

H. Chen, Z. Zhang.A semi-supervised method for drug-target interaction prediction with consistency in networks. PLoS One, 8 (5) (2013), p. e62975. DOI: 10.1371/journal.pone.0062975

[241]

S. Alaimo, A. Pulvirenti, R. Giugno, A. Ferro. Drug-target interaction prediction through domain-tuned network-based inference. Bioinformatics, 29 (16) (2013), pp. 2004-2008. DOI: 10.1093/bioinformatics/btt307

[242]

A. Mongia, A. Majumdar.Drug-target interaction prediction using multi graph regularized nuclear norm minimization. PLoS One, 15 (1) (2020), p. e0226484. DOI: 10.1371/journal.pone.0226484

[243]

Y. Wang, J. Zeng. Predicting drug-target interactions using restricted Boltzmann machines. Bioinformatics, 29 (13) (2013), pp. i126-i134. DOI: 10.1093/bioinformatics/btt234

[244]

H. Shi, S. Liu, J. Chen, X. Li, Q. Ma, B. Yu. Predicting drug-target interactions using Lasso with random forest based on evolutionary information and chemical structure. Genomics, 111 (6) (2019), pp. 1839-1852.

[245]

M. Wen, Z. Zhang, S. Niu, H. Sha, R. Yang, Y. Yun, et al. Deep-learning-based drug-target interaction prediction. J Proteome Res, 16 (4) (2017), pp. 1401-1409. DOI: 10.1021/acs.jproteome.6b00618

[246]

I. Lee, J. Keum, H. Nam.DeepConv-DTI: prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol, 15 (6) (2019), p. e1007129. DOI: 10.1371/journal.pcbi.1007129

[247]

L. Xie, S. He, X. Song, X. Bo, Z. Zhang. Deep learning-based transcriptome data classification for drug-target interaction prediction. BMC Genomics, 19 (Suppl 7) (2018), p. 667.

[248]

J. Verma, V.M. Khedkar, E.C. Coutinho. 3D-QSAR in drug design—a review. Curr Top Med Chem, 10 (1) (2010), pp. 95-115. DOI: 10.2174/156802610790232260

[249]

Y. Jing, Y. Bian, Z. Hu, L. Wang, X.Q. Xie.Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. AAPS J, 20 (3) (2018), p. 58. Corrected in: AAPS J 2018 ;20(4):79.

[250]

G. Hessler, K.H. Baringhaus. Artificial intelligence in drug design. Molecules, 23 (10) (2018), p. 2520. DOI: 10.3390/molecules23102520

[251]

E. Burello, A.P. Worth. QSAR modeling of nanomaterials. Wiley Interdiscip Rev Nanomed Nanobiotechnol, 3 (3) (2011), pp. 298-306. DOI: 10.1002/wnan.137

[252]

W. Xue, T. Fu, S. Deng, F. Yang, J. Yang, F. Zhu. Molecular mechanism for the allosteric inhibition of the human serotonin transporter by antidepressant escitalopram. ACS Chem Neurosci, 13 (3) (2022), pp. 340-351. DOI: 10.1021/acschemneuro.1c00694

[253]

F. Ballante, A.J. Kooistra, S. Kampen, C. de Graaf, J. Carlsson. Structure-based virtual screening for ligands of G protein-coupled receptors: what can molecular docking do for you?. Pharmacol Rev, 73 (4) (2021), pp. 1698-1736. DOI: 10.1124/pharmrev.120.000246

[254]

W.H. Shin, X. Zhu, M.G. Bures, D. Kihara. Three-dimensional compound comparison methods and their application in drug discovery. Molecules, 20 (7) (2015), pp. 12841-12862. DOI: 10.3390/molecules200712841

[255]

G. Ghislat, T. Rahman, P.J. Ballester. Recent progress on the prospective application of machine learning to structure-based virtual screening. Curr Opin Chem Biol, 65 (1) (2021), pp. 28-34.

[256]

M. Liu, S. Wang. MCDOCK: a Monte Carlo simulation approach to the molecular docking problem. J Comput Aided Mol Des, 13 (5) (1999), pp. 435-451.

[257]

P. Sneha, C. George Priya Doss. Molecular dynamics: new frontier in personalized medicine. Adv Protein Chem Struct Biol, 102 (1) (2016), pp. 181-224.

[258]

W. Xue, F. Yang, P. Wang, G. Zheng, Y. Chen, X. Yao, et al. What contributes to serotonin-norepinephrine reuptake inhibitors’ dual-targeting mechanism? The key role of transmembrane domain 6 in human serotonin and norepinephrine transporters revealed by molecular dynamics simulation. ACS Chem Neurosci, 9 (5) (2018), pp. 1128-1140. DOI: 10.1021/acschemneuro.7b00490

[259]

Q.Q. Xie, L. Zhong, Y.L. Pan, X.Y. Wang, J.P. Zhou, L. Di-wu, et al. Combined SVM-based and docking-based virtual screening for retrieving novel inhibitors of c-Met. Eur J Med Chem, 46 (9) (2011), pp. 3675-3680.

[260]

J.C. Pereira, E.R. Caffarena, C.N. Dos Santos. Boosting docking-based virtual screening with deep learning. J Chem Inf Model, 56 (12) (2016), pp. 2495-2506. DOI: 10.1021/acs.jcim.6b00355

[261]

N. Huang, B.K. Shoichet, J.J. Irwin. Benchmarking sets for molecular docking. J Med Chem, 49 (23) (2006), pp. 6789-6801. DOI: 10.1021/jm0608356

[262]

P.T. Lang, S.R. Brozell, S. Mukherjee, E.F. Pettersen, E.C. Meng, V. Thomas, et al. DOCK 6: combining techniques to model RNA—small molecule complexes. RNA, 15 (6) (2009), pp. 1219-1230. DOI: 10.1261/rna.1563609

[263]

O. Trott, A.J. Olson. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem, 31 (2) (2010), pp. 455-461. DOI: 10.1002/jcc.21334

[264]

M.D. AbdulHameed, D.L. Ippolito, A. Wallqvist. Predicting rat and human pregnane X receptor activators using Bayesian classification models. Chem Res Toxicol, 29 (10) (2016), pp. 1729-1740. DOI: 10.1021/acs.chemrestox.6b00227

[265]

E.J. Martin, V.R. Polyakov, L. Tian, R.C. Perez.Profile-QSAR 2. 0: kinase virtual screening accuracy comparable to four-concentration IC50s for realistically novel compounds. J Chem Inf Model, 57 (8) (2017), pp. 2077-2088. DOI: 10.1021/acs.jcim.7b00166

[266]

J.J.F. Chen, D.P. Visco Jr.. Developing an in silico pipeline for faster drug candidate discovery: virtual high throughput screening with the signature molecular descriptor using support vector machine models. Chem Eng Sci, 159 (1) (2017), pp. 31-42. DOI: 10.3390/info8010031

[267]

K.Z. Myint, L. Wang, Q. Tong, X.Q. Xie. Molecular fingerprint-based artificial neural networks QSAR for ligand biological activity predictions. Mol Pharm, 9 (10) (2012), pp. 2912-2923. DOI: 10.1021/mp300237z

[268]

J. Jaén-Oltra, M.T. Salabert-Salvador, F.J. García-March, F. Pérez-Giménez, F. Tomás-Vert. Artificial neural network applied to prediction of fluorquinolone antibacterial activity by topological methods. J Med Chem, 43 (6) (2000), pp. 1143-1148.

[269]

E.B. Lenselink, N. Ten Dijke, B. Bongers, G. Papadatos, H.W.T. van Vlijmen, W. Kowalczyk, et al. Beyond the hype: deep neural networks outperform established methods using a ChEMBL bioactivity benchmark set. J Cheminform, 9 (1) (2017), p. 45.

[270]

M.J. Keiser, B.L. Roth, B.N. Armbruster, P. Ernsberger, J.J. Irwin, B.K. Shoichet. Relating protein pharmacology by ligand chemistry. Nat Biotechnol, 25 (2) (2007), pp. 197-206. DOI: 10.1038/nbt1284

[271]

T. Xiao, X. Qi, Y. Chen, Y. Jiang. Development of ligand-based big data deep neural network models for virtual screening of large compound libraries. Mol Inform, 37 (11) (2018), p. 1800031.

[272]

J. Fang, L. Wang, Y. Li, W. Lian, X. Pang, H. Wang, et al. AlzhCPI: a knowledge base for predicting chemical-protein interactions towards Alzheimer’s disease. PLoS One, 12 (5) (2017), p. e0178347. DOI: 10.1371/journal.pone.0178347

[273]

A. Bender, H.Y. Mussa, R.C. Glen. Screening for dihydrofolate reductase inhibitors using MOLPRINT 2D, a fast fragment-based method employing the naïve Bayesian classifier: limitations of the descriptor and the importance of balanced chemistry in training and test sets. J Biomol Screen, 10 (7) (2005), pp. 658-666. DOI: 10.1177/1087057105281048

[274]

A. Abdo, B. Chen, C. Mueller, N. Salim, P. Willett. Ligand-based virtual screening using Bayesian networks. J Chem Inf Model, 50 (6) (2010), pp. 1012-1020. DOI: 10.1021/ci100090p

[275]

Y. Li, L. Wang, Z. Liu, C. Li, J. Xu, Q. Gu, et al. Predicting selective liver X receptor β agonists using multiple machine learning methods. Mol Biosyst, 11 (5) (2015), pp. 1241-1250.

[276]

J. Fang, R. Yang, L. Gao, D. Zhou, S. Yang, A.L. Liu, et al. Predictions of BuChE inhibitors using support vector machine and naive Bayesian classification techniques in drug discovery. J Chem Inf Model, 53 (11) (2013), pp. 3009-3020. DOI: 10.1021/ci400331p

[277]

P. Schneider, Y. Tanrikulu, G. Schneider. Self-organizing maps in drug discovery: compound library design, scaffold-hopping, repurposing. Curr Med Chem, 16 (3) (2009), pp. 258-266. DOI: 10.2174/092986709787002655

[278]

D. Hristozov, T.I. Oprea, J. Gasteiger. Ligand-based virtual screening by novelty detection with self-organizing maps. J Chem Inf Model, 47 (6) (2007), pp. 2044-2062. DOI: 10.1021/ci700040r

[279]

D. Reker, T. Rodrigues, P. Schneider, G. Schneider. Identifying the macromolecular targets of de novo-designed chemical entities through self-organizing map consensus. Proc Natl Acad Sci USA, 111 (11) (2014), pp. 4067-4072. DOI: 10.1073/pnas.1320001111

[280]

L. Stojanović, M. Popović, N. Tijanić, G. Rakočević, M. Kalinić. Improved scaffold hopping in ligand-based virtual screening using neural representation learning. J Chem Inf Model, 60 (10) (2020), pp. 4629-4639. DOI: 10.1021/acs.jcim.0c00622

[281]

A. Kadurin, A. Aliper, A. Kazennov, P. Mamoshina, Q. Vanhaelen, K. Khrabrov, et al. The cornucopia of meaningful leads: applying deep adversarial autoencoders for new molecule development in oncology. Oncotarget, 8 (7) (2017), pp. 10883-10890. DOI: 10.18632/oncotarget.14073

[282]

Y. Xu, P. Chen, X. Lin, H. Yao, K. Lin. Discovery of CDK4 inhibitors by convolutional neural networks. Future Med Chem, 11 (3) (2019), pp. 165-177

[283]

H. Altae-Tran, B. Ramsundar, A.S. Pappu, V. Pande. Low data drug discovery with one-shot learning. ACS Cent Sci, 3 (4) (2017), pp. 283-293. DOI: 10.1021/acscentsci.6b00367

[284]

Z. Zhou, S. Kearnes, L. Li, R.N. Zare, P. Riley.Optimization of molecules via deep reinforcement learning. Sci Rep, 9 (1) (2019), p. 10752. Corrected in: Sci Rep 2020 ;10(1):10478.

[285]

M. Hartenfeller, G. Schneider. De novo drug design. Methods Mol Biol, 672 (1) (2011), pp. 299-323.

[286]

M. Segall. Advances in multiparameter optimization methods for de novo drug design. Expert Opin Drug Discov, 9 (7) (2014), pp. 803-817. DOI: 10.1517/17460441.2014.913565

[287]

G. Schneider, U. Fechner. Computer-based de novo design of drug-like molecules. Nat Rev Drug Discov, 4 (8) (2005), pp. 649-663. DOI: 10.1038/nrd1799

[288]

J.I. Sohn, J.W. Nam. The present and future of de novo whole-genome assembly. Brief Bioinform, 19 (1) (2018), pp. 23-40.

[289]

G. Schneider, D.E. Clark. Automated de novo drug design: are we nearly there yet?. Angew Chem Int Ed Engl, 58 (32) (2019), pp. 10792-10803. DOI: 10.1002/anie.201814681

[290]

J. Xiong, Z. Xiong, K. Chen, H. Jiang, M. Zheng. Graph neural networks for automated de novo drug design. Drug Discov Today, 26 (6) (2021), pp. 1382-1393.

[291]

T. Pereira, M. Abbasi, B. Ribeiro, J.P. Arrais. Diversity oriented deep reinforcement learning for targeted molecule generation. J Cheminform, 13 (1) (2021), p. 21.

[292]

N. Ståhl, G. Falkman, A. Karlsson, G. Mathiason, J. Boström. Deep reinforcement learning for multiparameter optimization in de novo drug design. J Chem Inf Model, 59 (7) (2019), pp. 3166-3176. DOI: 10.1021/acs.jcim.9b00325

[293]

Ł. Maziarka, A. Pocha, J. Kaczmarczyk, K. Rataj, T. Danel, M. Warchoł. Mol-CycleGAN: a generative model for molecular optimization. J Cheminform, 12 (1) (2020), p. 2.

[294]

B. Sanchez-Lengeling, C. Outeiral, G.L. Guimaraes,A. Aspuru-Guzik. Optimizing distributions over molecular space. An objective-reinforced generative adversarial network for inverse-design chemistry (ORGANIC). ChemRxiv. Cambridge Open Engage, Cambridge (2017)

[295]

E. Putin, A. Asadulaev, Y. Ivanenkov, V. Aladinskiy, B. Sanchez-Lengeling, A. Aspuru-Guzik, et al. Reinforced adversarial neural computer for de novo molecular design. J Chem Inf Model, 58 (6) (2018), pp. 1194-1204. DOI: 10.1021/acs.jcim.7b00690

[296]

S. Harel, K. Radinsky. Prototype-based compound discovery using deep generative models. Mol Pharm, 15 (10) (2018), pp. 4406-4416. DOI: 10.1021/acs.molpharmaceut.8b00474

[297]

W. Wilman, S. Wrobel, W. Bielska, P. Deszynski, P. Dudzic, I. Jaszczyszyn, et al. Machine-designed biotherapeutics: opportunities, feasibility and advantages of deep learning in computational antibody discovery. Brief Bioinform, 23 (4) (2022), p. bbac267.

[298]

J.A. Ruffolo, J. Sulam, J.J. Gray. Antibody structure prediction using interpretable deep learning. Patterns, 3 (2) (2022), Article 100406.

[299]

A. Sivasubramanian, A. Sircar, S. Chaudhury, J.J. Gray. Toward high-resolution homology modeling of antibody Fv regions and application to antibody-antigen docking. Proteins, 74 (2) (2009), pp. 497-514. DOI: 10.1002/prot.22309

[300]

C. Schneider, A. Buchanan, B. Taddese, C.M. Deane. DLAB: deep learning methods for structure-based virtual screening of antibodies. Bioinformatics, 38 (2) (2022), pp. 377-383. DOI: 10.1093/bioinformatics/btab660

[301]

R.R. Eguchi, C.A. Choe, P.S. Huang.Ig-VAE: generative modeling of protein structure by direct 3D coordinate generation. PLoS Comput Biol, 18 (6) (2022), p. e1010271. DOI: 10.1371/journal.pcbi.1010271

[302]

M.I.J. Raybould, C. Marks, K. Krawczyk, B. Taddese, J. Nowak, A.P. Lewis, et al. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci USA, 116 (10) (2019), pp. 4025-4030. DOI: 10.1073/pnas.1810576116

[303]

J.H. Kim, H.J. Hong. Humanization by CDR grafting and specificity-determining residue grafting. Methods Mol Biol, 907 (1) (2012), pp. 237-245. DOI: 10.1007/978-1-61779-974-7_13

[304]

J. Leem, L.S. Mitchell, J.H.R. Farmery, J. Barton, J.D. Galson. Deciphering the language of antibodies using self-supervised learning. Patterns, 3 (7) (2022), Article 100513.

[305]

T.H. Olsen, I.H. Moal, C.M. Deane. AbLang: an antibody language model for completing antibody sequences. Bioinform Adv, 2(1):vbac046 (2022)

[306]

J. Fu, Y. Zhang, J. Liu, X. Lian, J. Tang, F. Zhu. Pharmacometabonomics: data processing and statistical analysis. Brief Bioinform, 22 (5) (2021), p. bbab138.

[307]

N.A. Meanwell. Improving drug candidates by design: a focus on physicochemical properties as a means of improving compound disposition and safety. Chem Res Toxicol, 24 (9) (2011), pp. 1420-1456. DOI: 10.1021/tx200211v

[308]

C.A. Lipinski. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov Today Technol, 1 (4) (2004), pp. 337-341.

[309]

M.Q. Zhang, B. Wilkinson. Drug discovery beyond the ‘rule-of-five’. Curr Opin Biotechnol, 18 (6) (2007), pp. 478-488.

[310]

D.T. Manallack, R.J. Prankerd, E. Yuriev, T.I. Oprea, D.K. Chalmers. The significance of acid/base properties in drug discovery. Chem Soc Rev, 42 (2) (2013), pp. 485-496.

[311]

H. Zhang, M.L. Xiang, C.Y. Ma, Q. Huang, W. Li, Y. Xie, et al. Three-class classification models of logS and logP derived by using GA-CG-SVM approach. Mol Divers, 13 (2) (2009), pp. 261-268

[312]

W.L. Jorgensen, E.M. Duffy. Prediction of drug solubility from Monte Carlo simulations. Bioorg Med Chem Lett, 10 (11) (2000), pp. 1155-1158.

[313]

M.J. Kamlet, R.M. Doherty, J.L. Abboud, M.H. Abraham, R.W. Taft. Linear solvation energy relationships: 36. molecular properties governing solubilities of organic nonelectrolytes in water. J Pharm Sci, 75 (4) (1986), pp. 338-349. DOI: 10.1002/jps.2600750405

[314]

X. Yang, Y. Wang, R. Byrne, G. Schneider, S. Yang. Concepts of artificial intelligence for computer-assisted drug discovery. Chem Rev, 119 (18) (2019), pp. 10520-10594. DOI: 10.1021/acs.chemrev.8b00728

[315]

D. Elder, R. Holm. Aqueous solubility: simple predictive methods (in silico, in vitro and bio-relevant approaches). Int J Pharm, 453 (1) (2013), pp. 3-11.

[316]

M. Hewitt, M.T. Cronin, S.J. Enoch, J.C. Madden, D.W. Roberts, J.C. Dearden. In silico prediction of aqueous solubility: the solubility challenge. J Chem Inf Model, 49 (11) (2009), pp. 2572-2587. DOI: 10.1021/ci900286s

[317]

P.G. Francoeur, D.R. Koes. SolTranNet—a machine learning tool for fast aqueous solubility prediction. J Chem Inf Model, 61 (6) (2021), pp. 2530-2536. DOI: 10.1021/acs.jcim.1c00331

[318]

D. Rogers, M. Hahn. Extended-connectivity fingerprints. J Chem Inf Model, 50 (5) (2010), pp. 742-754. DOI: 10.1021/ci100050t

[319]

W.X. Shen, X. Zeng, F. Zhu, Y.L. Wang, C. Qin, Y. Tan, et al. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nat Mach Intell, 3 (4) (2021), pp. 334-343. DOI: 10.1038/s42256-021-00301-6

[320]

J. Yin, F. Li, Y. Zhou, M. Mou, Y. Lu, K. Chen, et al. INTEDE: interactome of drug-metabolizing enzymes. Nucleic Acids Res, 49 (D1) (2021), pp. D1233-D1243. DOI: 10.1093/nar/gkaa755

[321]

F. Cheng, W. Li, G. Liu, Y. Tang. In silico ADMET prediction: recent advances, current challenges and future trends. Curr Top Med Chem, 13 (11) (2013), pp. 1273-1289. DOI: 10.2174/15680266113139990033

[322]

Y. Wang, J. Xing, Y. Xu, N. Zhou, J. Peng, Z. Xiong, et al. In silico ADME/T modelling for rational drug design. Q Rev Biophys, 48 (4) (2015), pp. 488-515.

[323]

L.L.G. Ferreira, A.D. Andricopulo. ADMET modeling approaches in drug discovery. Drug Discov Today, 24 (5) (2019), pp. 1157-1165.

[324]

L. Tao, P. Zhang, C. Qin, S.Y. Chen, C. Zhang, Z. Chen, et al. Recent progresses in the exploration of machine learning methods as in-silico ADME prediction tools. Adv Drug Deliv Rev, 86 (1) (2015), pp. 83-100.

[325]

A. Rácz, D. Bajusz, R.A. Miranda-Quintana, K. Héberger. Machine learning models for classification tasks related to drug safety. Mol Divers, 25 (3) (2021), pp. 1409-1424. DOI: 10.1007/s11030-021-10239-x

[326]

J.I. Vandenberg, M.D. Perry, M.J. Perrin, S.A. Mann, Y. Ke, A.P. Hill. hERG K+ channels: structure, function, and clinical significance. Physiol Rev, 92 (3) (2012), pp. 1393-1478. DOI: 10.1152/physrev.00036.2011

[327]

M.A. Kaisar, R.K. Sajja, S. Prasad, V.V. Abhyankar, T. Liles, L. Cucullo. New experimental models of the blood-brain barrier for CNS drug discovery. Expert Opin Drug Discov, 12 (1) (2017), pp. 89-103. DOI: 10.1080/17460441.2017.1253676

[328]

M.J. Smyth, E. Krasovskis, V.R. Sutton, R.W. Johnstone. The drug efflux protein, P-glycoprotein, additionally protects drug-resistant tumor cells from multiple forms of caspase-dependent apoptosis. Proc Natl Acad Sci USA, 95 (12) (1998), pp. 7024-7029.

[329]

A. Rácz, G.M. Keserű. Large-scale evaluation of cytochrome P450 2C9 mediated drug interaction potential with machine learning-based consensus modeling. J Comput Aided Mol Des, 34 (8) (2020), pp. 831-839. DOI: 10.1007/s10822-020-00308-y

[330]

H. Yang, L. Sun, W. Li, G. Liu, Y. Tang. In silico prediction of chemical toxicity for drug design using machine learning methods and structural alerts. Front Chem, 6 (1) (2018), p. 30

[331]

I.J. Onakpoya, C.J. Heneghan, J.K. Aronson.Post-marketing withdrawal of 462 medicinal products because of adverse drug reactions: a systematic review of the world literature. BMC Med, 14 (1) (2016), p. 10. Corrected in: BMC Med 2019 ;17(1):56.

[332]

V.M. Alves, A. Golbraikh, S.J. Capuzzi, K. Liu, W.I. Lam, D.R. Korn, et al. Multi-descriptor read across (MuDRA): a simple and transparent approach for developing accurate quantitative structure-activity relationship models. J Chem Inf Model, 58 (6) (2018), pp. 1214-1223. DOI: 10.1021/acs.jcim.8b00124

[333]

T. Lei, F. Chen, H. Liu, H. Sun, Y. Kang, D. Li, et al. ADMET evaluation in drug discovery. Part 17: development of quantitative and qualitative prediction models for chemical-induced respiratory toxicity. Mol Pharm, 14 (7) (2017), pp. 2407-2421. DOI: 10.1021/acs.molpharmaceut.7b00317

[334]

L. Zhu, J. Zhao, Y. Zhang, W. Zhou, L. Yin, Y. Wang, et al. ADME properties evaluation in drug discovery: in silico prediction of blood-brain partitioning. Mol Divers, 22 (4) (2018), pp. 979-990. DOI: 10.1007/s11030-018-9866-8

[335]

B.H. Su, Y.S. Tu, C. Lin, C.Y. Shao, O.A. Lin, Y.J. Tseng.Rule-based prediction models of cytochrome P450 inhibition. J Chem Inf Model, 55 (7) (2015), pp. 1426-1434. DOI: 10.1021/acs.jcim.5b00130

[336]

M. Yang, J. Chen, L. Xu, X. Shi, X. Zhou, Z. Xi, et al. A novel adaptive ensemble classification framework for ADME prediction. RSC Adv, 8 (21) (2018), pp. 11661-11683. DOI: 10.1039/c8ra01206g

[337]

E.V. Radchenko, A.S. Dyabina, V.A. Palyulin. Towards deep neural network models for the prediction of the blood-brain barrier permeability for diverse organic compounds. Molecules, 25 (24) (2020), p. 5901. DOI: 10.3390/molecules25245901

[338]

D. Wang, W. Liu, Z. Shen, L. Jiang, J. Wang, S. Li, et al. Deep learning based drug metabolites prediction. Front Pharmacol, 10 (1) (2020), p. 1586.

[339]

H. Yang, C. Lou, L. Sun, J. Li, Y. Cai, Z. Wang, et al. admetSAR 2.0: web-service for prediction and optimization of chemical ADMET properties. Bioinformatics, 35 (6) (2019), pp. 1067-1069. DOI: 10.1093/bioinformatics/bty707

[340]

A. Daina, O. Michielin, V. Zoete. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci Rep, 7 (1) (2017), p. 42717.

[341]

P. Banerjee, A.O. Eckert, A.K. Schrey, R. Preissner. ProTox-II: a webserver for the prediction of toxicity of chemicals. Nucleic Acids Res, 46 (W1) (2018), pp. W257-W263. DOI: 10.1093/nar/gky318

[342]

D.E. Pires, T.L. Blundell, D.B. Ascher. pkCSM: predicting small-molecule pharmacokinetic and toxicity properties using graph-based signatures. J Med Chem, 58 (9) (2015), pp. 4066-4072. DOI: 10.1021/acs.jmedchem.5b00104

[343]

M. Hay, D.W. Thomas, J.L. Craighead, C. Economides, J. Rosenthal. Clinical development success rates for investigational drugs. Nat Biotechnol, 32 (1) (2014), pp. 40-51. DOI: 10.1038/nbt.2786

[344]

S. Harrer, P. Shah, B. Antony, J. Hu. Artificial intelligence for clinical trial design. Trends Pharmacol Sci, 40 (8) (2019), pp. 577-591.

[345]

J.L. Perez-Gracia, M.F. Sanmamed, A. Bosch, A. Patiño-Garcia, K.A. Schalper, V. Segura, et al. Strategies to design clinical studies to identify predictive biomarkers in cancer research. Cancer Treat Rev, 53 (1) (2017), pp. 79-97.

[346]

J.M. Banda, M. Seneviratne, T. Hernandez-Boussard, N.H. Shah. Advances in electronic phenotyping: from rule-based definitions to machine learning models. Annu Rev Biomed Data Sci, 1 (1) (2018), pp. 53-68. DOI: 10.1146/annurev-biodatasci-080917-013315

[347]

S. Palmqvist, P.S. Insel, H. Zetterberg, K. Blennow, B. Brix, E. Stomrud, et al. Alzheimer’s Disease Neuroimaging Initiative, Swedish BioFINDER Study. Accurate risk estimation of β-amyloid positivity to identify prodromal Alzheimer’s disease: cross-validation study of practical algorithms. Alzheimers Dement, 15 (2) (2019), pp. 194-204. DOI: 10.1016/j.jalz.2018.08.014

[348]

K. Romero, K. Ito, J.A. Rogers, D. Polhamus, R. Qiu, D. Stephenson, et al. The future is now: model-based clinical trial design for Alzheimer’s disease. Clin Pharmacol Ther, 97 (3) (2015), pp. 210-214. DOI: 10.1002/cpt.16

[349]

E.E. Bain, L. Shafner, D.P. Walling, A.A. Othman, C. Chuang-Stein, J. Hinkle, et al. Use of a novel artificial intelligence platform on mobile devices to assess dosing compliance in a phase 2 clinical trial in subjects with schizophrenia. JMIR Mhealth Uhealth, 5 (2) (2017), p. e18. DOI: 10.2196/mhealth.7030

[350]

Yauney G, Shah P. Reinforcement learning with action-derived rewards for chemotherapy and clinical trial dosing regimen selection. In: Proceedings of the 3rd Machine Learning for Healthcare Conference; 2018 Aug 17-18; Stanford, CA, USA; 2018. p.161-226.

[351]

C.P. Farrington. Relative incidence estimation from case series for vaccine safety evaluation. Biometrics, 51 (1) (1995), pp. 228-235. DOI: 10.2307/2533328

[352]

P.B. Ryan, P.E. Stang, J.M. Overhage, M.A. Suchard, A.G. Hartzema, W. DuMouchel, et al. A comparison of the empirical performance of methods for a risk identification system. Drug Saf, 36 (Suppl 1) (2013), pp. 143-158. DOI: 10.1007/s40264-013-0108-9

[353]

G.N. Norén, J. Hopstadius, A. Bate, K. Star, I.R. Edwards. Temporal pattern discovery in longitudinal electronic patient records. Data Min Knowl Discov, 20 (3) (2010), pp. 361-387. DOI: 10.1007/s10618-009-0152-3

[354]

M. Morel, E. Bacry, S. Gaïffas, A. Guilloux, F. Leroy. ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection. Biostatistics, 21 (4) (2020), pp. 758-774. DOI: 10.1093/biostatistics/kxz003

[355]

A. Ben Abacha, M.F.M. Chowdhury, A. Karanasiou, Y. Mrabet, A. Lavelli, P. Zweigenbaum. Text mining for pharmacovigilance: using machine learning for drug name recognition and drug-drug interaction extraction and classification. J Biomed Inform, 58 (1) (2015), pp. 122-132.

[356]

J. Mower, D. Subramanian, T. Cohen. Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications. J Am Med Inform Assoc, 25 (10) (2018), pp. 1339-1350. DOI: 10.1093/jamia/ocy077

[357]

T. Lorberbaum, M. Nasir, M.J. Keiser, S. Vilar, G. Hripcsak, N.P. Tatonetti. Systems pharmacology augments drug safety surveillance. Clin Pharmacol Ther, 97 (2) (2015), pp. 151-158. DOI: 10.1002/cpt.2

[358]

A. Enshaei, C.N. Robson, R.J. Edmondson. Artificial intelligence systems as prognostic and predictive tools in ovarian cancer. Ann Surg Oncol, 22 (12) (2015), pp. 3970-3975. DOI: 10.1245/s10434-015-4475-6

[359]

D. Sun, M. Wang, A. Li. A multimodal deep neural network for human breast cancer prognosis prediction by integrating multi-dimensional data. IEEE/ACM Trans Comput Biol Bioinform, 16 (3) (2018), pp. 841-850. DOI: 10.1159/000494471

[360]

C.L. Chi, W.N. Street, W.H. Wolberg. Application of artificial neural network-based survival analysis on two breast cancer datasets. AMIA Annu Symp Proc, 2007 (1) (2007), pp. 130-134.

[361]

D. Delen, G. Walker, A. Kadam. Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med, 34 (2) (2005), pp. 113-127.

[362]

Y. Sun, S. Goodison, J. Li, L. Liu, W. Farmerie. Improved breast cancer prognosis through the combination of clinical and genetic markers. Bioinformatics, 23 (1) (2007), pp. 30-37. DOI: 10.1093/bioinformatics/btl543

[363]

O. Gevaert, F. De Smet, D. Timmerman, Y. Moreau, B. De Moor. Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics, 22 (14) (2006), pp. e184-e190. DOI: 10.1093/bioinformatics/btl230

[364]

C.M. Lynch, B. Abdollahi, J.D. Fuqua, A.R. de Carlo, J.A. Bartholomai, R.N. Balgemann, et al. Prediction of lung cancer patient survival via supervised machine learning classification techniques. Int J Med Inform, 108 (1) (2017), pp. 1-8

[365]

K.H. Yu, C. Zhang, G.J. Berry, R.B. Altman, C. , D.L. Rubin, et al. Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. Nat Commun, 7 (1) (2016), p. 12474.

[366]

A. Biglarian, E. Hajizadeh, A. Kazemnejad, M. Zali. Application of artificial neural network in predicting the survival rate of gastric cancer patients. Iran J Public Health, 40 (2) (2011), pp. 80-86.

[367]

Y. Zhu, Q.C. Wang, M.D. Xu, Z. Zhang, J. Cheng, Y.S. Zhong, et al. Application of convolutional neural network in the diagnosis of the invasion depth of gastric cancer based on conventional endoscopy. Gastrointest Endosc, 89 (4) (2019), pp. 806-815. DOI: 10.1364/ao.58.000806

[368]

L. Zhu, W. Luo, M. Su, H. Wei, J. Wei, X. Zhang, et al. Comparison between artificial neural network and Cox regression model in predicting the survival rate of gastric cancer patients. Biomed Rep, 1 (5) (2013), pp. 757-760. DOI: 10.3892/br.2013.140

[369]

D.W. Tian, Z.L. Wu, L.M. Jiang, J. Gao, C.L. Wu,H.L. Hu. Neural precursor cell expressed, developmentally downregulated 8 promotes tumor progression and predicts poor prognosis of patients with bladder cancer. Cancer Sci, 110 (1) (2019), pp. 458-467. DOI: 10.1111/cas.13865

[370]

Z. Hasnain, J. Mason, K. Gill, G. Miranda, I.S. Gill, P. Kuhn, et al. Machine learning models for predicting post-cystectomy recurrence and survival in bladder cancer patients. PLoS One, 14 (2) (2019), p. e0210976. DOI: 10.1371/journal.pone.0210976

[371]

R.J. Kuo, M.H. Huang, W.C. Cheng, C.C. Lin, Y.H. Wu. Application of a two-stage fuzzy neural network to a prostate cancer prognosis system. Artif Intell Med, 63 (2) (2015), pp. 119-133.

[372]

S. Zhang, Y. Xu, X. Hui, F. Yang, Y. Hu, J. Shao, et al. Improvement in prediction of prostate cancer prognosis with somatic mutational signatures. J Cancer, 8 (16) (2017), pp. 3261-3267. DOI: 10.7150/jca.21261

[373]

E.J. Corey, W.T. Wipke. Computer-assisted design of complex organic syntheses. Science, 166 (3902) (1969), pp. 178-192. DOI: 10.1126/science.166.3902.178

[374]

T.J. Struble, J.C. Alvarez, S.P. Brown, M. Chytil, J. Cisar, R.L. DesJarlais, et al. Current and future roles of artificial intelligence in medicinal chemistry synthesis. J Med Chem, 63 (16) (2020), pp. 8667-8682. DOI: 10.1021/acs.jmedchem.9b02120

[375]

M.H.S. Segler, M. Preuss, M.P. Waller. Planning chemical syntheses with deep neural networks and symbolic AI. Nature, 555 (7698) (2018), pp. 604-610. DOI: 10.1038/nature25978

[376]

H. Gao, T.J. Struble, C.W. Coley, Y. Wang, W.H. Green, K.F. Jensen. Using machine learning to predict suitable conditions for organic reactions. ACS Cent Sci, 4 (11) (2018), pp. 1465-1476. DOI: 10.1021/acscentsci.8b00357

[377]

Y. Gong, D. Xue, G. Chuai, J. Yu, Q. Liu. DeepReac+: deep active learning for quantitative modeling of organic chemical reactions. Chem Sci, 12 (43) (2021), pp. 14459-14472. DOI: 10.1039/d1sc02087k

[378]

C.W. Coley, R. Barzilay, T.S. Jaakkola, W.H. Green, K.F. Jensen. Prediction of organic reaction outcomes using machine learning. ACS Cent Sci, 3 (5) (2017), pp. 434-443. DOI: 10.1021/acscentsci.7b00064

[379]

D. Caramelli, D. Salley, A. Henson, G.A. Camarasa, S. Sharabi, G. Keenan, et al. Networking chemical robots for reaction multitasking. Nat Commun, 9 (1) (2018), p. 3406.

[380]

R.B. Merrifield. Automated synthesis of peptides. Science, 150 (3693) (1965), pp. 178-185. DOI: 10.1126/science.150.3693.178

[381]

G. Alvarado-Urbina, G.M. Sathe, W.C. Liu, M.F. Gillen, P.D. Duck, R. Bender, et al. Automated synthesis of gene fragments. Science, 214 (4518) (1981), pp. 270-274. DOI: 10.1126/science.6169150

[382]

T. Doi, S. Fuse, S. Miyamoto, K. Nakai, D. Sasuga, T. Takahashi. A formal total synthesis of taxol aided by an automated synthesizer. Chem Asian J, 1 (3) (2006), pp. 370-383. DOI: 10.1002/asia.200600156

[383]

J. Boström, D.G. Brown, R.J. Young, G.M. Keserü. Expanding the medicinal chemistry synthetic toolbox. Nat Rev Drug Discov, 17 (10) (2018), pp. 709-727 Erratum in: Nat Rev Drug Discov 2018;17(12):922. DOI: 10.1038/nrd.2018.116

[384]

A. Bellomo, N. Celebi-Olcum, X. Bu, N. Rivera, R.T. Ruck, C.J. Welch, et al. Rapid catalyst identification for the synthesis of the pyrimidinone core of HIV integrase inhibitors. Angew Chem Int Ed Engl, 51 (28) (2012), pp. 6912-6915. DOI: 10.1002/anie.201201720

[385]

S.D. Dreher, P.G. Dormer, D.L. Sandrock, G.A. Molander. Efficient cross-coupling of secondary alkyltrifluoroborates with aryl chlorides—reaction discovery using parallel microscale experimentation. J Am Chem Soc, 130 (29) (2008), pp. 9257-9259. DOI: 10.1021/ja8031423

[386]

A. Buitrago Santanilla, E.L. Regalado, T. Pereira, M. Shevlin, K. Bateman, L.C. Campeau, et al. Nanomole-scale high-throughput chemistry for the synthesis of complex molecules. Science, 347 (6217) (2015), pp. 49-53. DOI: 10.1126/science.1259203

[387]

D. Perera, J.W. Tucker, S. Brahmbhatt, C.J. Helal, A. Chong, W. Farrell, et al. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science, 359 (6374) (2018), pp. 429-434. DOI: 10.1126/science.aap9112

[388]

M. Shevlin. Practical high-throughput experimentation for chemists. ACS Med Chem Lett, 8 (6) (2017), pp. 601-607. DOI: 10.1021/acsmedchemlett.7b00165

[389]

D.T. Ahneman, J.G. Estrada, S. Lin, S.D. Dreher, A.G. Doyle. Predicting reaction performance in C-N cross-coupling using machine learning. Science, 360 (6385) (2018), pp. 186-190. DOI: 10.1126/science.aar5169

[390]

O. Isayev. Text mining facilitates materials discovery. Nature, 571 (7763) (2019), pp. 42-43. DOI: 10.1038/d41586-019-01978-x

[391]

J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.H. So, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36 (4) (2020), pp. 1234-1240.

[392]

C. Sun, Z. Yang, L. Luo, L. Wang, J. Wang. A deep learning approach with deep contextualized word representations for chemical-protein interaction extraction from biomedical literature. IEEE Access, 7 (1) (2019), pp. 151034-151046. DOI: 10.1109/access.2019.2948155

[393]

S. Zhao, C. Su, Z. Lu, F. Wang. Recent advances in biomedical literature mining. Brief Bioinform, 22 (3) (2021), p. bbaa057.

[394]

S.N. Deftereos, C. Andronis, E.J. Friedla, A. Persidis, A. Persidis. Drug repurposing and adverse event prediction using high-throughput literature analysis. Wiley Interdiscip Rev Syst Biol Med, 3 (3) (2011), pp. 323-334. DOI: 10.1002/wsbm.147

[395]

H.T. Yang, J.H. Ju, Y.T. Wong, I. Shmulevich, J.H. Chiang. Literature-based discovery of new candidates for drug repurposing. Brief Bioinform, 18 (3) (2017), pp. 488-497.

[396]

R. Zhang, M.J. Cairelli, M. Fiszman, H. Kilicoglu, T.C. Rindflesch, S.V. Pakhomov, et al. Exploiting literature-derived knowledge and semantics to identify potential prostate cancer drugs. Cancer Inform, 13 (Suppl 1) (2014), pp. 103-111

[397]

Y. Hu, L.M. Hines, H. Weng, D. Zuo, M. Rivera, A. Richardson, et al. Analysis of genomic and proteomic data using advanced literature mining. J Proteome Res, 2 (4) (2003), pp. 405-412.

[398]

N. Shang, H. Xu, T.C. Rindflesch, T. Cohen. Identifying plausible adverse drug reactions using knowledge extracted from the literature. J Biomed Inform, 52 (1) (2014), pp. 293-310.

[399]

S.A. Malec, P. Wei, E.V. Bernstam, R.D. Boyce, T. Cohen. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform, 117 (1) (2021), Article 103719.

[400]

L.L. Wang, K. Lo.Text mining approaches for dealing with the rapidly expanding literature on COVID-19. Brief Bioinform, 22 (2) (2021), pp. 781-799. DOI: 10.1093/bib/bbaa296

[401]

Z. Feng, Z. Shen, H. Li, S. Li. e-TSN: an interactive visual exploration platform for target-disease knowledge mapping from literature. Brief Bioinform, 23 (6) (2022), p. bbac465.

[402]

J. Wang, Z. Shen, Y. Liao, Z. Yuan, S. Li, G. He, et al. Multi-modal chemical information reconstruction from images and texts for exploring the near-drug space. Brief Bioinform, 23 (6) (2022), p. bbac461.

[403]

A.C. Ahn, M. Tewari, C.S. Poon, R.S. Phillips.The limits of reductionism in medicine: could systems biology offer an alternative?. PLoS Med, 3 (6) (2006), p. e208. DOI: 10.1371/journal.pmed.0030208

[404]

I.R. König, O. Fuchs, G. Hansen, E. von Mutius, M.V. Kopp.What is precision medicine?. Eur Respir J, 50 (4) (2017), p. 1700391. DOI: 10.1183/13993003.00391-2017

[405]

E.M. Antman, J. Loscalzo. Precision medicine in cardiology. Nat Rev Cardiol, 13 (10) (2016), pp. 591-602. DOI: 10.1038/nrcardio.2016.101

[406]

A.D. Hingorani, D.A. van der Windt, R.D. Riley, K. Abrams, K.G.M. Moons, E.W. Steyerberg, et al. Prognosis research strategy (PROGRESS) 4: stratified medicine research. BMJ, 346 (2013), p. e5793. DOI: 10.1136/bmj.e5793

[407]

J. Tang, M. Mou, Y. Wang, Y. Luo, F. Zhu. MetaFS: performance assessment of biomarker discovery in metaproteomics. Brief Bioinform, 22 (3) (2021), p. bbaa105.

[408]

Q. Yang, B. Li, S. Chen, J. Tang, Y. Li, Y. Li, et al. MMEASE: online meta-analysis of metabolomic data by enhanced metabolite annotation, marker selection and enrichment analysis. J Proteomics, 232 (1) (2021), Article 104023.

[409]

Y. Zhao, Z. Pan, S. Namburi, A. Pattison, A. Posner, S. Balachander, et al. CUP-AI-Dx: a tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence. EBioMedicine, 61 (1) (2020), Article 103030.

[410]

Y.L. Yeh, M.W. Su, B.L. Chiang, Y.H. Yang, C.H. Tsai, Y.L. Lee. Genetic profiles of transcriptomic clusters of childhood asthma determine specific severe subtype. Clin Exp Allergy, 48 (9) (2018), pp. 1164-1172. DOI: 10.1111/cea.13175

[411]

D.C.M. Rolland, V. Basrur, Y.K. Jeon, C. McNeil-Schwalm, D. Fermin, K.P. Conlon, et al. Functional proteogenomics reveals biomarkers and therapeutic targets in lymphomas. Proc Natl Acad Sci USA, 114 (25) (2017), pp. 6581-6586. DOI: 10.1073/pnas.1701263114

[412]

L. Niu, M. Thiele, P.E. Geyer, D.N. Rasmussen, H.E. Webel, A. Santos, et al. Noninvasive proteomic biomarkers for alcohol-related liver disease. Nat Med, 28 (6) (2022), pp. 1277-1287. DOI: 10.1038/s41591-022-01850-y

[413]

W. Poon, B.R. Kingston, B. Ouyang, W. Ngo, W.C.W. Chan. A framework for designing delivery systems. Nat Nanotechnol, 15 (10) (2020), pp. 819-829. DOI: 10.1038/s41565-020-0759-5

[414]

M.J. Mitchell, M.M. Billingsley, R.M. Haley, M.E. Wechsler, N.A. Peppas, R. Langer. Engineering precision nanoparticles for drug delivery. Nat Rev Drug Discov, 20 (2) (2021), pp. 101-124. DOI: 10.1038/s41573-020-0090-8

[415]

J. Li, B. Esteban-Fernández de Ávila, W. Gao, L. Zhang, J. Wang.Micro/nanorobots for biomedicine: delivery, surgery, sensing, and detoxification. Sci Robot, 2(4): eaam6431 (2017)

[416]

A.T. Ong, P.W. Serruys. Technology insight: an overview of research in drug-eluting stents. Nat Clin Pract Cardiovasc Med, 2 (12) (2005), pp. 647-658. DOI: 10.1038/ncpcardio0378

[417]

S.N. Bhatia, X. Chen, M.A. Dobrovolskaia, T. Lammers. Cancer nanomedicine. Nat Rev Cancer, 22 (10) (2022), pp. 550-556. DOI: 10.1038/s41568-022-00496-9

[418]

J. Vamathevan, D. Clark, P. Czodrowski, I. Dunham, E. Ferran, G. Lee, et al. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov, 18 (6) (2019), pp. 463-477. DOI: 10.1038/s41573-019-0024-5

[419]

S. Ekins, A.C. Puhl, K.M. Zorn, T.R. Lane, D.P. Russo, J.J. Klein, et al. Exploiting machine learning for end-to-end drug discovery and development. Nat Mater, 18 (5) (2019), pp. 435-441. DOI: 10.1038/s41563-019-0338-z

[420]

C. Chen, Z. Yaari, E. Apfelbaum, P. Grodzinski, Y. Shamay, D.A. Heller. Merging data curation and machine learning to improve nanomedicines. Adv Drug Deliv Rev, 183 (1) (2022), Article 114172.

[421]

D. Reker, Y. Rybakova, A.R. Kirtane, R. Cao, J.W. Yang, N. Navamajiti, et al. Computationally guided high-throughput design of self-assembling drug nanoparticles. Nat Nanotechnol, 16 (6) (2021), pp. 725-733. DOI: 10.1038/s41565-021-00870-y

[422]

Y. Shamay, J. Shah, M. Işık, A. Mizrachi, J. Leibold, D.F. Tschaharganeh, et al. Quantitative self-assembly prediction yields targeted nanomedicines. Nat Mater, 17 (4) (2018), pp. 361-368. DOI: 10.1038/s41563-017-0007-z

[423]

Y. Lu, A.A. Aimetti, R. Langer, Z. Gu. Bioresponsive materials. Nat Rev Mater, 2 (1) (2016), p. 16075.

[424]

R. Santana, R. Zuluaga, P. Gañán, S. Arrasate, E. Onieva, H. González-Díaz. Predicting coated-nanoparticle drug release systems with perturbation-theory machine learning (PTML) models. Nanoscale, 12 (25) (2020), pp. 13471-13483. DOI: 10.1039/d0nr01849j

[425]

C. Owh, V. Ow, Q. Lin, J.H.M. Wong, D. Ho, X.J. Loh, et al. Bottom-up design of hydrogels for programmable drug release. Biomater Adv, 141 (1) (2022), Article 213100.

[426]

C. Boztepe, A. Künkül, M. Yüceer. Application of artificial intelligence in modeling of the doxorubicin release behavior of pH and temperature responsive poly (NIPAAm-co-AAc)-PEG IPN hydrogel. J Drug Deliv Sci Technol, 57 (1) (2020), Article 101603.

[427]

R.T. Stiepel, E.S. Pena, S.A. Ehrenzeller, M.D. Gallovic, L.M. Lifshits, C.J. Genito, et al. A predictive mechanistic model of drug release from surface eroding polymeric nanoparticles. J Control Release, 351 (1) (2022), pp. 883-895.

[428]

M.K.P. Jayatunga, W. Xie, L. Ruder, U. Schulze, C. Meier. AI in small-molecule drug discovery: a coming wave?. Nat Rev Drug Discov, 21 (3) (2022), pp. 175-176. DOI: 10.1038/d41573-022-00025-1

[429]

P. Richardson, I. Griffin, C. Tucker, D. Smith, O. Oechsle, A. Phelan, et al. Baricitinib as potential treatment for 2019-nCoV acute respiratory disease. Lancet, 395 (10223) (2020), pp. e30-e31.

[430]

P. Kirkpatrick. Artificial intelligence makes a splash in small-molecule drug discovery. Biopharm Deal, 16 (2) (2022), pp. 84-86

[431]

C. Zhang, M. Mou, Y. Zhou, W. Zhang, X. Lian, S. Shi, et al. Biological activities of drug inactive ingredients. Brief Bioinform, 23 (5) (2022), p. bbac160.

[432]

L. Haghverdi, A.T.L. Lun, M.D. Morgan, J.C. Marioni. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat Biotechnol, 36 (5) (2018), pp. 421-427. DOI: 10.1038/nbt.4091

PDF (3773KB)

12546

Accesses

0

Citation

Detail

Sections
Recommended

/