《1. Introduction》

1. Introduction

The earth-shaking changes in human society are inseparable from the exploration of nature. Such transformative changes have evolved from focusing on natural observations to gradually being realized through various tools and cutting-edge methods [1,2]. In this process, different normative development paradigms covering the overall and interrelated assumptions of various disciplines have been formed [3,4]. Each paradigm shift is caused by the result of changes in the basic assumptions within the ruling theory in a certain era to meet the subsequent requirements, thereby creating a new paradigm [5]. The fifth paradigm has now been characterized as the intelligence-driven and knowledge-centric research paradigm following the paradigm shift from the data-intensive fourth paradigm and is coming on the heels of experimentation, theory, and computer-simulation paradigm shifts from the first to the third paradigms [6–10].

In the fifth paradigm world view, the exploration of the physical universe is not merely projected by the mathematical probable realm of intensive data-driven by intelligence, but the entire research process also involves the undifferentiated conscious process of human expert knowledge. Based on these features, the application of the fifth paradigm can be regarded as a cognitive system or cognitive application [9,10]. Taking the development of materials science as an example, the cognitive system of the fifth paradigm has evolved from the primitive early paradigm via classic evolutionary spiral processes, in which materials such as metals and ceramics were discovered and used in ancient times before the emergence of Newton’s laws and the advent of the theory of relativity. Then, the emergence of relativity and quantum mechanics made it possible to simulate the electronic structure of molecules [11–13]. In recent years, the meteoric rise of artificial intelligence (AI) and machine learning has been transformational to the research of data-driven materials design [14–18]. Therefore, by processing relevant innovative technologies into ever-larger datasets, the hidden properties of new materials, such as metals and ceramics, can be revealed [19–22]. Since then, cognitive materials design has taken the relay baton and formed a new ecosystem through the intellectual collaboration of interdisciplinary experts, thus greatly accelerating the exploration process.

At present, the fifth paradigm is in its emergent period and still has a long way to go. Unlike the mature fourth paradigm of dataintensive science, which has exploded rapidly in multiple application domains and has been used in industrial and scientific fields such as self-driving cars, computer vision, and brain modeling [23–27], the intelligence-driven, knowledge-centric fifth paradigm is still in the stage of vigorous development because it needs to break the boundaries of computational and data-intensive paradigms to form a new ecosystem by merging and extending existing technologies. Fortunately, scientists are now on the road to researching and solving these problems. For example, a Spark– message passing interface (MPI) integrated platform proposed by Malitsky et al. [10] can be used to promote the transformation of the fourth paradigm processing pipeline represented by dataintensive applications to the fifth paradigm of knowledge-centric applications. Cognitive computing, such as natural language processing, knowledge representation, and automatic reasoning, is exactly what Zubarev and Pitera [9] suggested that the fifth paradigm should possess. Furthermore, common aspects among diverse computing applications can be inferred in the fifth paradigm by the integration of expert knowledge in different fields and the intensive data from experimental observation and theoretical simulations, steering the development of complementary solutions to meet emerging and future challenges. Therefore, although the task of developing the fifth paradigm is arduous, the prospects of its application are broad.

The strategic transition from data-intensive science toward the fifth paradigm of composite cognitive computing applications is a long-term journey with many unknowns. This paper addresses the fifth paradigm platform by dissecting a framework called generalized adsorption simulations in Python (GASpy) in catalytic materials [28], aiming to bring together human wisdom, algorithms in high-performance scientific computing, and deep-learning approaches for tackling new frontiers of data-driven discovery applications. The remainder of the paper is organized as follows. Section 2 provides a brief overview and a discussion of the fifth paradigm platform. Section 3 further elaborates on the performance evaluation of the platform. Finally, Section 4 concludes with a summary.

https://github.com/ulissigroup/GASpy

《2. A platform of the fifth paradigm》

2. A platform of the fifth paradigm

In the process of materials research, processing the synergy among experimental data, theoretical models, and machine learning requires experts in different fields to collaboratively analyze and process data, that is, huge human wisdom is needed. Therefore, the intelligence-driven function with knowledge-centric characters that combine each link with versatility and operate in a platform-like manner is particularly important. Here, we introduce a platform of the fifth paradigm used in catalytic materials, as shown in Fig. 1. The platform of the fifth paradigm couples the third and fourth paradigms, and the latter two include the process of the first and second paradigms. Among them, the original data come from experimental observations in the first paradigm and theoretical guidance in the second paradigm, as well as numerical calculations in the third paradigm, which then can be intelligencedriven by machine learning in the fourth paradigm. Combining the knowledge of the work integration of experimental experts and theoretical experts, the materials selected by machine learning can be screened for the second time, and the screening results will be fed back to the numerical simulation of the third paradigm again. The results obtained in the third paradigm can still be driven by the data in the fourth paradigm. The prediction results can then be filtered again through the knowledge integration of experimental and theoretical experts and then fed back to the third paradigm for numerical simulation. These approaches have produced the fifth paradigm platform, which continuously provides samples for machine learning by intelligently controlling the calculation of high-throughput physical models to compensate for the lack of machine-learning samples. Moreover, by using the knowledge integrated into different fields, machine learning can be used to replace part of numerical calculations to solve the problem of the time-consuming massive model due to insufficient computing resources.

《Fig. 1》

Fig. 1. The paradigms in science. The evolution of the scientific paradigm has been developed from the simple 1st paradigm to the complex 5th paradigm. The core of the 5th paradigm is knowledge-centric and intelligence-driven including the successive paradigms from 1st, 2nd, 3rd to 4th marked by the experiments, theory, simulation, and data-driven process, respectively.

The comprehensive work of the fifth paradigm platform stems from the framework designed by Tran and Ulissi [28], for the bimetallic catalysts research in materials science, which uses machine learning to accelerate the numerical calculation based on density functional theory (DFT) that is conducted by a Vienna ab initio simulation package (VASP) [29], and can intelligently drive the discovery of high-performance electrocatalysts. The platform can classify the active sites of each stable low-index surface of bimetallic crystals, resulting in hundreds and thousands of possible active sites. At the same time, an alternative model based on artificial neural networks was used to predict the catalytic activity of these sites [30]. The discovered sites with high activity can be further used for future DFT calculations. 

《2.1. Automatic model construction and verification》

2.1. Automatic model construction and verification

The ability of raw data extraction to be driven by intelligence is reflected in automatic model construction and verification in the fifth paradigm platform. The ever-larger structures with and without the adsorbates can be automatically constructed and verified by DFT calculations. Since the adsorption of surface species is an indispensable process in heterogeneous catalysis, constructing many structures in experiments and DFT calculations can be time-intensive before determining the catalytic activity by evaluating the adsorption energy. Therefore, automated model construction and verification are essential to solving the problem.

As shown in Fig. 2, the entire task calculation includes the preparation process of raw data for standard simulation and then the numerical calculation. All the raw data used for the theoretical simulation come from the Material Project website, which can be realized by the module of gas/bulk generation through the Generate_Gas/Generate_Bulk function, and they can be processed into a list form with the items of user information, task location, calculation status, and other attributes, as well as be stored in the database by the update_atom_collection function with the collection creation named ‘‘Firework,” ‘‘Atoms,” ‘‘Catalog,” and ‘‘Adsorption.” 

《Fig. 2》

Fig. 2. The framework of this fifth paradigm case. The intelligence-driven of raw data extraction in the framework of GASpy is realized through modules of atomic operation, generation, and calculation. (a) The function of the module is to automatically calculate the adsorption energy from the gas and slab phase in the fifth paradigm platform. (b) This module is used to automatically create high-throughput tasks for the optimization of gas, bulk, and adslab with/without adsorbates through Firework. (c, d) The modules represent the (c) slab generation and (d) gas generation and structural relaxation described in part (a).

The relaxation calculation of the task in Fig. 2(a) can then be generated by the FireWorks workflow manager for submission in Fig. 2(b). The attribute of the results in FireWorks contains ‘‘gasphase optimization” as the list format for gas relaxation, as well as the ‘‘unit cell optimization” for bulk optimization (bulk_relaxation). The attribute ‘‘status” is the calculation status of ‘‘COMPLETED,” ‘‘RUNNING,” ‘‘READY,” and other statuses, such as ‘‘FIZZLED,” among others, and is judged by the Find_Bulk/Find_Gas function to either store the completed calculation process in the Atoms collection or generate a FireWorks task workflow waiting for calculation that has not started yet.

If the status determined by Find_Bulk/Find_Gas is ‘‘COMPLETED,” on one hand, the calculated result will be stored in the database. On the other hand, the irreducible crystal face index enumeration (realized by the EnumerateDistinctFacets function) can be carried out by obtaining the optimized crystal structure from the Atoms collection, followed by crystal slab cutting to generate slabs (realized by the Generateslabs function) according to the given Miller index, and then, all adsorption sites on the slab (realized by the GenerateAdsorptionSites function) are found by the extending primitive units (the function of Atom_operates), enumerating crystal slabs, and adding adsorbents, as shown in Figs. 2(c) and (d). For all the adsorption sites on all bulk materials, the GenerateAllSitesFromBulks function, composed of the EnumerateDistinctFacets function and GenerateAdsorptionSites function, can enumerate the irreducible Miller index in each slab and generate all the adsorption sites. All such information is written into the Catalog collection by the function update_catalog_collection.

Furthermore, for each slab in which the adsorption site has been found, the adsorbates will be added to the adsorption sites by the GenerateAdslabs function to generate a ‘‘slab + adsorbate optimization” calculation model (adslab_relaxation); in addition, the adsorbates can also be eliminated by the GenerateAdslabs function to generate a ‘‘bare slab optimization” calculation model (bare_slab_relaxation). These calculation models can then be submitted for calculation through the FireWorks workflow manager.

When completed, all the calculated results will be stored in the database collections by the function update_atom_collection. The Find_Adslab function will determine whether a relaxation task should be started by finding if there is a corresponding calculated result in the Atoms collection. For the adsorption energy Ead calculation, the CalculateAdsorptionEnergy function is used to extract the gas energy Eadsorbates, adsorbate_slab energy Eadsorbate_slab, and bare_slab energy Ebare_slab from the Atoms collection: Ead = Eadsorbate_slab – Ebare_slabEadsorbates. The Ead and the associated initial and final structure information can then be added to the Adsorption collection by the update_adsorption_collection function, where the neural network feature selection that will be discussed next can be extracted as the input of machine learning. Thus, the process of intelligence-driven model construction and verification is realized. 

《2.2. Automated fingerprint construction》

2.2. Automated fingerprint construction

The intelligence-driven quality of a neural network feature selection is reflected in the automatic fingerprint construction in the fifth paradigm platform. In this framework, the automatically constructed fingerprint is converted from all the atomic structures of each material adsorption model into a graphical representation of the numerical input of a convolutional neural network (CNN) [31]. In the atomic structure information, three types of features are considered, as shown in Fig. 3, namely atomic feature (FN1), neighbor feature (FN2), and connection distance (FN3). The basic atomic properties in atomic feature characteristics are atomic number, electronegativity, coordination number/covalent radius, group, period, the valence electron, first ionization energy, electron affinity, block, and atomic volume. The basic neighborhood feature properties are composed of the coordination number between adjacent atoms near the adsorption site calculated by the Voronoi polyhedron algorithm [32]. The connection distances are the distances from the adsorbate to all atoms. The target fingerprint is the adsorption energy (EadN).

《Fig. 3》

Fig. 3. The intelligence-driven of neural network feature selection in the fifth paradigm platform. It is realized by the automatic fingerprint construction in the framework of GASpy. (a) The DFT calculation is schematically viewed as an example dataset (N is the number of training examples); (b) the automatic fingerprint construction is achieved by a predictive model through the fingerprinting and learning steps process; (c) the learning problem is stated, followed by abandoning some materials from the learning results through the scaling relationship, and carrying out further DFT calculation screening.

The process of automatic fingerprint construction includes the process of extracting the final structures and adsorption energy by DFT calculation, fingerprint generation, and the process of machine learning, as well as the learning problem stating. The fingerprint constructed in GASpy comes from the original model without DFT calculation and the DFT calculation result. After DFT calculation, the initial targets EadN are obtained, as shown in Fig. 3(a), and then, these DFT relaxation structures are used to extract fingerprints {FN1, FN2, FN3} for learning and prediction, as shown in Fig. 3(b). These features will be used as a crossvalidation dataset in machine learning, and then, the function will be found by the learning process for the next prediction. In the prediction process, the fingerprints are obtained from the initial structure without any DFT calculation and are used to predict the adsorption energy of material X, as shown in Fig. 3(c), and then, the DFT calculation candidates required for the next cycle are screened through the learning problem. This learning problem is determined by the famous scaling relationship [33,34], as shown in Fig. 3. The scaling relationship is the adsorption energy–catalytic activity (also known as the binding energy–catalytic activity) curve, like a volcano, which rises first and then declines, also known as the ‘‘volcano plot.” 

The data on adsorption energy and catalytic activity in the scaling relationship come from the work of many attempts by theoretical and experimental scientists and are further used by AI experts to screen the results of machine learning. Hence, the knowledge-centric collaboration of these interdisciplinary experts formed this fifth paradigm platform. With the help of the knowledge-centric module, the predicted materials described in Fig. 3(c) will be further exploited, which means that some materials with predicted adsorption energies that do not match the ‘‘volcano plot” will be discarded, and only those predicted materials that match the ‘‘volcano plot” can be further quantified by DFT calculation. In the next cycle, the exploited candidates will be calculated again by DFT, and the dataset will be increased through exploration. As the types of materials calculated by DFT increase, the number of datasets also increases. The automated exploration and exploitation process enables the constantly updated number of fingerprints.

《2.3. The theoretical model for both DFT calculation and machine learning》

2.3. The theoretical model for both DFT calculation and machine learning

In the fifth paradigm platform, the Kohn–Sham theory and a method that integrates the CNN and Gaussian process (GP) [31,35–37] are the core theoretical models for both DFT and machine learning processes. Thus, we briefly introduce the details of these theoretical models.

2.3.1. The theoretical model for DFT calculation

In the process of numerical calculation, namely the DFT calculation, the adsorption energy calculation process mainly involves the optimization process of each slab through the continuous adjustment of the atomic and electronic structure to achieve the most energy-stable structural state, which can be achieved by approximately solving the many-body Schrödinger equation based on quantum mechanics, and solving the Kohn–Sham equation DFT is one of the main methods for this approximate solution.

The Kohn–Sham equation is 

Given a system that contains K ions, namely K occupied orbitals in three-dimensional coordinate space r, refers to the wave function of ion i with its coordinate in r, while its conjugate wave function is . n(r) is the local electron density, namely the probability of finding an electron in r within the ion i. is the energy of the total system. The is the Planck constant, m is the particle’s mass. is the exchange–correlation energy of a homogeneous electron gas with the local electron density n(r). refers to the exchange and correlation energies, for example, the local electron density approximation, which is one of the exchange–correlation functions, only takes the uniform electron gas density as a variable, while the generalized gradient approximation method considers the electron density and the gradient of the density as the variables. is the potential energy of ion i in the position of r. Hence the first item T[n(r)] in Eq. (1) refers to the kinetic energy, the second item  is the external potential. The last item in Eq. (1) refers to the Hartree energy (electron–electron repulsion), where r' is the coordinate perturbation relative to r, and r represents the vector of r. is the vector differential operator, and is the Laplacian for coordinate derivation. 

A self-consistent iterative procedure is described as follows.

Given an initial electron density n(r) obtained from all occupied orbitals by an arbitrary wave function

where occ. refers to the number of occupied orbitals, then

where H refers to the Hamiltonian for the wave function with its energy represented by then a new electron density can be obtained by 

The iterative procedure will exit prematurely when reached the minimum convergence standard required, and the Ead can be calculated by the energy gap between the Eadsorbate_slab and Ebare_slab + Eadsorbates.

2.3.2. The theoretical model for machine learning

The convolution-fed Gaussian process (CFGP) [37] is a method that the pooled outputs of the convolutional layers of the network are used to supply features to a GP regressor [38], which then makes training to produce both mean and predictions on the adsorption energies. The CNN is applied by Chen et al. [39] and Xie and Grossman [40] on top of a graph representation of bulk crystals to predict various properties, and further modified by Back et al. [31], to collect neighbor information using Voronoi polyhedral [32] for the application in predicting binding energies (for example, the adsorption energy) on heterogeneous catalyst surfaces. In the CFGP method, a complete CNN is first trained to create the final fixed network’s weights. Then all the pooled outputs of the convolutional layers are used as features in a new GP. The GP would then be trained to use these features to produce both mean and uncertainty predictions on the adsorption energies. 

In the CFGP method, the crystal structure is represented by a crystal graph G, where the atoms and edges representing connections between atoms in a crystal are encoded by the nodes with the information of atomic features and neighbor features, and then a CNN is constructed on the top of the undirected multigraph [40]. Due to the characteristics of periodicity for the crystal graphs, multiple edges are allowed between the same pair of end nodes, the number of each node is marked by i, and each node i can be represented by a feature vector vi. Similarly, each edge (i, j)k can be represented by the feature vector , which corresponds to the kth bond connecting atom i and atom j. Considering the differences of interaction between each atom feature and the neighbors, the first convolutional layers iteratively update the atom feature by 

where is the updated atom feature of atom i and atom j connected by kth bond in crystal graph G. denotes the concatenation of atom and bond feature. Then a nonlinear graph convolution function is defined as follows: 

where denotes an element-wise multiplication, σ is a sigmoid function, and g is a nonlinear activation function (for example, the ‘‘Leaky ReLu” or ‘‘Softplus”); W and b denote weights and biases of the neural networks, respectively. The σ(•) function is a learned weight matrix to different interactions between neighbors; f and s represent the abbreviation of first and self, respectively. After R convolutional layers, resulting vectors are then fully connected via K hidden layers, followed by a linear transformation to scalar values. Then, the distance filters collected by the connection distances are applied to exclude contributions of atoms that are too far from the adsorbates. A mean pooling layer is then used for producing an overall feature vector which can be represented by a pooling function,

The training is performed by the cost function  , then the whole process produces the function parametrized by weights W that maps a crystal C to the target property . Using backpropagation and stochastic gradient descent (SGD), the following optimization problem can be solved by iteratively updating the weights with DFT calculated data:  

Here, the penultimate layer of the pooling outputs and the corresponding learning weights W rather than the target property Ead is further extracted out as features in the GP. Hence the descriptor for nodes is V and is trained with their corresponding energies (Ead). The prediction function is 

where is the constant mean of prior function and is the Matern kernel with the length scale trained by the maximum likelihood estimation method, refer to different feature vector, respectively. All training and predictions were done with Tesla P100-PCIE GPU acceleration as implemented in GPyTorch [41].

《2.4. Iteration between machine learning and numerical calculations》

2.4. Iteration between machine learning and numerical calculations

The intelligence-driven, knowledge-centric nature of the fifth paradigm platform can be well depicted by the iterations between machine learning and numerical calculation concatenated by the interdisciplinary knowledge of ‘‘volcano plot.” This breaks through the new material bottleneck of artificial screening research between machine learning and numerical calculation and realizes the mutual promotion of scientific experiments and AI, as shown in Fig. 4(a). The experiments involve the process of fetching the primitive crystals (or primitive cells) from the Material Project website to be stored in the database, as well as the information about ‘‘volcano plot.” Then, the model is automatically reconstructed to create a bulk of adsorption energy calculation models. Through numerical calculation (i.e., ab initio DFT calculation), the optimized model and adsorption energy data are stored in the database, and fingerprints are extracted from it to train a suitable machine-learning model. Then, the trained model can use the fingerprint extracted from the bulk materials that have not been theoretically calculated to predict their adsorption energy and can be stored in the database again. Adsorption energy prediction results are intelligently analyzed through ‘‘volcano plot” to screen models that require further DFT calculations. Then the entire loop is ①②③④⑤⑥⑦⑧⑨⑩, ④⑤⑥⑦⑧⑨⑩, ..., ④⑤⑥⑦⑧⑨⑩. 

The cycle stops only when all the materials delivered in the framework are calculated in the machine learning or DFT processes. The characteristics of the fifth paradigm platform are well reflected in these steps. The step ⑤ indicates that the dataset obtained by numerical calculation supplements the problem of no dataset and fewer datasets in the machine-learning process. The step ⑩ indicates that the bulk of numerical calculations can be abandoned with the help of machine-learning prediction and the ‘‘volcano plot” to accelerate the entire DFT calculation. Moreover, the results of machine learning can be intelligently analyzed through the ‘‘volcano plot” that integrates the knowledge of experimental and theoretical scientists (the synergy of interdisciplinary experts), forming a knowledge-centric fifth paradigm driven by intelligence.

《2.5. Information science tools》

2.5. Information science tools

The framework of the fifth paradigm is built by using various Python packages, for example, Python Materials Genomics (pymatgen), the automic simulation environment (ASE), FireWorks, Luigi, and MongoDB [42–45]. Pymatgen is one of the powerful program packages supported by Python for high-throughput material calculations. It standardizes the initialization settings required before running high-throughput calculations and provides process analysis of the data generated by the calculations. The ASE aims to set up, steer, and analyze atomistic simulations. The function of FireWorks is to perform job management in high-throughput computing workflows running on high-performance computing clusters. Luigi can be used to build complex batch job pipelines, handle dependency resolution, and conduct workflow management. MongoDB is written in the C++ language and is used for realtime data storage and can jointly meet the JavaScript Object Notation data-exchange format. 

As shown in Fig. 4(b), the data-intensive DFT calculations can be done on the Tianhe-1 supercomputer using Lustre as the filestorage system [46]. High-throughput computing jobs can be realized by running the security-monitoring system deployed on the cluster. Luigi is used to building the various physical models through dependency resolution (function dependencies, running, and output target), which are then configured and calculated by the task management through FireWorks and batched processing performance through the resource management Slurm in the supercomputer [47]. These two task-management systems can automatically correct errors, re-run a single job, and simultaneously visualize the data through the installed visualization tools.

《Fig. 4》

Fig. 4. The architecture of this fifth paradigm case. (a) The repeated iterations framework includes machine learning and numerical computing in the fifth paradigm platform. Steps ① and ② and steps ⑦ and ⑧ are pulling the experimental results and machine learning results into and out of the database, respectively. Step ③ shows the constructed model prepared for ab initio calculation. Step ④ refers to the storage process of the calculation results, and steps ⑤ and ⑥ are the fingerprints extracted from calculated results and experimental results, respectively. Step ⑨ refers to an online analysis of machine learning results through ‘‘volcano plot.” Step ⑩ shows the remaining models after online analysis (screening) which require further numerical calculations. (b) The realization of services and functions are based on the fifth paradigm platform of the Tianhe-1 supercomputer. Typic components dedicated to services in GASpy contain Storage server, Firework server, and Luigi server. The basic environment in the supercomputing system is at the software level.

《3. Performance evaluation》

3. Performance evaluation

To illustrate the performance of the fifth paradigm platform in catalytic materials screening, we conducted a comparison test to explain how the machine-learning process accelerates numerical calculations and how the process of numerical calculations provides trainable samples for machine-learning iterations. In this article, we do not use the updated dataset containing the online DFT calculation process in the learning cycle of each model, but instead, we use the DFT calculated dataset to extract the corresponding fingerprints for research. Because target prediction is not directly related to the structure of DFT calculation, it is related to the fingerprint extracted from the initial structure without any simulation process. We believe that it will not affect the evaluation of the platform.

The dataset we prepare to test the cross-validation process comes from Github . Five adsorbates of H, CO, OH, O, and N are consisted, of which the main dataset comes from the first two adsorbates (21 269 and 18 437). The method of CFGP is used to create a model to compare the impact of different machine learning models and the total number of the dataset on the accuracy of the catalyst screening through the performance metrics of the correlation coefficient (R2 ), and the mean absolute error (MAE), as well as the root-meansquare error (RMSE). Hyperparameters for the dataset have been tuned by Back et al. [31] and Tran et al. [37], while the research in this paper focuses on the performance of different models under the same method, thus these hyperparameters are still applicable. In our work, the statement of the learning problem is determined by the famous ‘‘volcano plot” to evaluate the size and activity level of its adsorption energy. Taking the H adsorbate as an example, the hydrogen evolution reaction (HER) is a method that uses adsorption energies to predict catalytic performance. The optimal adsorption energy is –0.27 eV [48], and the near-optimal range of the ‘‘volcano plot” is defined as [–0.37 eV, –0.17 eV]. Therefore, if the result of each cycle reaches a range close to the optimal range (it can also be defined as a hit in the near-optimal range), then it is selected as a candidate to continue the DFT calculation before the start of the next cycle. 

https://github.com/ulissigroup/uncertainty_benchmarking

One realization of the mutual feedback between machine learning and numerical calculation is that the trainable sample provided by DFT calculations can supplement machine-learning iterations. In this platform, once an iteration occurs, the dataset containing the target features is determined, which means the machine learning model for the corresponding iteration is determined. In addition, as a typical case of the fifth paradigm platform, the performance comparison of each iterative process is derived from the model comparison under the same data generation conditions. As shown in Table 1, the entire dataset is first randomly shuffled and split into ten models, and 10% of the total dataset is taken as the first model dataset, and then added in increments until 100% of the total dataset is taken as the tenth model to form the datasets corresponding to ten models. The dataset of the previous model is encompassed in the dataset of the next model. For the crossvalidation process, the train/validate/test ratio of each model is 64/16/20, and all the monometallic slabs are added to the training set, as described by Tran et al. [37]. The cross-validation and its results are listed in Table 1 and Fig. 5. The violin in Fig. 5(a) refers to the R2 of the training and testing samples. The greater the difference between the two values, the slenderer it becomes. Otherwise, it turns out to be stubby. If the two are the same, it can be a line. Therefore, the slender violins of models 1, 2, 5, 6, and 9 are indicators of overfitting or underfitting, followed by models 3, 4, 7, and 10, and model 8 performs best. As the dataset increases, the MAE and RMSE in Table 1 gradually decrease, while the R2 trend of the validation and testing process in Fig. 5(a) gradually increases, which indicates that the training model is more accurate than the previous models. In addition, the hit numbers of H adsorbates verified by the DFT calculation (NDFT) and machine learning prediction (NML) are also listed, and their trend also increases with the expansion of the dataset, as shown in Fig. S1 (in Appendix A). The dataset of model 1 to be hit is set as the baseline. To find the performance of increasing trainable samples provided by the numerical calculation of machine-learning iteration, a formula is defined as follows: 

where η represents the increment of NDFT compared with the NML. Dn and Mn refer to NDFT and NML of model n in the near-optimal range (namely the hit number). With the expansion of the dataset, the trend of η becomes larger and approaches 1, indicating that the hit number NML is slowly approaching the hit number NDFT, which shows that the larger the training sample of numerical calculation, the higher the accuracy of the machine-learning model. Furthermore, η fits well in Fig. 5(b), even if some points are not in the linear range. For example, η of model 4 is very small compared to other points, which we attribute to the compensation of larger values in models 5 and 6.

《Table 1》

Table 1 Ten models constructed from the entire dataset to evaluate the performance of the fifth paradigm platform.

The datasets with multiple adsorbates including H adsorbate are used for train/validation/test, and MAE and RMSE are used to evaluate the performance of the machine learning model. The number of surfaces for which low-coverage H adsorption energies in near-optimal activity in the ‘‘volcano plot” are verified by the DFT calculation and machine learning prediction, represented by NDFT and NML, respectively. η is used to evaluate the trend of model performance changes.

《Fig. 5》

Fig. 5. Performance metrics evaluation of the learning model in the fifth paradigm platform. (a) The R2 correlation coefficient of the validation and testing process in the ten models; (b) the linear fit of η for all the models.

The datasets with multiple adsorbates including H adsorbate are used for train/validation/test, and MAE and RMSE are used to evaluate the performance of the machine learning model. The number of surfaces for which low-coverage H adsorption energies in near-optimal activity in the ‘‘volcano plot” are verified by the DFT calculation and machine learning prediction, represented by NDFT and NML, respectively. η is used to evaluate the trend of model performance changes. 

To illustrate the realization of mutual feedback between machine learning and numerical calculation (e.g., machine learning solves the time-consuming problem of massive models caused by insufficient computing resources in numerical calculations, and the numerical calculation process provides machine learning training samples), we prepared three types of prediction cases to understand the performance of the model trained and validated as described above. The dataset that we used in the prediction process is from the work of Tran and Ulissi [28], which encompassed 22 675 H adsorbates DFT results. To be honest, it has covered most of the 21 269 H-dataset mentioned above. However, we believe that it doesn’t matter of the repeated dataset, because our goal is to compare the performance of machine learning models generated on samples of different sizes and find out the acceleration behavior of machine learning under prediction samples of different sizes. Moreover, the material structure corresponding to the dataset to be predicted does not depend on whether simulation calculations have been performed. Therefore, the decision that this machine learning prediction dataset is taken from the DFT calculation dataset will not affect the overall evaluation of the intelligent driving process. 

In terms of the characteristics of the platform, the DFT calculations performed in each cycle (except the first cycle) are obtained from the machine-learning results. Three types of methods are designed for prediction in Table 2: Hit_no_split, No_hit_with_split, and No_hit_no_split. The No_hit_with_split method refers to the incremental dataset from 10% to 100% of the total prediction dataset corresponding to the machine learning model from model 1 to model 10 formed above. In addition, the entire prediction dataset can also be kept the same in each cycle, as defined by the No_hit_no_split method. As for the Hit_no_split method, it means that the model predicted by machine learning in the optimal range is discarded in the next model prediction. The process is as follows: Starting from model 1, 4960 models predicted by machine learning are found to be hits in the entire 22 675 models predicted. When using model 2 to make predictions, 4960 models predicted by machine learning will be removed from 22 675 models predicted, leaving only 17 715 (22 675 – 4960 = 17 715) models. Then, model 2 finds that 860 models predicted by machine learning were hits, and provides another simplified sample 16 855 (17 715 – 860 = 16 855) for model 3 prediction. The hit and drop will not end until the predictions of the ten models are completed. Note that NHits should be equal to NML, but certain materials in the samples must be excluded from the near-optimal activity process. 

Table 2 lists the results of the three methods in the nearoptimal range. In the Hit_no_split method, because the NML predicted by the previous model is deducted from the prediction samples of the next model (except for model 1), NDFT, NML, and NHits from model 1 to model 10 are also reduced accordingly. In the No_hit_with_split method, as the prediction samples increase, NDFT and NML expand gradually. In the No_hit_no_split method, NML fluctuates between 4177 and 4556, while NDFT remains unchanged. We infer that this is caused by the different accuracies of the machine-learning model. Meanwhile, the more datasets there are in the model, the more NML hits there are. From an acceleration point of view, the Hit_no_split method can ensure that the predicted reasonable samples will not be predicted again (of course, provided that it is reasonable), while the other two methods involve repeated predictions of the predicted samples. Therefore, ideally, the Hit_no_split method should be able to optimize the use of all samples that must be predicted to accelerate predictions and provide a faster machine-learning process for accelerating numerical calculations. 

《Table 2 》

Table 2 Three types of prediction methods in the near-optimal range and their performance of all models constructed in the fifth paradigm platform.

The dataset refers to the total number of data sets for each model. The NHits is the number of machine learning predictions that do not exclude certain materials.

To evaluate the difference of these methods in accelerating DFT calculations, we compared the number of NDFT replaced by NML, as well as the value of NML/NDFT in Fig. 6. The replacement of machine learning to replace DFT calculations is defined as follows:

where RE and Tn are the number of DFT calculations replaced by machine learning and all prediction datasets in each model. As shown in Fig. 6(a), the replacement amount of all models of Hit_no_split is more than 15 000, and the replacement amount from model 1 to model 10 is slightly reduced, but compared with other methods, it has the largest NDFT replacement. For the No_hit_with_split method, the number of replacements increases linearly from 1800 to the same as other methods in model 10. For the No_hit_no_split method, except for model 1, the number of replacements for all the models is approximately 14 000, and there is a slight downward trend. For a large number of replacements of model 1 in the No_hit_no_split method and the subsequent sudden decrease, we believe that it is caused by underfitting because model 1 uses a small amount of the dataset to train the model to predict an ever-larger dataset. In these methods, the Hit_no_split can replace the maximum NDFT, as we expected.

The reason that we compared the value NML/NDFT in Fig. 6 is that it can reflect the performance of each model in another view. The ideal NML/NDFT values should all be equal to 1. In the No_hit_with_split and No_hit_no_split methods, the NML/NDFT is slightly increased to close to 1, which indicates that the prediction behavior of the two methods is similar and is suitable for accelerating DFT calculation. In the Hit_no_split method, except for model 1 set as the baseline, the NML/NDFT value is gradually reduced from model 2 to model 7 and then gradually increased in the remaining models, all below 0.5. On one hand, we infer that these smaller values are caused by changes in the accuracy of the machine-learning model since smaller datasets lead to underfitting. On the other hand, as the number of hits of the prediction samples decreases, the number NML that can hit in the next model gradually decreases. In addition, for the No_hit_with_split and No_hit_no_split methods, the number of hits in the previous model will be removed in each model, and the NML that can be hit in the next model will gradually decrease. Since these methods does not involve hit material to be hit again in other iterations, the advantage in terms of speed then are more obvious. 

《Fig. 6》

Fig. 6. The predictive performance of all models constructed in the fifth paradigm platform. (a) The number of DFT calculations (NDFT) replaced by the number of machine learning predictions (NML); (b) the change of NML/NDFT in the near-optimal range for different models within the prediction process. In the Hit_no_split method, model 1 is abandoned because of its baseline function to the other models.

In addition, since the machine-learning model itself exhibits the characteristics of gradual reduction of poor fitting during the expansion process from small cross-validation samples, there will be a certain degree of accuracy loss in the prediction process from model 2 to model 10. For example, the predicted machine-learning dataset should have been hit but not hit, or the dataset should not be hit but hit, leading to hit data missing or non-hit data increasing in the dataset of the next model. Moreover, it is also possible that the sample size is not large enough, resulting in the underfitting or overfitting of the machine-learning model. Therefore, the Hit_no_split method has the advantage of replacing more DFT calculations, although the evaluation of its accuracy is not suitable for the indicators of NML/NDFT. However, this by no means indicates that the Hit_no_split method is not applicable to the fifth paradigm platform. We infer that when the prediction model is good enough and the dataset is large enough, it can reduce the repeated prediction process of data while maintaining the reliability of the results to accelerate the advantages of machine learning to, in turn, accelerate numerical calculations.

Based on the results of the three types of methods, the accuracy loss of machine learning prediction relative to DFT calculation is used to evaluate the performance in the fifth paradigm platform. The accuracy loss can be defined as follows: 

where L is the accuracy loss. Given that the No_hit_with_split and No_hit_no_split methods have relatively suitable predictive performance, we only consider the accuracy loss of these two methods. As shown in Fig. 7, for No_hit_with_split, although model 1 has the lowest accuracy loss, the dataset is small, and we exclude it and consider that model 9 has the lowest accuracy loss. For the No_hit_no_split method, model 5 has the lowest accuracy loss. Therefore, we believe that, as the dataset expands, machine learning will continue to replace DFT calculations, and there will be varying degrees of accuracy loss. The smallest accuracy-loss point is most conducive to this type of machine learning to accelerate the DFT calculation process.

《Fig. 7》

Fig. 7. The accuracy of the fifth paradigm. The mutual verification process of scientific experiment, theoretical calculation, and machine learning in the process of exploring the unknown world represents the accuracy of the fifth paradigm. The accuracy loss (L) of No_hit_no_split and No_hit_with_split methods between machine learning and DFT calculation of all models is constructed in the fifth paradigm platform.

We believe that the accuracy loss of this fifth paradigm case is related to the size of the sample involving machine learning, theoretical calculations, and experiments fed back from the ‘‘volcano plot,” which is exactly the knowledge-centric characteristic for the fifth paradigm in terms of precision. As shown in Fig. 7, the accurate fifth paradigm should make machine learning, theoretical calculation, and scientific experiment unique to the result of the unknown world exploration. Although this standard is very demanding, it is always the ultimate goal for exploring the unknown world.

《4. Discussion of the fifth paradigm platform》

4. Discussion of the fifth paradigm platform

Automated model construction, automated fingerprint extraction, as well as intelligent coupling of intensive data with DFT calculation and machine learning by the ‘‘volcano plot” compose the architecture of the fifth paradigm platform. In the intelligencedriven framework, the workload of traditional modeling construction and calculation is reduced effectively by making full use of the current development of various information tools and methods, greatly simplifying and improving the extremely cumbersome and challenging work in materials research.

One of the challenges this framework faces is the limited application areas implemented in the fifth paradigm. This is because the most typical feature of the fifth paradigm is intelligence-driven, which entails the synergy of interdisciplinary experts to carry out in-depth research. For example, in the materials science introduced in this work, it is necessary to intelligently drive the efficient synergy of experimental experts and theoretical experts, which can be achieved by filtering the machine-learning results through the ‘‘volcano plot.” For some high-throughput interdisciplinary work, before designing a similar fifth paradigm framework, it is best to first consider appropriate methods of quantifying the collaborative work between these experts in different application fields.

In addition, due to the lack of an ever-larger dataset, there must be an insufficient number of samples during the expansion process of the dataset, resulting in poor generalization ability of the training model. Therefore, more datasets must be accumulated to achieve a high-precision machine-learning process. Fortunately, for this fifth paradigm platform, the Open Catalyst project, jointly researched and developed by Facebook AI Research and the Department of Chemical Engineering of Carnegie Mellon University, has realized the Open Catalyst 2020 [49] dataset containing a dramatic rise in DFT calculation results, and it is still constantly updated online. Finally, the accuracy of the fifth paradigm utilized to realize the exploration of the unknown world is affected by machine learning, theoretical calculation, and scientific experiment. The high-precision fifth paradigm tends to explore the same objective thing from the unknown world through the three kinds of cooperation within the scope of its reasonable discovery, derivation, and judgment. We believe that the dissection of this fifth paradigm case can greatly promote the development of the fifth paradigm of materials science in the future.

《5. Conclusions》

5. Conclusions

In this work, we discuss the scientific explanation of the newest paradigm emerging due to the prosperity engendered by AI. Then, a detailed discussion is carried out using a fifth paradigm platform as a typical case, which conforms to a specific and well-defined framework capable of promoting the development of materials science. The interdisciplinary knowledge and intelligence-driven characteristics are the keys to the fifth paradigm, which can be addressed in the work encompassing automatic model construction and verification, automated fingerprint construction, as well as the theoretical model and repeated iteration between machine learning and theoretical calculations. These informatics tools needed for architecting the framework are also discussed in detail. Finally, tests and comparisons are conducted to show how the interaction between AI and numerical calculation in the framework of this fifth paradigm case meaningfully promotes each other to reduce numerical calculation and create more trainable samples in the mutual feedback process. The curation of the numerical calculation and machine-learning models, as well as the techniques, makes the fifth paradigm platform more interpretable.

With the expansion of the dataset, on one hand, the more machine learning replaces the DFT calculation, the faster the screening of materials will be. On the other hand, the more consistent the number of candidate materials predicted by the final machine learning is with the number of candidate materials calculated by DFT, the more accurate the prediction by machine learning is. Under the conditions of satisfying these two judgments, machine learning will continue to replace DFT calculation with different degrees of accuracy loss, and the smallest accuracy loss model is most conducive to machine learning to accelerate the DFT calculation process. This minimum accuracy loss discrimination represents the precise exploration premise of materials research under the scientific fifth paradigm, which requires consistent results when machine learning, theoretical calculation, and scientific experiment are jointly exploring the unknown world.

Although this article provides a scientific explanation for the fifth paradigm platform represented in the fields of catalytic materials, it also acknowledges that much more needs to be discussed. The overall development of the fifth paradigm across various fields still faces challenges in terms of the synergy between interdisciplinary experts and the dramatic rise in demand for data in datadriven disciplines. Despite these challenges, an ongoing endeavor in tandem with all the relevant parties can be envisioned to deepen the combination of AI technology and traditional disciplines, so that each simulation and calculation link has higher intelligence and automation characteristics, and finally runs as a platform to improve the efficiency of traditional scientific computing and promote the development of materials research in a more intelligent and high-precision direction. We believe that a glimpse of the fifth paradigm platform can pave the way for the application of the fifth paradigm in other fields.

《Acknowledgments》

Acknowledgments

We thank Prof. Zachary W. Ulissi and Prof. Pari Palizahti at Carnegie Mellon University for providing advice on the platform. This study was supported by the National Key Research and Development Program of China (2021ZD40303), the National Natural Science Foundation of China (62225205 and 92055213), Natural Science Foundation of Hunan Province of China (2021JJ10023) and Shenzhen Basic Research Project (Natural Science Foundation) (JCYJ20210324140002006).

《Compliance with ethics guidelines》

Compliance with ethics guidelines

Can Leng, Zhuo Tang, Yi-Ge Zhou, Zean Tian, Wei-Qing Huang, Jie Liu, Keqin Li, and Kenli Li declare that they have no conflict of interest or financial conflicts to disclose.

《Appendix A. Supplementary data》

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2022.06.027.