Information on the physicochemical properties of chemical species is an important prerequisite when performing tasks such as process design and product design. However, the lack of extensive data and high experimental costs hinder the development of prediction techniques for these properties. Moreover, accuracy and predictive capabilities still limit the scope and applicability of most property estimation methods. This paper proposes a new Gaussian process-based modeling framework that aims to manage a discrete and high-dimensional input space related to molecular structure representation with the group-contribution approach. A warping function is used to map discrete input into a continuous domain in order to adjust the correlation between different compounds. Prior selection techniques, including prior elicitation and prior predictive checking, are also applied during the building procedure to provide the model with more information from previous research findings. The framework is assessed using datasets of varying sizes for 20 pure component properties. For 18 out of the 20 pure component properties, the new models are found to give improved accuracy and predictive power in comparison with other published models, with and without machine learning.
Xinyu Cao, Ming Gong, Anjan Tula, Xi Chen, Rafiqul Gani, Venkat Venkatasubramanian.
An Improved Machine Learning Model for Pure Component Property Estimation.
Engineering, 2024, 39(8): 65-78 DOI:10.1016/j.eng.2023.08.024
The ability to predict the behavior of a chemical substance mainly depends on knowledge of its physicochemical properties. Thus, data on pure component properties are an important prerequisite for efficiently and consistently performing process and product design tasks [1,2]. In process design, the conditions (temperature, pressure, etc.) at which the properties (solubility, vapor pressure, etc.) of the identified chemical compounds match the design objectives are determined; in process design, chemical compounds with desirable target properties (boiling point, critical pressure, etc.) are chosen [3]. However, while lack of data on pure component properties is a critical limitation for modeling the required properties, experimental observations can be demanding, expensive, and sometimes not even feasible [4]. Thus, there is a need for accurate and robust property prediction models. Recent advances in computer software for statistical analysis, which can fill the gaps of incomplete measured property data [5], need to be employed in this endeavor. A class of pure component properties-namely, the primary properties-are known to be related to the structure of the molecules [6]. This article focuses on the modeling of the primary pure component properties of organic chemicals with their structural molecular information.
There has been rapid development in pure compound property prediction research in the past few decades. Mathematical models, ranging from simple polynomial functions to very large sets of differential-algebraic systems, are employed to estimate the desired pure compound properties [3]. Traditional methods include the group contribution (GC) method [7], quantitative structure property relationship (QSPR) modeling [8], ab initio quantum-mechanics-based methods [9], and so forth. The GC method is most widely used for the estimation of physicochemical properties of primary pure compound properties, where the property of a component is determined as a function of the contributions of the functional groups representing the molecule [10]. Its quick estimation without requiring substantial computational resources and its ease of incorporation into other models have accelerated the application of the GC method in different fields such as toxicity prediction [11], melting-point estimation [12], and biomass conversion processing [13]. On the other hand, low prediction accuracy has limited the usage of GC models, generally in instances where qualitative prediction estimates of the properties are needed [7]. Consequently, different approaches have been proposed to improve the performance of GC models. While higher level descriptors can reveal more structural information through second-order functional groups[10,14], models are created to represent missing GCs [1,7,15].
A review of the recent literature shows an increasing trend in the use of machine learning-based models for property estimation [16] and in artificial intelligence (AI)-based techniques to identify potential molecular structures with promising properties [17]. Venkatasubramanian and Mann [18] used AI in reaction prediction and chemical synthesis. Also, in a recent perspective paper, Mann et al. [19] highlighted the use of property prediction in chemical product design in the era of AI. Machine learning-based methods, such as neural networks and random forests algorithms, play a major role in property estimation using different molecular descriptors. Many of them have significant advantages over traditional modeling techniques, including flexibility, accuracy, and execution speed [20]. Their practicability has been witnessed in reduced computational cost associated with quantum mechanics/ molecular mechanics calculations [21], novel QSPR methods [22], and so forth. For illustration, Zhou et al. [23] treated simplified molecular-input line-entry system (SMILES) notation as a sentence and used natural language processing technologies for the molecular information mining and exploration of chemical properties. Zhang et al. [24] developed an accurate and interpretable deep neural network (DNN) model for property prediction. Wen et al. [25] proposed a systematic methodology that combines multiple machine learning technologies to address crucial issues such as applicability domain and prediction uncertainty in DNN-based QSPR modeling. Moreover, in the realm of functional group representation, many machine learning models have better estimation and predictive capabilities at the cost of greater computer resources, which has expanded the scope of GC-based applications. For illustration, Paduszyński and Domańska [26] employed a two-layer feed-forward artificial neural network based on GC-type molecular descriptors, which proved to be the best GC model for the viscosity of ionic liquids (ILs) described in the literature at that time. Li et al. [27] developed regression models to predict fuel ignition quality with 23 machine learning algorithms via the MATLAB regression learner module, achieving high precision. Among available machine learning models, it is noteworthy that a Gaussian process (GP) [28] with confidence intervals is a widely used method for property predictions. It posts a prior belief over the possible objective function and, during training, iteratively refines the model by updating the Bayesian posterior conditioned on observed data [29]. Confidence intervals have facilitated the development of GP in many fields, including safety-critical settings [30,31], uncertainty forecasting [32,33], and Bayesian optimization [34,35]. Another advantage of GP is that it needs to make fewer decisions (e.g., architecture and learning rate) before modeling [28]. Given these prominent features, Alshehri et al. [36] developed a next-generation suite of GP-based pure component property models for 25 properties of organic chemicals, which has achieved higher accuracy than the simple GC-based models.
Although practitioners frequently use various prediction methods, several challenges remain in developing property models. In most cases, the simple GC-based models result in less accurate performance, with an average error threshold of around 10% [36]. For machine learning methods, the extrapolative ability is still limited outside the training set, despite a smaller error threshold. Part of the reason lies in the information gap between the original compound and the mathematical representation of molecular descriptors. Furthermore, a rigid application of models without considering data characteristics may lead to a suboptimal prediction result. Although GP-based pure component property models have been developed with higher accuracy compared with traditional ones, some techniques can be used to further improve the GP model. The functional group input space is discrete and high-dimensional, and many mature modeling methods already exist to use this information. Machine learning models with more accuracy and extrapolative ability can be built by considering both the features of the input space and the works from previous research.
This paper proposes a new framework for property prediction under functional group representation based on the GP. In particular, high-dimensional and discrete input space is handled with a warping function, and prior information is elicited and checked under scrutiny. The rest of this paper is organized as follows. Section 2 provides some of the basic concepts related to GC method, as well as the latest machine learning approaches. Section 3 introduces the full model structure, which is applied in pure component regression in Section 4. The contributions of individual techniques (i.e., warping function, prior elicitation, and prior predictive checking) are successively analyzed for pure component regression, and models built by some other mainstream machine learning methods are also listed and compared. Section 5 highlights the main contributions of this paper and gives brief insights into future research.
2. Property modeling basics
2.1. The GC method
In GC-based models, a compound’s property is estimated as a function of the contributions of different groups that represent the molecular structure. Molecular representation by GC provides the important advantage of quick estimates without requiring substantial computational resources [10]. Here, the GC representation of the compound methoxylor is illustrated in Fig. 1. The molecular structure of methoxylor is represented with a 424-dimensional vector, divided into three group orders. In this case, the second dimension with a value of " 2 " corresponds to the fact that the functional group methyl appears twice in methoxylor. The value for other dimensions can be similarly obtained, resulting in information in the form of vectors with only integers. It should be noted that it is possible to accurately predict the properties of just simple and monofunctional molecules by using only first-order groups. This is because the first-order groups capture the proximity effect (they do not consider the effects of different groups in the molecules). However, the higher (second and third) levels provide poly-functional and structural groups that give more information about the molecular structure of more complex compounds. The atomic balance, however, is ensured through the first-order groups.
In the traditional GC framework, the property estimation model takes the form of Eq. (1).
where represent first-, second-, and third-order contributions, respectively; (which is equal to ) represents the functional group number; is the intercept; and is an estimation of the property value. , and are the index for first-, second-, and third-order contributions, respectively. NF, NS, and NT represent the total number of groups for the three orders. Hukkerikar et al. [37] proposed two approaches (simultaneous and sequential) to optimize the contribution coefficients, which differ in the order of coefficient estimation. While the simultaneous approach takes all parameters in a single step into consideration, the sequential approach uses second- and third-order terms to reduce the residual of the first order, step by step.
2.2. The support-vector regression (SVR) model
The SVR model [38] is a supervised learning model produced by a support-vector machine (SVM). The most distinct disparity between SVR with a linear kernel and linear regression is that the former ignores errors as long as they are less than a certain positive number . This feature makes the model less sensitive to aggregated data and thus more robust.
In this paper, the coefficients for the GC in the model are regressed by SVR with a linear kernel (referred to herein as "the SVR model") through both the simultaneous and sequential approaches introduced in Section 2.1. The hyperparameters of the SVR model include the penalty as the proportionality coefficient of the violation condition and the precision as the maximum tolerance deviation, as shown in Eq. (2).
where and are two slack variables dealing with points out of the -precision bound; represents the th molecule and represents its measured results; represents either the prediction result for the simultaneous method or the 1st-/2nd-/3rd-order prediction terms for the sequential method; (which is equal to in Eq. (1)) is the contribution vector; represents data number. In this work, hyperparameter tuning is implemented by means of the grid search method.
2.3. The GP model
The GP is a stochastic process, where every finite collection of random variables has a multivariate Gaussian distribution. In the most common cases, the GP prior satisfies Eq. (3). With Bayesian inference, the posterior of the unobserved point obtained also follows the normal distribution. The mean and variance are shown in Eqs. (4) and (5), respectively.
where and correspond to input variables in the training and testing set, respectively; is the true output value for input is the predicted output for , whose mean is and variance matrix is denotes the measurement noise; indicates the identity matrix; represents the joint normal distribution. is the matrix composed of the values calculated by the kernel function [39], whose element in the th row and the th column is equal to the values of the kernel function obtained by the th and th input, as shown in Eq. (6).
The kernel function indicates the degree of correlation between the two variables and should be carefully selected, Alshehri et al. [36] summed up four exponential kernels for property prediction under a GP (herein referred to as "the GP model"), where using the kernel function takes the form of Eq. (7).
where and are two vectors with 424 dimensions representing two certain compounds under the GC framework, and are the hyperparameters that need to be tuned for the GP model, which are also vital for the model performance. Commonly used hyperparameter optimization methods include maximum log-likelihood and cross-validation.
3. An improved machine learning model based on GP
Molecules represented by functional group descriptors are characterized by high-dimension and all-integer elements. Although the GP model has a remarkably improved estimation performance on the whole dataset [36], the predictive capability of GP models can be further improved. This section proposes an improved GP model, with the aim of dealing with a high-dimension discrete input space and thereby further improving the model performance.
3.1. A warping function for discrete input
A categorical variable with ordered levels is called an ordinal, where the levels can be rated as a discretization of a continuous space [40]. GP is defined on as an ordinal GP with the integer vector as the input variable, and GP is an ordinary continuous GP. GP can be transformed into GP in the continuous domain through a nondecreasing function [41] (also called a warping function), as given by Eq. (8). Consequently, the kernel function processed for the ordinal can be transformed as shown in Eq. (9).
where , as well as , is an ordinal number with different levels, and is the kernel function in GP . The kernel indicates the degree of correlation between the two variables in a GP. Many commonly used kernel functions depend only on the distance between two variables rather than the variable itself. Thus, the correlation between molecules having 0 and 1 of a certain functional group is the same as that between molecules having 19 and 20 (as both of the distances are "1"). Nevertheless, intuitively, the correlation between two ordinals that represent quantity ought to be related to their own numerical value. This can be seen in the case of GC, as the correlation between molecules having 0 and 1 of a certain functional group should be less than that between molecules having 19 and 20 (the distance of the first pair should be larger). Therefore, the form of the warping function is defined as Eq. (10).
Fig. 2 demonstrates how the warping function works with a one-dimensional ordinal. Here, GC representation and the popular exponential kernel are used as examples. Pair 1 involves molecules with 0 and 1 of a certain functional group, while pair 2 comprises molecules with 19 and 20 of a certain functional group. In the case where discrete variables are directly fed into the kernel, the correlation is the same for both pairs. However, after adding the warping function, pair 1 is less correlated, satisfying the model’s need. When a compound is represented with a 424-dimensional vector, Eq. (10) can be expanded in a way that the th element of warped vector is calculated with the th-dimensional component of .
To better deal with the degree of correlation change for discrete variables, parameter in the warping function and the parameters induced by the kernel function are regarded as hyperparameters and are tuned by cross-validation or maximum log-likelihood inGP Y.
3.2. Prior elicitation for a property prediction model
Strategies for prior elicitation include asking a panel of experts for advice and deriving from sample data [42]. Many scholars have studied the prior elicitation of a GP that is applicable to different scenarios [43⇓-45]. Most often, the prior of the GP is set to be zero if it lacks certain information or experience, whereas a good prior elicitation procedure grants the model better performance.
Dimensional information should be added into the GP model either by the kernel or the prior, as it plays an important role under GC representation. The squared exponential kernel can be parameterized in terms of hyperparameters, as shown in Eq. (11).
where denotes a symmetric matrix. Possible choices for the matrix include [28]
where , and are all the parameters of the kernel function. It should be noted that, although the form of appears to be unrelated to the output space, the hyperparameter will largely depend on the output value during the model training process, thus making the GP model more flexible when predicting multiple properties. While significantly adds the number of hyper-parameters, thus making hyperparameter tuning intractable, fails to present the dimensional information. For models with high dimension and linear characteristics or meaning such as a GC model, the prior can be set to be a linear combination of input space that provides dimensional information. Under this framework, the GP model turns into Eq. (13). The mean and variance of the posterior correspond to Eqs. (14) and (15), respectively [46].
where is the true output value matching , and is the predicted output matching . SVR is the linear model output value obtained through the SVR technique.
3.3. Prior predictive checking for the GP
Prior distributions based on data analysis or expert experiments cannot be inherently wrong if the prior elicitation procedure is valid. However, even in the case of a valid prior elicitation procedure, it is essential to check whether the prior generates incorrect data [42]. Many methods have been proposed for prior predictive checking, such as the prior predictive -value [47] and Bayes factors [48]. Although the SVR model has been added as the prior to the GP to provide dimension information, in fact, it does not necessarily outperform the GP with zero priors. Therefore, prior predictive checking must be conducted before GP modeling. As only the training set can be used and cross-validation is conducted for hyperparameter tuning, the average cross-validation loss on different folds is used to compare zero and non-zero priors. Considering that the SVR model is trained on the training set, giving the prior-predictive checking process for models with an SVR prior an inherent advantage, the average cross-validation loss with a non-zero prior is multiplied by a penalty. The prior selection criteria are given in Eq. (16).
where is the prior of the final GP model; and are the average cross-validation loss on the training set with a prior SVR and zero, respectively; and is the penalty.
3.4. An integrated modeling structure
Based on the warping function, prior elicitation, and prior predictive checking, the final GP model (referred to herein as the “GP-WP,” is built for discrete input with high dimensions. Eqs. (17) and (18) present a mathematical representation of the model.
where is the predicted result; is the matrix composed of the values calculated by the warping function and kernel function, whose elements in the th row and the th column are equal to the values of the kernel function obtained by Eq. (9) for the th and th input. The parameters of SVR are obtained with Eq. (2). Finally, Eqs. (19) and (20) give the prediction value and uncertainty.
The full model-building procedure is shown in Fig. 3. Before training, the dataset is divided into training and testing sets. The next stage involves transforming the discrete GP into the continuous GP through a warping function, whose form has been given. Then, the SVR model is conducted on the training set. Based on the SVR model, cross-validation is conducted to tune the hyperparam-eters, and prior predictive checking is done using Eq. (16) for the models with zero and non-zero priors. Finally, the GP model is built with the prior information and kernel processed with a warping function. The hyperparameters include (the length scale of the exponential kernel function, ), and .
Fig. 4 depicts the structure of the machine learning model for predicting properties of a new molecule. In step 1, the molecular formula of the new compound is required to obtain its group-contribution representation. Next, in step 2, the integral vector is transformed into the continuous domain using a warping function, and the parameter for this function is determined during the training process of the GP. In step 3, two covariance matrices are generated: One captures the correlation between different molecules in the training dataset, while the other represents the correlation between the new molecule and others. In addition, the prediction value is calculated using the formula inferred through Bayesian statistics. It is important to note that hyperparameters in the kernels and formulas are fixed after the training phase.
4. Results and discussions
The models mentioned in the previous sections (i.e., SVR in Section 2.2, GP in Section 2.3, and GP-WP in Section 3) are used for the estimation of pure component primary properties. All of the 20 pure component primary properties obtained from Appendix A are tested. Table 1 lists the database information for the 20 properties, while Table 2 gives detailed information about the three models.
Section 4 is organized as follows. First, the results of the three models are displayed using error threshold heatmaps along with a quantitative error index, including the root mean square error (RMSE) and -square . Second, the contributions of three techniques-namely, the warping function, prior elicitation, and prior predictive checking-to the full model framework are analyzed. Finally, the performance of the GP-WP is compared with those of other mainstream machine learning models-more specifically, the neural network and decision tree.
4.1. Result analysis of the GP-WP for 20 properties
This section compares the prediction accuracy of the SVR (i.e., the model developed using SVR); the GP (i.e., the model developed using a regular GP framework); and the GP-WP (i.e., the model developed using GP with a warping function, prior elicitation, and prior predictive checking). A total of 20 properties are tested, including the normal boiling point (K), critical volume , critical temperature(K), critical pressure ), auto ignition temperature(K), bioconcentration factor, Gibbs energy of formation at , standard enthalpy of formation , enthalpy of fusion at , Hildebrandt solubility parameter at , enthalpy of formation at (fathead minnow) , toxicity (oral rat) , liquid molar volume at , octanol-water partition coefficient, aqueous solubility , permissible exposure limit , photochemical oxidation potential, acid dissociation constant, and normal melting point(K). The division of the training set and testing set is retained as for the original dataset; however, for isomers, of which the input is the same and the output is close to each other, only the one closest to the average output value is kept.
The SVR model is trained with Eq. (2), with its hyperparameters and tuned by means of a grid search with the python package skopt.sampler (scikit-optimize library). Sequential and simultaneous methods are both used, and the one with a lower RMSE is chosen to be the SVR model. SVR coefficient optimization is done with the Python package sklearn.svm (scikit-learn library), after which contributions for each functional group are obtained. Five hyperpa-rameters of the GP are tuned using the kernel function shown in Eq. (7). A prediction for each property is obtained with Bayesian inference. Then, the GP-WP model is implemented following the structure in Fig. 3. Six hyperparameters (where the extra one comes from the warping function) are set separately for models with or without priors (the prior being zero). During the training process, the average validation loss can also be calculated. The final GP-WP model is determined using Eq. (16). Here, its penalty is manually set to 0.05.
First, the performance of different models on the whole dataset, presented in the heatmap in Fig. 5, is measured through different error threshold rates . The first three columns correspond to SVR, the middle three correspond to the regular GP, and the last three correspond to this GP-WP model. Different rows represent the percentage for different properties, while the last one calculates the average of all the 20 properties above. In Fig. 5, red indicates a high percentage, and blue indicates a low percentage.
The boundary in Fig. 5 between the SVR model and the other two models turns out to be very clear, as a large area in the first three columns is covered with blue. The phenomenon becomes especially prominent when observing the three 1% columns. Although the prediction accuracy differs between different properties, none of the SVR predictions has a percentage higher than . Nevertheless, its counterparts under the GP frameworks are all higher than . Therefore, GP will be a good choice when a surrogate is desired to substitute for the original dataset to achieve high-accuracy predictions. Meanwhile, from the heatmap, it can be seen that the average fractions under the , and thresholds for the GP-WP , and , respectively) are all higher than their counterparts for the regular GP (94.51%, 95.78%, and 96.94%, respectively).
The heatmap for the testing set only is shown in Fig. 6. Although the GP model has lost its prominent advantage compared with the SVR, the GP-WP model still outperforms the SVR model on most properties according to the fractions on the heatmap, with the highest average percentage for the three thresholds among the three models. It must be admitted that, for some properties (i.e., Hfus and bcf), none of the models have very accurate prediction ability. While the two GP models can provide an uncertainty range to inform the modelers of the model’s poor performance, the SVR model only gives an inaccuracy prediction result.
To better quantify the error of the SVR, regular GP, and GP-WP, the RMSE and for each property are calculated using Eqs. (21) and (22).
where is the true property value, and corresponds to the predicted one. is the mean of the true value of all samples. While a smaller RMSE is always expected, moves from 0 to 1 as the prediction model gradually becomes more accurate. The results for 20 property predictions are shown in Table 3 (the whole dataset) and Table 4 (the testing set).
The results in Tables 3 and 4 tend to be consistent with those in the heatmap. Taking the Number 2 property (Vc) as an example, the GP-WP model reduces the RMSE from 28.280 to 8.074-a reduction of 71.45%. In addition, it increases the "fractions under the 1% error threshold” from to , indicating an improvement of . Since both the RMSE and the "fractions under the 1% error threshold" represent the accuracy of the predictions, it is evident that the GP-WP model enhances the accuracy for most properties, both in terms of error threshold ratios and quantitative relative errors. On the whole dataset, the two GP-based models have a predominant advantage compared with the SVR model. Moreover, the GP model underperforms the GP-WP model for most properties. This is evident in the cases of Vc, Gf, and Hf, where the RMSE is decreased by more than (from 28.280 to 8.704,24.034 to 11.942, and 22.995 to 9.394, respectively) after the warping function and prior setting technique are adopted. On the testing set, the GP-WP model consistently outperforms the SVR model, displaying smaller errors and higher values. Therefore, it is highly practical to utilize the SVR model as the prior for the GP-WP model.
4.2. An analysis of each technique in the GP-WP
The previous subsection demonstrated the advantage of the GP-WP over the traditional SVR and GP methods. The good performance is a result of the introduced techniques in the proposed GP method. This section further illustrates how the warping function, prior elicitation, and prior predictive checking techniques contribute to the whole model framework separately. The performance of the property estimation can be improved by the introduction of each technique, albeit not always distinctively for all properties.
First, the warping function is used to better deal with the correlation between two discrete vectors. With the parameter of the warping function being the hyperparameter of the GP, the degree of correlation change is determined automatically during the hyperparameter tuning process. Compared with a regular GP, the model performance for property prediction after coalescing the warping function improves slightly but steadily. In other words, for most properties out of 20, only adding a warping function can slightly increase prediction accuracy. The results in terms of the RMSE after adding the warping function for all properties are given in the Appendix A, and some representative examples are listed in Table 5. It is clearly shown that the RMSE for properties such as Tb, Gf, and Tc becomes smaller (from 7.650, 24.034, and 24.310 to 6.890, 22.908, and 23.680, respectively).
Secondly, prior elicitation with traditional methods is especially effective when the SVR model performs well. From Eq. (14), it can be confirmed that the SVR prior makes the GP model fit the residual of the SVR model. The results in terms of the RMSE after adding the prior elicitation procedure for all properties are given in the Appendix A, and some representative examples are listed in Table 6. For properties such as Vc, where the prediction accuracy of the SVR model is already high, the prior elicitation process dramatically improves the model performance of the GP.
Lastly, the function of prior predictive checking lies in the determination of the final model. The result is shown in Table 7, where and correspond to the average cross-validation loss for the GP model with zero and non-zero priors, respectively. According to column " " and " ," the prior formation of the final GP-WP model is determined, as shown in the column "GP-WP." The RMSE on the testing set is also listed in the table for reference; it is unknown during the actual procedure. To be clear, a suboptimal prior choice can sometimes occur during the prior-checking process; for example, a model for bcf and Ld50 outperforms the one on the testing set, but is chosen as the final model by the prior-predictive-checking technique. However, the judgment is constant with the error on the testing set in most cases.
4.3. Comparison with other mainstream machine learning models
To fully demonstrate the effect of the GP-WP model, we further compare it with other mainstream machine learning techniques other than GP-based models-namely, neural networks and decision tree. Neural networks are one of the standard mainstream modeling methods that have been widely used in property predictions [49]. With the ability to learn complex patterns and relationships within the data, neural networks are frequently employed as a benchmark for comparison with other models. In comparison, decision trees are good at handling discrete-valued inputs, making them particularly well-suited for this problem in which the inputs are discrete-valued GCs. Without loss of generality, three representative properties are picked here with the number of samples ranging from 11 236 (logP) to 4658 (Tb), and eventually to 717 (Tc).
A neural network is a set of connected nodes called neurons, which form different layers. Data are passed through the input layer, hidden layers, and, finally, the output layer. In this work, two types of neural networks are used, one with fully connected dense hidden layers, whose width and depth can be adjusted as hyperparameters to optimize fitting (referred to herein as the "BP-ilayer," with hidden layers ), and the other with convolutional hidden layers added before the dense hidden layers (referred to herein as the "CNN" model). Unlike a neural network’s neuron and layer structure, a decision tree uses a flowchart to categorize attributes and split different cases [50], with nodes representing an attribute, branches corresponding to a separate category, and leaves at the end to indicate the results. Similarly, the tree’s depth, the number of leaves, and other parameters can be adjusted to optimize fitting. Here, a light gradient boosting machine (LightGBM) [51] regressor is used for the optimization and implementation of the gradient boosting decision tree (GBDT). The results of the property prediction for , and are shown in Tables 8-10, respectively.
In Tables 8-10, 1% error, 5% error, and 10% error represent the percentage of samples predicted under the error threshold rates of , and , respectively.
First, it is obvious that for all three datasets with different sizes, the GP-WP model outperforms almost all other machine learning models on both the testing set and the whole dataset, which further improves its extrapolative ability and fitting ability. When focusing on the testing set only, the neural networks and decision trees exhibit similar prediction errors when compared with the GP on a large sample set. However, their performance tends to degrade as the sample size decreases. Moreover, the GP has a predominant advantage over other mainstream machine learning models as a surrogate model for the whole dataset. In other words, GP-WP models always obtain the most desirable result for both the prediction error and the fraction under different error threshold rates.
5. Conclusions
This paper introduces new options for the machine learning-based development of primary pure component property models employing the GC approach. The model development approach is suitable for modeling a wide range of properties and exhibits superior model performance compared with other options for machine learning-based model development, regardless of the dataset size. Similar to other GC-based property models, our developed models are predictive models and have limitations related to the use of groups to represent molecules. They are not suitable for very small molecules, such as gases, but the extrapolation to larger molecules has been found to be safe. While some isomers can be handled, others may not be accommodated. These limitations have also been acknowledged by Alshehri et al. [36]. Interested readers can access the training dataset, prediction results, and GP-WP model using the method provided on Github (USA).
The proposed method has the following advantages over existing methods: ① Based on the GP, its construction process does not need to use a large amount of data, and no assumptions about the model’s structure are required. ② The increased dimensions do not involve an increase in the hyperparameter number, but effective dimension information is contained in the model. ③ In most cases, it performs better and predicts more accurately for new samples than other models. The improvement of this method can mainly be attributed to two factors: ① A warping function is used to pack the discrete variables, mapping them to a continuous domain, which changes the degree of variable correlation to a different extent by an adjustable hyperparameter. ② Prior information is elicited and carefully checked before determining the model, making the posterior closer to the real value.
According to the GC models, future work may involve simultaneously focusing on multiple outputs for several property predictions, as the correlation between different properties can be useful for model construction. With these features extracted, precise prediction can be further obtained. Moreover, unlike other representations where spatial information (e.g., angles and distances between atoms) is represented, isomers cannot be differentiated by the current set of functional groups. Additional discriminative descriptors can also be studied and covered, with machine learning guiding these advances.
Acknowledgments
Financial support from the National Natural Science Foundation of China (22150410338 and 61973268) is gratefully acknowledged.
Compliance with ethics guidelines
Xinyu Cao, Ming Gong, Anjan Tula, Xi Chen, Rafiqul Gani, and Venkat Venkatasubramanian declare that they have no conflict of interest or financial conflicts to disclose.
A.S.Hukkerikar, B.Sarup, A. TenKate, J.Abildskov, G.Sin, R.Gani. Group-contribution+ (GC+) based estimation of properties of pure components: improved property estimation and uncertainty analysis. Fluid Phase Equilib, 321 (2012), pp. 25-43.
[2]
D.Mackay, R.S.Boethling. Handbook of property estimation methods for chemicals:environmental health sciences. CRC Press, Boca Raton (2000).
[3]
HukkerikarAS. Development of pure component property models for chemical product-process design and analysis [dissertation]. Denmark: Technical University of Denmark; 2013.
[4]
T.Zhou, R.Gani, K.Sundmacher. Hybrid data-driven and mechanistic modeling approaches for multiscale material and process design. Engineering, 7 (9) (2021), pp. 1231-1238.
[5]
K.G.Joback. Knowledge bases for computerized physical property estimation. Fluid Phase Equilib, 185 (1-2) (2001), pp. 45-52.
[6]
K.G.Joback, R.C.Reid. Estimation of pure-component properties from group-contributions. Chem Eng Commun, 57 (1-6) (1987), pp. 233-243.
[7]
R.Gani. Group contribution-based property estimation methods: advances and perspectives. Curr Opin Chem Eng, 23 (2019), pp. 184-196.
[8]
T.Le, V.C.Epa, F.R.Burden, D.A.Winkler. Quantitative structure-property relationship modeling of diverse materials properties. Chem Rev, 112 (5) (2012), pp. 2889-2919.
[9]
S.Wen, K.Nanda, Y.Huang, G.J.O.Beran. Practical quantum mechanics-based fragment methods for predicting molecular crystal properties. Phys Chem Chem Phys, 14 (21) (2012), pp. 7578-7590.
[10]
L.Constantinou, R.Gani. New group contribution method for estimating properties of pure compounds. AIChE J, 40 (10) (1994), pp. 1697-1710.
[11]
C.Gao, R.Govind, H.H.Tabak. Application of the group contribution method for predicting the toxicity of organic chemicals. Environ Toxicol Chem, 11 (5) (1992), pp. 631-636.
[12]
C.L.Aguirre, L.A.Cisternas, J.O.Valderrama. Melting-point estimation of ionic liquids by a group contribution method. Int J Thermophys, 33 (1) (2012), pp. 34-46.
[13]
E.Terrell. Estimation of Hansen solubility parameters with regularized regression for biomass conversion products: an application of adaptable group contribution. Chem Eng Sci, 248 (2022), Article 117184.
[14]
J.Marrero, R.Gani. Group-contribution based estimation of pure component properties. Fluid Phase Equilib, 183-184 ( 2001), pp. 183-208.
[15]
R.Gani, P.M.Harper, M.Hostrup. Automatic creation of missing groups through connectivity index for pure-component property prediction. Ind Eng Chem Res, 44 (18) (2005), pp. 7262-7269.
V.Venkatasubramanian. The promise of artificial intelligence in chemical engineering: is it here, finally>. AIChE J, 65 (2) (2019), pp. 466-478.
[18]
V.Venkatasubramanian, V.Mann. Artificial intelligence in reaction prediction and chemical synthesis. Curr Opin Chem Eng, 36 (2022), Article 100749.
[19]
V.Mann, R.Gani, V.Venkatasubramanian. Group contribution-based property modeling for chemical product design: a perspective in the AI era. Fluid Phase Equilib, 568 (2023), Article 113734.
[20]
M.R.Dobbelaere, P.P.Plehiers, R. Van deVijver, C.V.Stevens, K.M. VanGeem. Machine learning in chemical engineering: strengths, weaknesses, opportunities, and threats. Engineering, 7 (9) (2021), pp. 1201-1211.
[21]
R.Nagai, R.Akashi, O.Sugino. Completing density functional theory by machine learning hidden messages from molecules. npj Comput Mater, 6 (1) (2020), p. 43.
[22]
GohGB, SiegelC, VishnuA, HodasNO, BakerN. Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed QSAR/QSPR models. 2017. arXiv:1706.06689.
[23]
Z.Zhou, M.Eden, W.Shen. Treat molecular linear notations as sentences: accurate quantitative structure-property relationship modeling via a natural language processing approach. Ind Eng Chem Res, 62 (12) (2023), pp. 5336-5346.
[24]
J.Zhang, Q.Wang, Y.Su, S.Jin, J.Ren, M.Eden, et al. An accurate and interpretable deep learning model for environmental properties prediction using hybrid molecular representations. AIChE J, 68 (6) (2022), p. e17634.
[25]
H.Wen, Y.Su, Z.Wang, S.Jin, J.Ren, W.Shen, et al. A systematic modeling methodology of deep neural network-based structure-property relationship for rapid and reliable prediction on flashpoints. AIChE J, 68 (1) (2022), p. e17402.
[26]
K.Paduszyński, U.Domańska. Viscosity of ionic liquids: an extensive database and a new group contribution model based on a feed-forward artificial neural network.J Chem Inf Model, 54 (5) (2014), pp. 1311-1324.
[27]
R.Li, J.M.Herreros, A.Tsolakis, W.Yang. Machine learning regression based group contribution method for cetane and octane numbers prediction of pure fuel compounds and mixtures. Fuel, 280 (2020), Article 118589.
[28]
C.E.Rasmussen. Gaussianprocesses in machine learning. O.Bousquet, U. VonLuxburg, G.Rätsch (Eds.), Advanced lectures on machine learning, Springer, Berlin (2003), pp. 63-71.
[29]
X.Lu, K.E.Jordan, M.F.Wheeler, E.O.Pyzer-Knapp, M.Benatan. Bayesian optimization for field-scale geological carbon storage. Engineering, 18 (2022), pp. 96-104.
[30]
A.Capone, A.Lederer, S.Hirche. Gaussian process uniform error bounds with unknown hyperparameters for safety-critical applications. Proceedings of the 39th International Conference on Machine Learning; 2022 Jul 17- 23, PMLR, Baltimore, MD, USA. New York ( 2022), pp. 2609-2624.
[31]
AkazakiT. Falsification of conditional safety properties for cyber-physical systems with Gaussian process regression. In: Falcone Y, Sánchez C, editors. Proceedings of the 16th International Conference on Runtime Verification; 2016 Sep 23-30; Madrid, Spain. Cham: Springer; 2016. p. 439-46.
[32]
MoriH, KurataE. Application of Gaussian process to wind speed forecasting for wind power generation. In:Proceedings of the 2008 IEEE International Conference on Sustainable Energy Technologies; 2008 Nov 24- 27 ; Singapore. Piscataway: IEEE; 2008. p. 956-9.
[33]
A.Y.Sun, D.Wang, X.Xu. Monthly streamflow forecasting using Gaussian process regression. J Hydrol, 511 (2014), pp. 72-81.
[34]
B.Shahriari, K.Swersky, Z.Wang, R.P.Adams, N. DeFreitas. Taking the human out of the loop: a review of Bayesian optimization. Proc IEEE, 104 (1) (2016), pp. 148-175.
[35]
GelbartMA, SnoekJ, AdamsRP. Bayesian optimization with unknown constraints. 2014. arXiv:1403.5607.
[36]
A.S.Alshehri, A.K.Tula, F.You, R.Gani. Next generation pure component property estimation models: with and without machine learning techniques. AIChE J, 68 (6) (2022), p. e17469.
[37]
A.S.Hukkerikar, S.Kalakul, B.Sarup, D.M.Young, G.Sin, R.Gani. Estimation of environment-related properties of chemicals for design of sustainable processes: development of group-contribution+ (GC+) property models and uncertainty analysis. J Chem Inf Model, 52 (11) (2012), pp. 2823-2839.
[38]
A.J.Smola, B.Schölkopf. A tutorial on support vector regression. Stat Comput, 14 (3) (2004), pp. 199-222.
[39]
T.Hofmann, B.Schölkopf, A.J.Smola. Kernel methods in machine learning. Ann Stat, 36 (3) (2008), pp. 1171-1220.
[40]
O.Roustant, E.Padonou, Y.Deville, A.Clément, G.Perrin, J.Giorla, et al. Group kernels for Gaussian process metamodels with categorical inputs. SIAM/ASA J Uncertain Quantif, 8 (2) (2020), pp. 775-806.
[41]
P.Z.G.Qian, H.Wu, C.F.J.Wu. Gaussian process models for computer experiments with qualitative and quantitative factors. Technometrics, 50 (3) (2008), pp. 383-396.
[42]
R. Van deSchoot, S.Depaoli, R.King, B.Kramer, K.Märtens, M.G.Tadesse, et al. Bayesian statistics and modelling. Nat Rev Methods Primers, 1 (1) (2021), p. 1.
[43]
S.Ghosal, A.Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann Stat, 34 (5) (2006), pp. 2413-2429.
[44]
CasaleFP, DalcaAV, SagliettiL, ListgartenJ, FusiN. Gaussian process prior variational autoencoders. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, editors. Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018 Dec 3- 8 ; Montréal, QC, Canada. Red Hook: Curran Associates Inc.; 2018. p. 10390-401.
[45]
C.G.Kaufman, S.R.Sain. Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Anal, 5 (1) (2010), pp. 123-149.
[46]
AstudilloR, FrazierPI. Thinking inside the box:a tutorial on grey-box Bayesian optimization. In: Proceedings of the 2021 Winter Simulation Conference; 2021 Dec 15-17; Phoenix, AZ, USA. Piscataway: IEEE; 2021. p. 1-15.
[47]
D.J.Nott, C.C.Drovandi, K.Mengersen, M.Evans. Approximation of Bayesian predictive p-values with regression ABC. Bayesian Anal, 13 (1) (2018), pp. 59-83.
[48]
R.E.Kass, A.E.Raftery. Bayes factors. J Am Stat Assoc, 90 (430) (1995), pp. 773-795.
[49]
L.Hirschfeld, K.Swanson, K.Yang, R.Barzilay, C.W.Coley. Uncertainty quantification using neural networks for molecular property prediction. J Chem Inf Model, 60 (8) (2020), pp. 3770-3780.
[50]
J.Fang, B.Gong, J.Caers. Data-driven model falsification and uncertainty quantification for fractured reservoirs. Engineering, 18 (2022), pp. 116-128.
[51]
KeG, MengQ, FinleyT, WangT, ChenW, MaW, et al. LightGBM:a highly efficient gradient boosting decision tree. In:Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4- 9 ; Long Beach, CA, USA. Red Hook: Curran Associates Inc.; 2017. p. 3149-57.