Machine-Learning-Assisted Design of Deep Eutectic Solvents Based on Uncovered Hydrogen Bond Patterns

Usman L. Abbas , Yuxuan Zhang , Joseph Tapia , Selim Md , Jin Chen , Jian Shi , Qing Shao

Engineering ›› 2024, Vol. 39 ›› Issue (8) : 79 -89.

PDF (1741KB)
Engineering ›› 2024, Vol. 39 ›› Issue (8) :79 -89. DOI: 10.1016/j.eng.2023.10.020
Research
Article
Machine-Learning-Assisted Design of Deep Eutectic Solvents Based on Uncovered Hydrogen Bond Patterns
Author information +
History +
PDF (1741KB)

Abstract

Non-ionic deep eutectic solvents (DESs) are non-ionic designer solvents with various applications in catalysis, extraction, carbon capture, and pharmaceuticals. However, discovering new DES candidates is challenging due to a lack of efficient tools that accurately predict DES formation. The search for DES relies heavily on intuition or trial-and-error processes, leading to low success rates or missed opportunities. Recognizing that hydrogen bonds (HBs) play a central role in DES formation, we aim to identify HB features that distinguish DES from non-DES systems and use them to develop machine learning (ML) models to discover new DES systems. We first analyze the HB properties of 38 known DES and 111 known non-DES systems using their molecular dynamics (MD) simulation trajectories. The analysis reveals that DES systems have two unique features compared to non-DES systems: The DESs have ① more imbalance between the numbers of the two intra-component HBs and ② more and stronger inter-component HBs. Based on these results, we develop 30 ML models using ten algorithms and three types of HB-based descriptors. The model performance is first benchmarked using the average and minimal receiver operating characteristic (ROC)-area under the curve (AUC) values. We also analyze the importance of individual features in the models, and the results are consistent with the simulation-based statistical analysis. Finally, we validate the models using the experimental data of 34 systems. The extra trees forest model outperforms the other models in the validation, with an ROC-AUC of 0.88. Our work illustrates the importance of HBs in DES formation and shows the potential of ML in discovering new DESs.

Graphical abstract

Keywords

Machine learning / Deep eutectic solvents / Molecular dynamics simulations / Hydrogen bond / Molecular design

Cite this article

Download citation ▾
Usman L. Abbas, Yuxuan Zhang, Joseph Tapia, Selim Md, Jin Chen, Jian Shi, Qing Shao. Machine-Learning-Assisted Design of Deep Eutectic Solvents Based on Uncovered Hydrogen Bond Patterns. Engineering, 2024, 39(8): 79-89 DOI:10.1016/j.eng.2023.10.020

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

Deep eutectic solvents (DESs) are liquid mixtures composed of hydrogen bond acceptors (HBAs) and donors (HBDs) that exhibit tunable properties [1-13]. DESs have gained attention as sustainable solvents in a number of applications, including carbon capture [2,14-16], pharmaceuticals [9,14,15,17-23], material synthesis [8,19,24], electrochemistry [9,14,25-37], decontamination [17,18], and extractions [6,8,24,38-41], due to their potential for recovery [42] and reuse [43]. Non-ionic DESs have several desirable properties-including biodegradability, high conductivity, low volatility, and low toxicity-as compared with conventional solvents [2,8,38,43,44]. Popularly classified as type V DESs [4,14- 16,38], non-ionic DES can be made using natural compounds and exhibit low viscosity, making them particularly suitable for industrial applications such as liquid-liquid extraction and carbon nano-material production [12,38,45].

One of the main challenges in the field of DES is the discovery of a large collection of DES candidates, which would enable the community to have a vast pool to explore and to search for the ones with the desired properties. Numerous experimental and computational studies have shown the important role of hydrogen bonds (HBs) in the formation and properties of DESs [1,2,4,9,14,15,39,42,46]. Farias et al. [43] carried out an experimental study to understand the role of the HBDs of DESs in aqueous biphasic systems. They concluded that HBDs with high relative hydrophilicity mainly serve as adjuvants in biphasic systems, while HBDs with moderate hydrophilicity control the formation of biphasic systems, and HBDs with low hydrophilicity (high hydrophobicity) form aqueous biphasic systems, with the HBAs acting as adjuvants in such systems. Abranches et al. [1] investigated the suitability of betaine, a molecule with polarity imbalance, as a universal HBA in the formation of DESs. Their study used a combination of experiments and density functional theory calculations and concluded that betaine is a suitable choice for producing natural DESs due to its non-selective nature, low cost, and low toxicity. These fundamental studies highlight the important role of HBs in DES formation and properties, indicating that HB-based descriptors could serve as suitable inputs to discover new DESs.

Machine learning (ML) models are becoming increasingly popular for predicting the physicochemical and thermophysical properties of DESs [17,19,46-51]. A review by Hansen et al. [15] summarized studies that developed quantitative structure-property relationship models for predicting DES properties [6,15]. Halder et al. [51] used a cheminformatics approach to determine the structural attributes of DESs necessary for accurate predictions of densities in industrial applications. They utilized a consensus modeling approach and concluded that features such as the number of HBDs, lipophilicity, polarizability, and van der Waals surface area could be used to obtain highly accurate estimates of novel DES densities. Dietz et al. [6] used perturbed-chain statistical association fluid theory (PC-SAFT) modeling to predict the liquid-liquid equilibrium and solid-liquid equilibrium of mixtures of hydrophobic DES with water or hydroxymethyl furfural, demonstrating the efficacy of this approach for predicting the phase behavior of hydrophobic DES mixtures.

Other studies have employed ML algorithms to estimate the densities and viscosities of DESs. Abdollahzadeh et al. [19] compared seven ML algorithms and showed that least squares support vector regression had the highest accuracy in predicting the densities of 149 DESs, performing 74.5% better than the best results obtained via empirical correlations. Zamora et al. [16] compared the suitability of five ML algorithms, trained on experimental data, to predict the densities and viscosities of type V DESs. Their study concluded that support vector machines performed best at predicting densities, and Gaussian process regression models did best at predicting viscosities. Xu et al. [50] used gradient boosting models to predict DES viscosities; their model showed satisfactory results when trained and tested on experimental and simulation data. Overall, these studies demonstrate the potential of combining ML and molecular simulations to predict the properties of DESs.

In contrast to other studies that focus on predicting the properties of DESs, our work aims to use ML models to predict the formation of DES systems. Molecular dynamics (MD) simulation has emerged as a valuable technique for determining descriptors to be used as inputs for ML models [14,24,46]. We hypothesize that HB properties could serve as predictors for the formation of DESs. However, determining the relevant HB properties is not trivial. Our previous work [52] classified non-ionic DESs into three groups based on the ratio of intra- and inter-component HB numbers. These observations inspired us to explore the possibility of developing ML models using HB-based descriptors. To the best of our knowledge, our work is the first to use ML models to classify solvents as DES or non-DES.

In ML model training, data is a crucial element. To facilitate our research, we curated a library of 38 known DES and 111 non-DES systems from the literature. The construction of this library allows us to conduct statistical analysis on molecular simulation data, which can be used to develop training and testing datasets for model development. We curated a separate library of 34 systems to validate our model performance. Given the size of our database, this paper focuses on traditional ML algorithms. We utilized ten ML algorithms; however, we acknowledge that deep learning algorithms have emerged as a promising technique for designing materials. One obstacle to using deep learning algorithms for predicting the formation of DESs is the relative sparsity of experimentally verified DESs in the literature. Models such as the one we developed could help speed up the discovery of novel DESs by generating solvent mixtures likely to form DESs. The rest of this paper is structured as follows: Section 2 provides details on the computational methods, Section 3 presents the results and discussion, and Section 4 gives our conclusions.

2. Methodology

2.1. Library of DES and non-DES systems

Tables S1-S8 in Appendix A provide detailed information on the 183 systems that were simulated in this study. Of these systems, 38 are identified as known DES and 111 are known non-DES systems, as reported in the literature. These comprised our training and testing set. Additionally, 34 experimentally verified systems (17 DES and 17 non-DES) were reserved for validation. Classification of the DES and non-DES systems is based on the experimental results of van Osch et al. [53,54], with only the non-ionic DESs from their list being considered. DESs that lacked all three types of HBs (A-A, B-B, and A-B) were excluded from this study. The compounds used in the simulation are represented using three-letter abbreviations, such as "DEA" for "de-canoic acid." We follow the naming conventions from van Osch et al.’s work, in which component A in a system A-B is the expected HBD, and compound B is the expected HBA. The systems are denoted by the three-letter abbreviations of their compounds and the corresponding molar ratio; for example, DEA-MEN11 represents a 1:1 mixture of decanoic acid and menthol. Tables S9-S11 in Appendix A list the abbreviations used for chemical compounds in this study.

2.2. Molecular simulations

2.2.1. Molecular models

The all-atom optimized potentials for liquid simulations (OPLS-AA/M) force field [55] was used to describe the molecules in this study. The nonbonded and bonded parameters in the systems were determined based on the OPLS-AA/M force field, due to its proven ability to accurately model the behavior of organic molecules. The force field parameters were generated using the LigParGen [56] web server.

2.2.2. Simulation detail

The simulation systems were created by randomly inserting specific numbers (based on the molar ratio) of the chosen organic molecules in a cubic box. Fig. 1 shows a snapshot of the THY-MEN11 system generated using visual molecular dynamics (VMD) [57].

For each system, the simulation process comprised three steps: ① an energy minimization to remove any atomic overlaps; ② a 50 ns isobaric-isothermal (NPT, where N is the number of particles, P is pressure P=1atm,1atm=101.325kPa, and T is the temperature T=295K) ensemble MD simulation to enable the system to reach thermodynamic equilibrium; and ③ a 10 ns canonical (NVT, where V is the system’s volume and T=295 K) ensemble MD simulation to collect the data at a frequency of 10 ps. In step ②, the MD simulation used the Berendsen et al.’s method [58] to control the system pressure, while the velocity rescaling [59] method was used to control the system temperature.

The short- and long-range nonbonded interactions in the OPLS-AA/M force field were calculated using the Lennard-Jones 12-6 and Coulomb potential(E), respectively, using Eq. (1).

E=ij<i14πε0qiqje2rij+4εijσijrij12-σijrij6

where rij is the distance between atoms i and j;qi and qj are the partial charges of atoms i and j, respectively; ε0 is the free space permittivity; and εij and σij are energetic and geometric parameters, respectively. The particle mesh Ewald (PME) [60] sum was used to calculate long-range potentials, and the linear constraint solver (LINCS) algorithm [61] was used to constrain bonds involving hydrogen atoms. All energy minimization and MD simulations were conducted using GROMACS 2021.2 [62].

2.3. Hydrogen bond analysis

We characterized the HBs using the criteria developed by Luzar and Chandler [63]: ① the distance between the O(donor) and O (acceptor) is 0.35nm ; and ② the O (acceptor)- H (donor)- O (donor) angle is 30. We calculated the HB lifetime in two steps:

(1) Calculate the correlation function Ct, as shown in Eq. (2):

Ct=NHBtNHB0

where NHB0 is the ensemble average of the number of HBs at the initial status, and NHBt is the ensemble average of the number of HBs still existing at time t. The HBs are counted even if they break intermittently, based on Rappaport’s definition [64].

(2) Calculate the lifetime τ by numerically integrating the Ct curves.

2.4. Machine learning models

The literature-based library introduced in Section 2.1 contains more non-DES than DES systems, a data imbalance that may cause bias in model training. To attenuate this potential source of artificial effect, we curated a database containing 38 DES and 38 non-DES systems dedicated for each round of training during the ML model development. The 38 non-DES systems were selected randomly from the original 111 non-DES systems in the library. We further split this database into a training set consisting of 30 DES and 30 non-DES systems and a testing set consisting of eight DES and eight non-DES systems. We used fixed seeds when sampling from the DES and non-DES sets to ensure that all models were evaluated on the same dataset slices. All models were further validated with experimentally verified DES and non-DES systems, as described in Section 3.

We trained ten distinct ML algorithms utilizing algorithm implementations provided by the scikit-learn [65,66] and XGBoost [67] packages. These algorithms were: ① logistic regression, ② decision tree, ③ gradient boost, ④ AdaBoost, ⑤ random forest, ⑥ extra trees forest,⑦ support vector machine,⑧ k -nearest neighbors, ⑨ XGBoost, and ⑩ XGBoost-random forest. Hyperpa-rameter optimization was performed using scikit-learn’s grid search method. Each model’s performance was measured via repeated k -fold cross-validation with six folds and ten repeats, using the receiver operating characteristic (ROC)-area under the curve (AUC) metric. The model with the highest ROC-AUC value during optimization was considered to be the best model. For each ML algorithm, subsequent training and testing were only conducted on the best-trained model.

HBs play a determining role in the formation of DESs. To obtain a full picture of the HB environment, it is imperative to know how many of the molecules in a system interact to form HBs (i.e., the HB number) and how long these HBs last (i.e., the HB lifetime). All ML algorithms consider three types of input features: ① HB numbers alone, ② HB lifetimes alone, and ③ HB numbers combined with lifetimes. The input features, generated from MD simulations, are shown in Tables S1-S8. A total of 30 models were trained in this study; the model hyperparameters are detailed in Tables S12- S14 in Appendix A.

The following Python packages were used to conduct the work presented in this study: Python (version 3.10.8), scikit-learn [66] (version 1.2.0), pandas [68] (version 1.5.2), NumPy [69] (version 1.22.3), matplotlib [70] (version 3.6.2), SciPy [71] (version 1.7.3), and XGBoost [67] (version 1.7.3). All ML work was executed on an 8th Gen Intel core i7-8750H processor.

2.5. Experiment

To validate the trained models, we used a list of solvent formulas from our previous study [72] to determine whether the formulas could form DESs or not. To prepare the systems, two components were mixed at a specific molar ratio, with heating and constant stirring to ensure complete mixing. More specifically, the required mass of each component was first calculated based on the molar ratio and sequentially weighed out into a glass bottle on an analytic balance (VWR-224AC, VWR International, USA). The compounds were premixed using a glass rod, and a magnetic stir bar was subsequently added to the bottle. The bottle was then sealed and placed in an oil bath for heating. The temperature was generally maintained at 80C, with constant stirring at 500rmin-1 for 1h on a magnetic stirrer hotplate (Hei-Tec, Heidolph Instruments, Germany). For combinations that did not form a homogeneous liquid at this temperature, higher temperatures of 100 and 120C were further applied to check whether these combinations could transform into the liquid state at elevated temperatures. After the heating process, the mixture was air-cooled to room temperature and kept in a desiccator for 24h. Samples that remained in a liquid form with no crystals within the 24h period were considered to be DES candidates. However, we observed that some systems initially exhibited DES-like behavior but eventually formed a solid phase after several days. These systems were excluded from this study. Finally, 17 DES and 17 non-DES systems were selected.

3. Results and discussion

3.1. Statistical analysis of hydrogen bond features

3.1.1. Hydrogen bond number features

We first analyzed the probability density distribution of actual inter- and intra-component HB numbers for DES and non-DES systems. Fig. 2 shows the distribution of the 38 DES and 111 non-DES systems based on their average inter- and intra-component HB numbers. The pattern distributions in Fig. 2 do not show distinct differences. As depicted in Fig. 2(a), the intra-component HB numbers (A-A and B-B) for the DESs skew to the left, indicating that most of the DESs in our dataset have average HB numbers that are less than 20. The B-B HB numbers are concentrated on the lower end of the spectrum compared with the A-A HBs. The inter-component HB numbers skew to the right, suggesting that most of the DESs have higher inter-component HB numbers compared with intra-component HB numbers. In Fig. 2(b), the intra-component HB numbers for the non-DESs also skew to the left. In addition, most of the inter-component HB numbers are skewed to the right. Thus, based on an analysis of Fig. 2, the actual number of intra- and inter-component HBs may not be a suitable HB feature to differentiate between DES and non-DES systems.

Distinct patterns emerge when plotting the average inter- and intra-component HB numbers as boxplots for DES and non-DES systems. As shown in Figs. 3(a) and (b), the DES systems present a large difference between the median values for the two intra-component HB numbers (A-A vs B-B) compared with the non-DES systems. In addition, the inter-component HBs in the DES systems exhibit a median value of 56.07,6%-83% higher than the median values of A-A and B-B. For the non-DES systems, the A-BHBs only present a median value of 48.36,55%-59% higher than those of A-A and B-B, respectively. In both the DES and non-DES systems, even when the intra-component HB numbers (A-A and B-B) are summed up, the inter-component HB number (A-B) is still greater. Such differences in the median imply that the ratio of the two intra-component and the inter-/intra-component HBs may serve as important features for DES and non-DES system classification.

The plot of A-A/B-B and A-B/(A-A + B-B) in Fig. 3(c) further confirms our hypothesis. The ratio of inter- to intra-component HB numbers is well above 1.5 for some DESs. On average, the inter-component HB numbers are 35% greater than the total intra-component HB numbers for DESs. Finally, we looked at the ratio of the intra-component bonds to obtain more insight into the magnitudes of their differences. On average, the ratio of A-A to B-B HBs is 8.01 for DESs. For non-DESs, the ratio of A-A to B-B HBs falls to 3.44. The average intra-component HB numbers (A-A and B-B) for the non-DESs are roughly the same (24.31 and 24.08, respectively). The median HB numbers for A-A and B-B are also similar in non-DESs, at 20.01 and 21.41, respectively. This finding suggests that there is no dominant intra-component HB in non-DESs. This is also shown in Fig. 3(c), in which most of the intra-component HB number ratios are clustered around 1.0 for non-DESs, with few outliers. For non-DESs, the average inter-component (A-B) HB numbers are close to twice (1.93-1.95 times) those of the average intra-component (A-A and B-B) HBs, respectively. Relative to DESs, the ratios of intra-component HBs in non-DESs are also smaller. For example, in the 25th, 50th, and 75th percentiles, the DESs display A-A/B-B of 1.13, 1.91, and 15.51 compared with 0.15, 0.49, and 0.99 for the non-DESs. Tables S15 and S16 in Appendix A provide more details.

3.1.2. Hydrogen bond lifetimes

We also analyzed the probability density distribution of the inter- and intra-component HB lifetimes of the 38 DES and 111 non-DES systems. Across the bins, we observed two distinct scenarios for DESs: ① dominant inter-component (A-B) HBs; and ② dominant intra-component HBs (A-A or B-B). This finding agrees with our previous work, in which we classified several known DESs into inter- or intra-dominant groups. In Fig. 4(a), one of the intra-component HB lifetime bonds (A-A) is concentrated at 2.0-4.0ns, while the B-B lifetime is concentrated at 0.25-2.50ns for DESs. The inter-component HB lifetimes (A-B) appear to skew to the right and to last longer than the intra-component HB lifetimes.

Fig. 4(b) shows that one of the intra-component HB lifetimes for non-DESs dominates in different bins, but there is no clear trend; for example, B-B dominates at lifetimes less than 1.25ns, but A- A dominates at lifetimes greater than 3.00ns. In each bin, the A-B lifetimes appear to be more dominant than one of the intra-component HB lifetimes, while being similar to the other intra-component HB lifetimes. The lack of a clear pattern means that actual intra- and inter-component HB lifetime features alone might not be enough to differentiate between DES and non-DES systems.

Some differences emerge when we plot the intra- and inter-component HB lifetime distributions as boxplots. As shown in Fig. 5(a), the DESs present a small difference between the median values for the inter-component (A-B) and one of the intra-component (A-A) lifetimes; the difference is wider between the median values of A-B and the other intra-component (B-B) lifetimes. The A-B lifetimes have a median of 2.67,14% and 39% greater than those of the A-A and B-B lifetimes, respectively. As shown in Fig. 5(b), the non-DESs present a smaller difference between the median values for the inter- and intra-component HB lifetime values. The A-B lifetimes have a median of 2.72, which is only 3.6% and 14.0% greater than those of the A-A and B-B lifetimes, respectively. These differences indicate that the ratios of inter- to intra-component HB lifetimes could be more useful as features than the actual lifetimes.

The plot of A-A/B-B versus A-B/(A-A + B-B) in Fig. 5(c) confirms this hypothesis. The A-A median lifetimes last about 7% longer than the B-B lifetimes in DESs, compared with 13% for non-DESs. Even though there are more inter-component HBs than intra-component HBs, the intra-component HBs last longer. The median value of A-B/A-A+B-B lifetimes is 0.63 for DESs and 0.53 for non-DESs. The ratio of inter- to intra-component HB lifetimes in DESs varies from 0.5 to 2.0 while most of the non-DESs have ratios of inter- to intra-component HB lifetimes clustered around 0.5. Similar to the HB numbers, ratios of the HB lifetimes might be more useful as features than the actual lifetime values.

3.2. Model development

We trained 30 models with ten algorithms (logistic regression, decision tree, gradient boost, AdaBoost, random forest, extra trees forest, support vector machine, k -nearest neighbors, XGBoost, and XGBoost-random forest) and three types of input features (HB number, HB lifetime, and a combination of HB number and lifetime features) to predict whether a system could be a DES. For each type of input feature, we used the five variables mentioned in Section 3.1 A-A,B-B,A-B,A-A/B-B , and A-B/A-A+B-B. We trained each model for 100 rounds and calculated the average ROC-AUC values from each of the 100 rounds. The ROC is a probability curve, and the AUC represents the degree or measure of separability. The ROC-AUC shows how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting DES classes to be DESs and non-DES classes to be non-DESs. For each round, we randomly sampled 38 (30 for training, eight for testing) entries each from the DES and non-DES datasets. In each round, a six-fold grid search cross-validation was used for hyperparameter optimization, with the ROC-AUC as a metric. To ensure a fair comparison, each model was trained and tested with the same samples from the DES and non-DES datasets. Figs. S1-S11 in Appendix A show the variation in the ROC-AUC for each model during training.

We ranked the models using two criteria: ① average ROC-AUC score (Table 1), and ② minimum ROC-AUC score (Table 2).

With an average ROC-AUC score of 0.70, the AdaBoost and extra trees forest classifiers were tied for the best performing models when trained with HB lifetime features. When trained with HB number features, XGBoost-random forest, random forest, and XGBoost were the top-performing models, with an average ROC-AUC of 0.82, 0.81, and 0.81, respectively. When HB number and lifetime features were combined, the top-performing models were the random forest and XGBoost-random forest classifiers, with both having an average ROC-AUC of 0.79. Overall, the top-performing models were the XGBoost-random forest and extra trees forest, based on the average and minimum ROC-AUC values, respectively.

The minimum ROC-AUC score in 100 training rounds could also be used to evaluate the performance of a model. Table 2 lists the minimum ROC-AUC scores for the 30 models. The minimum ROC-AUC scores trained with HB lifetime features alone range from 0.15 to 0.30, lower than those trained with HB number or the number and lifetime. Such observations indicate that the HB lifetime alone might not be sufficient to develop an ML model for classifying DES systems. Across all categories, the extra trees forest classifier had the highest minimum ROC-AUC score of 0.70 when trained with HB number.

Some algorithms were among the top performers regardless of the criteria used for model selection. For models trained with HB numbers, the top-performing model was the extra trees forest, based on the minimum ROC-AUC, and this model was only slightly behind the XGBoost-random forest when judged by the average ROC-AUC score. For models trained with HB lifetime features, the extra trees forest and the AdaBoost were the top performers using either the average ROC-AUC scores or the highest minimum ROC-AUC scores. Among the models trained with the combined HB number and lifetime features, the extra trees forest classifier was the top performer using the highest minimum ROC-AUC score or the average ROC-AUC score. However, it should be noted that the stellar performance observed during training does not necessarily translate into excellence in the validation stage, as will be seen in the next section.

3.3. Model validation with experimental results

We validated the 30 trained models using 34 experimental results (17 DES and 17 non-DES systems). The results are presented in Table 3.

For models trained with the HB lifetime features, the XGBoost-random forest, logistic regression, and extra trees forest were the top performers, with ROC-AUC values of 0.68, 0.65, and 0.65, respectively, during validation. Support vector machine, extra trees forest, and gradient boost were the top performers, with ROC-AUC values of0.80,0.79, and 0.77, respectively, when the models were trained with the HB number features. Extra trees forest, logistic regression, and gradient boost were the top-performing models, with ROC-AUC values of 0.88, 0.84, and 0.81, respectively, among the models trained with both HB number and lifetime. In general, the ensemble algorithms (bagging and boosting) were observed to perform well. Bagging algorithms such as random forest, extra trees forest, and decision tree build and train independent estimators, and then average the independent predictions of these estimators to make a final prediction. This can help reduce variance in predictions and increase accuracy. The boosting algorithms (XGBoost, XGBoost-random forest, AdaBoost, and gradient boost), on the other hand, train several estimators sequentially. Each estimator focuses on reducing the errors of the previous estimator, and this typically reduces bias.

Fig. 6 presents confusion matrices for the top-performing models under each of the three input feature categories during validation. The confusion matrices present true positives, true negatives, false positives, and false negatives for each model’s predictions. In this study, DESs are positives, while non-DESs are negatives. The sensitivity measures how many DESs were correctly predicted to be DESs, while the specificity measures how many non-DESs were correctly predicted to be non-DESs by a model. Some models were better at predicting DESs (high sensitivity), while some were better at predicting non-DESs (high specificity).

XGBoost-random forest was the top-performing algorithm among the models trained with HB lifetime features. It performed best at predicting which systems were DESs, as shown by its high sensitivity of 0.82 (Fig. 6(a)), but it was not good at predicting which systems were non-DESs (with a low specificity of 0.47, Fig. 6(a)). Among the models trained with HB number features, the support vector machine was the top-performing algorithm. It had a specificity of 0.88 (Fig. 6(c)), which means that it performed best at predicting which systems were non-DESs. Its low sensitivity of 0.35 (Fig. 6(c)) means that it was not good at predicting DESs. When models were trained with combined HB lifetime and number features as inputs, the extra trees forest model performed best. It had a sensitivity of 0.76 (Fig. 6(e)), indicating it was among the top performers at predicting which systems were DESs. It had a specificity of 0.94 (Fig. 6(e)), indicating it was the top performer at predicting which systems were non-DESs. Relative to the top-performing models in other input feature categories, the extra trees forest algorithm was the best overall at predicting DESs and non-DESs. The confusion matrices for all models are shown in Figs. S18-S20 in Appendix A.

3.4. Prediction probabilities

’Prediction probabilities are useful indicators of how well each model separates DESs and non-DESs. A model with good separation capability would have all its non-DES predictions with the probability of being DES <0.5 and as close to 0 as possible, and its DES predictions with the probability of being DES >0.5 and as close to 1 as possible. Fig. 7 presents the distribution of prediction probabilities for the best models during validation. It can be seen from the probabilities in Fig. 7(a) that the predictions of the XGBoost-random forest are closely distributed around 0.49 to 0.51, suggesting that there is not much separation for models trained with HB lifetime features. Notably, all of the XGBoost-random forest’s 14 DES predictions made with confidence >0.5 were correct. The separation improves in Fig. 7(b), with the probabilities distributed around 0.46 to 0.54, suggesting that HB number features helped the models detect non-DESs relatively better than HB lifetimes alone. This is backed up by the observation that all 15 non-DES predictions made by the support vector machine model with the probability of being DES <0.5 were correct. The probabilities are distributed between 0.30 and 0.70 in Fig. 7(c), indicating better confidence in the extra trees forest model’s predictions when HB number and lifetime features were combined as inputs. The extra trees forest model shows better separation in its classifications and is relatively more confident in its non-DES predictions, and this is backed up by its specificity of 0.94 (Fig. 6(e)). It got only one non-DES prediction wrong. The prediction probabilities for all the other models are shown in Figs. S21- S23 in Appendix A.

It is useful to have some insight into which input features carry the most weight when the ML models make predictions. Fig. 8 shows how the models ranked the importance of input features. Models that were trained with HB lifetime features alone overwhelmingly ranked the ratio of inter- to intra-component HB lifetime as the most important feature for predictions, followed by the inter-component HB lifetime. When trained with HB numbers features alone, the models ranked the inter-component HB numbers as the most important feature; however, it should be noted that the ratio of inter- to intra-component HB numbers was not far behind in second place. When number and lifetime were combined, the trained models ranked the inter-component HB numbers as the most important feature, closely followed by the ratio of inter- to intra-component HB lifetimes.

4. Conclusions

We analyzed the HB features of 38 known DES and 111 known non-DES systems using MD simulation trajectories. The statistical analysis of inter- and intra-component HB numbers and lifetimes revealed two unique features for DES systems in comparison with non-DES systems: The DESs exhibited an imbalance between the two intra-component HB numbers, and more and stronger inter-component HBs. We then developed 30 ML models by training ten algorithms on three types of input features and validated the models using 17 DES and 17 non-DES systems that had been experimentally verified. Using the two criteria of highest average and highest minimum ROC-AUC scores, we found the logistic regression, gradient boost, support vector machine, and extra trees forest models among the top performers when trained using the HB lifetime, number, and combined lifetime and number features. When testing against the experimental validation, the extra trees forest classifier was the top-performing model overall, with an ROC-AUC of 0.88 with the HB number and lifetime combined as inputs. Intuitively, it makes sense that models would perform better when fed information about the population of HB numbers and how long those HBs last. All models ranked the inter-component and the ratio of inter- to intra-component HB number and lifetime as the most important features for classifying a system as a DES or not.

DESs are promising solvents that hold huge potential. Due to the sheer size of the candidate pool, it is important to have models that can accurately predict which compounds will or will not form DESs when mixed. The purpose of the ML models developed in this work was to determine whether a binary system could be a DES based on MD simulation data. These ML models could assist in DES research by accelerating the discovery of new DES candidates. Our work sheds light on which compounds are likely to form DESs but does not suggest what their physicochemical properties are likely to be. In the future, more work needs to be done to predict which compounds will form DESs with application-specific properties.

Acknowledgments

This work was supported by Ignite Research Collaborations (IRC), Startup funds, and the UK Artificial Intelligence (AI) in Medicine Research Alliance Pilot (NCATS UL1TR001998 and NCI P30 CA177558), University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for the use of the Lipscomb Compute Cluster of the University of Kentucky.

Compliance with ethics guidelines

Usman L. Abbas, Yuxuan Zhang, Joseph Tapia, Selim Md, Jin Chen, Jian Shi, and Qing Shao declare that they have no conflict of interest or financial conflicts to disclose.

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2023.10.020.

References

[1]

D.O. Abranches, L.P. Silva, M.A.R. Martins, S.P. Pinho, J.A.P. Coutinho. Understanding the formation of deep eutectic solvents: betaine as a universal hydrogen bond acceptor. ChemSusChem, 13 (18) (2020), pp. 4916-4921.

[2]

N.M. Stephens, E.A. Smith. Structure of deep eutectic solvents (DESs): what we know, what we want to know, and why we need to know it. Langmuir, 38 (46) (2022), pp. 14017-14024.

[3]

A.T. Celebi, N. Dawass, O.A. Moultos, T.J.H. Vlugt. How sensitive are physical properties of choline chloride-urea mixtures to composition changes: molecular dynamics simulations and Kirkwood-Buff theory. J Chem Phys, 154 (18) (2021), Article 184502.

[4]

D.O. Abranches, J.A.P. Coutinho. Type V deep eutectic solvents: design and applications. Curr Opin Green Sustain Chem, 35 (2022), Article 100612.

[5]

R. Alcalde, A. Gutiérrez, M. Atilhan, S. Aparicio. An experimental and theoretical investigation of the physicochemical properties on choline chloride—lactic acid based natural deep eutectic solvent (NADES). J Mol Liq, 290 (2019), Article 110916.

[6]

C.H.J.T. Dietz, A. Erve, M.C. Kroon, A.M. van Sint, F. Gallucci, C. Held. Thermodynamic properties of hydrophobic deep eutectic solvents and solubility of water and HMF in them: measurements and PC-SAFT modeling. Fluid Phase Equilib, 489 (2019), pp. 75-82.

[7]

C. Florindo, L.C. Branco, I.M. Marrucho. Development of hydrophobic deep eutectic solvents for extraction of pesticides from aqueous environments. Fluid Phase Equilib, 448 (2017), pp. 135-142.

[8]

H. Kivelä, M. Salomäki, P. Vainikka, E. Mäkilä, F. Poletti, S. Ruggeri, et al. Effect of water on a hydrophobic deep eutectic solvent. J Phys Chem B, 126 (2) (2022), pp. 513-527.

[9]

A. Kovács, E.C. Neyts, I. Cornet, M. Wijnants, P. Billen. Modeling the physicochemical properties of natural deep eutectic solvents. ChemSusChem, 13 (15) (2020), pp. 3789-3804.

[10]

T. Křížek, M. Bursová, R. Horsley, M. Kuchař, P. Tůma, R. Čabala, et al. Menthol-based hydrophobic deep eutectic solvents: towards greener and efficient extraction of phytocannabinoids. J Clean Prod, 193 (2018), pp. 391-396.

[11]

K. Li, Y. Jin, D. Jung, K. Park, H. Kim, J. Lee. In situ formation of thymol-based hydrophobic deep eutectic solvents: application to antibiotics analysis in surface water based on liquid-liquid microextraction followed by liquid chromatography. J Chromatogr A, 1614 (2020), Article 460730.

[12]

M. Lukaczynska-Anderson, M.H. Mamme, A. Ceglia, K. Van den Bergh, J. De Strycker, F. De Proft, et al. The role of hydrogen bond donor and water content on the electrochemical reduction of Ni2+ from solvents—an experimental and modelling study. Phys Chem Chem Phys, 22 (28) (2020), pp. 16125-16135.

[13]

M.A.R. Martins, L.P. Silva, N. Schaeffer, D.O. Abranches, G.J. Maximo, S.P. Pinho, et al. Greener terpene-terpene eutectic mixtures as hydrophobic solvents. ACS Sustain Chem Eng, 7 (20) (2019), pp. 17414-17423.

[14]

D. Tolmachev, N. Lukasheva, R. Ramazanov, V. Nazarychev, N. Borzdun, I. Volgin, et al. Computer simulations of deep eutectic solvents: challenges, solutions, and perspectives. Int J Mol Sci, 23 (2) (2022), p. 645.

[15]

B.B. Hansen, S. Spittle, B. Chen, D. Poe, Y. Zhang, J.M. Klein, et al. Deep eutectic solvents: a review of fundamentals and applications. Chem Rev, 121 (3) (2021), pp. 1232-1285.

[16]

L. Zamora, C. Benito, A. Gutiérrez, R. Alcalde, N. Alomari, A.A. Bodour, et al. Nanostructuring and macroscopic behavior of type V deep eutectic solvents based on monoterpenoids. Phys Chem Chem Phys, 24 (1) (2021), pp. 512-531.

[17]

F. Bergua, M. Castro, C. Lafuente, M. Artal. Thymol + L-menthol eutectic mixtures: thermophysical properties and possible applications as decontaminants. J Mol Liq, 368 (Pt B) ( 2022), Article 120789.

[18]

F. Bergua, M. Castro, J. Muñoz-Embid, C. Lafuente, M. Artal. L-Menthol-based eutectic solvents: characterization and application in the removal of drugs from water. J Mol Liq, 352 (2022), Article 118754.

[19]

M. Abdollahzadeh, M. Khosravi, B. Hajipour Khire Masjidi, A. Samimi Behbahan, A. Bagherzadeh, A. Shahkar, et al. Estimating the density of deep eutectic solvents applying supervised machine learning techniques. Sci Rep, 12 (1) (2022), p. 4954.

[20]

Y. Dai, G.J. Witkamp, R. Verpoorte, Y.H. Choi. Tailoring properties of natural deep eutectic solvents with water to facilitate their applications. Food Chem, 187 (2015), pp. 14-19.

[21]

A. Gutiérrez, S. Aparicio, M. Atilhan. Design of arginine-based therapeutic deep eutectic solvents as drug solubilization vehicles for active pharmaceutical ingredients. Phys Chem Chem Phys, 21 (20) (2019), pp. 10621-10634.

[22]

A. Gutiérrez, M. Atilhan, S. Aparicio. A theoretical study on lidocaine solubility in deep eutectic solvents. Phys Chem Chem Phys, 20 (43) (2018), pp. 27464-27473.

[23]

M.H. Zainal-Abidin, M. Hayyan, G.C. Ngoh, W.F. Wong, C.Y. Looi. Emerging frontiers of deep eutectic solvents in drug discovery and drug delivery systems. J Control Release, 316 (2019), pp. 168-195.

[24]

X. Zhong, C. Velez, O. Acevedo. Partial charges optimized by genetic algorithms for deep eutectic solvent simulations. J Chem Theory Comput, 17 (5) (2021), pp. 3078-3087.

[25]

N. Chaabene, K. Ngo, M. Turmine, V. Vivier. New hydrophobic deep eutectic solvent for electrochemical applications. J Mol Liq, 319 (2020), Article 114198.

[26]

T. Hanada, M. Goto. Synergistic deep eutectic solvents for lithium extraction. ACS Sustain Chem Eng, 9 (5) (2021), pp. 2152-2160.

[27]

L. Yurramendi, J. Hidalgo, A. Siriwardana. A sustainable process for the recovery of valuable metals from spent lithium ion batteries by deep eutectic solvents leaching. Mater Proc, 5 (1) (2021), p. 100.

[28]

K. Du, E.H. Ang, X. Wu, Y. Liu. Progresses in sustainable recycling technology of spent lithium-ion batteries. Energy Environ Mater, 5 (4) (2022), pp. 1012-1036.

[29]

J. Neumann, M. Petranikova, M. Meeus, J.D. Gamarra, R. Younesi, M. Winter, et al. Recycling of lithium-ion batteries—current state of the art, circular economy, and next generation recycling. Adv Energy Mater, 12 (17) (2022), Article 2102917.

[30]

S. Tang, M. Zhang, M. Guo. A novel deep-eutectic solvent with strong coordination ability and low viscosity for efficient extraction of valuable metals from spent lithium-ion batteries. ACS Sustain Chem Eng, 10 (2) (2022), pp. 975-985.

[31]

J. Zhang, M. Wenzel, J. Steup, G. Schaper, F. Hennersdorf, H. Du, et al. 4-Phosphoryl pyrazolones for highly selective lithium separation from alkali metal ions. Chemistry, 28 (1) (2022), Article e202103640.

[32]

Y. Chen, Y. Wang, Y. Bai, Y. Duan, B. Zhang, C. Liu, et al. Significant improvement in dissolving lithium-ion battery cathodes using novel deep eutectic solvents at low temperature. ACS Sustain Chem Eng, 9 (38) (2021), pp. 12940-12948.

[33]

K. Wang, T. Hu, P. Shi, Y. Min, J. Wu, Q. Xu. Efficient recovery of value metals from spent lithium-ion batteries by combining deep eutectic solvents and coextraction. ACS Sustain Chem Eng, 10 (3) (2022), pp. 1149-1159.

[34]

G. Zante, M. Boltoeva. Review on hydrometallurgical recovery of metals with deep eutectic solvents. Sustain Chem, 1 (3) (2020), pp. 238-255.

[35]

L. Chen, Y. Chao, X. Li, G. Zhou, Q. Lu, M. Hua, et al. Engineering a tandem leaching system for the highly selective recycling of valuable metals from spent Li-ion batteries. Green Chem, 23 (5) (2021), pp. 2177-2184.

[36]

M.K. Tran, M.T.F. Rodrigues, K. Kato, G. Babu, P.M. Ajayan. Deep eutectic solvents for cathode recycling of Li-ion batteries. Nat Energy, 4 (4) (2019), pp. 339-345.

[37]

S. Wang, Z. Zhang, Z. Lu, Z. Xu. A novel method for screening deep eutectic solvent to recycle the cathode of Li-ion batteries. Green Chem, 22 (14) (2020), pp. 4473-4482.

[38]

N. Aguilar, R. Barros, J. Antonio Tamayo-Ramos, S. Martel, A. Bol, M. Atilhan, et al. Carbon nanomaterials with thymol + menthol type V natural deep eutectic solvent: from surface properties to nano-Venturi effect through nanopores. J Mol Liq, 368 (2022), Article 120637.

[39]

M. Tiecco, F. Cappellini, F. Nicoletti, T. Del Giacco, R. Germani, P. Di Profio. Role of the hydrogen bond donor component for a proper development of novel hydrophobic deep eutectic solvents. J Mol Liq, 281 (2019), pp. 423-430.

[40]

M.H. Zainal-Abidin, M. Hayyan, W.F. Wong. Hydrophobic deep eutectic solvents: current progress and future directions. J Ind Eng Chem, 97 (2021), pp. 142-162.

[41]

R. Paul, A. Mitra, S. Paul. Phase separation property of a hydrophobic deep eutectic solvent-water binary mixture: a molecular dynamics simulation study. J Chem Phys, 154 (24) (2021), Article 244504.

[42]

P. Makoś, E. Słupek, J. Gębicki. Extractive detoxification of feedstocks for the production of biofuels using new hydrophobic deep eutectic solvents—experimental and theoretical studies. J Mol Liq, 308 (2020), Article 113101.

[43]

F.O. Farias, J.F.B. Pereira, J.A.P. Coutinho, L. Igarashi-Mafra, M.R. Mafra. Understanding the role of the hydrogen bond donor of the deep eutectic solvents in the formation of the aqueous biphasic systems. Fluid Phase Equilib, 503 (2020), Article 112319.

[44]

P. Vainikka, S. Thallmair, P.C.T. Souza, S.J. Marrink. Martini 3 coarse-grained model for type III deep eutectic solvents: thermodynamic, structural, and extraction properties. ACS Sustain Chem Eng, 9 (51) (2021), pp. 17338-17350.

[45]

M. Atilhan, S. Aparicio. Molecular dynamics simulations of mixed deep eutectic solvents and their interaction with nanomaterials. J Mol Liq, 283 (2019), pp. 147-154.

[46]

I.I.I. Alkhatib, D. Bahamon, F. Llovell, M.R.M. Abu-Zahra, L.F. Vega. Perspectives and guidelines on thermodynamic modelling of deep eutectic solvents. J Mol Liq, 298 (2020), Article 112183.

[47]

I. Adeyemi, M.R.M. Abu-Zahra, I.M. AlNashef. Physicochemical properties of alkanolamine-choline chloride deep eutectic solvents: measurements, group contribution and artificial intelligence prediction techniques. J Mol Liq, 256 (2018), pp. 581-590.

[48]

K. Shahbaz, F.S.G. Bagh, F.S. Mjalli, I.M. AlNashef, M.A. Hashim. Prediction of refractive index and density of deep eutectic solvents using atomic contributions. Fluid Phase Equilib, 354 (2013), pp. 304-311.

[49]

F.S.G. Bagh, K. Shahbaz, F.S. Mjalli, I.M. AlNashef, M.A. Hashim. Electrical conductivity of ammonium and phosphonium based deep eutectic solvents: measurements and artificial intelligence-based prediction. Fluid Phase Equilib, 356 (2013), pp. 30-37.

[50]

X. Xu, J. Range, G. Gygli, J. Pleiss. Analysis of thermophysical properties of deep eutectic solvents by data integration. J Chem Eng Data, 65 (3) (2020), pp. 1172-1179.

[51]

A.K. Halder, R. Haghbakhsh, I.V. Voroshylova, A.R.C. Duarte, M.N.D.S. Cordeiro. Density of deep eutectic solvents: the path forward cheminformatics-driven reliable predictions for mixtures. Molecules, 26 (19) (2021), p. 5779.

[52]

U.L. Abbas, Q. Qiao, M.T. Nguyen, J. Shi, Q. Shao. Molecular dynamics simulations of heterogeneous hydrogen bond environment in hydrophobic deep eutectic solvents. AIChE J, 68 (2022), Article e17382.

[53]

D.J.G.P. van Osch, C.H.J.T. Dietz, S.E.E. Warrag, M.C. Kroon. The curious case of hydrophobic deep eutectic solvents: a story on the discovery, design, and applications. ACS Sustain Chem Eng, 8 (29) (2020), pp. 10591-10612.

[54]

D.J.G.P. van Osch, C.H.J.T. Dietz, J. van Spronsen, M.C. Kroon, F. Gallucci, A.M. van Sint, et al. A search for natural hydrophobic deep eutectic solvents based on natural components. ACS Sustain Chem Eng, 7 (3) (2019), pp. 2933-2942.

[55]

M.J. Robertson, J. Tirado-Rives, W.L. Jorgensen. Improved peptide and protein torsional energetics with the OPLS-AA force field. J Chem Theory Comput, 11 (7) (2015), pp. 3499-3509.

[56]

L.S. Dodda, I. Cabeza de Vaca, J. Tirado-Rives, W.L. Jorgensen. LigParGen web server: an automatic OPLS-AA parameter generator for organic ligands. Nucleic Acids Res, 45 (W1) (2017), pp. W331-W336.

[57]

W. Humphrey, A. Dalke, K. Schulten. VMD: visual molecular dynamics. J Mol Graph, 14 (1) (1996), pp. 33-38.

[58]

H.J.C. Berendsen, J.P.M. Postma, W.F. van Gunsteren, A. DiNola, J.R. Haak. Molecular dynamics with coupling to an external bath. J Chem Phys, 81 (8) (1984), pp. 3684-3690.

[59]

G. Bussi, D. Donadio, M. Parrinello. Canonical sampling through velocity rescaling. J Chem Phys, 126 (1) (2007), Article 014101.

[60]

T. Darden, D. York, L. Pedersen. Particle mesh Ewald: an N∙log(N) method for Ewald sums in large systems. J Chem Phys, 98 (12) (1993), pp. 10089-10092.

[61]

B. Hess, H. Bekker, H.J.C. Berendsen, J.G.E.M. Fraaije. LINCS: a linear constraint solver for molecular simulations. J Comput Chem, 18 (12) (1997), pp. 1463-1472.

[62]

M.J. Abraham, T. Murtola, R. Schulz, S. Páll, J.C. Smith, B. Hess, et al. GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1-2 ( 2015), pp. 19-25.

[63]

A. Luzar, D. Chandler. Hydrogen-bond kinetics in liquid water. Nature, 379 (6560) (1996), pp. 55-57.

[64]

A. Luzar. Resolving the hydrogen bond dynamics conundrum. J Chem Phys, 113 (23) (2000), pp. 10663-10675.

[65]

Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O, et al. API design for machine learning software:experiences from the scikit-learn project [presentation]. In: EuropeanConference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases; 2013 Sep 23-27 ; Prague, Czech Republic; 2013.

[66]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, et al. Scikit-learn: machine learning in Python. J Mach Learn Res, 12 (2011), pp. 2825-2830.

[67]

Chen T, Guestrin C. XGBoost:a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016 Aug 13-17; San Francisco, CA, USA. New York City: Association for Computing Machinery; 2016. p. 785-94.

[68]

McKinney W. Data structures for statistical computing in Python. In: vander Walt S, MillmanJ, editors. Proceedings of the 9th Python in Science Conference; Jun 28-Jul 3; Austin, TX USA; 2010. p. 2010 56-61.

[69]

C.R. Harris, K.J. Millman, S.J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, et al. Array programming with NumPy. Nature, 585 (7825) (2020), pp. 357-362.

[70]

J.D. Hunter. Matplotlib: a 2D graphics environment. Comput Sci Eng, 9 (3) (2007), pp. 90-95.

[71]

P. Virtanen, R. Gommers, T.E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods, 17 (3) (2020), pp. 261-272.

[72]

Y. Zhang, Q. Qiao, U.L. Abbas, J. Liu, Y. Zheng, C. Jones, et al. Lignin derived hydrophobic deep eutectic solvents as sustainable extractants. J Clean Prod, 388 (2023), Article 135808.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (1741KB)

7842

Accesses

0

Citation

Detail

Sections
Recommended

/