An Improved Machine Learning Model for Pure Component Property Estimation

Xinyu Cao; Ming Gong; Anjan Tula; Xi Chen; Rafiqul Gani; Venkat Venkatasubramanian

doi:10.1016/j.eng.2023.08.024

PDF(2727 KB)

Engineering ›› 2024, Vol. 39 ›› Issue (8) : 61-73. DOI: 10.1016/j.eng.2023.08.024

Research

Article

An Improved Machine Learning Model for Pure Component Property Estimation

Xinyu Cao^a ,
Ming Gong^b ,
Anjan Tula^a^,^* ,
Xi Chen^a^,^* ,
Rafiqul Gani^c^,^d^,^e ,
Venkat Venkatasubramanian^f

Author information +

History +

Abstract

Information on the physicochemical properties of chemical species is an important prerequisite when performing tasks such as process design and product design. However, the lack of extensive data and high experimental costs hinder the development of prediction techniques for these properties. Moreover, accuracy and predictive capabilities still limit the scope and applicability of most property estimation methods. This paper proposes a new Gaussian process-based modeling framework that aims to manage a discrete and high-dimensional input space related to molecular structure representation with the group-contribution approach. A warping function is used to map discrete input into a continuous domain in order to adjust the correlation between different compounds. Prior selection techniques, including prior elicitation and prior predictive checking, are also applied during the building procedure to provide the model with more information from previous research findings. The framework is assessed using datasets of varying sizes for 20 pure component properties. For 18 out of the 20 pure component properties, the new models are found to give improved accuracy and predictive power in comparison with other published models, with and without machine learning.

Graphical abstract

Keywords

Group contribution / Gaussian process / Warping function / Prior predictive checking

Cite this article

EndNote

Ris (Procite)

Bibtex

Download citation ▾

Xinyu Cao, Ming Gong, Anjan Tula, Xi Chen, Rafiqul Gani, Venkat Venkatasubramanian. An Improved Machine Learning Model for Pure Component Property Estimation. Engineering, 2024, 39(8): 61‒73 https://doi.org/10.1016/j.eng.2023.08.024

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	A.S. Hukkerikar, B. Sarup, A. Ten Kate, J. Abildskov, G. Sin, R. Gani. Group-contribution⁺ (GC⁺) based estimation of properties of pure components: improved property estimation and uncertainty analysis. Fluid Phase Equilib, 321 (2012), pp. 25-43.

[2]	D. Mackay, R.S. Boethling. Handbook of property estimation methods for chemicals:environmental health sciences. CRC Press, Boca Raton (2000).

[3]	Hukkerikar AS. Development of pure component property models for chemical product-process design and analysis [dissertation]. Denmark: Technical University of Denmark; 2013.

[4]	T. Zhou, R. Gani, K. Sundmacher. Hybrid data-driven and mechanistic modeling approaches for multiscale material and process design. Engineering, 7 (9) (2021), pp. 1231-1238.

[5]	K.G. Joback. Knowledge bases for computerized physical property estimation. Fluid Phase Equilib, 185 (1-2) (2001), pp. 45-52.

[6]	K.G. Joback, R.C. Reid. Estimation of pure-component properties from group-contributions. Chem Eng Commun, 57 (1-6) (1987), pp. 233-243.

[7]	R. Gani. Group contribution-based property estimation methods: advances and perspectives. Curr Opin Chem Eng, 23 (2019), pp. 184-196.

[8]	T. Le, V.C. Epa, F.R. Burden, D.A. Winkler. Quantitative structure-property relationship modeling of diverse materials properties. Chem Rev, 112 (5) (2012), pp. 2889-2919.

[9]	S. Wen, K. Nanda, Y. Huang, G.J.O. Beran. Practical quantum mechanics-based fragment methods for predicting molecular crystal properties. Phys Chem Chem Phys, 14 (21) (2012), pp. 7578-7590.

[10]	L. Constantinou, R. Gani. New group contribution method for estimating properties of pure compounds. AIChE J, 40 (10) (1994), pp. 1697-1710.

[11]	C. Gao, R. Govind, H.H. Tabak. Application of the group contribution method for predicting the toxicity of organic chemicals. Environ Toxicol Chem, 11 (5) (1992), pp. 631-636.

[12]	C.L. Aguirre, L.A. Cisternas, J.O. Valderrama. Melting-point estimation of ionic liquids by a group contribution method. Int J Thermophys, 33 (1) (2012), pp. 34-46.

[13]	E. Terrell. Estimation of Hansen solubility parameters with regularized regression for biomass conversion products: an application of adaptable group contribution. Chem Eng Sci, 248 (2022), Article 117184.

[14]	J. Marrero, R. Gani. Group-contribution based estimation of pure component properties. Fluid Phase Equilib, 183-184 ( 2001), pp. 183-208.

[15]	R. Gani, P.M. Harper, M. Hostrup. Automatic creation of missing groups through connectivity index for pure-component property prediction. Ind Eng Chem Res, 44 (18) (2005), pp. 7262-7269.

[16]	F. Jirasek, H. Hasse. Perspective: machine learning of thermophysical properties. Fluid Phase Equilib, 549 (2021), Article 113206.

[17]	V. Venkatasubramanian. The promise of artificial intelligence in chemical engineering: is it here, finally>. AIChE J, 65 (2) (2019), pp. 466-478.

[18]	V. Venkatasubramanian, V. Mann. Artificial intelligence in reaction prediction and chemical synthesis. Curr Opin Chem Eng, 36 (2022), Article 100749.

[19]	V. Mann, R. Gani, V. Venkatasubramanian. Group contribution-based property modeling for chemical product design: a perspective in the AI era. Fluid Phase Equilib, 568 (2023), Article 113734.

[20]	M.R. Dobbelaere, P.P. Plehiers, R. Van de Vijver, C.V. Stevens, K.M. Van Geem. Machine learning in chemical engineering: strengths, weaknesses, opportunities, and threats. Engineering, 7 (9) (2021), pp. 1201-1211.

[21]	R. Nagai, R. Akashi, O. Sugino. Completing density functional theory by machine learning hidden messages from molecules. npj Comput Mater, 6 (1) (2020), p. 43.

[22]	Goh GB, Siegel C, Vishnu A, Hodas NO, Baker N. Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed QSAR/QSPR models. 2017. arXiv:1706.06689.

[23]	Z. Zhou, M. Eden, W. Shen. Treat molecular linear notations as sentences: accurate quantitative structure-property relationship modeling via a natural language processing approach. Ind Eng Chem Res, 62 (12) (2023), pp. 5336-5346.

[24]	J. Zhang, Q. Wang, Y. Su, S. Jin, J. Ren, M. Eden, et al. An accurate and interpretable deep learning model for environmental properties prediction using hybrid molecular representations. AIChE J, 68 (6) (2022), p. e17634.

[25]	H. Wen, Y. Su, Z. Wang, S. Jin, J. Ren, W. Shen, et al. A systematic modeling methodology of deep neural network-based structure-property relationship for rapid and reliable prediction on flashpoints. AIChE J, 68 (1) (2022), p. e17402.

[26]	K. Paduszyński, U. Domańska. Viscosity of ionic liquids: an extensive database and a new group contribution model based on a feed-forward artificial neural network.J Chem Inf Model, 54 (5) (2014), pp. 1311-1324.

[27]	R. Li, J.M. Herreros, A. Tsolakis, W. Yang. Machine learning regression based group contribution method for cetane and octane numbers prediction of pure fuel compounds and mixtures. Fuel, 280 (2020), Article 118589.

[28]	C.E. Rasmussen. Gaussian processes in machine learning. O. Bousquet, U. Von Luxburg, G. Rätsch (Eds.), Advanced lectures on machine learning, Springer, Berlin (2003), pp. 63-71.

[29]	X. Lu, K.E. Jordan, M.F. Wheeler, E.O. Pyzer-Knapp, M. Benatan. Bayesian optimization for field-scale geological carbon storage. Engineering, 18 (2022), pp. 96-104.

[30]	A. Capone, A. Lederer, S. Hirche. Gaussian process uniform error bounds with unknown hyperparameters for safety-critical applications. Proceedings of the 39th International Conference on Machine Learning; 2022 Jul 17- 23, PMLR, Baltimore, MD, USA. New York ( 2022), pp. 2609-2624.

[31]	Akazaki T. Falsification of conditional safety properties for cyber-physical systems with Gaussian process regression. In: Falcone Y, Sánchez C, editors. Proceedings of the 16th International Conference on Runtime Verification; 2016 Sep 23-30; Madrid, Spain. Cham: Springer; 2016. p. 439-46.

[32]	Mori H, Kurata E. Application of Gaussian process to wind speed forecasting for wind power generation. In:Proceedings of the 2008 IEEE International Conference on Sustainable Energy Technologies; 2008 Nov 24- 27 ; Singapore. Piscataway: IEEE; 2008. p. 956-9.

[33]	A.Y. Sun, D. Wang, X. Xu. Monthly streamflow forecasting using Gaussian process regression. J Hydrol, 511 (2014), pp. 72-81.

[34]	B. Shahriari, K. Swersky, Z. Wang, R.P. Adams, N. De Freitas. Taking the human out of the loop: a review of Bayesian optimization. Proc IEEE, 104 (1) (2016), pp. 148-175.

[35]	Gelbart MA, Snoek J, Adams RP. Bayesian optimization with unknown constraints. 2014. arXiv:1403.5607.

[36]	A.S. Alshehri, A.K. Tula, F. You, R. Gani. Next generation pure component property estimation models: with and without machine learning techniques. AIChE J, 68 (6) (2022), p. e17469.

[37]	A.S. Hukkerikar, S. Kalakul, B. Sarup, D.M. Young, G. Sin, R. Gani. Estimation of environment-related properties of chemicals for design of sustainable processes: development of group-contribution⁺ (GC⁺) property models and uncertainty analysis. J Chem Inf Model, 52 (11) (2012), pp. 2823-2839.

[38]	A.J. Smola, B. Schölkopf. A tutorial on support vector regression. Stat Comput, 14 (3) (2004), pp. 199-222.

[39]	T. Hofmann, B. Schölkopf, A.J. Smola. Kernel methods in machine learning. Ann Stat, 36 (3) (2008), pp. 1171-1220.

[40]	O. Roustant, E. Padonou, Y. Deville, A. Clément, G. Perrin, J. Giorla, et al. Group kernels for Gaussian process metamodels with categorical inputs. SIAM/ASA J Uncertain Quantif, 8 (2) (2020), pp. 775-806.

[41]	P.Z.G. Qian, H. Wu, C.F.J. Wu. Gaussian process models for computer experiments with qualitative and quantitative factors. Technometrics, 50 (3) (2008), pp. 383-396.

[42]	R. Van de Schoot, S. Depaoli, R. King, B. Kramer, K. Märtens, M.G. Tadesse, et al. Bayesian statistics and modelling. Nat Rev Methods Primers, 1 (1) (2021), p. 1.

[43]	S. Ghosal, A. Roy. Posterior consistency of Gaussian process prior for nonparametric binary regression. Ann Stat, 34 (5) (2006), pp. 2413-2429.

[44]

Casale

, Dalca

, Saglietti

, Listgarten

, Fusi

. Gaussian process prior variational autoencoders. In: Bengio S, Wallach HM, Larochelle H, Grauman K, Cesa-Bianchi N, editors. Proceedings of the 32nd International Conference on Neural Information Processing Systems; 2018 Dec 3- 8 ; Montréal, QC, Canada. Red Hook: Curran Associates Inc.; 2018. p. 10390-401.

[45]	C.G. Kaufman, S.R. Sain. Bayesian functional ANOVA modeling using Gaussian process prior distributions. Bayesian Anal, 5 (1) (2010), pp. 123-149.

[46]	Astudillo R, Frazier PI. Thinking inside the box:a tutorial on grey-box Bayesian optimization. In: Proceedings of the 2021 Winter Simulation Conference; 2021 Dec 15-17; Phoenix, AZ, USA. Piscataway: IEEE; 2021. p. 1-15.

[47]	D.J. Nott, C.C. Drovandi, K. Mengersen, M. Evans. Approximation of Bayesian predictive p-values with regression ABC. Bayesian Anal, 13 (1) (2018), pp. 59-83.

[48]	R.E. Kass, A.E. Raftery. Bayes factors. J Am Stat Assoc, 90 (430) (1995), pp. 773-795.

[49]	L. Hirschfeld, K. Swanson, K. Yang, R. Barzilay, C.W. Coley. Uncertainty quantification using neural networks for molecular property prediction. J Chem Inf Model, 60 (8) (2020), pp. 3770-3780.

[50]	J. Fang, B. Gong, J. Caers. Data-driven model falsification and uncertainty quantification for fractured reservoirs. Engineering, 18 (2022), pp. 116-128.

[51]	Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM:a highly efficient gradient boosting decision tree. In:Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4- 9 ; Long Beach, CA, USA. Red Hook: Curran Associates Inc.; 2017. p. 3149-57.