Journal Home Online First Current Issue Archive For Authors Journal Information 中文版

Engineering >> 2021, Volume 7, Issue 9 doi: 10.1016/j.eng.2021.04.027

Actor–Critic Reinforcement Learning and Application in Developing Computer-Vision-Based Interface Tracking

Department of Chemical and Materials Engineering, University of Alberta, Edmonton, AB T6G 1H9, Canada

# These authors contributed equally to this work.

Received: 2020-07-16 Revised: 2020-11-04 Accepted: 2021-04-02 Available online: 2021-08-14

Next Previous

Abstract

This paper synchronizes control theory with computer vision by formalizing object tracking as a sequential decision-making process. A reinforcement learning (RL) agent successfully tracks an interface between two liquids, which is often a critical variable to track in many chemical, petrochemical, metallurgical, and oil industries. This method utilizes less than 100 images for creating an environment, from which the agent generates its own data without the need for expert knowledge. Unlike supervised learning (SL) methods that rely on a huge number of parameters, this approach requires far fewer parameters, which naturally reduces its maintenance cost. Besides its frugal nature, the agent is robust to environmental uncertainties such as occlusion, intensity changes, and excessive noise. From a closed-loop control context, an interface location-based deviation is chosen as the optimization goal during training. The methodology showcases RL for real-time object-tracking applications in the oil sands industry. Along with a presentation of the interface tracking problem, this paper provides a detailed review of one of the most effective RL methodologies: actor–critic policy.

Figures

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Fig. 7

Fig. 8

Fig. 9

Fig. 10

Fig. 11

Fig. 12

Fig. 13

References

[ 1 ] Masliyah J, Zhou ZJ, Xu Z, Czarnecki J, Hamza H. Understanding water-based bitumen extraction from Athabasca oil sands. Can J Chem Eng 2004;82 (4):628–54. link1

[ 2 ] Shafi H, Velswamy K, Ibrahim F, Huang B. A hierarchical constrained reinforcement learning for optimization of bitumen recovery rate in a primary separation vessel. Comput Chem Eng 2020;140:106939. link1

[ 3 ] Jampana P, Shah SL, Kadali R. Computer vision based interface level control in separation cells. Control Eng Pract 2010;18(4):349–57. link1

[ 4 ] Vicente A, Raveendran R, Huang B, Sedghi S, Narang A, Jiang H, et al. Computer vision system for froth-middlings interface level detection in the primary separation vessels. Comput Chem Eng 2019;123:357–70. link1

[ 5 ] Liu Z, Kodamana H, Afacan A, Huang B. Dynamic prediction of interface level using spatial temporal Markov random field. Comput Chem Eng 2019;128:301–11. link1

[ 6 ] Ruder S. An overview of gradient descent optimization algorithms. 2016. arXiv:1609.04747.

[ 7 ] Xie R, Jan NM, Hao K, Chen L, Huang B. Supervised variational autoencoders for soft sensor modeling with missing data. IEEE Trans Industr Inform 2020;16(4):2820–8. link1

[ 8 ] Raveendran R, Kodamana H, Huang B. Process monitoring using a generalized probabilistic linear latent variable model. Automatica 2018;96:73–83. link1

[ 9 ] LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1(4):541–51. link1

[10] Babu GS, Zhao P, Li XL. Deep convolutional neural network based regression approach for estimation of remaining useful life. In: Navathe S, Wu W, Shekhar S, Du X, Wang X, Xiong H, editors. Database systems for advanced applications. DASFAA 2016. Lecture notes in computer science, vol 9642. Cham: Springer; 2016. p. 214–28.

[11] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy; 2017. p. 2961–9.

[12] Kingma DP, Welling M. Auto-encoding variational Bayes. 2013. arXiv:1312.6114.

[13] Hubel DH, Wiesel TN. Receptive fields of single neurones in the cat’s striate cortex. J Physiol 1959;148(3):574–91. link1

[14] Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 1962;160(1):106–54. link1

[15] Hubel DH, Wiesel TN. Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. J Neurophysiol 1965;28 (2):229–89. link1

[16] Fukushima K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 1980;36(4):193–202. link1

[17] Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS. Deep learning for visual understanding: a review. Neurocomputing 2016;187:27–48. link1

[18] Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 2019;29(2):102–27. link1

[19] Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, et al. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv 2019;51(5):1–36. link1

[20] Wu X, Chen J, Xie L, Chan LLT, Chen CI. Development of convolutional neural network based Gaussian process regression to construct a novel probabilistic virtual metrology in multi-stage semiconductor processes. Control Eng Pract 2020;96:104262. link1

[21] Eklund A, Dufort P, Forsberg D, LaConte SM. Medical image processing on the GPU—past, present and future. Med Image Anal 2013;17(8):1073–94. link1

[22] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision 2015;115(3):211–52. link1

[23] Xie RM, Hao KG, Huang B, Chen L, Cai X. Data-driven modeling based on twostream k gated recurrent unit network with soft sensor application. IEEE Trans Ind Electron 2019;67(8):7034–43. link1

[24] Elman JL. Finding structure in time. Cognitive Sci 1990;14(2):179–211. link1

[25] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–80. link1

[26] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. 2014. arXiv:1406.1078.

[27] Shi X, Chen Z, Wang H, Yeung DY, Wong WK, Woo W. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Cortes C, Lee DD, Sugiyama M, Garnett R, editors. Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 1; 2015 Dec 7–12; Montreal, QC, Canada. Cambridge: MIT Press; 2015. p. 802–10. link1

[28] Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60–88. link1

[29] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Proceedings of the 25th International Conference on Neural Information Processing Systems, volume 1; 2012 Dec 3–6; Lake Tahoe, NV, USA. Red Hook: Curran Associates Inc.; 2012. p. 1097–105. link1

[30] Matthew DZ, Rob F. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer vision—ECCV 2014. Cham: Springer; 2014. p. 818–33. link1

[31] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 7–12; Boston, MA, USA. New York: IEEE; 2015. p. 1–9. link1

[32] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. New York: IEEE; 2016. p. 770–8. link1

[33] Yuan X, Huang B, Wang Y, Yang C, Gui W. Deep learning-based feature representation and its application for soft sensor modeling with variablewise weighted SAE. IEEE Trans Industr Inform 2018;14(7):3235–43. link1

[34] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng 2010;22(10):1345–59. link1

[35] Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: Ku˚ rková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I, editors. Artificial neural networks and machine learning—ICANN 2018. Cham: Springer; 2018. p. 270–9. link1

[36] Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge: MIT Press; 2000. link1

[37] Thorndike EL. Animal intelligence. Nature 1898;58(390):520. link1

[38] Farley B, Clark W. Simulation of self-organizing systems by digital computer. Trans IRE Prof Group Inform Theory 1954;4(4):76–84. link1

[39] Bellman R. The theory of dynamic programming. Bull Am Math Soc 1954;60 (6):503–16. link1

[40] Sutton RS, Barto AG, Williams RJ. Reinforcement learning is direct adaptive optimal control. IEEE Contr Syst Mag 1992;12(2):19–22. link1

[41] Donald EK. Optimal control theory: an introduction. New York: Dover Publication; 2004. link1

[42] Bertsekas DP. Reinforcement learning and optimal control. Belmont: Athena Scientific; 2019. link1

[43] Szepesvári C. Algorithms for reinforcement learning. Edmonton: Morgan and Claypool Publishers; 2010. link1

[44] Barto AG, Sutton RS, Anderson CW. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 1983; SMC-13(5):834–46. link1

[45] Konda VR, Tsitsiklis JN. Actor–critic algorithms. In: Solla SA, Leen TK, Müller K, editors. Proceedings of the 12th International Conference on Neural Information Processing Systems; 1999 Nov 29–Dec 4; Denver, CO, USA. Cambridge: MIT Press; 2000. p. 1008–14.

[46] Bhatnagar S, Ghavamzadeh M, Lee M, Sutton RS. Incremental natural actor– critic algorithms. In: Platt JC, Koller D, Singer Y, Roweis ST, editors. Proceedings of the 20th International Conference on Neural Information Processing Systems; 2007 Dec 3–6; Vancouver, BC, Canada. Red Hook: Curran Associates Inc.; 2007. p. 105–12. link1

[47] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. 2015. arXiv:1509.02971.

[48] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, et al. Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48; 2016 Jun 19–24; New York, NY, USA; 2016. p. 1928–37.

[49] Ljung L. System identification. New York: American Cancer Society; 1999. link1

[50] Huang B, Qi Y, Monjur Murshed AKM. Dynamic modeling and predictive control in solid oxide fuel cells: first principle and data-based approaches. Chichester: John Wiley & Sons; 2013. link1

[51] Kodamana H, Huang B, Ranjan R, Zhao Y, Tan R, Sammaknejad N. Approaches to robust process identification: a review and tutorial of probabilistic methods. J Process Contr 2018;66:68–83. link1

[52] Pendrith M. On reinforcement learning of control actions in noisy and nonMarkovian domains. Sydney: The University of New South Wales; 1994. link1

[53] Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, et al. Parameter space noise for exploration. 2017. arXiv:1706.01905.

[54] Tang H, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, et al. #Exploration: a study of count-based exploration for deep reinforcement learning. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, editors. Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4; Long Beach, CA, USA. Red Hook: Curran Associates Inc.; 2017. p. 2750–9. link1

[55] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor–critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. 2018. arXiv:1801.01290.

[56] Busoniu L, de Bruin T, Tolic´ D, Kober J, Palunko I. Reinforcement learning for control: performance, stability, and deep approximators. Annu Rev Contr 2018;46:8–28. link1

[57] Ciosek K, Vuong Q, Loftin R, Hofmann K. Better exploration with optimistic actor–critic. 2019. arXiv:1910.12807.

[58] Luck KS, Vecerik M, Stepputtis S, Amor HB, Scholz J. Improved exploration through latent trajectory optimization in deep deterministic policy gradient. 2019. arXiv:1911.06833.

[59] Couffignal L. Les Machines à calculer, leurs principes, leur evolution. Paris: Gauthier-Villars; 1933. French.

[60] Turing AM. I.—Computing machinery and intelligence. Mind 1950;LIX (236):433–60. link1

[61] Arf C. Makine Düsünebilir Mi ve Nasıl Düsünebilir? In: Üniversite Ç alısmalarını Muhite Yayma ve Halk Eg˘itimi Yayınları Konferanslar Serisi No: 1. Erzurum: Atatürk Üniversitesi; 1959. p. 91–103. Turkish.

[62] Wang Y, Velswamy K, Huang B. A long-short term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems. Processes 2017;5(4):46. link1

[63] Spielberg SPK, Gopaluni RB, Loewen PD. Deep reinforcement learning approaches for process control. In: Proceedings of the 6th International Symposium on Advanced Control of Industrial Processes; 2017 May 28–31; Taipei, China. New York: IEEE; 2017. p. 201–6. link1

[64] Pandian BJ, Noel MM. Tracking control of a continuous stirred tank reactor using direct and tuned reinforcement learning based controllers. Chem Prod Process Mo 2018;13(3):20170040. link1

[65] Shin J, Badgwell TA, Liu KH, Lee JH. Reinforcement learning overview of recent progress and implications for process control. Comput Chem Eng 2019;127:282–94. link1

[66] Ruan Y, Zhang Y, Mao T, Zhou X, Li D, Zhou H. Trajectory optimization and positioning control for batch process using learning control. Control Eng Pract 2019;85:1–10. link1

[67] Nian R, Liu J, Huang B. A review on reinforcement learning: introduction and applications in industrial process control. Comput Chem Eng 2020;139:106886. link1

[68] Zhu L, Cui Y, Takami G, Kanokogi H, Matsubara T. Scalable reinforcement learning for plant-wide control of vinyl acetate monomer process. Control Eng Pract 2020;97:104331. link1

[69] Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2012 Oct 7–12; Vilamoura-Algarve, Portugal. New York: IEEE; 2012. p. 5026–33.

[70] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning. 2013. arXiv:1312.5602.

[71] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518 (7540):529–33. link1

[72] Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019;364(6443):859–65. link1

[73] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al. OpenAI Gym. 2016. arXiv:1606.01540.

[74] Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018;362(6419):1140–4. link1

[75] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, et al. Emergent tool use from multi-agent autocurricula. 2019. arXiv:1909.07528.

[76] Berner C, Brockman G, Chan B, Cheung V, De˛biak P, Dennison C, et al. Dota 2 with large scale deep reinforcement learning. 2019. arXiv:1912.06680.

[77] Badia AP, Piot B, Kapturowski S, Sprechmann P, Vitvitskyi A, Guo D, et al. Agent57: outperforming the Atari human benchmark. 2020. arXiv:2003.13350.

[78] Bucak IO, Zohdy MA. Reinforcement learning control of nonlinear multi-link system. Eng Appl Artif Intell 2001;14(5):563–75. link1

[79] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: a survey. Int J Robot Res 2013;32(11):1238–74. link1

[80] Amini A, Gilitschenski I, Phillips J, Moseyko J, Banerjee R, Karaman S, et al. Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot Autom Lett 2020;5(2):1143–50. link1

[81] Pi CH, Hu KC, Cheng S, Wu IC. Low-level autonomous control and tracking of quadrotor using reinforcement learning. Control Eng Pract 2020;95:104222. link1

[82] Mathe S, Pirinen A, Sminchisescu C. Reinforcement learning for visual object detection. In: He KM, Zhang XY, Ren SQ, Sun J, editors. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27– 30; Las Vegas, NV, USA. New York: IEEE; 2016. p. 2894–902. link1

[83] König J, Malberg S, Martens M, Niehaus S, Krohn-Grimberghe A, Ramaswamy A. Multi-stage reinforcement learning for object detection. In: Arai K, Kapoor S, editors. Advances in computer vision. Cham: Springer; 2019. p. 178–91. link1

[84] Halici E, Alatan AA. Object localization without bounding box information using generative adversarial reinforcement learning. In: Liu D, Lee S, Li XL, Bhanu B, Li HQ, Jung C, editors. Proceedings of the 25th IEEE International Conference on Image Processing; 2018 Oct 7–10; Athens, Greece. New York: IEEE; 2018. p. 3728–32. link1

[85] Zhang D, Maei H, Wang X, Wang YF. Deep reinforcement learning for visual object tracking in videos. 2017. arXiv:1701.08936.

[86] Luo W, Sun P, Zhong F, Liu W, Zhang T, Wang Y. End-to-end active object tracking via reinforcement learning. 2017. arXiv:1705.10561.

[87] Ren LL, Lu JW, Wang ZF, Tian Q, Zhou J. Collaborative deep reinforcement learning for multi-object tracking. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Computer Vision—ECCV 2018. Cham: Springer; 2018. p. 586–602. link1

[88] Yun S, Choi J, Yoo Y, Yun K, Choi JY. Action-decision networks for visual tracking with deep reinforcement learning. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. New York: IEEE; 2017. p. 2711–20. link1

[89] Choi J, Kwon J, Lee KM. Real-time visual tracking by deep reinforced decision making. 2017. arXiv:1702.06291.

[90] Li P, Wang D, Wang L, Lu H. Deep visual tracking: review and experimental comparison. Pattern Recognit 2018;76:323–38. link1

[91] Chen BX, Tsotsos JK. Fast visual object tracking with rotated bounding boxes. 2019. arXiv:1907.03892.

[92] Wang Z, Xu J, Liu L, Zhu F, Shao L. RANet: ranking attention network for fast video object segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision; 2019 Oct 27–Nov 2; Seoul, Republic of Korea. New York: IEEE; 2019. p. 3978–87.

[93] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF. Large-scale video classification with convolutional neural networks. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23–28; Columbus, OH, USA. New York: IEEE; 2014. p. 1725–32. link1

[94] Li J, Chen X, Hovy E, Jurafsky D. Visualizing and understanding neural models in NLP. 2015. arXiv:1506.01066.

[95] Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding neural networks through deep visualization. 2015. arXiv:1506.06579.

[96] Wattenberg M, Viégas F, Johnson I. How to use t-SNE effectively. Distill 2016;1(10):e2. link1

[97] Zrihem NB, Zahavy T, Mannor S. Visualizing dynamics: from t-SNE to SEMIMDPs. 2016. arXiv:1606.07112.

[98] François-Lavet V, Bengio Y, Precup D, Pineau J. Combined reinforcement learning via abstract representations. Proc AAAI Conf Artif Intell 2019;33 (1):3582–9. link1

[99] McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2018. arXiv:1802.03426.

[100] Zhao Y, Fatehi A, Huang B. A data-driven hybrid ARX and Markov chain modeling approach to process identification with time-varying time delays. IEEE Trans Ind Electron 2017;64(5):4226–36. link1

[101] Watkins CJCH, Dayan P. Q-learning. Mach Learn 1992;8(3–4):279–92. link1

[102] Tsitsiklis JN, Roy BV. Analysis of temporal-difference learning with function approximation. In: Jordan MI, Petsche T, editors. Proceedings of the 9th International Conference on Neural Information Processing Systems; 1996 Dec 3–5; Denver, CO, USA. Cambridge: MIT Press; 1997. p. 1075–81. link1

[103] Gullapalli V. A stochastic reinforcement learning algorithm for learning realvalued functions. Neural Networks 1990;3(6):671–92. link1

[104] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. In: Xing EP, Jebara T, editors. Proceedings of the 31st International Conference on International Conference on Machine Learning; 2014 Jun 21–26; Beijing, China; 2014. p. I-387–95.

[105] Levine S, Finn C, Darrell T, Abbeel P. End-to-end training of deep visuomotor policies. J Mach Learn Res 2016;17(1):1334–73. link1

[106] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors. Proceedings of the 27th International Conference on Neural Information Processing Systems; 2014 Dec 8–13; Montreal, QC, Canada. Cambridge: MIT Press; 2014. p. 2672–80. link1

[107] Konda VR, Tsitsiklis JN. On actor–critic algorithms. SIAM J Control Optim 2003;42(4):1143–66. link1

[108] Grondman I, Busoniu L, Lopes GAD, Babuska R. A survey of actor–critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 2012;42(6):1291–307. link1

[109] Grondman I, Busoniu L, Babuška R. Model learning actor–critic algorithms: performance evaluation in a motion control task. In: Proceedings of the 51st IEEE Conference on Decision and Control; 2012 Dec 10–13; Maui, HI, USA. New York: IEEE; 2012. p. 5272–7. link1

[110] Costa B, Caarls W, Menasché DS. Dyna-MLAC: trading computational and sample complexities in actor–critic reinforcement learning. In: 2015 Brazilian Conference on Intelligent Systems; 2015 Nov 4–7; Natal, Brazil. New York: IEEE; 2015. p. 37–42. link1

[111] Langevin P. Sur la théorie du mouvement brownien. In: Comptes Rendus Hebdomadaires des Seances de l’Academie des Sciences. Paris: GauthierVillars; 1908. p. 530–3. French. link1

[112] Wang Z, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, et al. Sample efficient actor–critic with experience replay. 2016. arXiv:1611.01224.

[113] Munos R, Stepleton T, Harutyunyan A, Bellemare MG. Safe and efficient offpolicy reinforcement learning. In: Lee DD, von Luxburg U, Garnett R, Sugiyama M, Guyon I, editors. Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5; Barcelona, Spain. Red Hook: Curran Associates Inc.; 2016. p. 1054–62. link1

[114] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning; 2015 Jul 7–9; Lille, France; 2015. p. 1889–97.

[115] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017. arXiv:1707.06347.

[116] Wu Y, Mansimov E, Grosse RB, Liao S, Ba J. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, editors. Advances in neural information processing systems 30: 31st Annual Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA. San Diego: Neural Information Processing Systems Foundation, Inc.; 2017. p. 5279–88. link1

[117] Grosse R, Martens J. A Kronecker-factored approximate fisher matrix for convolution layers. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd International Conference on International Conference on Machine Learning; 2016 Jun 19–24; New York, NY, USA; 2016. p. 573–82.

[118] Martens J, Ba J, Johnson M. Kronecker-factored curvature approximations for recurrent neural networks. In: Proceedings of the 6th International Conference on Learning Representations; 2018 Apr 30–May 3; Vancouver, BC, Canada; 2018.

[119] Gruslys A, Dabney W, Azar MG, Piot B, Bellemare M, Munos R. The reactor: a fast and sample-efficient actor–critic agent for reinforcement learning. 2017. arXiv:1704.04651.

[120] Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, et al. Soft actor–critic algorithms and applications. 2018. arXiv:1812.05905.

[121] Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor–critic methods. 2018. arXiv:1802.09477.

[122] Ljung S, Ljung L. Error propagation properties of recursive least-squares adaptation algorithms. Automatica 1985;21(2):157–67. link1

[123] Doya K. Reinforcement learning in continuous time and space. Neural Comput 2000;12(1):219–45. link1

[124] Bellman R. Dynamic programming. Science 1966;153(3731):34–7. link1

[125] Borkar VS. An actor–critic algorithm for constrained Markov decision processes. Syst Control Lett 2005;54(3):207–13. link1

[126] Peters J, Schaal S. Natural actor–critic. Neurocomputing 2008;71(7– 9):1180–90. link1

[127] Vrabie D, Lewis F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks 2009;22(3):237–46. link1

[128] Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010;46(5):878–88. link1

[129] Degris T, White M, Sutton RS. Off-policy actor–critic. 2012. arXiv:1205.4839.

[130] Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE. A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 2013;49(1):82–92. link1

[131] Zhao D, Wang B, Liu D. A supervised actor–critic approach for adaptive cruise control. Soft Comput 2013;17(11):2089–99. link1

[132] Modares H, Lewis FL, Naghibi-Sistani MB. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 2014;50 (1):193–202. link1

[133] Chang SJ, Lee JY, Park JB, Choi YH. An online fault tolerant actor–critic neurocontrol for a class of non-linear systems using neural network HJB approach. Int J Control Autom Syst 2015;13(2):311–8. link1

[134] Kiumarsi B, Lewis FL. Actor–critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 2015;26(1):140–51. link1

[135] Song R, Lewis F, Wei Q, Zhang HG, Jiang ZP, Levine D. Multiple actor–critic structures for continuous-time optimal control using input–output data. IEEE Trans Neural Netw Learn Syst 2015;26(4):851–65. link1

[136] Allen C, Asadi K, Roderick M, Mohamed A, Konidaris G, Littman M. Mean actor critic. 2017. arXiv:1709.00503.

[137] Dhar NK, Verma NK, Behera L. Adaptive critic-based event-triggered control for HVAC system. IEEE Trans Industr Inform 2018;14(1):178–88. link1

[138] Fan QY, Yang GH, Ye D. Quantization-based adaptive actor–critic tracking control with tracking error constraints. IEEE Trans Neural Netw Learn Syst 2018;29(4):970–80. link1

[139] Wei Y, Yu FR, Song M, Han Z. User scheduling and resource allocation in HetNets with hybrid energy supply: an actor–critic reinforcement learning approach. IEEE Trans Wirel Commun 2018;17(1):680–92. link1

[140] Chen B, Wang D, Li P, Wang S, Lu H. Real-time ‘actor–critic’ tracking. In: Yang MH, Gool LV, Liu W, Wang XG, Kautz J, Shen CH, et al., editors. Proceedings of the European Conference on Computer Vision; 2018 Sep 8–14; Munich, Germany; 2018. p. 318–34.

[141] Radac MB, Precup RE, Roman RC. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-learning. ISA Trans 2018;73:227–38. link1

[142] Yang Z, Chen Y, Hong M, Wang Z. On the global convergence of actor–critic: a case for linear quadratic regulator with ergodic cost. 2019. arXiv:1907.06246.

[143] Lv Y, Na J, Ren X. Online H1 control for completely unknown nonlinear systems via an identifier–critic-based ADP structure. Int J Control 2019;92(1):100–11. link1

[144] Hou Z, Zhang K, Wan Y, Li D, Fu C, Yu H. Off-policy maximum entropy reinforcement learning: soft actor–critic with advantage weighted mixture policy (SAC-AWMP). 2020. arXiv:2002.02829.

[145] Zhang Y, Zhao B, Liu D. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing 2020;387:40–50. link1

[146] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. 2015. arXiv:1506.02438.

[147] Shashua SDC, Mannor S. Trust region value optimization using Kalman filtering. 2019. arXiv:1901.07860.

[148] Shashua SDC, Mannor S. Kalman meets Bellman: improving policy evaluation through value tracking. 2020. arXiv:2002.07171.

[149] Su PH, Budzianowski P, Ultes S, Gasic M, Young S. Sample-efficient actor– critic reinforcement learning with supervised data for dialogue management. 2017. arXiv:1707.00130.

[150] Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: datasets for deep datadriven reinforcement learning. 2020. arXiv:2004.07219.

[151] Nair A, Srinivasan P, Blackwell S, Alcicek C, Fearon R, De Maria A, et al. Massively parallel methods for deep reinforcement learning. 2015. arXiv:1507.04296.

[152] Pavlov PI. Conditioned reflexes: an investigation of the physiological activity of the cerebral cortex. Ann Neurosci 2010;17(3):136–41. link1

[153] Huang B, Kadali R. Dynamic modeling, predictive control and performance monitoring: a data-driven subspace approach. London: Springer; 2008. link1

[154] Gonzalez RC, Woods RE. Digital image processing. 4th ed. London: Pearson Publishing Co.; 2018. link1

[155] Chen M, Radford A, Child R, Wu J, Jun H, Luan D, et al. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning; 2020 Jul 12–18; online conference; 2020. p. 1691–703.

[156] Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. arXiv:1412.6980.

[157] Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. In: Proceedings of 2017 International Conference on Engineering and Technology; 2017 Aug 21–23; Antalya, Turkey. New York: IEEE; 2017. p. 1–6. link1

Related Research