期刊首页 优先出版 当期阅读 过刊浏览 作者中心 关于期刊 English

《工程(英文)》 >> 2021年 第7卷 第9期 doi: 10.1016/j.eng.2021.04.027


Department of Chemical and Materials Engineering, University of Alberta, Edmonton, AB T6G 1H9, Canada

# These authors contributed equally to this work.

收稿日期: 2020-07-16 修回日期: 2020-11-04 录用日期: 2021-04-02 发布日期: 2021-08-14

下一篇 上一篇


本文通过将对象跟踪形式化为序列决策过程,使控制理论与计算机视觉实现同步。强化学习(RL)智能体成功跟踪了两种液体之间的界面,这通常是化学、石化、冶金和石油行业中跟踪的关键变量。该方法使用少于100 张图像来创建环境,智能体无需专家知识即可从中生成自己的数据。与依赖大量参数的监督学习(SL)方法不同,这种方法需要的参数少得多,这自然降低了维护成本。除了经济性外,该智能体还对环境不确定性(如遮挡、强度变化和过度噪声)具有鲁棒性。在闭环控制情境下,基于界面位置的偏差被选作训练阶段的优化目标。该方法展示了RL方法在油砂行业中的实时对象跟踪应用。本文除了介绍界面跟踪问题外,还详细回顾了最有效的RL方法之一——actor-critic策略。
















[ 1 ] Masliyah J, Zhou ZJ, Xu Z, Czarnecki J, Hamza H. Understanding water-based bitumen extraction from Athabasca oil sands. Can J Chem Eng 2004;82 (4):628–54. 链接1

[ 2 ] Shafi H, Velswamy K, Ibrahim F, Huang B. A hierarchical constrained reinforcement learning for optimization of bitumen recovery rate in a primary separation vessel. Comput Chem Eng 2020;140:106939. 链接1

[ 3 ] Jampana P, Shah SL, Kadali R. Computer vision based interface level control in separation cells. Control Eng Pract 2010;18(4):349–57. 链接1

[ 4 ] Vicente A, Raveendran R, Huang B, Sedghi S, Narang A, Jiang H, et al. Computer vision system for froth-middlings interface level detection in the primary separation vessels. Comput Chem Eng 2019;123:357–70. 链接1

[ 5 ] Liu Z, Kodamana H, Afacan A, Huang B. Dynamic prediction of interface level using spatial temporal Markov random field. Comput Chem Eng 2019;128:301–11. 链接1

[ 6 ] Ruder S. An overview of gradient descent optimization algorithms. 2016. arXiv:1609.04747.

[ 7 ] Xie R, Jan NM, Hao K, Chen L, Huang B. Supervised variational autoencoders for soft sensor modeling with missing data. IEEE Trans Industr Inform 2020;16(4):2820–8. 链接1

[ 8 ] Raveendran R, Kodamana H, Huang B. Process monitoring using a generalized probabilistic linear latent variable model. Automatica 2018;96:73–83. 链接1

[ 9 ] LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, et al. Backpropagation applied to handwritten zip code recognition. Neural Comput 1989;1(4):541–51. 链接1

[10] Babu GS, Zhao P, Li XL. Deep convolutional neural network based regression approach for estimation of remaining useful life. In: Navathe S, Wu W, Shekhar S, Du X, Wang X, Xiong H, editors. Database systems for advanced applications. DASFAA 2016. Lecture notes in computer science, vol 9642. Cham: Springer; 2016. p. 214–28.

[11] He K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy; 2017. p. 2961–9.

[12] Kingma DP, Welling M. Auto-encoding variational Bayes. 2013. arXiv:1312.6114.

[13] Hubel DH, Wiesel TN. Receptive fields of single neurones in the cat’s striate cortex. J Physiol 1959;148(3):574–91. 链接1

[14] Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 1962;160(1):106–54. 链接1

[15] Hubel DH, Wiesel TN. Receptive fields and functional architecture in two nonstriate visual areas (18 and 19) of the cat. J Neurophysiol 1965;28 (2):229–89. 链接1

[16] Fukushima K. Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 1980;36(4):193–202. 链接1

[17] Guo Y, Liu Y, Oerlemans A, Lao S, Wu S, Lew MS. Deep learning for visual understanding: a review. Neurocomputing 2016;187:27–48. 链接1

[18] Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 2019;29(2):102–27. 链接1

[19] Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, et al. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv 2019;51(5):1–36. 链接1

[20] Wu X, Chen J, Xie L, Chan LLT, Chen CI. Development of convolutional neural network based Gaussian process regression to construct a novel probabilistic virtual metrology in multi-stage semiconductor processes. Control Eng Pract 2020;96:104262. 链接1

[21] Eklund A, Dufort P, Forsberg D, LaConte SM. Medical image processing on the GPU—past, present and future. Med Image Anal 2013;17(8):1073–94. 链接1

[22] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vision 2015;115(3):211–52. 链接1

[23] Xie RM, Hao KG, Huang B, Chen L, Cai X. Data-driven modeling based on twostream k gated recurrent unit network with soft sensor application. IEEE Trans Ind Electron 2019;67(8):7034–43. 链接1

[24] Elman JL. Finding structure in time. Cognitive Sci 1990;14(2):179–211. 链接1

[25] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput 1997;9(8):1735–80. 链接1

[26] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, et al. Learning phrase representations using RNN encoder–decoder for statistical machine translation. 2014. arXiv:1406.1078.

[27] Shi X, Chen Z, Wang H, Yeung DY, Wong WK, Woo W. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Cortes C, Lee DD, Sugiyama M, Garnett R, editors. Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 1; 2015 Dec 7–12; Montreal, QC, Canada. Cambridge: MIT Press; 2015. p. 802–10. 链接1

[28] Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal 2017;42:60–88. 链接1

[29] Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ, editors. Proceedings of the 25th International Conference on Neural Information Processing Systems, volume 1; 2012 Dec 3–6; Lake Tahoe, NV, USA. Red Hook: Curran Associates Inc.; 2012. p. 1097–105. 链接1

[30] Matthew DZ, Rob F. Visualizing and understanding convolutional networks. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T, editors. Computer vision—ECCV 2014. Cham: Springer; 2014. p. 818–33. 链接1

[31] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, et al. Going deeper with convolutions. In: Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 7–12; Boston, MA, USA. New York: IEEE; 2015. p. 1–9. 链接1

[32] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. New York: IEEE; 2016. p. 770–8. 链接1

[33] Yuan X, Huang B, Wang Y, Yang C, Gui W. Deep learning-based feature representation and its application for soft sensor modeling with variablewise weighted SAE. IEEE Trans Industr Inform 2018;14(7):3235–43. 链接1

[34] Pan SJ, Yang Q. A survey on transfer learning. IEEE Trans Knowl Data Eng 2010;22(10):1345–59. 链接1

[35] Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: Ku˚ rková V, Manolopoulos Y, Hammer B, Iliadis L, Maglogiannis I, editors. Artificial neural networks and machine learning—ICANN 2018. Cham: Springer; 2018. p. 270–9. 链接1

[36] Sutton RS, Barto AG. Reinforcement learning: an introduction. Cambridge: MIT Press; 2000. 链接1

[37] Thorndike EL. Animal intelligence. Nature 1898;58(390):520. 链接1

[38] Farley B, Clark W. Simulation of self-organizing systems by digital computer. Trans IRE Prof Group Inform Theory 1954;4(4):76–84. 链接1

[39] Bellman R. The theory of dynamic programming. Bull Am Math Soc 1954;60 (6):503–16. 链接1

[40] Sutton RS, Barto AG, Williams RJ. Reinforcement learning is direct adaptive optimal control. IEEE Contr Syst Mag 1992;12(2):19–22. 链接1

[41] Donald EK. Optimal control theory: an introduction. New York: Dover Publication; 2004. 链接1

[42] Bertsekas DP. Reinforcement learning and optimal control. Belmont: Athena Scientific; 2019. 链接1

[43] Szepesvári C. Algorithms for reinforcement learning. Edmonton: Morgan and Claypool Publishers; 2010. 链接1

[44] Barto AG, Sutton RS, Anderson CW. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Trans Syst Man Cybern 1983; SMC-13(5):834–46. 链接1

[45] Konda VR, Tsitsiklis JN. Actor–critic algorithms. In: Solla SA, Leen TK, Müller K, editors. Proceedings of the 12th International Conference on Neural Information Processing Systems; 1999 Nov 29–Dec 4; Denver, CO, USA. Cambridge: MIT Press; 2000. p. 1008–14.

[46] Bhatnagar S, Ghavamzadeh M, Lee M, Sutton RS. Incremental natural actor– critic algorithms. In: Platt JC, Koller D, Singer Y, Roweis ST, editors. Proceedings of the 20th International Conference on Neural Information Processing Systems; 2007 Dec 3–6; Vancouver, BC, Canada. Red Hook: Curran Associates Inc.; 2007. p. 105–12. 链接1

[47] Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, Tassa Y, et al. Continuous control with deep reinforcement learning. 2015. arXiv:1509.02971.

[48] Mnih V, Badia AP, Mirza M, Graves A, Harley T, Lillicrap TP, et al. Asynchronous methods for deep reinforcement learning. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd International Conference on International Conference on Machine Learning, volume 48; 2016 Jun 19–24; New York, NY, USA; 2016. p. 1928–37.

[49] Ljung L. System identification. New York: American Cancer Society; 1999. 链接1

[50] Huang B, Qi Y, Monjur Murshed AKM. Dynamic modeling and predictive control in solid oxide fuel cells: first principle and data-based approaches. Chichester: John Wiley & Sons; 2013. 链接1

[51] Kodamana H, Huang B, Ranjan R, Zhao Y, Tan R, Sammaknejad N. Approaches to robust process identification: a review and tutorial of probabilistic methods. J Process Contr 2018;66:68–83. 链接1

[52] Pendrith M. On reinforcement learning of control actions in noisy and nonMarkovian domains. Sydney: The University of New South Wales; 1994. 链接1

[53] Plappert M, Houthooft R, Dhariwal P, Sidor S, Chen RY, Chen X, et al. Parameter space noise for exploration. 2017. arXiv:1706.01905.

[54] Tang H, Houthooft R, Foote D, Stooke A, Chen X, Duan Y, et al. #Exploration: a study of count-based exploration for deep reinforcement learning. In: von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, editors. Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4; Long Beach, CA, USA. Red Hook: Curran Associates Inc.; 2017. p. 2750–9. 链接1

[55] Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor–critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. 2018. arXiv:1801.01290.

[56] Busoniu L, de Bruin T, Tolic´ D, Kober J, Palunko I. Reinforcement learning for control: performance, stability, and deep approximators. Annu Rev Contr 2018;46:8–28. 链接1

[57] Ciosek K, Vuong Q, Loftin R, Hofmann K. Better exploration with optimistic actor–critic. 2019. arXiv:1910.12807.

[58] Luck KS, Vecerik M, Stepputtis S, Amor HB, Scholz J. Improved exploration through latent trajectory optimization in deep deterministic policy gradient. 2019. arXiv:1911.06833.

[59] Couffignal L. Les Machines à calculer, leurs principes, leur evolution. Paris: Gauthier-Villars; 1933. French.

[60] Turing AM. I.—Computing machinery and intelligence. Mind 1950;LIX (236):433–60. 链接1

[61] Arf C. Makine Düsünebilir Mi ve Nasıl Düsünebilir? In: Üniversite Ç alısmalarını Muhite Yayma ve Halk Eg˘itimi Yayınları Konferanslar Serisi No: 1. Erzurum: Atatürk Üniversitesi; 1959. p. 91–103. Turkish.

[62] Wang Y, Velswamy K, Huang B. A long-short term memory recurrent neural network based reinforcement learning controller for office heating ventilation and air conditioning systems. Processes 2017;5(4):46. 链接1

[63] Spielberg SPK, Gopaluni RB, Loewen PD. Deep reinforcement learning approaches for process control. In: Proceedings of the 6th International Symposium on Advanced Control of Industrial Processes; 2017 May 28–31; Taipei, China. New York: IEEE; 2017. p. 201–6. 链接1

[64] Pandian BJ, Noel MM. Tracking control of a continuous stirred tank reactor using direct and tuned reinforcement learning based controllers. Chem Prod Process Mo 2018;13(3):20170040. 链接1

[65] Shin J, Badgwell TA, Liu KH, Lee JH. Reinforcement learning overview of recent progress and implications for process control. Comput Chem Eng 2019;127:282–94. 链接1

[66] Ruan Y, Zhang Y, Mao T, Zhou X, Li D, Zhou H. Trajectory optimization and positioning control for batch process using learning control. Control Eng Pract 2019;85:1–10. 链接1

[67] Nian R, Liu J, Huang B. A review on reinforcement learning: introduction and applications in industrial process control. Comput Chem Eng 2020;139:106886. 链接1

[68] Zhu L, Cui Y, Takami G, Kanokogi H, Matsubara T. Scalable reinforcement learning for plant-wide control of vinyl acetate monomer process. Control Eng Pract 2020;97:104331. 链接1

[69] Todorov E, Erez T, Tassa Y. MuJoCo: a physics engine for model-based control. In: Proceedings of 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2012 Oct 7–12; Vilamoura-Algarve, Portugal. New York: IEEE; 2012. p. 5026–33.

[70] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, et al. Playing Atari with deep reinforcement learning. 2013. arXiv:1312.5602.

[71] Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, et al. Human-level control through deep reinforcement learning. Nature 2015;518 (7540):529–33. 链接1

[72] Jaderberg M, Czarnecki WM, Dunning I, Marris L, Lever G, Castañeda AG, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning. Science 2019;364(6443):859–65. 链接1

[73] Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al. OpenAI Gym. 2016. arXiv:1606.01540.

[74] Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science 2018;362(6419):1140–4. 链接1

[75] Baker B, Kanitscheider I, Markov T, Wu Y, Powell G, McGrew B, et al. Emergent tool use from multi-agent autocurricula. 2019. arXiv:1909.07528.

[76] Berner C, Brockman G, Chan B, Cheung V, De˛biak P, Dennison C, et al. Dota 2 with large scale deep reinforcement learning. 2019. arXiv:1912.06680.

[77] Badia AP, Piot B, Kapturowski S, Sprechmann P, Vitvitskyi A, Guo D, et al. Agent57: outperforming the Atari human benchmark. 2020. arXiv:2003.13350.

[78] Bucak IO, Zohdy MA. Reinforcement learning control of nonlinear multi-link system. Eng Appl Artif Intell 2001;14(5):563–75. 链接1

[79] Kober J, Bagnell JA, Peters J. Reinforcement learning in robotics: a survey. Int J Robot Res 2013;32(11):1238–74. 链接1

[80] Amini A, Gilitschenski I, Phillips J, Moseyko J, Banerjee R, Karaman S, et al. Learning robust control policies for end-to-end autonomous driving from data-driven simulation. IEEE Robot Autom Lett 2020;5(2):1143–50. 链接1

[81] Pi CH, Hu KC, Cheng S, Wu IC. Low-level autonomous control and tracking of quadrotor using reinforcement learning. Control Eng Pract 2020;95:104222. 链接1

[82] Mathe S, Pirinen A, Sminchisescu C. Reinforcement learning for visual object detection. In: He KM, Zhang XY, Ren SQ, Sun J, editors. Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27– 30; Las Vegas, NV, USA. New York: IEEE; 2016. p. 2894–902. 链接1

[83] König J, Malberg S, Martens M, Niehaus S, Krohn-Grimberghe A, Ramaswamy A. Multi-stage reinforcement learning for object detection. In: Arai K, Kapoor S, editors. Advances in computer vision. Cham: Springer; 2019. p. 178–91. 链接1

[84] Halici E, Alatan AA. Object localization without bounding box information using generative adversarial reinforcement learning. In: Liu D, Lee S, Li XL, Bhanu B, Li HQ, Jung C, editors. Proceedings of the 25th IEEE International Conference on Image Processing; 2018 Oct 7–10; Athens, Greece. New York: IEEE; 2018. p. 3728–32. 链接1

[85] Zhang D, Maei H, Wang X, Wang YF. Deep reinforcement learning for visual object tracking in videos. 2017. arXiv:1701.08936.

[86] Luo W, Sun P, Zhong F, Liu W, Zhang T, Wang Y. End-to-end active object tracking via reinforcement learning. 2017. arXiv:1705.10561.

[87] Ren LL, Lu JW, Wang ZF, Tian Q, Zhou J. Collaborative deep reinforcement learning for multi-object tracking. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Computer Vision—ECCV 2018. Cham: Springer; 2018. p. 586–602. 链接1

[88] Yun S, Choi J, Yoo Y, Yun K, Choi JY. Action-decision networks for visual tracking with deep reinforcement learning. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition; 2017 Jul 21–26; Honolulu, HI, USA. New York: IEEE; 2017. p. 2711–20. 链接1

[89] Choi J, Kwon J, Lee KM. Real-time visual tracking by deep reinforced decision making. 2017. arXiv:1702.06291.

[90] Li P, Wang D, Wang L, Lu H. Deep visual tracking: review and experimental comparison. Pattern Recognit 2018;76:323–38. 链接1

[91] Chen BX, Tsotsos JK. Fast visual object tracking with rotated bounding boxes. 2019. arXiv:1907.03892.

[92] Wang Z, Xu J, Liu L, Zhu F, Shao L. RANet: ranking attention network for fast video object segmentation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision; 2019 Oct 27–Nov 2; Seoul, Republic of Korea. New York: IEEE; 2019. p. 3978–87.

[93] Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Li FF. Large-scale video classification with convolutional neural networks. In: Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition; 2014 Jun 23–28; Columbus, OH, USA. New York: IEEE; 2014. p. 1725–32. 链接1

[94] Li J, Chen X, Hovy E, Jurafsky D. Visualizing and understanding neural models in NLP. 2015. arXiv:1506.01066.

[95] Yosinski J, Clune J, Nguyen A, Fuchs T, Lipson H. Understanding neural networks through deep visualization. 2015. arXiv:1506.06579.

[96] Wattenberg M, Viégas F, Johnson I. How to use t-SNE effectively. Distill 2016;1(10):e2. 链接1

[97] Zrihem NB, Zahavy T, Mannor S. Visualizing dynamics: from t-SNE to SEMIMDPs. 2016. arXiv:1606.07112.

[98] François-Lavet V, Bengio Y, Precup D, Pineau J. Combined reinforcement learning via abstract representations. Proc AAAI Conf Artif Intell 2019;33 (1):3582–9. 链接1

[99] McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. 2018. arXiv:1802.03426.

[100] Zhao Y, Fatehi A, Huang B. A data-driven hybrid ARX and Markov chain modeling approach to process identification with time-varying time delays. IEEE Trans Ind Electron 2017;64(5):4226–36. 链接1

[101] Watkins CJCH, Dayan P. Q-learning. Mach Learn 1992;8(3–4):279–92. 链接1

[102] Tsitsiklis JN, Roy BV. Analysis of temporal-difference learning with function approximation. In: Jordan MI, Petsche T, editors. Proceedings of the 9th International Conference on Neural Information Processing Systems; 1996 Dec 3–5; Denver, CO, USA. Cambridge: MIT Press; 1997. p. 1075–81. 链接1

[103] Gullapalli V. A stochastic reinforcement learning algorithm for learning realvalued functions. Neural Networks 1990;3(6):671–92. 链接1

[104] Silver D, Lever G, Heess N, Degris T, Wierstra D, Riedmiller M. In: Xing EP, Jebara T, editors. Proceedings of the 31st International Conference on International Conference on Machine Learning; 2014 Jun 21–26; Beijing, China; 2014. p. I-387–95.

[105] Levine S, Finn C, Darrell T, Abbeel P. End-to-end training of deep visuomotor policies. J Mach Learn Res 2016;17(1):1334–73. 链接1

[106] Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ, editors. Proceedings of the 27th International Conference on Neural Information Processing Systems; 2014 Dec 8–13; Montreal, QC, Canada. Cambridge: MIT Press; 2014. p. 2672–80. 链接1

[107] Konda VR, Tsitsiklis JN. On actor–critic algorithms. SIAM J Control Optim 2003;42(4):1143–66. 链接1

[108] Grondman I, Busoniu L, Lopes GAD, Babuska R. A survey of actor–critic reinforcement learning: standard and natural policy gradients. IEEE Trans Syst Man Cybern C 2012;42(6):1291–307. 链接1

[109] Grondman I, Busoniu L, Babuška R. Model learning actor–critic algorithms: performance evaluation in a motion control task. In: Proceedings of the 51st IEEE Conference on Decision and Control; 2012 Dec 10–13; Maui, HI, USA. New York: IEEE; 2012. p. 5272–7. 链接1

[110] Costa B, Caarls W, Menasché DS. Dyna-MLAC: trading computational and sample complexities in actor–critic reinforcement learning. In: 2015 Brazilian Conference on Intelligent Systems; 2015 Nov 4–7; Natal, Brazil. New York: IEEE; 2015. p. 37–42. 链接1

[111] Langevin P. Sur la théorie du mouvement brownien. In: Comptes Rendus Hebdomadaires des Seances de l’Academie des Sciences. Paris: GauthierVillars; 1908. p. 530–3. French. 链接1

[112] Wang Z, Bapst V, Heess N, Mnih V, Munos R, Kavukcuoglu K, et al. Sample efficient actor–critic with experience replay. 2016. arXiv:1611.01224.

[113] Munos R, Stepleton T, Harutyunyan A, Bellemare MG. Safe and efficient offpolicy reinforcement learning. In: Lee DD, von Luxburg U, Garnett R, Sugiyama M, Guyon I, editors. Proceedings of the 30th International Conference on Neural Information Processing Systems; 2016 Dec 5; Barcelona, Spain. Red Hook: Curran Associates Inc.; 2016. p. 1054–62. 链接1

[114] Schulman J, Levine S, Abbeel P, Jordan M, Moritz P. Trust region policy optimization. In: Proceedings of the 32nd International Conference on Machine Learning; 2015 Jul 7–9; Lille, France; 2015. p. 1889–97.

[115] Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017. arXiv:1707.06347.

[116] Wu Y, Mansimov E, Grosse RB, Liao S, Ba J. Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, editors. Advances in neural information processing systems 30: 31st Annual Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, CA, USA. San Diego: Neural Information Processing Systems Foundation, Inc.; 2017. p. 5279–88. 链接1

[117] Grosse R, Martens J. A Kronecker-factored approximate fisher matrix for convolution layers. In: Balcan MF, Weinberger KQ, editors. Proceedings of the 33rd International Conference on International Conference on Machine Learning; 2016 Jun 19–24; New York, NY, USA; 2016. p. 573–82.

[118] Martens J, Ba J, Johnson M. Kronecker-factored curvature approximations for recurrent neural networks. In: Proceedings of the 6th International Conference on Learning Representations; 2018 Apr 30–May 3; Vancouver, BC, Canada; 2018.

[119] Gruslys A, Dabney W, Azar MG, Piot B, Bellemare M, Munos R. The reactor: a fast and sample-efficient actor–critic agent for reinforcement learning. 2017. arXiv:1704.04651.

[120] Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, Tan J, et al. Soft actor–critic algorithms and applications. 2018. arXiv:1812.05905.

[121] Fujimoto S, Van Hoof H, Meger D. Addressing function approximation error in actor–critic methods. 2018. arXiv:1802.09477.

[122] Ljung S, Ljung L. Error propagation properties of recursive least-squares adaptation algorithms. Automatica 1985;21(2):157–67. 链接1

[123] Doya K. Reinforcement learning in continuous time and space. Neural Comput 2000;12(1):219–45. 链接1

[124] Bellman R. Dynamic programming. Science 1966;153(3731):34–7. 链接1

[125] Borkar VS. An actor–critic algorithm for constrained Markov decision processes. Syst Control Lett 2005;54(3):207–13. 链接1

[126] Peters J, Schaal S. Natural actor–critic. Neurocomputing 2008;71(7– 9):1180–90. 链接1

[127] Vrabie D, Lewis F. Neural network approach to continuous-time direct adaptive optimal control for partially unknown nonlinear systems. Neural Networks 2009;22(3):237–46. 链接1

[128] Vamvoudakis KG, Lewis FL. Online actor–critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 2010;46(5):878–88. 链接1

[129] Degris T, White M, Sutton RS. Off-policy actor–critic. 2012. arXiv:1205.4839.

[130] Bhasin S, Kamalapurkar R, Johnson M, Vamvoudakis KG, Lewis FL, Dixon WE. A novel actor–critic–identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 2013;49(1):82–92. 链接1

[131] Zhao D, Wang B, Liu D. A supervised actor–critic approach for adaptive cruise control. Soft Comput 2013;17(11):2089–99. 链接1

[132] Modares H, Lewis FL, Naghibi-Sistani MB. Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 2014;50 (1):193–202. 链接1

[133] Chang SJ, Lee JY, Park JB, Choi YH. An online fault tolerant actor–critic neurocontrol for a class of non-linear systems using neural network HJB approach. Int J Control Autom Syst 2015;13(2):311–8. 链接1

[134] Kiumarsi B, Lewis FL. Actor–critic-based optimal tracking for partially unknown nonlinear discrete-time systems. IEEE Trans Neural Netw Learn Syst 2015;26(1):140–51. 链接1

[135] Song R, Lewis F, Wei Q, Zhang HG, Jiang ZP, Levine D. Multiple actor–critic structures for continuous-time optimal control using input–output data. IEEE Trans Neural Netw Learn Syst 2015;26(4):851–65. 链接1

[136] Allen C, Asadi K, Roderick M, Mohamed A, Konidaris G, Littman M. Mean actor critic. 2017. arXiv:1709.00503.

[137] Dhar NK, Verma NK, Behera L. Adaptive critic-based event-triggered control for HVAC system. IEEE Trans Industr Inform 2018;14(1):178–88. 链接1

[138] Fan QY, Yang GH, Ye D. Quantization-based adaptive actor–critic tracking control with tracking error constraints. IEEE Trans Neural Netw Learn Syst 2018;29(4):970–80. 链接1

[139] Wei Y, Yu FR, Song M, Han Z. User scheduling and resource allocation in HetNets with hybrid energy supply: an actor–critic reinforcement learning approach. IEEE Trans Wirel Commun 2018;17(1):680–92. 链接1

[140] Chen B, Wang D, Li P, Wang S, Lu H. Real-time ‘actor–critic’ tracking. In: Yang MH, Gool LV, Liu W, Wang XG, Kautz J, Shen CH, et al., editors. Proceedings of the European Conference on Computer Vision; 2018 Sep 8–14; Munich, Germany; 2018. p. 318–34.

[141] Radac MB, Precup RE, Roman RC. Data-driven model reference control of MIMO vertical tank systems with model-free VRFT and Q-learning. ISA Trans 2018;73:227–38. 链接1

[142] Yang Z, Chen Y, Hong M, Wang Z. On the global convergence of actor–critic: a case for linear quadratic regulator with ergodic cost. 2019. arXiv:1907.06246.

[143] Lv Y, Na J, Ren X. Online H1 control for completely unknown nonlinear systems via an identifier–critic-based ADP structure. Int J Control 2019;92(1):100–11. 链接1

[144] Hou Z, Zhang K, Wan Y, Li D, Fu C, Yu H. Off-policy maximum entropy reinforcement learning: soft actor–critic with advantage weighted mixture policy (SAC-AWMP). 2020. arXiv:2002.02829.

[145] Zhang Y, Zhao B, Liu D. Deterministic policy gradient adaptive dynamic programming for model-free optimal control. Neurocomputing 2020;387:40–50. 链接1

[146] Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. High-dimensional continuous control using generalized advantage estimation. 2015. arXiv:1506.02438.

[147] Shashua SDC, Mannor S. Trust region value optimization using Kalman filtering. 2019. arXiv:1901.07860.

[148] Shashua SDC, Mannor S. Kalman meets Bellman: improving policy evaluation through value tracking. 2020. arXiv:2002.07171.

[149] Su PH, Budzianowski P, Ultes S, Gasic M, Young S. Sample-efficient actor– critic reinforcement learning with supervised data for dialogue management. 2017. arXiv:1707.00130.

[150] Fu J, Kumar A, Nachum O, Tucker G, Levine S. D4RL: datasets for deep datadriven reinforcement learning. 2020. arXiv:2004.07219.

[151] Nair A, Srinivasan P, Blackwell S, Alcicek C, Fearon R, De Maria A, et al. Massively parallel methods for deep reinforcement learning. 2015. arXiv:1507.04296.

[152] Pavlov PI. Conditioned reflexes: an investigation of the physiological activity of the cerebral cortex. Ann Neurosci 2010;17(3):136–41. 链接1

[153] Huang B, Kadali R. Dynamic modeling, predictive control and performance monitoring: a data-driven subspace approach. London: Springer; 2008. 链接1

[154] Gonzalez RC, Woods RE. Digital image processing. 4th ed. London: Pearson Publishing Co.; 2018. 链接1

[155] Chen M, Radford A, Child R, Wu J, Jun H, Luan D, et al. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning; 2020 Jul 12–18; online conference; 2020. p. 1691–703.

[156] Kingma DP, Ba J. Adam: a method for stochastic optimization. 2014. arXiv:1412.6980.

[157] Albawi S, Mohammed TA, Al-Zawi S. Understanding of a convolutional neural network. In: Proceedings of 2017 International Conference on Engineering and Technology; 2017 Aug 21–23; Antalya, Turkey. New York: IEEE; 2017. p. 1–6. 链接1
