具身智能发展趋势与展望
Embodied Intelligence: Development Trends and Prospects
人工智能的发展目标是使机器像人类一样思维和行动,不仅能求解复杂问题,更重要的是能在一个复杂、动态、不确定的物理世界中进行交互。具身智能强调智能体通过物理载体与环境的动态交互,在感知、决策与行动中不断学习和进化,从而突破传统静态数据训练模型的局限,展现出更强的环境适应性与泛化能力,已成为实现人工智能发展目标的关键路径之一。本文深入探讨了具身智能的概念、内涵、计算框架与系统实现,在此基础上进一步梳理了具身智能的发展现状、演进趋势与面临的挑战。同时,特别指出,生成式人工智能,尤其是大语言模型、多模态大模型以及正在演进的“信息 ‒ 物理 ‒ 认知”三域融合大模型等技术在加速具身智能演进中的关键作用。面对全球人工智能竞争日益加剧的态势,总结与分析了我国在具身智能领域发展取得的进展和面临的风险,并提出了我国应重点布局的研究方向和针对性的对策建议,助力我国在全球具身智能竞赛中占据领先地位。
The development goal of artificial intelligence is to enable machines to think and act like humans, solving complex problems, and more importantly, interacting effectively in a complex, dynamic, and uncertain physical world. Embodied intelligence emphasizes that intelligent agents continuously learn and evolve from perception, decision-making, and action processes. It is realized through dynamic interactions with their surroundings via physical embodiments. This approach overcomes the limitations of traditional static data-driven training models, demonstrating superior adaptability and generalization capabilities in the real world. It therefore has become a dominant way to achieve the goal of artificial intelligence. This study explores the conceptual connotations, computational frameworks, and system implementations of embodied intelligence, and, on this basis, further reviews its current development status, evolutionary trends, and challenges. In particular, the study highlights the pivotal role of generative artificial intelligence, especially large language models, multimodal large language models, and the advancing large “information ‒ physical ‒ cognitive” models, in accelerating the evolution of embodied intelligence. In the face of intensifying global competition in artificial intelligence, this study further summarizes the achievements and analyzes the risks in the development of embodied intelligence in China, and proposes key research directions and targeted policy recommendations to help China secure a leading position in the global race for embodied intelligence.
embodied intelligence / artificial intelligence / generative artificial intelligence / environment interaction
| [1] |
Turing A M. Computing machinery and intelligence [M]. Dordrecht: Springer Netherlands, 2007: 23‒65. |
| [2] |
Wiener N. Cybernetics [J]. Scientific American, 1948, 179(5): 14‒19. |
| [3] |
Brooks R A. Intelligence without representation [J]. Artificial Intelligence, 1991, 47(1‒3): 139‒159. |
| [4] |
Zador A, Escola S, Richards B, et al. Catalyzing next-generation artificial intelligence through NeuroAI [J]. Nature Communications, 2023, 14: 1597. |
| [5] |
Li F F, Krishna R. Searching for computer vision north stars [J]. Daedalus, 2022, 151(2): 85‒99. |
| [6] |
Glenberg A M. Embodiment as a unifying perspective for psychology [J]. WIREs Cognitive Science, 2010, 1(4): 586‒596. |
| [7] |
Ding N, Qin Y J, Yang G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models [J]. Nature Machine Intelligence, 2023, 5(3): 220‒235. |
| [8] |
Zheng N N, Liu Z Y, Ren P J, et al. Hybrid-augmented intelligence: Collaboration and cognition [J]. Frontiers of Information Technology Electronic Engineering, 2017, 18(2): 153‒179. |
| [9] |
Durante Z, Huang Q Y, Wake N, et al. Agent AI: Surveying the horizons of multimodal interaction [EB/OL]. (2024-01-07)[2025-06-10]. arXiv: 2401.03568. https://arxiv.org/abs/2401.03568. |
| [10] |
Ha D, Schmidhuber J. World models [EB/OL]. (2018-03-27)[2025-06-10]. arXiv: 1803.10122. https://arxiv.org/abs/1803.10122. |
| [11] |
Bengio Y, Courville A, Vincent P. Representation learning: A review and new perspectives [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(8): 1798‒1828. |
| [12] |
Pearl J. Causality [M]. New York: Cambridge University Press, 2009. |
| [13] |
OpenAI, Achiam J, Adler S, et al. GPT-4 technical report [EB/OL]. (2023-03-15)[2025-06-10]. arXiv: 2303.08774. https://arxiv.org/abs/2303.08774. |
| [14] |
Liu Y X, Zhang K, Li Y, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models [EB/OL]. (2024-02-27)[2025-06-10]. arXiv: 2402.17177. https://arxiv.org/abs/2402.17177. |
| [15] |
DeepSeek-AI, Liu A X, Feng B, et al. DeepSeek-V3 technical report [EB/OL]. (2024-12-27)[2025-06-10]. arXiv: 2412.19437. https://arxiv.org/abs/2412.19437. |
| [16] |
Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science [J]. Nature, 2023, 614(7947): 214‒216. |
| [17] |
Team G, Anil R, Borgeaud S, et al. Gemini: A family of highly capable multimodal models [EB/OL]. (2023-12-19)[2025-06-10]. arXiv: 2312.11805. https://arxiv.org/abs/2312.11805. |
| [18] |
LeCun Y, Bengio Y, Hinton G. Deep learning [J]. Nature, 2015, 521(7553): 436‒444. |
| [19] |
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models [EB/OL]. (2020-01-23)[2025-06-10]. arXiv: 2001.08361. https://arxiv.org/abs/2001.08361. |
| [20] |
Heess N, Tb D, Sriram S, et al. Emergence of locomotion behaviours in rich environments [EB/OL]. (2017-07-08)[2025-06-10]. arXiv: 1707.02286. https://arxiv.org/abs/1707.02286. |
| [21] |
Liu Y, Chen W X, Bai Y J, et al. Aligning cyber space with physical world: A comprehensive survey on embodied AI [EB/OL]. (2024-07-09)[2025-06-10]. arXiv: 2407.06886. https://arxiv.org/abs/2407.06886. |
| [22] |
中国信息通信研究院, 北京人形机器人创新中心有限公司. 具身智能发展报告(2024年) [R]. 北京: 中国信息通信研究院, 2024. |
| [23] |
China Academy of Information and Communications Technology, Beijing Humanoid Robot Innovation Center Co., Ltd. Development report on embodied intelligence (2024) [R]. Beijing: China Academy of Information and Communications Technology, Beijing Humanoid Robot Innovation Center Co., Ltd., 2024. |
| [24] |
Zhao H S, Jiang L, Jia J Y, et al. Point transformer [R]. Montreal: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2022. |
| [25] |
陶永, 万嘉昊, 王田苗, 构建具身智能新范式: 人形机器人技术现状及发展趋势综述 [J/OL]. 机械工程学报, 2025: 1‒27 [2025-05-08]. https://kns.cnki.net/KCMS/detail/detail.aspx?filename=JXXB20250506001dbname=CJFDdbcode=CJFQ. |
| [26] |
Tao Y, Wan J H, Wang T M, et al. Establishing a new paradigm of embodied intelligence: A review of the current status and development trends in humanoid robot technology [J/OL]. Journal Of Mechanical Engineering, 2025: 1‒27[2025-05-08]. https://kns. cnki. net/KCMS/detail/detail. aspx? filename=JXXB20250506001dbname=CJFDdbcode=CJFQ. |
| [27] |
杨玉琪, 王梦云, 刘运卓, 具身智能及其在自主无人系统的应用研究 [J]. 无人系统技术, 2024, 7(5): 99‒110. |
| [28] |
Yang Y Q, Wang M Y, Liu Y Z, et al. Embodied intelligence and its application in autonomous unmanned systems [J]. Unmanned Systems Technology, 2024, 7(5): 99‒110. |
| [29] |
Ahn M, Brohan A, Brown N, et al. Do as I can, not as I say: Grounding language in robotic affordances [EB/OL]. (2022-04-05)[2025-06-10]. arXiv: 2204.01691. https://arxiv.org/abs/2204.01691. |
| [30] |
Brohan A, Brown N, Carbajal J, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control [EB/OL]. (2023-08-30)[2025-06-10]. arXiv: 2307.15818. https://arxiv.org/abs/2307.15818. |
| [31] |
Belkhale S, Ding T L, Xiao T, et al. RT-H: Action hierarchies using language [EB/OL]. (2024-03-06)[2025-06-10]. arXiv: 2403.01823. https://arxiv.org/abs/2403.01823. |
| [32] |
Wang H Q, Wang W G, Liang W, et al. Structured scene memory for vision-language navigation [R]. Nashville: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. |
| [33] |
Pan C B, Yaman B, Nesti T, et al. VLP: Vision language planning for autonomous driving [R]. Seattle: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. |
| [34] |
Kolve E, Mottaghi R, Han W, et al. AI2-THOR: An interactive 3D environment for visual AI [EB/OL]. (2017-12-14)[2025-06-10]. arXiv: 1712.05474. https://arxiv.org/abs/1712.05474. |
| [35] |
Savva M, Kadian A, Maksymets O, et al. Habitat: A platform for embodied AI research [R]. Seoul: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019. |
| [36] |
Peng X B, Andrychowicz M, Zaremba W, et al. Sim-to-real transfer of robotic control with dynamics randomization [R]. Brisbane: 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018. |
| [37] |
Fang H S, Fang H J, Tang Z Y, et al. RH20T: A comprehensive robotic dataset for learning diverse skills in one-shot [EB/OL]. (2023-07-02)[2025-06-10]. arXiv: 2307.00595. https://arxiv.org/abs/2307.00595. |
| [38] |
Wan Z L, Ling Y G, Yi S L, et al. VinT-6D: A large-scale object-in-hand dataset from vision, touch and proprioception [EB/OL]. (2024-12-31)[2025-06-10]. arXiv: 2501.00510. https://arxiv.org/abs/2501.00510. |
| [39] |
Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system [J]. IEEE Transactions on Robotics, 2015, 31(5): 1147‒1163. |
| [40] |
Zhu X Y, Li J, Liu Y, et al. A survey on model compression for large Language Models [J]. Transactions of the Association for Computational Linguistics, 2024, 12: 1556‒1577. |
| [41] |
Wan Z W, Wang X, Liu C, et al. Efficient large language models: A survey [EB/OL]. (2024-05-20)[2025-06-10]. arXiv: 2312.03863. https://arxiv.org/abs/2312.03863. |
| [42] |
郑南宁. 认知过程的信息处理和新型人工智能系统 [J]. 中国基础科学, 2000, 2(8): 9‒18. |
| [43] |
Zheng N N. Information processing for cognition process and new artificial intelligent systems [J]. China Basic Science, 2000, 2(8): 9‒18. |
| [44] |
LeCun Y. A path towards autonomous machine intelligence [J]. Open Review, 2022, 62(1): 1‒62. |
| [45] |
李强. 政府工作报告——2025年3月5日在第十四届全国人民代表大会第三次会议上 [J]. 工业信息安全, 2025 (2): 81‒93. |
| [46] |
Li Q. Report on the work of the government—Delivered at the third session of the 14th national people’s congress of the people’s republic of China on March 5, 2025 [J]. Industry Information Security, 2025 (2): 81‒93. |
| [47] |
中华人民共和国工业和信息化部. 《“十四五”机器人产业发展规划》解读 [J]. 自动化博览, 2022, 39(3): 14‒15. |
| [48] |
Ministry of Industry and Information Technology of the People’s Republic of China. Interpretation of the development plan of robot industry in the 14th Five-Year Plan [J]. Automation Panorama, 2022, 39(3): 14‒15. |
| [49] |
DeepSeek-AI, Guo D Y, Yang D J, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning [EB/OL]. (2025-01-22)[2025-06-10]. arXiv: 2501.12948. https://arxiv.org/abs/2501.12948. |
| [50] |
Jaech A, Kalia A, Lerer A, et al. OpenAI o1 system card [EB/OL]. (2024-12-21)[2025-06-10]. arXiv: 2412.16720. https://arxiv.org/abs/2412.16720. |
| [51] |
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition [R]. Las Vegas: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. |
| [52] |
Chang A, Dai A, Funkhouser T, et al. Matterport3D: Learning from RGB-D data in indoor environments [EB/OL]. (2017-09-18)[2025-06-10]. arXiv: 1709.06158. https://arxiv.org/abs/1709.06158. |
| [53] |
Ramakrishnan S K, Gokaslan A, Wijmans E, et al. Habitat-matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI [EB/OL]. (2021-09-16)[2025-06-10]. arXiv: 2109.08238. https://arxiv.org/abs/2109.08238. |
中国工程院咨询项目“生成式AI与具身智能发展战略及对策研究”(2024-XZ-14)
/
| 〈 |
|
〉 |