通智测试——基于动态具身物理社会交互环境的通用人工智能测试

Yujia Peng, Jiaheng Han, Zhenliang Zhang, Lifeng Fan, Tengyu Liu, Siyuan Qi, Xue Feng, Yuxi Ma, Yizhou Wang, Song-Chun Zhu

工程(英文) ›› 2024, Vol. 34 ›› Issue (3) : 12-22.

PDF(2038 KB)
PDF(2038 KB)
工程(英文) ›› 2024, Vol. 34 ›› Issue (3) : 12-22. DOI: 10.1016/j.eng.2023.07.006
研究论文
Perspective

通智测试——基于动态具身物理社会交互环境的通用人工智能测试

作者信息 +

The Tong Test: Evaluating Artificial General Intelligence Through Dynamic Embodied Physical and Social Interactions

Author information +
History +

Abstract

The release of the generative pre-trained transformer (GPT) series has brought artificial general intelligence (AGI) to the forefront of the artificial intelligence (AI) field once again. However, the questions of how to define and evaluate AGI remain unclear. This perspective article proposes that the evaluation of AGI should be rooted in dynamic embodied physical and social interactions (DEPSI). More specifically, we propose five critical characteristics to be considered as AGI benchmarks and suggest the Tong test as an AGI evaluation system. The Tong test describes a value- and ability-oriented testing system that delineates five levels of AGI milestones through a virtual environment with DEPSI, allowing for infinite task generation. We contrast the Tong test with classical AI testing systems in terms of various aspects and propose a systematic evaluation system to promote standardized, quantitative, and objective benchmarks and evaluation of AGI.

Keywords

Artificial general intelligence / Artificial intelligence benchmark / Artificial intelligence evaluation / Embodied artificial intelligence / Value alignment / Turing test / Causality

引用本文

导出引用
Yujia Peng, Jiaheng Han, Zhenliang Zhang. . Engineering. 2024, 34(3): 12-22 https://doi.org/10.1016/j.eng.2023.07.006

参考文献

[1]
Open AI. GPT-4 technical report. 2023. arXiv:2303.08774.
[2]
Kirillov A, Mintun E, Ravi N, Mao H, Rolland C, Gustafson L, et al. Segment anything. 2023. arXiv:2304.02643.
[3]
Driess D, Xia F, Sajjadi MSM, Lynch C, Chowdhery A, Ichter B, et al. PaLM-E: an embodied multimodal language model. 2023. arXiv:2303.03378.
[4]
N. Fei, Z. Lu, Y. Gao, G. Yang, Y. Huo, J. Wen, et al. Towards artificial general intelligence via a multimodal foundation model. Nat Commun, 13 (2022), p. 3094.
[5]
R. Dale. GPT-3: what’s it good for?. Nat Lang Eng, 27 (1) (2021), pp. 113-118.
[6]
Kosinski M. Theory of mind may have spontaneously emerged in large language models. 2023. arXiv:2302.02083.
[7]
Bubeck S, Chandrasekaran V, Eldan R, Gehrke J, Horvitz E, Kamar E, et al. Sparks of artificial general intelligence: early experiments with GPT-4. 2023. arXiv:2303.12712.
[8]
M. Binz, E. Schulz. Using cognitive psychology to understand GPT-3. Proc Natl Acad Sci USA, 120 (6) (2023), e2218523120.
[9]
M. Johnson. Embodied understanding. Front Psychol, 6 (2015), p. 875.
[10]
A. Glenberg. Why mental models must be embodied. Adv Psychol, 128 (1999), pp. 77-90.
[11]
E. Tronick, H. Als, L. Adamson, S. Wise, T.B. Brazelton. The infant’s response to entrapment between contradictory messages in face-to-face interaction. J Am Acad Child Psychiatry, 17 (1) (1978), pp. 1-13.
[12]
M.D.S. Ainsworth, M.C. Blehar, E. Waters, S. Wall. Patterns of attachment: a psychological study of the strange situation. Lawrence Erlbaum, Hillsdale (1978).
[13]
B. Amsterdam. Mirror self-image reactions before age two. Dev Psychobiol, 5 (4) (1972), pp. 297-305.
[14]
E.J. Gibson, R.D. Walk. The “visual cliff”. Sci Am, 202 (4) (1960), pp. 64-71.
[15]
J. Duan, S. Yu, H.L. Tan, H. Zhu, C. Tan. A survey of embodied AI: from simulators to research tasks. IEEE Trans Emerg Top Comput Intell, 6 (2) (2022), pp. 230-244.
[16]
T. Shu, Y. Peng, S.C. Zhu, H. Lu. A unified psychological space for human perception of physical and social events. Cognit Psychol, 128 (2021), 101398.
[17]
Pathak D, Agrawal P, Efros AA, Darrell T. Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the 34th International Conference On Machine Learning; 2017 Aug 7-9; Sydney, NSW, Australia. New York City: Association for Computing Machinery; 2778-87.
[18]
Sancaktar C, Blaes S, Martius G. Curious exploration via structured world models yields zero-shot object manipulation. In: Proceedings of the 36th International Conference on Neural Information Processing Systems; 2022 Nov 28-Dec 9; New Orleans, LU, USA. New York City: Curran Associates Inc.; 2022. p. 24170-83.
[19]
Gu S, Yang L, Du Y, Chen G, Walter F, Wang J, et al. A review of safe reinforcement learning: methods, theory and applications. 2022. arXiv:2205.10330.
[20]
L. Yuan, X. Gao, Z. Zheng, M. Edmonds, Y.N. Wu, F. Rossano, et al. In situ bidirectional human-robot value alignment. Sci Robot, 7 (68) (2022), p. eabm4183.
[21]
A.H. Maslow. A theory of human motivation. Psychol Rev, 50 (4) (1943), pp. 370-396.
[22]
C.P. Alderfer. An empirical test of a new theory of human needs. Organ Behav Hum Perform, 4 (2) (1969), pp. 142-175.
[23]
S.H. Schwartz, W. Bilsky. Toward a universal psychological structure of human values. J Pers Soc Psychol, 53 (3) (1987), pp. 550-562.
[24]
A. Michotte. The perception of causality. Routledge, Milton Park (1963).
[25]
A.M. Leslie, S. Keeble. Do six-month-old infants perceive causality?. Cognition, 25 (3) (1987), pp. 265-288.
[26]
L.M. Oakes, L.B. Cohen. Infant perception of a causal event. Cogn Dev, 5 (2) (1990), pp. 193-207.
[27]
R. Baillargeon, M. Stavans, D. Wu, Y. Gertner, P. Setoh, A.K. Kittredge, et al. Object individuation and physical reasoning in infancy: an integrative account. Lang Learn Dev, 8 (1) (2012), pp. 4-46.
[28]
L. Kotovsky, R. Baillargeon. The development of calibration-based reasoning about collision events in young infants. Cognition, 67 (3) (1998), pp. 311-351.
[29]
Y. Luo, R. Baillargeon, L. Brueckner, Y. Munakata. Reasoning about a hidden object after a delay: evidence for robust representations in 5-month-old infants. Cognition, 88 (3) (2003), pp. B23-B32.
[30]
A. Waismeyer, A.N. Meltzoff. Learning to make things happen: infants’ observational learning of social and physical causal events. J Exp Child Psychol, 162 (2017), pp. 58-71.
[31]
Y. Zhu, T. Gao, L. Fan, S. Huang, M. Edmonds, H. Liu, et al. Dark, beyond deep: a paradigm shift to cognitive AI with humanlike common sense. Engineering, 6 (3) (2020), pp. 310-345.
[32]
B.M. Lake, T.D. Ullman, J.B. Tenenbaum, S.J. Gershman. Building machines that learn and think like people. Behav Brain Sci, 40 (2017), p. e253.
[33]
A. Holzinger, A. Saranti, C. Molnar, P. Biecek, W. Samek. Explainable AI methods—a brief overview. Springer International Publishing, Berlin (2022), pp. 13-38.
[34]
Xu L, Huang H, Liu J. SUTD-TrafficQA:a question answering benchmark and an efficient network for video reasoning over traffic events. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 19-25; online. New York City: IEEE; 9878-88.
[35]
Bakhtin A, van der Maaten L, Johnson J, Gustafson L, Girshick R. PHYRE:a new benchmark for physical reasoning. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems; 2019 Dec 8-14; Vancouver, BC, Canada, New York City: Curran Associates Inc.; 2019.
[36]
Ahmed O, Träuble F, Goyal A, Neitz A, Wuthrich M, Bengio Y, et al. CausalWorld:a robotic manipulation benchmark for causal structure and transfer learning. In: Proceedings of the International Conference on Learning Representations; 2021 May 3-7; online. Vancouver: International Conference on Learning Representations; 2021.
[37]
Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015 Jun 7-12; Boston, MA, USA. New York City: IEEE; 2015. p. 3128-37.
[38]
Laskar MTR, Bari MS, Rahman M, Bhuiyan MAH, Joty S, Huang JX. A systematic study and comprehensive evaluation of ChatGPT on benchmark datasets. 2023. arXiv:2305.18486.
[39]
Dziri N, Lu X, Sclar M, Li XL, Jiang L, Lin BY, et al. Faith and fate: limits of transformers on compositionality. 2023. arXiv:2305.18654.
[40]
Kosoy E, Reagan ER, Lai L, Gopnik A, Cobb DK. Comparing machines and children: using developmental psychology experiments to assess the strengths and weaknesses of laMDA responses. 2023. arXiv:2305.11243.
[41]
Yao B, Yang X, Zhu SC. Introduction to a large-scale general purpose ground truth database: methodology, annotation tool and benchmarks. In: Proceedings of the International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition; 2007 Aug 27-29; Ezhou, China. Berlin: Springer; 2007. p. 169-83.
[42]
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet:a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2009 Jun 20-25; Miami, FL, USA. New York City: IEEE; 2009. p. 248-55.
[43]
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. Microsoft COCO: common objects in context. In: Proceedings of the European Conference on Computer Vision; 2014 Sep 6-12; Zurich, Switzerland. Berlin: Springer; 2014. p. 740-55.
[44]
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, et al. VQA:visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision; 2015 Dec 7-13; Santiago, Chile. New York City: IEEE; 2015.
[45]
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. GLUE: a multi-task benchmark and analysis platform for natural language understanding. 2018. arXiv:1804.07461.
[46]
J. Hernández-Orallo. Evaluation in artificial intelligence: from task-oriented to ability-oriented measurement. Artif Intell Rev, 48 (3) (2017), pp. 397-447.
[47]
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al. OpenAI Gym. 2016. arXiv:1606.01540.
[48]
Beattie C, Leibo JZ, Teplyashin D, Ward T, Wainwright M, Küttler H, et al. DeepMind Lab. 2016. arXiv:1612.03801.
[49]
Li C, Xia F, Martín-Martín R, Lingelbach M, Srivastava S, Shen B, et al. IGibson 2.0: object-centric simulation for robot learning of everyday household tasks. 2021. arXiv:2108.03272.
[50]
Gan C, Schwartz J, Alter S, Mrowca D, Schrimpf M, Traer J, et al. ThreeDWorld: a platform for interactive multi-modal physical simulation. 2020. arXiv:2007.04954.
[51]
Kolve E, Mottaghi R, Han W, VanderBilt E, Weihs L, Herrasti A, et al. AI2-THOR: an interactive 3D environment for visual AI. 2017. arXiv:1712.05474.
[52]
Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, et al. Habitat:a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2019 Oct 27-Nov 2; Seoul, Republic of Korea. New York City: IEEE; 9339-47.
[53]
Wu Y, Wu Y, Gkioxari G, Tian Y. Building generalizable agents with a realistic and rich 3D environment. 2018. arXiv:1801.02209.
[54]
Puig X, Ra K, Boben M, Li J, Wang T, Fidler S, et al. VirtualHome:simulating household activities via programs. In: Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition; 2018 Jun 18-23; Salt Lake City, UT, USA. New York City: IEEE; 2018. p. 8494-502.
[55]
A. Zador, S. Escola, B. Richards, B. Ölveczky, Y. Bengio, K. Boahen, et al. Catalyzing next-generation artificial intelligence through NeuroAI. Nat Commun, 14 (1) (2023), p. 1597
[56]
G. Avrin. Assessing artificial intelligence capabilities. AI and the future of skills, volume 1: capabilities and assessments, OECD Publishing, Paris (2021).
[57]
Clark P, Cowhey I, Etzioni O, Khot T, Sabharwal A, Schoenick C, et al. Think you have solved question answering? Try arc, the AI2 reasoning challenge. 2018. arXiv:1803.05457.
[58]
Srivastava A, Rastogi A, Rao A, Shoeb AAM, Abid A, Fisch A, et al. Beyond the imitation game: quantifying and extrapolating the capabilities of language models. 2022. arXiv:2206.04615.
[59]
Li C, Zhang R, Wong J, Gokmen C, Srivastava S, Martín-Martín R, et al. Behavior-1k: a benchmark for embodied AI with 1000 everyday activities and realistic simulation. In: Proceedings of the Conference on Robot Learning; 2023 Nov 6-9; Atlanta, GA, USA; online. The Conference on Robot Learning (CoRL); 2023, p. 80-93.
[60]
Xu B, Ren Q. Artificial open world for evaluating AGI: a conceptual design. In: Proceedings of the International Conference on Artificial General Intelligence; 2022 Aug 19-22; Seattle, WA, USA; online. The Artificial General Intelligence Society; 2023, p. 452-63.
[61]
L.M. Terman, M.A. Merrill. Stanford-Binet intelligence scale:manual for the third revision, form L-M. Houghton Mifflin, Boston (1960).
[62]
N. Bayley. Bayley-III: Bayley Scales of infant and toddler development. Giunti OS, Florence (2009).
[63]
D. Wechsler. Wechsler Adult Intelligence Scale. Arch Clin Neuropsychol (1955).
[64]
J.C. Raven, J. Court. Raven’s progressive matrices. Western Psychological Services, Torrance (1938).
[65]
R.J. Sternberg. What should intelligence tests test? Implications of a triarchic theory of intelligence for intelligence testing. Educ Res, 13 (1) (1984), pp. 5-15.
[66]
Z. Tu, X. Chen, A.L. Yuille, S.C. Zhu. Image parsing: unifying segmentation, detection, and recognition. Int J Comput Vis, 63 (2) (2005), pp. 113-140.
[67]
I. Newton, J. Colson. The method of fluxions and infinite series: with its application to the geometry of curve-lines. Henry Woodfall, London (1736).
[68]
E.S. Spelke, K.D. Kinzler. Core knowledge. Dev Sci, 10 (1) (2007), pp. 89-96.
[69]
S. Duval, R.A. Wicklund. A theory of objective self-awareness. Academic Press, Cambridge (1972).
[70]
P. Rochat. Five levels of self-awareness as they unfold early in life. Conscious Cogn, 12 (4) (2003), pp. 717-731.
[71]
H. Wimmer, J. Perner. Beliefs about beliefs: representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 13 (1) (1983), pp. 103-128.
[72]
H.M. Wellman, D. Liu. Scaling of theory-of-mind tasks. Child Dev, 75 (2) (2004), pp. 523-541.
[73]
F. Warneken, M. Tomasello. Altruistic helping in human infants and young chimpanzees. Science, 311 (5765) (2006), pp. 1301-1303.
[74]
Y. Kanakogi, Y. Inoue, G. Matsuda, D. Butler, K. Hiraki, M. Myowa-Yamakoshi. Preverbal infants affirm third-party interventions that protect victims from aggressors. Nat Hum Behav, 1 (2) (2017), p. 0037.
[75]
A. Geraci, L. Surian. The developmental roots of fairness: infants’ reactions to equal and unequal distributions of resources. Dev Sci, 14 (5) (2011), pp. 1012-1020.
[76]
Porter HH III. A methodology for the assessment of AI consciousness. In: Proceedings of the International Conference on Artificial General Intelligence; 2016 Jul 16-19; New York City, NY, USA. Berlin: Springe; 2016. p. 305-13.
[77]
I. Kotseruba, J.K. Tsotsos. 40 years of cognitive architectures: core cognitive abilities and practical applications. Artif Intell Rev, 53 (1) (2020), pp. 17-94.
[78]
Xie X, Liu H, Zhang Z, Qiu Y, Gao F, Qi S, et al. VRGYM:a virtual testbed for physical and interactive AI. In: Proceedings of the ACM Turing Celebration Conference-China; 2019 May 17-19; Chengdu, China. New York City: ACM Turing Celebration Conference; 1-6.
[79]
Gao X, Gong R, Shu T, Xie X, Wang S, Zhu SC. VRKitchen: an interactive 3D virtual environment for task-oriented learning. 2019. arXiv:1903.05757.
[80]
Ma X, Yong S, Zheng Z, Li Q, Liang Y, Zhu SC, et al. SQA3D: situated question answering in 3D scenes. In: Proceedings of the 11th International Conference on Learning Representations; 2023 May 1-5; Kigali, Rwanda. New York City: IEEE; 2023.
PDF(2038 KB)

Accesses

Citation

Detail

段落导航
相关文章

/