Vision-Language Model-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing

Tian Wang; Junming Fan; Pai Zheng; Ruqiang Yan; Lihui Wang

doi:10.1016/j.eng.2025.04.028

Engineering ›› DOI: 10.1016/j.eng.2025.04.028

review-article

Vision-Language Model-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing

Author information +

History +

PDF (5043KB)

Abstract

In smart manufacturing, autonomous mobile robots play an indispensable role in conducting inspection and material handling operations, yet they face significant limitations regarding adaptability and resilience within unstructured environments. Vision and language navigation (VLN), a human-guided navigation paradigm, emerges as a compelling solution to these challenges. Nevertheless, VLN’s practical implementation is constrained by limited task generalization capabilities, inadequate response to diverse linguistic commands, and insufficient consideration of sensor-induced noise in environmental perception. This research addresses these limitations by introducing an innovative vision-language model (VLM)-based human-guided mobile robot navigation approach in an unstructured environment for human-centric smart manufacturing (HSM). This approach encompasses robust Three-dimensional (3D) scene reconstruction through advanced point cloud techniques, zero-shot semantic segmentation via a VLM, and natural language processing through a large language model (LLM) to interpret instructions and generate control code for navigation. The system’s efficacy is validated through extensive experiments in an unstructured manufacturing setup.

Keywords

Vision-language model / Large language model / Human-robot interaction / Mobile robot navigation / Human-centric smart manufacturing

Cite this article

Download citation ▾

Tian Wang, Junming Fan, Pai Zheng, Ruqiang Yan, Lihui Wang. Vision-Language Model-Based Human-Guided Mobile Robot Navigation in an Unstructured Environment for Human-Centric Smart Manufacturing. Engineering DOI:10.1016/j.eng.2025.04.028

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Xu X, Lu Y, Vogel-Heuser B, Wang L.Industry 4.0 and industry 5.0-inception, conception and perception.J Manuf Syst 2021; 61:530-535.

[2]	Wang J, Tao B, Gong Z, Yu S, Yin Z.A mobile robotic measurement system for large-scale complex components based on optical scanning and visual tracking.Robot Comput-Integr Manuf 2021; 67:102010.

[3]	Yilmaz A, Sumer E, Temeltas H.A precise scan matching based localization method for an autonomously guided vehicle in smart factories.Robot Comput-Integr Manuf 2022; 75:102302.

[4]	Meng J, Wang S, Xie Y, Li G, Zhang X, Jiang L, et al.A safe and efficient LIDAR-based navigation system for 4WS4WD mobile manipulators in manufacturing plants.Meas Sci Technol 2021; 32(4):045203.

[5]	Zheng P, Li C, Fan J, Wang L.A vision-language-guided and deep reinforcement learning-enabled approach for unstructured human-robot collaborative manufacturing task fulfilment.CIRP Ann 2024; 73(1):341-344.

[6]	Del Dottore E, Mondini A, Rowe N, Mazzolai B.A growing soft robot with climbing plant–inspired adaptive behaviors for navigation in unstructured environments. Sci Robot 2024;9(86):eadi5908.

[7]	Pei L, Lin J, Han Z, Quan L, Cao Y, Xu C, et al.Collaborative planning for catching and transporting objects in unstructured environments.IEEE Robot Autom Lett 2024; 9(2):1098-1105.

[8]	Zheng P, Li S, Fan J, Li C, Wang L.A collaborative intelligence-based approach for handling human-robot collaboration uncertainties.CIRP Ann 2023; 72(1):1-4.

[9]	Zheng P, Li S, Xia L, Wang L, Nassehi A.A visual reasoning-based approach for mutual-cognitive human-robot collaboration.CIRP Ann 2022; 71(1):377-380.

[10]	Wang L, Gao R, Váncza J, Krüger J, Wang XV, Makris S, et al.Symbiotic human-robot collaborative assembly.CIRP Ann 2019; 68(2):701-726.

[11]	Li S, Zheng P, Liu S, Wang Z, Wang XV, Zheng L, et al.Proactive human–robot collaboration: mutual-cognitive, predictable, and self-organising perspectives.Robot Comput-Integr Manuf 2023; 81:102510.

[12]	Fan J, Zheng P, Li S.Vision-based holistic scene understanding towards proactive human–robot collaboration.Robot Comput-Integr Manuf 2022; 75:102304.

[13]	Wang T, Zheng P, Li S, Wang L.Multimodal human–robot interaction for human-centric smart manufacturing: a survey.Adv Intell Syst 2024; 6(3):2300259.

[14]	Ren M, Zheng P.Towards smart product-service systems 2.0: a retrospect and prospect. Adv Eng.Inform 2024; 61:102466.

[15]	Xia L, Li C, Zhang C, Liu S, Zheng P.Leveraging error-assisted fine-tuning large language models for manufacturing excellence.Robot Comput-Integr Manuf 2024; 88:102728.

[16]	Ren M, Dong L, Xia Z, Cong J, Zheng P.A proactive interaction design method for personalized user context prediction in smart-product service system.Procedia CIRP 2023; 119:963-968.

[17]	Yin S, Fu C, Zhao S, Li K, Sun X, Xu T, et al.A survey on multimodal large language models.2023. arXiv: 2306.13549.

[18]	Su S, Zeng X, Song S, Lin M, Dai H, Yang W, et al.Positioning accuracy improvement of automated guided vehicles based on a novel magnetic tracking approach.IEEE Intell Transp Syst Mag 2020; 12(4):138-148.

[19]	Goutham M, Boyle S, Menon M, Mohan S, Garrow S, Stockar S.Optimal path planning through a sequence of waypoints.IEEE Robot Autom Lett 2023; 8(3):1509-1514.

[20]	Demesure G, El-Haouzi HB, Iung B.Mobile-agents based hybrid control architecture—implementation of consensus algorithm in hierarchical control mode.CIRP Ann 2021; 70(1):385-388.

[21]	Pashevich A, Schmid C, Sun C.Episodic transformer for vision-and-language navigation.In: Proceedings of the 2021 IEE E/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, Q C, Canada. Piscataway: IEE E; 2021. p. 15922–32.

[22]	Huang C, Mees O, Zeng A, Burgard W.Visual language maps for robot navigation.In: Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA); 2023 May 29–Jun 02; London, United Kingdom. Piscataway: IEE E; 2023. p. 10608–15.

[23]	Wang T, Fan J, Zheng P.An LLM-based vision and language cobot navigation approach for human-centric smart manufacturing.J Manuf Syst 2024; 75:299-305.

[24]	Liu S, Zhang J, Wang L, Gao RX.Vision AI-based human-robot collaborative assembly driven by autonomous robots.CIRP Ann 2024; 73(1):13-16.

[25]	Zacharia PT, Xidias EK.AGV routing and motion planning in a flexible manufacturing system using a fuzzy-based genetic algorithm.Int J Adv Manuf Technol 2020; 109:1801-1803.

[26]	Gu J, Stefani E, Wu Q, Thomason J, Wang XE.Vision-and-language navigation: a survey of tasks, methods, and future directions.2022. arXiv: 2203.12667.

[27]	Krantz J, Wijmans E, Majumdar A, Batra D, Lee S.Beyond the Nav-graph: vision-and-language navigation in continuous environments.In: Vedaldi A, Bischof H, Brox T, Frahm JM, editors. Computer Vision – ECC V2020. 12373. Cham: Springer; 2020. p. 104–20.

[28]	Guhur PL, Tapaswi M, Chen S, Laptev I, Schmid C.Airbert: In-domain pretraining for vision-and-language navigation.In: Proceedings of the 2021 IEE E/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, Q C, Canada. Piscataway: IEE E; 2021. p. 1614–23.

[29]	Krantz J, Gokaslan A, Batra D, Lee S, Maksymets O.Waypoint models for instruction-guided navigation in continuous environments.In: Proceedings of the 2021 IEE E/CVF International Conference on Computer Vision (ICCV); 2021 Oct 10–17; Montreal, Q C, Canada. Piscataway: IEE E; 2021. p. 15142–51.

[30]	Zhang Y, Ma Z, Li J, Qiao Y, Wang Z, Chai J, et al.Vision-and-language navigation today and tomorrow: a survey in the era of foundation models.2024. arXiv: 2407.07035.

[31]	Gao J, Chen B, Zhao X, Liu W, Li X, Wang Y, et al.Llm-enhanced reranking in recommender systems.2024. arXiv: 2406.12433.

[32]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez ANet al.Attention is all you need.In: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4–9; Long Beach, C A, USA. Red Hook, N Y: Curran Associates Inc. p. 6000–10.

[33]	Zhou G, Hong Y, Wu Q.Navgpt: explicit reasoning in vision-and-language navigation with large language models.2023. arXiv: 2305.16986.

[34]	Shah D, Osiński B, Ichter B, Levine S.Lm-nav: robotic navigation with large pre-trained models of language, vision, and action.In: Proceedings of the 6th Conference on Robot Learning (CoR L 2022); 2022 Dec 14–18; Auckland, New Zealand. New York: PML R; 2023. p. 492–504.

[35]	Choi S, Zhou QY, Koltun V.Robust reconstruction of indoor scenes.In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2015 Jun 7–12; Boston, M A, USA. Piscataway: IEE E; 2015. p. 5556–65.

[36]	Park J, Zhou QY, Koltun V.Colored point cloud registration revisited.In: Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. Piscataway: IEE E; 2017. p. 143–52.

[37]	Zhou QY, Park J, Koltun V.Fast global registration.In: Leibe B, Matas J, Sebe N, Welling M, editors. Computer Vision – ECC V2016. 9906. Cham: Springer; 2016. p. 766–82.

[38]	Newcombe RA, Izadi S, Hilliges O, Molyneaux D, Kim D, Davison AJ, et al.Kinectfusion: real-time dense surface mapping and tracking.In: Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality; 2011 Oct 26–29; Basel, Switzerland. Piscataway: IEE E; 2011. p. 127–136.

[39]	Rusinkiewicz S, Levoy M.Efficient variants of the ICO algorithm.In: Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling; 2001 May 28–Jun 1; Quebec City, Q C, Canada. Piscataway: IEE E; 2001. p. 145–52.

[40]	Savva M, Kadian A, Maksymets O, Zhao Y, Wijmans E, Jain B, et al.Habitat: a platform for embodied AI research.In: Proceedings of the 2019 IEE E/CVF International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov 2; Seoul, Republic of Korean. Piscataway: IEE E; 2019. p. 9338–46.

[41]	Pütz S, Wiemann T, Sprickerhof J, Hertzberg J.3D navigation mesh generation for path planning in uneven terrain. IFAC-PapersOn Line;49:212–7.

[42]	Wenna W, Weili D, Changchun H, Heng Z, Haibing F, Yao Y.A digital twin for 3d path planning of large-span curved-arm gantry robot.Robot Comput-Integr Manuf 2022; 76:102330.

[43]	Li B, Weinberger KQ, Belongie S, Koltun V, Ranftl R.Language-driven semantic segmentation.In: Proceedings of the 2022 International Conference on Learning Representations (ICL R2022); 2022 Apr 25–29; Online. ICR L; 2022. p. 1–13.

[44]	Ronneberger O, Fischer P, Brox T.U-net: Convolutional networks for biomedical image segmentation.In: Navab N, Hornegger J, Wells WM, Frangi AF, editors. Medical Image Computing and Computer-Assisted Intervention – MICCA I2015. 9351. Cham: Springer; 2015. p. 234–41.

[45]	Chen LC, Zhu Y, Papandreou G, Schroff F, Adam H.Encoder-decoder with atrous separable convolution for semantic image segmentation.In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, editors. Computer Vision–ECC V 2018. 11211. Cham: Springer; 2018. p. 833–51.