基于高效视觉Transformer增强的无锚框YOLO混凝土桥梁损伤自动化检测

工程（英文） ›› 2025, Vol. 51 ›› Issue (8) : 311 -326. DOI: 10.1016/j.eng.2025.02.018

研究论文

^a^,^* ,
^a ,
^a ,
^a ,
^b ,
^c

作者信息 +

Automated Concrete Bridge Damage Detection Using an Efficient Vision Transformer-Enhanced Anchor-Free YOLO

Author information +

文章历史 +

PDF

Abstract

Deep learning techniques have recently been the most popular method for automatically detecting bridge damage captured by unmanned aerial vehicles (UAVs). However, their wider application to real-world scenarios is hindered by three challenges: ① defect scale variance, motion blur, and strong illumination significantly affect the accuracy and reliability of damage detectors; ② existing commonly used anchor-based damage detectors struggle to effectively generalize to harsh real-world scenarios; and ③ convolutional neural networks (CNNs) lack the capability to model long-range dependencies across the entire image. This paper presents an efficient Vision Transformer-enhanced anchor-free YOLO (you only look once) method to address these challenges. First, a concrete bridge damage dataset was established, augmented by motion blur and varying brightness. Four key enhancements were then applied to an anchor-based YOLO method: ① Four detection heads were introduced to alleviate the multi-scale damage detection issue; ② decoupled heads were employed to address the conflict between classification and bounding box regression tasks inherent in the original coupled head design; ③ an anchor-free mechanism was incorporated to reduce the computational complexity and improve generalization to real-world scenarios; and ④ a novel Vision Transformer block, C3MaxViT, was added to enable CNNs to model long-range dependencies. These enhancements were integrated into an advanced anchor-based YOLOv5l algorithm, and the proposed Vision Transformer-enhanced anchor-free YOLO method was then compared against cutting-edge damage detection methods. The experimental results demonstrated the effectiveness of the proposed method, with an increase of 8.1% in mean average precision at intersection over union threshold of 0.5 (mAP₅₀) and an improvement of 8.4% in mAP@[0.5:.05:.95] respectively. Furthermore, extensive ablation studies revealed that the four detection heads, decoupled head design, anchor-free mechanism, and C3MaxViT contributed improvements of 2.4%, 1.2%, 2.6%, and 1.9% in mAP₅₀, respectively.