Automated Concrete Bridge Damage Detection Using an Efficient Vision Transformer-Enhanced Anchor-Free YOLO

Xiaofei Yang , Enrique del Rey Castillo , Yang Zou , Liam Wotherspoon , Jianxi Yang , Hao Li

Engineering ›› 2025, Vol. 51 ›› Issue (8) : 331 -347.

PDF (4032KB)
Engineering ›› 2025, Vol. 51 ›› Issue (8) :331 -347. DOI: 10.1016/j.eng.2025.02.018
Research
research-article
Automated Concrete Bridge Damage Detection Using an Efficient Vision Transformer-Enhanced Anchor-Free YOLO
Author information +
History +
PDF (4032KB)

Abstract

Deep learning techniques have recently been the most popular method for automatically detecting bridge damage captured by unmanned aerial vehicles (UAVs). However, their wider application to real-world scenarios is hindered by three challenges: ① defect scale variance, motion blur, and strong illumination significantly affect the accuracy and reliability of damage detectors; ② existing commonly used anchor-based damage detectors struggle to effectively generalize to harsh real-world scenarios; and ③ convolutional neural networks (CNNs) lack the capability to model long-range dependencies across the entire image. This paper presents an efficient Vision Transformer-enhanced anchor-free YOLO (you only look once) method to address these challenges. First, a concrete bridge damage dataset was established, augmented by motion blur and varying brightness. Four key enhancements were then applied to an anchor-based YOLO method: ① Four detection heads were introduced to alleviate the multi-scale damage detection issue; ② decoupled heads were employed to address the conflict between classification and bounding box regression tasks inherent in the original coupled head design; ③ an anchor-free mechanism was incorporated to reduce the computational complexity and improve generalization to real-world scenarios; and ④ a novel Vision Transformer block, C3MaxViT, was added to enable CNNs to model long-range dependencies. These enhancements were integrated into an advanced anchor-based YOLOv5l algorithm, and the proposed Vision Transformer-enhanced anchor-free YOLO method was then compared against cutting-edge damage detection methods. The experimental results demonstrated the effectiveness of the proposed method, with an increase of 8.1% in mean average precision at intersection over union threshold of 0.5 (mAP50) and an improvement of 8.4% in mAP@[0.5:.05:.95] respectively. Furthermore, extensive ablation studies revealed that the four detection heads, decoupled head design, anchor-free mechanism, and C3MaxViT contributed improvements of 2.4%, 1.2%, 2.6%, and 1.9% in mAP50, respectively.

Graphical abstract

Keywords

Computer vision / Deep learning techniques / Vision Transformer / Object detection / Bridge visual inspection

Cite this article

Download citation ▾
Xiaofei Yang, Enrique del Rey Castillo, Yang Zou, Liam Wotherspoon, Jianxi Yang, Hao Li. Automated Concrete Bridge Damage Detection Using an Efficient Vision Transformer-Enhanced Anchor-Free YOLO. Engineering, 2025, 51(8): 331-347 DOI:10.1016/j.eng.2025.02.018

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

Environmental conditions, natural hazards, increasing traffic, and aging all make concrete bridges prone to deterioration and can impact on their functionality [1], [2]. Periodic bridge inspection is vital to timely assess the bridge condition and to provide early warnings for any defects that may affect the safety of bridge users and their surrounds. Visual inspection is the most common method for bridge condition monitoring [3]. Qualified inspectors physically assess the bridge appearance, a process that has been recognized as time-consuming, subjective, and error-prone [4]. Published research has suggested that around 50% of bridge condition ratings are either incorrect or differ between inspectors [5]. In addition, a considerable number of bridges need to be inspected across transportation networks, yet resources are often limited, particularly in the context of the lack of professionals and budget shortfalls, resulting in maintenance backlogs and undetected deterioration of bridges. Thus, new techniques that can provide more comprehensive data to inform objective and repeatable inspection decisions while reducing workload and inspection costs are in high demand.

Unmanned aerial vehicles (UAVs) equipped with high-definition cameras have recently seen increasing use for bridge inspections [6]. In the United States, over half of the bridge stock has been deemed appropriate for UAV inspection [6], offering up to 60% cost savings compared with manual visual inspections [7]. The integration of UAVs and computer vision-based damage detection algorithms can further improve the efficiency of bridge visual inspections [8], [9]. Extensive research efforts in automated damage detection have shown that the deep learning technique is optimal and the most popular method [2]. However, three challenges remain in using deep learning-based methods for damage detection:

Firstly, existing deep learning-based object detectors are primarily designed for natural scene imagery captured by handheld cameras. However, there is a significant disparity between natural scene images collected by handheld cameras and concrete defect images captured by UAVs [10]. Specifically, defects in aerial images often exhibit scale variations due to the wide field of view of UAV cameras and changes in navigation altitudes [11]. Additionally, motion blur induced by UAV high-speed flights, vibrations, and wind degrades image quality, compromising damage detection accuracy [12]. Consequently, existing deep learning-based object detectors may not be directly applicable to UAV-captured images of concrete defects.

Secondly, previous studies have leveraged anchor-based algorithms such as faster region-based convolutional neural networks (Faster R-CNNs) [13], You Only Look Once version 3 (YOLOv3) [3], YOLOv4 [14], and YOLOv5 [15] to detect damage. These algorithms rely on pre-defined anchor boxes with distinct sizes and aspect ratios, which are typically generated through k-means clustering on the dimensions of ground truth bounding boxes from the training dataset [16]. However, the reliance on predefined anchors limits the generalizability of these methods to real-world scenarios, where damage dimensions may differ from those in the training dataset. As a result, detection performance may significantly drop when these algorithms are applied to real-world scenarios. In addition, the anchor mechanism increases computational complexity, as it generates multiple anchor boxes with varying sizes and aspect ratios at each location in the feature map [17]. This results in the creation of a large number of potential bounding boxes, significantly raising the computational load required to process them [18].

Lastly, prior damage detection models have been developed based on CNNs due to their excellent detection ability and computational efficiency. However, CNN-based models lack human-level generalization ability because they only focus on local information and struggle to capture long-range dependencies, a limitation inherent to the locality of convolution operations [19]. Vision Transformers (ViTs) [20] have recently emerged as a paradigm, which can extract and integrate global contextual information through self-attention mechanisms, though this comes at the cost of quadratic computational complexity [21]. To address this issue, state-of-the-art Vision Transformers, like the Swin Transformer [22], partition an image into a set of local windows called shifted windows, and restrict self-attention computation to these local windows to avoid quadratic computational complexity. Nevertheless, this operation shift from global to local self-attention compromises model’s capability to capture global dependencies. In addition, pure Transformer-based vision models suffer from significant performance drops on small datasets [23] and therefore generalize poorly [24]. Given that damage detection datasets in civil engineering are typically small and difficult to label, pure Transformer-based models are not well-suited for this application [25]. Few studies have investigated hybrid approaches that combine CNNs and ViTs to capitalize on their respective advantages.

To meet the challenges, this study aims to develop an efficient Vision Transformer-enhanced anchor-free YOLO method for automated concrete bridge damage detection. Firstly, a multi-scale concrete bridge damage dataset was established, augmented by motion blur and brightness to better adapt to complex real-world conditions. Subsequently, the proposed four enhancements were applied to an advanced anchor-based YOLOv5l method due to its good trade-off between accuracy, speed, and hardware-friendship [26]. Specifically, the following four enhancements were leveraged:

•>Additional detection head: Four detection heads were implemented to detect multi-scale defects, enhancing robustness to scale variance.

•Decoupled head design: The original coupled head was replaced with a decoupled design, separating classification and regression tasks to improve detection accuracy.

•Anchor-free mechanism: An anchor-free mechanism was introduced to directly predict object center points and dimensions, eliminating the need for anchor-based methods that rely on classifying and regressing multiple predefined anchor boxes at each feature map location, thereby enhancing computational efficiency and offering better generalization to real-world scenarios [18].

•C3MaxViT module: An efficient Transformer module, the C3MaxViT block, incorporating a Cross Stage Partial (CSP) design [27], was developed and integrated into the network, effectively capturing both local and global dependencies across the entire image, while mitigating the quadratic computational complexity typically associated with Transformers, making it more suitable for training on smaller datasets.

The main contributions of this study are:

•A multi-scale concrete bridge damage dataset has been expertly annotated and augmented with motion blur and varying brightness levels to strengthen the damage detector’s robustness in handling challenging real-world conditions.

•Four enhancements (four detection heads, decoupled head design, an anchor-free mechanism, and a C3MaxViT block) were leveraged to develop a Vision Transformer-enhanced anchor-free YOLO method based on the YOLOv5l algorithm. Furthermore, the efficacy of each enhancement was analyzed through ablation studies.

•A novel C3MaxViT block was proposed to model long-range dependencies. The effect of different embedding positions of the C3MaxViT block within the proposed method was also investigated.

The rest of the paper is structured as follows. Existing research in the realm of automated concrete bridge damage detection is summarized in Section 2. Section 3 provides an overview of the YOLOv5l architecture and illustrates the proposed method in detail. Section 4 elaborates on the concrete bridge damage dataset preparation, experiment implementation, evaluation, and analysis of experiment results. Section 5 concludes the paper and discusses potential future research.

2. Literature review

Deep learning-based damage detection methods have attracted extensive attention and have been applied over a number of different domains due to their strong automated feature extraction capability, high computational efficiency, and multi-class detection ability [14]. Existing deep learning-based damage detection methods can be categorized into two groups: anchor-based methods and anchor-free approaches.

2.1. Anchor-based damage detectors

Damage detection methods have recently been dominated by anchor-based methods. These methods leverage a set of anchor boxes that are tiled across each grid cell of a feature map to enumerate possible locations, scales, and aspect ratios for defects. The sizes and aspect ratios of the predefined anchor boxes are dependent on specific datasets, and the best anchor boxes are selected through clustering methods. The anchor box mechanism transforms the damage detection task into the damage classification task for an extensive number of potential bounding boxes. For each bounding box, the damage detector predicts defect probability, category, and offsets to that box to determine the damage categories and locations [27].

Anchor-based methods can be generally divided into two-stage and one-stage methods. Two-stage damage detectors were popularized by Faster R-CNNs [28]. Cha et al. [13] adopted Faster R-CNNs to successfully detect five types of surface defects with distinctive features. In addition, a modified Faster R-CNN method was proposed to detect minor defects in various real-world scenarios [21]. The novelty of this study was the use of a multi-scale defect region proposal network to improve damage detection accuracy for minor defects. While two-stage models typically feature high detection accuracy, they suffer from slow detection speeds due to the intermediate region proposal generation step.

One-stage methods can conduct defect classification and box regression within a single convolution network, removing the time-consuming region proposal generation process [22]. One-stage methods are represented by two popular models: ① Single Shot MultiBox Detector (SSD) [29] and ② YOLO series [30]. A typical application of the SSD method was to detect road surface defects in real-time [31]. While this study achieved a relatively satisfying detection speed, it compromised damage detection accuracy as a result [32]. For YOLO series, a recent example leveraged the architecture of YOLOv3 for detecting four concrete bridge damage types. The comparison results demonstrated that YOLOv3’s detection speed is three to five times faster than that of Faster R-CNN, while its detection accuracy is slightly lower [2]. To further enhance damage detection accuracy, YOLOv4 [33] added more tuning tricks based on YOLOv3, such as mosaic data augmentation and the replacement of the original Feature Pyramid Network (FPN) with a Path Aggregation Network (PAN) for better feature aggregation. Zou et al. [14] adopted an improved YOLOv4 algorithm to conduct damage detection, where depth-wise separable convolutions were introduced into the model to reduce computational costs without decreasing accuracy. YOLOv5 [34] consists of four models of increasing size: YOLOV5s, YOLOv5m, YOLOv5l, and YOLOv5x. The smaller models, YOLOv5s and YOLOv5m, are suitable for real-time damage detection, while the larger YOLOv5l and YOLOv5x prioritize detection accuracy. A recent attempt by Zhao et al. [15] presented a YOLOv5s-HSC method, enhancing YOLOv5s with Swin Transformer blocks and coordinate attention modules to improve feature extraction. They also alleviated defect scale variation through adding an extra detection head into the model. Nevertheless, the Swin Transformer block’s reliance on window-based attention mechanism limits the model’s ability to capture global dependencies, thereby lowering detection performance.

While anchor-based damage detectors have achieved great success, some common limitations remain. Firstly, predefined anchor boxes are dataset-specific, whereas surface defects in real-world scenarios could have various scales and aspect ratios. This mismatch hinders the generalization ability of anchor-based methods to diverse real-world scenarios. Secondly, anchor-based methods include complicated computations related to predefined anchor boxes and are sensitive to hyperparameters. Lastly, the efficient establishment of long-range dependencies across an entire image remains challenging, especially under a limited computational budget.

2.2. Anchor-free damage detectors

The emergence of anchor-free damage detectors has recently attracted increasing research attention. These detectors directly perform damage classification and bounding box regression without relying on anchor references, avoiding hyperparameter tuning and expensive computations related to anchor boxes, leading to a more efficient and simpler damage detection process. Moreover, anchor-free damage detectors are more appropriate to real-world scenarios where defects are usually distributed arbitrarily and display large variance in both scale and aspect ratio.

Anchor-free methods such as CornerNet [35], CenterNet [36] and YOLOX [18] have been developed. A recent example developed a CenWholeNet method based on CenterNet [37] by incorporating both center point features and holistic features of defects such as the diagonal length and angle of the bounding box [16]. This method outperformed the original CenterNet, demonstrating its effectiveness and making the first application of anchor-free method to infrastructure damage detection. Another notable advancement us the YOLOX algorithm [38], which transformed the YOLO series into an anchor-free framework. YOLOX implements a decoupled head and a novel label assignment strategy called the simplified optimal transport assignment (SimOTA), leading to excellent detection performance. Although anchor-free methods have gradually led the trend of damage detectors, they remain underused in civil infrastructure applications.

The review of previous literature on both anchor-based and anchor-free damage detectors has revealed three key challenges in infrastructure damage detection. Firstly, previous studies overlooked the presence of motion blur in UAV-captured images. Secondly, existing studies are dominated by anchor-based methods and neglect the potential benefits of anchor-free damage detectors, which are deemed more suitable for industry applications. Thirdly, while Vision Transformer blocks can help CNNs establish long-range dependencies and improve damage detection, their high computational cost poses a significant challenge. There is a pressing need for more efficient vision Transformer modules that can enhance detection capabilities without excessive computational demands.

3. Methodology

3.1. Overview

This section firstly details the dataset preparation used for model training and then expatiates the proposed efficient Transformer-based anchor-free YOLO method. Fig. 1 illustrates the technical roadmap for the methodology section. The framework of the original YOLOv5l algorithm is outlined in Section 3.3, offering context on the base model architecture. Subsequently, an efficient Transformer-based anchor-free YOLO method is presented, integrating five enhancements (i.e., motion blur and brightness augmentation, an additional detection head, decoupled head design, anchor-free mechanism, and the C3MaxViT module) to the original YOLOv5l algorithm.

3.2. Data preparation

The established dataset was driven by the need for detecting concrete bridge damage from harsh real-world scenarios. The selection of damage types, spalling, rebar exposure, and efflorescence, was based on their availability, prevalence, and severity in the captured concrete bridges, particularly those prone to structural degradation. These types of damage are critical because they can significantly impact the structural integrity and long-term durability of the bridges. The dataset encompassed 1969 images with 4385 annotations, all labelled by a group of researchers following a consistent standard using the image annotation tool LabelImg [39]. Notably, 72% of the images contained more than 2 defects. The dataset contains two sources of images: 1719 real-world captured images and 250 images augmented with motion blur and brightness adjustments to simulate scenes influenced by motion blur and strong illumination (described in Section 3.4.1). The real-world images varied in scale from 563 × 421 pixels to 4608 × 3456 pixels and were collected by different bridge inspectors on-site from a shooting distance of 3–5 meters, like UAV capture scenes. In addition, the damage dataset is comprised of spalling (labeled as “spall”) with 1359 annotations, exposed rebar (labeled as “rebar”) with 1950 annotations, and efflorescence with 1076 annotations. The dataset was randomly split into the training, validation, and testing dataset with a ratio of 7:1:2. The training dataset contains 1352 images, with 175 augmented images; the validation dataset includes 222 images, with 25 augmented images, and the testing dataset has 395 images, having 50 augmented images. Table 1 summarizes the number of images from different sources in the damage dataset, while Fig. 2 presents some visual examples of different categories of annotated damage images.

3.3. Original YOLOv5l algorithm

The framework of the YOLOv5l algorithm [34] can be divided into two components: data preprocessing and network architecture, both of which are illustrated in detail in the following subsections.

3.3.1. Data preprocessing

Data preprocessing consists of three key steps, namely, ① data augmentation, ② adaptive image scaling, and ③ adaptive calculation of prior anchor boxes.

Data augmentation aims to increase the diversity of the original dataset, thereby improving the deep learning model’s robustness to images with complex backgrounds. In addition to traditional data augmentation methods such as crop, rotation, flip, and saturation, YOLOv5 specifically employed the mosaic method. This method involves randomly selecting four images from the batch dataset, applying random scaling, cropping and distribution to each, and then splicing these processed images, each with different sizes and shapes, into a larger image, as shown in Fig. 3. The mosaic method can greatly increase the number of small objects and diversifies the background of the detected objects, enabling YOLOv5l to achieve better performance and higher robustness.

After data augmentation, both the original and augmented images in the dataset are resized to a resolution of 640 × 640 pixels, which is the model’s input size. This resizing maintains the aspect ratio without distortion by scaling the longer sides of the image to match the input size and resizing the shorter sides using the same scaling ratio. Gray bars are then added to the shorter sides to create a square image. The above operation increases information redundancy and affects inference speed. To mitigate this issue, YOLOv5l adopts an adaptive image scaling method to calculate the padding size during the model inference process. The mod function is employed on the difference between long and short sides and a fixed value of 32 used in this research. Fig. 4 provides an example of the operation of adaptive scaling down the image size during model training and model inference.

YOLOv5 automatically calculates the width and height for initial anchor boxes by using a k-means clustering algorithm before training the model [40], tailored to the custom dataset. This is a crucial step because the model generates predicted bounding boxes based on these anchor boxes by closing the gap between the predicted bounding boxes and their ground truth boxes with back propagation to update the model’s parameters. The adaptive calculation of initial anchor boxes begins with gathering a custom dataset of annotated training images. Each annotation provides details about the bounding boxes of objects within the image. Next, various features are extracted from each annotated bounding box, including width, height, aspect ratio, and center coordinates. These features are then normalized to ensure consistency in scale, which is crucial for the application of the k-means clustering algorithm. Once the features are normalized, the k-means clustering algorithm is applied to group these normalized features into k clusters based on their similarities in feature space, where k represents the desired number of anchor boxes. Each cluster centroid represents an anchor box. To obtain the final anchor box dimensions and aspect ratios, these centroid values are denormalized. Finally, the anchor boxes are typically sorted by area to ensure smaller boxes correspond to smaller objects, while larger boxes correspond to larger objects.

3.3.2. Original YOLOV5l architecture

YOLOv5l [34] can be grouped into three parts: ① CSPDarknet 53 as the backbone module for feature extraction, ② Path Aggregation Network (PANet) as the neck module for feature aggregation, and ③ coupled detection head for damage classification and bounding box regression based on preset anchor boxes. The overall architecture of the YOLOv5l algorithm is presented in Fig. 5.

3.3.2.1. Backbone module

The backbone module consists of three major blocks: the convolution (Conv) block, the simplified CSP bottleneck (C3 block), and the Spatial Pyramid Pooling Fast (SPPF) block. The C3 block leverages CSP networks [41] for its residual block design, which can decrease computational cost while preserving accuracy. The C3 block operates by dividing the input features into two paths: One path passes through the Conv block and then through n bottleneck operations, while the other path is only processed by the Conv block. The final step involves merging the cross-stage features via a concatenation (Concat) operation. The architecture of the C3 block is depicted in Fig. 6.

The SPPF block can extract multi-scale features to expand the receptive field without compromising operation speed, which concatenates the maxpooling outputs derived from operations of a kernel size k × k, where k = 5. The architecture of the SPPF block is shown in Fig. 7.

3.3.2.2. Neck module

The neck module aims to fuse the distinct feature scales extracted by the backbone module at different stages, which consists of the FPN and the PAN. The FPN conveys high-level semantic features with a top-down manner, while the PAN propagates low-level localization features through bottom-up path augmentation to enhance the overall feature hierarchy, as shown in Fig. 5.

3.3.2.3. Detection head module

The detection head consists of a 1 × 1 convolutional layer and constructs three levels of feature maps with subsampling strides of 32, 16, and 8 for detecting different scales of defects. The detection head predicts three bounding boxes through translating and scaling prior anchor boxes at each grid cell on three different levels of feature maps with distinct resolutions. It is worth noting that the prior anchor boxes are predefined with different scales corresponding to different levels of feature maps.

3.4. Efficient Transformer-enhanced anchor-free YOLO

The framework of the proposed efficient Transformer-enhanced anchor-free YOLO method is illustrated in Fig. 8. We adapted the original YOLOv5l for UAV-assisted automated concrete bridge damage detection tasks. The main modifications can be summarized into five aspects:

•Implementing motion blur and brightness augmentation to enhance the model’s robustness in challenging real-world conditions;

•Introducing an additional detection head to alleviate the issue of damage scale variation;

•Replacing the original coupled head with a decoupled head design to further improve damage detection performance;

•Adding an anchor-free mechanism to improve generalization in real-world scenarios and reduce computational complexity; and

•Leveraging the proposed C3MaxViT block to model long-range dependencies to remedy drawbacks of CNNs.

The relevant modification details are presented in the following subsections.

3.4.1. Additional data augmentation

Along with the original image augmentation methods such as mosaic, crop, rotation, flip, and saturation, we introduce motion blur and brightness into the proposed method using the imgaug library [42]. These augmentations enable the proposed method to learn useful features from augmented damage images, improving its resilience to challenging real-world conditions. Some visual examples of the hybrid augmentation results with motion blur and brightness are presented in Fig. 9.

3.4.2. Additional detection head

As defects in aerial images are present at various scales, an additional detection head is added to improve performance in detecting multi-scale defects. This additional (fourth) detection head is specifically tasked with processing high-level, low-resolution feature maps (1/64 of the input resolution). When combined with the original three detection heads, the resulting four detection head architecture, shown in Fig. 8, largely alleviates multi-scale variance issues and improves damage detection ability, although this enhancement does come with an increase in computational cost.

The detection heads were strategically positioned at various scales within the network, corresponding to different feature map resolutions: 1/8 of input resolution targeting the smallest defects, 1/16 of input resolution focusing on small to medium defects, 1/32 of input resolution detecting medium to large defects, and 1/64 of input resolution identifying the largest defects. By leveraging multi-scale feature representations, we enhance detection performance across various defect sizes. The detection heads are embedded at different layers of the network, allowing for the integration of both low-level features, which capture fine details, and high-level features, which provide contextual information.

3.4.3. Efficient decoupled head design

The original YOLOv5 detection head involves a coupled design that simultaneously performs classification and box regression tasks through sharing learned parameters between the two. However, this coupling causes a conflict between these two tasks because the classification task focuses on damage texture information while the regression task concentrates on damage edge features for localization [18]. To address this issue, an efficient decoupled head is designed to divide the classification and box regression tasks into distinct branches. This design, compared to YOLOX’s decoupled head, reduces the additional latency typically introduced by decoupled heads while maintaining accuracy. In this decoupled head, a 1 × 1 Conv block is firstly adopted to reduce the feature channels to 256, 512, 768, 1024 as per each level of FPN features. Subsequently, two parallel branches, each incorporating a standard 3 × 3 Conv block, are added for damage classification and box regression respectively. Finally, a standard convolution layer and sigmoid function are employed to predict the objectness score, class probability, and coordinate values of the predicted bounding boxes. The architecture of the decoupled head is shown in Fig. 10.

3.4.4. Anchor-free mechanism

The anchor-based mechanism of the YOLOv5l algorithm assigns three predefined anchor boxes of distinct sizes to each grid cell across different scale feature maps generated by the head module. This study adopts an anchor-free mechanism, reducing the number of anchor boxes for each grid cell from three to one, thereby improving computational efficiency. The anchor box size is designated to match the grid cell size at different levels of the feature maps. For each grid cell, a bounding box is predicted, consisting of five attributes 4 + 1 + T: 4 for the center coordinate offsets and the scale values of the width and height of the anchor box, and 1 for the objectness score to identify if a defect is located in the predicted bounding box. Additionally, T is the number of predicted damage types. The anchor-free mechanism is not dependent on the preset anchors, so it can better generalize to real-world scenarios.

A two-stage positive sample selection strategy is leveraged after the predicted bounding box is generated on each grid cell to balance the number of positive and negative predictions. This is necessary because most predicted bounding boxes on grid cells are negative samples, containing background information rather than defects. In the first stage, an initial positive sample selection strategy is used to assign two types of predictions on grid cells as initial positive samples: ① grid cells whose center points are located within the ground truth bounding box, and ② grid cells whose center points lie within a square with side length five times that of the grid cell, where the square’s center point aligns with the center point of the ground truth box. This operation could include more high-quality predictions, improving the detection performance. In the second stage, the SimOTA approach [18] is employed to perform refined positive sample selection, which is a simplified version of the OTA approach [43]. The refined positive sample selection strategy firstly computes the Intersection over Union (IoU) between the initial selected positive samples and their corresponding ground truth, which is defined as follows:

IoU=ABAB

where A and B represent predicted bounding boxes and their corresponding ground truth respectively.

The cost Mij is then computed to measure the matching degree between the initial selected positive samples and their corresponding ground truth.

Mij=Lijcls+λLijreg

where λ = 3 is a trade-off coefficient, Lijcls is classification loss, and Lijreg is regression loss.

The top 10 IoU values for each ground truth are selected and summed, followed by the round operation to determine k that represents the number of positive samples for the corresponding ground truth. The top k positive samples with the least cost are then selected as the final positive samples for that ground truth. The SimOTA operation reduces both training time and the number of hyperparameters.

The loss function used in this study is described as follows. The damage detectors encompass two types of loss functions for the two sub-tasks: classification loss for the classification task and box regression loss for localization. VariFocal loss (VFL) [44] is adopted as the classification loss function to alleviate the extreme imbalance issue between positive samples and negative samples in damage detection tasks. This loss function treats positive and negative samples asymmetrically by referencing the weighting idea of focal loss [45] and is defined as below:

VFLp,q=-qqlogp+1-qlog1-p,q>0-αpγlog1-p,q=0

where p represents the IoU-aware classification score that merges the defect presence confidence with localization accuracy, while q denotes the IoU between the prediction bounding box and its ground truth. The parameter α is set to 0.75, acting as a balancing factor between positive and negative samples, while γ, assigned a value of 2.0, serves as the focal weight factor for negative samples.

For the box regression loss LGIoU, generalized IoU [46] is applied, which is defined as follows:

GIoU=IoU-E/ABE
LGIoU=1-GIoU

where E is the smallest enclosing convex for A and B.

3.4.5. Improved multi-axis Vision Transformer with CSP design (C3MaxViT block)

This study presents the C3MaxViT block, an efficient and universal block with CSP design, inspired by the original C3 block and the success of the multi-axis Vision Transformer (MaxViT) [21]. The C3MaxViT block can serve as a plug-and-play component for different CNNs, allowing for local-global spatial interactions to establish long-range dependencies at arbitrary input resolution, while only requiring linear computational complexity. The architecture of the proposed C3MaxViT block is presented in Fig. 11.

Within the proposed C3MaxViT block, only one improved MaxViT block is leveraged due to the strong model capacity of Transformer blocks that may lead to overfitting issues. The improved MaxViT block encompasses three sub-modules, Fused-MBConv, Block Attention, and Grid Attention. We firstly leveraged the Fused-MBConv [47] to replace the MBConv [48] from the original MaxViT module, where the depth-wise convolution operation is replaced with a 3 × 3 standard convolution to increase computational speed. The Fused-MBConv starts with an input feature map XRH×W×C, which performs a single 3 × 3 Conv block with a channel expansion rate of 4. This process results in output dimensions of this Conv block equal to H×W×4C. A squeeze-and-excitation block is then used to reduce dimensions back to H×W×C, followed by a standard convolution and batch normalization. The implementation of Fused-MBConv can improve the learning capacity, generalization, and trainability of the network [47]. In addition, the core innovation of the original multi-axis vision Transformer method lies in its use Block Attention and Grid Attention to conduct local and global interactions. The advantage of the Vision Transformer module compared to the convolution operation is the self-attention mechanism that establishes global interactions across the input feature map. However, applying self-attention directly to the entire feature map is computationally expensive, resulting in quadratic computational complexity. To address the abovementioned issues, the MaxViT block decomposes full-size attention into two sparse axis attentions with linear computational complexity. These attentions, Block Attention and Grid Attention, will be illustrated in detail in the following paragraphs.

Block Attention firstly partitions the input feature map (H,W,C) into a tensor shape (HS×WS,S×S,C) that represents non-overlapping windows. Each window size is S×S. This operation is similar to the Swin Transformer approach [22]. The self-attention mechanism is then applied to each window. Subsequently, a window reverse operation is leveraged to transform the tensor shape back to (H,W,C), which is a reverse of the window partition operation. The final output of the Block Attention module is processed by a feed forward neural network (FFN) and used as input for the Grid Attention module.

Grid Attention firstly employs a uniform G×G grid to decompose the input feature map (H,W,C) into (G×G,HG×WG,C). The self-attention mechanism is then applied to each grid, followed by implementing a grid reverse operation to transform the tensor shape back to (H,W,C). The operation processes of Block Attention and Grid Attention are illustrated in Fig. 12, and the effectiveness of the proposed C3MaxViT module is discussed in Section 4.3.3. Here note that the window size S×S and grid size G×G are both set to 8 × 8 to balance computational efficiency, receptive field, and cross-window context capture. This choice ensures that the model can process images with high detail retention, effectively capture global context, and maintain computational efficiency.

4. Experiment and results

4.1. Implementation details

The efficient Transformer-enhanced anchor-free YOLO method was implemented on a Linux operating system with an AMD EPYC 7601 central processing unit (CPU) and a single NVIDIA GeForce RTX3090 graphics processing unit (GPU) with 24 gigabytes (GB) memory for training, validating, and testing the network. The deep learning network was built using Pytorch v1.8.1 [49]. In terms of hyperparameter optimization, a Genetic Algorithm [50] was adopted. Firstly, an initial hyperparameter selection was performed based on previous literature and similar tasks. The fitness function was defined as a weighted combination of 10% of the mean average precision at IoU threshold of 0.5 (mAP50) and 90% of mAP@[0.5:.05:.95]. The evolution process was then executed to determine the optimal hyperparameters. As a result, the input image size was set to 640, and the batch size to 16. The training process was conducted over 50 epochs, with the first 2 epochs used for warm-up due to the relatively small training dataset (1352 images). Stochastic gradient descent was chosen as the optimizer, with an initial learning rate of 0.0032, momentum of 0.843, and weight decay set to 0.00036.

4.2. Evaluation metrics

The average precision (AP) and mean average precision (mAP) were selected to evaluate the damage detectors’ performance. Before the explanation of AP and mAP, some foundational concepts are firstly introduced as below.

A predicted bounding box is considered as a true positive (TP) when it meets three requirements: ① The confidence score is greater than the preset threshold; ② the predicted damage category matches the ground truth; and ③ the IoU is greater than 0.5. Based on these three requirements, the predicted bounding boxes that do not meet either of the latter two conditions are considered as false positives (FP). Predicted bounding boxes whose confidence scores are lower than a certain threshold, are labelled as false negatives (FN), while those that do not detect defects or whose confidence scores are lower than the preset threshold are regarded as true negatives (TN). The metrics of precision and recall are then defined as follows:

Precision=TPTP+FP
Recall=TPTP+FN

Different pairs of precision and recall can be obtained by varying the confidence threshold from 1 to 0, and a precision–recall curve can be drawn. The interpolated precision Precisioninterp at a certain recall level r is defined as the highest precision of any recall level rr, and is computed as follows:

Precisioninterpr=maxrrPrecisionr

The interpolated precision–recall curve can be drawn based on the different pairs of interpolated precision and recall. AP is defined as the area under the interpolated precision–recall curve and calculated as below:

AP=i=1n-1ri+1-riPrecisioninterpri+1

The metric of AP is class-wise, while mAP can be used to evaluate the performance of the damage detector across all K classes and defined as follows:

mAP=i=1KAPiK

mAP50 denotes the mAP is computed at the single IoU threshold of 0.5, and mAP@[0.5:.05:.95] represents the mAP averaged over 10 IoU thresholds ranging from 0.50 to 0.95 with an increment of 0.05.

4.3. Result evaluation

In this section, the effectiveness of the presented efficient Transformer-enhanced anchor-free YOLO method is qualitatively and quantitatively evaluated. In addition, the performance of the proposed method is compared with existing state-of-the-art detectors, and the impact of each modified component within the proposed method is analyzed.

4.3.1. Overall analysis of the proposed method

The experimental results were first analyzed qualitatively. Visual examples of the prediction results, as shown in Fig. 13, demonstrate that the proposed method can accurately generate predicted bounding boxes with high confidence scores when localizing damage positions.

Table 2 provides a summary of the performance of the proposed method. The overall detection accuracy achieved a satisfying performance, with an mAP50 of 81.2%. The detection accuracy of each damage category is reported in Table 2, with exposed rebar achieving the highest detection accuracy compared to efflorescence and spalling. This can be credited to the salient features of exposed rebar such as its distinct shape and color, making it easier to distinguish from background pixels and other damage types. The efflorescence detection had the lowest performance, with an mAP50 that was approximately 16% lower than the exposed rebar. This is likely due to the lower contrast of minor efflorescence against the background pixels and its more discrete distribution, making its classification and localization more challenging.

The confusion matrix of the proposed method is presented in Fig. 14, where the columns represent the ground truth annotations, and the rows are the prediction results. The quotient is computed by the number of correct predictions divided by the total number of a specific category’s annotations (sensitivity). As can be seen from Fig. 14, the damage types were reliably detected with over 80% sensitivity, validating the effectiveness of the proposed method. The biggest influencing factor for each category was the background pixels, with 25%, 55%, and 20% of background pixels being wrongly classified into exposed rebar, efflorescence, and spalling. In addition, 18% of the ground truth efflorescence instances were wrongly detected into background information. The possible reason was that the white material of the efflorescence is similar to the concrete color, making challenges in useful feature extraction.

4.3.2. Comparison with the state-of-the-art methods

Existing methods in YOLO series were selected to conduct comparison experiments and demonstrate the advanced performance of the proposed method. Fig. 15 compares the detection accuracy (mAP50) on the validation dataset during the training phase across various YOLO series algorithms and the proposed method, which reveals several key observations. At the early stages of training, the proposed method exhibited a sharp increase in mAP50, quickly surpassing the other YOLO models. During the early to mid-training stage, it continued to rise significantly, with some fluctuations, while consistently outperforming the other methods except YOLOv6l up to 35 epochs. Afterward, the performance of YOLOv6l, YOLOv7 [51], and the proposed method became comparable, showing slight improvements and minor fluctuations, converging in later stages of training. This performance trajectory underscores the efficiency and effectiveness of our proposed method, particularly in the crucial early and middle stages of training.

Visual comparison examples of the predicted results from different methods, including the proposed approach, are provided in Fig. S1 in Appendix A. The proposed method exhibits stronger detection ability, yielding higher confidence scores and producing more accurate prediction results compared to other damage detection methods. Table 3 summarizes the overall performance of both the existing methods and the proposed method [18], [26], [34], [51], [52]. The proposed method achieved the best performance on both evaluation metrics across all methods, particularly exceeding the original YOLOv5l algorithm by 8.3% in mAP50 and 7.6% in mAP@[0.5:.05:.95], while maintaining a comparable training time and achieving an acceptable inference speed. Despite using CSPDarknet53 as the backbone, with depth and width parameters set to 1.00, the proposed method outperformed the most advanced methods such as YOLOv6l with a CSPBepBackbone, YOLOv7 with an advanced version of CSPDarknet incorporating a reparameterized network, and YOLOv8 [53] with a CSPDarkNet53 variant. This demonstrates its superiority over other state-of-the-art methods. In addition, we also found that anchor-free methods normally performed better than anchor-based methods. The possible explanation is that anchor-based methods heavily rely on the preset anchor boxes designed for specific training datasets, limiting its generalization ability to new data.

4.3.3. Ablation studies

In this section, the importance of each modification within the proposed method is analyzed. During the ablation experiments, part of pre-trained weights from the YOLOv5l model was transferred to the proposed model for training, as the proposed method shared the 0–7 layers of backbone network with the pre-trained model, significantly reducing training time and computational resources. Although this operation slightly decreased the performance of damage detectors, it did not influence the ablation studies which focused on validating the effect of each modification within the proposed method. The effect of each modification is summarized in Table 4. Overall, the proposed method enhanced damage detection accuracy by 8.1% in mAP50 and 8.4% in mAP@[0.5:.05:.95] respectively compared with the original YOLOv5l algorithm, while achieving an acceptable inference speed, demonstrating its effectiveness. The effect of each modification was analyzed in detail as below.

The effect of introducing an extra detection head improved the metrics of mAP50 and mAP@[0.5:.05:.95] by 2.4% and 3.5% respectively. However, this enhancement comes at the cost of an increase in inference time. Subsequently, the decoupled detection head further increased mAP50 by 1.2% and mAP@[0.5:.05:.95] by 1.1% compared to the last method. The reason for this improvement was the separation of classification and localization tasks, avoiding their mutual conflicts. Next, the introduction of anchor-free mechanism was investigated, indicating the most significant improvement across all modifications. This enhancement demonstrated the advancement of anchor-free methods over anchor-based methods like YOLOv5, as it was not limited by preset anchor boxes, allowing for better generalization to new data.

The evaluation metrics of mAP50 and mAP@[0.5:.05:.95] improved by 1.9% and 2.5% respectively, when using the proposed C3MaxViT block compared to method 4. Introducing the Vision Transformer block into a CNN-based network not only helps CNNs establish long-range dependencies but also maintains computational efficiency. In terms of the hybrid use of Vision Transformer blocks and CNNs, we also explored the effect of the Transformer block’s embedding position. Prior studies have demonstrated that the Transformer block can improve detection accuracy while reducing the model size [48]. However, no previous work has discussed the impact of the Transformer block’s position within the network. In this paper, we embedded the C3MaxViT block in different positions, as used in prior studies. Table 5 presents the performance comparison of the C3MaxViT block embedded at different positions.

As indicated in Table 5, the best detection results for both evaluation metrics and the least training time were achieved when the C3MaxViT block was only embedded at position 10 within the backbone module. This shows that the Transformer block has a significant impact on both performance and training time when added in the later layers of the backbone module. We also found that when the C3MaxViT block was progressively added into the latter two detection heads at positions 32 and 29, the metrics of mAP50 and mAP@[0.5:.05:.95] decreased slightly by around 1% compared to only adding the C3MaxViT block into the backbone module. The performance dropped significantly when the C3MaxViT block was added into positions 26 and 23, resulting in outcomes worse than those achieved using only CNNs. This phenomenon demonstrated that the performance did not improve with the increasing use of Transformer heads; rather, overusing Transformer blocks may cause the overfitting problem that decreased detection performance. As a result, the proposed C3MaxViT block was embedded into the backbone network at position 10 only to replace the original C3 blocks.

Other advanced Transformer blocks used in prior studies such as C3TR [11] incorporating standard Transformer blocks and C3STR [15] enhanced by the Swin Transformer were also investigated, as summarized in Fig. 16, Fig. 17. The C3MaxViT block achieved the best detection accuracy with 76.1% in mAP50 and 44.0% in mAP@[0.5:.05:.95], outperforming other methods in mAP50 by 1.4% and 1.1% respectively. The improved detection performance can be credited to the special design of the C3MaxViT block, which has only one modified MaxViT block, considering the overfitting issues commonly associated with the use of Transformer blocks. Minimizing the number of Transformer blocks also reduces computational cost, saves training time, and expedites detection speed. Moreover, the combination of window attention global attention in the modified MaxViT block increased detection accuracy beyond that of a window attention only strategy, such as the Swin Transformer method. It should be noted that the proposed C3MaxViT block can obviously improve the performance of efflorescence detection, achieving a 5% and 3.9% increase in mAP50 compared to other methods. As can be seen from Fig. 17, the proposed C3MaxViT block achieved the best mAP@[0.5:.05:.95] results for overall performance and excelled in most damage types except for exposed rebar. This indicated that the proposed method can more accurately localize damage positions, outperforming other methods by 1.8% and 1.2% respectively.

Previous research reported that attention blocks together with Vision Transformer blocks can further improve detection performance [11], [15], [16]. Two different attention blocks, convolutional block attention module (CBAM) [54] and global attention mechanism (GAM) [55], were embedded next to the C3 block in the neck module of the proposed method to validate the effectiveness of these commonly used attention mechanisms. However, the experiment results in Fig. 18, Fig. 19 showed that embedding attention blocks slightly decreased detection accuracy. The possible explanation for this was that the addition of attention blocks may result in the model to overfit the training data.

5. Conclusions

Automated concrete bridge damage detection for UAV-captured images in harsh real-world conditions remains a global challenge. In this paper, a well-annotated concrete bridge damage dataset augmented with motion blur and various illumination conditions was firstly established. We then presented an efficient Transformer-enhanced anchor-free YOLO method to automate the damage detection task. The advantages of the proposed method lie in four aspects: ① employing four detection heads to improve multi-scale damage detection ability, alleviating the defect scale variance in aerial images; ② designing a decoupled head to avoid conflicts between classification and box regression tasks; ③ introducing an anchor-free mechanism to enhance generalization ability to real-world scenarios, reducing computational complexity; and ④ using a high efficiency Vision Transformer block (C3MaxViT block) to model long-range dependencies for CNN, avoiding quadratic computational complexity of the standard Transformer block. It is worth noting that the proposed enhancements can also be extended to the architecture of newer YOLO models. These improvements are based on general concepts that are independent of the specific base model used. Additionally, the modular design philosophy shared by the YOLO series algorithms facilitates seamless integration of our enhancements, enabling further performance improvements across different YOLO architectures. The effectiveness of the proposed method was validated through a series of comparison experiments against existing state-of-the-art methods. In addition, the effect of each improvement was also investigated by detailed ablation studies. Experiment results showed that the proposed method achieved the best detection accuracy for both evaluation metrics, mAP50 and mAP@[0.5:.05:.95], compared to the state-of-the-art methods, particularly outperforming the original YOLOv5l algorithm by 8.1% in mAP50 and 8.4% in mAP@[0.5:.05:.95] respectively. Furthermore, extensive ablation studies showed that an additional detection head, decoupled head design, anchor-free mechanism and C3MaxViT block respectively improved the detection performance by 2.4%, 1.2%, 2.6%, and 1.9% in mAP50. The effect of the embedding position of the C3MaxViT block was also explored. Experiment results presented that the best detection performance was achieved when the C3MaxViT block was embedded at position 10 only. The performance of the C3MaxViT block also exceeded the performance of C3TR block and C3STR block by 1.4% and 1.1% respectively in terms of mAP50.

While the proposed method achieved great success, some limitations still remain in the realm of automated concrete bridge damage detection. Firstly, existing damage datasets in the civil engineering field are relatively small, which limits the generalization of deep learning networks to harsh real-world conditions. Secondly, leveraging horizontal bounding boxes to annotate damage positions within images results in bounding boxes that contain large areas of background information. This issue arises due to the arbitrary direction of defects in images captured by UAVs, making damage feature extraction more difficult and thus limiting damage detection performance. Thirdly, current studies only focus on detecting damage in normal illumination conditions, with no prior work considering damage detection in low light, under-exposure and over-exposure conditions that often encountered in real-world bridge inspections, particularly when observing the base of a bridge.

To address these limitations, future research could consider: ① establishing a much larger dataset with multiple damage types as per the bridge inspection standard, and exploring dataset augmentation with synthetic defects; ② annotating defects with rotated bounding boxes, and developing novel damage detection methods that can perform rotated damage detection; and ③ developing a deep learning model with illumination enhancement and exposure correction, which can deal with challenging illumination conditions.

CRediT authorship contribution statement

Xiaofei Yang: Writing – original draft, Validation, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization. Enrique del Rey Castillo: Writing – review & editing, Supervision, Conceptualization. Yang Zou: Writing – review & editing, Supervision, Funding acquisition, Conceptualization. Liam Wotherspoon: Writing – review & editing, Supervision. Jianxi Yang: Data curation. Hao Li: Data curation.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors would like to acknowledge the support by University of Auckland Faculty Research Development Fund (3716476).

Appendix A. Supplementary data

Supplementary data to this article can be found online at https://doi.org/10.1016/j.eng.2025.02.018.

References

[1]

Moselhi O, Ahmed M, Bhowmick A.Multisensor data fusion for bridge condition assessment.J Perform Constr Facil 2017; 31(4):04017008.

[2]

Yang X, del Rey CE, Zou Y, Wotherspoon L, Tan Y.Automated semantic segmentation of bridge components from large-scale point clouds using a weighted superpoint graph.Autom Construct 2022; 142:104519.

[3]

Zhang C, Chang C, Jamshidi M.Concrete bridge surface damage detection using a single‐stage detector.Comput Aided Civ Infrastruct Eng 2020; 35(4):389-409.

[4]

Guldur B, Yan Y, Hajjar JF.Condition assessment of bridges using terrestrial laser scanners.In: Proceedings of the tructures Congress 2015; 2015 Apr 23–25; Portland, OR, USA. Reston: American Society of Civil Engineers; 2015. p. 355–66.

[5]

Phares BM, Washer GA, Rolander DD, Graybeal BA, Moore M.Routine highway bridge inspection condition documentation accuracy and reliability.J Bridge Eng 2004; 9(4):403-413.

[6]

Otero LD.Proof of concept for using unmanned aerial vehicles for high mast pole and bridge inspections. Report. Tallahassee: Florida Department of Transportation-Research Center; 2015.

[7]

Wells J, Lovelace B.Unmanned aircraft system bridge inspection demonstration project phase II final report. Report. Saint Paul: Minnesota Department of Transportation-Research Services & Library; 2017.

[8]

Meng S, Gao Z, Zhou Y, He B, Djerrad A.Real‐time automatic crack detection method based on drone.Comput Aided Civ Infrastruct Eng 2023; 38(7):849-872.

[9]

Zhang C, Zou Y, Wang F, del Rey CE, Dimyadi J, Chen L.Towards fully automated unmanned aerial vehicle-enabled bridge inspection: where are we at?.Constr Build Mater 2022; 347:128543.

[10]

Chen C, Zhang Y, Lv Q, Wei S, Wang X, Sun X, et al.RRNET: a hybrid detector for object detection in drone-captured images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops 2019; 2019 Oct 27–28; Seoul, Republic of Korea. New York City: IEEE; 2019.

[11]

Zhu X, Lyu S, Wang X, Zhao Q.TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios.In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021 Oct 11–17; Montreal, BC, Canada. New York City: IEEE; 2021. p. 2778–88.

[12]

Lee JH, Gwon GH, Kim IH, Jung HJ.A motion deblurring network for enhancing UAV image quality in bridge inspection.Drones 2023; 7(11):657.

[13]

Cha YJ, Choi W, Suh G, Mahmoudkhani S, Büyüköztürk O.Autonomous structural visual inspection using region‐based deep learning for detecting multiple damage types.Comput Aided Civ Infrastruct Eng 2018; 33(9):731-747.

[14]

Zou D, Zhang M, Bai Z, Liu T, Zhou A, Wang X, et al.Multicategory damage detection and safety assessment of post‐earthquake reinforced concrete structures using deep learning.Comput Aided Civ Infrastruct Eng 2022; 37(9):1188-1204.

[15]

Zhao S, Kang F, Li J.Concrete dam damage detection and localisation based on YOLOv5s-HSC and photogrammetric 3D reconstruction.Autom Construct 2022; 143:104555.

[16]

He Z, Jiang S, Zhang J, Wu G.Automatic damage detection using anchor-free method and unmanned surface vessel.Autom Construct 2022; 133:104017.

[17]

Hüthwohl P, Lu R, Brilakis I.Multi-classifier for reinforced concrete bridge defects.Autom Construct 2019; 105:102824.

[18]

Ge Z, Liu S, Wang F, Li Z, Sun J.YOLOX: exceeding YOLO series in 2021.2021. arXiv: 2107.08430.

[19]

Dosovitskiy A, Beyer L, Kolesnikov L, Weissenborn D, Zhai X, Unterthiner T, et al.An image is worth 16x16 words: transformers for image recognition at scale.2020. arXiv: 2010.11929.

[20]

Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M.Transformers in vision: a survey.ACM Comput Surv 2022; 54(10S):1-41.

[21]

Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, et al.MaxViT: multi-axis vision transformer.2022. arXiv: 2204.01697.

[22]

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al.Swin transformer: Hierarchical vision transformer using shifted windows.In: Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021 Oct 11–17; Montreal, BC, Canada. New York City: IEEE; 2021. p. 10012–22.

[23]

Wang W, Zhang J, Cao Y, Shen Y, Tao D.Towards data-efficient detection transformers.2022. arXiv: 2203.09507.

[24]

Dai Z, Liu H, Le QV, Tan M.CoAtNet: marrying convolution and attention for all data sizes.In: Proceedings of the 35th International Conference on Neural Information Processing Systems; 2021 Dec 6–14; Online. Red Hook: Curran Associates Inc.; 2021. p. 3965–77.

[25]

Cui Z, Wang Q, Guo J, Lu N.Few-shot classification of façade defects based on extensible classifier and contrastive learning.Autom Construct 2022; 141:104381.

[26]

Li C, Li L, Jiang H, Weng K, Geng Y, Li L, et al.YOLOv6: a single-stage object detection framework for industrial applications.2022. arXiv: 2209.02976.

[27]

Zhang S, Chi C, Yao Y, Lei Z, Li SZ.Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 13–19; Seattle, WA, USA. New York City: IEEE; 2020. p. 9759–68.

[28]

Ren S, He K, Girshick R, Sun J.Faster R-CNN: towards real-time object detection with region proposal networks.In: Proceedings of the 29th International Conference on Neural Information Processing Systems; 2015 Dec 7–12; Montreal, Canada. Cambridge: MIT Press; 2015. p. 91–9.

[29]

Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, et al.SSD: Single shot multibox detector.In: Proceedings of the European Conference on Computer Vision (ECCV 2016); 2016 Oct 11–14; Amsterdam, The Netherlands. Berlin: Springer; 2016. p. 21–37.

[30]

Redmon J, Divvala S, Girshick R, Farhadi A.You only look once: unified, real-time object detection.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016 Jun 27–30; Las Vegas, NV, USA. New York City: IEEE; 2016. p. 779–88.

[31]

Maeda H, Sekimoto Y, Seto T, Kashiyama T, Omata H.Road damage detection using deep neural networks with images captured through a smartphone.2018. arXiv: 1801.09454.

[32]

Li R, Yuan Y, Zhang W, Yuan Y.Unified vision‐based methodology for simultaneous concrete defect detection and geolocalization.Comput Aided Civ Infrastruct Eng 2018; 33(7):527-544.

[33]

Bochkovskiy A, Wang CY, Liao HYM.YOLOv4: optimal speed and accuracy of object detection.2020. arXiv: 2004.10934.

[34]

Jocher G.YOLOv5 [Internet].San Francisco: Github; 2022 Nov 22 [cited 2024 May 24]. Available from: https://github.com/ultralytics/yolov5/tree/v6.1.

[35]

Law H, Deng J.CornerNet: detecting objects as paired keypoints.In: Proceedings of the European Conference on Computer Vision (ECCV 2018); 2018 Sep 8–14; Munich, Germany. Berlin: Springer-Verlag; 2018. p. 734–50.

[36]

Zhou X, Wang D, Krähenbühl P.Objects as points.2019. arXiv: 1904.07850.

[37]

Chen R, Liu Y, Zhang M, Liu S, Yu B, Tai YW.Dive deeper into box for object detection.In: Proceedings of the European Conference on Computer Vision (ECCV 2020); 2020 Aug 23–28; Glasgow, UK. Berlin: Springer; 2020. p. 412–28.

[38]

Agyemang IO, Zhang X, Acheampong D, Adjei-Mensah I, Kusi GA, Mawuli BC, et al.Autonomous health assessment of civil infrastructure using deep learning and smart devices.Autom Construct 2022; 141:104396.

[39]

Sell L.LabelImg [Internet].San Francisco: Github; 2018 Dec 3 [cited 2024 May 24]. Available from: https://github.com/tzutalin/labelImg.

[40]

Likas A, Vlassis N, Verbeek JJ.The global K-means clustering algorithm.Pattern Recognit 2003; 36(2):451-461.

[41]

Wang CY, Liao HYM, Wu YH, Chen PY, Hsieh JW, Yeh IH.CSPNet: a new backbone that can enhance learning capability of CNN.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; 2020 Jun 14–19; Seattle, WA, USA. New York City: IEEE; 2020. p. 390–1.

[42]

Jung AB, Wada K, Crall J, Tanaka S, Graving J, Reinders C, et al.imgaug [Internet].San Francisco: Github; 2020 Feb 6 [cited 2024 May 24]. Available from:

[43]

Ge Z, Liu S, Li Z, Yoshie O, Sun J.Ota: Optimal transport assignment for object detection.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. New York City: IEEE; 2021. p. 303–12.

[44]

Zhang H, Wang Y, Dayoub F, Sunderhauf N.Varifocalnet: An IoU-aware dense object detector.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 20–25; Nashville, TN, USA. New York City: IEEE; 2021. p. 8514–23.

[45]

Lin TY, Goyal P, Girshick R, He K, Dollár P.Focal loss for dense object detection.In: Proceedings of the IEEE International Conference on Computer Vision; 2017 Oct 22–29; Venice, Italy. New York City: IEEE; 2017. p. 2980–8.

[46]

Rezatofighi H, Tsoi N, Gwak J, Sadeghian A, Reid I, Savarese S.Generalized intersection over union: A metric and a loss for bounding box regression.In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2019; 2019 Jun 15–20; Long Beach, CA, USA. New York City: IEEE; 2019. p. 658–66.

[47]

Tan M, Le Q.EfficientNetV2: smaller models and faster training.PMLR 2021; 139:10096-10106.

[48]

Tan M, Le Q.EfficientNet: rethinking model scaling for convolutional neural networks.PMLR 2019; 97:6105-6114.

[49]

Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, et al.PyTorch: an imperative style, high-performance deep learning library.In: Proceedings of the 33rd International Conference on Neural Information Processing Systems; 2019 Dec 8–14; Vancouver, BC, Canada. Red Hook: Curran Associates Inc.; 2019.

[50]

Jocher G.Hyperparameter evolution [Internet].San Francisco: Github; 2020 Aug 3 [cited 2024 May 24]. Available from: https://github.com/ultralytics/yolov5/issues/607.

[51]

Wang CY, Bochkovskiy A, Liao HYM.YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors.In: Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023 Jun 17–24; Vancouver, BC, Canada. New York City: IEEE. p. 7464–75.

[52]

Redmon J, Farhadi A.YOLOv3: an incremental improvement.2018. arXiv: 1804.02767.

[53]

Varghese R, Sambath M.YOLOv8: a novel object detection algorithm with enhanced performance and robustness.In: Proceedings of 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). 2024 Apr 18–1; Chennai, India; 2024. p. 1–6.

[54]

Woo S, Park J, Lee JY, Kweon IS.Cbam: convolutional block attention module.In: Proceedings of the European Conference on Computer Vision (ECCV 2018); 2018 Sep 8–14; Munich, Germany. Berlin: Springer-Verlag; 2018. p. 3–19.

[55]

Liu Y, Shao Z, Hoffmann N.Global attention mechanism: retain information to enhance channel–spatial interactions.2021. arXiv: 2112.05561.

RIGHTS & PERMISSIONS

THE AUTHOR

PDF (4032KB)

Supplementary files

Supplementary data

7125

Accesses

0

Citation

Detail

Sections
Recommended

/