Abstract
We propose a joint feature and metric learning architecture, called the associative affinity network (AAN), as an affinity model for (MOT) in videos. The AAN learns the associative affinity between tracks and detections across frames in an end-to-end manner. Considering flawed detections, the AAN jointly learns bounding box regression, classification, and affinity regression via the proposed multi-task loss. Contrary to networks that are trained with ranking loss, we directly train a binary classifier to learn the associative affinity of each track-detection pair and use a matching cardinality loss to capture information among candidate pairs. The AAN learns a discriminative affinity model for data association to tackle MOT, and can also perform single-object tracking. Based on the AAN, we propose a simple multi-object tracker that achieves competitive performance on the public MOT16 and MOT17 test datasets.