Towards Robust and Accurate Visual Object Tracking: Scale Estimation

Author: ORCID icon
Ma, Haoyi, Electrical Engineering - School of Engineering and Applied Science, University of Virginia
Lin, Zongli, Electrical and Computer Engineering, University of Virginia
Acton, Scott, Electrical and Computer Engineering, University of Virginia

Visual object tracking is a critical task with a plethora of applications ranging from medical imaging to robotics. In a video sequence, given the states of the target object in the initial frame, the goal of visual object tracking is to estimate the target state in the subsequent frames. The tracking problem is challenging as numerous factors affect the performance of a tracking algorithm, such as scaling, viewpoint, illumination variation and occlusion. Furthermore, as a basic building block of many time-critical systems, another major challenge is that a visual tracker should meet the strict constraints of the time and computational budgets.

Recently, visual object tracking has seen advances made possible by the development of the tracking-by-detection paradigm, in which the tracking problem is formulated as a classification problem typically addressed by a machine learning technique used to discriminate the target appearance from the surrounding background. With the advancement of discriminative modeling and feature representations, many important works have been carried out, such as the discriminative correlation filter based trackers and Siamese network based trackers. However, most tracking methods cannot accommodate large scale variations in complex video sequences and thus result in unsatisfactory performance.

In this thesis, we focus on providing robust and accurate scale estimation methods to further enable accurate and robust visual object tracking. First, a novel criterion, the average peak-to-correlation energy, is incorporated into a multi-scale searching framework to obtain robust scale estimation. The resulting system is referred to as SITUP: Scale Invariant Tracking Using average Peak-to-correlation energy.

Second, to address the problem of the heavy computational load associated with the multi-scale searching scheme in SITUP, we investigate different strategies to reduce the computational cost. The resulting system is named FAST: Fast and Accurate Scale estimation for Tracking. In comparison with SITUP, FAST obtains comparable performance while operating at a frame rate up to three times higher.

Third, to further enable aspect ratio adaptability, an effective scale estimation scheme is proposed to adapt to aspect ratio variation via using a group of discriminative correlation filters to localize the target boundary, and the resulting system is named TARA: Tracking with Aspect Ratio Adaptability.

Last, to obviate the need of hyper-parameters associated with the candidate boxes (e.g., scale factors of the multi-scale searching scheme, or sizes and aspect ratios of the predefined candidate anchor boxes), we formulate tracking as parallel classification and regression problems, and the target object is directly classified and the corresponding bounding boxes are regressed in a unified fully convolutional network. The resulting system is named CAT: Centerness-aware Anchor-free Tracker. Extensive experiments on publicly available tracking benchmark datasets show that our proposed scale estimation methods, with their appealing features, achieve salient performance in a wide variety of tracking scenarios.

PHD (Doctor of Philosophy)
Visual object tracking, scale estimation, aspect ratio adaptability, discriminative correlation filter, Siamese network, fully convolutional network, anchor-free, centerness
All rights reserved (no additional license for public reuse)
Issued Date: