Ir al contenido

Documat


Resumen de Towards end-to-end networks for visual tracking in rgb and tir videos

Lichao Zhang

  • As a fundamental research topic, visual tracking plays an important role in computer vision. It has been widely applied in many fields, including autonomous driving, navigation, and robotics. The target of visual tracking is to estimate the trajectory of an object in a sequence of images, where the object is selected manually in the first frame. Tracking is regarded as a difficult task because real-world videos exhibit a large range of variations. In recent years end-to-end training of deep learning methods has dominated tracking research. Visual tracking can be applied to different modalities, such as RGB and thermal infrared (TIR).

    In this thesis, we identify several problems of current tracking systems. The lack of large-scale labeled datasets hampers the usage of deep learning, especially end-to-end training, for tracking in TIR images. Therefore, many methods for tracking on TIR data are still based on hand-crafted features. This situation also happens in multi-modal tracking, e.g. RGB-T tracking. Another reason, which hampers the development of RGB-T tracking, is that there exists little research on the fusion mechanisms for combining information from RGB and TIR modalities. One of the crucial components of most trackers is the update module. For the currently existing end-to-end tracking architecture, e.g, Siamese trackers, the online model update is still not taken into consideration at the training stage. They use no-update or a linear update strategy during the inference stage. While such a hand-crafted approach to updating has led to improved results, its simplicity limits the potential gain likely to be obtained by learning to update.

    To address the data-scarcity for TIR and RGB-T tracking, we use image-to-image translation to generate a large-scale synthetic TIR dataset. This dataset allows us to perform end-to-end training for TIR tracking. Furthermore, we investigate several fusion mechanisms for RGB-T tracking. The multi-modal trackers are also trained in an end-to-end manner on the synthetic data. To improve the standard online update, we pose the updating step as an optimization problem which can be solved by training a neural network. Our approach thereby reduces the hand-crafted components in the tracking pipeline and sets a further step in the direction of a complete end-to-end trained tracking network which also considers updating during optimization.

    Extensive experiments on several benchmark datasets from the RGB, TIR and RGB-T modalities demonstrate the effectiveness of our proposed methods. Specifically, synthetic TIR data is effective for end-to-end training, our fusion mechanisms outperform the single modality counterparts, and our update network outperforms the standard linear update.


Fundación Dialnet

Mi Documat