Ir al contenido

Documat


Resumen de Context-driven vision for image and video analysis

Alejandro López Cifuentes

  • The paradigm shift in the last decade towards Deep Learning methods, and specifically, towards Convolutional Neural Networks, has driven computer vision methods to reach a considerable level of performance in tasks such as image classification, semantic segmentation, or pedestrian detection among others. Nowadays, the widespread adopted path when trying to solve a computer vision task is to train a Deep Learning model using manually annotated labels specifically collected for the task itself. Although the straightforward use of these annotations usually yields an acceptable performance, the identification and description of the necessary features to solve these or more complex tasks require the development of cognitive capabilities close to the Human Visual Understanding; something that is yet far from being solved. In this line, Human Visual Understanding is supported, not only on intrinsic visual cues but also on context: The collection of extra information, contained or not in the image domain, used by humans to disambiguate when solving visual tasks.

    This Thesis addresses the problem of the usage and automatic generation of context information in computer vision. The main hypothesis of this Thesis is that the usage of different sources of context information, at different stages of computer vision algorithms, might help to automatically disambiguate errors and reinforce the learning of the relevant data for a given task, thus increasing the performance of algorithms by closing the gap between computer vision and Human Visual Understanding. To this aim, it proposes to study the benefits of generating and incorporating different sources of context information computer vision paramount applications: semantic segmentation and pedestrian detection in multi-camera recorded scenarios and scene recognition in images. The Thesis is arranged in two parts, one for each of these two application scenarios.

    Specifically, in the first part of the Thesis, two different algorithms related to multi-camera scenarios are proposed. Initially, this thesis proposes a novel context generation approach where every point in the ground plane is assigned a semantic label by projecting labels obtained for each camera view. These labels permit to automatically generate context that extrapolates to the automatic definition of areas of interest. After, a novel multi-camera pedestrian detector is presented. Leveraging the automatically extracted areas of interest, context is used to improve pedestrian detection in three ways. Firstly, false pedestrian detections that do not lie on the ground are suppressed. Secondly, a novel graph approach is proposed to obtain global detections in the ground-plane. Thirdly, context information is used to handle occlusions and to globally refine the location and size of the back-projected detections by aggregating information from all the cameras.

    In the second part of the Thesis context information is applied to the task of scene recognition. First, context is introduced to a scene recognition Convolutional Neural Network. Learnt features from semantic segmentation are used to gate image feature via an attention module resulting in the reinforcement of the learning of relevant context information by changing the focus of attention towards human-accountable concepts indicative of scene classes. Second, a novel Attention-based Knowledge Distillation that compares 2D activations in the Discrete Cosine Transform domain is presented. With the proposed approach, context is inherently learnt by the student model from a teacher model, usually a better one. Finally, this Thesis studies Convolutional Neural Networks interpretability and explainability. Towards this goal, a perturbation-based attribution method guided by context in the form of semantic segmentation, is used to obtain complete attribution maps, that enable to delve deeper into scene recognition interpretability, obtaining relevant, irrelevant, and distracting semantic labels on a per-scene basis.

    In conclusion, presented results in this thesis suggest that the use of context information is highly beneficial for Computer Vision, leading to a better automatically disambiguation of errors and generally to better modeling the underlying data distribution that is effective for solving the task, yielding not only improved state-of-the-art performances, but also enabling a better comprehension of the learning processes underlying the explored computer vision methods.


Fundación Dialnet

Mi Documat