Tomas Martinez Cortes
In this thesis, we study loss functions that allow to train Convolutional Neural Networks (CNNs) under noisy datasets for the particular task of Content-Based Image Retrieval (CBIR). In particular, we propose two novel losses to fit models that generate image global representations. First, a Soft-Matching (SM) loss exploiting both, image content and meta data, is used to specialized general CNNs to particular cities or regions using weakly annotated datasets. Second, a Bag Exponential (BE) loss inspired by the Multiple Instance Learning (MIL) framework is employed to train CNNs for CBIR under noisy datasets.
The first part of the thesis introduces a novel training framework that, relying on image content and meta data, learns location-adapted deep models that provide tuned image descriptors for specific visual contents. Our networks, which depart from a baseline model originally learned for a different task, are specialized by means of a custom pairwise loss function, the proposed SM loss, that uses weak labels based on image content and meta data.
The experimental results show that the proposed location-adapted CNNs achieve an improvement of up to a 55\% over the baseline networks on a landmark discovery task. This implies that the model has successfully learned the visual clues and peculiarities of the region for which it was trained, and generated image descriptors that are better location-adapted. In addition, for those landmarks that were not present on the training set or even other cities, our proposed models performed at least as well as the baseline network, which indicates a good overfitting resilience.
The second part of the thesis introduces the BE Loss function to train CNNs for image retrieval borrowing inspiration from the MIL framework. The loss combines the use of an exponential acting as a soft margin and a MIL-based mechanism working with bags of positive and negative pairs of images. The method allows to train deep retrieval networks under noisy datasets by weighing the influence of the different samples at loss level, which increases the performance of the generated global descriptors. The rationale behind the improvement is that we are handling noise in an end-to-end manner and, therefore, avoiding its negative influence as well as the unintentional biases due to fixed pre-processing cleaning procedures. In addition, our method is general enough to suit other scenarios requiring different weights for the training instances (e.g. boosting the influence of hard positives during training). The proposed bag exponential function can be seen as a back door to guide the learning process according to a certain objective in a end-to-end manner, allowing the model to approach such an objective smoothly and progressively.
Our results show that our loss allows CNN-based retrieval systems to be trained with noisy training sets and achieve state-of-the-art performance. Furthermore, we have found that it is better to use ad-hoc training sets that are highly correlated with the final task, even if they are noisy, than training with a clean set that is only weakly related. From our point of view, these results represent a big leap in the applicability of retrieval systems and help to reduce the needed effort to set-up new CBIR applications: e.g. by allowing a fast automatic generation of noisy training datasets and then using our bag exponential loss to deal with noise. Moreover, we also consider that this result opens a new line of research for CNN-based image retrieval: let the models decide not only on the best features to solve the task but also on the most relevant samples to do it.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados