Ir al contenido

Documat


Resumen de Deep spatio-temporal neural network for facial analysis

Decky Aspandi Latif

  • Automatic Facial Analysis is one of the most important field of computer vision due to its significant impacts to the world we currently live in. Among many applications of Automatic Facial Analysis, Facial Alignment and Facial-Based Emotion Recognition are two most prominent tasks considering their roles in this field. That is, the former serves as intermediary steps enabling many higher facial analysis tasks, and the latter provides direct, real-world high level facial-based analysis and applications to the society. Together, they have significant impacts ranging from biometric recognition, facial recognition, health, and many others.

    These facial analysis tasks are currently even more relevant given the emergence of big-data, that enables rapid development of machine learning based models advancing their current state of the arts accuracy. In regard to this, the uses of video-based data as the part of the development of current datasets have been more frequent. These sequence based data have been explicitly exploited in the other relevant machine learning fields through the use of inherent temporal information, that in contrast, it has not been the case for both of Facial Alignment and Facial-Based Emotion Recognition tasks. Furthermore, the in-the-wild characteristics of the data that exist on the current datasets present additional challenge for developing an accurate system to these tasks. In this context, the main purpose of this thesis is to evaluate the benefit of incorporating both temporal information and the in-the-wild data characteristics that are largely overlooked on both Facial Alignment and Facial-Based Emotion Recognition. We mainly focus in the use of deep learning based models given their capability and capacity to leverage on the current sheer size of input data. Also, we investigate the introduction of an internal noise modellings in order to assess their impacts to the proposed works.

    Specifically, this thesis analyses the benefit of sequence modelling through progressive learning applied to facial tracking task, while it is also fully end to end trainable. This arrangement allows us to evaluate the optimum sequence length to increase the quality of our models estimation. Subsequently, we expand our investigations to the introduction of internal noise modelling to benefit from the characteristics of each image degradation for single-image facial alignment, alongside the facial tracking task. Following this approach, we can study and quantify its direct impacts. We then combine both sequence based approach and internal noise modelling by proposing the unified systems that can simultaneously perform both of single-image facial alignment and facial tracking, with state of the art accuracy result.

    Motivated by our findings from Facial Alignment task, we then expand these approaches to Facial-Based Emotion Recognition problem. We first explore the use of adversarial learning to enhance our image degradation modelling, and simultaneously increase the efficiency of our approaches through the formation of internal visual latent features. We then equip our base sequence modelling with soft attention modules to allow the proposed model to adjust their focus using the adaptive weighting scheme. Subsequently, we introduce a more effective fusion method for both facial features modality and visual representation of audio using gating mechanism. In this stage, we also analyse the impacts of our proposed gating mechanisms along with the attention enhanced sequence modelling. Finally, we found that these approaches improve our models estimation quality leading to the high level of accuracy, outperforming the results from other alternatives.


Fundación Dialnet

Mi Documat