
, Jon Ander Gómez (secret.)
, Stavros Petridis (voc.) 
Inspired by the multi-sensory nature of human speech perception, there has been a growing interest in developing automatic lipreading systems. This challenging task aims to interpret speech solely by analyzing the speaker's lip movements, without relying on acoustic cues. Among the wide variety of applications offered by this lipreading technology, we highlight its promising role in advancing silent speech interfaces through the design of non-invasive solutions that support communication for individuals who have lost the ability to speak.
The efficacy of this technology, however, is compromised by numerous factors, including visual ambiguities when attempting to distinguish between distinct but visually similar sounds, and the difficulty multi-speaker models face in generalizing and handling interpersonal variations across speakers. In addition to these challenges inherent to lipreading itself, more technical issues --- such as the scarcity of available audio-visual data for continuous and natural speech in languages other than English, and the current reliance on model complexity for achieving strong performance --- also hinder progress. Regarding the development of silent speech interfaces, we also noted that research on speaker-dependent approaches was mostly limited to indoor controlled settings and constrained experimental protocols, overlooking the importance of exploring this type of speaker-dependent studies under more naturalistic conditions, particularly with continuous speech.
This thesis not only addresses this research gap in speaker dependency, but also investigates the design of automatic lipreading systems and their integration with acoustic speech cues for natural continuous Spanish. Our comprehensive approach offers a holistic perspective on the problem, encompassing the collection and processing of audio-visual data, the transition from traditional Hidden Markov Model (HMM)-based methods to state-of-the-art deep end-to-end architectures, and the integration of audio-visual speech cues for robust speech recognition. More specifically, we propose cross-modal training strategies in the context of traditional decoders, evaluate and analyze the efficacy of current end-to-end architectures and their adaptation to specific speakers, conduct systematic comparative studies between these two distinct decoding paradigms under diverse training conditions, and explore the contributions of visual cues when integrated with acoustic cues for audio-visual speech recognition from a parameter-efficient perspective. As a result, our research establishes a Spanish lipreading benchmark, aiming to ensure that lipreading technologies can be effectively studied for this language under diverse data conditions.
A common theme throughout these contributions is our particular emphasis on the efficiency of this technology and its application to real-world scenarios. By addressing the heterogeneity and occasional significant lack of data that most languages suffer from in the context of audiovisual speech technologies, we strongly believe our work promotes further research in underrepresented languages and low-resource settings.
© 2008-2026 Fundación Dialnet · Todos los derechos reservados