Streaming automatic speech recognition with hybrid architectures and deep neural network models

Javier Jorge Cano

Ayuda

Streaming automatic speech recognition with hybrid architectures and deep neural network models

Autores: Javier Jorge Cano
Directores de la Tesis: Alfons Juan Císcar (dir. tes.) , Jorge Civera Saiz (dir. tes.)
Lectura: En la Universitat Politècnica de València ( España ) en 2022
Idioma: español
Tribunal Calificador de la Tesis: Eva Onaindia de la Rivaherrera (presid.) , Eduardo Lleida Solano (secret.) , Ralf Schlüter (voc.)
Enlaces
- Tesis en acceso abierto en: RiuNet
Resumen
- Over the last decade, the media have experienced a revolution, turning away from the conventional TV in favor of on-demand platforms. In addition, this media revolution not only changed the way entertainment is conceived but also how learning is conducted. Indeed, on-demand educational platforms have also proliferated and are now providing educational resources on diverse topics. These new ways to distribute content have come along with requirements to improve accessibility, particularly related to hearing difficulties and language barriers. Here is the opportunity for automatic speech recognition (ASR) to comply with these requirements by providing high-quality automatic captioning. Automatic captioning provides a sound basis for diminishing the accessibility gap, especially for live or streaming content. To this end, streaming ASR must work under strict real-time conditions, providing captions as fast as possible, and working with limited context. However, this limited context usually leads to a quality degradation as compared to the pre-recorded or offline content.
  
  This thesis is aimed at developing low-latency streaming ASR with a quality similar to offline ASR. More precisely, it describes the path followed from an initial hybrid offline system to an efficient streaming-adapted system. The first step is to perform a single recognition pass using a state-of-the-art neural network-based language model. In conventional multi-pass systems, this model is often deferred to the second or later pass due to its computational complexity. As with the language model, the neural-based acoustic model is also properly adapted to work with limited context. The adaptation and integration of these models is thoroughly described and assessed using fully-fledged streaming systems on well-known academic and challenging real-world benchmarks. In brief, it is shown that the proposed adaptation of the language and acoustic models allows the streaming-adapted system to reach the accuracy of the initial offline system with low latency.