Ir al contenido

Documat


Streaming automatic speech recognition with hybrid architectures and deep neural network models

  • Autores: Javier Jorge Cano
  • Directores de la Tesis: Alfons Juan Císcar (dir. tes.) Árbol académico, Jorge Civera Saiz (dir. tes.) Árbol académico
  • Lectura: En la Universitat Politècnica de València ( España ) en 2022
  • Idioma: español
  • Tribunal Calificador de la Tesis: Eva Onaindia de la Rivaherrera (presid.) Árbol académico, Eduardo Lleida Solano (secret.) Árbol académico, Ralf Schlüter (voc.) Árbol académico
  • Enlaces
    • Tesis en acceso abierto en: RiuNet
  • Resumen
    • Over the last decade, the media have experienced a revolution, turning away from the conventional TV in favor of on-demand platforms. In addition, this media revolution not only changed the way entertainment is conceived but also how learning is conducted. Indeed, on-demand educational platforms have also proliferated and are now providing educational resources on diverse topics. These new ways to distribute content have come along with requirements to improve accessibility, particularly related to hearing difficulties and language barriers. Here is the opportunity for automatic speech recognition (ASR) to comply with these requirements by providing high-quality automatic captioning. Automatic captioning provides a sound basis for diminishing the accessibility gap, especially for live or streaming content. To this end, streaming ASR must work under strict real-time conditions, providing captions as fast as possible, and working with limited context. However, this limited context usually leads to a quality degradation as compared to the pre-recorded or offline content.

      This thesis is aimed at developing low-latency streaming ASR with a quality similar to offline ASR. More precisely, it describes the path followed from an initial hybrid offline system to an efficient streaming-adapted system. The first step is to perform a single recognition pass using a state-of-the-art neural network-based language model. In conventional multi-pass systems, this model is often deferred to the second or later pass due to its computational complexity. As with the language model, the neural-based acoustic model is also properly adapted to work with limited context. The adaptation and integration of these models is thoroughly described and assessed using fully-fledged streaming systems on well-known academic and challenging real-world benchmarks. In brief, it is shown that the proposed adaptation of the language and acoustic models allows the streaming-adapted system to reach the accuracy of the initial offline system with low latency.


Fundación Dialnet

Mi Documat

Opciones de tesis

Opciones de compartir

Opciones de entorno