Ir al contenido

Documat


Is ASR the right tool for the construction of Spoken Corpus Linguistics in European Spanish?

  • Autores: Mirari San Martín, Jónathan Heras Vicente Árbol académico, Gadea Mata Martínez Árbol académico, Sara Gómez Seibane
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 73, 2024, págs. 165-176
  • Idioma: inglés
  • Títulos paralelos:
    • ¿Es el ASR la herramienta adecuada para la construcción de Corpus Lingüísticos Orales en Castellano?
  • Enlaces
  • Resumen
    • español

      Los corpus orales son un recurso muy valioso para explorar el discurso que ocurre de manera natural. Sin embargo, grandes partes de estos corpus permanecen sin transcribir debido al alto coste de transcribir manualmente ficheros de audio; y, por lo tanto, el acceso a estos recursos es limitado. Este problema podría ser abordado mediante herramientas de Reconocimiento Automático del Habla (ASR, por sus siglas en inglés), que han demostrado su potencial para transcribir automáticamente ficheros de audio. En este trabajo, estudiamos dos familias de modelos ASR (Whisper y Seamless) para transcribir automáticamente archivos del corpus COSER (sigla formada a partir de Corpus Oral y Sonoro del Español Rural ). Nuestros resultados muestran que los modelos de ASR pueden producir transcripciones precisas independientemente del dialecto de los hablantes y su velocidad de habla; especialmente con la versión large v3 de Whisper, que es el modelo que produce los mejores resultados (WER promedio de 0.292). Sin embargo, en algunos casos, las transcripciones no se alinean perfectamente con las producidas por humanos, ya que los transcriptores humanos reflejan matices introducidos por los hablantes que no son capturados con los modelos ASR. Esto muestra que las herramientas ASR pueden reducir la carga de transcribir manualmente horas de audio de los corpus orales, pero aún se necesita supervisión humana.

    • English

      Spoken corpora are a valuable resource to explore naturally occurring discourse. However, large parts of those corpora remain untranscribed due to the high cost of manually transcribing audio files; and, therefore, the access to these resources is limited. This problem could be faced by means of Automatic Speech Recognition (ASR) tools, that have shown their potential to automatically transcribe audio files. In this work, we study two families of ASR models (Whisper and Seamless) for automatically transcribing files from the COSER corpus (that stands for Corpus Oral y Sonoro del Español Rural, in English Audible Corpus of Rural Spanish). Our results show that those ASR models can produce accurate transcriptions independently of the dialect of the speakers and their speed-rate; specially with the large v3 version of Whisper that is the model which produces the best results (mean WER of 0.292). However, in some cases the transcriptions do not perfectly align with those produced by humans, since human transcriptors reflect nuances introduced in the speech of speakers that are not captured with the ASR models. This shows that ASR tools can reduce the burden of manually transcribing hours of audios from spoken corpus, but human supervision is still needed.

  • Referencias bibliográficas
    • Baevski, A., Y. Zhou, A. Mohamed, and M. Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances...
    • Bang, J.-U., S. Yun, S.-H. Kim, M.-Y. Choi, M.-K. Lee, Y.-J. Kim, D.-H. Kim, J. Park, Y.-J. Lee, and S.-H. Kim. 2020. Ksponspeech: Korean...
    • Barrault, L., Y.-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffman, et al. 2023. Seamlessm4t-massively...
    • Fernández-Ordóñez, I. 2005. Coser. Corpus oral y sonoro del español rural.
    • Forsberg, M. 2003. Why is speech recognition difficult. Chalmers University of Technology.
    • Frota, S. and P. Prieto. 2015. Intonation in Romance: Systemic similarities and differences. Oxford University Press.
    • Gorisch, J., M. Gref, and T. Schmidt. 2020. Using automatic speech recognition in spoken corpus curation. In Proceedings of the Twelfth Language...
    • Gulati, A., J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al. 2020. Conformer: Convolution-augmented...
    • Hualde, J. I. 2013. Los sonidos del español: Spanish Language edition. Cambridge University Press.
    • Hualde, J. I. and P. Prieto. 2015. Intonational variation in spanish: European and american varieties. In Intonation in romance. Oxford University...
    • Huggins-Daines, D., M. Kumar, A. Chan, A. W. Black, M. Ravishankar, and A. I. Rudnicky. 2006. Pocketsphinx: A free, real-time continuous speech...
    • Kantharuban, A., I. Vulic, and A. Korhonen. 2023. Quantifying the dialect gap and its correlates across languages. In H. Bouamor, J. Pino,...
    • Kennedy, G. 2014. An introduction to corpus linguistics. Routledge.
    • Knight, D. and S. Adolphs. 2022. Building a spoken corpus: What are the basics? In The Routledge Handbook of Corpus Linguistics. Routledge,...
    • Knight, D., S. Adolphs, P. Tennent, and R. Carter. 2008. The nottingham multimodal corpus: A demonstration. In Programme of the Workshop on...
    • Levenshtein, V. I. et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume...
    • Li, X., Y. Jia, and C.-C. Chiu. 2023. Textless direct speech-to-speech translation with discrete speech representation. In ICASSP 2023-2023...
    • Malik, M., M. K. Malik, K. Mehmood, and I. Makhdoom. 2021. Automatic speech recognition: a survey. Multimedia Tools and Applications, 80:9411–9457.
    • Mehrish, A., N. Majumder, R. Bharadwaj, R. Mihalcea, and S. Poria. 2023. A review of deep learning techniques for speech processing. Information...
    • Mello, H. 2014. What corpus linguistics can offer contact linguistics: the c-oral-brasil corpus experience. PAPIA: Revista Brasileira de Estudos...
    • Moreno-Fernández, F. and R. Caravedo. 2022. Dialectología hispánica the routledge handbook of spanish dialectology.
    • Nazabal, O. J. 2021. Euskararen erritmoa neurtzen. Fontes linguae vasconum: Studia et documenta, 53(132):257–278.
    • Orihuela Gracia, S. 2021. Del lenguaje oral al lenguaje escrito: la transcripción como documento de archivo. Ph.D. thesis, Universitat Autònoma...
    • O’Shaughnessy, D. 2008. Automatic speech recognition: History, methods and challenges. Pattern Recognition, 41(10):2965–2979.
    • Pragt, L., P. van Hengel, D. Grob, and J.-W. A. Wasmann. 2022. Preliminary evaluation of automated speech recognition apps for the hearing...
    • Radford, A., J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever. 2023. Robust speech recognition via largescale weak supervision....
    • Ramabhadran, B., J. Huang, and M. Picheny. 2003. Towards automatic transcription of large spoken archives-english asr for the malach project....
    • Seaborn, K., N. P. Miyake, P. Pennefather, and M. Otake-Matsuura. 2021. Voice in human–agent interaction: A survey. ACM Computing Surveys...
    • Selouani, S. A. and M. Boudraa. 2010. Algerian arabic speech database (algasd): corpus design and automatic speech recognition application....
    • Shareah, M., B. Mudhsh, and A. H. ALTakhayinh. 2015. An overview on dialectal variation. International Journal of Scientific and Research...
    • Tatman, R. and C. Kasten. 2017. Effects of talker dialect, gender & race on accuracy of bing speech and youtube automatic captions. In...
    • Woodard, J. and J. Nelson. 1982. An information theoretic measure of speech recognition performance. In Workshop on standardisation for speech...
    • Yu, D. and L. Deng. 2016. Automatic speech recognition, volume 1. Springer.

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno