Ir al contenido

Documat


valuation of transformer-based models for punctuation and capitalization restoration in Catalan and Galician

  • Autores: Pedro J. Vivancos Vicente, Rafael Valencia García Árbol académico, Ronghao Pan, José Antonio García Díaz
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 70, 2023, págs. 27-38
  • Idioma: inglés
  • Títulos paralelos:
    • Evaluación de modelos basados en Transformers para el sistema de recuperación de puntuación y mayúsculas en Catalán y Gallego
  • Enlaces
  • Resumen
    • español

      En los últimos años, el rendimiento de sistemas de Reconocimiento Automático del habla ha aumentado considerablemente gracias a nuevos métodos de deep learning. Sin embargo, la salida bruta de estos sistemas consiste en secuencias de palabras sin mayúsculas ni signos de puntuación. Recuperar esta información mejora la legibilidad y permite su posterior uso en otros modelos de PLN. La mayoría de las soluciones existentes se centran únicamente en inglés; aunque recientemente han surgido nuevos modelos de restauración de la puntuación en español. Sin embargo, ninguno se centra en gallego y catalán. En este sentido, proponemos un sistema de restauración de mayúsculas y puntuación basado en modelos Transformers para estos idiomas. Ambos modelos tienen un rendimiento muy bueno: 90,2% para el gallego y 90,86% para el catalán. Además, también tienen la capacidad de identificar nombres propios, nombres de países y organizaciones para la restauración de mayúsculas.

    • English

      In recent years, the performance of Automatic Speech Recognition systems (ASR) has increased considerably due to new deep learning methods. However, the raw output of an ASR system consists of a sequence of words without capital letters and punctuation marks. Therefore, a capitalization and punctuation restoration system are one of the most important post-processes of ASR to improve readability and to enable the subsequent use of these results in other NLP models. Most models focus solely on English punctuation resolution, and recently new models of Spanish punctuation restoration have emerged. However, none focus on capitalization and punctuation restoration in Galician and Catalan. In this sense, we propose a system for capitalization and punctuation restoration based on Transformers models for Catalan and Galician. Both models perform very well, with an overall performance of 90.2% for Galician and 90.86% for Catalan, and have the ability to identify proper names, country names, and organizations for uppercase restoration.

  • Referencias bibliográficas
    • Alam, T., A. Khan, and F. Alam. 2020. Punctuation restoration using transformer models for high-and low-resource languages. In Proceedings...
    • Armengol-Estape, J., C. P. Carrino, C. Rodriguez-Penagos, O. de Gibert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas....
    • Bannard, C. and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association...
    • Banon, M., P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Espl`a-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas,...
    • Basili, R., C. Bosco, R. Delmonte, A. Moschitti, and M. Simi, editors. 2015. Harmonization and Development of Resources and Tools for Italian...
    • Bostrom, K. and G. Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. CoRR, abs/2004.03720.
    • Canete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC...
    • Che, X., C. Wang, H. Yang, and C. Meinel. 2016. Punctuation prediction for unsegmented transcript based on word vector. In Proceedings of...
    • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised...
    • Courtland, M., A. Faulkner, and G. McElvain. 2020. Efficient automatic punctuation restoration using bidirectional transformers with robust...
    • David Vilares, Marcos Garcia, C. G.-R. 2021. Bertinho: Galician bert representations. Procesamiento del Lenguaje Natural, 66(0):13–26.
    • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding....
    • Federico, M., M. Cettolo, L. Bentivogli, M. Paul, and S. St¨uker. 2012. Overview of the IWSLT 2012 evaluation campaign. In Proceedings of...
    • Gonzalez-Docasal, A., A. Garcıa-Pablos, H. Arzelus, and A. Alvarez. 2021. Autopunct: A bert-based automatic punctuation and capitalisation...
    • Jones, D., F. Wolf, E. Gibson, E. Williams, E. Fedorenko, D. Reynolds, and M. Zissman. 2003. Measuring the readability of automatic speech-to-text...
    • Ljubesic, N. and A. Toral. 2014. cawac - a web corpus of catalan and its ap- plication to language modeling and machine translation. In N....
    • Ortiz Suarez, P. J., L. Romary, and B. Sagot. 2020. A monolingual approach to contextualized word embeddings for midresource languages. In...
    • Peitz, S., M. Freitag, A. Mauser, and H. Ney. 2011. Modeling punctuation prediction as machine translation. In Proceedings of the 8th International...
    • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv,...
    • Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J....
    • Tilk, O. and T. Alumae. 2016. Bidirectional recurrent neural network with attention mechanism for punctuation restoration. In INTERSPEECH.
    • Yi, J. and J. Tao. 2019. Self-attention based model for punctuation prediction using word and speech embeddings. In ICASSP 2019 - 2019 IEEE...
    • Yi, J., J. Tao, Y. Bai, Z. Tian, and C. Fan. 2020. Adversarial transfer learning for punctuation restoration.
    • Zhu, X., S. Gardiner, D. Rossouw, T. Roldan, and S. Corston-Oliver. 2022. Punctuation restoration in Spanish customer support transcripts...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno