Transformers for Lexical Complexity Prediction in Spanish Language

Jenny Alexandra Ortiz Zambrano; César Espin Riofrio; Arturo Montejo Ráez

Ayuda

Transformers for Lexical Complexity Prediction in Spanish Language

Autores: Jenny Alexandra Ortiz Zambrano, César Espin Riofrio, Arturo Montejo Ráez
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 69, 2022, págs. 177-188
Idioma: inglés
Títulos paralelos:
- Transformers para la Predicción de la Complejidad Léxica en Lengua Española
Enlaces
- Texto completo

Dialnet Métricas: 1 Cita

Resumen
- español
  En este artículo hemos presentado una contribución a la predicción de la complejidad de palabras simples en lengua española cuyo fundamento se basa en la combinación de un gran número de características de distinta naturaleza. Obtuvimos los resultados después de ejecutar los modelos afinados basados en Transformers y ejecutados sobre los modelos pre-entrenados BERT, XLM-RoBERTa y RoBERTa-large-BNE en los diferentes conjuntos de datos en español y corridos con varios algoritmos de regresión. La evaluación de los resultados determinó que se logró un buen desempeño con un Error Absoluto Medio (MAE) = 0.1598 y Pearson = 0.9883 logrado con el entrenamiento y evaluación del algoritmo Random Forest Regressor para el modelo BERT afinado. Como posible propuesta alternativa para lograr una mejor predicción de la complejidad léxica, estamos muy interesados en seguir realizando experimentaciones con conjuntos de datos para español probando modelos de Transformer de última generación
- English
  In this article we have presented a contribution to the prediction of the complexity of simple words in the Spanish language whose foundation is based on the combination of a large number of features of different types. We obtained the results after run the fined models based on Transformers and executed on the pretrained models BERT, XLM-RoBERTa, and RoBERTa-large-BNE in the different datasets in Spanish and executed on several regression algorithms. The evaluation of the results determined that a good performance was achieved with a Mean Absolute Error (MAE) = 0.1598 and Pearson = 0.9883 achieved with the training and evaluation of the Random Forest Regressor algorithm for the refined BERT model. As a possible alternative proposal to achieve a better prediction of lexical complexity, we are very interested in continuing to carry out experimentations with data sets for Spanish, testing state-of-the-art Transformer models.
Referencias bibliográficas
- Bender, E. M., T. Gebru, A. McMillan- Major, and S. Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?...
- Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’ 2010. Springer, pages 177–186.
- Breiman, L. 2001. Random forests. Machine learning, 45(1):5–32.
- Canete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data. Pml4dc at...
- Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised...
- Crammer, K., O. Dekel, J. Keshet, S. Shalev- Shwartz, and Y. Singer. 2006. Online passive-aggressive algorithms. Journal of Machine Learning...
- Dale, E. and J. S. Chall. 1948. A formula for predicting readability: Instructions. Educational research bulletin, pages 37–54.
- Davidson, S., A. Yamada, P. F. Mira, A. Carando, C. H. S. Gutierrez, and K. Sagae. 2020. Developing nlp tools with a new corpus of learner...
- Desai, A., K. North, M. Zampieri, and C. M. Homan. 2021. Lcp-rit at semeval- 2021 task 1: Exploring linguistic features for lexical complexity...
- Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding.
- Gooding, S. and E. Kochmar. 2018. Camb at cwi shared task 2018: Complex word identification with ensemble-based voting. In Proceedings of...
- Gutierrez-Fandiño, A., J. Armengol- Estape, M. P`amies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, A. Gonzalez-Agirre, C. Armentano-Oller, C....
- Liebeskind, C., O. Elkayam, and S. Liebeskind. 2021. Jct at semeval-2021 task 1: Context-aware representation for lexical complexity prediction....
- Liu, X., P. He, W. Chen, and J. Gao. 2019. Improving multi-task deep neural networks via knowledge distillation for natural language understanding....
- Mc Laughlin, G. H. 1969. Smog grading-a new readability formula. Journal of reading, 12(8):639–646.
- Nandy, A., S. Adak, T. Halder, and S. M. Pokala. 2021. cs60075 team2 at semeval- 2021 task 1: Lexical complexity prediction using transformer-based...
- Ortiz-Zambrano, J. A. and A. Montejo-Raez. 2021. Complex words identification using word-level features for semeval-2020 task 1. In Proceedings...
- Ortiz-Zambranoa, J. A. and A. Montejo- Raezb. 2020. Overview of alexs 2020: First workshop on lexical analysis at sepln. In Proceedings of...
- Paetzold, G. 2021. Utfpr at semeval- 2021 task 1: Complexity prediction by combining bert vectors and classic features. In Proceedings of...
- Paetzold, G. and L. Specia. 2016. Sv000gg at semeval-2016 task 11: Heavy gauge complex word identification with system voting. In Proceedings...
- Rayner, K. and S. A. Duffy. 1986. Lexical complexity and fixation times in reading: Effects of word frequency, verb complexity, and lexical...
- Rico-Sulayes, A. 2020. General lexiconbased complex word identification extended with stem n-grams and morphological engines. In Proceedings...
- Rojas, K. R. and F. Alva-Manchego. 2021. Iapucp at semeval-2021 task 1: Stacking fine-tuned transformers is almost all you need for lexical...
- Ronzano, F., L. E. Anke, H. Saggion, et al. 2016. Taln at semeval-2016 task 11: Modelling complex words by contextual, lexical and semantic...
- Saggion, H., S. Stajner, S. Bott, S. Mille, L. Rello, and B. Drndarevic. 2015. Making it simplext: Implementation and evaluation of a text...
- Shardlow, M. 2013. A comparison of techniques to automatically identify complex words. In 51st Annual Meeting of the Association for Computational...
- Shardlow, M., M. Cooper, and M. Zampieri. 2020. Complex: A new corpus for lexical complexity prediction from likert scale data. arXiv preprint...
- Shardlow, M., R. Evans, and M. Zampieri. 2021. Predicting lexical complexity in english texts. arXiv preprint arXiv:2102.08773.
- Singh, S. and A. Mahmood. 2021. The nlp cookbook: Modern recipes for transformer based deep learning architectures. IEEE Access, 9:68675–68702.
- Uluslu, A. Y. 2022. Automatic lexical simplification for turkish. arXiv preprint arXiv:2201.05878.
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In...
- Vettigli, G. and A. Sorgente. 2021. Compna at semeval-2021 task 1: Prediction of lexical complexity analyzing heterogeneous features. In Proceedings...
- Wolf, T., L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. 2020. Transformers: State-of-the-art...
- Yaseen, T. B., Q. Ismail, S. Al-Omari, E. Al- Sobh, and M. Abdullah. 2021. Just-blue at semeval-2021 task 1: Predicting lexical complexity...
- Zaharia, G.-E., D.-C. Cercel, and M. Dascalu. 2021. Upb at semeval-2021 task 1: Combining deep learning and handcrafted features for lexical...
- Zambrano, J. A. O. and A. Montejo-Raez. 2021. Clexis2: A new corpus for complex word identification research in computing studies. In Proceedings...