Widaug: Aumento de datos para el reconocimiento de entidades nombradas usando Wikidata

Pablo Calleja; Oscar Corcho García; Alberto Sánchez

Ayuda

Widaug: Aumento de datos para el reconocimiento de entidades nombradas usando Wikidata

Autores: Pablo Calleja, Oscar Corcho García , Alberto Sánchez
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 70, 2023, págs. 145-155
Idioma: español
Títulos paralelos:
- Widaug: Data augmentation for named entity recognition using Wikidata
Enlaces
- Texto completo
Resumen
- español
  El estado del arte actual de los modelos de Procesamiento de Lenguaje Natural se basa en el uso de una gran cantidad de datos para ser entrenados. Cuantos más, mejor. Sin embargo, esto es una gran limitación en la creación de conjuntos de datos para tareas específicas de procesamiento de lenguaje natural, como el reconocimiento de entidades nombradas, que involucra a uno o más anotadores para leer, comprender y anotar las entidades nombradas requeridas a lo largo de un corpus. Actualmente, hay bastantes corpus buenos de dominio general para el inglés. Sin embargo, los dominios o escenarios particulares y otros idiomas distintos del inglés aún no están tan representados en la comunidad de investigación. Por ello, se exploran técnicas de aumento de datos para crear datos sintéticos similares a los originales para luego enriquecer el proceso de entrenamiento de los modelos. Por otro lado, los grafos de conocimiento contienen muchísima información valiosa que no se está utilizando para ayudar en el proceso de aumento de datos. Este trabajo propone un método de aumento de datos basado en el grafo de conocimiento de Wikidata que es evaluado en un corpus español para un desafío de reconocimiento de entidades nombradas.
- English
  The current state of the art of Natural Language Processing models are based on the use of a big amount of data to be trained. The more, the better. However, this is quite a limitation in the creation of datasets for specific natural language processing tasks such as Named Entity Recognition, which involves one or more annotators to read, understand and annotate those required named entities along a corpus. Currently, there are many good general domain corpora for the English language. However, particular domains or scenarios and other non-English languages are still not so represented in the research community. Thus, data augmentation techniques are explored to create synthetic data similar to the originals to enrich the training process of the models. On the other hand, knowledge graphs contain a lot of valuable information that is not being used to help in the data augmentation process. This work proposes a data augmentation method based on the Wikidata knowledge graph which is tested in a Spanish corpus for a Named Entity Recognition challenge. |
Referencias bibliográficas
- Asghari, M., D. Sierra-Sosa, and A. S. Elmaghraby. 2022. Biner: A low-cost biomedical named entity recognition. Information Sciences, 602:184–200.
- Bayer, M., M.-A. Kaufhold, B. Buchhold, M. Keller, J. Dallmeyer, and C. Reuter. 2022. Data augmentation in natural language processing: a...
- Bojanowski, P., E. Grave, A. Joulin, and T. Mikolov. 2016. Enriching word vectors with subword information.
- Dai, X. and H. Adel. 2020. An analysis of simple data augmentation for named entity recognition. arXiv preprint arXiv:2010.11683.
- Ding, B., L. Liu, L. Bing, C. Kruengkrai, T. H. Nguyen, S. Joty, L. Si, and C. Miao. 2020. Daga: Data augmentation with a generation approach...
- Erd, R., L. Feddoul, C. Lachenmaier, and M. J. Mauch. 2022. Evaluation of data augmentation for named entity recognition in the german legal...
- Farre-Maduell, E., G. Gonzalez Gacio, S. Lima, A. Miranda-Escalada, and M. Krallinger. 2022. LivingNER Guidelines: Named entity recognition,...
- Grishman, R. and B. M. Sundheim. 1996. Message understanding conference-6: A brief history. In COLING 1996 Volume 1: The 16th International...
- Gutierrez-Fandino, A., J. ArmengolEstape, M. P`amies, J. Llop-Palao, J. Silveira-Ocampo, C. P. Carrino, C. Armentano-Oller, C. RodriguezPenagos,...
- Guzman-Silverio, M., A. Balderas-Paredes, and A. P. Lopez-Monroy. 2020. Transformers and data augmentation for aggressiveness detection in...
- Kang, T., A. Perotte, Y. Tang, C. Ta, and C. Weng. 2021. Umls-based data augmentation for natural language processing of clinical research...
- Kim, J., Y. Kim, and S. Kang. 2022. Weakly labeled data augmentation for social media named entity recognition. Expert Systems with Applications,...
- Li, X., H. Zhang, and X.-H. Zhou. 2020. Chinese clinical named entity recognition with variant neural structures based on bert methods. Journal...
- Liu, Q., P. Li, W. Lu, and Q. Cheng. 2020. Long-tail dataset entity recognition based on data augmentation. In EEKE@ JCDL, pages 79–80.
- Luo, H. 2021. Emotion detection for spanish with data augmentation and transformerbased models. In IberLEF@ SEPLN, pages 35–42.
- Luque, F. M. 2019. Atalaya at tass 2019: Data augmentation and robust embeddings for sentiment analysis. arXiv preprint arXiv:1909.11241.
- Malmasi, S., A. Fang, B. Fetahu, S. Kar, and O. Rokhlenko. 2022. Semeval-2022 task 11: Multilingual complex named entity recognition (multiconer)....
- Marivate, V. and T. Sefara. 2020. Improving short text classification through global augmentation methods. In International Cross-Domain Conference...
- Miranda-Escalada, A., E. Farre, and M. Krallinger. 2020. Named entity recognition, concept normalization and clinical coding: Overview of...
- Miranda-Escalada, A., E. Farre-Maduell, S. Lima-Lopez, L. Gasco, V. BrivaIglesias, M. Aguero-Torales, and M. Krallinger. 2021. The profner...
- Perera, N., M. Dehmer, and F. EmmertStreib. 2020. Named entity recognition and relation detection for biomedical information extraction. Frontiers...
- Raiman, J. and J. Miller. 2017. Globally normalized reader. arXiv preprint arXiv:1709.02828.
- Rekabsaz, N., M. Lupu, and A. Hanbury. 2017. Exploration of a threshold for similarity based on uncertainty in word embedding. In Advances...
- Rodriguez, S., R. Gretter, M. Matassoni, A. Alonso, O. Corcho, M. Rico, and F. Daniele. 2021. SmarTerp: A CAI system to support simultaneous...
- Schindler, D., F. Bensmann, S. Dietze, and F. Kr¨uger. 2021. Somesci-a 5 star open data gold standard knowledge graph of software mentions...
- Serrano, A. V., G. G. Subies, H. M. Zamorano, N. A. Garcia, D. Samy, D. B. Sanchez, A. M. Sandoval, M. G. Nieto, and A. B. Jimenez. 2022....
- Tjong Kim Sang, E. F. 2002. Introduction to the CoNLL-2002 shared task: Languageindependent named entity recognition. In COLING-02: The 6th...
- Wang, W. Y. and D. Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic...
- Wei, J. and K. Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196.
- Wu, X., S. Lv, L. Zang, J. Han, and S. Hu. 2019. Conditional bert contextual augmentation. In International conference on computational science,...
- Xie, Q., Z. Dai, E. Hovy, T. Luong, and Q. Le. 2020. Unsupervised data augmentation for consistency training. Advances in Neural Information...
- Yaseen, U. and S. Langer. 2021. Data augmentation for low-resource named entity recognition using backtranslation. arXiv preprint arXiv:2108.11703.
- Zhang, X., J. Zhao, and Y. LeCun. 2015. Character-level convolutional networks for text classification. Advances in neural information processing...