UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks

José Antonio García Díaz; Ángela Almela Sánchez-Lafuente; Gema Alcaraz Mármol; Rafael Valencia García

Ayuda

UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks

Autores: José Antonio García Díaz, Ángela Almela Sánchez-Lafuente, Gema Alcaraz Mármol, Rafael Valencia García
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 65, 2020, págs. 139-142
Idioma: inglés
Títulos paralelos:
- UMUCorpusClassifier: Recolección y evaluación de corpus lingüísticos para tareas de Procesamiento del Lenguaje Natural
Enlaces
- Texto completo

Dialnet Métricas: 7 Citas

Resumen
- español
  La construcción de un corpus anotado es una tarea que consume mucho tiempo. Aunque algunos investigadores han propuesto la anotación automática basada en heurísticas, éstas no siempre son posibles. Además, incluso cuando la anotación es realizada por personas puede haber discrepancias entre los mismos anotadores o de un anotador consigo mismo que influyen en la calidad del corpus. Por tanto, la falta de supervisión sobre el proceso de anotación puede llevar a corpus con baja calidad. En este trabajo, proponemos una demostración de UMUCorpusClassifier, una herramienta PLN para ayudar a los investigadores a compilar corpus y también a coordinar y supervisar el proceso de anotación. Esta herramienta facilita la monitorización diaria y permite detectar inconsistencias durante etapas tempranas del proceso de anotación.
- English
  The development of an annotated corpus is a very time-consuming task. Although some researchers have proposed the automatic annotation of a corpus based on ad-hoc heuristics, valid hypotheses cannot always be made. Even when the annotation process is performed by human annotators, the quality of the corpus is heavily influenced by disagreements between annotators or with themselves. Therefore, the lack of supervision of the annotation process can lead to poor quality corpus. In this work, we propose a demonstration of UMUCorpusClassifier, a NLP tool for aid researches for compiling corpus as well as coordinating and supervising the annotation process. This tool eases the daily supervision process and permits to detect deviations and inconsistencies during early stages of the annotation process.
Referencias bibliográficas
- Apolinardo-Arzube, O., J. A. García-Díaz, J. Medina-Moreira, H. Luna-Aveiga, and R. Valencia-Garc´ıa. 2019. Evaluating information-retrieval...
- García-Díaz, J. A., M. Cánovas-García, and R. Valencia-García. 2020. Ontologydriven aspect-based sentiment analysis classification: An infodemiological...
- Go, A., R. Bhayani, and L. Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
- Grave, E., P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.
- Krippendorff, K. 2018. Content analysis: An introduction to its methodology. Sage publications.
- Medina-Moreira, J., J. A. García-Díaz, O. Apolinardo-Arzube, H. Luna-Aveiga, and R. Valencia-García. 2019. Mining twitter for measuring social...
- Medina-Moreira, J., J. O. Salavarria-Melo, K. Lagos-Ortiz, H. Luna-Aveiga, and R. Valencia-García. 2018. Opinion mining for measuring the...
- Mozetiˇc, I., M. Grˇcar, and J. Smailovi´c. 2016. Multilingual twitter sentiment classification: The role of human annotators. PloS one, 11(5).
- Pak, A. and P. Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326.
- Salas-Zárate, M. d. P., M. A. ParedesValverde, M. A. Rodríguez-García, R. Valencia-García, and G. AlorHernández. 2017. Automatic detection...
- Singh, A., N. Thakur, and A. Sharma. 2016. A review of supervised machine learning algorithms. In 2016 3rd International Conference on Computing...