Creación de un corpus de noticias de gran tamaño en español para el análisis diacrónico y diatópico del uso del lenguaje

Pavel Razgovorov; David Tomás Díaz

Ayuda

Creación de un corpus de noticias de gran tamaño en español para el análisis diacrónico y diatópico del uso del lenguaje

Autores: Pavel Razgovorov, David Tomás Díaz
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 62, 2019, págs. 29-36
Idioma: español
Títulos paralelos:
- Creation of a large news corpus in Spanish for the diachronic and diatopic analysis of the use of language
Enlaces
- Texto completo

Dialnet Métricas: 1 Cita

Resumen
- español
  Este artículo describe el proceso llevado a cabo para desarrollar un corpus de noticias periodísticas de gran tamaño en español. Todos los textos recopilados están ubicados tanto temporal como geográficamente. Esto lo convierte en un recurso de gran utilidad para trabajos en el ámbito de la lingüística, la sociología y el periodismo de datos, permitiendo tanto el estudio diacrónico y diatópico del uso del lenguaje como el seguimiento de la evolución de determinados eventos. El corpus se puede descargar libremente empleando el software que se ha desarrollado como parte de este trabajo. El artículo se completa con un análisis estadístico del corpus y con la presentación de dos casos de estudio que muestran su potencial a la hora de analizar sucesos.
- English
  This article describes the process carried out to develop a large corpus of news stories in Spanish. The collected texts are located both temporally and geographically. This makes it a very useful resource to work with in the field of linguistics, sociology and data journalism, allowing the diachronic and diatopic study of the use of language and tracking the evolution of specific events. The corpus can be freely downloaded using the software developed as part of this work. The article includes a statistical analysis of the corpus and two case studies that show its potential for event analysis.
Referencias bibliográficas
- Broder, A. Z. 1997. On the resemblance and containment of documents. En Proceedings of the Compression and Complexity of Sequences 1997, SEQUENCES...
- Graff, D. y G. Gallegos. 1995. Spanish news text. Download at Linguistic Data Consortium: https://catalog.ldc. upenn.edu/LDC95T9.
- Graff, D. y G. Gallegos. 1999. Spanish newswire text, volume 2. Download at Linguistic Data Consortium: https://catalog. ldc.upenn.edu/LDC99T41.
- Gray, J., L. Chambers, y L. Bounegru. 2012. The data journalism handbook: How journalists can use data to improve the news. O’Reilly Media.
- Holmes, D. I. 1985. The analysis of literary style. Journal of the Royal Statistical Society. Series A (General), 148(4):328–341.
- Indyk, P. y R. Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. En Proceedings of the Thirtieth...
- Leetaru, K. 2011. Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space. First Monday, 16(9).
- Leskovec, J., A. Rajaraman, y J. D. Ullman. 2014. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2nd edición.
- Padró, L. y E. Stanilovsky. 2012. Freeling 3.0: Towards wider multilinguality. En Proceedings of the Language Resources and Evaluation Conference...