Detección de Idioma en Twitter

Yudivián Almeida Cruz; Suilan Estévez Velarde; Alejandro Piad Morffis

Ayuda

Detección de Idioma en Twitter

Yudivián Almeida-Cruz ^[1] ; Suilan Estévez-Velarde ^[1] ; Alejandro Piad-Morffis ^[1]
1. [1] Universidad de La Habana
  
  Universidad de La Habana
  
  Cuba
Localización: GECONTEC: revista Internacional de Gestión del Conocimiento y la Tecnología, ISSN-e 2255-5684, Vol. 2, Nº. 3, 2014, págs. 35-45
Idioma: español
Títulos paralelos:
- Language Detection on Twitter
Enlaces
- Texto completo
Resumen
- español
  El trabajo presenta una alternativa para identificar idiomas en Twitter sin que sea necesario utilizar conjuntos de entrenamiento o información agregada. En dicha alternativa se utilizan técnicas basadas en los algoritmos de reconocimiento de trigramas y small words. Se valora la utilización de estos algoritmos por sí solos y en un modelo de composición. Asimismo, se analiza la incidencia del pre-procesamiento de los tweets en la precisión de la identificación de los idiomas. Finalmente, después de un proceso de experimentación, se determina la mejor alternativa de las estudiadas.
- English
  The paper presents an alternative to identify languages on Twitter without having to use training sets or aggregated information. Such alternative is based on trigram recognition algorithms and small words techniques. The use of these algorithms is evaluated both on their own and in a model of composition. Also, the incidence of pre-processing of tweets in the accuracy of identifying the language is discussed. Finally, after a process of experimentation, the best alternative, out of those studied, is determined.
Referencias bibliográficas
- Álvarez, R. M. (2010). Análisis de opiniones en Internet a partir de la red social Twitter [Report]. - [s.l.] : Anales de Mecánica y Electricidad.
- Baldwin T. and Lui M. (2010). Language identification: The long and the short of the matter [Conference] //In Proc. HLT-NAACL, pages 229–237.
- Bird, S. (2006). NLTK: the natural language toolkit [Conference]. Association for Computational Linguistics. Proceedings of the COLING/ACL...
- Bollen, J., Mao H. and Pepe A. (2011). Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. [Conference] // ICWSM.
- Carter, S., Weerkamp Wouter, and Tsagkias Manos (2013). Microblog Language Identification: Overcoming the Limitations of Short, Unedited and...
- Cavnar W. B. and Trenkle J. (1994). M N-gram-based text categorization. Ann Arbor MI. - [s.l.] : Citeseer, 1994. - 2 : Vol. 48113, pp. 161--175.
- Gold E. (1967). Mark Language identification in the limit. Information and control. - [s.l.] : Elsevier, 1967. - 5 : Vol. 10: 447-474.
- Golding, A. R. and Schabes, Y. (1996). Combining trigram-based and feature-based methods for context-sensitive spelling correction. Proceedings...
- Gottron, T. and Lipka N. (2010). A comparison of language identification approaches on short, query-style texts. Advances in information retrieval....
- Hughes, B., Baldwin, T., Bird, S., Nicholson, J., and Mackinlay, A. (2006). Reconsidering language identification for written language resources...
- Johnson, S. (1993). Solving the problem of language recognition [Report] / Technical report, School of Computer Studies, University of Leeds.
- Lui, M. and Baldwin, T. (2011). Cross-domain feature selection for language identification. In Proceedings of 5th International Joint Conference...
- Lui, M., Lau, J. H. and Baldwin, T. (2014). Automatic detection and language identification of multilingual documents [Journal] // Transactions...
- McNamee, P. (2005). Language identification: a solved problem suitable for undergraduate instruction. Comput. Sci. Coll., 20(3):94–101.
- Schmitt, J. C. (1991). Trigram-based method of language identification. Google Patents, US Patent 5,062,143.
- Tromp, E. and Pechenizkiy, M. (2011). Graph-based n-gram language identication on short texts, In Proc. 20th Machine Learning conference of...