Ir al contenido

Documat


NESM: a Named Entity based Proximity Measure for Multilingual News Clustering

  • Autores: X Soto Montalvo, Víctor Fresno Fernández Árbol académico, Raquel Martínez Lucas
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 48, 2012, págs. 81-88
  • Idioma: inglés
  • Títulos paralelos:
    • NESM: una medida de similitud para el clustering multilingüe de noticias basada en entidades nombradas
  • Enlaces
  • Resumen
    • español

      Una de las tareas esenciales dentro del proceso del Clustering de Documentos es medir la similitud entre éstos. En este trabajo se presenta una nueva medida basada en el número y la categoría de las Entidades Nombradas compartidas entre documentos. Para evaluar la calidad de la medida propuesta en el clustering multilingüe de noticias, se han utilizado tres medidas de pesado diferentes y dos medidas de similitud estándar. Los resultados demuestran, con tres colecciones de noticias comparables escritas en español e inglés, que la medida propuesta es competitiva, superando en algunos casos a medidas como el coseno y el coeficiente de correlación.

    • English

      Measuring the similarity between documents is an essential task in Document Clustering. This paper presents a new metric that is based on the number and the category of the Named Entities shared between news documents. Three different feature-weighting functions and two standard similarity measures were used to evaluate the quality of the proposed measure in multilingual news clustering. The results, with three different collections of comparable news written in English and Spanish, indicate that the new metric performance is in some cases better than standard similarity measures such as cosine similarity and correlation coefficient.

  • Referencias bibliográficas
    • Armour, Q., N. Japkowicz, and S. Matwin. 2005. The Role of Named Entities in Text Classification. In Proceedings of CLiNE'05.
    • Baeza-Yates, R. and B. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press.
    • Carreras, X., I. Chao, L. Padró, and M. Padró. 2004. FreeLing: An Open-Source Suite of Language Analyzers . In Proceedings of LREC04.
    • Chau, R., C. Yeh, and K. Smith. 2005. A Neural Network Model for Hierarchical Multilingual Text Categorization. In Advances in Neural Networks,...
    • Cheung, P., R. Huang, and W. Lam. 2004. Financial Activity Mining from Online Multilingual News. In Proceedings of the ITCC'04.
    • Denicia-Carral, C., M. Montes-Gómez, L. Villase~nor-Pineda, and R. M. Aceves-Pérez. 2010. Bilingual document clustering using translation-independent...
    • Dhillon, I. S. and D. S. Modha. 2001. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1):143-175.
    • Flaounas, I., O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, and N. Cristianini. 2011. NOAM: news outlets analysis and monitoring system....
    • Gael, J. Van and X. Zhu. 2007. Correlation clustering for crosslingual link detection. In Proceedings of IJCAI'07.
    • Karypis, G. 2003. Cluto: A clustering toolkit. Technical Report 02-017, University of Minnesota, Department of Computer Science, Minneapolis.
    • Kogan, J., M. Teboulle, and C. Nicholas. 2005. Data Driven Similarity Measures for k-Means Like Clustering Algorithms. Information Retrieval,...
    • Kumaran, Giridhar and James Allan. 2004. Text classification and named entities for new event detection. In Proceedings of SIGIR'04. ACM.
    • Lawrence, J. L. 2003. Newsblaster russianenglish clustering performance analysis. Technical Report CUCS-010-03, Department of Computer Science,...
    • Levenshtein, V. I. 1966. Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady, 10:707-710.
    • Mathieu, B., R. BesanÇon, and C. Fluhr. 2004. Multilingual document clusters discovery. In Proceedings of RIAO'04, pages 116-125.
    • Montalvo, S., R. Martínez, A. Casillas, and V. Fresno. 2007a. Multilingual News Document Clustering: Feature Translation vs. Identification...
    • Montalvo, S., R. Martínez, A. Casillas, and V. Fresno. 2007b. Bilingual news clustering using named entities and fuzzy similarity. In Proceedings...
    • Pouliquen, B., R. Steinberger, C. Ignat, E. Ksper, and I. Temikova. 2004. Multilingual and cross-lingual news topic tracking. In Proceedings...
    • Ratinov, L. and D. Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In Proc. of CoNLL '09.
    • Rodríguez, H. 2002. Similitud Semantica. In Actas del Seminario de Industrias de la Lengua de la Fundacion Duques de Soria.
    • Salton, G. 1983. Introduction to Modern Information Retrieval. McGraw-Hill.
    • Savoy, J. 2003. Report on CLEF-2003 Multilingual Tracks. Results of the CLEF-2003, cross-language evaluation forum.
    • Shah, C., W. Bruce Croft, and D. Jensen. 2006. Representing documents with named entities for story link detection (SLD). In Proceedings of...
    • Shinyama, Y. and S. Sekine. 2004. Named entity discovery using comparable news articles. In Proceedings of COLING '04. ACL.
    • Silva, J., J. Mexia, C. Coelho, and G. Lopes. 2004. A Statistical Approach for Multilingual Document Clustering and Topic Extraction form...
    • Steinberger, R., B. Pouliquen, and J. Hagman. 2002. Cross-Lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC....
    • Steinberger, R., B. Pouliquen, and C. Ignat. 2005. Navigating multilingual news collections using automatically extracted information. Journal...
    • Steinberger, R., B. Pouliquen, and C. Ignat. 2006. Exploiting multilingual nomenclatures and language-independent text features as an interlingua...
    • Urizar, X. Saralegi and I. Alegría Loinaz. 2007. Similitud entre documentos multilingües de carácter científico-técnico en un entorno web....
    • van Rijsbergen, C. J. 1974. Foundations of evaluation. Journal of Documentation, 30:365-373.
    • Wu, K. and B. Lu. 2007. Cross-lingual document clustering. In Proceedings of PAKDD'07.

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno