NESM: a named entity based proximity measure for multilingual news clustering

Montalvo Herranz, Soto; Fresno Fernández, Víctor; Martínez Unanue, Raquel

NESM: a named entity based proximity measure for multilingual news clustering

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/22034

Información del item - Informació de l'item - Item information
Título:	NESM: a named entity based proximity measure for multilingual news clustering
Título alternativo:	NESM: una medida de similitud para el clustering multilingüe de noticias basada en entidades nombradas
Autor/es:	Montalvo Herranz, Soto \| Fresno Fernández, Víctor \| Martínez Unanue, Raquel
Palabras clave:	Entidad nombrada \| Clustering multilingüe \| Similitud de documentos \| Named entity \| Multilingual clustering \| Document similarity
Área/s de conocimiento:	Lenguajes y Sistemas Informáticos
Fecha de publicación:	mar-2012
Editor:	Sociedad Española para el Procesamiento del Lenguaje Natural
Cita bibliográfica:	MONTALVO, Soto; FRESNO, Víctor; MARTÍNEZ, Raquel. “NESM: a named entity based proximity measure for multilingual news clustering”. Procesamiento del Lenguaje Natural. N. 48 (2012). ISSN 1135-5948, pp. 81-88
Resumen:	Una de las tareas esenciales dentro del proceso del Clustering de Documentos es medir la similitud entre éstos. En este trabajo se presenta una nueva medida basada en el número y la categoría de las Entidades Nombradas compartidas entre documentos. Para evaluar la calidad de la medida propuesta en el clustering multilingüe de noticias, se han utilizado tres medidas de pesado diferentes y dos medidas de similitud estándar. Los resultados demuestran, con tres colecciones de noticias comparables escritas en español e inglés, que la medida propuesta es competitiva, superando en algunos casos a medidas como el coseno y el coeficiente de correlación. \| Measuring the similarity between documents is an essential task in Document Clustering. This paper presents a new metric that is based on the number and the category of the Named Entities shared between news documents. Three different feature-weighting functions and two standard similarity measures were used to evaluate the quality of the proposed measure in multilingual news clustering. The results, with three different collections of comparable news written in English and Spanish, indicate that the new metric performance is in some cases better than standard similarity measures such as cosine similarity and correlation coefficient.
Patrocinador/es:	This work has been part-funded by the Education Council of the Regional Government of Madrid, MA2VICMR (S-2009/TIC-1542), and the research project Holopedia, funded by the Ministerio de Ciencia e Innovación under grant TIN2010-21128-C02.
URI:	http://hdl.handle.net/10045/22034
ISSN:	1135-5948
Idioma:	eng
Tipo:	info:eu-repo/semantics/article
Revisión científica:	si
Aparece en las colecciones:	Procesamiento del Lenguaje Natural - Nº 48 (2012)

Archivos en este ítem:

Archivos en este ítem:
Archivo	Descripción	Tamaño	Formato
PLN_48_10.pdf		794 kB	Adobe PDF	Abrir Vista previa Cerrar vista previa

Ver citas en Google Académico

Muestra el registro completo