Generación de un tesauro de similitud multilingüe a partir de un corpus comparable a CLIR

María Teresa Martín Valdivia; Manuel García Vega; Fernando Martínez Santiago; Luis Alfonso Ureña López

Ayuda

Generación de un tesauro de similitud multilingüe a partir de un corpus comparable a CLIR

Autores: María Teresa Martín Valdivia , Manuel García Vega, Fernando Martínez Santiago , Luis Alfonso Ureña López
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 28, 2002, págs. 55-62
Idioma: español
Enlaces
- Texto Completo Ejemplar
Resumen
- español
  En este trabajo se describe un nuevo enfoque para generar de manera automática un tesauro de similitud a través de un corpus comparable con el fin de aplicarlo a tareas de recuperación de información multilingüe. Aunque la disponibilidad de recursos lingüísticos es cada vez mayor, todavía hoy en día es dificil el acceso a algunos de ellos, sobre todo en ámbitos multilingües. Incluso, la propia complejidad de la tarea CLIR requiere el uso conjunto de varios recursos para aumentar la eficacia del sistema. Los corpus comparables son uno de estos recursos multilingües especialmente interesantes por su disponibilidad y por la posibilidad de generarlos automáticamente. Sin embargo, para que sean útiles deben estar alineados al menos a nivel de documento. Para llevar a cabo esta tarea, se han utilizado técnicas de clustering. Una vez que los documentos están alineados, se genera el tesauro de similitud a partir de ellos. Los experimentos realizados muestran que los tesauros de similitud multilingües son una buena alternativa cuando otros recursos más adecuados no están disponibles.
- English
  In this work, it is described a new approach to automatically generate a similarity thesaurus through a comparable corpus, with the aim of applying it to Cross Language Information Retrieval. Although the availability of linguistic resources is higher and higher, it is still difficult to heve access to some of them, above all on multilingual circles. Even, the complexity itself of the ask CLIR requires the global use of several resources to increase the efficiency of the system. The comparable corpus are one of this multilingual resources specially interesting due to its availability and due do its chance to be generated automatically. However, in order to make these corpora useful, they should be aligned at least at document level. In order to carry out this task, clustering techniques have been used. Once the documents are aligned, the similarity thesaurus is generated from them. The accomplished experiments show that the multilingual similarity thesaurus are a good chance when other more suitable resources are not available.