Ir al contenido

Documat


Resumen de Semantically-enabled browsing of large multilingual document collections

Carlos Badenes Olmedo

  • Searching for similar documents and exploring major themes covered are common activities when browsing document collections. With the ongoing growth in the number of digital documents in multiple languages, we need better tools to browse large multilingual corpora. Manual document annotation has been traditionally used to facilitate such document browsing. However, manual annotation is a knowledge-intensive and tedious task, which can be alleviated by using automatic document annotation algorithms. Most algorithms represent documents in a common feature space that abstract them away from the specific sequence of words used in them. Probabilistic Topic Models reduce that feature space by annotating documents with thematic information. Over this low-dimensional latent space some algorithms have been proposed to perform document similarity search, including collections of texts in multiple languages. However, theme-aligned data or dictionaries are required to create multilingual topics and thematic information gets hidden behind specific representations that limits the explanatory capability of topics to justify content-based similarities. In this thesis we address the challenge of automatically relating large corpora of multilingual documents without losing the knowledge offered by topics to explain the relationships, and without the need for parallel or comparable corpora. In order to do so, we have created a framework where probabilistic topic models can be created and reused, a hierarchical model for describing documents with thematic annotations and an unsupervised algorithm that relates multilingual documents from their most relevant themes. Evaluations on classifying and sorting documents by similar content reveal good results on multiple domains.


Fundación Dialnet

Mi Documat