Influencia de la Longitud del Texto en Tareas de Recuperación de Información mediante Tópicos Probabilísticos

Carlos Badenes Olmedo; Oscar Corcho García; Borja Lozano Álvarez

Ayuda

Influencia de la Longitud del Texto en Tareas de Recuperación de Información mediante Tópicos Probabilísticos

Autores: Carlos Badenes Olmedo, Oscar Corcho García , Borja Lozano Álvarez
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 67, 2021, págs. 27-36
Idioma: español
Títulos paralelos:
- Impact of Text Length for Information Retrieval Tasks based on Probabilistic Topics
Enlaces
- Texto completo
Resumen
- español
  La recuperación de información ha utilizado tradicionalmente modelos vectoriales para describir los textos. A gran escala, estos modelos necesitan reducir las dimensiones de los vectores para que las operaciones sean manejables sin comprometer su rendimiento. Los modelos probabilísticos de tópicos (MPT) proponen espacios vectoriales más pequeños. Las palabras se organizan en tópicos y los documentos se relacionan entre sí a partir de sus distribuciones de tópicos. Como en muchas otras técnicas de IA, los textos utilizados para entrenar los modelos influyen en su rendimiento. En particular, nos interesa el impacto de la longitud de los textos al crear MPT. Hemos estudiado cómo influye al relacionar semánticamente documentos multilingües y al capturar el conocimiento derivado de sus relaciones. Los resultados sugieren que los textos más adecuados deben ser de igual o mayor longitud que los utilizados para hacer inferencias posteriormente y las relaciones deben basarse en métricas de similitud jerárquicas.
- English
  Information retrieval has traditionally been approached using vector models to describe texts. In large document collections, these models need to reduce the dimensions of the vectors to make the operations manageable without compromising their performance. Probabilistic topic models (PTM) propose smaller vector spaces. Words are organized into topics and documents are related to each other from their topic distributions. As in many other AI techniques, the texts used to train the models have an impact on their performance. Particularly, we are interested on the impact that length of texts may have to create PTM. We have studied how it influences to semantically relate multilingual documents and to capture the knowledge derived from their relationships. The results suggest that the most adequate texts to train PTM should be of equal or greater length than those used to make inferences later and documents should be related by hierarchy-based similarity metrics at large-scale.
Referencias bibliográficas
- Badenes-Olmedo, C., J. Redondo-García, and O. Corcho. 2019a. Large-Scale Semantic Exploration of Scientific Literature using Topic-based Hashing...
- Badenes-Olmedo, C., J. Redondo-García, and O. Corcho. 2019b. Legal document retrieval across languages: topic hierarchies based on synsets....
- Badenes-Olmedo, C., J. L. Redondo-García, and O. Corcho. 2017a. Distributing text mining tasks with librairy. In DocEng 2017 - Proceedings...
- Badenes-Olmedo, C., J. L. Redondo-García, and O. Corcho. 2017b. An initial analysis of topic-based similarity among scientific documents based...
- Blei, D., A. Ng, and M. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(4-5):993–1022.
- Deerwester, S., S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American...
- Dieng, A. B., F. Ruiz, and D. Blei. 2020. Topic modeling in embedding spaces. Transactions of the Association for Computational Linguistics,...
- He, J., L. Li, and X. Wu. 2017. A self-adaptive sliding window based topic model for non-uniform texts. In Proceedings - IEEE International...
- Hofmann, T. 2001. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42(1-2):177–196. Hofmann, T. 1999. Probabilistic...
- in information retrieval, pages 50–57. Hu, Y., K. Zhai, V. Eidelman, and J. BoydGraber. 2014. Polylingual tree-based topic models for translation...
- Jelodar, H., Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, and L. Zhao. 2017. Latent dirichlet allocation (lda) and topic modeling: models,...
- Jung, K. H., E. Ruthruff, and T. Goldsmith. 2017. Document similarity misjudgment by lsa: Misses vs. false positives. Cognitive Science.
- Mao, X.-L., B.-S. Feng, Y.-J. Hao, L. Nie, H. Huang, and G. Wen. 2017. S2JSDLSH: a locality-sensitive hashing schema for probability distributions....
- Nzali, T., M. Donald, S. Bringay, C. Lavergne, C. Mollevi, and T. Opitz. 2017. What Patients Can Tell Us: Topic Analysis for Social Media...
- Rus, V., N. Niraula, and R. Banjade. 2013. Similarity measures based on latent dirichlet allocation. In International Conference on Intelligent...
- Schofield, A., M. Magnusson, and D. Mimno. 2017. Pulling out the stops: Rethinking stopword removal for topic models. In Proceedings of the...
- Steinberger, R., M. Ebrahim, A. Poulis, M. Carrasco-Benitez, P. Schlüter, M. Przybyszewski, and S. Gilbro. 2014. An overview of the European...
- Steinberger, R., B. Pouliquen, A. Widiger, C. Ignat, T. Erjavec, D. Tufi¸s, and D. Varga. 2006. The JRC-Acquis: A multilingual aligned parallel...
- Syed, S. and M. R. Spruit. 2017. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. 2017 IEEE International...