Una aproximación al uso de word embeddings en una tarea de similitud de textos en español

Francisco Javier Ortega Rodríguez; Tomás López Solaz; José Antonio Troyano Jiménez; Fernando Enríquez de Salamanca Ros

Ayuda

Una aproximación al uso de word embeddings en una tarea de similitud de textos en español

Autores: Francisco Javier Ortega Rodríguez , Tomás López Solaz, José Antonio Troyano Jiménez , Fernando Enríquez de Salamanca Ros
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 57, 2016, págs. 67-74
Idioma: español
Títulos paralelos:
- An approach to the use of word embeddings in a textual similarity task for Spanish texts
Enlaces
- Texto completo

Dialnet Métricas: 2 Citas

Resumen
- español
  En este trabajo mostramos cómo una representación vectorial de palabras basada en word embeddings puede ayudar a mejorar los resultados en una tarea de similitud semántica de textos. Para ello hemos experimentado con dos métodos que se apoyan en la representación vectorial de palabras para calcular el grado de similitud de dos textos, uno basado en la agregación de vectores y otro basado en el cálculo de alineamientos. El método de alineamiento se apoya en la similitud de vectores de palabras para determinar la vinculación entre las mismas. El método de agregación nos permite construir representaciones vectoriales de los textos a partir de los vectores individuales de palabras. Estas representaciones son comparadas mediante dos distancias clásicas como son la euclídea y la del coseno. Hemos evaluado nuestros sistemas con el corpus basado en Wikipedia distribuido en la competición de similitud de textos en español de SemEval-2015. Nuestros experimentos muestran que el método basado en alineamiento se comporta mucho mejor, obteniendo resultados muy cercanos al mejor sistema de SemEval. El método basado en agregación de vectores se comporta sensiblemente peor. No obstante, esta segunda aproximación parece capturar aspectos de similitud no recogidos por la primera, ya que cuando se combinan las salidas de ambos sistemas se mejoran los resultados del método de alineamiento, superando incluso los resultados del mejor sistema de SemEval.
- English
  In this paper we show how a vector representation of words based on word embeddings can help to improve the results in tasks focused on the semantic similarity of texts. Thus we have experimented with two methods that rely on the vector representation of words to calculate the degree of similarity of two texts, one based on the aggregation of vectors and the other one based on the calculation of alignments. The alignment method relies on the similarity of word vectors to determine the semantic link between them. The aggregation method allows us to construct vector representations of the texts from the individual vectors of each word. These representations are compared by means of two classic distance measures: Euclidean distance and cosine similarity. We have evaluated our systems with the corpus based on Wikipedia distributed in the competition of similarity of texts in Spanish of SemEval-2015. Our experiments show that the method based on the alignment of words performs much better, obtaining results that are very close to the best system at SemEval. The method based on vector representations of texts behaves substantially worse. However, this second approach seems to capture aspects of similarity not detected by the first one, as when the outputs of both systems are combined the results of the alignment method are surpassed, even exceeding the results of the best system at SemEval.
Referencias bibliográficas
- Agirre, E., C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, y R. Mihalcea. 2015. Semeval-2015...
- B¨ar, D., T. Zesch, y I. Gurevych. 2013. Dkpro similarity: An open source framework for text similarity. En ACL (Conference System Demonstrations),...
- Bi¸cici, E. 2015. Rtm-dcu: Predicting semantic similarity with referential translation machines. SemEval-2015. Cardellino, C. 2016. Spanish...
- Collobert, R., J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, y P. Kuksa. 2011. Natural language processing (almost) from scratch. J. Mach....
- H¨anig, C., R. Remus, y X. D. L. Puente. 2015. Exb themis: Extensive feature extraction from word alignments for semantic textual similarity....
- Jiang, J. y D. Conrath. 1997. Semantic similarity based on corpus statistics and lexical taxonomy. arXiv preprint cmplg/9709008.
- Karumuri, S., V. Vuggumudi, y S. Chitirala. 2015. Umduluth-blueteam: Svcsts-a multilingual and chunk level semantic similarity system. SemEval-2015,...
- Lin, D. 1998. Extracting collocations from text corpora. En First workshop on computational terminology, p´aginas 57– 63. Citeseer.
- Mihalcea, R., C. Corley, y C. Strapparava. 2006. Corpus-based and knowledge-based measures of text semantic similarity. En AAAI, volumen 6,...
- Mikolov, T., Q. Le, y I. Sutskever. 2013. Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.
- Mikolov, T., I. Sutskever, K. Chen, G. Corrado, y J. Dean. 2013. Distributed representations of words and phrases and their compositionality....
- Miller, G. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
- Pennington, J., R. Socher, y C. Manning. 2014. Glove: Global vectors for word representation. En EMNLP, volumen 14, p´aginas 1532–1543.
- Reh˚uˇrek, R. y P. Sojka. 2010. Softwa- ˇ re Framework for Topic Modelling with Large Corpora. En Proceedings of the LREC 2010 Workshop on...
- Valletta, Malta, Mayo. ELRA. http:// is.muni.cz/publication/884893/en. Resnik, P. 1995. Using information content to evaluate semantic similarity...
- Zou, W., R. Socher, D. Cer, y C. Manning. 2013. Bilingual word embeddings for phrase-based machine translation. En EMNLP, p´aginas 1393–1398.