Ir al contenido

Documat


Extracting terminology from Wikipedia

  • Autores: Jorge Vivaldi Palatresi, Horacio Rodríguez Hontoria Árbol académico
  • Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 47, 2011, págs. 65-73
  • Idioma: inglés
  • Enlaces
  • Resumen
    • español

      En este artículo presentamos una aproximación novedosa para obtener la terminología de un dominio utilizando las estructuras de páginas y categorías de Wikipedia de una forma independiente del dominio y de la lengua. La idea es aprovechar el grafo de categorías de Wikipedia a partir de un conjunto de categorías que asociamos con el dominio. Después de obtener las categorías del dominio seleccionado se extraen las páginas correspondientes con ciertas restricciones. El conjunto resultante de páginas y categorías se seleccionan como vocabulario inicial del dominio. Comparamos los resultados obtenidos mediante un modulo de un extractor híbrido, YATE y su equivalente que utiliza la Wikipedia. El resultado muestra que este recurso puede utilizarse para esta tarea. Aplicamos esta aproximación a cuatro dominios (astronomía, química, economía y medicina) y dos idiomas (inglés y castellano).

    • English

      In this paper we present a new approach for obtaining the terminology of a given domain using the category and page structures of the Wikipedia in a domain and language independent way. The idea is to take profit of category graph of Wikipedia starting with a set of categories that we associate with the domain. After obtaining the full set of categories belonging to the selected domain, the collection of corresponding pages is extracted, using some constraints. The set of titles of recovered pages and categories is selected as initial domain term vocabulary. The system has been evaluated substituting by it the term candidates analyzer module of an state-of-the-art term extractor, YATE. The results show that this resource may be used for this task overcoming some of the limitations of alternative knowledge sources. This approach has been applied to three domains (astronomy, chemistry, economics and medicine) and two languages (English and Spanish).

  • Referencias bibliográficas
    • Aronson A. and F. Lang, 2010. An overview of MetaMap: historical perspective and recent advances. JAMIA 2010 17:229-236.
    • Atserias J. H. Zaragoza, M. Ciaramita and G. Attardi, 2008. Semantically Annotated Snapshot of the English Wikipedia. Proceedings of the 6th...
    • Barrón-Cedeño A., Sierra G., Drouin P., Ananiadou S. 2009. An improved automatic term recognition method for Spanish. In Proceedings of the...
    • Bernardini, S., M. Baroni y S. Evert. 2006. A WaCky Introduction. Wacky! Working papers on the Web as Corpus, pages 9-40, Bologna: Gedit.
    • Erdmann M., Nakayama K., Hara T. and S. Nishio, 2008. Extraction of Bilingual Terminology from a Multilingual Webbased Encyclopedia. Journal...
    • Gabrilovich E. and S. Markovitch, 2009. Wikipedia-based Semantic Interpretation for Natural Language Processing. Journal of Artificial Intelligence...
    • Hecht B. and D. Gergle, 2010. The Tower of Babel Meets Web 2.0: User-Generated Content and its Applications in a Multilingual Context. In...
    • Kazama, J. and K. Torisawa, 2007. Exploiting Wikipedia as External Knowledge for Named Entity Recognition. Proceedings of the EMNLP-CoNLL...
    • Krauthammer M. and G. Nenadic, 2004. Term identification in the Biomedical Literature. Journal of Biomedical Informatics. Vol. 37(6):512-526.
    • Magnini B. and G. Cavaglià, 2000. Integrating Subject Field Codes In WordNet. In Proceedings of the 2nd LREC International Conference: 1413-1418,...
    • Maynard D., 1999. Term recognition using combined knowledge sources. PhD Thesis. Manchester Metropolitan University.
    • Medelyan O., David N. Milne, C. Legg and I. H. Witten (2009). Mining meaning from Wikipedia. International Journal of Human- Computer Studies....
    • Medelyan O. I. H.. Witten, and D. Milne, 2008. Topic indexing with Wikipedia. In Proceedings of Wikipedia and AI workshop at the AAAI-08 Conference....
    • Mihalcea R. and R. Csomai, 2007. Wikify!: linking documents to encyclopedic knowledge. Proceedings of CIKM 233-242.
    • Milne D. D. Milne, D. Medelyan and I. H. Witten, 2006. Mining Domain-Specific Thesauri from Wikipedia: A case study. IEEE/WIC/ACM International...
    • Pazienza M.T., Pennacchiotti M. and F. M. Zanzotto, 2005. Terminology Extraction: An Analysis of Linguistic and Statistical Approaches. Studies...
    • Ponzetto P. and M. Strube, 2008. WikiTaxonomy: A large scale knowledge resource. In: Proceedings of the 18th European Conf. on Artificial...
    • Suchanek F., 2008. Automated Construction and Growth of a Large Ontology. PhDThesis. Saarbrücken University, Germany.
    • Toral, A. and R. Muñoz, 2006 A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In Proceedings...
    • Vivaldi Palatresi, Jorge (2009). Corpus and exploitation tool: IULACT and bwanaNet. In Proceedings of CILC-09: 224-239, Spain.
    • Zesch T. and I. Gurevych, 2007. Analysis of the Wikipedia Category Graph for NLP Applications. In Proceedings of the TextGraphs-2 Workshop:...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno