Ir al contenido

Documat


Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

  • Sotelo Docío, Susana [1] ; Gamallo, Pablo [1] Árbol académico ; Iriarte, Álvaro [2]
    1. [1] Universidade de Santiago de Compostela

      Universidade de Santiago de Compostela

      Santiago de Compostela, España

    2. [2] Universidade do Minho

      Universidade do Minho

      Braga (São José de São Lázaro), Portugal

  • Localización: Linguamática, ISSN 1647-0818, Vol. 15, Nº. 2, 2023, págs. 3-18
  • Idioma: portugués
  • DOI: 10.21814/lm.15.2.405
  • Títulos paralelos:
    • Development and evaluation of a NER model in the domain of cultural analysis and tourism
  • Enlaces
  • Resumen
    • English

       Named Entity Recognition (NER) is an essential task in information extraction where entities in a text are identified and classified. One of the primary challenges addressed by NER systems is the difficulty of generalizing what was learned to different types of corpora beyond the training data. This problem is magnified by the fact that most of the training corpora used are journalistic and therefore need to be adapted to other genres and domains. In this paper, we use a Spanish corpus consisting of interviews with visitors to the city of Santiago de Compostela and annotated with named entities, to evaluate and train NER systems tailored to the domain of cultural analysis and tourism. We provide a comprehensive comparison of various approaches employed, ranging from classical machine learning algorithms to fine-tuning Transformer models. The results significantly outperform the baseline, represented here by the toolkits Stanza, spaCy and Flair, although initial tests with unseen entities during training highlight the need for additional evaluations regarding their generalization capability and the utilization of adversarial splits for the corpus.

    • português

       O Reconhecimento de Entidades Mencionadas (NER) é uma tarefa essencial de extracção de informação em que as entidades de um texto são identificadas e classificadas. Um dos principais desafios enfrentados pelos sistemas NER é a dificuldade de generalização do aprendido para outros tipos de corpora diferentes dos utilizados durante o treino. Este problema é acentuado pelo facto de a maioria dos corpora de treino utilizados serem de natureza jornalística e, portanto, precisarem de ser adaptados a outros géneros e domínios. Neste artigo, utilizamos um corpus espanhol composto por entrevistas a visitantes da cidade de Santiago de Compostela e anotado com entidades mencionadas, para a avaliação e treino de sistemas NER adaptados ao domínio da cultura e do turismo. Apresentamos uma comparação das diferentes abordagens aplicadas, desde algoritmos clássicos de aprendizagem automática ao afinamento de vários modelos de Transformers. Os resultados obtidos superam significativamente o baseline, representado aqui pelos toolkits Stanza, spaCy e Flair, embora os testes preliminares com entidades não observadas durante o treino sugiram a necessidade de avaliações adicionais da sua capacidade de generalização e o uso de um método de segmentação adversarial no corpus.

  • Referencias bibliográficas
    • Agarwal, Oshin, Yinfei Yang, Byron C. Wallace& Ani Nenkova. 2021. Interpretability analy-sis for named entity recognition to unders-tand...
    • Akbik, Alan, Tanja Bergmann, Duncan Blythe,Kashif Rasul, Stefan Schweter & Roland Voll-graf. 2019. FLAIR: An easy-to-use frameworkfor...
    • Amaral, Carlos, Helena Figueira, Afonso Men-des, Pedro Mendes, Cl ́audia Pinto & TiagoVeiga. 2008. Adapta ̧c ̃ao do sistema...
    • Augenstein, Isabelle, Leon Derczynski & Ka-lina Bontcheva. 2017. Generalisation in na-med entity recognition: A quantitative...
    • Baldwin, Timothy, Marie Catherine de Marneffe,Bo Han, Young-Bum Kim, Alan Ritter & WeiXu. 2015. Shared tasks of the 2015 workshop on...
    • Bamman, David, Sejal Popat & Sheng Shen.2019. An annotated dataset of literary entities.EmConference of the North American Chapterof...
    • Barachi, May El, Sujith Samuel Mathew &Manar AlKhatib. 2022.Combining namedentity recognition and emotion analysis oftweets...
    • Bick, Eckhard. 2006. Functional aspects in Por-tuguese NER. EmComputational Processingof the Portuguese Language (PROPOR), 80–89.
    • Bouabdallaoui, Ibrahim, Fatima Guerouate,Samya Bouhaddour, Chaimae Saadi & Moha-med Sbihi. 2022. Named entity recognition ap-plied...
    • Cañete, Jos ́e. 2019. Compilation of large Spa-nish unannotated corpora. Version 2. Zenodo.10.5281/zenodo.3247731.
    • Cañete, Jos ́e, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang & Jorge P ́erez.2020. Spanish pre-trained BERT model...
    • Cardellino, Cristian. 2019.Spanish billionwords corpus and embeddings.https://crscardellino.github.io/SBWCE/.
    • Chantrapornchai, Chantana & Aphisit Tunsakul.2019. Information extraction based on namedentity for tourism corpus. Em16thInterna-tional...
    • Cheng, Xiao, Weihua Wang, Feilong Bao& Guanglai Gao. 2020.MTNER: Acorpus for Mongolian tourism named...
    • Conneau, Alexis, Kartikay Khandelwal, NamanGoyal, Vishrav Chaudhary, Guillaume Wen-zek, Francisco Guzmán, Edouard Grave, MyleOtt,...
    • Del ́eger, Louise, Robert Bossy, Estelle Chaix,Mouhamadou Ba, Arnaud Ferr ́e, PhilippeBessi`eres & Claire N ́edellec....
    • Devlin, Jacob, Ming-Wei Chang, Kenton Lee& Kristina Toutanova. 2019.BERT: Pre-training of deep bidirectional transformersfor...
    • do Amaral, Daniela O. F., Sandra Collovini,A. Figueira, Renata Vieira & Marco Gonza-lez. 2017. Processo de constru ̧c ̃ao...
    • Doddington, George, Alexis Mitchell, MarkPrzybocki, Lance Ramshaw, Stephanie Stras-sel & Ralph Weischedel. 2004. The automa-tic...
    • Egger, Roman (ed.). 2022.Applied data sci-ence in tourism: Interdisciplinary approaches,methodologies, and applicationsTourism onthe...
    • Eltyeb, Safaa & Naomie Salim. 2014. Chemicalnamed entities recognition: a review on appro-aches and applications.Journal of Cheminfor-matics6....
    • Freitas, Cl ́audia, Cristina Mota, Diana San-tos, Hugo Gon ̧calo Oliveira & Paula Carvalho.2010. Second HAREM: Advancing ...
    • Frontini, Francesca, Carmen Brando, JoannaByszuk, Ioana Galleron, Diana Santos &Desenvolvimento e avalia ̧c ̃ao de um...
    • Gamallo,Pablo & Marcos Garcia. 2017.LinguaKit:uma ferramenta multilinguepara a an ́alise lingu ́ıstica e a...
    • Garc ́ıa-Pablos, Aitor, Montse Cuadros & Ma-ria Teresa Linaza. 2015. OpeNER: Open to-ols to perform natural language processing...
    • Giorgi, John M. & Gary D. Bader. 2018.Transferlearningforbiomedicalna-medentityrecognitionwithneuralnetworks.Bioinformatics34(23)....
    • Grishman, Ralph & Beth Sundheim. 1995. De-sign of the MUC-6 evaluation.Em6thConference on Message Understanding, 1–11.10.3115/1072399.1072401.
    • Guo, Jianyi, Zhengshan Xue, Zhengtao Yu, Zhi-kun Zhang, Yihao Zhang & Xianming Yao.2009. Named entity recognition for the tourismdomain...
    • Guti ́errez Fandi ̃no, Asier, Jordi Armengol-Estap ́e, Marc P`amies, Joan Llop-Palao, Joa-quin Silveira-Ocampo, Casimiro...
    • He, Xuming, Richard S. Zemel & Miguel A.Carreira-Perpi ̃n ́an. 2004. Multiscale conditio-nal random fields for image labeling....
    • Honnibal, Matthew. 2016. Embed, encode, at-tend, predict: The new deep learning for-mula for state-of-the-art NLP models....
    • Honnibal, Matthew, Adriane Boyd & Vincent D.Warmerdam. 2022.Compact word vectorswith bloom embeddings. Explosion.https://explosion.ai/blog/bloom-embeddings.
    • Kanev, Anton I., Grigory A. Savchenko, Ilya A.Grishin, Denis A. Vasiliev & Emilia M. Duma.2022. Sentiment analysis of multilingual textsusing...
    • Kim, Hyunjae & Jaewoo Kang. 2022.Howdo your biomedical named entity recog-nition models generalize to novel...
    • K ́ad ́ar, ́Akos, Lester James Miranda, Victo-ria Slocum & Sofie Van Landeghem. 2023.The tale of bloom embeddings and...
    • Lacoste,Alexandre,AlexandraLuccioni,VictorSchmidt&ThomasDandres.2019.Quantifying the Carbon Emissi-ons of Machine Learning.ArXiv...
    • Lafferty, John, Andrew McCallum & FernandoPereira. 2001. Conditional random fields: Pro-babilistic models for segmenting and...
    • Lample, Guillaume, Miguel Ballesteros, San-deep Subramanian, Kazuya Kawakami &Chris Dyer. 2016.Neural architecturesfor...
    • LeCun, Yann, Yoshua Bengio & Geoffrey Hinton.2015. Deep learning.Nature521(7553). 436–444.10.1038/nature14539.
    • Lee, Jangwon, Jungi Lee, Minho Lee & Gil-Jin Jang. 2022. Named entity correction inneural machine translation using ...
    • Leitner, Elena, Georg Rehm & Julian Moreno-Schneider. 2019. Fine-grained named entityrecognition in legal documents. Em15thInter-national...
    • Lignos, Constantine & Marjan Kamyab. 2020. Ifyou build your own NER scorer, non-replicableresults will come. Em1stWorkshop on In-sights...
    • Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zheng-bao Jiang, Hiroaki Hayashi & Graham Neubig.2021. Pre-train, prompt, and predict: A...
    • Liu, Yinhan, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer & Veselin...
    • Manning, Christopher D., Mihai Surdeanu, JohnBauer, Jenny Finkel, Steven J. Bethard &David McClosky. 2014.The Stanford Co-reNLP...
    • Matos, Emanuel, M ́ario Rodrigues, Pedro Miguel& Ant ́onio Teixeira. 2021. Towards automa-tic creation of annotations to foster...
    • McDonald, Ryan & Fernando Pereira. 2005.Identifyinggeneandproteinmentionsin text using conditional random fields.BMCBioinformatics6(Suppl1).S6.10.1186/1471-2105-6-S1-S6.
    • Miranda, Lester James, ́Akos K ́ad ́ar, Adri-ane Boyd, Sofie Van Landeghem, AndersSøgaard & Matthew Honnibal. 2022....
    • Oronoz, Maite, Koldo Gojenola, Alicia P ́erez,Arantza D ́ıaz de Ilarraza & Arantza Casillas.2015. On the creation of a clinical...
    • Ortiz Su ́arez, Pedro Javier, Benoˆıt Sagot & Lau-rent Romary. 2019. Asynchronous pipelines forprocessing huge corpora on medium to low...
    • Pais, Vasile, Maria Mitrofan, Carol Luca Ga-san, Vlad Coneschi & Alexandru Ianov.2021.Named entity recognition ...
    • Palmer, David D. & David S. Day. 1997. A sta-tistical profile of the named entity task. Em5thConference Applied Natural Language ...
    • Pedregosa, F., G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J....
    • Pennington, Jeffrey, Richard Socher & Ch-ristopher D. Manning. 2014.Glove:Global vectors for word representation.EmEmpirical...
    • Qi, Peng, Yuhao Zhang, Yuhui Zhang, Ja-son Bolton & Christopher D. Manning.2020.Stanza:A Python natural lan-guage...
    • Santos, Diana, Nuno Seco, Nuno Cardoso & RuiVilela. 2006. HAREM: An advanced NER eva-luation contest for Portuguese. Em5thIn-ternational...
    • Saputro, Khurniawan Eko, Sri Suning Kusu-mawardani & Silmi Fauziati. 2016.Deve-lopment of semi-supervised named entity...
    • Settles, Burr. 2004. Biomedical named entityrecognition using conditional random fieldsand rich feature sets. EmInternational...
    • Sha, Fei & Fernando Pereira. 2003. Shallow par-sing with conditional random fields. EmHu-man Language Technology Conference of ...
    • Søgaard, Anders, Sebastian Ebert, Jasmijn Bas-tings & Katja Filippova. 2021. We need totalk about random splits. Em16thConferenceof...
    • Strubell, Emma, Ananya Ganesh & AndrewMcCallum. 2020.Energy and Policy Con-siderations for Modern Deep Learning Re-search.EmAAAI...
    • Tjong Kim Sang, Erik F. 2002. Introductionto the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Em6thConference...
    • Tjong Kim Sang, Erik F. & Fien De Meulder.2003. Introduction to the CoNLL-2003 Sha-red Task: Language-Independent Named...
    • Torres Feij ́o, Elias J. 2019.Bem-estar comu-nit ́ario e visitantes atrav ́es do Caminho deSantiago. Grandes narrativas, ideias...
    • Vaswani, Ashish, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser & Illia Polosukhin. 2017. Atten-tion...
    • Vijay, J. & Rajeswari Sridhar. 2016. A machinelearning approach to named entity recognitionfor the travel and tourism domain.Asian Jour-nal...
    • Vu, Van-Hai, Quang-Phuoc Nguyen, Kiem-HieuNguyen, Joon-Choul Shin & Cheol-Young Ock.2020.Korean-Vietnamese Neural MachineTranslation...
    • Walker, Christopher, Stephanie Strassel, JulieMedero & Kazuaki Maeda. 2006. ACE 2005multilingual training corpus. Linguistic...
    • Wolf, Thomas, Lysandre Debut, Victor Sanh, Ju-lien Chaumond, Clement Delangue, AnthonyMoi, Pierric Cistac, Tim Rault, Remi Louf,Morgan...
    • Xue, Leyi, Han Cao, Fan Ye & Yuehua Qin. 2019.A method of Chinese tourism named entity re-cognition based on BBLC Model. EmIEEESmartWorld:...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno