Un método ligero de generación de datos: combinación entre Cadenas de Markov y Word Embeddings

Eva Martínez García; Álvaro García Tejedor; Javier Morales; Alberto Nogales Moyano

Ayuda

Un método ligero de generación de datos: combinación entre Cadenas de Markov y Word Embeddings

Autores: Eva Martínez García, Álvaro García Tejedor, Javier Morales, Alberto Nogales Moyano
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 64, 2020, págs. 85-92
Idioma: español
Títulos paralelos:
- A light method for data generation: a combination of Markov Chains and Word Embeddings
Enlaces
- Texto completo
Resumen
- español
  Las técnicas para el Procesamiento del Lenguaje Natural (PLN) que actualmente conforman el estado del arte necesitan una cantidad importante de datos para su entrenamiento que en algunos escenarios puede ser difícil de conseguir. Presentamos un método híbrido para generar frases nuevas que aumenten los datos de entrenamiento, combinando cadenas de Markov y word embeddings para producir datos de alta calidad similares a un conjunto de datos de partida. Proponemos un método ligero que no necesita una gran cantidad de datos. Los resultados muestran cómo nuestro método es capaz de generar datos útiles. En particular, evaluamos los datos generados generando Modelos de Lenguaje basados en el Transformer utilizando datos de tres dominios diferentes en el contexto de enriquecer chatbots de propósito general.
- English
  Most of the current state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent. A significant amount of data is required for their training, and in some scenarios data is scarce. We present a hybrid method to generate new sentences for augmenting the training data. Our approach takes advantage of the combination of Markov Chains and word embeddings to produce high-quality data similar to an initial dataset. In contrast to other neural-based generative methods, it does not need a high amount of training data. Results show how our approach can generate useful data for NLP tools. In particular, we validate our approach by building Transformer-based Language Models using data from three different domains in the context of enriching general purpose chatbots.
Referencias bibliográficas
- Artetxe, M., G. Labaka, and E. Agirre. 2017. Learning bilingual word embeddings with Eva Martínez García, Alberto Nogales, Javier Morales,...
- Artetxe, M. and H. Schwenk. 2019. Marginbased parallel corpus mining with multilingual sentence embeddings. In Proceedings of the ACL2019...
- Bahdanau, D., K. Cho, and Y. Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR 2015.
- Chan, W., N. Jaitly, Q. Le, and O. Vinyals. 2016. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition....
- Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In...
- Duˇsek, O. and F. Jurˇc´ıˇcek. 2016. Sequenceto-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings...
- Freitag, M. and S. Roy. 2018. Unsupervised natural language generation with denoising autoencoders. In Proceedings of the EMNLP 2018, pages...
- Gagniuc, P. A. 2017. Markov chains: from theory to implementation and experimentation. John Wiley & Sons.
- Ghaddar, A. and P. Langlais. 2017. WiNER: A Wikipedia annotated corpus for named entity recognition. In Proceedings of the IJCNLP 2017(Volume...
- Inaba, M. and K. Takahashi. 2016. Neural utterance ranking model for conversational dialogue systems. In Proceedings of the SIGDIAL 2016,...
- Inoue, H. 2018. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929.
- Junczys-Dowmunt, M. 2019. Microsoft translator at wmt 2019: Towards large-scale document-level neural machine translation. In Proceedings...
- Junczys-Dowmunt, M., R. Grundkiewicz, T. Dwojak, H. Hoang, K. Heafield, T. Neckermann, F. Seide, U. Germann, A. Fikri Aji, N. Bogoychev, A....
- Jurafsky, D. and J. H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics,...
- Lample, G., A. Conneau, M. Ranzato, L. Denoyer, and H. J´egou. 2018. Word translation without parallel data. In Proceedings of the ICLR 2018.
- Le, Q. and T. Mikolov. 2014. Distributed representations of sentences and documents. In International conference on machine learning, pages...
- Liu, T., K. Wang, L. Sha, B. Chang, and Z. Sui. 2018. Table-to-text generation by structure-aware seq2seq learning. In 32nd AAAI Conference...
- Mikolov, T., K. Chen, G. S. Corrado, and J. Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
- Moritz, N., T. Hori, and J. L. Roux. 2019. Unidirectional Neural Network Architectures for End-to-End Automatic Speech Recognition. In Proc....
- Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. Deep contextualized word representations. In...
- Pham, N.-Q., T.-S. Nguyen, J. Niehues, M. M¨uller, and A. Waibel. 2019. Very Deep Self-Attention Networks for End-toEnd Speech Recognition....
- Puzikov, Y. and I. Gurevych. 2018. E2E NLG challenge: Neural models vs. templates. In Proceedings of the INLG 2018, pages 463–471.
- Reimers, N. and I. Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the EMNLP-IJCNLP 2019,...
- Ruiter, D., C. España-Bonet, and J. van Genabith. 2019. Self-Supervised Neural Machine Translation. In Proceedings of the ACL 2019, Volume...
- Sankar, C., S. Subramanian, C. Pal, S. Chandar, and Y. Bengio. 2019. Do neural dialog systems use the conversation history effectively? an...
- Sennrich, R., B. Haddow, and A. Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the ACL2016...
- Serban, I. V., A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical...
- Sordoni, A., M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J.-Y. Nie, J. Gao, and B. Dolan. 2015. A neural network approach to context-sensitive...
- Tanner, M. A. and W. H. Wong. 1987. The calculation of posterior distributions by data augmentation. Journal of the American statistical Association,...
- Taul´e, M., M. A. Mart´ı, and M. Recasens. 2008. AnCora: Multilevel annotated corpora for Catalan and Spanish. In Proceedings of the LREC’08.
- Tiedemann, J. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the LREC2012, pages 2214–2218.
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In...
- Wen, T.-H., M. Gaˇsi´c, D. Kim, N. Mrkˇsi´c, P.- H. Su, D. Vandyke, and S. Young. 2015. Stochastic language generation in dialogue using recurrent...
- Yang, Z., Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le. 2019. Xlnet: Generalized autoregressive pretraining for language...