An empirical analysis of data selection techniques in statistical machine translation.

Mara Chinea Rios; Germán Sanchis Triches; Francisco Casacuberta Nolla

Ayuda

An empirical analysis of data selection techniques in statistical machine translation.

Autores: Mara Chinea Rios, Germán Sanchis Triches, Francisco Casacuberta Nolla
Localización: Procesamiento del lenguaje natural, ISSN 1135-5948, Nº. 55, 2015, págs. 101-108
Idioma: inglés
Títulos paralelos:
- Análisis empírico de técnicas de selección de datos en traducción automática estadística
Enlaces
- Texto completo
Resumen
- español
  La adaptación de dominios genera mucho interés dentro de la traducción automática estadística. Una de las técnicas de adaptación está basada en la selección de datos que tiene como objetivo seleccionar el mejor subconjunto de oraciones bilingües de un gran conjunto de oraciones. En este artículo estudiamos como afectan los corpus bilingües empleados por los métodos de selección de frases en la calidad de las traducciones.
- English
  Domain adaptation has recently gained interest in statistical machine translation. One of the adaptation techniques is based in the selection data. Data selection aims to select the best subset of the bilingual sentences from an available pool of sentences, with which to train a SMT system. In this paper, we study how affect the bilingual corpora used for the data selection methods in the translation quality.
Referencias bibliográficas
- Axelrod, A., X. He, and J. Gao. (2011). Domain adaptation via pseudo in-domain data selection. In Proc. of the EMNLP, pages 355–362.
- Gao, J., J. Goodman, M. Li, and K. Lee. (2002). Toward a unified approach to statistical language modeling for chinese. ACM TALIP, 1:3–33.
- Gascó, G., M.A. Rocha, G. Sanchis-Trilles, J. Andrés-Ferrer, and F. Casacuberta. (2012). Does more data always yield better translations?...
- Haddow, B. and P. Koehn. (2012). Analysing the effect of out-of-domain data on smt systems. In Proc. of the Seventh Workshop on Statistical...
- Irvine, A., J. Morgan, M. Carpuat, H. Daumé III, and D. Munteanu.(2013). Measuring machine translation errors in new domains. Transactions...
- Kneser, R. and H. Ney. (1995). Improved backing-off for m-gram language modeling. In Proc. of the International Conference on Acoustics Speech...
- Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In MT summit, pages 79–86.
- Koehn, P., H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,...
- Lü, Y., J. Huang, and Q. Liu. (2007). Improving statistical machine translation performance by training data selection and optimization. In...
- Moore, R. C. and W. Lewis. (2010). Intelligent selection of language model training data. In Proc. of the ACL, pages 220– 224.
- Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proc. of the ACL, pages 160– 167.
- Och, F. J. and H. Ney. (2002). Discriminative training and maximum entropy models for statistical machine translation. In Proc. of the ACL,...
- Och, F. J. and H. Ney. (2003). A systematic comparison of various sta-tistical alignment models. Computational linguistics, 29:19–51.
- Papineni, K., S. Roukos, T. Ward, and W. J. Zhu. (2002). Bleu: a method for automatic evaluation of machine translation. In Proc. of the ACL,...
- Papineni, K. A, S. Roukos, and R. T. Ward. (1998). Maximum likelihood and discriminative training of direct translation models. In Proc. of...
- Rousseau, A. (2013). Xenc: An open-source tool for data selection in natural language processing. The Prague Bulletin of Mathematical Linguistics,...
- Schwenk, H., A. Rousseau, and M. Attik. (2012). Large, pruned or continuous space language models on a gpu for statistical machine translation....
- Sennrich, R. (2012). Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proc. of the EACL,...
- Sennrich, R. (2013). Domain adaptation for translation models in statistical machine translation. Ph.D. thesis, University of Zurich.
- Stolcke, A. (2002). Srilm-an extensible language modeling toolkit. In Proc. of the Seventh International Conference on Spoken Language Processing.
- Tiedemann, J.(2009). News from opus- a collection of multilingual parallel corpora with tools and interfaces. In Proc. of the Recent advances...
- Wäschle, K. and S. Riezler. (2012). Analyzing Parallelism and Do-main Similarities in the MAREC Patent Corpus. Multidisciplinary Information Retrieval,...