Ir al contenido

Documat


Native language detection with 'cheap' learner corpora

  • Autores: Julian Brooke, Graeme Hirst
  • Localización: Twenty years of learner corpus research: looking back, moving ahead / Sylviane Granger (ed. lit.), Gaëtanelle Gilquin (ed. lit.), Fanny Meunier (ed. lit.), 2013, ISBN 978-2-87558-199-0, págs. 37-47
  • Idioma: inglés
  • Enlaces
  • Resumen
    • We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has issues when used directly for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that results in cross-validated performance that appears misleadingly high, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which have been previously avoided. We also investigate ways to do NLD that do not involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.


Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno