We begin by showing that the best publicly available, multiple-L1 learner corpus, the International Corpus of Learner English (Granger et al. 2009), has issues when used directly for the task of native language detection (NLD). The topic biases in the corpus are a confounding factor that results in cross-validated performance that appears misleadingly high, for all the feature types which are traditionally used. Our approach here is to look for other, cheap ways to get training data for NLD. To that end, we present the web-scraped Lang-8 learner corpus, and show that it is useful for the task, particularly if large quantities of data are used. This also seems to facilitate the use of lexical features, which have been previously avoided. We also investigate ways to do NLD that do not involve having learner corpora at all, including double-translation and extracting information from L1 corpora directly. All of these avenues are shown to be promising.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados