Resumen de Distribution-free tests for lossless feature selection in classification and regression

László Györfi, Tamás Linder, Harro Walk

We study the problem of lossless feature selection for a d-dimensional feature vector and label Y for binary classification as well as nonparametric regression. Foran index set , consider the selected |S|-dimensional feature subvector . If and stand for the minimum risk based on X and , respectively, then is called lossless if . For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor-based test statistics to test the hypothesis that is lossless. This test statistic is an estimate of the excess risk . Surprisingly, estimating this excess risk turns out to be a functional estimation problem that does not suffer from the curse of dimensionality in the sense that the convergence rate does not depend on the dimension d. For the threshold , the corresponding tests are proved to be consistent under conditions on the distribution of (X, Y) that are significantly milder than in previous work. Also, our threshold is universal (dimension independent), in contrast to earlier methods where for large d the threshold becomes too large to be useful in practice.

Acceso de usuarios registrados

¿Es nuevo? Regístrese

Coordinado por: