Distribution-free tests for lossless feature selection in classification and regression

László Györfi; Tamás Linder; Harro Walk

Ayuda

Distribution-free tests for lossless feature selection in classification and regression

László Györfi ^[1] ; Tamás Linder ^[2] ; Harro Walk ^[3]
1. [1] Budapest University of Technology and Economics
  
  Budapest University of Technology and Economics
  
  Hungría
2. [2] Queen's University
  
  Queen's University
  
  Canadá
3. [3] University of Stuttgart
  
  University of Stuttgart
  
  Stadtkreis Stuttgart, Alemania
Mostrar afiliaciones +
Localización: Test: An Official Journal of the Spanish Society of Statistics and Operations Research, ISSN-e 1863-8260, ISSN 1133-0686, Vol. 34, Nº. 1, 2025, págs. 262-287
Idioma: inglés
Texto completo no disponible (Saber más ...)
Resumen
- We study the problem of lossless feature selection for a d-dimensional feature vector and label Y for binary classification as well as nonparametric regression. Foran index set , consider the selected |S|-dimensional feature subvector . If and stand for the minimum risk based on X and , respectively, then is called lossless if . For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor-based test statistics to test the hypothesis that is lossless. This test statistic is an estimate of the excess risk . Surprisingly, estimating this excess risk turns out to be a functional estimation problem that does not suffer from the curse of dimensionality in the sense that the convergence rate does not depend on the dimension d. For the threshold , the corresponding tests are proved to be consistent under conditions on the distribution of (X, Y) that are significantly milder than in previous work. Also, our threshold is universal (dimension independent), in contrast to earlier methods where for large d the threshold becomes too large to be useful in practice.