A practical view of large-scale classification: feature selection and real-time classification

Irene Rodríguez Luján

Ayuda

A practical view of large-scale classification: feature selection and real-time classification

Autores: Irene Rodríguez Luján
Directores de la Tesis: Carlos Santa Cruz Fernández (dir. tes.)
Lectura: En la Universidad Autónoma de Madrid ( España ) en 2012
Idioma: inglés
Tribunal Calificador de la Tesis: José Ramón Dorronsoro Ibero (presid.) , Alberto Suárez González (secret.) , Antonio Artés Rodríguez (voc.) , Pedro Larrañaga Múgica (voc.) , Ramón Figueras (voc.)
Enlaces
- Tesis en acceso abierto en: Biblos-e Archivo
Resumen
- The increasing volume of data during the last years together with the emergence of classification systems requiring real-time responses have given rise to a new school of thought in machine learning owing to the impossibility of applying many of the classification methods that, traditionally, had been successful. This impossibility can be due to hardware limitations, the data cannot be completely stored in memory as required by many of the existing algorithms, or requirements concerning the training and/or classification times that traditional classifiers are not able to fulfill. In any case, it is mandatory to adapt the current machine learning solutions to the new scenario which, inevitably, leads to the design of simple and easily scalable algorithms. In particular, this thesis proposes two complementary solutions that make it possible to face up those classification problems requiring real-time predictions.
  
  The first of these contributions is a new feature selection algorithm independent of the classifier and capable of reducing its computational cost in the classification and training phases. The proposed method can be categorized into the multivariate filters group and it yields similar classification rates that its state-of-the-art counterparts while reducing their computational complexity. In addition, the new algorithm is reformulated in a higher dimensional space induced by a kernel turning out to be equivalent to the well-known Kernel Fisher Discriminant Analysis (KFDA). This equivalence is theoretically proven providing new insights into the KFDA.
  
  The second contribution of this work is focused on the design of an algorithm capable of classifying patterns in few milliseconds. Motivated by the difficulty of applying nonlinear Support Vector Machines (SVMs) to real-time classification domains, the new method attempts to approximate the nonlinear decision boundaries by means of piecewise linear functions while locally preserving the maximum margin criteria. The results presented in this thesis show how the proposed algorithm can bridge the gap between the simplicity but low accuracy of linear SVMs and the effectiveness but sophistication of nonlinear SVMs in real-time classification systems.
  
  In conclusion, the large amount of data makes it sometimes necessary to leave aside the precision of the most complex models in favor of approximate solutions fulfilling the requirements of the classification system.