Variable selection and predictive models in big data environments

Álvaro Méndez Civieta

Ayuda

Variable selection and predictive models in big data environments

Autores: Álvaro Méndez Civieta
Directores de la Tesis: María del Carmen Aguilera Morillo (codir. tes.) , Rosa Elvira Lillo Rodríguez (codir. tes.)
Lectura: En la Universidad Carlos III de Madrid ( España ) en 2022
Idioma: español
Tribunal Calificador de la Tesis: María Luz Durbán Reguera (presid.) , María Angeles Gil Alvarez (secret.) , Ying Wei (voc.)
Enlaces
- Tesis en acceso abierto en: e-Archivo
Resumen
- In recent years, the advances in data collection technologies have presented a difficult challenge by extracting increasingly complex and larger datasets. Traditionally, statistics methodologies treated with datasets where the number of variables did not exceed the number of observations, however, dealing with problems where the number of variables is larger than the number of observations has become more and more common, and can be seen in areas like economics, genetics, climate data, computer vision etc. This problem has required the development of new methodologies suitable for a high dimensional framework.
  
  Most of the statistical methodologies are limited to the study of averages. Least squares regression, principal component analysis, partial least squares etc. All these techniques provide mean based estimations, and are built around the key idea that the data is normally distributed. But this is an assumption that is usually unverified in real datasets, where skewness, heteroscedasticity and outliers can easily be found. The estimation of more robust alternative metrics, like the quantiles, can help solving these problems, providing a more complete image of the data distribution.
  
  This thesis is built around these two core ideas. We seek to develop more robust, quantile based methodologies and extend them to high dimensional problems where the number of variables is possibly larger than the number of observations. The thesis is structured as a compendium of articles, divided into four chapters where each chapter has independent content and structure but is nevertheless encompassed within the main objective of the thesis.
  
  First, a series of basic concepts and results, assumed to be known or referenced in the rest of the thesis are introduced. These include traditional least squares regression, quantile regression, penalized regression models, dimension reduction techniques like principal component analysis and partial least squares, and functional data analysis.
  
  A possible solution when dealing with high dimensional problems in the field of regression is the usage of variable selection techniques. In this regard, sparse group lasso (SGL), which is a linear combination of lasso and group lasso, has proven to be a very effective alternative. However, these penalizations are based on the variance bias tradeoff concept, and seek to reduce the variability of the estimations by introducing some bias in the model, which means that it is possible that the variables selected by the model are not the truly significant ones.The first contribution of this thesis studies the formulation of an \textit{adaptive sparse group lasso} for quantile regression, a more flexible formulation of the sparse group lasso that makes use of the adaptive idea, this is, the usage of adaptive weights in the penalization to help correcting the bias, improving this way variable selection and prediction accuracy. However, the adaptive idea has traditionally been limited to the usage in low dimensional scenarios, as it requires solving an unpenalized model (which is unfeasible in high dimensions). This thesis studies a series of alternatives for the weights computation that effectively extend the adaptive based estimators to high dimensional problems.
  
  An alternative solution to the high dimensional problem is the usage of a dimension reduction technique like partial least squares. Partial least squares (PLS) is a methodology initially proposed in the field of chemometrics as an alternative to traditional least squares regression when the data is high dimensional or faces colinearity. It works by projecting the independent data matrix into a subspace of uncorrelated variables that maximize the covariance with the response matrix. However, being an iterative process based on least squares implies that this methodology provides mean based estimates, and makes it extremely sensitive to the presence of outliers, skewness or heteroscedasticity. The second contribution of this thesis defines the \textit{fast partial quantile regression}, a technique that performs a projection into a subspace where a quantile covariance metric is maximized, effectively extending partial least squares to the quantile regression framework. Opposed to the traditional covariance, there is not a unique definition of what a quantile covariance should be. For this reason in this work three different alternatives for this metric are studied through a series of synthetic datasets.
  
  Another field where it is common to find high dimensional data is in functional data analysis. Functional data analysis (FDA) is a statistical field that studies observations that are not scalars, but functions changing along a continuum, usually along time. A key technique in this field is functional principal component analysis (FPCA), a methodology that is able to decompose functional observations into an orthogonal set of basis functions that best explains the variability in the data. However, FPCA fails capturing shifts in the scale of the data affecting the quantiles, and is affected by outliers. The third contribution of this thesis introduces the \textit{functional quantile factor model} (FQFM). A methodology that extends the concept of FPCA to quantile regression, obtaining a model that can explain the quantiles of the data conditional on a set of common functions. An iterative algorithm for the computation of the FQFM estimator is also proposed. This algorithm is suitable for dealing with missing data, and with observations measured in irregular time grids.
  
  The last contribution of this thesis is \textit{asgl}, a python package that solves penalized least squares and quantile regression models in low and high dimensional frameworks. This package fills a gap in the existing methodologies available in different programming languages like R, matlab or python, making it possible to use adaptive based penalizations. It also provides different alternatives for the weights calculation, and is programmed in a way that can be executed in parallel, potentially reducing computation time.
  
  Finally, the last chapter of this thesis presents the conclusions of this work, and includes possible lines of future research.