This thesis has been developed at University Carlos III of Madrid, motivated through a collaboration with the Gregorio Marañón General University Hospital, in Madrid. It is framed within the field of Penalized Linear Models, specifically Variable Selection in Regression, Classification and Survival Models, but it also explores other techniques such as Variable Clustering and Semi-Supervised Learning.
In recent years, variable selection techniques based on penalized models have gained considerable importance. With the advance of technologies in the last decade, it has been possible to collect and process huge volumes of data with algorithms of greater computational complexity. However, although it seemed that models that provided simple and interpretable solutions were going to be definitively displaced by more complex ones, they have still proved to be very useful. Indeed, in a practical sense, a model that is capable of filtering important information, easily extrapolated and interpreted by a human, is often more valuable than a more complex model that is incapable of providing any kind of feedback on the underlying problem, even when the latter offers better predictions.
This thesis focuses on high dimensional problems, in which the number of variables is of the same order or larger than the sample size. In this type of problems, restrictions that eliminate variables from the model often lead to better performance and interpretability of the results. To adjust linear regression in high dimension the Sparse Group Lasso regularization method has proven to be very efficient. However, in order to use the Sparse Group Lasso in practice, there are two critical aspects on which the solution depends: the correct selection of the regularization parameters, and a prior specification of groups of variables. Very little research has focused on algorithms for the selection of the regularization parameters of the Sparse Group Lasso, and none has explored the issue of the grouping and how to relax this restriction that in practice is an obstacle to using this method.
The main objective of this thesis is to propose new methods of variable selection in generalized linear models. This thesis explores the Sparse Group Lasso regularization method, analyzing in detail the correct selection of the regularization parameters, and finally relaxing the problem of group specification by introducing a new variable clustering algorithm based on the Sparse Group Lasso, but much more flexible and that extends it. In a parallel but related line of research, this thesis reveals a connection between penalized linear models and semi-supervised learning.
This thesis is structured as a compendium of articles, divided into four chapters. Each chapter has a structure and contents independent from the rest, however, all of them follow a common line.
First, variable selection methods based on regularization are introduced, describing the optimization problem that appears and a numerical algorithm to approximate its solution when a term of the objective function is not differentiable. The latter occurs naturally when penalties inducing variable selection are added.
A contribution of this work is the iterative Sparse Group Lasso, which is an algorithm to obtain the estimation of the coefficients of the Sparse Group Lasso model, without the need to specify the regularization parameters. It uses coordinate descent for the parameters, while approximating the error function in a validation sample. Moreover, with respect to the traditional Sparse Group Lasso, this new proposal considers a more general penalty, where each group has a flexible weight.
A separate chapter presents an extension that uses the iterative Sparse Group Lasso to order the variables in the model according to a defined importance index. The introduction of this index is motivated by problems in which there are a large number of variables, only a few of which are directly related to the response variable. This methodology is applied to genetic data, revealing promising results.
A further significant contribution of this thesis is the Group Linear Algorithm with Sparse Principal decomposition, which is also motivated by problems in which only a small number of variables influence the response variable. However, unlike other methodologies, in this case the relevant variables are not necessarily among the observed data. This makes it a potentially powerful method, adaptable to multiple scenarios, which is also, as a side effect, a supervised variable clustering algorithm. Moreover, it can be interpreted as an extension of the Sparse Group Lasso that does not require an initial specification of the groups.
From a computational point of view, this paper presents an organized framework for solving problems in which the objective function is a linear combination of a differentiable error term and a penalty. The flexibility of this implementation allows it to be applied to problems in very different contexts, for example, the proposed Generalized Elastic Net for semi-supervised learning.
Regarding its main objective, this thesis offers a framework for the exploration of generalized interpretable models. In the last chapter, in addition to compiling a summary of the contributions of the thesis, future lines of work in the scope of the thesis are included.
© 2008-2024 Fundación Dialnet · Todos los derechos reservados