Ir al contenido

Documat


Integer constraints for enhancing interpretability in linear regression

    1. [1] Universidad de Sevilla

      Universidad de Sevilla

      Sevilla, España

    2. [2] University of Chicago

      University of Chicago

      City of Chicago, Estados Unidos

    3. [3] Universidad de Cádiz

      Universidad de Cádiz

      Cádiz, España

  • Localización: Sort: Statistics and Operations Research Transactions, ISSN 1696-2281, Vol. 44, Nº. 1, 2020, págs. 69-78
  • Idioma: inglés
  • DOI: 10.2436/20.8080.02.95
  • Enlaces
  • Resumen
    • One of the main challenges researchers face is to identify the most relevant features in a prediction model. As a consequence, many regularized methods seeking sparsity have flourished. Although sparse, their solutions may not be interpretable in the presence of spurious coefficients and correlated features. In this paper we aim to enhance interpretability in linear regression in presence of multicollinearity by: (i) forcing the sign of the estimated coefficients to be consistent with the sign of the correlations between predictors, and (ii) avoiding spurious coefficients so that only significant features are represented in the model. This will be addressed by modelling constraints and adding them to an optimization problem expressing some estimation procedure such as ordinary least squares or the lasso. The so-obtained constrained regression models will become Mixed Integer Quadratic Problems. The numerical experiments carried out on real and simulated datasets show that tightening the search space of some standard linear regression models by adding the constraints modelling (i) and/or (ii) help to improve the sparsity and interpretability of the solutions with competitive predictive quality.

  • Referencias bibliográficas
    • Atamurk, A., Nemhauser, G. and Savelsbergh, M. (2000). Conflict graphs in solving integer programming problems. European Journal of Operational...
    • Bartholomew, D. J., Steele, F., Moustaki, I. and Galbraith, J. (2008). Analysis of Multivariate Social Science Data. Chapman & Hall.
    • Bertsimas, D. and King, A. (2015). OR forum – An algorithmic approach to linear regression. Operations Research, 64, 2–16.
    • Bertsimas, D., King, A., Mazumder, R. (2016). Best subset selection via a modern optimization lens. The Annals of Statistics, 44, 813–852.
    • Breiman, L. (1995). Better subset regression using the nonnegative garrote. Technometrics, 37, 373–384.
    • Bühlmann, P. and van de Geer, S. (2011). Statistics for High-Dimensional Data. Springer.
    • Cai, A., Tsay, R. and Chen, R. (2009). Variable selection in linear regression with many predictors. Journal of Computational and Graphical...
    • Camm, J. D., Raturi, A. S. and Tsubakitani, S. (1990). Cutting big M down to size. Interfaces, 20, 61–66.
    • Cao, G., Guo, Y. and Bouman, C. A. (2010). High dimensional regression using the sparse matrix transform (SMT). In Acoustics Speech and Signal...
    • Carrizosa, E. and Guerrero, V. (2014). Biobjective sparse principal component analysis. Journal of Multivariate Analysis, 132, 151–159.
    • Carrizosa, E., Nogales-Gómez, A. and Morales, D. R. (2016). Strongly agree or strongly disagree?: Rating features in support vector machines....
    • Carrizosa, E., Nogales-Gómez, A. and Morales, D. R. (2017). Clustering categories in support vector machines. Omega, 66, 28–37.
    • Carrizosa, E., Olivares-Nadal, A. V. and Ramı́rez-Cobo, P. (2016). A sparsity-controlled vector autoregressive model. Biostatistics, 18,...
    • Chatterjee, S. and Hadi, A. S. (2015). Regression Analysis by Example. John Wiley & Sons.
    • Danna, E., Rothberg, E. and Le Pape, C. (2005). Exploring relaxation induced neighborhoods to improve mip solutions. Mathematical Programming,...
    • Efron, B. and Hastie, T. (2003). LARS software for R and Splus. https://web.stanford.edu/ hastie/ Papers/LARS/.
    • Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32, 407–499.
    • Farrar, D. E. and Glauber, R. R. (1967). Multicollinearity in regression analysis: the problem revisited. The Review of Economic and Statistics,...
    • Fischetti, M. and Lodi, A. (2005). Local branching. Mathematical Programming, 98, 23–47.
    • Fourer, R., Gay, D. and Kernighan, B. W. (2002). The AMPL book. Duxbury Press, Pacific Grove.
    • Friedman, J., Hastie, T. and Tibshirani, R. (2001). The Elements of Statistical Learning, Volume 1. Springer series in statistics.
    • Hastie, T. and Efron, B. (2013). Least Angle Regression, Lasso and Forward Stagewise. http://cran. r-project.org/web/packages/lars/lars.pdf.
    • Hastie, T., Tibshirani, R. and Wainwright, M. (2015). Statistical Learning with Sparsity: the Lasso and Generalizations. CRC Press.
    • Hesterberg, T., Choi, N. H., Meier, L. and Fraley, C. (2008). Least angle and ℓ1 penalized regression: A review. Statistics Surveys, 2,...
    • Jou, Y.-J., Huang, C.-C. L. and Cho, H.-J. (2014). A VIF-based optimization model to alleviate collinearity problems in multiple linear regression....
    • Kim, S. and Xing, E. P. (2009). Statistical estimation of correlated genome associations to a quantitative trait network. PLoS genetics, 5,...
    • Lichman, M. (2016). UCI machine learning repository. http://archive.ics.uci.edu/ml. University of California, Irvine, School of Information...
    • Massy, W. F. (1965). Principal components regression in exploratory statistical research. Journal of the American Statistical Association,...
    • Meinshausen, N. (2013). Sign-constrained least squares estimation for high-dimensional regression. Electronic Journal of Statistics, 7, 1607–1631.
    • Miller, A. (2002). Subset Selection in Regression (2 ed.). Chapman & Hall/CRC.
    • Montgomery, D. C., Peck, E. A. and Vining, G. G. (2012). Introduction to Linear Regression Analysis, Volume 821. John Wiley & Sons.
    • Rothberg, E. (2007). An evolutionary algorithm for polishing mixed integer programming solutions. INFORMS Journal on Computing, 19, 534–541.
    • Savelsbergh, M. (1994). Preprocessing and probing techniques for mixed integer programming problems.
    • ORSA Journal on Computing, 6, 445–454.
    • Sengupta, D. and Bhimasankaram, P. (1997). On the roles of observations in collinearity in the linear model. Journal of the American Statistical...
    • Silvey, S. (1969). Multicollinearity and imprecise estimation. Journal of the Royal Statistical Society. Series B (Methodological), 539–552.
    • Tamura, R., Kobayashi, K., Takano, Y., Miyashiro, R., Nakata, K. and Matsui, T. (2017). Best subset selection for eliminating multicollinearity....
    • Tamura, R., Kobayashi, K., Takano, Y., Miyashiro, R., Nakata, K. and Matsui, T. (2019). Mixed integer quadratic optimization formulations...
    • Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological),...
    • Tibshirani, R., Saunders, M., Rosset, S., Zhu, J. and Knight, K. (2005). Sparsity and smoothness via the fused lasso. Journal of the Royal...
    • Torgo, L. (2016). Regression data sets. http://www.dcc.fc.up.pt/ ltorgo/Regression/DataSets.html. University of Porto, Faculty of Sciences.
    • Watson, P. K. and Teelucksingh, S. S. (2002). A Practical Introduction to Econometric Methods: Classical and Modern. University of West Indies...
    • Winner, L. (2016). Miscellaneous data sets. http://www.stat.ufl.edu/ winner/datasets.html. University of Florida.
    • Yu, G. and Liu, Y. (2016). Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association,...
    • Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series...

Fundación Dialnet

Mi Documat

Opciones de artículo

Opciones de compartir

Opciones de entorno