STA 721: Lecture 11
Duke University
Readings (see reading link)
Tibshirani (JRSS B 1996) proposed estimating coefficients through \(L_1\) constrained least squares via the Least Absolute Shrinkage and Selection Operator or lasso \[\hat{\boldsymbol{\beta}}_{L} = \mathop{\mathrm{argmin}}_{\boldsymbol{\beta}} \left\{ \|\mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \right\}\]
\(\mathbf{Y}_c\) is the centered \(\mathbf{Y}\), \(\mathbf{Y}_c = \mathbf{Y}- \bar{\mathbf{Y}} \mathbf{1}\)
\(\mathbf{X}_s\) is the centered and standardized \(\mathbf{X}\) matrix so that the diagonal elements of \(\mathbf{X}_s^T\mathbf{X}_s = c\).
use the scale
function but standardization usually handled within packages
Control how large coefficients may grow \[\arg \min_{\boldsymbol{\beta}} (\mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta})^T (\mathbf{Y}_c - \mathbf{X}_s\boldsymbol{\beta})\]
\[ \text{ subject to } \sum |\beta_j| \le t\]
image from Machine Learning with R
The entire path of solutions can be easily found using the ``Least Angle Regression’’ Algorithm of Efron et al (Annals of Statistics 2004)
GNP.deflator GNP Unemployed Armed.Forces Population Year
[1,] 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
[2,] 0.0000 0.0327 0.0000 0.0000 0.0000 0.0000
[3,] 0.0000 0.0362 -0.0037 0.0000 0.0000 0.0000
[4,] 0.0000 0.0372 -0.0046 -0.0010 0.0000 0.0000
[5,] 0.0000 0.0000 -0.0124 -0.0054 0.0000 0.9068
[6,] 0.0000 0.0000 -0.0141 -0.0071 0.0000 0.9438
[7,] 0.0000 0.0000 -0.0147 -0.0086 -0.1534 1.1843
[8,] -0.0077 0.0000 -0.0148 -0.0087 -0.1708 1.2289
[9,] 0.0000 -0.0121 -0.0166 -0.0093 -0.1303 1.4319
[10,] 0.0000 -0.0253 -0.0187 -0.0099 -0.0951 1.6865
[11,] 0.0151 -0.0358 -0.0202 -0.0103 -0.0511 1.8292
LARS/LASSO
Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type = "lasso")
Df Rss Cp
0 1 185.009 1976.7120
1 2 6.642 59.4712
2 3 3.883 31.7832
3 4 3.468 29.3165
4 5 1.563 10.8183
5 4 1.339 6.4068
6 5 1.024 5.0186
7 6 0.998 6.7388
8 7 0.907 7.7615
9 6 0.847 5.1128
10 7 0.836 7.0000
For \(p\) predictors, \[C_p = \frac{\textsf{SSE}_p}{s^2} -n + 2p\]
\(s^2\) is the residual variance from the full model
\(\textsf{SSE}_p\) is the sum of squared errors for the model with \(p\) predictors (RSS)
if the model includes all the predictors with non-zero coefficients, then \(C_p \approx p\)
choose minimum \(C_p \approx p\)
in practice use Cross-validation or Generalized Cross Validation (GCV) to choose \(\lambda\)
Combines shrinkage (like Ridge Regression) with Variable Selection to deal with collinearity
Can be used for prediction or variable selection
not invariant under linear transformations of the predictors
typically no uncertainty estimates for the coefficients or predictions
ignores uncertainty in the choice of \(\lambda\)
may overshrink large coefficients
Equivalent to finding posterior mode with a Double Laplace Prior \[ \mathop{\mathrm{argmax}}_{\boldsymbol{\beta}} -\frac{\phi}{2} \{ \| \mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta}\|^2 + \lambda^* \|\boldsymbol{\beta}\|_1 \} \]
Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso \[\begin{eqnarray*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta}, \phi & \sim & \textsf{N}(\mathbf{1}_n \alpha + \mathbf{X}^s \boldsymbol{\beta}^s, \mathbf{I}_n/\phi) \\ \boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\ \tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\ p(\alpha, \phi) & \propto& 1/\phi \\ \end{eqnarray*}\]
Generalizes Ridge Priors to allow different prior variances for each coefficient
Marginal distribution of \(\beta_j\) \[\begin{eqnarray*} \boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\ \tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\ p(\beta_j \mid \phi, \lambda) & = & \int_0^\infty p(\beta_i \mid \phi, \tau^2_j) p(\tau^2_j \mid \phi, \lambda) \, d\tau^2 \\ \end{eqnarray*}\]
Can show that \(\beta_j \mid \phi, \lambda \mathrel{\mathop{\sim}\limits^{\rm iid}}DE(\lambda \sqrt{\phi})\) \[\int_0^\infty \frac{1}{\sqrt{2 \pi t}} e^{-\frac{1}{2} \phi \frac{\beta^2}{t }} \, \frac{\lambda^2}{2} e^{- \frac{\lambda^2 t}{2}}\, dt = \frac{\lambda \phi^{1/2}}{2} e^{-\lambda \phi^{1/2} |\beta|} \]
Scale Mixture of Normals (Andrews and Mallows 1974)
Integrate out \(\alpha\): \(\alpha \mid \mathbf{Y}, \phi \sim \textsf{N}(\bar{y},
1/(n \phi)\)
\(\boldsymbol{\beta}\mid \boldsymbol{\tau}, \phi, \lambda, \mathbf{Y}_c \sim \textsf{N}(, )\)
\(\phi \mid \boldsymbol{\tau}, \boldsymbol{\beta}, \lambda, \mathbf{Y}_c \sim \mathbf{G}( , )\)
\(1/\tau_j^2 \mid \boldsymbol{\beta}, \phi, \lambda, \mathbf{Y}\sim \textsf{InvGaussian}( , )\)
\(X \sim \textsf{InvGaussian}(\mu, \lambda)\) has density \[ f(x) = \sqrt{\frac{\lambda^2}{2 \pi}} x^{-3/2} e^{- \frac{1}{2} \frac{ \lambda^2( x - \mu)^2} {\mu^2 x}} \qquad x > 0 \]
Homework: Derive the full conditionals for \(\boldsymbol{\beta}^s\), \(\phi\), \(1/\tau^2\)
see Casella & Park
Carvalho, Polson & Scott (2010) propose an alternative shrinkage prior \[\begin{align*} \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{0}_p, \frac{\textsf{diag}(\tau^2)}{ \phi }) \\ \tau \mid \lambda & \mathrel{\mathop{\sim}\limits^{\rm iid}}C^+(0, \lambda) \\ \lambda & \sim \textsf{C}^+(0, 1/\phi) \\ p(\alpha, \phi) & \propto 1/\phi \end{align*}\]
In the case \(\lambda = \phi = 1\) and with \(\mathbf{X}^t\mathbf{X}= \mathbf{I}\), \(\mathbf{Y}^* = \mathbf{X}^T\mathbf{Y}\) \[\begin{align*} E[\beta_i \mid \mathbf{Y}] & = \textsf{E}_{\kappa_i \mid \mathbf{Y}}[ \textsf{E}_{\beta_i \mid \kappa_i, \mathbf{Y}}[\beta_i \mid \mathbf{Y}] \\ & = \int_0^1 (1 - \kappa_i) y^*_i p(\kappa_i \mid \mathbf{Y}) \ d\kappa_i \\ & = (1 - \textsf{E}[\kappa \mid y^*_i]) y^*_i \end{align*}\] where \(\kappa_i = 1/(1 + \tau_i^2)\) is the shrinkage factor (like in James-Stein)
the posterior mode also induces shrinkage and variable selection if the mode is at zero
the posterior mean is a shrinkage estimator (no selection)
the tails of the distribution are heavier than the Laplace prior (like a Cauchy distribution) so that there is less shrinkage of large \(|\hat{\boldsymbol{\beta}}|\).
Desirable in the orthogonal case, where lasso is more like ridge regression (related to bounded influence)
MCMC is slow to mix using programs like stan
but specialized R
packages like horseshoe
and monomvn::bhs
are available
Posterior mean of \(\beta\) may also be written as \[E[\beta_i \mid y^*_i] = y^*_i + \frac{d} {d y} \log m(y^*_i)\] where \(m(y)\) is the predictive density \(y^*_i\) under the prior (known \(\lambda\))
HS has Bounded Influence: \[\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = 0\]
\(\lim_{|y_i^*| \to \infty} E[\beta_i \mid y^*_i) \to y^*_i\) (the MLE)
since the MLE \(\to \beta_i^*\) as \(n \to \infty\), the HS is asymptotically consistent
Diabetes data (from the lars
package)
64 predictors: 10 main effects, 2-way interactions and quadratic terms
sample size of 442
split into training and test sets
compare MSE for out-of-sample prediction using OLS, lasso and horseshoe priors
Root MSE for prediction for left out data based on 25 different random splits with 100 test cases
The literature on shrinkage estimators (with or without selection) is vast
For Bayes, choice of estimator
Properties?