Shrinkage Estimators and Hierarchical Bayes

STA 721: Lecture 11

Merlise Clyde (clyde@duke.edu)

Duke University

Outline

Lasso
Bayesian Lasso

Readings (see reading link)
- Seber & Lee Chapter Chapter 12
- Tibshirani (JRSS B 1996)
- Park & Casella (JASA 2008)
- Hans (Biometrika 2010)
- Carvalho, Polson & Scott (Biometrika 2010)

LASSO Estimator

Tibshirani (JRSS B 1996) proposed estimating coefficients through \(L_1\) constrained least squares via the Least Absolute Shrinkage and Selection Operator or lasso \[\hat{\boldsymbol{\beta}}_{L} = \mathop{\mathrm{argmin}}_{\boldsymbol{\beta}} \left\{ \|\mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|_1 \right\}\]

\(\mathbf{Y}_c\) is the centered \(\mathbf{Y}\), \(\mathbf{Y}_c = \mathbf{Y}- \bar{\mathbf{Y}} \mathbf{1}\)
\(\mathbf{X}_s\) is the centered and standardized \(\mathbf{X}\) matrix so that the diagonal elements of \(\mathbf{X}_s^T\mathbf{X}_s = c\).
use the scale function but standardization usually handled within packages

Control how large coefficients may grow \[\arg \min_{\boldsymbol{\beta}} (\mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta})^T (\mathbf{Y}_c - \mathbf{X}_s\boldsymbol{\beta})\]

\[ \text{ subject to } \sum |\beta_j| \le t\]

Lasso Solutions

The entire path of solutions can be easily found using the ``Least Angle Regression’’ Algorithm of Efron et al (Annals of Statistics 2004)

library(lars); datasets::longley
longley.lars = lars(as.matrix(longley[,-7]), longley[,7], type="lasso")

Coefficients

round(coef(longley.lars),4)

      GNP.deflator     GNP Unemployed Armed.Forces Population   Year
 [1,]       0.0000  0.0000     0.0000       0.0000     0.0000 0.0000
 [2,]       0.0000  0.0327     0.0000       0.0000     0.0000 0.0000
 [3,]       0.0000  0.0362    -0.0037       0.0000     0.0000 0.0000
 [4,]       0.0000  0.0372    -0.0046      -0.0010     0.0000 0.0000
 [5,]       0.0000  0.0000    -0.0124      -0.0054     0.0000 0.9068
 [6,]       0.0000  0.0000    -0.0141      -0.0071     0.0000 0.9438
 [7,]       0.0000  0.0000    -0.0147      -0.0086    -0.1534 1.1843
 [8,]      -0.0077  0.0000    -0.0148      -0.0087    -0.1708 1.2289
 [9,]       0.0000 -0.0121    -0.0166      -0.0093    -0.1303 1.4319
[10,]       0.0000 -0.0253    -0.0187      -0.0099    -0.0951 1.6865
[11,]       0.0151 -0.0358    -0.0202      -0.0103    -0.0511 1.8292

Selecting a Solution from the Path

summary(longley.lars)

LARS/LASSO
Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type = "lasso")
   Df     Rss        Cp
0   1 185.009 1976.7120
1   2   6.642   59.4712
2   3   3.883   31.7832
3   4   3.468   29.3165
4   5   1.563   10.8183
5   4   1.339    6.4068
6   5   1.024    5.0186
7   6   0.998    6.7388
8   7   0.907    7.7615
9   6   0.847    5.1128
10  7   0.836    7.0000

For \(p\) predictors, \[C_p = \frac{\textsf{SSE}_p}{s^2} -n + 2p\]
\(s^2\) is the residual variance from the full model
\(\textsf{SSE}_p\) is the sum of squared errors for the model with \(p\) predictors (RSS)
if the model includes all the predictors with non-zero coefficients, then \(C_p \approx p\)
choose minimum \(C_p \approx p\)
in practice use Cross-validation or Generalized Cross Validation (GCV) to choose \(\lambda\)

Features and Issues

Combines shrinkage (like Ridge Regression) with Variable Selection to deal with collinearity
Can be used for prediction or variable selection
not invariant under linear transformations of the predictors
typically no uncertainty estimates for the coefficients or predictions
ignores uncertainty in the choice of \(\lambda\)
may overshrink large coefficients

Bayesian LASSO

Equivalent to finding posterior mode with a Double Laplace Prior \[ \mathop{\mathrm{argmax}}_{\boldsymbol{\beta}} -\frac{\phi}{2} \{ \| \mathbf{Y}_c - \mathbf{X}_s \boldsymbol{\beta}\|^2 + \lambda^* \|\boldsymbol{\beta}\|_1 \} \]
Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso \[\begin{eqnarray*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta}, \phi & \sim & \textsf{N}(\mathbf{1}_n \alpha + \mathbf{X}^s \boldsymbol{\beta}^s, \mathbf{I}_n/\phi) \\ \boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\ \tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\ p(\alpha, \phi) & \propto& 1/\phi \\ \end{eqnarray*}\]
Generalizes Ridge Priors to allow different prior variances for each coefficient

Double Exponential or Double Laplace Prior

Marginal distribution of \(\beta_j\) \[\begin{eqnarray*} \boldsymbol{\beta}\mid \alpha, \phi, \boldsymbol{\tau}& \sim & \textsf{N}(\mathbf{0}, \textsf{diag}(\boldsymbol{\tau}^2)/\phi) \\ \tau_1^2 \ldots, \tau_p^2 \mid \alpha, \phi & \mathrel{\mathop{\sim}\limits^{\rm iid}}& \textsf{Exp}(\lambda^2/2) \\ p(\beta_j \mid \phi, \lambda) & = & \int_0^\infty p(\beta_i \mid \phi, \tau^2_j) p(\tau^2_j \mid \phi, \lambda) \, d\tau^2 \\ \end{eqnarray*}\]
Can show that \(\beta_j \mid \phi, \lambda \mathrel{\mathop{\sim}\limits^{\rm iid}}DE(\lambda \sqrt{\phi})\) \[\int_0^\infty \frac{1}{\sqrt{2 \pi t}} e^{-\frac{1}{2} \phi \frac{\beta^2}{t }} \, \frac{\lambda^2}{2} e^{- \frac{\lambda^2 t}{2}}\, dt = \frac{\lambda \phi^{1/2}}{2} e^{-\lambda \phi^{1/2} |\beta|} \]
Scale Mixture of Normals (Andrews and Mallows 1974)

Gibbs Sampler

Integrate out \(\alpha\): \(\alpha \mid \mathbf{Y}, \phi \sim \textsf{N}(\bar{y}, 1/(n \phi)\)
\(\boldsymbol{\beta}\mid \boldsymbol{\tau}, \phi, \lambda, \mathbf{Y}_c \sim \textsf{N}(, )\)
\(\phi \mid \boldsymbol{\tau}, \boldsymbol{\beta}, \lambda, \mathbf{Y}_c \sim \mathbf{G}( , )\)
\(1/\tau_j^2 \mid \boldsymbol{\beta}, \phi, \lambda, \mathbf{Y}\sim \textsf{InvGaussian}( , )\)
\(X \sim \textsf{InvGaussian}(\mu, \lambda)\) has density \[ f(x) = \sqrt{\frac{\lambda^2}{2 \pi}} x^{-3/2} e^{- \frac{1}{2} \frac{ \lambda^2( x - \mu)^2} {\mu^2 x}} \qquad x > 0 \]
Homework: Derive the full conditionals for \(\boldsymbol{\beta}^s\), \(\phi\), \(1/\tau^2\)
see Casella & Park

Horseshoe Priors

Carvalho, Polson & Scott (2010) propose an alternative shrinkage prior \[\begin{align*} \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{0}_p, \frac{\textsf{diag}(\tau^2)}{ \phi }) \\ \tau \mid \lambda & \mathrel{\mathop{\sim}\limits^{\rm iid}}C^+(0, \lambda) \\ \lambda & \sim \textsf{C}^+(0, 1/\phi) \\ p(\alpha, \phi) & \propto 1/\phi \end{align*}\]

\(C^+(0, \lambda)\) is the half-Cauchy distribution with scale \(\lambda\) \[ p(\tau \mid \lambda) = \frac{2}{\pi} \frac{\lambda}{\lambda^2 + \tau_j^2} \]
\(\textsf{C}^+(0, 1/\phi)\) is the half-Cauchy distribution with scale \(1/\phi\)

Special Case

In the case \(\lambda = \phi = 1\) and with \(\mathbf{X}^t\mathbf{X}= \mathbf{I}\), \(\mathbf{Y}^* = \mathbf{X}^T\mathbf{Y}\) \[\begin{align*} E[\beta_i \mid \mathbf{Y}] & = \textsf{E}_{\kappa_i \mid \mathbf{Y}}[ \textsf{E}_{\beta_i \mid \kappa_i, \mathbf{Y}}[\beta_i \mid \mathbf{Y}] \\ & = \int_0^1 (1 - \kappa_i) y^*_i p(\kappa_i \mid \mathbf{Y}) \ d\kappa_i \\ & = (1 - \textsf{E}[\kappa \mid y^*_i]) y^*_i \end{align*}\] where \(\kappa_i = 1/(1 + \tau_i^2)\) is the shrinkage factor (like in James-Stein)

Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on \(\kappa_i\) a priori (change of variables)

Features and Issues

the posterior mode also induces shrinkage and variable selection if the mode is at zero
the posterior mean is a shrinkage estimator (no selection)
the tails of the distribution are heavier than the Laplace prior (like a Cauchy distribution) so that there is less shrinkage of large \(|\hat{\boldsymbol{\beta}}|\).
Desirable in the orthogonal case, where lasso is more like ridge regression (related to bounded influence)
MCMC is slow to mix using programs like stan but specialized R packages like horseshoe and monomvn::bhsare available

Bounded Influence and Posterior Mean

Posterior mean of \(\beta\) may also be written as \[E[\beta_i \mid y^*_i] = y^*_i + \frac{d} {d y} \log m(y^*_i)\] where \(m(y)\) is the predictive density \(y^*_i\) under the prior (known \(\lambda\))

HS has Bounded Influence: \[\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = 0\]
\(\lim_{|y_i^*| \to \infty} E[\beta_i \mid y^*_i) \to y^*_i\) (the MLE)
since the MLE \(\to \beta_i^*\) as \(n \to \infty\), the HS is asymptotically consistent

the DE also has bounded influence, but the bound does not decay to zero in tails so that the posterior mean does not shrink to the MLE

Comparison

Diabetes data (from the lars package)
64 predictors: 10 main effects, 2-way interactions and quadratic terms

sample size of 442
split into training and test sets
compare MSE for out-of-sample prediction using OLS, lasso and horseshoe priors
Root MSE for prediction for left out data based on 25 different random splits with 100 test cases

Summary

The literature on shrinkage estimators (with or without selection) is vast

Elastic Net (Zou & Hastie 2005)
SCAD (Fan & Li 2001)
Generalized Double Pareto Prior (Armagan, Dunson & Lee 2013)
Spike-and-Slab Lasso (Rockova & George 2018)

For Bayes, choice of estimator

posterior mean (easy via MCMC)
posterior mode (optimization)
posterior median (via MCMC)

Properties?

Fan & Li (JASA 2001) discuss variable selection via non-concave penalties and oracle properties (next time …)