STA 721: Lecture 12
Duke University
Bounded Influence and Posterior Mean
Shrinkage properties and nonconcave penalties
conditions for optimal shrinkage and selection . . .
Readings (see reading link)
Carvalho, Polson & Scott (2010) propose an alternative shrinkage prior \[\begin{align*} \boldsymbol{\beta}\mid \phi & \sim \textsf{N}(\mathbf{0}_p, \frac{\textsf{diag}(\tau^2)}{ \phi }) \\ \tau \mid \lambda & \mathrel{\mathop{\sim}\limits^{\rm iid}}C^+(0, \lambda) \\ \lambda & \sim \textsf{C}^+(0, 1/\phi) \\ p(\alpha, \phi) & \propto 1/\phi \end{align*}\]
In the case \(\lambda = \phi = 1\) and with \(\mathbf{X}^t\mathbf{X}= \mathbf{I}\), \(\mathbf{Y}^* = \mathbf{X}^T\mathbf{Y}\) \[\begin{align*} E[\beta_i \mid \mathbf{Y}] & = \textsf{E}_{\kappa_i \mid \mathbf{Y}}[ \textsf{E}_{\beta_i \mid \kappa_i, \mathbf{Y}}[\beta_i \mid \mathbf{Y}] \\ & = \int_0^1 (1 - \kappa_i) y^*_i p(\kappa_i \mid \mathbf{Y}) \ d\kappa_i \\ & = (1 - \textsf{E}[\kappa \mid y^*_i]) y^*_i \end{align*}\] where \(\kappa_i = 1/(1 + \tau_i^2)\) is the shrinkage factor (like in James-Stein)
Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on \(\kappa_i\) a priori (change of variables)
marginal prior (after integrating out )
Posterior mean of \(\beta_i\) may be written as \[E[\beta_i \mid y^*_i] = y^*_i + \frac{d} {d y} \log m(y^*_i)\] where \(m(y)\) is the predictive density \(y^*_i\) under the prior (known \(\lambda\))
Bounded Influence of the prior (in this setting) means that \[\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = c\]
For HS \(\lim_{|y| \to \infty} \frac{d}{dy} \log m(y) = 0\)
\(\lim_{|y_i^*| \to \infty} E[\beta_i \mid y^*_i) \to y^*_i\) (the MLE)
unbiasedness for large \(|y_i^*|\)
Diabetes data (from the lars
package)
64 predictors: 10 main effects, 2-way interactions and quadratic terms
sample size of 442
split into training and test sets
compare MSE for out-of-sample prediction using OLS, lasso and horseshoe priors
Root MSE for prediction for left out data based on 25 different random splits with 100 test cases
both Lasso and Horseshoe much better than OLS
Model \(Y = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\mathbf{X}^T\mathbf{X}= \mathbf{I}_p\) and \(\hat{\boldsymbol{\beta}}= \mathbf{X}^T \mathbf{Y}\equiv \mathbf{Y}^*\) (take \(\sigma^2 = 1\))
Penalized Least Squares \[ \hat{\boldsymbol{\beta}}_j^\lambda = \mathop{\mathrm{argmin}}_{\boldsymbol{\beta}} \ \frac{1}{2}\| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 + \frac{1}{2} \sum_j(\beta_j -\hat{\beta}_j)^2 + \sum_j \textsf{pen}_\lambda(\beta_j) \]
Bayes posterior mode (conditional) with prior \(p(\boldsymbol{\beta}\mid \lambda) = \prod_j p(\beta_j \mid \lambda)\) \[\begin{align*} \hat{\boldsymbol{\beta}}_j^\lambda & =\mathop{\mathrm{argmax}}_{\boldsymbol{\beta}} -\frac{1}{2} \| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 - \frac{1}{2} \sum_j(\beta_j -\hat{\beta}_j)^2 + \sum_j \log(p(\beta_j \mid \lambda)) \\ & = \mathop{\mathrm{argmin}}_{\boldsymbol{\beta}} \frac{1}{2}\| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 + \frac{1}{2} \sum_j(\beta_j -\hat{\beta}_j)^2 - \sum_j\log(p(\beta_j \mid \lambda)) \\ & = \mathop{\mathrm{argmin}}_{\boldsymbol{\beta}} \frac{1}{2}\| \mathbf{Y}- \mathbf{X}\hat{\boldsymbol{\beta}}\|^2 + \frac{1}{2} \sum_j(\beta_j -\hat{\beta}_j)^2 + \sum_j\textsf{pen}_\lambda(\beta_j) \\ \end{align*}\]
Fan & Li (JASA 2001) discuss variable selection via nonconcave penalties and oracle properties in the context of penalized likelihoods in this setting
with duality of the negative log prior as their penalty we can extend to Bayesian modal estimates where the prior is a function of \(|\beta_j|\) \[\frac 1 2 \sum(\beta_i - y_i^*)^2 + \frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \textsf{pen}_\lambda(|\beta_j|)\]
Requirements on penality
To find the optimal estimator take derivative of \(\frac 1 2 \sum_j(\beta_j - \hat{\beta}_j)^2 + \sum_j \textsf{pen}_\lambda(|\beta_j|)\) componentwise and set to zero
Derivative is \[\begin{align*} \frac{d}{d\,\beta_j} \left\{\frac 1 2 (\beta_j - \hat{\beta}_j)^2 + \textsf{pen}_\lambda(|\beta_j|)\right\} & = (\beta_j - \hat{\beta}_j) + \mathop{\mathrm{sgn}}(\beta_j)\textsf{pen}^\prime_\lambda(|\beta_j|) \\ & = \mathop{\mathrm{sgn}}(\beta_j)\left\{|\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|) \right\} - \hat{\beta}_j \end{align*}\]
setting derivative to zero gives \(\hat{\beta}_j = \mathop{\mathrm{sgn}}(\beta_j)\left\{|\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|) \right\}\)
if \(\lim_{|\beta_j| \to \infty} \textsf{pen}^\prime_\lambda(|\beta|) = 0\) then \(\hat{\beta}_j = \mathop{\mathrm{sgn}}(\beta_j) |\beta_j| = \beta_j\)
for large \(|\beta_j|\), \(|\hat{\beta}_j|\) is large with high probability
as MLE is unbiased, the optimal estimator is approximately unbiased for large \(|\beta_j|\)
As sufficient condition for a thresholding rule \(\hat{\boldsymbol{\beta}}_j^\lambda = 0\) is if \[0 < \min \left\{ |\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|)\right\}\]
if \(|\hat{\beta}_j| < \min \left\{ |\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|) \right\}\) then the derivative is positive for all positive \(\beta_j\) and negative for all negative \(\beta_j\) so \(\hat{\beta}_j^\lambda = 0\) is a local minimum
if \(|\hat{\beta}_j| > \min \left\{ |\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|) \right\}\) multiple crossings (local roots)
a sufficient and necessary condition for continuity is that the minimum of \(|\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|)\) is obtained at zero
Prior \(\textsf{N}(0, 1/\lambda^2)\)
Penalty: \(\textsf{pen}_\lambda(|\beta_j|) = \frac{1}{2} \lambda |\beta_j|^2\)
Unbiasedness: for large \(|\beta_j|\)?
not a thresholding rule as \[\min \left\{ |\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|)\right\} = (1 + \lambda)|\beta_j|\] is zero
is continuous as minimum is at zero
Penalty: \(\textsf{pen}_\lambda(|\beta_j|) = \lambda |\beta_j|\)
Unbiasedness: for large \(|\beta_j|\)?
Is a thresholding rule as \[\min \left\{ |\beta_j| + \textsf{pen}^\prime_\lambda(|\beta_j|)\right\} = (|\beta_j| + \lambda) > 0 \]
is continuous as minimum is at \(\beta_j = 0\)
The Generalized Double Pareto of Armagan, Dunson & Lee (2013)
has a prior density for \(\beta_j\) of the form \[
p(\beta_j \mid \xi, \alpha) = \frac{1}{2 \xi} \left(1 + \frac{\beta_j}{\alpha \xi}\right)^{-(1 + \alpha)}
\]
express as \(\beta_j \mid \xi, \alpha \sim \textsf{GDP}(\xi, \alpha)\)
Scale mixtures of Normals representation \[\begin{align*} \beta \mid \tau_j & \sim \textsf{N}(0, \tau_j) \\ \tau_j \mid \lambda_j & \sim \textsf{Exp}(\lambda_j^2/2) \\ \lambda_j & \sim \textsf{Gamma}(\alpha, \eta) \\ \beta_j & \sim \textsf{GDP}(\xi = \eta/\alpha, \alpha) \end{align*}\]
is this a thresholding rule? unbiasedness? continuity?
for all parameters or are there restrictions?
The literature on shrinkage estimators (with or without selection) is extensive
For Bayes, choice of estimator
Prior/Posterior do not put any probability on the event \(\beta_j = 0\)
Uncertainty that the coefficient is zero?
Selection solved as a post-analysis decision problem
Selection part of model uncertainty