Bayesian Model Uncertainty

STA721: Lecture 19

Merlise Clyde

Duke University

Recap Diabetes Data

set.seed(8675309)
source("yX.diabetes.train.txt")
diabetes.train = as.data.frame(diabetes.train)
source("yX.diabetes.test.txt")
diabetes.test = as.data.frame(diabetes.test)
colnames(diabetes.test)[1] = "y"

str(diabetes.train)

'data.frame':   342 obs. of  65 variables:
 $ y      : num  -0.0147 -1.0005 -0.1444 0.6987 -0.2222 ...
 $ age    : num  0.7996 -0.0395 1.7913 -1.8703 0.113 ...
 $ sex    : num  1.064 -0.937 1.064 -0.937 -0.937 ...
 $ bmi    : num  1.296 -1.081 0.933 -0.243 -0.764 ...
 $ map    : num  0.459 -0.553 -0.119 -0.77 0.459 ...
 $ tc     : num  -0.9287 -0.1774 -0.9576 0.256 0.0826 ...
 $ ldl    : num  -0.731 -0.402 -0.718 0.525 0.328 ...
 $ hdl    : num  -0.911 1.563 -0.679 -0.757 0.171 ...
 $ tch    : num  -0.0544 -0.8294 -0.0544 0.7205 -0.0544 ...
 $ ltg    : num  0.4181 -1.4349 0.0601 0.4765 -0.6718 ...
 $ glu    : num  -0.371 -1.936 -0.545 -0.197 -0.979 ...
 $ age^2  : num  -0.312 -0.867 1.925 2.176 -0.857 ...
 $ bmi^2  : num  0.4726 0.1185 -0.0877 -0.6514 -0.2873 ...
 $ map^2  : num  -0.652 -0.573 -0.815 -0.336 -0.652 ...
 $ tc^2   : num  -0.091 -0.6497 -0.0543 -0.6268 -0.6663 ...
 $ ldl^2  : num  -0.289 -0.521 -0.3 -0.45 -0.555 ...
 $ hdl^2  : num  -0.0973 0.8408 -0.3121 -0.2474 -0.5639 ...
 $ tch^2  : num  -0.639 -0.199 -0.639 -0.308 -0.639 ...
 $ ltg^2  : num  -0.605 0.78 -0.731 -0.567 -0.402 ...
 $ glu^2  : num  -0.578 1.8485 -0.4711 -0.6443 -0.0258 ...
 $ age:sex: num  0.69 -0.139 1.765 1.609 -0.284 ...
 $ age:bmi: num  0.852 -0.142 1.489 0.271 -0.271 ...
 $ age:map: num  0.0349 -0.3346 -0.5862 1.1821 -0.3025 ...
 $ age:tc : num  -0.978 -0.246 -1.927 -0.72 -0.244 ...
 $ age:ldl: num  -0.803 -0.203 -1.504 -1.2 -0.182 ...
 $ age:hdl: num  -0.7247 0.0147 -1.2661 1.6523 0.1046 ...
 $ age:tch: num  -0.254 -0.176 -0.31 -1.598 -0.216 ...
 $ age:ltg: num  0.0644 -0.2142 -0.163 -1.1657 -0.3474 ...
 $ age:glu: num  -0.636 -0.239 -1.359 0.071 -0.438 ...
 $ sex:bmi: num  1.304 0.935 0.915 0.142 0.635 ...
 $ sex:map: num  0.258 0.289 -0.381 0.5 -0.697 ...
 $ sex:tc : num  -1.02 0.131 -1.051 -0.274 -0.112 ...
 $ sex:ldl: num  -0.927 0.236 -0.913 -0.638 -0.452 ...
 $ sex:hdl: num  -0.647 -1.188 -0.377 1.189 0.238 ...
 $ sex:tch: num  -0.411 0.47 -0.411 -1.062 -0.296 ...
 $ sex:ltg: num  0.2988 1.2093 -0.0866 -0.6032 0.4857 ...
 $ sex:glu: num  -0.6171 1.6477 -0.8069 -0.0239 0.7283 ...
 $ bmi:map: num  0.189 0.191 -0.477 -0.195 -0.702 ...
 $ bmi:tc : num  -1.5061 -0.0595 -1.1853 -0.3231 -0.3239 ...
 $ bmi:ldl: num  -1.267 0.183 -0.976 -0.407 -0.536 ...
 $ bmi:hdl: num  -0.869 -1.41 -0.286 0.586 0.251 ...
 $ bmi:tch: num  -0.505 0.505 -0.484 -0.614 -0.388 ...
 $ bmi:ltg: num  0.1014 1.1613 -0.4085 -0.5893 0.0716 ...
 $ bmi:glu: num  -0.862 1.693 -0.89 -0.337 0.358 ...
 $ map:tc : num  -0.687 -0.148 -0.131 -0.451 -0.21 ...
 $ map:ldl: num  -0.5407 0.0388 -0.1034 -0.6114 -0.036 ...
 $ map:hdl: num  -0.235 -0.672 0.254 0.745 0.252 ...
 $ map:tch: num  -0.29 0.207 -0.258 -0.835 -0.29 ...
 $ map:ltg: num  -0.214 0.428 -0.427 -0.811 -0.748 ...
 $ map:glu: num  -0.541 0.659 -0.314 -0.23 -0.812 ...
 $ tc:ldl : num  -0.144 -0.551 -0.139 -0.509 -0.581 ...
 $ tc:hdl : num  0.8363 -0.3457 0.6304 -0.2579 -0.0392 ...
 $ tc:tch : num  -0.405 -0.326 -0.404 -0.295 -0.451 ...
 $ tc:ltg : num  -0.901 -0.259 -0.571 -0.392 -0.569 ...
 $ tc:glu : num  0.0202 0.0196 0.2073 -0.396 -0.4283 ...
 $ ldl:hdl: num  0.889 -0.446 0.705 -0.207 0.26 ...
 $ ldl:tch: num  -0.463 -0.243 -0.463 -0.21 -0.506 ...
 $ ldl:ltg: num  -0.6536 0.2724 -0.3783 -0.0708 -0.5638 ...
 $ ldl:glu: num  -0.0194 0.4995 0.1032 -0.4013 -0.6234 ...
 $ hdl:tch: num  0.703 -0.5 0.692 0.171 0.651 ...
 $ hdl:ltg: num  0.0179 -1.9846 0.3839 0.0399 0.3043 ...
 $ hdl:glu: num  0.654 -2.948 0.689 0.452 0.113 ...
 $ tch:ltg: num  -0.592 0.531 -0.574 -0.253 -0.537 ...
 $ tch:glu: num  -0.371 1.114 -0.362 -0.522 -0.34 ...
 $ ltg:glu: num  -0.584 2.184 -0.468 -0.526 0.183 ...

Credible Intervals under BMA

coef.diabetes = coefficients(diabetes.bas)
ci.coef.bas = confint(coef.diabetes, level=0.95)
plot(ci.coef.bas)

NULL

uses Monte Carlo simulations from the posteriors of the coefficients
uses HPD intervals from the CODA package to compute intervals

Out of Sample Prediction

What is the optimal value to predict \(\mathbf{Y}^{\text{test}}\) given \(\mathbf{Y}\) under squared error?
BMA is optimal prediction for squared error loss with Bayes \[\textsf{E}[\| \mathbf{Y}^{\text{test}} - a\|^2 \mid \mathbf{y}] = \textsf{E}[\| \mathbf{Y}^{\text{test}} - \textsf{E}[\mathbf{Y}^{\text{test}}\mid \mathbf{y}] \|^2 \mid \mathbf{y}] + \| \textsf{E}[\mathbf{Y}^{\text{test}}\mid \mathbf{y}] - a\|^2 \]
Iterated expectations leads to BMA for \(\textsf{E}[\mathbf{Y}^{\text{test}} \mid \mathbf{Y}]\)
Prediction under model averaging \[\hat{Y} = \sum_S (\hat{\alpha}_\boldsymbol{\gamma}+ \mathbf{X}_{\boldsymbol{\gamma}}^{\text{test}} \hat{\boldsymbol{\beta}}_{\boldsymbol{\gamma}}) \hat{p}(\boldsymbol{\gamma}\mid \mathbf{Y})\]

[1] 0.4556414

Credible Intervals & Coverage

posterior predictive distribution \[ p(\mathbf{y}^{\text{test}} \mid \mathbf{y}) = \sum_\boldsymbol{\gamma}p(\mathbf{y}^{\text{test}} \mid \mathbf{y}, \boldsymbol{\gamma})p(\boldsymbol{\gamma}\mid \mathbf{y}) \]
integrate out \(\alpha\) and \(\boldsymbol{\beta_\gamma}\) to get a normal predictive given \(\phi\) and \(\boldsymbol{\gamma}\) (and \(\mathbf{y}\))
integrate out \(\phi\) to get a t distribution given \(\boldsymbol{\gamma}\) and \(\mathbf{y}\)
credible intervals via sampling
- sample a model from \(p(\boldsymbol{\gamma}\mid \mathbf{y})\)
- conditional on a model sample \(y \sim p(\mathbf{y}^{\text{test}} \mid \mathbf{y}, \boldsymbol{\gamma})\)
- compute HPD or quantiles from samples of \(y\)

95% Prediction intervals

ci.bas = confint(pred.bas);
coverage = mean(diabetes.test$y > ci.bas[,1] & diabetes.test$y < ci.bas[,2])
coverage

[1] 0.99

plot(ci.bas)

NULL

points(diabetes.test$y, col=2, pch=15)

Selection and Prediction

BMA is optimal for squared error loss Bayes
What if we want to use only a single model for prediction under squared error loss?
HPM: Highest Posterior Probability model is optimal for selection, but not prediction
MPM: Median Probability model (select model where PIP > 0.5) (optimal under certain conditions; nested models)
BPM: Best Probability Model - Model closest to BMA under loss (usually includes more predictors than HPM or MPM)
costs of using variables in prediction?

Example

pred.bas = predict(diabetes.bas,
                   newdata=diabetes.test,
                   estimator="BMA",
                   se=TRUE)
mean((pred.bas$fit- diabetes.test$y)^2)

[1] 0.4556414

pred.bas = predict(diabetes.bas,
                   newdata=diabetes.test,
                   estimator="BPM",
                   se=TRUE)
#MSE
mean((pred.bas$fit- diabetes.test$y)^2)

[1] 0.4740667

#Coverage
ci.bas = confint(pred.bas)
mean(diabetes.test$y > ci.bas[,1] &
     diabetes.test$y < ci.bas[,2])

[1] 0.98

Theory - Consistency of g-priors

desire that posterior probability of model goes to 1 as \(n \to \infty\)
- does not always hold if the null model is true (may be highest posterior probability model)
- need prior on \(g\) to depend on \(n\) (rules out EB and fixed g-priors with \(g \ne n\))
- asymptotically BMA collapses to the true model
other quantities may converge i.e. posterior mean

Model Paradigms

what if the true model \(\boldsymbol{\gamma}_T\) is not in \(\Gamma\)? What can we say?
\(\boldsymbol{{\cal M}}\)-complete; BMA converges to the model that is “closest” to the truth in Kullback-Leibler divergence
\(\boldsymbol{{\cal M}}\)-closed;
- know \(\boldsymbol{\gamma}_T \notin \mathbf{G}\) so that \((p_\boldsymbol{\gamma}) = 0 \ \forall \boldsymbol{\gamma}\in \mathbf{G}\) but want to use models in \(\mathbf{G}\)
- Predictive distribution \(p(\mathbf{Y}^* \mid \mathbf{Y}, \boldsymbol{\gamma}_T)\) is available
\(\boldsymbol{{\cal M}}\)-open;
- know \(\boldsymbol{\gamma}_T \notin \mathbf{G}\) so that \((p_\boldsymbol{\gamma}) = 0 \ \forall \boldsymbol{\gamma}\in \mathbf{G}\) but want to use models in \(\mathbf{G}\)
- Predictive distribution \(p(\mathbf{Y}^* \mid \mathbf{Y}, \boldsymbol{\gamma}_T)\) is not available. (too complicated to use, etc)

\(\boldsymbol{{\cal M}}\)-Open and \(M\)-Complete Prediction

Clyde & Iversen (2013) pdf motivate a solution via decision theory

Use the models in \(\mathbf{G}\) to predict \(\mathbf{Y}^*\) given \(\mathbf{Y}\) under squared error loss \[E[\mathbf{Y}^*, \sum_{\boldsymbol{\gamma}\in \mathbf{G}} \omega_\boldsymbol{\gamma}\hat{\mathbf{Y}}^*_\boldsymbol{\gamma}\mid \mathbf{Y}] = \int ( \mathbf{Y}^* - \sum_{\boldsymbol{\gamma}\in \mathbf{G}} \omega_\boldsymbol{\gamma}\hat{\mathbf{Y}}^*_\boldsymbol{\gamma})^2 p(\mathbf{Y}^* \mid \mathbf{Y})\]
Still use a weighted sum of predictions or densities from models in \(\mathbf{G}\) but now the weights are not probabilities but are chosen to minimize the loss function
- uses additional constraints of penalties on the weights as part of the loss function
- need to approximate the predictive distribution for \(\mathbf{Y}^* \mid \mathbf{Y}\) (via an approximate Dirichlet Process Model)
- latter is related to “stacking” (Wolpert 1972) which is a frequentist method of ensemble learning using cross-validation;

Summary

Choice of prior on \(\boldsymbol{\beta_\gamma}\)
- multivariate Spike & Slab
- products of independent Spike & Slab priors
- intermediates block g-priors
- non-semi-conjugate
- non-local priors
- shrinkage priors without point-masses
priors on the models (sensitivity)
computation (MCMC, “stochastic search”, adaptive MH, variational, orthogonal data augmentation, reversible jump-MCMC)
decision theory - select a model or “average” over all models
asymptotic properties - large \(n\) and large \(p > n\)

Other aspects of model selection?

transformations of \(\mathbf{Y}\)
functions of \(\mathbf{X}\): interactions or nonlinear functions such as splines kernels
choice of error distribution
outliers