STA 721: Lecture 8
Duke University
Readings:
Model \(\mathbf{Y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\boldsymbol{\epsilon}\sim \textsf{N}(\mathbf{0}_n , \sigma^2 \mathbf{I}_n)\) is equivalent to \[ \mathbf{Y}\sim \textsf{N}(\mathbf{X}\boldsymbol{\beta}, \mathbf{I}_n/\phi) \]
\(\phi = 1/\sigma^2\) is the precision of the data.
we might expect \(\boldsymbol{\beta}\) to be close to some vector \(\mathbf{b}_0\)
represent this a priori with a Prior Distribution for \(\boldsymbol{\beta}\), e.g. \[\boldsymbol{\beta}\sim \textsf{N}(\mathbf{b}_0, \boldsymbol{\Phi}_0^{-1})\]
\(\mathbf{b}_0\) is the prior mean and \(\boldsymbol{\Phi}_0\) is the prior precision of \(\boldsymbol{\beta}\) that captures how close \(\boldsymbol{\beta}\) is to \(\mathbf{b}_0\)
Similarly, we could represent prior uncertainty about \(\sigma\), \(\sigma^2\) or equivalently \(\phi\) with a probability distribution
for now treat \(\phi\) as fixed
once we see data \(\mathbf{Y}\), Bayesian inference proceeds by updating prior beliefs
represented by the posterior distribution of \(\boldsymbol{\beta}\) which is the conditional distribution of \(\boldsymbol{\beta}\) given the data \(\mathbf{Y}\) (and \(\phi\) for now)
Posterior \(p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi)\) \[p(\boldsymbol{\beta}\mid \mathbf{Y}) = \frac{p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi)}{c}\]
\(c\) is a constant so that the posterior density integrates to \(1\) \[c = \int_{\mathbb{R}^p} p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi) d\, \boldsymbol{\beta}\equiv p(\mathbf{Y})\]
since \(c\) is a constant that doesn’t depend on \(\boldsymbol{\beta}\) just ignore
work with density up to constant of proportionality
Posterior for \(\boldsymbol{\beta}\) is \(p(\boldsymbol{\beta}\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi)\)
\[\begin{align*} p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) & = (2 \pi)^{-n/2} |\mathbf{I}_n / \phi |^{-1/2} \exp\left\{-\frac{1}{2} \left( (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T \phi \mathbf{I}_n (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) \right) \right\} \\ & \propto \exp\left\{-\frac{1}{2} \left( \phi \mathbf{Y}^T\mathbf{Y}- 2 \boldsymbol{\beta}^T \phi \mathbf{X}^T\mathbf{Y}+ \phi \boldsymbol{\beta}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\right) \right\} \end{align*}\]
Kernel of a Multivariate Normal
\(\dagger\) necessary to keep track of all terms for \(\phi\) when we do not condition on \(\phi\)
\[ p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) \propto \exp\left\{-\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \]
Posterior Precision \[\boldsymbol{\Phi}_n \equiv \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0\]
sum of data precision and prior precision
posterior Covariance \[\textsf{Cov}[\boldsymbol{\beta}\mid \mathbf{Y}, \phi] = \boldsymbol{\Phi}_n^{-1} = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}\]
if \(\boldsymbol{\Phi}_0\) is full rank, then \(\textsf{Cov}[\boldsymbol{\beta}\mid \mathbf{Y}, \phi]\) is full rank even if \(\mathbf{X}^T\mathbf{X}\) is not
\[\begin{align*} p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) & \propto \exp\left\{\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \\ & \propto \exp\left\{\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T\boldsymbol{\Phi}_n \boldsymbol{\Phi}_n^{-1}(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \\ \end{align*}\]
posterior mean \(\mathbf{b}_n\) \[\begin{align*} \mathbf{b}_n & \equiv \boldsymbol{\Phi}_n^{-1} (\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 ) \\ & = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}\left(\phi (\mathbf{X}^T\mathbf{X}) (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 \right) \\ & = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1} \left( \phi (\mathbf{X}^T\mathbf{X}) \hat{\boldsymbol{\beta}}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 \right) \end{align*}\]
a precision weighted linear combination of MLE and prior mean
first expression useful if \(\mathbf{X}\) is not full rank!
Posterior is a Multivariate Normal \(p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) \sim \textsf{N}(\mathbf{b}_n, \boldsymbol{\Phi}_n^{-1})\)
posterior mean: \(\mathbf{b}_n = \boldsymbol{\Phi}_n^{-1} (\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 )\)
posterior precision: \(\boldsymbol{\Phi}_n = \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0\)
the posterior precision (inverse posterior variance) is the sum of the prior precision and the data precision.
the posterior mean is a linear combination of MLE/OLS and prior mean
if the prior precision \(\boldsymbol{\Phi}_n\) is very large compared to the data precision \(\phi \mathbf{X}^T\mathbf{X}\), the posterior mean will be close to the prior mean \(\mathbf{b}_0\).
if the prior precision \(\boldsymbol{\Phi}_n\) is very small compared to the data precision \(\phi \mathbf{X}^T\mathbf{X}\), the posterior mean will be close to the MLE/OLS estimator.
data precision will generally be increasing with sample size
A Bayes estimator is a potential value of \(\boldsymbol{\beta}\) that is obtained from the posterior distribution in some principled way.
Standard estimators include
The first estimator is based on principles from classical decision theory, whereas the second can be related to penalized likelihood estimation.
in the case of linear regression they turn out to be the same estimator!
\[B(p_\boldsymbol{\beta}, \delta) = \textsf{E}_\boldsymbol{\beta}R(\boldsymbol{\beta}, \delta) = \textsf{E}_{\boldsymbol{\beta}} \textsf{E}_{\mathbf{Y}\mid \boldsymbol{\beta}}[\| \delta(\mathbf{Y})− \boldsymbol{\beta}\|^2]\] where the expectation is with respect to the prior distribution, \(p_\boldsymbol{\beta}\), over \(\boldsymbol{\beta}\) and the conditional distribution of \(\mathbf{Y}\) given \(\boldsymbol{\beta}\)
can show that the Bayes action that minimizes the posterior expected loss is the posterior mean \(\boldsymbol{\beta}_n = (\phi \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0\) and is also the Bayes rule.
different values of \(\mathbf{b}_0\) and \(\boldsymbol{\Phi}_0\) will lead to different Bayes estimators as will different prior distributions besides the Normal
take \(\mathbf{b}_0 = \mathbf{0}\); Bayes estimators are often referred to as shrinkage estimators \[\boldsymbol{\beta}_n = \left( \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0/\phi \right)^{-1} \mathbf{X}^T\mathbf{Y}\] as they shrink the MLE/OLS estimator towards \(\mathbf{0}\)
One of the most common priors for the normal linear model is the g-prior of Zellner (1986) where \(\boldsymbol{\Phi}_0 = \frac{\phi}{g} \mathbf{X}^T\mathbf{X}\) \[\boldsymbol{\beta}\mid \phi, g \sim \textsf{N}(\mathbf{0}, g/\phi (\mathbf{X}^T\mathbf{X})^{-1})\]
\[\begin{align*} \mathbf{b}_n & = \left( \mathbf{X}^T\mathbf{X}+ \frac{\phi}{g} \frac{\mathbf{X}^T\mathbf{X}}{\phi} \right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \left( \mathbf{X}^T\mathbf{X}+ \frac{1}{g} \mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \left( \frac{1 +g}{g} \mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \frac{g}{1+g} \hat{\boldsymbol{\beta}} \end{align*}\]
another common choice is the independent prior \[\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\mathbf{0}, \boldsymbol{\Phi}_0^{-1})\] where \(\boldsymbol{\Phi}_0 = \phi \kappa \mathbf{I}_b\) for some \(\kappa> 0\)
the posterior mean is \[\begin{align*} \boldsymbol{\beta}_n & = (\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I})^{-1} \mathbf{X}^T\mathbf{Y}\\ & = (\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I})^{-1} \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} \end{align*}\]
this is also a shrinkage estimator but the amount of shrinkage is different for the different components of \(\mathbf{b}_n\) depending on the eigenvalues of \(\mathbf{X}^T\mathbf{X}\)
easiest to see this via an orthogonal rotation of the model
Use the singular value decomposition of \(\mathbf{X}= \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^T\) and multiply thru by \(\mathbf{U}^T\) to get the rotated model \[\begin{align*} \mathbf{U}^T \mathbf{Y}& = \boldsymbol{\Lambda}\mathbf{V}^T\boldsymbol{\beta}+ \mathbf{U}^T\boldsymbol{\epsilon}\\ \tilde{\mathbf{Y}}& = \boldsymbol{\Lambda}\boldsymbol{\alpha}+ \tilde{\boldsymbol{\epsilon}} \end{align*}\] where \(\boldsymbol{\alpha}= \mathbf{V}^T\boldsymbol{\beta}\) and \(\tilde{\boldsymbol{\epsilon}} = \mathbf{U}^T\boldsymbol{\epsilon}\)
the induced prior is still \(\boldsymbol{\alpha}\mid \phi \sim \textsf{N}(\mathbf{0}, (\phi \kappa)^{-1} \mathbf{I})\)
the posterior mean of \(\boldsymbol{\alpha}\) is \[\begin{align*} \mathbf{a}& = (\boldsymbol{\Lambda}^2 + \kappa \mathbf{I})^{-1} \boldsymbol{\Lambda}^2 \hat{\boldsymbol{\alpha}}\\ a_j & = \frac{\lambda_j^2}{\lambda_j^2 + \kappa} \hat{\alpha}_j \end{align*}\]
sets to zero the components of the OLS solution where eigenvalues are zero!
The posterior mean under this independent prior is the same as the classic ridge regression estimator of Hoerl and
the variance of \(\hat{\alpha}_j\) is \(\sigma^2/\lambda_j^2\) while the variance of \(a_j\) is \(\sigma^2/(\lambda_j^2 + \kappa)\)
clearly components of \(\boldsymbol{\alpha}\) with small eigenvalues will have large variances
ridge regression keeps those components from “blowing up” by shrinking them towards zero and having a finite variance
rotate back to get the ridge estimator for \(\boldsymbol{\beta}\), \(\hat{\boldsymbol{\beta}}_R = \mathbf{V}\mathbf{a}\)
ridge regression applies a high degree of shrinkage to the “parts” (linear combinations) of \(\boldsymbol{\beta}\) that have high variability, and a low degree of shrinkage to the parts that are well-estimated.
turns out there always exists a value of \(\kappa\) that will improve over OLS!
Unfortunately no closed form solution except in orthogonal regression and then it depends on the unknown \(\|\boldsymbol{\beta}\|^2\)!