Bayesian Estimation in Linear Models

STA 721: Lecture 8

Merlise Clyde (clyde@duke.edu)

Duke University

Outline

Readings:

  • Christensen Chapter 2.9 and Chapter 15
  • Seber & Lee Chapter 3.12

Bayes Estimation

Model \(\mathbf{Y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\boldsymbol{\epsilon}\sim \textsf{N}(\mathbf{0}_n , \sigma^2 \mathbf{I}_n)\) is equivalent to \[ \mathbf{Y}\sim \textsf{N}(\mathbf{X}\boldsymbol{\beta}, \mathbf{I}_n/\phi) \]

  • \(\phi = 1/\sigma^2\) is the precision of the data.

  • we might expect \(\boldsymbol{\beta}\) to be close to some vector \(\mathbf{b}_0\)

  • represent this a priori with a Prior Distribution for \(\boldsymbol{\beta}\), e.g. \[\boldsymbol{\beta}\sim \textsf{N}(\mathbf{b}_0, \boldsymbol{\Phi}_0^{-1})\]

  • \(\mathbf{b}_0\) is the prior mean and \(\boldsymbol{\Phi}_0\) is the prior precision of \(\boldsymbol{\beta}\) that captures how close \(\boldsymbol{\beta}\) is to \(\mathbf{b}_0\)

  • Similarly, we could represent prior uncertainty about \(\sigma\), \(\sigma^2\) or equivalently \(\phi\) with a probability distribution

  • for now treat \(\phi\) as fixed

Bayesian Inference

  • once we see data \(\mathbf{Y}\), Bayesian inference proceeds by updating prior beliefs

  • represented by the posterior distribution of \(\boldsymbol{\beta}\) which is the conditional distribution of \(\boldsymbol{\beta}\) given the data \(\mathbf{Y}\) (and \(\phi\) for now)

  • Posterior \(p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi)\) \[p(\boldsymbol{\beta}\mid \mathbf{Y}) = \frac{p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi)}{c}\]

  • \(c\) is a constant so that the posterior density integrates to \(1\) \[c = \int_{\mathbb{R}^p} p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi) d\, \boldsymbol{\beta}\equiv p(\mathbf{Y})\]

  • since \(c\) is a constant that doesn’t depend on \(\boldsymbol{\beta}\) just ignore

  • work with density up to constant of proportionality

Posterior Density

Posterior for \(\boldsymbol{\beta}\) is \(p(\boldsymbol{\beta}\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) p(\boldsymbol{\beta}\mid \phi)\)

  • Likelihood for \(\boldsymbol{\beta}\) is proportional to \(p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi\))

\[\begin{align*} p(\mathbf{Y}\mid \boldsymbol{\beta}, \phi) & = (2 \pi)^{-n/2} |\mathbf{I}_n / \phi |^{-1/2} \exp\left\{-\frac{1}{2} \left( (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T \phi \mathbf{I}_n (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) \right) \right\} \\ & \propto \exp\left\{-\frac{1}{2} \left( \phi \mathbf{Y}^T\mathbf{Y}- 2 \boldsymbol{\beta}^T \phi \mathbf{X}^T\mathbf{Y}+ \phi \boldsymbol{\beta}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}\right) \right\} \end{align*}\]

  • similarly expand prior \[\begin{align*} p(\boldsymbol{\beta}\mid \phi) & = (2 \pi)^{-p/2} |\boldsymbol{\Phi}_0^{-1}|^{-1/2} \exp\left\{-\frac{1}{2} \left( (\boldsymbol{\beta}- \mathbf{b}_0)^T \boldsymbol{\Phi}_0 (\boldsymbol{\beta}- \mathbf{b}_0) \right) \right\} \\ & \propto \exp\left\{-\frac{1}{2} \left( \mathbf{b}_0^T \boldsymbol{\Phi}_0\mathbf{b}_0 - 2 \boldsymbol{\beta}^T\boldsymbol{\Phi}_0 \mathbf{b}_0 + \boldsymbol{\beta}\boldsymbol{\Phi}_0 \boldsymbol{\beta}\right) \right\} \end{align*}\]

Posterior Steps

  • Expand quadratics and regroup terms \[\begin{align*} p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) & \propto e^{\left\{-\frac{1}{2} \left( \phi \boldsymbol{\beta}\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\beta}\boldsymbol{\Phi}_0 \boldsymbol{\beta}- 2(\phi \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{Y}+ \boldsymbol{\beta}^T\boldsymbol{\Phi}_0 \mathbf{b}_0) + \phi \mathbf{Y}^T\mathbf{Y}+ \mathbf{b}_0^T \boldsymbol{\Phi}_0\mathbf{b}_0 \right) \right\} } \\ & \propto \exp\left\{-\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \end{align*}\]

Kernel of a Multivariate Normal

  • Read off posterior precision from Quadratic in \(\boldsymbol{\beta}\)
  • Read off posterior precision \(\times\) posterior mean from Linear term in \(\boldsymbol{\beta}\)
  • will need to complete the quadratic in the posterior mean\(^{\dagger}\)

Posterior Precision and Covariance

\[ p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) \propto \exp\left\{-\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \]

  • Posterior Precision \[\boldsymbol{\Phi}_n \equiv \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0\]

  • sum of data precision and prior precision

  • posterior Covariance \[\textsf{Cov}[\boldsymbol{\beta}\mid \mathbf{Y}, \phi] = \boldsymbol{\Phi}_n^{-1} = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}\]

  • if \(\boldsymbol{\Phi}_0\) is full rank, then \(\textsf{Cov}[\boldsymbol{\beta}\mid \mathbf{Y}, \phi]\) is full rank even if \(\mathbf{X}^T\mathbf{X}\) is not

Posterior Mean Updating

\[\begin{align*} p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) & \propto \exp\left\{\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \\ & \propto \exp\left\{\frac{1}{2} \left( \boldsymbol{\beta}( \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0) \boldsymbol{\beta}- 2 \boldsymbol{\beta}^T\boldsymbol{\Phi}_n \boldsymbol{\Phi}_n^{-1}(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0) \right) \right\} \\ \end{align*}\]

  • posterior mean \(\mathbf{b}_n\) \[\begin{align*} \mathbf{b}_n & \equiv \boldsymbol{\Phi}_n^{-1} (\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 ) \\ & = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}\left(\phi (\mathbf{X}^T\mathbf{X}) (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 \right) \\ & = (\phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1} \left( \phi (\mathbf{X}^T\mathbf{X}) \hat{\boldsymbol{\beta}}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 \right) \end{align*}\]

  • a precision weighted linear combination of MLE and prior mean

  • first expression useful if \(\mathbf{X}\) is not full rank!

Notes

Posterior is a Multivariate Normal \(p(\boldsymbol{\beta}\mid \mathbf{Y}, \phi) \sim \textsf{N}(\mathbf{b}_n, \boldsymbol{\Phi}_n^{-1})\)

  • posterior mean: \(\mathbf{b}_n = \boldsymbol{\Phi}_n^{-1} (\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0 )\)

  • posterior precision: \(\boldsymbol{\Phi}_n = \phi\mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0\)

  • the posterior precision (inverse posterior variance) is the sum of the prior precision and the data precision.

  • the posterior mean is a linear combination of MLE/OLS and prior mean

  • if the prior precision \(\boldsymbol{\Phi}_n\) is very large compared to the data precision \(\phi \mathbf{X}^T\mathbf{X}\), the posterior mean will be close to the prior mean \(\mathbf{b}_0\).

  • if the prior precision \(\boldsymbol{\Phi}_n\) is very small compared to the data precision \(\phi \mathbf{X}^T\mathbf{X}\), the posterior mean will be close to the MLE/OLS estimator.

  • data precision will generally be increasing with sample size

Bayes Estimators

A Bayes estimator is a potential value of \(\boldsymbol{\beta}\) that is obtained from the posterior distribution in some principled way.

  • Standard estimators include

    • the posterior mean estimator, which is the minimizer of the Bayes risk under squared error loss
    • the maximum a posteriori (MAP) estimator, the value \(\boldsymbol{\beta}\) that maximizes the posterior density (or log posterior density)
  • The first estimator is based on principles from classical decision theory, whereas the second can be related to penalized likelihood estimation.

  • in the case of linear regression they turn out to be the same estimator!

Bayes Estimator under Squared Error Loss

  • the Frequentist Risk \(R(\beta, \delta) \equiv \textsf{E}_{\mathbf{Y}\mid \boldsymbol{\beta}}[\| \delta(\mathbf{Y})− \boldsymbol{\beta}\|^2]\) is the expected loss of decision \(\delta\) for a given \(\boldsymbol{\beta}\)

Definition: Bayes Rule and Bayes Risk
The Bayes rule under squared error loss is the function of \(\mathbf{Y}\), \(\delta^*(\mathbf{Y})\), that minimizes the Bayes risk \(B(p_\boldsymbol{\beta}, \delta)\) \[\delta^*(\mathbf{Y}) = \arg \min_{\delta \in \cal{D}} B(p_\boldsymbol{\beta}, \delta)\]

\[B(p_\boldsymbol{\beta}, \delta) = \textsf{E}_\boldsymbol{\beta}R(\boldsymbol{\beta}, \delta) = \textsf{E}_{\boldsymbol{\beta}} \textsf{E}_{\mathbf{Y}\mid \boldsymbol{\beta}}[\| \delta(\mathbf{Y})− \boldsymbol{\beta}\|^2]\] where the expectation is with respect to the prior distribution, \(p_\boldsymbol{\beta}\), over \(\boldsymbol{\beta}\) and the conditional distribution of \(\mathbf{Y}\) given \(\boldsymbol{\beta}\)

Bayes Estimators

Definition: Bayes Action
The Bayes Action is the action \(a \in {\cal{A}}\) that minimizes the posterior expected loss: \[ \delta_B^*(\mathbf{Y}) = \arg \min_{\delta \in \cal{D}} E_{\boldsymbol{\beta}\mid \mathbf{Y}} [\| \delta − \boldsymbol{\beta}\|^2] \]

  • can show that the Bayes action that minimizes the posterior expected loss is the posterior mean \(\boldsymbol{\beta}_n = (\phi \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0)^{-1}(\phi \mathbf{X}^T\mathbf{Y}+ \boldsymbol{\Phi}_0 \mathbf{b}_0\) and is also the Bayes rule.

  • different values of \(\mathbf{b}_0\) and \(\boldsymbol{\Phi}_0\) will lead to different Bayes estimators as will different prior distributions besides the Normal

  • take \(\mathbf{b}_0 = \mathbf{0}\); Bayes estimators are often referred to as shrinkage estimators \[\boldsymbol{\beta}_n = \left( \mathbf{X}^T\mathbf{X}+ \boldsymbol{\Phi}_0/\phi \right)^{-1} \mathbf{X}^T\mathbf{Y}\] as they shrink the MLE/OLS estimator towards \(\mathbf{0}\)

Prior Choice

One of the most common priors for the normal linear model is the g-prior of Zellner (1986) where \(\boldsymbol{\Phi}_0 = \frac{\phi}{g} \mathbf{X}^T\mathbf{X}\) \[\boldsymbol{\beta}\mid \phi, g \sim \textsf{N}(\mathbf{0}, g/\phi (\mathbf{X}^T\mathbf{X})^{-1})\]

\[\begin{align*} \mathbf{b}_n & = \left( \mathbf{X}^T\mathbf{X}+ \frac{\phi}{g} \frac{\mathbf{X}^T\mathbf{X}}{\phi} \right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \left( \mathbf{X}^T\mathbf{X}+ \frac{1}{g} \mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \left( \frac{1 +g}{g} \mathbf{X}^T\mathbf{X}\right)^{-1} \mathbf{X}^T\mathbf{Y}\\ & = \frac{g}{1+g} \hat{\boldsymbol{\beta}} \end{align*}\]

  • \(g\) controls the amount of shrinkage where all of the MLEs are shrunk to zero by the same fraction \(g/(1+g)\)

Another Common Choice

  • another common choice is the independent prior \[\boldsymbol{\beta}\mid \phi \sim \textsf{N}(\mathbf{0}, \boldsymbol{\Phi}_0^{-1})\] where \(\boldsymbol{\Phi}_0 = \phi \kappa \mathbf{I}_b\) for some \(\kappa> 0\)

  • the posterior mean is \[\begin{align*} \boldsymbol{\beta}_n & = (\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I})^{-1} \mathbf{X}^T\mathbf{Y}\\ & = (\mathbf{X}^T\mathbf{X}+ \kappa \mathbf{I})^{-1} \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} \end{align*}\]

  • this is also a shrinkage estimator but the amount of shrinkage is different for the different components of \(\mathbf{b}_n\) depending on the eigenvalues of \(\mathbf{X}^T\mathbf{X}\)

  • easiest to see this via an orthogonal rotation of the model

Rotated Regression

  • Use the singular value decomposition of \(\mathbf{X}= \mathbf{U}\boldsymbol{\Lambda}\mathbf{V}^T\) and multiply thru by \(\mathbf{U}^T\) to get the rotated model \[\begin{align*} \mathbf{U}^T \mathbf{Y}& = \boldsymbol{\Lambda}\mathbf{V}^T\boldsymbol{\beta}+ \mathbf{U}^T\boldsymbol{\epsilon}\\ \tilde{\mathbf{Y}}& = \boldsymbol{\Lambda}\boldsymbol{\alpha}+ \tilde{\boldsymbol{\epsilon}} \end{align*}\] where \(\boldsymbol{\alpha}= \mathbf{V}^T\boldsymbol{\beta}\) and \(\tilde{\boldsymbol{\epsilon}} = \mathbf{U}^T\boldsymbol{\epsilon}\)

  • the induced prior is still \(\boldsymbol{\alpha}\mid \phi \sim \textsf{N}(\mathbf{0}, (\phi \kappa)^{-1} \mathbf{I})\)

  • the posterior mean of \(\boldsymbol{\alpha}\) is \[\begin{align*} \mathbf{a}& = (\boldsymbol{\Lambda}^2 + \kappa \mathbf{I})^{-1} \boldsymbol{\Lambda}^2 \hat{\boldsymbol{\alpha}}\\ a_j & = \frac{\lambda_j^2}{\lambda_j^2 + \kappa} \hat{\alpha}_j \end{align*}\]

  • sets to zero the components of the OLS solution where eigenvalues are zero!

Connections to Frequentist Estimators

  • The posterior mean under this independent prior is the same as the classic ridge regression estimator of Hoerl and

  • the variance of \(\hat{\alpha}_j\) is \(\sigma^2/\lambda_j^2\) while the variance of \(a_j\) is \(\sigma^2/(\lambda_j^2 + \kappa)\)

  • clearly components of \(\boldsymbol{\alpha}\) with small eigenvalues will have large variances

  • ridge regression keeps those components from “blowing up” by shrinking them towards zero and having a finite variance

  • rotate back to get the ridge estimator for \(\boldsymbol{\beta}\), \(\hat{\boldsymbol{\beta}}_R = \mathbf{V}\mathbf{a}\)

  • ridge regression applies a high degree of shrinkage to the “parts” (linear combinations) of \(\boldsymbol{\beta}\) that have high variability, and a low degree of shrinkage to the parts that are well-estimated.

  • turns out there always exists a value of \(\kappa\) that will improve over OLS!

  • Unfortunately no closed form solution except in orthogonal regression and then it depends on the unknown \(\|\boldsymbol{\beta}\|^2\)!

Next Class

  • Frequentist risk of Bayes estimators
  • Bayes and penalized loss functions