STA 721: Lecture 6
Duke University
Readings:
Christensen Chapter 2 and 10 (Appendix B as needed)
Seber & Lee Chapter 3
Model:
\[\begin{align} \mathbf{Y}& = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\quad
\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n \\
\textsf{Cov}[\boldsymbol{\epsilon}] & = \sigma^2 \mathbf{V}
\end{align}\] where \(\sigma^2\) is a scalar and \(\mathbf{V}\) is a \(n \times n\) symmetric matrix
Examples:
Is it still unbiased? What’s its variance? Is it still the BLUE?
Unbiasedness of \(\hat{\boldsymbol{\beta}}\) \[\begin{align} \textsf{E}[\hat{\boldsymbol{\beta}}] & = \textsf{E}[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}] \\ & = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \textsf{E}[\mathbf{Y}] = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \textsf{E}[\mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}] \\ & = \boldsymbol{\beta}+ \mathbf{0}_p = \boldsymbol{\beta} \end{align}\]
Covariance of \(\hat{\boldsymbol{\beta}}\) \[\begin{align} \textsf{Cov}[\hat{\boldsymbol{\beta}}] & = \textsf{Cov}[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y}] \\ & = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \textsf{Cov}[\mathbf{Y}] \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \\ & = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{V}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1} \end{align}\]
Not necessarily \(\sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}\) unless \(\mathbf{V}\) has a special form
Transform the data and reduce problem to one we have solved!
For \(\mathbf{V}> 0\) use the Spectral Decomposition \[\mathbf{V}= \mathbf{U}\boldsymbol{\Lambda}\mathbf{U}^T = \mathbf{U}\boldsymbol{\Lambda}^{1/2} \boldsymbol{\Lambda}^{1/2} \mathbf{U}^T\]
define the symmetric square root of \(\mathbf{V}\) as \[\mathbf{V}^{1/2} \equiv \mathbf{U}\boldsymbol{\Lambda}^{1/2} \mathbf{U}^T\]
transform model: \[\begin{align*} \mathbf{V}^{-1/2} \mathbf{Y}& = \mathbf{V}^{-1/2} \mathbf{X}\boldsymbol{\beta}+ \mathbf{V}^{-1/2}\boldsymbol{\epsilon}\\ \tilde{\mathbf{Y}} & = \tilde{\mathbf{X}} \boldsymbol{\beta}+ \tilde{\boldsymbol{\epsilon}} \end{align*}\]
Since \(\textsf{Cov}[\tilde{\boldsymbol{\epsilon}}] = \sigma^2\mathbf{V}^{-1/2} \mathbf{V}\mathbf{V}^{-1/2} = \sigma^2 \mathbf{I}_n\), we know that \(\hat{\boldsymbol{\beta}}_\mathbf{V}\equiv (\tilde{\mathbf{X}}^T\tilde{\mathbf{X}})^{-1} \tilde{\mathbf{X}}^T\tilde{\mathbf{Y}}\) is the BLUE for \(\boldsymbol{\beta}\) based on \(\tilde{\mathbf{Y}}\) (\(\mathbf{X}\) full rank)
If \(\mathbf{V}\) is known, then \(\tilde{\mathbf{Y}}\) and \(\mathbf{Y}\) are known linear transformations of each other
any estimator of \(\boldsymbol{\beta}\) that is linear in \(\mathbf{Y}\) is linear in \(\tilde{\mathbf{Y}}\) and vice versa from previous results
\(\hat{\boldsymbol{\beta}}_\mathbf{V}\) is the BLUE of \(\boldsymbol{\beta}\) based on either \(\tilde{\mathbf{Y}}\) or \(\mathbf{Y}\)!
Substituting back, we have \[\begin{align} \hat{\boldsymbol{\beta}}_\mathbf{V}& = (\tilde{\mathbf{X}}^T\tilde{\mathbf{X}})^{-1} \tilde{\mathbf{X}}^T\tilde{\mathbf{Y}}\\ & = (\mathbf{X}^T \mathbf{V}^{-1/2}\mathbf{V}^{-1/2} \mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1/2}\mathbf{V}^{-1/2}\mathbf{Y}\\ & = (\mathbf{X}^T \mathbf{V}^{-1} \mathbf{X})^{-1} \mathbf{X}^T\mathbf{V}^{-1}\mathbf{Y} \end{align}\] which is the Generalized Least Squares Estimator of \(\boldsymbol{\beta}\)
the OLS/MLE of \(\boldsymbol{\mu}\in C(\mathbf{X})\) with transformed variables is \[\begin{align*} \mathbf{P}_{\tilde{\mathbf{X}}} \tilde{\mathbf{Y}}& = \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_\mathbf{V}\\ \tilde{\mathbf{X}}\left(\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\right)^{-1}\tilde{\mathbf{X}}^T \tilde{\mathbf{Y}}& = \tilde{\mathbf{X}}\hat{\boldsymbol{\beta}}_\mathbf{V}\\ \mathbf{V}^{-1/2} \mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{V}^{-1} \mathbf{Y}& = \mathbf{V}^{-1/2} \mathbf{X}\hat{\boldsymbol{\beta}}_\mathbf{V}\end{align*}\]
since \(\mathbf{V}\) is positive definite, multiple thru by \(\mathbf{V}^{1/2}\), to show that \(\hat{\boldsymbol{\beta}}_\mathbf{V}\) is a GLS/MLE estimator of \(\boldsymbol{\beta}\) iff \[\mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{V}^{-1} \mathbf{Y}= \mathbf{X}\hat{\boldsymbol{\beta}}_\mathbf{V}\]
Is \(\mathbf{P}_\mathbf{V}\equiv \mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{V}^{-1}\) a projection onto \(C(\mathbf{X})\)? Is it an orthogonal projection onto \(C(\mathbf{X})\)?
\(\dagger\) if \(\mathbf{X}\) is not full rank replace \(\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\) with \(\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-}\)
We want to show that \(\mathbf{P}_\mathbf{V}\equiv \mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{V}^{-1}\) is a projection onto \(C(\mathbf{X})\)
from the definition of \(\mathbf{P}_\mathbf{V}\) it follows that \(\mathbf{m}\in C(\mathbf{P}_\mathbf{v})\) implies that \(\mathbf{m}= \mathbf{P}_\mathbf{V}\mathbf{m}= \mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T\mathbf{m}\) so \(C(\mathbf{P}_\mathbf{V}) \subset C(\mathbf{X})\)
since \(\mathbf{P}_\tilde{\mathbf{X}}\) is a projection onto \(C(\tilde{\mathbf{X}})\) we have \[\begin{align*} \mathbf{P}_{\tilde{\mathbf{X}}} \tilde{\mathbf{X}}& = \tilde{\mathbf{X}}\\ \tilde{\mathbf{X}}\left(\tilde{\mathbf{X}}^T\tilde{\mathbf{X}}\right)^{-1}\tilde{\mathbf{X}}^T \tilde{\mathbf{X}}& = \tilde{\mathbf{X}}\\ \mathbf{V}^{-1/2} \mathbf{X}\left(\mathbf{X}^T\mathbf{V}^{-1}\mathbf{X}\right)^{-1}\mathbf{X}^T \mathbf{V}^{-1} \mathbf{X}& = \mathbf{V}^{-1/2} \mathbf{X}\\ \mathbf{V}^{-1/2} \mathbf{P}_\mathbf{V}\mathbf{X}& = \mathbf{V}^{-1/2} \mathbf{X} \end{align*}\]
We can multiply both sides by \(\mathbf{V}^{1/2} > 0\), so that \(\mathbf{P}_\mathbf{V}\mathbf{X}= \mathbf{X}\)
for \(\mathbf{m}\in C(\mathbf{X})\), \(\mathbf{P}_\mathbf{V}\mathbf{m}= \mathbf{m}\) and \(C(\mathbf{X}) \subset C(\mathbf{P}_\mathbf{V})\)
\(\quad \quad \therefore C(\mathbf{P}_\mathbf{V}) = C(\mathbf{X})\) so that \(\mathbf{P}_\mathbf{V}\) is a projection onto \(C(\mathbf{X})\)
Show that \(\mathbf{P}_\mathbf{V}^2 = \mathbf{P}_\mathbf{V}\) (idempotent)
every vector \(\mathbf{y}\in \mathbb{R}^n\) may be written as \(\mathbf{y}= \mathbf{m}+ \mathbf{n}\) where \(\mathbf{P}_\mathbf{v}\mathbf{y}= \mathbf{m}\) and \((\mathbf{I}_n - \mathbf{P}_\mathbf{v})\mathbf{y}= \mathbf{n}\) where \(\mathbf{m}\in C(\mathbf{P}_\mathbf{V})\) and \(\mathbf{u}\in N(\mathbf{P}_\mathbf{V})\)
Is \(\mathbf{P}_\mathbf{V}\) an orthogonal projection onto \(C(\mathbf{X})\) for the inner product space \((\mathbb{R}^n, \langle \mathbf{v}, \mathbf{u}\rangle = \mathbf{v}^T\mathbf{u})\)?
The GLS estimator minimizes the following generalized squared error loss: \[\begin{align} \| \tilde{\mathbf{Y}}- \tilde{\mathbf{X}}\boldsymbol{\beta}\|^2 & = (\tilde{\mathbf{Y}}- \tilde{\mathbf{X}}\boldsymbol{\beta})^T(\tilde{\mathbf{Y}}- \tilde{\mathbf{X}}\boldsymbol{\beta}) \\ & = (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T \mathbf{V}^{-1/2}\mathbf{V}^{-1/2}(\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) \\ & = (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T \mathbf{V}^{-1}(\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) \\ & = \| \mathbf{Y}- \mathbf{X}\boldsymbol{\beta}\|^2_{\mathbf{V}^{-1}} \end{align}\] where we can change the inner product to be \[\langle \mathbf{u}, \mathbf{v}\rangle_{\mathbf{V}^{-1}} \equiv \mathbf{u}^T\mathbf{V}^{-1} \mathbf{v}\]
For what covariance matrices \(\mathbf{V}\) will the OLS and GLS estimators be the same?
Figuring this out can help us understand why the GLS estimator has a lower variance in general.
We need to show that \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\beta}}_\mathbf{V}\) are the same for all \(\mathbf{Y}\). Since both \(\mathbf{P}\) and \(\mathbf{P}_\mathbf{V}\) are projections onto \(C(\mathbf{X})\), \(\hat{\boldsymbol{\beta}}\) and \(\hat{\boldsymbol{\beta}}_\mathbf{V}\) will be the same iff \(\mathbf{P}_\mathbf{V}\) is an orthogonal projection onto \(C(\mathbf{X})\) so that \(\mathbf{P}_\mathbf{V}\mathbf{n}= 0\) for \(\mathbf{n}\in C(\mathbf{X})^\perp\) (they have the same null spaces)
Show that \(C(\mathbf{X}) = C(\mathbf{V}\mathbf{X})\) iff \(\mathbf{V}\) can be written as \[\mathbf{V}= \mathbf{X}\boldsymbol{\Psi}\mathbf{X}^T + \mathbf{H}\boldsymbol{\Phi}\mathbf{H}^T\] (Show \(C(\mathbf{V}\mathbf{X}) \subset C( \mathbf{X})\) iff \(\mathbf{V}\) has the above form and since the two subspaces have the same rank \(C(\mathbf{X}) = C(\mathbf{V}\mathbf{X})\)
Show that \(C(\mathbf{X}) = C(\mathbf{V}^{-1} \mathbf{X})\) iff \(C(\mathbf{X}) = C(\mathbf{V}\mathbf{X})\)
Show that \(C(\mathbf{X})^\perp = C(\mathbf{V}^{-1} \mathbf{X})^\perp\) iff \(C(\mathbf{X}) = C(\mathbf{V}^{-1} \mathbf{X})\)
Show that \(\mathbf{n}\in C(\mathbf{X})^\perp\) iff \(\mathbf{n}\in C(\mathbf{V}^{-1}\mathbf{X})^\perp\) so \(\mathbf{P}_\mathbf{V}\mathbf{n}= 0\)
See Proposition 2.7.5 and Proof in Christensen
For the linear model \(\mathbf{Y}= \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\) with \(\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n\) and \(\textsf{Cov}[\boldsymbol{\epsilon}] = \sigma^2 \mathbf{V}\), we can always write
\[\begin{align} \boldsymbol{\epsilon}& = \mathbf{P}\boldsymbol{\epsilon}+ (\mathbf{I}- \mathbf{P})\boldsymbol{\epsilon}\\ & = \boldsymbol{\epsilon}_\mathbf{X}+ \boldsymbol{\epsilon}_N \end{align}\]
we can recover \(\boldsymbol{\epsilon}_N\) from the data \(\mathbf{Y}\) but not \(\boldsymbol{\epsilon}_\mathbf{X}\): \[\begin{align} \mathbf{P}\mathbf{Y}& = \mathbf{P}( \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}_\mathbf{X}+ \boldsymbol{\epsilon}_n )\\ & = \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}_\mathbf{X}= \mathbf{X}\hat{\boldsymbol{\beta}}\\ (\mathbf{I}_n - \mathbf{P}) \mathbf{Y}& = \boldsymbol{\epsilon}_N = \hat{\boldsymbol{\epsilon}} = \mathbf{e} \end{align}\]
Can \(\boldsymbol{\epsilon}_\textsf{N}\) help us estimate \(\mathbf{X}\boldsymbol{\beta}\)? What if \(\boldsymbol{\epsilon}_N\) could tell us something about \(\boldsymbol{\epsilon}_X\)?
Yes if they were highly correlated! But if they were independent or uncorrelated then knowing \(\boldsymbol{\epsilon}_\textsf{N}\) doesn’t help us!
For what matrices are \(\boldsymbol{\epsilon}_\mathbf{X}\) and \(\boldsymbol{\epsilon}_N\) uncorrelated?
Under \(\mathbf{V}= \mathbf{I}_n\): \[\begin{align} \textsf{E}[\boldsymbol{\epsilon}_X \boldsymbol{\epsilon}_N] & = \mathbf{P}\textsf{E}[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T](\mathbf{I}-\mathbf{P}) \\ & = \sigma^2 \mathbf{P}(\mathbf{I}- \mathbf{P}) = \mathbf{0} \end{align}\] so they are uncorrelated
For the \(\mathbf{V}\) in the theorem, introduce
As a consequence we have
\(\boldsymbol{\epsilon}_\mathbf{X}= \mathbf{P}\boldsymbol{\epsilon}= \mathbf{X}\mathbf{Z}_\mathbf{X}\)
\(\boldsymbol{\epsilon}_\textsf{N}= (\mathbf{I}_n - \mathbf{P})\boldsymbol{\epsilon}= \mathbf{H}\mathbf{Z}_\textsf{N}\)
\(\boldsymbol{\epsilon}_\mathbf{X}\) and \(\boldsymbol{\epsilon}_\textsf{N}\) are uncorrelated \[\begin{align} \textsf{E}[\boldsymbol{\epsilon}_\mathbf{X}\boldsymbol{\epsilon}_\textsf{N}] & = \textsf{E}[\mathbf{X}\mathbf{Z}_\mathbf{X}\mathbf{Z}_\textsf{N}^T \mathbf{H}^T] \\ & = \mathbf{X}\mathbf{0}\mathbf{H}^T \\ & = \mathbf{0} \end{align}\]
so that \(\boldsymbol{\epsilon}_\mathbf{X}\) and \(\boldsymbol{\epsilon}_\textsf{N}\) are uncorrelated with \(\mathbf{V}= \mathbf{X}\boldsymbol{\Psi}\mathbf{X}^T + \mathbf{H}\boldsymbol{\Phi}\mathbf{H}\) ^T$
Alternative Statement of Theorem: \(\hat{\boldsymbol{\beta}}= \hat{\boldsymbol{\beta}}_\mathbf{V}\) for all \(\mathbf{Y}\) under \(\textsf{Cov}[\mathbf{Y}] = \sigma^2 \mathbf{V}\) iff \(\mathbf{P}\mathbf{Y}\) and \((\mathbf{I}- \mathbf{P})\mathbf{Y}\) are uncorrelated
The following corollary to the theorem establishes when two GLS estimators for different \(\textsf{Cov}[\boldsymbol{\epsilon}]\) are equivalent :