Best Linear Unbiased Estimators in Prediction, MVUEs and BUEs

STA 721: Lecture 5

Merlise Clyde (clyde@duke.edu)

Duke University

Outline

  • Gauss-Markov Theorem for non-full rank \(\mathbf{X}\) (recap)
  • Best Linear Unbiased Estimators for Prediction
  • MVUE
  • Discussion of recent papers on Best Unbiased Estimators beyond linearity

Readings:

Identifiability

Definition: Identifiable
\(\boldsymbol{\beta}\) and \(\sigma^2\) are identifiable if the distribution of \(\mathbf{Y}\), \(f_\mathbf{Y}(\mathbf{y}; \boldsymbol{\beta}_1, \sigma^2_1) = f_\mathbf{Y}(\mathbf{y}; \boldsymbol{\beta}_2, \sigma^2_2)\) implies that \((\boldsymbol{\beta}_1, \sigma^2_1)^T = (\boldsymbol{\beta}_2, \sigma^2_2)^T\)

  • For linear models, equivalent definition is that \(\boldsymbol{\beta}\) is identifiable if for any \(\boldsymbol{\beta}_1\) and \(\boldsymbol{\beta}_2\), \(\mu(\boldsymbol{\beta}_1) = \mu(\boldsymbol{\beta}_2)\) or \(\mathbf{X}\boldsymbol{\beta}_1 =\mathbf{X}\boldsymbol{\beta}_2\) implies that \(\boldsymbol{\beta}_1 = \boldsymbol{\beta}_2\).

  • If \(r(\mathbf{X}) = p\) then \(\boldsymbol{\beta}\) is identifiable

  • If \(\mathbf{X}\) is not full rank, there exists \(\boldsymbol{\beta}_1 \neq \boldsymbol{\beta}_2\), but \(\mathbf{X}\boldsymbol{\beta}_1 = \mathbf{X}\boldsymbol{\beta}_2\) and hence \(\boldsymbol{\beta}\) is not identifiable!

  • identifiable linear functions of \(\boldsymbol{\beta}\), \(\boldsymbol{\Lambda}^T\boldsymbol{\beta}\) that have an unbiased estimator are historically referred to as estimable in linear models.

BLUE of \(\boldsymbol{\Lambda}^T \boldsymbol{\beta}\)

If \(\boldsymbol{\Lambda}^T= \mathbf{B}\mathbf{X}\) for some matrix \(\mathbf{B}\) (or \(\boldsymbol{\Lambda}= \mathbf{X}^T\mathbf{B}\) then

  • \(\textsf{E}[\mathbf{B}\mathbf{P}\mathbf{Y}] = \textsf{E}[\mathbf{B}\mathbf{X}\hat{\boldsymbol{\beta}}] = \textsf{E}[\boldsymbol{\Lambda}^T \hat{\boldsymbol{\beta}}] = \boldsymbol{\Lambda}^T\boldsymbol{\beta}\)
  • identifiable as it is a function of \(\boldsymbol{\mu}\), linear and unbiased
  • The unique OLS estimate of \(\boldsymbol{\Lambda}^T\boldsymbol{\beta}\) is \(\boldsymbol{\Lambda}^T\hat{\boldsymbol{\beta}}\)
  • \(\mathbf{B}\mathbf{P}\mathbf{Y}= \boldsymbol{\Lambda}^T\hat{\boldsymbol{\beta}}\) is the BLUE of \(\boldsymbol{\Lambda}^T\boldsymbol{\beta}\) \[\begin{align*} & \textsf{E}[\|\mathbf{B}\mathbf{P}\mathbf{Y}- \mathbf{B}\boldsymbol{\mu}\|^2] \le \textsf{E}[\|\mathbf{A}\mathbf{Y}- \mathbf{B}\boldsymbol{\mu}\|^2] \\ \Leftrightarrow & \\ & \textsf{E}[\|\boldsymbol{\Lambda}^T\hat{\boldsymbol{\beta}}- \boldsymbol{\Lambda}^T\boldsymbol{\beta})\|^2] \le \textsf{E}[\|\mathbf{L}^T\tilde{\boldsymbol{\beta}}- \boldsymbol{\Lambda}^T\boldsymbol{\beta}\|^2] \end{align*}\] for LUE \(\mathbf{A}\mathbf{Y}= \mathbf{L}^T\tilde{\boldsymbol{\beta}}\) of \(\boldsymbol{\Lambda}^T\boldsymbol{\beta}\)

Non-Identifiable Example

One-way ANOVA model \[\mu_{ij} = \mu + \tau_j \qquad \boldsymbol{\mu}= ( \mu_{11}, \ldots,\mu_{n_1 1},\mu_{12},\ldots, \mu_{n_2,2},\ldots, \mu_{1J}, \ldots, \mu_{n_J J})^T \]

  • Let \(\boldsymbol{\beta}_{1} = (\mu, \tau_1, \ldots, \tau_J)^T\)
  • Let \(\boldsymbol{\beta}_{2} = (\mu - 42, \tau_1 + 42, \ldots, \tau_J + 42)^T\)
  • Then \(\boldsymbol{\mu}_{1} = \boldsymbol{\mu}_{2}\) even though \(\boldsymbol{\beta}_1 \neq \boldsymbol{\beta}_2\)
  • \(\boldsymbol{\beta}\) is not identifiable
  • yet \(\boldsymbol{\mu}\) is identifiable, where \(\boldsymbol{\mu}= \mathbf{X}\boldsymbol{\beta}\) (a linear combination of \(\boldsymbol{\beta}\))

LUEs of Individual \(\beta_j\)

Proposition: Christensen 2.1.6
For \(\boldsymbol{\mu}= \mathbf{X}\boldsymbol{\beta}= \sum_j \mathbf{X}_j \beta_j\) \(\beta_j\) is not identifiable if and only if there exists \(\alpha_j\) such that \(\mathbf{X}_j = \sum_{i \neq j} \mathbf{X}_i \alpha_i\)

One-way Anova Model: \(Y_{ij} = \mu + \tau_j + \epsilon_{ij}\) \[\boldsymbol{\mu}= \left[ \begin{array}{lllll} \mathbf{1}_{n_1} & \mathbf{1}_{n_1} & \mathbf{0}_{n_1} & \ldots & \mathbf{0}_{n_1} \\ \mathbf{1}_{n_2} & \mathbf{0}_{n_2} & \mathbf{1}_{n_2} & \ldots & \mathbf{0}_{n_2} \\ \vdots & \vdots & \ddots & \vdots \\ \mathbf{1}_{n_J} & \mathbf{0}_{n_J} & \mathbf{0}_{n_J} & \ldots & \mathbf{1}_{n_J} \\ \end{array} \right] \left( \begin{array}{l} \mu \\ \tau_1 \\ \tau_2 \\ \vdots \\ \tau_J \end{array} \right) \]

  • Are any parameters \(\mu\) or \(\tau_j\) identifiable?

Examples of \(\boldsymbol{\lambda}\) of Interest:

  • A \(j\)th element of \(\boldsymbol{\beta}\): \(\boldsymbol{\lambda}= (0, 0, \ldots,1, 0, \ldots, 0)^T\), \[\boldsymbol{\lambda}^T\boldsymbol{\beta}= \beta_j\]

  • Difference between two treatements: \(\tau_1 - \tau_2\): \(\boldsymbol{\lambda}= (0, 1, -1, \ldots, 0, \ldots, 0)^T\), \[\boldsymbol{\lambda}^T\boldsymbol{\beta}= \tau_1 - \tau_2\]

  • Estimation at observed \(\mathbf{x}_i\): \(\boldsymbol{\lambda}= \mathbf{x}_i\) \[\mu_i = \mathbf{x}_i^T \boldsymbol{\beta}\]

  • Estimation or prediction at a new point \(\mathbf{x}_*\): \(\boldsymbol{\lambda}= \mathbf{x}_*\), \[\mu_* = \mathbf{x}_*^T \boldsymbol{\beta}\]

Another Non-Full Rank Example

x1 = -4:4
x2 = c(-2, 1, -1, 2, 0, 2, -1, 1, -2)
x3 = 3*x1  -2*x2
x4 = x2 - x1 + 4
Y = 1+x1+x2+x3+x4 + c(-.5,.5,.5,-.5,0,.5,-.5,-.5,.5)
dev.set = data.frame(Y, x1, x2, x3, x4)

# Order 1
lm1234 = lm(Y ~ x1 + x2 + x3 + x4, data=dev.set)
round(coefficients(lm1234), 4)
(Intercept)          x1          x2          x3          x4 
          5           3           0          NA          NA 
# Order 2
lm3412 = lm(Y ~ x3 + x4 + x1 + x2, data = dev.set)
round(coefficients(lm3412), 4)
(Intercept)          x3          x4          x1          x2 
        -19           3           6          NA          NA 

In Sample Predictions

cbind(dev.set, predict(lm1234), predict(lm3412))
     Y x1 x2  x3 x4 predict(lm1234) predict(lm3412)
1 -7.5 -4 -2  -8  6              -7              -7
2 -3.5 -3  1 -11  8              -4              -4
3 -0.5 -2 -1  -4  5              -1              -1
4  1.5 -1  2  -7  7               2               2
5  5.0  0  0   0  4               5               5
6  8.5  1  2  -1  5               8               8
7 10.5  2 -1   8  1              11              11
8 13.5  3  1   7  2              14              14
9 17.5  4 -2  16 -2              17              17
  • Both models agree for estimating the mean at the observed \(\mathbf{X}\) points!

Out of Sample

out = data.frame(test.set,
      Y1234=predict(lm1234, new=test.set),
      Y3412=predict(lm3412, new=test.set))
out
  x1 x2 x3 x4 Y1234 Y3412
1  3  1  7  2    14    14
2  6  2 14  4    23    47
3  6  2 14  0    23    23
4  0  0  0  4     5     5
5  0  0  0  0     5   -19
6  1  2  3  4     8    14
  • Agreement for cases 1, 3, and 4 only!

  • Can we determine that without finding the predictions and comparing?

  • Conditions for general \(\boldsymbol{\Lambda}\) or \(\boldsymbol{\lambda}\) without findingn \(\mathbf{B}\) (\(\boldsymbol{\beta}^T\))?

Conditions for LUE of \(\boldsymbol{\lambda}\)

  • GM requires that \(\boldsymbol{\lambda}^T = \mathbf{b}^T\mathbf{X}\Leftrightarrow \boldsymbol{\lambda}= \mathbf{X}^T \mathbf{b}\) therefore \(\boldsymbol{\lambda}\in C(\mathbf{X}^T)\)

  • Suppose we have an arbitrary \(\boldsymbol{\lambda}= \boldsymbol{\lambda}_* + \mathbf{u}\), where \(\boldsymbol{\lambda}_* \in C(\mathbf{X}^T)\) and \(\mathbf{u}\in C(\mathbf{X}^T)^\perp\) (orthogonal complement)

  • Let \(\mathbf{P}_{\mathbf{X}^T}\) denote an orthogonal projection onto \(C(\mathbf{X}^T)\) then \(\mathbf{I}- \mathbf{P}_{\mathbf{X}^T}\) is an orthogonal projection onto \(C(\mathbf{X}^T)^\perp\)

  • \((\mathbf{I}- \mathbf{P}_{\mathbf{X}^T})\boldsymbol{\lambda}= (\mathbf{I}- \mathbf{P}_{\mathbf{X}^T})\boldsymbol{\lambda}_* + (\mathbf{I}- \mathbf{P}_{\mathbf{X}^T})\mathbf{u}= \mathbf{0}_p + \mathbf{u}\)

  • so if \(\boldsymbol{\lambda}\in C(\mathbf{X}^T)\) we will have \((\mathbf{I}- \mathbf{P}_{\mathbf{X}^T})\boldsymbol{\lambda}= \mathbf{0}_p\)! (or \(\mathbf{P}_{\mathbf{X}^T} \boldsymbol{\lambda}= \boldsymbol{\lambda}\))

  • Note this is really just a generalization of Proposition 2.1.6 in Christensen that \(\beta_j\) is not identifiable iff there exist scalars such that \(\mathbf{X}_j = \sum_{i \neq j} \mathbf{X}_i \alpha_i\)

Exercise
  1. Is \(\mathbf{P}_{\mathbf{X}^T} = (\mathbf{X}^T\mathbf{X})(\mathbf{X}^T\mathbf{X})^{-}\) a projection onto \(C(\mathbf{X}^T)\)?
  2. is the expression for \(\mathbf{P}_{\mathbf{X}^T}\) unique?
  3. Is \(\mathbf{P}_{\mathbf{X}^T}\) an orthogonal projection in general?
  4. Is \(\mathbf{P}_{\mathbf{X}^T}\) using the Moore-Penrose generalized inverse an orthogonal projection?

Prediction Example Again

For prediction at a new \(\mathbf{x}_*\), this is implemented in the R package estimability

require("estimability" )
cbind(epredict(lm1234, test.set), epredict(lm3412, test.set))
  [,1] [,2]
1   14   14
2   NA   NA
3   23   23
4    5    5
5   NA   NA
6   NA   NA

Rows 2, 5, and 6 do not have a unique best linear unbiased estimator, \(\mathbf{x}_*^T \boldsymbol{\beta}\)

MVUE: Minimum Variance Unbiased Estimators

  • Gauss-Markov Theorem says that OLS has minimum variance in the class of all Linear Unbiased estimators for \(\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n\) and \(\textsf{Cov}[\boldsymbol{\epsilon}] = \sigma^2 \mathbf{I}_n\)

  • Requires just first and second moments

  • Additional assumption of normality and full rank, OLS of \(\boldsymbol{\beta}\) is the same as MLEs and have minimum variance out of ALL unbiased estimators (MVUE); not just linear estimators (section 2.5 in Christensen)

  • requires Complete Sufficient Statististics and Rao-Blackwell Theorem - next semester in STA732)

  • so Best Unbiased Estimators (BUE) not just BLUE!

What about ?

  • are there nonlinear estimators that are better than OLS under the assumptions ?

  • Anderson (1962) showed OLS is not generally the MVUE with \(\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n\) and \(\textsf{Cov}[\boldsymbol{\epsilon}] = \sigma^2 \mathbf{I}_n\)

  • pointed out that linear-plus-quadratic (LPQ) estimators can outperform the OLS estimator for certain error distributions.

  • Other assumptions on \(\textsf{Cov}[\boldsymbol{\epsilon}] = \boldsymbol{\Sigma}\)?

    • Generalized Least Squares are BLUE (not necessarily equivalent to OLS)
  • more recently Hansen (2022) concludes that OLS is BUE over the broader class of linear models with \(\textsf{Cov}[\boldsymbol{\epsilon}]\) finite and \(\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n\)

  • lively ongoing debate! - see What Estimators are Unbiased for Linear Models (2023) and references within

Next Up

  • GLS under assumptions \(\textsf{E}[\boldsymbol{\epsilon}] = \mathbf{0}_n\) and \(\textsf{Cov}[\boldsymbol{\epsilon}] = \boldsymbol{\Sigma}\)

  • Oblique projections and orthogonality with other inner products on \(\mathbb{R}^n\)

  • MLEs in Multivariate Normal setting

  • Gauss-Markov