Residuals and Diagnostics

STA 721: Lecture 21

Merlise Clyde

Duke University

Linear Model Assumptions

Linear Model: \[ \mathbf{Y}= \boldsymbol{\mu}+ \boldsymbol{\epsilon}\]
Assumptions:
\[\begin{eqnarray*} \boldsymbol{\mu}\in C(\mathbf{X}) & \Leftrightarrow & \boldsymbol{\mu}= \mathbf{X}\boldsymbol{\beta}\\ \boldsymbol{\epsilon}& \sim & \textsf{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_n) \end{eqnarray*}\]

Focus on

Wrong mean for a case or cases
Cases that influence the estimates of the mean
Wrong distribution for \(\boldsymbol{\epsilon}\)

If \(\mu_i \neq \mathbf{x}_i^T\boldsymbol{\beta}\) then expected value of \(e_i = Y_i -\hat{Y}_i\) is not zero

Standardized residuals

Standardized residuals \(e_i/\sqrt{\sigma^2(1 - h_{ii})}\)
\(h_{ii}\) is the \(i\)th diagonal element of the hat matrix \(\mathbf{H}= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) or leverage
Under correct model standardized residuals have mean 0 and scale 1
plug in the usual unbiased estimate of \(\sigma^2\) \[r_i = e_i/\sqrt{\hat{\sigma}^2(1 - h_{ii})} \]
if \(h_{ii}\) is close to 1, then \(\hat{Y}_i\) is close to \(Y_i\) (why?!?) so \(e_i\) is approximately 0
\(\textsf{var}(e_i)\) is also almost 0 as \(h_{ii} \to 1\), so \(e_i \to 0\) with probability 1
if \(h_{ii} \approx 1\) \(r_i\) may not flag ``outliers’’
even if \(h_{ii}\) is not close to 1, the distribution of \(r_i\) is not a \(t\) (hard to judge if large \(|r_i|\) is unusual)

Outlier Test for Mean Shift

Test \(H_0\): \(\mu_i = \mathbf{x}_i^T \boldsymbol{\beta}\) versus \(H_a\): \(\mu_i = \mathbf{x}_i^T\boldsymbol{\beta}+ \alpha_i\)

t-test for testing H\(_0\): \(\alpha_i = 0\) has \(n - p -1\) degrees of freedom
if p-value is small declare the \(i\)th case to be an outlier: \(\textsf{E}[Y_i]\) not given by \(\mathbf{X}\boldsymbol{\beta}\) but \(\mathbf{X}\boldsymbol{\beta}+ \delta_i \alpha_i\)
Can extend to include multiple \(\delta_i\) and \(\delta_j\) to test that case \(i\) and \(j\) are both outliers
Extreme case \(\boldsymbol{\mu}= \mathbf{X}\boldsymbol{\beta}+ \mathbf{I}_n \boldsymbol{\alpha}\) all points have their own mean!
Need to control for multiple testing without prior reason to expect a case to be an outlier (or use a Bayesian approach)

Predicted Residuals

Estimates without Case (i): \[\begin{eqnarray*} \hat{\boldsymbol{\beta}}_{(i)} & = & (\mathbf{X}_{(i)}^T\mathbf{X}_{(i)})^{-1 }\mathbf{X}_{(i)}^T \mathbf{Y}_{(i)} \\ & = & \hat{\boldsymbol{\beta}}- \frac{ (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i e_i}{ 1 - h_{ii}} \end{eqnarray*}\]

Predicted residual \(e_{(i)} = y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}}_{(i)} = \frac{e_i}{1 - h_{ii}}\)
variance \(\textsf{var}(e_{(i)}) = \frac{\sigma^2}{1 - h_{ii}}\)
standardized predicted residual is \[\frac{e_{(i)}}{\sqrt{\textsf{var}(e_{(i)})}} = \frac{e_i/(1 - h_{ii})}{\sigma/\sqrt{1 - h_{ii}}} = \frac{e_i}{\sigma \sqrt{1 - h_{ii}} } \]
these are the same as standardized residual!

How to Calculate \(\hat{\boldsymbol{\beta}}_{(i)}\)

How do we calculate \(\hat{\boldsymbol{\beta}}_{(i)}\) without case \(i\) without refitting the model \(n\) times?

Note: \(\mathbf{X}^T\mathbf{X}= \mathbf{X}_{(i)}^T\mathbf{X}_{(i)} + \mathbf{x}_i\mathbf{x}_i^T\) rearrange to get \(\mathbf{X}_{(i)}^T\mathbf{X}_{(i)} = \mathbf{X}^T\mathbf{X}- \mathbf{x}_i\mathbf{x}_i^T\)
Special Case of Binomial Inverse Theorem or Woodbury Theorem: (Thm B.56 in Christensen) \[(\mathbf{A}+ \mathbf{u}\mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{\mathbf{A}^{-1} \mathbf{u}\mathbf{v}^T \mathbf{A}^{-1}}{1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{u}}\] with \(\mathbf{A}= \mathbf{X}^T\mathbf{X}\) and \(\mathbf{u}= -\mathbf{x}_i\) and \(\mathbf{v}= \mathbf{x}_i\)

\[(\mathbf{X}_{(i)}^T\mathbf{X}_{(i)})^{-1} = (\mathbf{X}^T\mathbf{X}- \mathbf{x}_i\mathbf{x}_i^T)^{-1} = (\mathbf{X}^T\mathbf{X})^{-1} + \frac{(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1}}{1 - \mathbf{x}_i^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_i}\] - use \(\mathbf{X}_{i}^T\mathbf{Y}_{i} = \mathbf{X}^T\mathbf{Y}- \mathbf{x}_i y_i\) to get \(\hat{\boldsymbol{\beta}}_{(i)}\) and other quantities

External estimate of \(\sigma^2\)

Estimate \({\hat{\sigma}}^2_{(i)}\) using data with case \(i\) deleted
\[\begin{eqnarray*} \textsf{SSE}_{(i)} & = & \textsf{SSE}- \frac{e_i^2}{1 - h_{ii}} \\ {\hat{\sigma}}^2_{(i)} = \textsf{MSE}_{(i)} & = & \frac{\textsf{SSE}_{(i)}}{n - p - 1} \\ \end{eqnarray*}\]

Externally Standardized residuals
\[t_i = \frac{e_{(i)}}{\sqrt{{\hat{\sigma}}^2_{(i)}/(1 - h_{ii})}} = \frac{y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}}_{(i)}} {\sqrt{{\hat{\sigma}}^2_{(i)}/(1 - h_{ii})}} = r_i \left( \frac{ n - p - 1}{n - p - r_i^2}\right)^{1/2}\]
May still miss extreme points with high leverage, but will pick up unusual \(y_i\)’s

Externally Studentized Residual

Externally studentized residuals have a \(t\) distribution with \(n - p - 1\) degrees of freedom: \[t_i = \frac{e_{(i)}}{\sqrt{{\hat{\sigma}}^2_{(i)}/(1 - h_{ii})}} = \frac{y_i - \mathbf{x}_i^T \hat{\boldsymbol{\beta}}_{(i)}} {\sqrt{{\hat{\sigma}}^2_{(i)}/(1 - h_{ii})}} \sim \textsf{St}(n - p - 1)\] under the hypothesis that the \(i\)th case is not an “outlier”.
This externally studentized residual statistic is equivalent to the t-statistic for testing that \(\alpha_i\) is zero!

(HW)

Multiple Testing

without prior reason to suspect an outlier, usually look at the maximum of the \(|t_i|\)’s
is the \(\max |t_i|\) larger than expected under the null of no outliers?
Need distribution of the max of Student \(t\) random variables (simulation?)
a conservative approach is the Bonferroni Correction: For \(n\) tests of size \(\alpha\) the probability of falsely labeling at least one case as an outlier is no greater than \(n \alpha\); e.g. with 21 cases and \(\alpha = 0.05\), the probability is no greater than 1.05!
adjust \(\alpha^* = \alpha/n\) so that the probability of falsely labeling at least one point an outlier is \(\alpha\)
with 21 cases and \(\alpha = 0.05\),
\(\alpha/n = .00238\) so use \(\alpha^* = 0.0024\) for each test

Influence - Cook’s Distance

Cook’s Distance measure of how much predictions change with \(i\)th case deleted
\[\begin{eqnarray*} D_i & = &\frac{\| \hat{\mathbf{Y}}_{(i)} - \hat{\mathbf{Y}} \|^2}{ p {\hat{\sigma}}^2} = \frac{ (\hat{\boldsymbol{\beta}}_{(i)} - \hat{\boldsymbol{\beta}})^T \mathbf{X}^T\mathbf{X}(\hat{\boldsymbol{\beta}}_{(i)} - \hat{\boldsymbol{\beta}}) }{ p {\hat{\sigma}}^2} \\ & = & \frac{r_i^2}{p} \frac{h_{ii}}{ 1 - h_{ii}} \end{eqnarray*}\]
Flag cases where \(D_i > 1\) or large relative to other cases
Influential Cases are those with extreme leverage or large \(r_i^2\)

Stackloss Data

Case 21

Leverage \(0.285\) (compare to \(p/n = .19\) )
p-value \(t_{21}\) is \(0.0042\)
Bonferroni adjusted p-value is \(0.0024\) (not really an outlier?)
Cooks’ Distance \(.69\)
Other points? Masking?
Refit without Case 21 and compare results

Other analyses have suggested that cases (1, 2, 3, 4, 21) are outliers

look at MC3.REG or BAS or robust regression

Bayesian Outlier Detection

Chaloner & Brant (1988) “A Bayesian approach to outlier detection and residual analysis”

provides an approach to identify outliers or surprising variables by looking at the probabilty that the error for a case is more than \(k\) standard deviations above or below zero. \[P(|\epsilon_i| > k \sigma \mid \mathbf{Y})\]
Cases with a high probability (absolute fixed value of \(k\) or relative to a multiplicity correction to determine \(k\)) are then investigated.
find posterior distribution of \(\epsilon_i\) given the data and model
Chaloner and Brant use a reference prior for the analysis \(p(\boldsymbol{\beta}, \phi) \propto 1/\phi\)
no closed form solution for the probability but can be approximated by MCMC or a one dimensional integral! see ?BAS::Bayes.outlier

Stackloss Data

library(BAS)
stack.lm <- lm(stack.loss ~ ., data = stackloss)
stack.outliers <- BAS::Bayes.outlier(stack.lm, k = 3)
plot(stack.outliers$prob.outlier, 
     type = "h", 
     ylab = "Posterior Probability")

Stackloss Data

Adjust prior probability for multiple testing with sample size of 21 and prior probability of no outliers 0.95

stack.outliers <- BAS::Bayes.outlier(stack.lm, prior.prob = 0.95)

cases where posterior probability exceeds prior probability

which(stack.outliers$prob.outlier > 
      stack.outliers$prior.prob)

[1]  4 21

plot(stack.loss~Air.Flow, 
  data = stackloss, pch = 19,
  col = ifelse(
        stack.outliers$prob.outlier > 
        stack.outliers$prior.prob, 
        "red", "blue")
)

To Remove or Not Remove?

For suspicious cases, check data sources for errors
Check that points are not outliers/influential because of wrong mean function or distributional assumptions (transformations)
Investigate need for transformations (use EDA at several stages)
Influential cases - report results with and without cases (results may change - are differences meaningful?)
Outlier test - suggests alternative population for the case(s); if keep in analysis, will inflate \({\hat{\sigma}}^2\) and interval estimates
Document how you handle any case deletions - reproducibility!
If lots of outliers - consider throwing out the model rather than data
Alternative Model Averaging with Outlier models
Robust Regression Methods - M-estimation, L-estimation, S-estimation, MM-estimation, etc. or Bayes with heavy tails