Introduction to STA721

Merlise Clyde (clyde@duke.edu)

Duke University

Introduction to STA721

Course: Theory and Application of linear models from both a frequentist (classical) and Bayesian perspective
Prerequisites: linear algebra and a mathematical statistics course covering likelihoods and distribution theory (normal, t, F, chi-square, gamma distributions)
Introduce R programming as needed in the lab
Introduce Bayesian methods, but assume that you are co-registered in 702 or have taken it previously
more info on Course website https://sta721-F24.github.io/website/
- schedule and slides, HW, etc
- critical dates (Midterms and Finals)
- office hours
Canvas for grades, email, announcements

Please let me know if there are broken links for slides, etc!

Notation

scalors are \(a\) (italics or math italics)
vectors are in bold lower case, \(\mathbf{a}\), with the exception of random variables
all vectors are column vectors \[\mathbf{a}= \left[\begin{array}{c} a_1 \\ a_2 \\ \vdots \\ a_n \end{array} \right] \]
\(\mathbf{1}_n\) is a \(n \times 1\) vector of all ones
inner product \(\langle \mathbf{a}, \mathbf{a}\rangle = \mathbf{a}^T\mathbf{a}= \|\mathbf{a}\|^2 = \sum_{i=1}^n a_i^2\); \(\langle \mathbf{a}, \mathbf{b}\rangle = \mathbf{a}^T\mathbf{b}\)
length or norm of \(\mathbf{a}\) is \(\|\mathbf{a}\|\)

Matrices

Matrices are represented in bold \(\mathbf{A}= (a_{ij})\) \[\mathbf{A}= \left[\begin{array}{cccc} a_{11} & a_{12} & \cdots & a_{1m} \\ a_{21} & a_{22} & \cdots & a_{2m} \\ \vdots & \vdots & \vdots & \vdots\\ a_{n1} & a_{n2} & \cdots & a_{nm} \end{array} \right] \]
identity matrix \(\mathbf{I}_n\) square matrix with diagonal elements 1 and off diagonal 0
trace: if \(\mathbf{A}\) is \(n \times m\) \(\textsf{tr}(\mathbf{A}) = \sum_i^{\max n,m } a_{ii}\)
determinant: for \(\mathbf{A}\) is \(n \times n\) then the determinant is \(\det(A)\)
inverse: if \(\mathbf{A}\) is nonsingular \(\mathbf{A}> 0\), then its inverse is \(\mathbf{A}^{-1}\)

Statistical Models

Ohm’s Law: \(Y\) is voltage across a resistor of \(r\) ohms and \(X\) is the amperes of the current through the resistor (in theory) \[Y = rX\]

Simple linear regression for observational data \[Y_i = \beta_0 + \beta_1 x_i + \epsilon_i \text{ for } i = 1, \ldots, n\]
Rewrite in vectors: \[\begin{eqnarray*} \left[ \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right] = & \left[ \begin{array}{c} 1 \\ \vdots \\ 1 \end{array} \right] \beta_0 + \left[ \begin{array}{c} x_1 \\ \vdots \\ x_n \end{array} \right] \beta_1 + \left[ \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right] = & \left[ \begin{array}{cc} 1 & x_1 \\ \vdots & \vdots \\ 1 & x_n\end{array} \right] \left[ \begin{array}{c} \beta_0 \\ \beta_1 \end{array} \right] + \left[ \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right] \\ \\ \mathbf{Y}= & \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon} \end{eqnarray*}\]

Nonlinear Models

Gravitational Law: \(F = \alpha/d^\beta\) where \(d\) is distance between 2 objects and \(F\) is the force of gravity between them

log transformations \[\log(F) = \log(\alpha) - \beta \log(d)\]
compare to noisy experimental data \(Y_i =\log(F_i)\) observed at \(x_i = \log(d_i)\)
write \(\mathbf{X}= [\mathbf{1}_n \, \mathbf{x}]\)
\(\boldsymbol{\beta}= (\log(\alpha), -\beta)^T\)
model with additive error on log scale \(\mathbf{Y}= \mathbf{X}\boldsymbol{\beta}+ \mathbf{e}\)
test if \(\beta = 2\)
error assumptions?

Intrinsically Nonlinear Models

Regression function may be an intrinsically nonlinear function of \(t_i\) (time) and parameters \(\boldsymbol{\theta}\) \[Y_i = f(t_i, \boldsymbol{\theta}) + \epsilon_i\]

Quadratic Linear Regression

Taylor’s Theorem: \[f(t_i, \boldsymbol{\theta}) = f(t_0, \boldsymbol{\theta}) + (t_i - t_0) f'(t_0, \boldsymbol{\theta}) + (t_i - t_0)^2 \frac{f^{''}(t_0, \boldsymbol{\theta})}{2} + R(t_i, \boldsymbol{\theta})\]

\[Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \text{ for } i = 1, \ldots, n\]

Rewrite in vectors: \[\begin{eqnarray*} \left[ \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right] = & \left[ \begin{array}{ccc} 1 & x_1 & x_1^2 \\ \vdots & \vdots \\ 1 & x_n & x_n^2\end{array} \right] \left[ \begin{array}{c} \beta_0 \\ \beta_1 \\ \beta_2 \end{array} \right] + \left[ \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right] \\ & \\ \mathbf{Y}= & \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon} \end{eqnarray*}\]

Quadratic in \(x\), but linear in \(\beta\)’s - how do we know this model is adequate?

Kernel Regression (NonParametric)

\[y_i = \beta_0 + \sum_{j = 1}^J \beta_j e^{-\lambda (x_i - k_j)^d} + \epsilon_i \text{ for } i = 1, \ldots, n\] where \(k_j\) are kernel locations and \(\lambda\) is a smoothing parameter \[\begin{eqnarray*} \left[ \begin{array}{c} y_1 \\ \vdots \\ y_n \end{array} \right] = & \left[ \begin{array}{cccc} 1 & e^{-\lambda (x_1 - k_1)^d} & \ldots & e^{-\lambda (x_1 - k_J)^d} \\ \vdots & \vdots & & \vdots \\ 1 & e^{-\lambda (x_n - k_1)^d} & \ldots & e^{-\lambda (x_n - k_J)^d} \end{array} \right] \left[ \begin{array}{c} \beta_0 \\ \beta_1 \\\vdots \\ \beta_J \end{array} \right] + \left[ \begin{array}{c} \epsilon_1 \\ \vdots \\ \epsilon_n \end{array} \right] \\ & \\ \mathbf{Y}= & \mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon} \end{eqnarray*}\]

Linear in \(\beta\) given \(\lambda\) and \(k_1, \ldots k_J\)
Learn \(\lambda\), \(k_1, \ldots k_J\) and \(J\)

Hierarchical Models

each line represent individual sample trajectories
correlation between an individual’s measurements
similarities within groups
differences among groups?
allow individual regressions for each individual ?
add more structure?

Linear Regression Models

Response \(Y_i\) and \(p\) predictors \(x_{i1}, x_{i2}, \dots x_ip\) \[Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \ldots \beta_{p} x_{ip} + \epsilon_i\]

Design matrix \[\mathbf{X}= \left[\begin{array}{cccc} 1 & x_{11} & \ldots & x_{1p} \\ 1 & x_{21} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \vdots \\ 1 & x_{n1} & \ldots & x_{np} \\ \end{array} \right] = \left[ \begin{array}{cc} 1 & \mathbf{x}_1^T \\ \vdots & \vdots \\ 1 & \mathbf{x}_n^T \end{array} \right] = \left[\begin{array}{cccc} \mathbf{1}_n & \mathbf{X}_1 & \mathbf{X}_2 \cdots \mathbf{X}_p \end{array} \right] \]
matrix version \[\mathbf{Y}= \mathbf{X}\boldsymbol{\beta}+ \epsilon\] what should go into \(\mathbf{X}\) and do we need all columns of \(\mathbf{X}\) for inference about \(\mathbf{Y}\)?

Linear Model

\(\mathbf{Y}= \mathbf{X}\, \boldsymbol{\beta}+ \boldsymbol{\epsilon}\)
\(\mathbf{Y}\) (\(n \times 1\)) vector of random response (observe \(\mathbf{y}\)); \(\mathbf{Y}, \mathbf{y}\in \mathbb{R}^n\)
\(\mathbf{X}\) (\(n \times p\)) design matrix (observe)
\(\boldsymbol{\beta}\) (\(p \times 1\)) vector of coefficients (unknown)
\(\boldsymbol{\epsilon}\) (\(n \times 1\)) vector of “errors” (unobservable)

Goals:

What goes into \(\mathbf{X}\)? (model building, model selection - post-selection inference?)
What if multiple models are “good”? (model averaging or ensembles)
What about the future? (Prediction)
Uncertainty Quantification - assumptions about \(\boldsymbol{\epsilon}\)

All models are wrong, but some may be useful (George Box)

Ordinary Least Squares

Goal: Find the best fitting “line” or “hyper-plane” that minimizes \[\sum_i (Y_i - \mathbf{x}_i^T \boldsymbol{\beta})^2 = (\mathbf{Y}- \mathbf{X}\boldsymbol{\beta})^T(\mathbf{Y}- \mathbf{X}\boldsymbol{\beta}) = \| \mathbf{Y}- \mathbf{X}\boldsymbol{\beta}\|^2 \]

Optimization problem - seek \(\boldsymbol{\beta}\ni \mathbf{X}\boldsymbol{\beta}\) is close to \(\mathbf{Y}\) in squared error
May over-fit \(\Rightarrow\) add other criteria that provide a penalty Penalized Least Squares
Robustness to extreme points \(\Rightarrow\) replace quadratic loss with other functions
no notion of uncertainty of estimates
no structure of problem (repeated measures on individual, randomization restrictions, etc)

Need Distribution Assumptions of \(\mathbf{Y}\) (or \(\boldsymbol{\epsilon}\)) for testing and uncertainty measures \(\Rightarrow\) Likelihood and Bayesian inference

Random Vectors

Let \(Y_1, \ldots Y_n\) be random variables in \(\mathbb{R}\) Then \[\mathbf{Y}\equiv \left[ \begin{array}{c} Y_1 \\ \vdots \\ Y_n \end{array} \right]\] is a random vector in \(\mathbb{R}^n\)
Expectations of random vectors are defined element-wise: \[\textsf{E}[\mathbf{Y}] \equiv \textsf{E}\left[ \begin{array}{c} Y_1 \\ \vdots \\ Y_n \end{array} \right] \equiv \left[ \begin{array}{c} \textsf{E}[Y_1] \\ \vdots \\ \textsf{E}[Y_n] \end{array} \right] = \left[ \begin{array}{c} \mu_1 \\ \vdots \\ \mu_n \end{array} \right] \equiv \boldsymbol{\mu}\in \mathbb{R}^n \] where mean or expected value \(\textsf{E}[Y_i] = \mu_i\)

Model Space

We will work with inner product spaces: a vector spaces, say \(\mathbb{R}^n\) equipped with an inner product \(\langle \mathbf{x},\mathbf{y}\rangle \equiv \mathbf{x}^T\mathbf{y}, \quad \mathbf{x}, \mathbf{y}\in \mathbb{R}^n\)

Definition: Subspace

A set \(\boldsymbol{{\cal M}}\) is a subspace of \(\mathbb{R}^n\) if is a subset of \(\mathbb{R}^n\) and also a vector space.

That is, if \(\mathbf{x}_1 \in \boldsymbol{{\cal M}}\) and \(\mathbf{x}_2 \in \boldsymbol{{\cal M}}\), then \(b_1\mathbf{x}_1 + b_2 \mathbf{x}_2 \in \boldsymbol{{\cal M}}\) for all \(b_1, b_2 \in \mathbb{R}\)

Definition: Column Space

The column space of \(\mathbf{X}\) is \(C(\mathbf{X}) = \mathbf{X}\boldsymbol{\beta}\) for \(\boldsymbol{\beta}\in \mathbb{R}^p\)

If \(\mathbf{X}\) is full column rank, then the columns of \(\mathbf{X}\) form a basis for \(C(\mathbf{X})\) and \(C(\mathbf{X})\) is a p-dimensional subspace of \(\mathbb{R}^n\)

If we have just a single model matrix \(\mathbf{X}\), then the subspace \(\boldsymbol{{\cal M}}\) is the model space.

Philosophy

for many problems frequentist and Bayesian methods will give similar answers (more a matter of taste in interpretation)
- For small problems, Bayesian methods allow us to incorporate prior information which provides better calibrated answers
- for problems with complex designs and/or missing data Bayesian methods are often easier to implement (do not need to rely on asymptotics)
For problems involving hypothesis testing or model selection frequentist and Bayesian methods can be strikingly different.
Frequentist methods often faster (particularly with “big data”) so great for exploratory analysis and for building a “data-sense”
Bayesian methods sit on top of Frequentist Likelihood
Goemetric perspective important in both!

Important to understand advantages and problems of each perspective!