Introduction
The main objective of many statistical investigations is to make predictions.
For instance, an engineer may wish to predict the amount of oxide that will form on the surface of a metal baked in an oven for one hour at 200$^{\circ}$C, or the amount of deformation of a ring subjected to a certain compressive force.
Usually, such predictions require a formula which relates the dependent variable whose value we want to predict (or called response) to one or more other variables, usually called predictors (or regressors).
The collection of statistical tools for this is called regression analysis.
Simple Linear Regression
The simplest form of regression analysis is simple linear regression, which involves a single predictor and a single response variable.
$$ Y = \beta_0 + \beta_1 X + e $$
where:
- $Y$ is the response variable.
- $X$ is the predictor variable.
- $\beta_0$ is the intercept.
- $\beta_1$ is the slope.
Assume that $\mathbb{E}(e) = 0$ and $\text{Var}(e) = \sigma^2$.
Suppose we fix $X = x$, meaning
$$ Y = \beta_0 + \beta_1 x + e, $$
with mean $\mathbb{E}(Y) = \beta_0 + \beta_1 x$ and variance $\text{Var}(Y) = \sigma^2$ ($Y$ is normal if $e$ is normal).
The linear function $\beta_0 + \beta_1 x$ is thus the function giving the mean value of $Y$ for each possible value of $X$.
It is called the regression function (or regression line) and will be denoted as, $$ \mu_{Y | X = x} = \mathbb{E}(Y | X = x) = \beta_0 + \beta_1 x. $$
The slope $\beta_1$ is the change in mean of $Y$ for one unit change in $X$, the intercept $\beta_0$ is the mean value of $Y$ when $X = 0$.
Least Squares Estimator
Gauss proposed estimating $\beta_0$ and $\beta_1$ to minimize the sum of the squares of the vertical deviations between the observed responses and the fitted line.
These deviations are called the residuals.
Figure 1: The residuals are the vertical distances between the data points and the fitted line. Source
For any “candidate” straight line $Y = a + bX$, we can write $R(a, b) = \sum_{i = 1}^N (Y_i - (a + bX_i))^2$.
The minimizers are $(\hat{\beta_0}, \hat{\beta_1})$, with
$$ \hat{\beta_1}= \frac{\sum_i X_i Y_i - \frac{\left(\sum_i X_i\right) \left(\sum_i Y_i\right)}{n}}{\sum_i X_i^2 - \frac{\left(\sum_i X_i\right)^2}{n}} $$
and
$$ \hat{\beta_0}= \bar{Y} - \hat{\beta}_1 \bar{X}, $$
where $\bar{X} = \frac{1}{n} \sum_i X_i$ and $\bar{Y} = \frac{1}{n} \sum_i Y_i$.
We can clean up this by defining the following quantities:
$$ \begin{align*} S_{XX} = \sum_i (X_i - \bar{X})^2 \left(= \sum_i X_i^2 - \frac{\left(\sum_i X_i\right)^2}{n}\right) \newline S_{XY} = \sum_i (X_i - \bar{X})(Y_i - \bar{Y}) \left(= \sum_i X_i Y_i - \frac{\left(\sum_i X_i\right) \left(\sum_i Y_i\right)}{n}\right) \newline \end{align*} $$
Thus, we have,
$$ \hat{\beta_1}= \frac{S_{XY}}{S_{XX}} $$
and
$$ \hat{\beta_0}= \bar{Y} - \hat{\beta}_1 \bar{X}. $$
The variance $\sigma^2$ of the error term $e = Y - (\beta_0 + \beta_1 X)$ is another unknown parameter.
The residuals,
$$ \hat{e_i} = y_i - (\hat{\beta_0} + \hat{\beta_1} x_i) = y_i - \hat{y}_i, \quad i = 1, \ldots, n, $$
can be regarded as a ‘sample’
First, it can be checked that,
$$ \bar{e} = \frac{1}{n} \sum_i \hat{e}_i = \bar{y} - \hat{\beta}_0 - \hat{\beta}_1 \bar{x} = 0. $$
(by definition of the estimated coefficient $\hat{\beta}_0$ and $\hat{\beta}_1$).
Therefore, an unbiased estimate of $\sigma^2$ is,
$$ s^2 = \frac{1}{n - 2} \sum_i \hat{e}_i^2. $$
Inferences
From now on we will assume that the value of the predictors are deterministic (or we condition on X).
We note that $Y_i | (X_i = x_i) \sim N(\beta_0 + \beta_1 x_i, \sigma^2)$.
Then, because $\sum_i (x_i - \bar{x}) = 0$, we can write,
$$ \hat{\beta_1} = \frac{S_{xY}}{s_{xx}} = \frac{\sum_i (x_i - \bar{x})}{s_{xx}} Y_i. $$
The sampling distribution of $\hat{\beta_1}$ is normal with mean $\beta_1$ and variance $\sigma^2 / s_{xx}$.
Now we can write, $$ \hat{\beta_0} = \sum_i \frac{Y_i}{n} - \hat{\beta_1} \bar{x}, $$
which is again a linear combination of normal random variables (the $Y_i$‘s and $\hat{\beta}_1$).
Thus, the estimator of $\beta_0$ is also normally distributed.
The sampling distribution of $\hat{\beta_0}$ is normal with mean $\beta_0$ and variance $\sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{s_{xx}}\right)$.
Confidence Interval
If we have the hypothesis,
$$ H_0: \beta_1 = 0 \quad \text{vs.} \quad H_1: \beta_1 \neq 0, $$
We have $\sqrt{s_{xx}} \frac{\hat{\beta}_1 - \beta_1}{\sigma} \sim N(0, 1)$.
Replacing $\sigma$ with its estimator $S$, we find,
$$ \frac{\hat{\beta_1} - \beta_1}{S / \sqrt{s_{xx}}} \sim t_{n - 2}. $$
The rejection region is then,
$$ \text{reject } H_0 \text{ if } \hat{\beta_1} \notin \left[-t_{n - 2, 1 - \alpha / 2} \frac{s}{\sqrt{s_{xx}}}, t_{n - 2, 1 - \alpha / 2} \frac{s}{\sqrt{s_{xx}}}\right]. $$
The $p$-value is, $$ p = 1 - P(T \in [-|t_0|, |t_0|]) = 2P(T > |t_0|). $$
where $t_0 = \frac{\hat{\beta_1}}{s / \sqrt{s_{xx}}}$ and $T$ is a random variable with distribution $t_{n - 2}$.
So, the CI for $\beta_1$ is,
$$ \left[\hat{\beta_1} - t_{n - 2, 1 - \alpha / 2} \frac{s}{\sqrt{s_{xx}}}, \hat{\beta_1} + t_{n - 2, 1 - \alpha / 2} \frac{s}{\sqrt{s_{xx}}}\right]. $$
Confidence Interval on the Mean Response
For a specified value of $X$, say, $x$, we want a confidence interval for,
$$ y(x) = \mu_{Y | X = x} = \beta_0 + \beta_1 x. $$
we have an estimator,
$$ \hat{y}(x) = \hat{\mu}_{Y | X = x} = \hat{\beta}_0 + \hat{\beta}_1 x. $$
and
$$ \hat{y}(x) \sim N\left(\mu_{Y | X = x}, \sigma^2 \left(\frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}\right)\right). $$
Replacing the unkown $\sigma$ by its estimator $S$, we get,
$$ \frac{\hat{\mu_{Y | X = x}} - \mu_{Y | X = x}}{S \sqrt{\frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}} } \sim t_{n - 2}. $$
and the CI is,
$$ \left[\hat{\mu_{Y | X = x}} - t_{n - 2, 1 - \alpha / 2} S \sqrt{\frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}}, \hat{\mu_{Y | X = x}} + t_{n - 2, 1 - \alpha / 2} S \sqrt{\frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}}\right]. $$
Prediction of new observations
An important application is predicting new or future observations $Y$ corresponding to a specified level $X = x$.
This is different to estimating the mean response.
$\mu_{Y | X = x}$ at $X = x$, a predicted value is $\hat{y}(x) = \beta_0 + \beta_1 x$.
We can derive,
$$ Y | (X = x) - \hat{y(x)} \sim N(0, \sigma^2 \left(1 + \frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}\right)), $$
which leads to the following prediction interval,
$$ \left[\hat{y}(x) - t_{n - 2, 1 - \alpha / 2} S \sqrt{1 + \frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}}, \hat{y}(x) + t_{n - 2, 1 - \alpha / 2} S \sqrt{1 + \frac{1}{n} + \frac{(x - \bar{x})^2}{s_{xx}}}\right]. $$
Remarks
-
A prediction interval for $Y$ at $X = x$ will always be longer than the CI for $\mu_{Y | X = x}$ because there is much more variability in one observation than in an average.
-
As $n$ gets larger ($n \to \infty$), the width of CI for $\mu_{Y | X = x}$ decreases to 0, buy this is not the case for the prediction interval.
Analysis of Variance (ANOVA)
The most interesting test for simple linear regression is probably,
$$ H_0 : \beta_1 = 0 \quad \text{vs.} \quad H_1 : \beta_1 \neq 0. $$
We previously used $t$-test,
$$ \frac{\hat{\beta_1}}{s / \sqrt{s_{xx}}} \sim t_{n - 2}. $$
There is an alternative (and equivalent) test, the $F$-test based on analysis of variance.
Firstly, let’s define some quantities,
$$ \begin{align*} SST = \sum_i (y_i - \bar{y})^2 \newline SS_{\text{reg}} = \sum_i (\hat{y}_i - \bar{y})^2 \newline RSS = \sum_i (y_i - \hat{y}_i)^2 = \sum_i \hat{e}_i^2 \end{align*} $$
where $SST$ is the total sum of squares, $SS_{\text{reg}}$ is the regression sum of squares, and $RSS$ is the residual sum of squares.
For the $F$-test, we use,
$$ F = \frac{SS_{\text{reg}} / 1}{RSS / (n - 2)} \sim F(1, n - 2) \text{ under } H_0. $$
We reject $H_0$ if $F > F_{1 - \alpha, 1, n - 2}$.
Analysis of Variance Table
The ANOVA table is a table that shows the sources of variation in an experiment.
Source of Variation | Degrees of Freedom (df) | Sum of Squares (SS) | Mean Square (MS) | F-statistic |
---|---|---|---|---|
Regression | 1 | $SS_{\text{reg}}$ | $SS_{\text{reg}}/1$ | $F = \frac{MS_{\text{reg}}/1}{RSS/(n - 2)}$ |
Residual | $n - 2$ | $RSS$ | $RSS/(n - 2)$ | |
Total | $n - 1$ | $SST$ |
Coefficient of Determination
Suppose $SST \simeq SS_{\text{reg}}$ and $RSS \simeq 0$.
Almost the whole variation is due to error, and the regression is not significant.
The coefficient of determination is defined as,
$$ R^2 = \frac{SS_{\text{reg}}}{SST}, $$
which represents the proportion of the variability in the responses that is explained by the predictor.
A value of $R^2$ near 1 indicates a good fit to the data.
A value of $R^2$ near 0 indicates a poor fit to the data.
Connection to correlation coefficient
We can derive the following relationship,
$$ \begin{align*} R^2 & = \frac{SS_{\text{reg}}}{SST} \newline & = \frac{SST - RSS}{s_{yy}} \newline & = \frac{s_{xx}(SST - RSS)}{s_{xx} s_{yy}} \newline & = \frac{s^2_{xy}}{s_{xx} s_{yy}} \newline & = \frac{\left(\sum_i (x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}. \end{align*} $$
Except for its sign (positive or negative linear relationship), the sample correlation is the square root of the coefficient of determination.
Example
A food processing center needs to be able to switch from one type of package to another quickly to react to changes in order patterns. Consultants have developed a new method for changing the production line. We have a sample of 48 change-over times using the new method, and an independent sample of 72 change-over times for the existing method.
We model the relationship between $Y$, the change-over time and $X$, the dummy variable for the new method.
We consider the simple linear regression model,
$$ Y = \beta_0 + \beta_1 X + e, $$
where $Y = $ change over time and $x$ is the dummy variable (i.e., x = 1 for the new change-over method and 0 for the existing method).
changeover_times<-read.table("changeover_times.txt", header=TRUE)
attach(changeover_times)
m1<-lm(Changeover~New)
summary(m1)
par(mfrow=c(2,2))
plot(New,Changeover,xlab="Dummy variable, New",ylab="Change Over Time")
abline(lsfit(New,Changeover))
boxplot(Changeover~New,xlab="Dummy variable, New",ylab="Change Over Time")
boxplot(Changeover~Method,ylab="Change Over Time",xlab="Method")
# p-value
pval<-2*pt(-2.254,df=118)
pval
# equivalently, using two sample test
t.test(Changeover[New==0],Changeover[New==1],var.equal = TRUE)
t.test(Changeover[New==0],Changeover[New==1],var.equal = FALSE)
detach(changeover_times)
Diagnostic Tools for Model Checking
So far we have assumed independence, linearity, constant variance and normal error. But, we will now look at some techniques and tools to check these assumptions.
We will discuss the following,
- Leverage points
- If they exist, determine if they are bad leverage points.
- Standardized residuals
- Outliers
- Examine assumption of constant variance
- Examine whether the errors are normally distributed
Leverage Points
A leverage point is a point whose $x$-value is distant from the other $x$-values.
A point is a bad leverage point if its $y$-value does not follow the pattern set by the other data points. In other words, a bad leverage point is a leverage point which is also an outlier.
The mathematical definition of leverage is,
$$ \hat{y_i} = \sum_{j = 1}^n h_{ij} y_j, $$
where $h_{ij} = \left[\frac{1}{n} + \frac{(x_i - \bar{x})(x_j - \bar{x})}{S_{xx}}\right]$.
We can see that, if we sum up $h_{ij}$ for all $j$,
$$ \sum_{j = 1}^n h_{ij} = 1. $$
Thus,
$$ h_{ii} = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j = 1}^n (x_j - \bar{x})^2}, $$
is called the leverage of the $i$-th data point. If $h_{ii} \approx 1, \hat{y_i} \approx y_i$.
A popular rule of thumb for detecting leverage points is,
$$ h_{ii} > 2 \times \text{average}(h_{ii}). = 2 \times \frac{2}{n} = \frac{4}{n}. $$
When dealing with bad leverage points, after finding them we will remove them and then fit a different model.
Standardized Residuals
It can be shown that,
$$ \text{Var}(\hat{e_i}) = \sigma^2 (1 - h_{ii}). $$
This is reasonable since $h_{ii} \approx 1$ implies that $\hat{y_i} \approx y_i$ and thus $\hat{e_i}$ is small.
The standardized residuals are defined as,
$$ r_i = \frac{\hat{e_i}}{s \sqrt{1 - h_{ii}}}. $$
where $s = \sqrt{\frac{1}{n - 2} \sum_{j = 1}^n \hat{e_j}^2}$.
Outliers
The common practice of labelling points as outliers is,
If the standardized residual for the point falls outside the interval from -2 to 2, it is considered an outlier. In very large data sets, we change this interval to -4 to 4.
In R, we can use the following code to find the leverage and standardized residuals,
leverage<-hatvalues(m1)
# or influence(m1)$hat
stdres<-rstandard(m1)
res<-m1$residuals
# or residuals(m1)
Cook’s Distance
Cook’s distance is a measure of the influence of each observation on the fitted values.
It is defined as,
$$ D_i = \frac{\sum_{j = 1}^n (\hat{y_{j(i)}} - \hat{y_j})^2}{2S^2} = \frac{r_i^2 h_{ii}}{2(1 - h_{ii})^2}. $$
where $\hat{y_{j(i)}}$ denotes the $j$-th fitted value when the $i$-th observation is deleted. A recommended cutoff is $4/(n - 2)$.
Normality of Errors
Usually the approximation $\hat{e_i} \approx e_i$ is accurate enough to justify use of a QQ-plot.
Non-constant Variance
What do we do if the variance is not constant?
We can use a transformation, the transformation is used to overcome problems due to non-constant variance but also due to nonlinearity.
The methods that we will discuss deal with positive variables only.
There is no general rule of transformation, if for example we have count data, this is often modelled by the Poisson distribution. Thus, the appropriate transformation is square root.
For overcoming nonlinearity, we need to think a bit more.
Suppose the true model is $Y = g(\beta_0 + \beta_1x + e)$ for an unknown monotone function $g$.
We can rewrite this as,
$$ g^{-1}(Y) = \beta_0 + \beta_1 x + e. $$
For possible $g$, we can consider the family of scaled power transformations,
$$ \psi_s(Y, \lambda) = \begin{cases} (Y^{\lambda} - 1) / \lambda, & \text{if } \lambda \neq 0, \newline \log(Y), & \text{if } \lambda = 0. \end{cases} $$
So, the procedure will be,
- Fit a linear regression model $y_i = \beta_0 + \beta_1 x_i + e_i$.
- Estimate $g^{-1}$ by plotting $\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i$ against $x_i$.
- Mathematically, we minimize $\sum_i (\hat{y_i} - \alpha_0 - \alpha_1 \psi(y_i, \lambda))^2$ over $(\alpha_0, \alpha_1, \lambda)$.