Part 5 - Multiple Linear Regression | applied statistics

Introduction

Last time we discussed simple linear regression, which is a model that predicts the value of a dependent variable based on the value of a single independent variable. But, often in the real world, we may want to model our dependant variable from multiple independent variables.

Or, mathematically represented as,

$$ Y_i = \beta_0 + \beta_1 x_{1i} 0 \beta_2 x_{2i} + \ldots + \beta_p x_{pi} + e_i $$

To have a “good” fit, we try to minimize RSS, however, since we now have multiple variables, we need to minimize the RSS with respect to all the variables.

$$ \begin{align*} RSS &= \sum_{i=1}^{n} (y_i - b_0 - b_1 - \ldots - b_p x_{pi})^2 \newline &= \frac{\partial RSS}{\partial b_0} = -2 \sum_{i=1}^{n} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline &= \frac{\partial RSS}{\partial b_1} = -2 \sum_{i=1}^{n} x_{1i} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline & \vdots \newline &= \frac{\partial RSS}{\partial b_p} = -2 \sum_{i=1}^{n} x_{pi} (y_i - b_0 - b_1 x_{1i} - b_2 x_{2i} - \ldots - b_p x_{pi}) = 0 \newline \end{align*} $$

Matrix Form

Using matrix formulation is more convenient,

$$ \mathbf{Y} = \begin{bmatrix} y_1 \newline y_2 \newline \vdots \newline y_n \end{bmatrix}, \mathbf{X} = \begin{bmatrix} 1 & x_{11} & \ldots & x_{1p} \newline 1 & x_{21} & \ldots & x_{2p} \newline \vdots & \vdots & \ddots & \vdots \newline 1 & x_{n1} & \ldots & x_{np} \end{bmatrix}, \beta = \begin{bmatrix} \beta_0 \newline \beta_1 \newline \vdots \newline \beta_p \end{bmatrix}, \mathbf{e} = \begin{bmatrix} e_1 \newline e_2 \newline \vdots \newline e_n \end{bmatrix} $$

Thus, the matrix form of the equation is,

$$ \mathbf{Y} = \mathbf{X} \beta + \mathbf{e} $$

Now, we can write the RSS, denoted as $RSS(\beta)$, in matrix form as,

$$ RSS(\beta) = (\mathbf{Y} - \mathbf{X} \beta)^T (\mathbf{Y} - \mathbf{X} \beta) $$

We can also write our predictor for $\hat{\beta}$ as,

$$ \hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} $$

Since we are trying to generalize linear regression with this, it must mean that simple linear regression is a special case of multiple linear.

$$ \mathbf{X}^T \mathbf{X} = \begin{bmatrix} 1 & 1 & \ldots & 1 \newline x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} 1 & x_1 \newline 1 & x_2 \newline \vdots & \vdots \newline 1 & x_n \end{bmatrix} = \begin{bmatrix} n & \sum_{i=1}^{n} x_i \newline \sum_{i=1}^{n} x_i & \sum_{i=1}^{n} x_i^2 \end{bmatrix} = n \begin{bmatrix} 1 & \bar{x} \newline \bar{x} & \frac{1}{n} \sum_{i=1}^{n} x_i^2 \end{bmatrix} $$

The inverse of this matrix is, $$ (\mathbf{X}^T \mathbf{X})^{-1} = \frac{1}{n \left( \frac{1}{n} \sum_{i=1}^{n} x_i^2 - \bar{x}^2 \right)} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} = \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} $$

$$ \mathbf{X}^T \mathbf{Y} = \begin{bmatrix} 1 & 1 & \ldots & 1 \newline x_1 & x_2 & \ldots & x_n \end{bmatrix} \begin{bmatrix} y_1 \newline y_2 \newline \vdots \newline y_n \end{bmatrix} = \begin{bmatrix} \sum_{i=1}^{n} y_i \newline \sum_{i=1}^{n} x_i y_i \end{bmatrix} $$

Which finally means that $\hat{\beta}$ is,

$$ \begin{align*} \hat{\beta} &= \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} x_i^2 & -\bar{x} \newline -\bar{x} & 1 \end{bmatrix} \begin{bmatrix} \sum_{i=1}^{n} y_i \newline \sum_{i=1}^{n} x_i y_i \end{bmatrix} \newline &= \frac{1}{S_{xx}} \begin{bmatrix} \frac{1}{n} \sum_{i=1}^{n} y_i \sum_{i=1}^{n} x_i^2 - \bar{x} \sum_{i=1}^{n} x_i y_i \newline \sum_{i=1}^{n} x_i y_i - \bar{x} \sum_{i=1}^{n} y_i \end{bmatrix} \newline &= \begin{bmatrix} \frac{\bar{y} \left[ \sum_{i=1}^n x_i^2 - n\bar{x}^2 \right] - \bar{x} \left[ \sum_{i=1}^n x_i y_i - n\bar{x}\bar{y} \right]}{S_{xx}} \newline \frac{S_{xy}}{S_{xx}} \end{bmatrix} \newline &= \begin{bmatrix} \bar{y} - \frac{S_{xy}}{S_{xx}} \bar{x} \newline \frac{S_{xy}}{S_{xx}} \end{bmatrix} \end{align*} $$

So we finally have our $\beta_0$ and $\beta_1$.

Let’s take a look at some properties of the multiple linear regression model.

$$ \begin{align*} \mathbb{E}(\hat{\beta} | \mathbf{X}) &= \beta \newline \text{Var}(\hat{\beta} | \mathbf{X}) &= \sigma^2 (\mathbf{X}^T \mathbf{X})^{-1} S^2 &= \frac{RSS}{n-p-1} = \frac{\sum_{i=1}^{n} \hat{e}_i^2}{n-p-1} \end{align*} $$

Inference

We can define inference for a single parameter as such,

$$ T_i = \frac{\hat{\beta_i} - \beta_i}{SE(\hat{\beta_i})} \sim t_{n-p-1} $$

Analysis of variance approach to inference is also possible,

$$ H_0 : \beta_1 = \ldots = \beta_p = 0 \newline H_1 : \text{At least one } \beta_j \neq 0 $$

Let’s recall some important definitions,

$$ \begin{align*} SST = S_{yy} &= \sum_{i=1}^{n} (y_i - \bar{y})^2 \newline RSS = &= \sum_{i=1}^{n} (y_i - \hat{y_i})^2 \newline SS_{\text{reg}} &= \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2 SST = SS_{\text{reg}} + RSS \end{align*} $$

The test statistic is defined as,

$$ F = \frac{SS_{reg}/p}{RSS/(n-p-1)} \sim F_{p, n-p-1} $$

where we reject $H_0$ if $F$ is too large.

Recall our analysis of variance table,

Source of Variation	Degrees of Freedom (df)	Sum of Squares (SS)	Mean Square (MS)	F-statistic
Regression	p	$SS_{\text{reg}}$	$SS_{\text{reg}}/p$	$F = \frac{SS_{\text{reg}}/p}{RSS/(n - p - 1)}$
Residual	$n - p - 1$	$RSS$	$S^2 = \frac{RSS}{n - p - 1}$
Total	$n - 1$	$SST = S_{yy}$

The coefficient of determination and adjusted coefficient of determination are defined as,

$$ \begin{align*} R^2 &= \frac{SS_{\text{reg}}}{SST} = 1 - \frac{RSS}{SST} \newline R^2_{\text{adj}} &= 1 - \frac{RSS/(n-p-1)}{SST/(n-1)} \end{align*} $$

This new adjusted coefficient of determination is due to the fact of, if we add variables (even irrelevant variables) $R^2$ never decreases. Thus to compare the model with different number of variables, we use this.

We can also test a subset of coefficients using ANOVA.

$$ H_0 : \beta_1 = \ldots = \beta_k = 0 \newline H_1 : \text{At least one } \beta_j \neq 0 $$

Then the test statistic is,

$$ \begin{align*} F &= \frac{(RSS(\text{reduced} - RSS(\text{full}))/(df_{\text{reduced}} - df_{\text{full}})}{RSS(\text{full})/df_{\text{full}}} \newline &= \frac{(RSS(\text{reduced} - RSS(\text{full}))/k}{RSS(\text{full})/(n-p-1)} \end{align*} $$

since $df_{\text{reduced}} - df_{\text{full}} = (n - (p + 1 - k) - (n - (p + 1)) = k$.

Example

Let’s look at a simple example and the corresponding R code.

Data from surveys of customers of 168 Italian restaurants in the target area are available, with the following variables: $$ \begin{align*} Y &= \text{Price} = \text{the price (in \$US) of a dinner (including 1 drink \& a tip)} \newline x_1 &= \text{Food} = \text{customer rating of the food (out of 30)} \newline x_2 &= \text{Décor} = \text{customer rating of the decor (out of 30)} \newline x_3 &= \text{Service} = \text{customer rating of the service (out of 30)} \newline x_4 &= \text{East} = \text{dummy variable} = 1 \text{ if the restaurant is east of Fifth Avenue, 0 if it is west} \end{align*} $$

So, let’s first devlop a regression model that predicts the price of a dinner.

# Assume data is loaded into `nyc`.
attach(nyc)
m1<-lm(Price~Food+Decor+Service+East)
summary(m1)
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept)  -24.244248   4.758009  -5.095 9.70e-07 ***
> Food          1.553922   0.372672   4.170 4.98e-05 ***
> Decor         1.897926   0.219100   8.662 4.79e-15 ***
> Service       0.005318   0.399493   0.013 0.9894
> East          2.003432   0.961794   2.083 0.0388 *
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So, our initial regression model is,

$$ \text{Price} = -24.244248 + 1.553922 \text{Food} + 1.897926 \text{Decor} + 0.005318 \text{Service} + 2.003432 \text{East} $$

We can see that Décor has the largest effect on the price (note that Food, Décor and Service are each measured on the same 0 to 30 scale so it is meaningful to compare the regression coefficients).

We can also observe that, if a new resturant wants to choose a location, it should be east of Fifth Avenue, since our dummy variable is statistically significantly larger than 0.

Interaction Effects

Consider we want to model $Y$ based on a continuous predictor $x$ and a dummy variable $d$. Compared to using a single variable $x$, the additional effect of $d$ could be,

$d$ ahs no effect on $Y$, that is, $Y = \beta_0 + \beta_1 x + e$.
$d$ produces only an additive change in $Y$, that is, $$ Y = \beta_0 + \beta_1 x + \beta_2 d + e = \begin{cases} \beta_0 + \beta_1 x + e & d = 0 \newline \beta_0 + \beta_2 + \beta_1 x + e & d = 1 \end{cases} $$
$d$ produces an additive change in $Y$ and also changes the size of the effect of $x$ on $Y$, that is, $$ Y = \beta_0 + \beta_1 x + \beta_2 d + \beta_3 xd + e = \begin{cases} \beta_0 + \beta_1 x + e & d = 0 \newline \beta_0 + \beta_2 + (\beta_1 + \beta_3) x + e & d = 1 \end{cases} $$

Let’s go back to our example again.

After you present your model to the management team, the debate foucses on the statistical insignificane of the Service variable. The team feels like the model may be too simple to reflect the reality of a Italian restaurant in Manhattan. In particular, there is general consensus amongst the team that restaurants on the east of Fifth Avenue are very different from those on the west side. As such, you have been asked to consider different models for the East and West.

So, we can develop a mathematical model as such,

$$ Y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 \text{East} + \beta_5 x_1 \text{East} + \beta_6 x_2 \text{East} + \beta_7 x_3 \text{East} + e $$

In our code now,

m2<-lm(Price~Food+Decor+Service+East+Food:East+Decor:East+Serivce:East)
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)
> (Intercept) -26.9305   8.5195  -3.161 0.00189 **
> Food          0.9986   0.5742   1.739 0.08398 .
> Decor         1.8824   0.3005   6.264 3.45e-09 ***
> Service       0.7566   0.6489   1.166 0.24536
> East          5.5702  10.3285   0.539 0.59044
> Food:East     1.2679   0.7815   1.622 0.10673
> Decor:East   -0.2791   0.4618  -0.604 0.54646
> Service:East -1.2842   0.8230  -1.560 0.12071
> Residual standard error: 5.747 on 157 degrees of freedom
> (3 observations deleted due to missingness)
> Multiple R-squared: 0.6381, Adjusted R-squared: 0.622
> F-statistic: 39.55 on 7 and 157 DF, p-value: < 2.2e-16

If we also do an ANOVA test,

anova(m1, m2)
> Analysis of Variance Table
> Model 1: Price ~ Food + Decor + East
> Model 2: Price ~ Food + Decor + Service + East + Food:East + Decor:East + Service:East
>   Service:East
>   Res.Df RSS Df Sum of Sq F Pr(>F)
> 1 161 5338.9
> 2 157 5186.1 4 152.82 1.1566 0.3322

This means we are happy to adopt the original reduced model.

Part 4 - Simple Linear Regression