Deriving Multiple Linear Regression

Sun 3 November 2019

Deriving Multiple Linear Regression

Written by Hongjinn Park in Articles

Let's say you're trying to make a special chemical called the T Virus. Why? Because it's highly profitable. Unfortunately your recipe is not perfect and you need to improve it.

You make the T Virus by mixing raw materials together and heating things to different temperatures. What is the right ratio of ingredients? What's the right temperature?

This is where Multiple Linear Regression can help. There are so many possible recipes for the T Virus and the raw materials are very expensive. You don't have the time or the money to try every recipe. Note that Multiple Linear Regression is not always the best model, for example what if you have a polynomial relationship. But anyway now we are sufficiently motivated to derive it.

Often times a response or dependent variable is dependent on many inputs and you have the relationship

$$ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_k x_{ik} + e_i \qquad \text{with $e_i \sim N(0,\sigma^2)$}$$

Let's say there are $n=3$ observations of data and there are $k=3$ inputs. This can be written also as

$$ Y_1 = \beta_0 + \beta_1 x_{11} + \beta_2 x_{12} + \beta_3 x_{13} + e_1 $$ $$ Y_2 = \beta_0 + \beta_1 x_{21} + \beta_2 x_{22} + \beta_3 x_{23} + e_2 $$ $$ Y_3 = \beta_0 + \beta_1 x_{31} + \beta_2 x_{32} + \beta_3 x_{33} + e_3 $$

and this can be put in matrix notation:

\[ \begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & x_{13} \\ 1 & x_{21} & x_{22} & x_{23} \\ 1 & x_{31} & x_{32} & x_{33} \\ \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ e_3 \\ \end{bmatrix} \] $$ Y = X \beta + e $$

Which is (warning abuse of notation): $\mathbb{R}^{3 \times 1} = \mathbb{R}^{3 \times 4} \times \mathbb{R}^{4 \times 1}+\mathbb{R}^{3 \times 1} $

So we want to know what the vector $\beta$ is.

I should say, and this is implied, that we want to know what the vector $\beta$ is given that we observe a bunch of data first, $(x_{11}, x_{12} , x_{13}, Y_1)$ and $(x_{21}, x_{22} , x_{23}, Y_2)$ and $(x_{31}, x_{32} , x_{33}, Y_3)$.

Now with this data, we want to know what $\beta$ is.

and to that end we must use estimators $B_0, ..., B_k$

How to figure out what the "best" estimators are? How do we define best? Well let's define best as the values that minimize $SS_R$

$$ SS_R = \sum_{i=1}^n (Y_i - B_0 - B_1 x_{i1} - ... - B_k x_{ik})^2$$

To determine the values of the estimators take $k+1$ partial derivatives. Then set to zero. Then rearrange to get the $k+1$ normal equations. And express the normal equations as

$$ X^T Y = X^T X B $$ $$\mathbb{R}^{k+1 \times n} \times \mathbb{R}^{n \times 1} = \mathbb{R}^{k+1 \times n} \times \mathbb{R}^{n \times k+1} \times \mathbb{R}^{k+1 \times 1} $$ $$ X^T X B = X^T Y $$

which then means that

$$ B = (X^T X)^{-1} X^T Y $$

Just like in simple linear regression we found what the estimators are. But are they unbiased?

$$ E[B] = E[(X^T X)^{-1} X^T Y ] = (X^T X)^{-1} X^T E[ Y ] = (X^T X)^{-1} X^T E[ X\beta + e] $$ $$ = (X^T X)^{-1} X^T X\beta + (X^T X)^{-1} X^T E[e]$$ $$I\beta + 0 = \beta$$

So the estimators are unbiased and are normal RVs. We then want to know the variance of the estimators to be able to make confidence intervals or do hypothesis testing, etc. For example $H_0 : \beta_1 = 0$ or there is no regression on $x_{i1}$ or the first input / independent variable.

It turns out $Var(B_i) = \sigma_{B_i}^2 = Cov(B_i,B_i)$ is equal to entry number $i+1,i+1$ of the matrix $\sigma_e^2 (X^TX)^{-1}$

$$ \sigma_{B_i}^2 = \sigma_e^2 (X^TX)_{i+1,i+1}^{-1}$$

and generally speaking

$$ Cov(B_i, B_j) = \sigma_e^2 (X^TX)_{i+1,j+1}^{-1} $$

How can we show this? Well note that

\begin{align*} B = (X^T X)^{-1} X^T Y & \qquad (\mathbb{R}^{4 \times 3} \times \mathbb{R}^{3 \times 4}) \times \mathbb{R}^{4 \times 3} \times \mathbb{R}^{3 \times 1} = \mathbb{R}^{4 \times 1} \\ C \equiv (X^T X)^{-1} X^T & \qquad (\mathbb{R}^{4 \times 3} \times \mathbb{R}^{3 \times 4}) \times \mathbb{R}^{4 \times 3} = \mathbb{R}^{4 \times 3} \\ B = CY & \qquad \mathbb{R}^{4 \times 3} \times \mathbb{R}^{3 \times 1} = \mathbb{R}^{4 \times 1} \end{align*} \[ \begin{bmatrix} B_0 \\ B_1 \\ B_2 \\ B_3 \\ \end{bmatrix} = \begin{bmatrix} C_{11} & C_{12} & C_{13} \\ C_{21} & C_{22} & C_{23} \\ C_{31} & C_{32} & C_{33} \\ C_{41} & C_{42} & C_{43} \\ \end{bmatrix} \begin{bmatrix} Y_1 \\ Y_2 \\ Y_3 \\ \end{bmatrix} \text{ which is $\mathbb{R}^{4 \times 1} = \mathbb{R}^{4 \times 3} \times \mathbb{R}^{3 \times 1}$} \]

And note for future reference that...

\[ C C^T = \begin{bmatrix} C_{11} & C_{12} & C_{13} \\ C_{21} & C_{22} & C_{23} \\ C_{31} & C_{32} & C_{33} \\ C_{41} & C_{42} & C_{43} \\ \end{bmatrix} \begin{bmatrix} C_{11} & C_{21} & C_{31} & C_{41} \\ C_{12} & C_{22} & C_{32} & C_{42} \\ C_{13} & C_{23} & C_{33} & C_{43} \\ \end{bmatrix} = \begin{bmatrix} \sum_{r=1}^3 C_{1r} C_{1r} & \sum_{r=1}^3 C_{1r} C_{2r} & \sum_{r=1}^3 C_{1r} C_{3r} & \sum_{r=1}^3 C_{1r} C_{4r} \\ \sum_{r=1}^3 C_{2r} C_{1r} & \sum_{r=1}^3 C_{2r} C_{2r} & \sum_{r=1}^3 C_{2r} C_{3r} & \sum_{r=1}^3 C_{2r} C_{4r} \\ \sum_{r=1}^3 C_{3r} C_{1r} & \sum_{r=1}^3 C_{3r} C_{2r} & \sum_{r=1}^3 C_{3r} C_{3r} & \sum_{r=1}^3 C_{3r} C_{4r} \\ \sum_{r=1}^3 C_{4r} C_{1r} & \sum_{r=1}^3 C_{4r} C_{2r} & \sum_{r=1}^3 C_{4r} C_{3r} & \sum_{r=1}^3 C_{4r} C_{4r} \\ \end{bmatrix} \]

Back to $B=CY$, we know that

$$B_1 = \sum_{l=1}^3 C_{2l} Y_l = C_{21} Y_1 + C_{22} Y_2 + C_{23} Y_3$$ $$ Var(B_1) = Cov(B_1,B_1) = Cov \left( \sum_{r=1}^3 C_{2r} Y_r , \sum_{l=1}^3 C_{2l} Y_l \right) = \sum_{r=1}^3 \sum_{l=1}^3 C_{2r} C_{2l} Cov(Y_r,Y_l)$$ \begin{align*} Var(B_1) = & Cov(C_{21} Y_1 + C_{22} Y_2 + C_{23} Y_3, C_{21} Y_1 + C_{22} Y_2 + C_{23} Y_3) \\ = & C_{21} C_{21} Cov(Y_1,Y_1) + C_{21} C_{22} Cov(Y_1,Y_2) + C_{21} C_{23} Cov(Y_1,Y_3) \\ & + C_{22} C_{21} Cov(Y_2,Y_1) + C_{22} C_{22} Cov(Y_2,Y_2) + C_{22} C_{23} Cov(Y_2,Y_3) \\ & + C_{23} C_{21} Cov(Y_3,Y_1) + C_{23} C_{22} Cov(Y_3,Y_2) + C_{23} C_{23} Cov(Y_3,Y_3) \end{align*}

And now note that $Cov(Y_i,Y_j)$ is equal to $\sigma_e^2$ if $i=j$ and zero otherwise.

Note that $Cov(Y_2,Y_3) = Cov(\alpha + \beta x_2 + e_2,\alpha + \beta x_3 + e_3) = Cov(e_2,e_3) = 0$

$$= C_{21} C_{21} Cov(Y_1,Y_1) + C_{22} C_{22} Cov(Y_2,Y_2) + C_{23} C_{23} Cov(Y_3,Y_3)$$ $$= C_{21} C_{21} \sigma_e^2 + C_{22} C_{22} \sigma_e^2 + C_{23} C_{23} \sigma_e^2$$ $$= \sigma_e^2 (C_{21} C_{21} + C_{22} C_{22} + C_{23} C_{23} )$$ $$= \sigma_e^2 (C_{21}^2 + C_{22}^2 + C_{23}^2)$$ $$= Var(B_1)$$

and note that $Var(B_1) = Cov(B_1,B_1) = \sigma_e^2 (CC^T)_{2,2}$

But since $C \equiv (X^T X)^{-1} X^T$ we have that

\begin{align*} CC^T = & (X^T X)^{-1} X^T ((X^T X)^{-1} X^T)^T \\ = & (X^T X)^{-1} X^T (X (X^T X)^{-1} ) \\ = & (X^T X)^{-1} (X^T X) (X^T X)^{-1} ) \\ = & (X^T X)^{-1} \end{align*}

where we know $(X^T X)^{-1}$ is symmetric because $X^T X$ is symmetric.

Therefore we conclude that $ Cov(B) = \sigma_e^2 (X^TX)^{-1} $ where $Cov(B)$ is a symmetric matrix of the covariances of the different estimators.

\[ Cov(B) = \begin{bmatrix} Cov(B_0,B_0) & Cov(B_0,B_1) & \dots & Cov(B_0,B_k)\\ Cov(B_1,B_0) & Cov(B_1,B_1) & \dots & Cov(B_1,B_k)\\ \vdots & \vdots & \ddots & \vdots \\ Cov(B_k,B_0) & Cov(B_k,B_1) & \dots & Cov(B_k,B_k)\\ \end{bmatrix} = \begin{bmatrix} Var(B_0) & Cov(B_0,B_1) & \dots & Cov(B_0,B_k)\\ Cov(B_0,B_1) & Var(B_1) & \dots & Cov(B_1,B_k)\\ \vdots & \vdots & \ddots & \vdots \\ Cov(B_0,B_k) & Cov(B_1,B_k) & \dots & Var(B_k)\\ \end{bmatrix} \] $$ = \sigma_e^2 (X^TX)^{-1} \in \mathbb{R}^{k+1 \times k+1} $$

We can estimate $\sigma_e^2$ using $SS_R$.

$$ \frac{SS_R}{\sigma_e^2} \sim \chi_{n-k-1}^2$$ $$ E \left[ \frac{SS_R}{\sigma_e^2} \right] = n-k-1$$ $$ E \left[ \frac{SS_R}{n-k-1} \right] = \sigma_e^2$$

That is $\frac{SS_R}{n-k-1}$ is an unbiased estimator of $\sigma_e^2$

$SS_R$ is independent of $B_0, B_1, ... ,B_k$ $$ SS_R = Y^T Y - B^T X^T Y $$

and you can get the above by starting with $r = Y - XB$ and noting that $SS_R = r^T r$

Down and dirty example of how to get the standard deviations of $B$

Abuse of notation: After getting your data, do:

$$\sqrt{\frac{ss_R}{n-k-2}(X^TX)^{-1}} \qquad \text{which will yield a $\mathbb{R}^{k+1 \times k+1}$ symmetric matrix}$$ and the diagonals are the standard deviations of $B_0, B_1, ... , B_k$ \textbf{Hypothesis testing} Now that we have the matrix $Cov(B)$ and we know that $E[B] = \beta$ we know the distributions of each of the elements in $B$. This is a big deal. Let's say we have a model $$ Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + ... + \beta_k x_{ik} + e_i \qquad \text{with $e_i \sim N(0,\sigma^2)$}$$ and we want to test the hypothesis that $B_1 = 0$, that there is no regression of $x_1$. Well $$B_1 \sim N \left(\beta_1, \sigma_e^2 (X^TX)_{2,2}^{-1} \right)$$ And we are interested in an alpha level test. That is, given the hypothesis is true and $\beta_1 = 0$ the probability of accidentally rejecting is no more than $\alpha$. So given it's true, we "know" that $$ \frac{B_1 - \beta_1}{\sqrt{\sigma_e^2 (X^TX)_{i+1,i+1}^{-1}}} \sim N(0,1)$$ We then use a $t$ distribution since we have $\sigma_e^2$ that we want to get rid of, $$ \frac{ \frac{B_1 - \beta_1}{\sqrt{\sigma_e^2 (X^TX)_{2,2}^{-1}}} } { \sqrt{\frac{\frac{SS_R}{\sigma_e^2}}{n-k-1}} } = \frac{ \frac{B_1 - \beta_1}{\sqrt{\sigma_e^2 (X^TX)_{2,2}^{-1}}} } { \sqrt{\frac{SS_R}{(n-k-1)\sigma_e^2}} } = B_1 \sqrt{\frac{(n-k-1)}{SS_R(X^TX)_{2,2}^{-1}}} \sim t_{n-k-1} $$ So at this point you calculate with the actual data $B_1, SS_R, (X^T X)^{-1}$ and this will give some number which is a $t$ value which is the number of standard deviations from zero (the $t$ distribution is symmetric about zero) and if you get a value very far from zero then that means $B_1$ is likely not zero. If you get a very small value, then that means $B_1$ is probably close to zero. Let's say you get a $t$ value of $-0.5$ which is pretty close to zero which means it has a p value of $2 F_{t,n-k-1}(-.5)$ which is going to be relatively big for decently sized $n-k-1$, which means you would accept the hypothesis. As $n$ gets bigger, your p value gets smaller, however it will cap out at $2\Phi(-.5)$ and not get smaller than that since a $t$ distribution becomes normal as the degrees of freedom goes to infinity. You showed this on Cross Validated. \textbf{Coefficient of multiple determination} You can still use $$ R^2 = 1 - \frac{SS_R}{\sum_i (Y_i - \overline{Y})^2}$$ to get the proportion of variance described by different input levels.

Articles

Personal notes I've written over the years.