Deriving Simple Linear Regression

Sun 8 October 2017

Deriving Simple Linear Regression

Written by Hongjinn Park in Articles

Throughout my working life, I've seen so many charts with regression lines. But how does regression work and when should we apply it?

Well let's see how it's derived.

Background

Two variables $Y_i$ and $x_i$ have the following linear relationship

$$Y_i = \alpha + \beta x_i + e_i \qquad \text{where} \quad e_i \sim N(0,\sigma^2)$$

Let's call $x_i$ the input or independent variable

Let's call $Y_i$ the response or dependent variable

Examples:

temperature $x_i$ of a steel making process and hardness of steel $Y_i$
temperature $x_i$ of a chemical process and yield $Y_i$

We want to know what $\alpha$ and $\beta$ are since it could save a lot of money. Unfortunately we have to estimate their values based on observations $(x_i,Y_i)$

What is a good estimator?

So we have to come up with estimators $A$ and $B$ for $\alpha$ and $\beta$ and the question is, what are "good" estimators?

Note that you can do whatever you want. For example, let's say god tells us that $\alpha = \beta = 2$ and thus $Y_i = 2 + 2 x_i + e_i$

By sheer dumb luck I choose $A=B=2$ which is not based on any observations however I was incredibly lucky.

Other dumb ideas to use for an estimator are just to make a line with the first two observations $(x_1, Y_1)$ and $(x_2,Y_2)$.

Enter the method of least squares. Let

$$ SS_R = \sum_{i=1}^{n} (Y_i - A - B x_i)^2 \qquad \text{Sum of Squared Residuals} $$ $$ = (Y_1 - A - B x_1)^2 + (Y_2 - A - B x_2)^2 + ... + (Y_n - A - B x_n)^2 $$

Choose $A$ and $B$ to minimize $SS_R$. To do this we can take derivatives and set equal to zero. Note that the $(x_i, Y_i)$ pairs are constants and our variables are $A$ and $B$.

$$ \frac{\partial SS_r}{\partial A} = -2 \sum_{i=1}^{n} (Y_i - A - B x_i) = 0$$ $$ \sum_{i=1}^{n} Y_i - \sum_{i=1}^{n} A - \sum_{i=1}^{n} B x_i = 0 $$ $$ \sum_{i=1}^{n} Y_i - n A - B \sum_{i=1}^{n} x_i = 0 $$ $$ \sum_{i=1}^{n} Y_i - B \sum_{i=1}^{n} x_i = nA $$ $$ A = \overline{Y} - B \overline{x}$$

Now let's take the partial derivative with respect to $B$ and set equal to zero.

$$ \frac{\partial SS_R}{\partial B} = -2 \sum_{i=1}^{n} x_i (Y_i - A - B x_i) = 0$$ $$ \sum_{i=1}^{n} x_i (Y_i - A - B x_i) = \sum_{i=1}^{n} x_i Y_i - A x_i - B x_i^2 = 0$$

Now plugging in $A = \overline{Y} - B \overline{x}$ from above,

$$ \sum_{i=1}^{n} x_i Y_i - (\overline{Y} - B \overline{x}) x_i - B x_i^2 = 0$$ $$ \sum_{i=1}^{n} x_i Y_i - \overline{Y} x_i + B \overline{x} x_i - B x_i^2 = 0$$ $$ \sum_{i=1}^{n} x_i Y_i - \overline{Y} x_i + B (\overline{x} x_i - x_i^2) = 0$$ $$ \sum_{i=1}^{n} (x_i Y_i - \overline{Y} x_i )+ \sum_{i=1}^{n} B (\overline{x} x_i - x_i^2) = 0$$ $$ B \sum_{i=1}^{n} \overline{x} x_i - x_i^2 = \sum_{i=1}^{n} \overline{Y} x_i - x_i Y_i$$ $$ B = \frac{ \sum_{i=1}^{n} x_i \overline{Y} - x_i Y_i } { \sum_{i=1}^{n} x_i \overline{x} - x_i^2 } = \frac{ \sum_{i=1}^{n} x_i (\overline{Y} - Y_i) } { \sum_{i=1}^{n} x_i (\overline{x} - x_i) } = \frac{ \sum_{i=1}^{n} x_i (Y_i - \overline{Y}) } { \sum_{i=1}^{n} x_i (x_i - \overline{x}) } $$

I know that

$$ B = \frac{S_{xY}}{S_{xx}} = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(Y_i - \overline{Y})}{\sum_{i=1}^{n} (x_i - \overline{x})^2}$$ $$ = \frac{\sum_{i=1}^{n} (x_i - \overline{x})(Y_i - \overline{Y})}{\sum_{i=1}^{n} x_i^2 - 2x_i \overline{x} + \overline{x}^2} $$

So are we done now? Well let's say you want to do hypothesis testing, for example $H_0: \beta = 0$ or basically saying that there is no regression on the input variable. Hmm well if you note that $R^2$ which is the goodness of fit test is related to correlation. If $\beta = 0$ then there is no regression on the input variable, and the variation of the $Y_i$'s is totally dependent on the error term $e$. Note that a "standard measure" of the variation of the $Y_i$'s is $S_{YY} = \sum (Y_i - \overline{Y})^2$.

Okay well anyway to do hypothesis testing on $\beta$ or to build a confidence interval for $\alpha$ or whatever, we need to know the distributions of $A$ and $B$.

$$ A \sim N \left( \alpha, \frac{\sigma^2 \sum_{i=1}^n x_i^2}{n S_{xx}}\right)$$ $$ B \sim N \left( \beta, \frac{\sigma^2}{ S_{xx}}\right)$$

And note that

$$ \sqrt{S_{xx}}\frac{(B-\beta)}{\sigma} \sim N(0,1) \qquad \text{which is independent of} \qquad \frac{SS_R}{\sigma^2} \sim \chi_{n-2}^2$$

and so we have

$$ \frac{\sqrt{S_{xx}}\frac{(B-\beta)}{\sigma} }{\sqrt{\frac{\frac{SS_R}{\sigma^2}}{n-2}}} = \sqrt{\frac{(n-2)S_{xx}}{SS_R}} (B-\beta) \sim t_{n-2} $$

Predicting future responses

For some specified value $x_0$. For example, now that you have your glorious model what will be the mean response or what will be the prediction interval when the input is $42$.

You do it for:

1. CI is for $E[Y \mid x_0]$ which is a CI for a parameter

2. $Y(x_0)$ which is a PI for a RV

$$ \text{CI} = A+B x_0 \pm \sqrt{\frac{SS_R}{n-2}} \sqrt{\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}\, \, t_{\alpha/2,n-2}$$ $$ \text{PI} = A+B x_0 \pm \sqrt{\frac{SS_R}{n-2}} \sqrt{1+\frac{1}{n} + \frac{(x_0 - \overline{x})^2}{S_{xx}}}\, \, t_{\alpha/2,n-2}$$

What happens as $n \rightarrow \infty$ ?

1) Well we would expect the CI to get down to a single point, namely $\alpha + \beta x_0$

2) We would expect the PI to turn into $\alpha + \beta x_0 \pm \sigma_e z_{\alpha/2}$

And this does happen because clearly $\frac{1}{n} \rightarrow 0$ and

$$\frac{(x_0 - \overline{x})^2}{S_{xx}} = \frac{(x_0 - \overline{x})^2}{{\sum_{i=1}^n} (x_i-\overline{x})^2} = \frac{(x_0 - \overline{x})^2}{ \left(\sum x_i^2 \right) - n\overline{x}^2} \rightarrow 0 \qquad \text{as} \quad n \rightarrow \infty$$

Articles

Personal notes I've written over the years.