Thu 21 February 2019

Notes on Weighted Linear Regression

Written by Hongjinn Park in Articles

Simple Linear Regression has a big assumption, that your error term has constant variance no matter the input level. But this doesn't always make sense, and I find the following example about driving distances very intuitive.


In the general case we have $Y_i = \alpha + \beta x_i + e_i$ where $\forall i$ we have $e_i \sim N(0,\sigma^2)$.

As you can see in this model (simple linear regression), the error term $e_i$ has the same distribution no matter the input level $x_i$.

Sometimes this doesn't make sense. For example, let's say you are interested in a model for commute time based on distance traveled. Well to travel 1 foot the variance of time is relatively low versus traveling 500 miles. Therefore the amount of variance (in the distribution of $e_i$ and consequently $Y_i$ since $Var(e_i)=Var(Y_i)$) is dependent on the input level $x_i$.

In the commute time model we would want to favor or add weight to the data points $(x_i, Y_i)$ when $x_i$ is lower.

The data points when $x_i$ is big are subject to higher variance or noise which distorts the underlying relationship, the values of $\alpha$ and $\beta$, which is the end goal and what we're trying to determine.

At this point you're thinking okay the flaw in the commute model makes sense but how the hell am I supposed to know the variance as a function of the input? That is $e_i \sim N(0,f(x_i))$ so $Var(Y_i) = \sigma_{Y_i}^2 = f(x_i)$

What the heck is $f$?

Well we can make this more doable by using an assumption, that the variance of the model is proportionally constant to the input. That is, $Var(Y_i) = \frac{\sigma^2}{w_i}$ so $e_i \sim N(0,\frac{\sigma^2}{w_i})$

The book example for commute time has $w_i = \frac{1}{x_i}$ and so $Var(Y_i) = \frac{\sigma^2}{\frac{1}{x_i}} = x_i \sigma^2$ so $e_i \sim N(0, x_i \sigma^2)$

I'm not totally clear on it but it is logical how they came to this. Let's say you're traveling 500 miles. There's going to be much bigger variance than traveling 1 mile. But say you chop up the 500 mile journey into 1 mile increments. Then the $Var(Y) = Y_1 + ... + Y_{500} = 500Var(Y_1)$

Therefore instead of minimizing the vanilla $SS_R$ which would yield the regular estimators $A$ and $B$, we want to minimize

$$SS_R = \sum_{i=1}^n \frac{1}{x_i}(Y-A-B x_i)^2$$

Note here that when $x_i$ is small, the squared residual $\frac{(Y-A-B x_i)^2}{x_i}$ is bigger. So you want to choose an $A$ and $B$ that fit these kind of points better.

When $x_i$ are very big than the squared residual is smaller. Since we're trying to minimize the sum of squared residuals, it's not as important that $A$ and $B$ fit these kind of points.

General case

Let's say $Var(Y_i) = \frac{\sigma^2}{w_i}$ then we want to minimize

$$\sum_{i=1}^n \frac{(Y-A-B x_i)^2}{Var(Y_i)} = \frac{1}{\sigma^2}\sum_{i=1}^n w_i(Y-A-B x_i)^2$$

On the LHS you see that if the variance is large, we weight the data point less since we want to minimize the above, and if the variance is small, we weight it more. So $A$ and $B$ will form more to those points that have smaller variances in an effort to minimize the above expression.

Chugging right along we take the partial derivatives and set equal to zero,

$$ \frac{\partial{SS_r}}{\partial{A}} = -2 \sum_{i=1}^n w_i(Y-A-B x_i) = 0$$ $$ \frac{\partial{SS_r}}{\partial{B}} = -2 \sum_{i=1}^n w_i x_i(Y-A-B x_i) = 0$$

and after relatively simple manipulation we get the two normal equations,

$$\sum_{i=1}^n w_i Y_i = A \sum_{i=1}^n w_i + B \sum_{i=1}^n w_i x_i $$ $$\sum_{i=1}^n w_i x_i Y_i = A \sum_{i=1}^n w_i x_i + B \sum_{i=1}^n w_i x_i^2$$

You can solve for $A$ and $B$ using matrix algebra.

You can also get an expression for $B$ and then $A$ in terms of $x_i, Y_i, w_i$ but honestly the matrix algebra part might be better.



Articles

Personal notes I've written over the years.