Tue 24 September 2019

Comparing means of Normal populations

Written by Hongjinn Park in Articles

The point of this article is to show how the t-test, ANOVA, and regression are all related. The punch line is all three can be written as a regression equation and they can all be used to test if different Normal populations have the same mean.


Turning a t test into a regression equation

For a t-test, you have RV $A_i$'s that are iid $N(\mu_A,\sigma^2)$ and $B_i$'s that are iid $N(\mu_B,\sigma^2)$ and to develop an alpha level significance test for $H_0: \mu_A = \mu_B$ you look at the distribution of the RV $\overline{A} - \overline{B}$ and transform it into a $t$ RV and if you're far away from zero (too negative or positive) you reject $H_0$.

How do we turn this into a regression equation? Start with simple linear regression,

$$Y_i = \beta_0 + \beta_1 x_i + e_i$$

and this can become a t-test by setting up the variables as follows.

Let $\beta_0 = \mu_A$, $\beta_1 = \mu_B-\mu_A$, $x_i$ is a dummy variable that equals $0$ or $1$ depending on if the data is from the control group $A$ or from the treatment group $B$. And the vector $Y$ is the concatenation of $A_i$'s and $B_i$'s.

How did we come up with this? Well, given the values for $x$ are $0$ and $1$ (they can be anything) just solve the linear system of equations. You want it where $E[Y_i] = \mu_A$ when $x_i=0$ so $E[Y_i] = \mu_A = \beta_0$ and then to get $\beta_1$ you want $E[Y_i] = \mu_B$ when $x$ is equal to $1$ so $E[Y_i] = \mu_B = \mu_A + \beta_1$ and so $\beta_1 = \mu_B - \mu_A$

$$Y_i = \mu_A + (\mu_B-\mu_A) x_i + e_i$$ $$\text{When $x_i = 0$} \qquad \qquad Y_i = \mu_A + e_i \sim N(\mu_A, \sigma^2) $$ $$ \text{When $x_i = 1$} \qquad \qquad Y_i = \mu_B + e_i \sim N(\mu_B, \sigma^2)$$ \[ \begin{bmatrix} A_1 \\ A_2 \\ \vdots \\ A_n \\ B_1 \\ B_2 \\ \vdots \\ B_n \\ \end{bmatrix} = \begin{bmatrix} \mu_A \\ \mu_A \\ \vdots \\ \mu_A \\ \mu_A \\ \mu_A \\ \vdots \\ \mu_A \\ \end{bmatrix} + \begin{bmatrix} \mu_B - \mu_A\\ \mu_B - \mu_A\\ \vdots\\ \mu_B - \mu_A\\ \mu_B - \mu_A\\ \mu_B - \mu_A\\ \vdots \\ \mu_B - \mu_A\\ \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \\ 1 \\ 1 \\ \vdots \\ 1 \\ \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ \vdots \\ e_n \\ e_{n+1} \\ e_{n+2} \\ \vdots \\ e_{2n} \\ \end{bmatrix} \]

Recall that $H_0: \mu_A = \mu_B$ and this will happen if $\beta_1 = \mu_B - \mu_A = 0$. Therefore the hypothesis we are testing is equivalent to $\beta_1 = 0$ which is the hypothesis that there is no regression on the input variable! So now you have proven to yourself that the t-test for the hypothesis that the means of two groups are equal can be a regression equation!

F and t distribution

If $t_v \sim t(df=v)$ then $t_v ^2 \sim F(1,v)$ and is an $F_{1,v}$ RV.

$$ t_v = \frac{Z}{\sqrt{\frac{\chi_v^2}{v}}} $$ $$ t_v^2 = \frac{Z^2}{\frac{\chi_v^2}{v}} = \frac{\frac{\chi_1^2}{1}}{\frac{\chi_v^2}{v}} \sim F_{1,v}$$

and note that independence of numerator and denominator checks out.

It is true that

$$ 2P(t_n > |v|) = P(F_{1,n}> v^2)$$

And to see this start with $P(t_n > v) = pval$ then note that $P(t_n^2 > v^2) \neq pval$ since $P(t_n^2 > v^2) = P(t_n > v)+P(t_n < -v) = 2P(t_n > |v|) $ and then $t_n^2 \sim F_{1,n}$ and that completes the proof. It is also true that $2P(Z > |z|) = P(Z^2 = \chi_1^2 > z^2)$

Now the following three methods all test the same hypothesis.

Method 1: Comparing two population means, $H_0: \mu_X = \mu_Y$ using t tests

$$ \text{TS} = \frac{\frac{\overline{X}_n-\overline{Y}_m }{\sigma\sqrt{\frac{1}{n}+\frac{1}{m}}}}{\sqrt{\frac{\frac{(n-1)S_X^2}{\sigma^2}+\frac{(m-1)S_Y^2}{\sigma^2}}{n+m-2}}} = \frac{\frac{\overline{X}_n-\overline{Y}_m }{\sigma\sqrt{\frac{1}{n}+\frac{1}{m}}}}{\frac{1}{\sigma}\sqrt{\frac{(n-1)S_X^2+(m-1)S_Y^2}{n+m-2}}} = \frac{\overline{X}_n-\overline{Y}_m }{\sqrt{\frac{1}{n}+\frac{1}{m}}} \sqrt{\frac{n+m-2}{(n-1)S_X^2+(m-1)S_Y^2}} \sim t_{n+m-2}$$

Method 2: Simple regression $Y_i = \alpha + \beta x_i + e_i$, hypothesis that there is no regression, $H_0: \beta = 0$

$$ \frac{\sqrt{S_{xx}}\frac{(B-\beta)}{\sigma} }{\sqrt{\frac{\frac{SS_R}{\sigma^2}}{n-2}}} = \sqrt{\frac{(n-2)S_{xx}}{SS_R}} (B-\beta) \sim t_{n-2} $$

Method 3: One way Anova with two levels, $H_0:$ that there is no row effect

$$\text{TS} = \frac{\frac{SS_b}{m-1}}{\frac{SS_W}{nm-m}} = \frac{SS_b (nm-m)}{SS_W (m-1)} = \frac{\sum_{i=1}^m n (X_{i.} - X_{..})^2 (nm-m)}{\sum_{i=1}^m \sum_{j=1}^n (X_{ij} - X_{i.})^2(m-1)} \sim F_{nm-m,m-1} $$ $$SS_b = \sum_{i=1}^m n (X_{i.} - X_{..})^2 \qquad \qquad SS_W = \sum_{i=1}^m \sum_{j=1}^n (X_{ij} - X_{i.})^2 $$

and when this quantity is sufficiently large we can reject the hypothesis that there is no row effect, aka the population means are the same.

So in summary...

$$ \left( \frac{\overline{X}_n-\overline{Y}_m }{\sqrt{\frac{1}{n}+\frac{1}{m}}} \sqrt{\frac{n+m-2}{(n-1)S_X^2+(m-1)S_Y^2}} \right)^2= \left( \sqrt{\frac{(n-2)S_{xx}}{SS_R}} (B-\beta) \right)^2 = \frac{\sum_{i=1}^m n (X_{i.} - X_{..})^2 (nm-m)}{\sum_{i=1}^m \sum_{j=1}^n (X_{ij} - X_{i.})^2(m-1)} \sim F_{m-1,nm-m}$$ The above is messed up in that values like $n,m$ do not mean the same things. Correcting for this... (assuming balanced and only two factors to compare) $$ \left( \frac{\overline{X}_n-\overline{Y}_n }{\sqrt{\frac{2}{n}}} \sqrt{\frac{2n-2}{(n-1)(S_X^2+S_Y^2)}} \right)^2= \left( \sqrt{\frac{(n-2)S_{xx}}{SS_R}} B \right)^2 = \frac{\sum_{i=1}^2 n (X_{i.} - X_{..})^2 (2n-2)}{\sum_{i=1}^2 \sum_{j=1}^n (X_{ij} - X_{i.})^2} \sim F_{1,2n-2}$$ $$ \left( \frac{X_{1.}-X_{2.}}{\sqrt{\frac{2}{n}}} \sqrt{\frac{2n-2}{\sum_{j=1}^n (X_{1j} - X_{1.})^2+\sum_{j=1}^n (X_{2j} - X_{2.})^2 }} \right)^2= \left( \sqrt{\frac{(n-2)S_{xx}}{SS_R}} B \right)^2 = \frac{\sum_{i=1}^2 n (X_{i.} - X_{..})^2 (2n-2)}{\sum_{i=1}^2 \sum_{j=1}^n (X_{ij} - X_{i.})^2} \sim F_{1,2n-2}$$ $$ \frac{(X_{1.}-X_{2.})^2}{\frac{2}{n}} \frac{2n-2}{\sum_{j=1}^n (X_{1j} - X_{1.})^2+\sum_{j=1}^n (X_{2j} - X_{2.})^2 } = \left( \sqrt{\frac{(n-2)S_{xx}}{SS_R}} B \right)^2 = \frac{\sum_{i=1}^2 n (X_{i.} - X_{..})^2 (2n-2)}{\sum_{i=1}^2 \sum_{j=1}^n (X_{ij} - X_{i.})^2} \sim F_{1,2n-2}$$ $$ \frac{n(2n-2) \frac{(X_{1.}-X_{2.})^2}{2}}{\sum_{i=1}^2 \sum_{j=1}^n (X_{ij} - X_{i.})^2 } = \left( \sqrt{\frac{(n-2)S_{xx}}{SS_R}} B \right)^2 = \frac{n(2n-2) \sum_{i=1}^2 (X_{i.} - X_{..})^2 }{\sum_{i=1}^2 \sum_{j=1}^n (X_{ij} - X_{i.})^2} \sim F_{1,2n-2}$$

and the LHS is equal to the RHS since

$$ \frac{(X_{1.}-X_{2.})^2}{2} = \sum_{i=1}^2 (X_{i.} - X_{..})^2 = (X_{1.} - X_{..})^2 + (X_{2.} - X_{..})^2 $$

So I have shown (basically ignoring the middle equation which represents the test statistic for $H_0:\beta_1 = 0$) that the t-test and Anova are related.

To sum up, the t-test is a form of regression and the t-test statistics squared is the Anova F statistics. Therefore all three, t-test, Regression, Anova are all related. And actually a t-test, Anova, can be written as Regression.



Articles

Personal notes I've written over the years.