Part 3 - Confidence Interval and Hypothesis Testing | applied statistics

Inferences for mean/proportion

Let’s recap what a confidence interval is,

A confidence interval of level $100(1 - \alpha)%$ means that we are $100(1 - \alpha)%$ confident that the true value of the parameter is included into the interval.

When dealing with confidence intervals, we will often encounter different “types” of situations. Let’s review these.

Confidence Interval on the mean of a Normal Distribution, variance known

Suppose we have a normal distribution with unknown mean $\mu$ and known variance $\sigma^2$.

We have thus a random sample $X_1, X_2, \ldots, X_n$, such that, for all $i$, $$ X_i \sim N(\mu, \sigma^2) $$

with $\mu$ unknown and $\sigma$ a known constant.

We would like a confidence interval for $\mu$.

We know that, $$ \bar{X} = \frac{1}{n} \sum_{i = 1}^n X_i \sim N\left(\mu, \frac{\sigma^2}{n}\right) $$

we can standardize this to, $$ Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}} = \sqrt{n} \frac{\bar{X} - \mu}{\sigma} \sim N(0, 1) $$

So, because $Z \sim N(0, 1)$, $$ P(-z_{1 - \alpha/2} \leq Z \leq z_{1 - \alpha/2}) = 1 - \alpha $$

Thus, $$ P\left(-z_{1 - \alpha/2} \leq \sqrt{n} \frac{\bar{X} - \mu}{\sigma} \leq z_{1 - \alpha/2}\right) = 1 - \alpha $$

Re-arranging this, we get, $$ P\left(\bar{X} - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}} \leq \mu \leq \bar{X} + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right) = 1 - \alpha $$

Let’s call them $L$ and $U$ for lower and upper, finally, the interval is, $$ [L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right] $$

Confidence Interval on the mean of an arbitrary distribution, variance known

Let us recap the central limit theorem.

The Central Limit (CLT) implies if $n$ is large enough, $$ Z = \sqrt{n} \frac{\bar{X} - \mu}{\sigma} \sim N(0, 1) $$

Thus, $$ P\left(-z_{1 - \alpha/2} \leq \sqrt{n} \frac{\bar{X} - \mu}{\sigma} \leq z_{1 - \alpha/2}\right) \simeq 1 - \alpha $$

Which yields the (same) interval, $$ [L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right] $$

Confidence interval on the mean of a normal distribution, variance unknown

In the case we do not have the variance, we need to sample it!

Recall the sample variance, $$ S^2 = \frac{1}{n - 1} \sum_{i = 1}^n (X_i - \bar{X})^2 $$

A natural procedure is thus to, $$ T = \sqrt{n} \frac{\bar{X} - \mu}{S} $$

In a normal population, the exact distribution of $T$ is $T \sim t_{n - 1}$.

We can write, $$ P(-t_{n - 1;1 - \alpha/2} \leq \sqrt{n} \frac{\bar{X} - \mu}{S} \leq t_{n - 1;1 - \alpha/2}) = 1 - \alpha $$

or, $$ P\left(\bar{X} - t_{n - 1;1 - \alpha/2} \frac{S}{\sqrt{n}} \leq \mu \leq \bar{X} + t_{n - 1;1 - \alpha/2} \frac{S}{\sqrt{n}}\right) = 1 - \alpha $$

Which yields us the interval, $$ [L, U] = \left[\bar{x} - t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}, \bar{x} + t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}\right] $$

In large samples, $$ T \simeq N(0, 1) $$

Consequently, an approximate confidence interval of level $100(1 - \alpha)%$ for $\mu$ is, $$ [L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{s}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{s}{\sqrt{n}}\right] $$

Confidence Interval on the mean: Summary

Is the population normal?

If yes, is $\sigma$ known?
- If yes, use an **exact $z$-confidence interval:
  - $[L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right]$
- If no, use an **exact $t$-confidence interval:
  - $[L, U] = \left[\bar{x} - t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}, \bar{x} + t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}\right]$
If no, use an approximate large sample confidence interval:
- $[L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right]$
- or
- $[L, U] = \left[\bar{x} - z_{1 - \alpha/2} \frac{s}{\sqrt{n}}, \bar{x} + z_{1 - \alpha/2} \frac{s}{\sqrt{n}}\right]$

Confidence Interval on the proportion

The random variable to study is, $$ X = \begin{cases} 1 & \text{If the individual has the characteristic of interest} \newline 0 & \text{If not} \end{cases} $$

$X_1, X_2, \ldots, X_n$ is a set of $n$ independent Bern($p$) random variables.

Thus, $$ Y = \sum_{i = 1}^n X_i \sim B(n, p) $$

and the sample proportion is, $$ \hat{P} = \frac{Y}{n} $$

We also know that, $$ \sqrt{n} \frac{\hat{P} - p}{\sqrt{p(1 - p)}} \sim N(0, 1) $$

if $n$ is large.

Approximate two-sided confidence interval of level $100(1 - \alpha)%$ for $p$ is given by, $$ \left[\hat{p} - z_{1 - \alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \hat{p} + z_{1 - \alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\right] $$

Hypotheses testing for the mean

Let’s recall hypothesis testing. The null hypothesis is usually of the form, $$ H_0 : \mu = \mu_0 $$

The alternative hypothesis can be a two-sided alternative, $$ H_a : \mu \neq \mu_0 $$

or one-sided alternatives, $$ H_a : \mu > \mu_0 \quad \text{or} \quad H_a : \mu < \mu_0 $$

Remember, we have two important cases

Rejecting $H_0$ when it is true: type I error.
Failing to reject H_0 when it is false: type II error.

$$ P(\text{Type I error}) = P(\text{reject } H_0 | H_0 \text{ is true }) = \alpha \newline P(\text{Type II error}) = P(\text{fail to reject } H_0 | H_0 \text{ is false }) = \beta $$

Note that $\beta$ depends on the (unknown) value of $\mu$ under the alternative.

Assume for the moment that the population is normal with known $\sigma$. $$ \bar{X} \sim N(\mu, \frac{\sigma^2}{n}) $$

At significance level $\alpha$, we are after two constants $\ell$ and $u$ such that, $$ \alpha = P(\bar{X} \notin [\ell, u] \text{ when } \mu = \mu_0) = P\left(Z \notin \left[\sqrt{n} \frac{\ell - \mu_0}{\sigma}, \sqrt{n} \frac{u - \mu_0}{\sigma}\right]\right) $$

Thus, $$ \sqrt{n} \frac{\ell - \mu_0}{\sigma} = z_{\alpha/2} = -z_{1 - \alpha/2} $$

and, $$ \sqrt{n} \frac{u - \mu_0}{\sigma} = z_{1 - \alpha/2} $$

This yields, $$ \ell = \mu_0 - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}} \newline u = \mu_0 + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}. $$

The decision rule is then, $$ \text{Reject } H_0 \text{ if } \bar{x} \notin [\mu_0 - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \mu_0 + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}] $$

Hypotheses testing for the mean: $p$-value

The $p$-value is the probability that the test statistic will take on a value that is at least as extreme as the observed value when $H_0$ is true. (extreme is to be understood in the direction of the alternative).

When testing $H_0 : \mu = \mu_0$ against $H_a : \mu \neq \mu_0$, the $p$-value will be the probability of finding the random variable $\bar{X}$ more different to $\mu_0$ than the observed $\bar{x}$, that is, $$ \begin{align*} p & = P(\bar{X} \notin [\mu_0 \pm |\bar{x} - \mu_0|] \text{ when } \mu = \mu_0) \newline & = 1 - P(\bar{X} \in [\mu_0 \pm |\bar{x} - \mu_0|] \text{ when } \mu = \mu_0) \end{align*} $$

Let us define, $$ z_0 = \sqrt{n} \frac{\bar{x} - \mu_0}{\sigma} $$

as the “observed value of the test statistic”.

As we know that $Z = \sqrt{n} \frac{\bar{X} - \mu_0}{\sigma} \sim N(0, 1)$, we can write, $$ \begin{align*} p & = 1 - P(\sqrt{n} \frac{\bar{X} - \mu_0}{\sigma} \in \left[\sqrt{n} \frac{\mu_0 \pm |\bar{x} - \mu_0| - \mu_0}{\sigma}\right]) \newline & = 1 - P(Z \in [-|z_0|, |z_0|]) = 2(1 - \phi(|z_0|)) \end{align*} $$

Operationally, since a $p$-value is computed, we typically compare it to a predefined significance level $\alpha$ to make a decision: $$ \begin{cases} \text{if } p < \alpha, \text{ reject } H_0 \newline \text{if } p \geq \alpha, \text{ do not reject } H_0 \end{cases} $$

Hypotheses testing for the mean: one-sided

With $H_a : \mu > \mu_0$ we are after a constant $u$ such that, $$ P(\bar{X} > u \text{ when } \mu = \mu_0) = \alpha $$

As we know that $Z = \sqrt{n} \frac{\bar{X} - \mu_0}{\sigma} \sim N(0, 1)$, the decision rule is reject $H_0$ if $\bar{x} > \mu_0 + z_{1 - \alpha} \frac{\sigma}{\sqrt{n}}$.

Again with $z_0 = \sqrt{n} \frac{\bar{x} - \mu_0}{\sigma}$, the $p$-value is, $$ p = P(\bar{X} > \bar{x} \text{ when } \mu = \mu_0) = P(Z > \sqrt{n} \frac{\bar{x} - \mu_0}{\sigma}) = 1 - \phi(z_0) $$

With $H_a : \mu < \mu_0$, we are after a constant $\ell$ such that, $$ P(\bar{X} < \ell \text{ when } \mu = \mu_0) = \alpha $$

The decision rule is reject $H_0$ if $\bar{x} < \mu_0 - z_{1 - \alpha} \frac{\sigma}{\sqrt{n}}$.

The $p$-value is, $$ p = P(\bar{X} < \bar{x} \text{ when } \mu = \mu_0) = P(Z < \sqrt{n} \frac{\bar{x} - \mu_0}{\sigma}) = \phi(z_0) $$

Hypotheses testing for the mean: other cases

Say we have a normal population with unknown standard deviation.

Specifically, for the two-sided test $H_0 : \mu = \mu_0$ against $H_a : \mu \neq \mu_0$, the decision rule is, $$ \text{reject } H_0 \text{ if } \bar{x} \notin [\mu_0 - t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}, \mu_0 + t_{n - 1;1 - \alpha/2} \frac{s}{\sqrt{n}}] $$

and from the observed value of the test statistic, $$ t_0 = \sqrt{n} \frac{\bar{x} - \mu_0}{s} $$

we can compute the $p$-value, $$ p = 1 - P(T \in [-|t_0|, |t_0|]) = 2P(T > |t_0|) \text{ where } T \sim t_{n - 1} $$

If we have non-normal populations with known or unknown standard deviation, $$ \text{reject } H_0 \text{ if } \bar{x} \notin \left[\mu_0 - z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}, \mu_0 + z_{1 - \alpha/2} \frac{\sigma}{\sqrt{n}}\right] $$

or, $$ \text{reject } H_0 \text{ if } \bar{x} \notin \left[\mu_0 - z_{1 - \alpha/2} \frac{s}{\sqrt{n}}, \mu_0 + z_{1 - \alpha/2} \frac{s}{\sqrt{n}}\right] $$

The associated approximate $p$-value will be given by, $$ p = 2(1 - \phi(|z_0|)), $$

with $z_0 = \sqrt{n} \frac{\bar{x} - \mu_0}{\sigma}$ or $z_0 = \sqrt{n} \frac{\bar{x} - \mu_0}{s}$.

Hypotheses testing for the proportion

Large-sample confidence interval for $p$, $$ \left[\hat{p} - z_{1 - \alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}, \hat{p} + z_{1 - \alpha/2} \sqrt{\frac{\hat{p}(1 - \hat{p})}{n}}\right] $$

The decision rule at (approximate) level $\alpha$ is, $$ \text{reject } H_0 \text{ if } \hat{p} \notin \left[p_0 - z_{1 - \alpha/2} \sqrt{\frac{p_0(1 - p_0)}{n}}, p_0 + z_{1 - \alpha/2} \sqrt{\frac{p_0(1 - p_0)}{n}}\right] $$

The (approximate) $p$-value for this test is, $$ p = 2(1 - \phi(|z_0|)) $$

where $z_0 = \sqrt{n} \frac{\hat{p} - p_0}{\sqrt{p_0(1 - p_0)}}$.

R code for finding $z_{1 - \alpha/2}$

alpha<-0.05
zstar<-qnorm(1-alpha/2)

Example

Suppose 53 people among 100 surveyed is for the proposition, find the 95% confidence interval for the proportion of people in favor of the proposition.

n<-100
phat<-53/n
SE<-sqrt(phat*(1-phat)/n)
alpha<-0.05
zstar<-qnorm(1 - alpha/2)
> c(phat-zstar*SE, phat+zstar*SE)
[1] 0.432178 0.6278216

Built-in functions in R for finding the CI of proportion estimate

prop.test(x=53, n=100, conf.level=0.95)
>   1-sample proportions test with continuity correction
>
> data:  53 out of 100, null probability 0.5
> X-squared = 0.25, df = 1, p-value = 0.6171
> alternative hypothesis: true p is not equal to 0.5
> 95 percent confidence interval:
>   0.4280225 0.6296465
> sample estimates:
>   p
> 0.53

Built-in functions in R for finding the CI of mean estimate

x<-c(175, 185, 170, 184, 175)
t.test(x, conf.level=0.90)
> 	One Sample t-test
>
> data:  x
> t = 61.567, df = 4, p-value = 4.169e-07
> alternative hypothesis: true mean is not equal to 0
> 90 percent confidence interval:
>   171.6434 183.9566
> sample estimates:
> mean of x
>  177.8


t.test(x, conf.level=0.90, alt="less")
>   One Sample t-test
>
> data:  x
> t = 61.567, df = 4, p-value = 1
> alternative hypothesis: true mean is less than 0
> 90 percent confidence interval:
>   -Inf 182.2278
> sample estimates:
> mean of x
>  177.8

Inferences for difference of means

In many situations it is quite common to be interested in *comparing two ‘populations’ in regard to a parameter of interest.

The two ‘populations’ may be:

Produced items using an existing and a new technique.
Success rates in two groups of individuals.
Health test results for patients who received a drug and for patients who received a placebo.

Two-sample test

$X_{11}, X_{12}, \ldots, X_{1n_1}$ is a sample from population 1. $X_{21}, X_{22}, \ldots, X_{2n_2}$ is a sample from population 2.

The samples are independent (i.e., observations in sample 1 are by no means linked to the observations in sample 2, they concern different individuals).

What we would like to know is whether $\mu_1 = \mu_2$ or not.

So, $$ H_0 : \mu_1 = \mu_2 $$

We compute the sample means $\bar{x}_1 and $\bar{x}_2$.

If $\bar{x}_1 \simeq \bar{x}_2$, then H_0 is probably acceptable.
If $\bar{x}_1$ is considerably different to $\bar{x}_2$, that is evidence that $H_0$ is not true and we are tempted to reject it.

Note that the alternative hypothesis can be, $$ H_1 : \mu_1 \neq \mu_2 \ | \ \text{two-sided alternative} $$

or, $$ H_1 : \mu_1 > \mu_2 \quad \text{or} \quad H_1 : \mu_1 < \mu_2 \ | \ \text{one-sided alternative} $$

We know that, $$ \bar{X_1} = \frac{1}{n_1} \sum_{i = 1}^{n_1} X_{1i} \sim N\left(\mu_1, \frac{\sigma_1^2}{n_1}\right) $$

and, $$ \bar{X_2} = \frac{1}{n_2} \sum_{i = 1}^{n_2} X_{2i} \sim N\left(\mu_2, \frac{\sigma_2^2}{n_2}\right) $$

we deduce the sampling distribution of $\bar{X_1} - \bar{X_2}$, $$ \bar{X_1} - \bar{X_2} \sim N\left(\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}\right) $$

Now, testing for $H_0 : \mu_1 = \mu_2$ exactly amounts to for $H_0 : \mu_1 - \mu_2 = 0$, with $\bar{X_1} - \bar{X_2}$ as an estimator for $\mu_1 - \mu_2$.

Two-sample test: known variances

Suppose that $\sigma_1$ and $\sigma_2$ are known.

For the two-sided test (with $H_a : \mu_1 - \mu_2 \neq 0$), at significance level $\alpha$, the decision rule is, $$ \text{Reject } H_0 \text{ if } \bar{x_1} - \bar{x_2} \notin \left[-z_{1 - \alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}, z_{1 - \alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\right] $$

$p$-value is, $$ p = 2(1 - \phi(|z_0|)), $$

where $z_0 = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$.

Similarly, for the one-sided test $H_A : \mu_1 > \mu_2$, the decision rule is, $$ \text{reject } H_0 \text{ if } \bar{x_1} - \bar{x_2} > z_{1 - \alpha} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} $$

and the $p$-value is, $$ p = 1 - \phi(z_0) $$

while for the one-sided test $H_A : \mu_1 < \mu_2$, the decision rule is, $$ \text{reject } H_0 \text{ if } \bar{x_1} - \bar{x_2} < -z_{1 - \alpha} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}} $$

and the $p$-value is, $$ p = \phi(z_0) $$

$100(1 - \alpha)%$ two-sided confidence interval for $\mu_1 - \mu_2$ is, $$ \left[(\bar{x_1} - \bar{x_2}) - z_{1 - \alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}, (\bar{x_1} - \bar{x_2}) + z_{1 - \alpha/2} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\right] $$

$100(1 - \alpha)%$ one-sided confidence intervals for $\mu_1 - \mu_2$ are, $$ \left(-\infty, (\bar{x_1} - \bar{x_2}) + z_{1 - \alpha} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}\right] $$

and, $$ \left[(\bar{x_1} - \bar{x_2}) - z_{1 - \alpha} \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}, +\infty\right] $$

Two-sample test: unknown equal variances

Assume now $\sigma_1 = \sigma_2 = \sigma$, but $\sigma$ is unknown.

We can estimate $\sigma^2$ by the pooled variance estimator, $$ S_{p}^2 = \frac{\sum_{i = 1}^{n_1} \left(X_{1i} - \bar{X_1}\right)^2 + \sum_{i = 1}^{n_2} \left(X_{2i} - \bar{X_2}\right)^2}{n_1 + n_2 - 2} $$

Where $S_1^2$ and $S_2^2$ are the sample variances of the two samples, $$ S_1^2 = \frac{1}{n_1 - 1} \sum_{i = 1}^{n_1} (X_{1i} - \bar{X_1})^2 $$

and, $$ S_2^2 = \frac{1}{n_2 - 1} \sum_{i = 1}^{n_2} (X_{2i} - \bar{X_2})^2 $$

For the two-sided test (with $H_a : \mu_1 - \mu_2 \neq 0$), at significance level $\alpha$, at significance level $\alpha$, the decision rule is, $$ \text{reject } H_0 \text{ if } \bar{x_1} - \bar{x_2} \notin \left[-t_{n_1 + n_2 - 2;1 - \alpha/2} s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, t_{n_1 + n_2 - 2;1 - \alpha/2} s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\right] $$

The $p$-value is given by, $$ p = 2P(T > |t_0|) $$

with $T \sim t_{n_1 + n_2 - 2}$, and where $t_0$ is the observed value of the test statistic, $$ t_0 = \frac{\bar{x}_1 - \bar{x}_2}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} $$

This test is known as the two-sample $t$-test.

A $100(1 - \alpha)%$ two-sided confidence interval for $\mu_1 - \mu_2$ is, $$ \left[(\bar{x_1} - \bar{x_2}) - t_{n_1 + n_2 - 2;1 - \alpha/2} s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}, (\bar{x_1} - \bar{x_2}) + t_{n_1 + n_2 - 2;1 - \alpha/2} s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\right] $$

Two-sample test: unknown unequal variances

There is no exact result available. An approximate result can be applied, $$ \frac{(\bar{X_1} - \bar{X_2}) - (\mu_1 - \mu_2)}{\sqrt{\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}}} \sim t_{\nu} $$

where the number of degrees of freedom is, $$ \nu = \frac{\left(\frac{S_1^2}{n_1} + \frac{S_2^2}{n_2}\right)^2}{\frac{\left(\frac{S_1^2}{n_1}\right)^2}{n_1 - 1} + \frac{\left(\frac{S_2^2}{n_2}\right)^2}{n_2 - 1}} $$

(rounded down to the nearest integer).

This is called Welch’s two sample $t$-test.

Example

6 subjects were given a drug (treatment group, $\mu_2$) and an additional 6 subjects a placebo (control group, $\mu_1$). Their reaction time to a stimulus was measured (in ms). Perform a two sample $t$-test for comparing the means of the treatment and control groups

Let’s use a one-sided test. $$ H_0 : \mu_1 = \mu_2 \quad \text{vs} \quad H_1 : \mu_1 < \mu_2 $$

control<-c(91, 87, 99, 77, 88, 91)
treat<-c(101, 110, 103, 93, 99, 104)
t.test(control, treat, alternative="less", var.equal=TRUE)
>  Two Sample t-test
>
> data:  control and treat
> t = -3.4456, df = 10, p-value = 0.003136
> alternative hypothesis: true difference in means is less than 0
> 95 percent confidence interval:
>   -Inf -6.082744
> sample estimates:
> mean of x mean of y
> 88.83333 101.66667

We can also do a Welch test,

t.test(control, treat, alternative="less")
>  Welch Two Sample t-test
>
> data:  control and treat
> t = -3.4456, df = 9.4797, p-value = 0.003391
> alternative hypothesis: true difference in means is less than 0
> 95 percent confidence interval:
>   -Inf -6.044949
> sample estimates:
> mean of x mean of y
> 88.83333 101.66667

Paired data for difference of means

Two sample $t$-test cannot be used when we deal with “before and after” data, and numerous situations where the data are naturally paired (and thus, not independent).

Let $(X_{11}, X_{21}), (X_{12}, X_{22}), \ldots, (X_{1n}, X_{2n})$ be a random sample of $n$ pairs of observations drawn from two subpopulations $X_1$ and $X_2$, with respective means $\mu_1$ and $\mu_2$.

An easy way is just to consider the diffrences. $$ D_i = X_{i1} - X_{i2} $$

We have just a sample $D_1, D_2, \ldots, D_n$ from a distribution with mean, $$ \mu_D = \mu_1 - \mu_2 $$

Testing for $H_0 : \mu_1 = \mu_2$ is just $H_0 : \mu_D = 0$. This can be accomplished by performing the usual one sample test for mean.

So, in R we can,

t.test(x,y, paired=TRUE)
# or
t.test(x-y)

Inferences for variance

We know that, $$ S^2 = \frac{1}{n - 1} \sum_{i = 1}^n (X_i - \bar{X})^2 $$

and is a natural estimator for the population variance $\sigma^2$.

In general, little can be said about the distribution of $S^2$. However, when the population is normal, that is $X \sim N(\mu, \sigma^2)$, then, $$ \frac{(n - 1)S^2}{\sigma^2} \sim \chi^2_{n - 1} $$

Let $\chi^2_{\nu; \alpha} be the value such that, $$ P(X > \chi^2_{\nu; \alpha}) = \alpha $$

for $X \sim \chi^2_{\nu}$.

As we know that, $$ \frac{(n - 1)S^2}{\sigma^2} \sim \chi^2_{n - 1} $$

we can write, $$ P(\chi^2_{n - 1;1 - \alpha/2} \leq \frac{(n - 1)S^2}{\sigma^2} \leq \chi^2_{n - 1;\alpha/2}) = 1 - \alpha $$

which can be re-arranged to, $$ P\left(\frac{(n - 1)S^2}{\chi^2_{n - 1;\alpha/2}} \leq \sigma^2 \leq \frac{(n - 1)S^2}{\chi^2_{n - 1;1 - \alpha/2}}\right) = 1 - \alpha $$

Which gives us the interval, $$ \left[\frac{(n - 1)S^2}{\chi^2_{n - 1;\alpha/2}}, \frac{(n - 1)S^2}{\chi^2_{n - 1;1 - \alpha/2}}\right] $$

Now, $H_0 : \sigma^2 = \sigma_0^2$ against $H_a : \sigma^2 \neq \sigma_0^2$. It is natural to reject $H_0$ whenever $s^2$ is too distant from $\sigma_0^2$.

We are after two constants $\ell$ and $u$ such that, $$ \alpha = P(S^2 \notin [\ell, u] \text{ when } \sigma^2 = \sigma_0^2) = P\left(\frac{(n - 1)S^2}{\sigma_0^2} \notin \left[\frac{(n - 1)S^2}{u}, \frac{(n - 1)S^2}{\ell}\right]\right) $$

This yields, $$ \ell = \frac{\chi^2_{n - 1;\alpha/2} \sigma_0^2}{n - 1} \newline u = \frac{\chi^2_{n - 1;1 - \alpha/2} \sigma_0^2}{n - 1} $$

The decision rule is then, $$ \text{Reject } H_0 \text{ if } s^2 \notin \left[\frac{\chi^2_{n - 1;\alpha/2} \sigma_0^2}{n - 1}, \frac{\chi^2_{n - 1;1 - \alpha/2} \sigma_0^2}{n - 1}\right] $$

One-sided CI, $$ \left[0, \frac{(n - 1)S^2}{\chi^2_{n - 1;\alpha}}\right] \newline \left[\frac{(n - 1)S^2}{\chi^2_{n - 1;1 - \alpha}}, +\infty\right] $$

One-sided test, $$ \text{For } H_a : \sigma^2 > \sigma_0^2, \text{ reject } H_0 \text{ if } s^2 > \frac{\chi^2_{n - 1;\alpha} \sigma_0^2}{n - 1} \newline \text{For } H_a : \sigma^2 < \sigma_0^2, \text{ reject } H_0 \text{ if } s^2 < \frac{\chi^2_{n - 1;1 - \alpha} \sigma_0^2}{n - 1} $$

The $p$-value is, $$ P(S^2 > s^2) = 1 - P\left(\frac{(n - 1) S^2}{\sigma^2_0} \leq \frac{(n - 1) s^2}{\sigma^2_0}\right) = 1 - P\left(\chi^2_{n - 1} \leq \frac{(n - 1) s^2}{\sigma^2_0}\right) \newline P(S^2 < s^2 = P\left(\frac{(n - 1) S^2}{\sigma^2_0} \leq \frac{(n - 1) s^2}{\sigma^2_0}\right) = P\left(\chi^2_{n - 1} \leq \frac{(n - 1) s^2}{\sigma^2_0}\right) $$

Inferences for ratio of variances/test of equality of variances

Let $X_{11}, X_{12}, \ldots, X_{1n}$ is a sample from population 1. Let $X_{21}, X_{22}, \ldots, X_{2n}$ is a sample from population 2.

The samples are independent, and we would like to whether $\sigma_1^2 = \sigma_2^2$ or not.

Define the sample variances, $$ S_1^2 = \frac{1}{n_1 - 1} \sum_{i = 1}^{n_1} (X_{1i} - \bar{X_1})^2 \newline S_2^2 = \frac{1}{n_2 - 1} \sum_{i = 1}^{n_2} (X_{2i} - \bar{X_2})^2 $$

We can use the test statistic, $$ F = \frac{S_1^2}{S_2^2} $$

In general, there is no known exact distribution for F. Fortunately, if $X_{1i} \sim N(\mu_1, \sigma_1^2)$ and $X_{2i} \sim N(\mu_2, \sigma_2^2)$, F has an F(n_1 - 1, n_2 - 1) distribution if the null ($\sigma_1^2 = \sigma_2^2$) is true.

$H_0 : \sigma_1^2 = \sigma_2^2$ against $H_a : \sigma_1^2 \neq \sigma_2^2$.

It is natural to reject $H_0$ whenever $F$ is too big or too small,

The decision rule is, $$ \text{Reject } H_0 \text{ if } F \notin \left[\frac{1}{F_{n_1 - 1, n_2 - 1;1 - \alpha/2}}, F_{n_1 - 1, n_2 - 1;\alpha/2}\right] $$

Two-sided CI for $\frac{\sigma_1^2}{\sigma_2^2}$, $$ P(F_{n_1 - 1, n_2 - 1;1 - \alpha/2} \leq \frac{\sigma_1^2}{\sigma_2^2} \leq F_{n_1 - 1, n_2 - 1;\alpha/2}) = 1 - \alpha $$

Which yields us the interval, $$ \left[\frac{S_1^2}{S_2^2 F_{n_1 - 1, n_2 - 1;\alpha/2}}, \frac{S_1^2}{S_2^2 F_{n_1 - 1, n_2 - 1;1 - \alpha/2}}\right] $$

One-sided CI, $$ \left[0, \frac{S_1^2}{S_2^2 F_{n_1 - 1, n_2 - 1;\alpha}}\right] \newline \left[\frac{S_1^2}{S_2^2 F_{n_1 - 1, n_2 - 1;1 - \alpha}}, +\infty\right] $$

One-sided test, $$ \text{For } H_a : \sigma_1^2 > \sigma_2^2, \text{ reject } H_0 \text{ if } F > F_{n_1 - 1, n_2 - 1;1 - \alpha} \newline \text{For } H_a : \sigma_1^2 < \sigma_2^2, \text{ reject } H_0 \text{ if } F < F_{n_1 - 1, n_2 - 1;\alpha} $$

The $p$-value is, $$ P(F > f) = 1 - P(F(n_1 - 1, n_2 - 1) \leq f) \newline P(F < f) = P(F(n_1 - 1, n_2 - 1) \leq f) $$

Part 2 - Basic Statistical Concepts and Tools