Part 2 - Basic Statistical Concepts and Tools | applied statistics

Basic Statistics

Let’s review some basic statistical concepts and tools.

Sample Mean

The sample mean is defined as, $$ \bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i $$ where $x_i$ is the $i$-th observation in the sample.

If we’re drawing from a population, the population mean is denoted by $\mathbb{E}(X)$.

In R, we can calculate the sample mean using the mean() function.

x <- 1:10
> mean(x)
[1] 5.5

> sum(x)/lengt(x)
[1] 5.5

If we assume the components of the data vector are independent and identically distributed, the sample mean is an unbiased estimator of the population mean.

Sample Median

The sample median $m$ of $x_1, x_2, \ldots, x_n$ is the middle value of the ordered data set.

Let the sorted data be denoted by $x_{(1)} \leq x_{(2)} \leq \ldots \leq x_{(n)}$. Then, $$ m = \begin{cases} x_{k+1} & n = 2k + 1 \text{ (odd)} \newline \frac{1}{2} (x_{k} + x_{k+1}) & n = 2k \text{ (even)} \end{cases} $$

The sample median is simply median() in R. The median is a special kind of quantile, which we will talk about later. Important to note is that the median and mean can differ significantly, especially when the data is skewed.

The median is insensitive to a few unusually large or unusually small observations, this is called robustness.

Sample Variance

The sample variance is defined as, $$ s^2 = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2 $$ where $\bar{x}$ is the sample mean.

The sample variance is an unbiased estimator of the population variance.

This will give us an overall sense of the variation in the data, as the name suggests. NB: Remember that, the unit of the variance is squared, so it’s not directly interpretable.

In R, we can calculate the sample variance using the var() function.

Sample Standard Deviation

The sample standard deviation is the square root of the sample variance, $$ s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})^2} $$

Unlike the variance, the standard deviation is in the same unit as the data, so it’s more interpretable.

In R, we can calculate the sample standard deviation using the sd() function.

If we’re drawing from a population, the population standard deviation is defined as, $$ Var(X) = \mathbb{E}[(X - \mathbb{E}(X))^2] $$

Sample Quantile

For $0 \leq p \leq 1$, the $p$-th quantile for the sample is the data at position $\approx pn$ in the sorted data.

Wehn this is not an integer, a weighted average is used.

This value essentially splits the data so that (roughly) $100p%$ is smaller and $100(1-p)%$ is larger. The median is the $0.5$ quantile.

The $p$-th quantile is also called the $100p$-th percentile.

For a population, the $p$-th quantile for a distribution is defined as $F^{-1}(p)$, if $F$ is strictly increasing so that the inverse exists. Otherwise the definition is a little bit more complicated.

In R, we can calculate the quantiles using the quantile() function.

quantile(x, 0.25) # 25th percentile
quantile(x, (0.25, 0.75)) # 25th and 75th percentiles

Quantiles VS. Proportions

For a data vector $x$ we can ask two related but inverse questions.

What proportion of the data is less than or equal to a specified value? Or for a specified proporation, what value has this proportion of the data less than or equal?

The latter question is answered by the quantile function.

The inter-quantile range (IQR)

We are interested in the range of the middle 50% of the data, this would be the distance between the 75 percentile and the 25 percentile. Thus, the IQR is a single number.

In R we can calculate the IQR using the IQR() function.

Detecting Outliers

An outlier is an observation that is significantly different from other observations in the data set.

A rule of thumb is:

An observation is an outlier if it is more than $1.5 \times IQR$ from the closest quartile.

Also a good rule to remember is:

An outlier is said to be extreme if it is more than $3 \times IQR$ from the closest quartile.

Correlation

The pearson correlation coefficient $r$ of two data vectors $x$ and $y$ is defined as, $$ r = cor(x, y) = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}} $$

The value of $r$ is between -1 and 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

For a population, when $(x_i, y_i) \overset{iid}{\sim} F_{XY}$, the population correlation coefficient is defined as, $$ r = \frac{\mathbb{E}[(X - \mathbb{E}(X))(Y - \mathbb{E}(Y))]}{\sqrt{Var(X)Var(Y)}} $$

In R, we can calculate the correlation coefficient using the cor() function. NB: This is the pearson correlation coefficient, there are other types of correlation coefficients.

Properties of the Correlation Coefficient

The vale of $r$ does not depend on the unit of measurement for each variable.
The value of $r$ does not depend on which of the two variables are labeled $x$, i.e. it is symmetric.
The value of $r$ is between -1 and 1.
The correlation coefficient is:
- -1 only when all the points lie on a line with a negative slope.
- 1 only when all the points lie on a line with a positive slope.
The value of $r$ is a measure of the extent to which $x$ and $y$ are linearly related.

Modes and Skew

A mode of a distribution is a peak, or a local maximum, in its density.

It is a visual effect and has no rigorous mathematical definition.

A data set can be characterized by its number of modes. A unimodal distribution has a single mode (e.g Normal distribution), a bimodal distribution has two modes, and so on.

The tail of a distribution are the very large and very small values of the distribution (NB: This is not the best definition, but it will do for now).

A distribution is called long-tailed (or heavy-tailed) if the data set contains values far from the main body of the distribution.

A distribution is skewed if one tail is significantly heavier or longer than the other. Good to remember is that the antonym (opposite) of skew is symmetry.

A distribution with longer left tail is called left-skewed or negatively skewed.

Some good rules of thumb:

When a distribution is skewed positively (to the right) the mean is larger than the median.
When a distribution is skewed negatively (to the left) the mean is smaller than the median.
When a distribution is symmetric, the mean and median are equal.

Distributions

Let’s review some basic distributions.

Gamma Distribution

Definition:

A random variable $X$ is said to be gamma distributed with parameters $(\alpha, \beta)$, denoted as $X \sim Gamma(\alpha, \beta)$, if its PDF is given by, $$ f(x) = \begin{cases} \frac{\beta^{\alpha} e^{-\beta x} x^{\alpha - 1}}{\Gamma(\alpha)} & x \geq 0 \newline 0 & x < 0 \end{cases} $$

The gamma function $\Gamma$ is given by, $$ \Gamma(\alpha) = \int_{0}^{\infty} t^{\alpha - 1} e^{-t} dt. $$

One important property of $\Gamma$ is that $\Gamma(\alpha) = (\alpha - 1) \Gamma(\alpha - 1)$.

For integer values of $\alpha = n$, $\Gamma(n) = (n-1)!$.

When $\alpha = 1$, the density becomes, $$ f(x) = \beta e^{-\beta x} \quad x \geq 0 $$

we get the exponential distribution. Thus, $Gamma(1, \beta) = Exp(\beta)$.

With this, if $\alpha = n$ is an integer, $X$ can be written as sum of $n$ i.i.d $Exp(\beta)$ random variables.

$\alpha$ is called the shape parameter and $\beta$ is called the rate parameter.

The mean and variance of a gamma distribution are given by, $$ \mathbb{E}(X) = \frac{\alpha}{\beta} \newline Var(X) = \frac{\alpha}{\beta^2} $$

In R, we can generate random variables from a gamma distribution using the rgamma() function.

Beta Distribution

Definition:

A random variable $X$ is said to have a beta distribution if its density is given by, $$ f(x) = \begin{cases} \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} (1 - x)^{\beta - 1} & 0 < x < 1 \newline 0 & \text{otherwise} \end{cases} $$

Where $B(\alpha, \beta)$ is the beta function, $$ B(\alpha, \beta) = \int_{0}^{1} t^{\alpha - 1} (1 - t)^{\beta - 1} dt $$

so that $f(x)$ integrates to 1.

The mean and variance of a beta distribution are given by, $$ \mathbb{E}(X) = \frac{\alpha}{\alpha + \beta} \newline Var(X) = \frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)} $$

In R, we can generate random variables from a beta distribution using the rbeta() function.

Chi-Squared Distribution

The $\Chi^2(n)$ or $\Chi^2_n$ distribution is a special case of the gamma distribution, with $\alpha = n/2$ and $\beta = 1/2$.

The integer $n$ is the parameter of the distribution and sometimes called the degree of freedom of the distribution. If $X \sim \Chi^2(n)$, then $\mathbb{E}(X) = n$ and $Var(X) = 2n$.

If $Z_i \overset{iid}{\sim} N(0, 1)$, then $Z_1^2 + Z_2^2 + \ldots + Z_n^2 \sim \Chi^2(n)$.

Property:

If $X$ and $Y$ are independent with $\Chi^2_n$ and $\Chi^2_m$ distributions, then $X + Y \sim \Chi^2_{n+m}$.

In R, we can generate random variables from a chi-squared distribution using the rchisq() function.

Student’s t-Distribution

We use this t-distribution for CI (confidence intervals) and hypothesis testing. Which we will cover in the next part.

Densitity, $$ f(x) = \frac{\Gamma(\frac{n+1}{2})}{\Gamma(\frac{n}{2}) \sqrt{n\pi}} \left(1 + \frac{x^2}{n}\right)^{-\frac{n+1}{2}} $$

Mean and variance, $$ \mathbb{E}(X) = 0 \newline Var(X) = \frac{n}{n-2} \quad n > 2 $$

If $Z \sim N(0, 1)$, $X \sim \Chi^2_n$, and $X$ and $Z$ are independant, then $\sqrt{n}Z / \sqrt{X/n} \sim t_n$.

$n$ is the parameter of the t-distribution and is called the degree of freedom.

A special case is the $t(1)$, it is also called cauchy distribution.

Density for Cauchy, $$ f(x) = \frac{1}{\pi(1 + x^2)} $$

In R, we can generate random variables from a t-distribution using the rt() function. For cauchy, we can use the rcauchy() function.

Part 1 - Introduction to R

Part 3 - Confidence Interval and Hypothesis Testing