Lecture 1: Random Samples, Probability Inequalities, Limit Theorems
Population and Random Sample
Definition (Population): A population is the complete collection of all elements (observations, individuals, measurements) that we want to study. The population is characterized by some probability distribution with unknown parameters.
Definition (Random Sample): A random sample of size $n$ from a population is a collection of $n$ random variables $X_1, X_2, \ldots, X_n$ such that:
- Each $X_i$ has the same distribution as the population (identical distribution)
- The random variables $X_1, X_2, \ldots, X_n$ are mutually independent
We denote this as $X_1, X_2, \ldots, X_n \stackrel{\text{iid}}{\sim} F$, where $F$ is the population distribution.
Example: If we want to study the heights of all college students in the US (population), we might randomly select 100 students and measure their heights. The measurements $X_1, X_2, \ldots, X_{100}$ would form a random sample.
Sample Statistics
In statistical inference, we begin by observing data from a random sample. Let’s define some fundamental quantities that we’ll use throughout the course.
Given data $Y = (X_1, X_2, X_3, \ldots, X_n)$ where each $X_i$ is an independent observation from the same distribution, we can define:
Sample mean: $\bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i$
Sample variance: $S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2$
These are our most basic statistics - functions of the observed data that help us understand the underlying population.
Properties of Sample Mean
A natural question arises: what can we say about the behavior of $\bar{X}$? Since $\bar{X}$ is itself a random variable (being a function of random variables), it has its own distribution.
What is the mean and variance of $\bar{X}$?
\[E[\bar{X}] = \mu = E[X]\] \[V[\bar{X}] = \frac{\sigma^2}{n}\]where $\sigma^2 = E[(X - \mu)^2]$
Notice something important: as $n$ increases, $V[\bar{X}]$ decreases! This tells us that larger samples give us more precise estimates of $\mu$.
Probability Inequalities
Before we can establish the fundamental limit theorems of statistics, we need some tools to bound the probability that random variables deviate from their expected values. Probability inequalities provide these bounds and are essential for proving convergence results.
Theorem: Markov’s Inequality
Let $Y$ be a non-negative random variable. Then for any $h > 0$:
\[P(Y \geq h) \leq \frac{E[Y]}{h}\]Proof: Assume $Y$ is a continuous RV with pdf $p(y)$
\[\begin{aligned} E[Y] &= \int_0^{\infty} y \cdot p(y) \, dy \\ &= \int_0^h y \cdot p(y) \, dy + \int_h^{\infty} y \cdot p(y) \, dy \\ &\geq \int_h^{\infty} y \cdot p(y) \, dy \\ &\geq \int_h^{\infty} h \cdot p(y) \, dy \\ &= h \int_h^{\infty} p(y) \, dy \\ &= h \cdot P(Y \geq h) \end{aligned}\]Therefore: $E[Y] \geq h \cdot P(Y \geq h)$ ⟹ $P(Y \geq h) \leq \frac{E[Y]}{h}$ □
Markov’s inequality gives us our first probabilistic bound, but it’s quite loose. We can do better if we know something about the variance of our random variable.
Theorem: Chebyshev’s Inequality
Let $Y$ be a random variable with finite non-zero variance $\sigma^2$. Then for any $k > 0$:
\[P(\lvert Y - \mu \rvert \geq k\sigma) \leq \frac{1}{k^2}\]Proof: \(\begin{aligned} P(\lvert Y - \mu \rvert \geq k\sigma) &= P((Y - \mu)^2 \geq k^2\sigma^2) \\ &\leq \frac{E[(Y - \mu)^2]}{k^2\sigma^2} \quad \text{(by Markov's inequality)} \\ &= \frac{\sigma^2}{k^2\sigma^2} \\ &= \frac{1}{k^2} \end{aligned}\) □
Chebyshev’s inequality is much more useful than Markov’s because it depends on $k^2$ rather than $k$. For instance, the probability that a random variable deviates from its mean by more than 2 standard deviations is at most $1/4 = 0.25$.
Moment Generating Function (MGF)
To get even tighter bounds, we can use moment generating functions. The MGF is a powerful tool that, when it exists, uniquely characterizes a distribution.
For a random variable $X$: \(M_X(t) = E[e^{tX}]\)
\[\frac{dM_X(t)}{dt}\bigg|_{t=0} = E[X]\]Theorem: Chernoff Bounds
Let $Y$ be a RV with MGF $M_Y(t)$ where $\lvert t \rvert < h$. Then for any $a$:
\(P(Y \geq a) \leq e^{-at} M_Y(t)\) for $0 < t < h$
\(P(Y \leq a) \leq e^{-at} M_Y(t)\) for $-h < t < 0$
Proof: \(\begin{aligned} P(Y \geq a) &= P(e^{tY} \geq e^{ta}) \quad \text{for } t > 0 \\ &\leq \frac{E[e^{tY}]}{e^{ta}} \quad \text{(by Markov's inequality)} \\ &= e^{-ta} M_Y(t) \end{aligned}\) □
Chernoff bounds can be much tighter than Chebyshev’s inequality, especially for distributions with well-behaved MGFs. The key is to optimize over the choice of $t$.
Laws of Large Numbers and Central Limit Theorem
Now we’re ready to establish the fundamental limit theorems that justify statistical inference. These theorems tell us what happens to sample statistics as the sample size grows large.
The Hierarchy: WLLN → SLLN → CLT
We’ll study three increasingly strong results:
- Weak Law of Large Numbers (WLLN): Sample mean converges in probability to population mean
- Strong Law of Large Numbers (SLLN): Sample mean converges almost surely to population mean
- Central Limit Theorem (CLT): Sample mean is approximately normally distributed for large $n$
Weak Law of Large Numbers (WLLN)
The WLLN formalizes our intuition that averages of independent observations should converge to the true population mean. But first, we need to be precise about what “converge” means.
Definition: Convergence in Probability
A sequence of random variables $X_1, X_2, \ldots$ converges in probability to a random variable $X$ if for every $\epsilon > 0$:
\[\lim_{n \to \infty} P(\lvert X_n - X \rvert < \epsilon) = 0\]or equivalently:
\[\lim_{n \to \infty} P(\lvert X_n - X \rvert \leq \epsilon) = 1\]Notation: $X_n \xrightarrow{P} X$
Theorem: Weak Law of Large Numbers
Let $X_1, X_2, \ldots$ be i.i.d. random variables such that $E[X_i] = \mu$ and $V[X_i] = \sigma^2 < \infty$. Define $\bar{X}_{n} = \frac{1}{n} \sum_{i=1}^{n} X_{i}$. Then $\bar{X}_n \xrightarrow{P} \mu$.
Proof:
\[\begin{aligned} P(\lvert\bar{X}_n - \mu\rvert \geq \epsilon) &= P((\bar{X}_n - \mu)^2 \geq \epsilon^2) \\ &\leq \frac{E[(\bar{X}_n - \mu)^2]}{\epsilon^2} \quad \text{(by Markov's inequality)} \\ &= \frac{V[\bar{X}_n]}{\epsilon^2} \\ &= \frac{\sigma^2/n}{\epsilon^2} \\ &= \frac{\sigma^2}{n\epsilon^2} \end{aligned}\]Taking the limit: \(\lim_{n \to \infty} P(\lvert\bar{X}_n - \mu\rvert \geq \epsilon) \leq \lim_{n \to \infty} \frac{\sigma^2}{n\epsilon^2} = 0\)
Therefore: $\lim_{n \to \infty} P(\lvert\bar{X}_n - \mu\rvert \geq \epsilon) = 0$ □
This is a beautiful result! It says that no matter how small $\epsilon > 0$ we choose, we can make the probability that $\bar{X}_n$ is more than $\epsilon$ away from $\mu$ arbitrarily small by taking $n$ large enough.
Definition: Consistent Estimator
Let $\hat{\theta}$ be an estimator of $\theta$ s.t. $\hat{\theta}_n \xrightarrow{P} \theta$. Then $\hat{\theta}$ is said to be a consistent estimator of $\theta$.
Example: Consistency of Sample Variance
\[S_n^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\] \[E[S_n^2] = \sigma^2\]If $V[S_n^2] \to 0$ as $n \to \infty$, then $S_n^2 \xrightarrow{P} \sigma^2$
Consistency is a fundamental property we want our estimators to have - it means they get better as we collect more data.
Theorem: Continuous Mapping Theorem
Suppose $X_1, X_2, \ldots$ converge in probability to $X$. Let $h$ be a continuous function. Then: \(h(X\_n) \xrightarrow{P} h(X)\)
Strong Law of Large Numbers (SLLN)
While the WLLN tells us about convergence in probability, the SLLN gives us a stronger form of convergence. The difference is subtle but important.
Definition: Almost Sure Convergence
A sequence of RVs $X_1, X_2, \ldots$ converges almost surely to a random variable $X$ if for every $\epsilon > 0$:
\[P\left(\lim_{n \to \infty} \lvert X_n - X \rvert < \epsilon\right) = 1\]Notation: $X_n \xrightarrow{a.s.} X$
Theorem: Strong Law of Large Numbers
Let $X_1, X_2, \ldots$ be i.i.d. random variables such that $E[X_i] = \mu$ and $V[X_i] = \sigma^2 < \infty$. Define $\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i$. Then for every $\epsilon > 0$:
\[P\left(\lim_{n \to \infty} \lvert\bar{X}_n - \mu\rvert < \epsilon\right) = 1\]or $\bar{X}_n \xrightarrow{a.s.} \mu$
The SLLN tells us that the sample mean doesn’t just get close to $\mu$ in probability, but that it actually converges to $\mu$ with probability 1. This is a stronger statement than the WLLN.
Central Limit Theorem (CLT)
The Laws of Large Numbers tell us that $\bar{X}_n \to \mu$, but they don’t tell us about the rate of convergence or the distribution of the error $\bar{X}_n - \mu$. The Central Limit Theorem answers both questions.
Definition: Convergence in Distribution
A sequence of random variables $X_1, X_2, \ldots$ is said to converge in distribution to $X$ if:
\[\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\]at all points where $F_X$ is continuous.
Notation: $X_n \xrightarrow{d} X$
Theorem: Central Limit Theorem
Let $X_1, X_2, \ldots$ be a sequence of i.i.d. RVs whose MGFs exist. Let $E[X_i] = \mu$ and $V[X_i] = \sigma^2$. Define $\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i$. Let $G_n(x)$ denote the CDF of the RV:
\[Z_n = \frac{\sqrt{n}(\bar{X}_n - \mu)}{\sigma}\]Then for any $x \in \mathbb{R}$:
\[\lim_{n \to \infty} G_n(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-y^2/2} dy\]That is: $\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{d} N(0, \sigma^2)$
This is remarkable! Regardless of the original distribution of $X_i$, the standardized sample mean converges to a standard normal distribution. This universality is what makes the CLT so powerful in statistics.
Theorem: Slutsky’s Theorem
Sometimes we need to combine convergence results. Slutsky’s theorem helps us do this.
If $X_n \xrightarrow{d} X$ and $Y_n \xrightarrow{P} a$, then:
- $X_n + Y_n \xrightarrow{d} X + a$
- $X_n Y_n \xrightarrow{d} aX$
Theorem: Delta Method
Let $Y_n$ be a sequence of RVs that satisfies: \(\sqrt{n}(Y_n - \theta) \xrightarrow{d} N(0, \sigma^2)\)
For a given function $h$ whose derivative exists and we denote it by $h’$. Then: \(\sqrt{n}(h(Y_n) - h(\theta)) \xrightarrow{d} N(0, \sigma^2[h'(\theta)]^2)\)
The Delta Method is incredibly useful when we want to find the asymptotic distribution of transformations of our estimators. For example, if we have a CLT for $\bar{X}_n$, the Delta Method gives us a CLT for $\log(\bar{X}_n)$ or $\bar{X}_n^2$.
Additional Notes
Point mass/Delta function/Dirac measure: Random variable s.t. $P(X = \mu) = 1$, otherwise $P(X \neq \mu) = 0$
Sifting property: $\int f(x)\delta_\mu(x)dx = f(\mu)$
Discussion: In practice, $\sigma$ is usually unknown, so we replace it with a consistent estimator $\hat{\sigma} = \sqrt{S_n^2}$. By Slutsky’s theorem, this doesn’t affect the limiting distribution.
Looking Ahead
In our next lecture, we’ll use these foundational results to study sufficiency - the idea that some statistics contain all the relevant information about unknown parameters. This will lead us toward optimal estimation procedures.
