18. LLN and CLT#
Contents
18.1. Overview#
This lecture illustrates two of the most important theorems of probability and statistics: The law of large numbers (LLN) and the central limit theorem (CLT).
These beautiful theorems lie behind many of the most fundamental results in econometrics and quantitative economic modeling.
The lecture is based around simulations that show the LLN and CLT in action.
We also demonstrate how the LLN and CLT break down when the assumptions they are based on do not hold.
In addition, we examine several useful extensions of the classical theorems, such as
The delta method, for smooth functions of random variables
The multivariate case
Some of these extensions are presented as exercises.
18.2. Relationships#
The CLT refines the LLN.
The LLN gives conditions under which sample moments converge to population moments as sample size increases.
The CLT provides information about the rate at which sample moments converge to population moments as sample size increases.
18.3. LLN#
We begin with the law of large numbers, which tells us when sample averages will converge to their population means.
18.3.1. The Classical LLN#
The classical law of large numbers concerns independent and identically distributed (IID) random variables.
Here is the strongest version of the classical LLN, known as Kolmogorov’s strong law.
Let
When it exists, let
In addition, let
Kolmogorov’s strong law states that, if
What does this last expression mean?
Let’s think about it from a simulation perspective, imagining for a moment that our computer can generate perfect random samples (which of course it can’t).
Let’s also imagine that we can generate infinite sequences, so that the
statement
In this setting, (18.1) should be interpreted as meaning that the
probability of the computer producing a sequence where
18.3.2. Proof#
The proof of Kolmogorov’s strong law is nontrivial – see, for example, theorem 8.3.5 of [Dud02].
On the other hand, we can prove a weaker version of the LLN very easily and still get most of the intuition.
The version we prove is as follows: If
(This version is weaker because we claim only convergence in probability rather than almost sure convergence, and assume a finite second moment)
To see that this is so, fix
Recall the Chebyshev inequality, which tells us that
Now observe that
Here the crucial step is at the third equality, which follows from independence.
Independence means that if
As a result,
Combining our last result with (18.3), we come to the estimate
The claim in (18.2) is now clear.
Of course, if the sequence
While this doesn’t mean that the same line of argument is impossible, it does mean that if we want a similar result then the covariances should be “almost zero” for “most” of these terms.
In a long sequence, this would be true if, for example,
In other words, the LLN can still work if the sequence
This idea is very important in time series analysis, and we’ll come across it again soon enough.
18.3.3. Illustration#
Let’s now illustrate the classical IID law of large numbers using simulation.
In particular, we aim to generate some sequences of IID random variables and plot the evolution
of
Below is a figure that does just this (as usual, you can click on it to expand it).
It shows IID observations from three different distributions and plots
The dots represent the underlying observations
In each of the three cases, convergence of
using LinearAlgebra, Statistics
using LaTeXStrings, Plots, Distributions, Random, Statistics
function ksl(distribution, n = 100)
title = nameof(typeof(distribution))
observations = rand(distribution, n)
sample_means = cumsum(observations) ./ (1:n)
mu = mean(distribution)
plot(repeat((1:n)', 2),
[zeros(1, n); observations'], label = "", color = :grey, alpha = 0.5)
plot!(1:n, observations, color = :grey, markershape = :circle,
alpha = 0.5, label = "", linewidth = 0)
if !isnan(mu)
hline!([mu], color = :black, linewidth = 1.5, linestyle = :dash,
grid = false,
label = "Mean")
end
plot!(1:n, sample_means, linewidth = 3, alpha = 0.6, color = :green,
label = "Sample mean")
return plot!(title = title)
end
ksl (generic function with 2 methods)
distributions = [TDist(10), Beta(2, 2), Gamma(5, 2), Poisson(4), LogNormal(0.5),
Exponential(1)]
6-element Vector{UnivariateDistribution}:
TDist{Float64}(ν=10.0)
Beta{Float64}(α=2.0, β=2.0)
Gamma{Float64}(α=5.0, θ=2.0)
Poisson{Float64}(λ=4.0)
LogNormal{Float64}(μ=0.5, σ=1.0)
Exponential{Float64}(θ=1.0)
Here is in an example for the standard normal distribution
ksl(Normal())
Random.seed!(0); # reproducible results
plot(ksl.(sample(distributions, 3, replace = false))..., layout = (3, 1),
legend = false)
The three distributions are chosen at random from distributions.
18.3.4. Infinite Mean#
What happens if the condition
This might be the case if the underlying distribution is heavy tailed — the best known example is the Cauchy distribution, which has density
The next figure shows 100 independent draws from this distribution
Random.seed!(0); # reproducible results
ksl(Cauchy())
Notice how extreme observations are far more prevalent here than the previous figure.
Let’s now have a look at the behavior of the sample mean
Random.seed!(0); # reproducible results
function plot_means(n = 1000)
sample_mean = cumsum(rand(Cauchy(), n)) ./ (1:n)
plot(1:n, sample_mean, color = :red, alpha = 0.6, label = "Sample Mean",
linewidth = 3)
return hline!([0], color = :black, linestyle = :dash, label = "", grid = false)
end
plot_means()
Here we’ve increased
Will convergence become visible if we take
The answer is no.
To see this, recall that the characteristic function of the Cauchy distribution is
Using independence, the characteristic function of the sample mean becomes
In view of (18.5), this is just
Thus, in the case of the Cauchy distribution, the sample mean itself has the very same Cauchy distribution, regardless of
In particular, the sequence
18.4. CLT#
Next we turn to the central limit theorem, which tells us about the distribution of the deviation between sample averages and population means.
18.4.1. Statement of the Theorem#
The central limit theorem is one of the most remarkable results in all of mathematics.
In the classical IID setting, it tells us the following:
If the sequence
Here
18.4.2. Intuition#
The striking implication of the CLT is that for any distribution with finite second moment, the simple operation of adding independent copies always leads to a Gaussian curve.
A relatively simple proof of the central limit theorem can be obtained by working with characteristic functions (see, e.g., theorem 9.5.6 of [Dud02]).
The proof is elegant but almost anticlimactic, and it provides surprisingly little intuition.
In fact all of the proofs of the CLT that we know are similar in this respect.
Why does adding independent copies produce a bell-shaped distribution?
Part of the answer can be obtained by investigating addition of independent Bernoulli random variables.
In particular, let
Think of
The next figure plots the probability mass function of
function binomial_pdf(n)
bar(0:n, pdf.(Binomial(n), 0:n),
xticks = 0:10, ylim = (0, 1), yticks = 0:0.1:1,
label = L"Binomial(%$n, 0.5)", legend = :topleft)
end
binomial_pdf (generic function with 1 method)
plot(binomial_pdf.((1, 2, 4, 8))...)
When
When
Notice the peak in probability mass at the mid-point
The reason is that there are more ways to get 1 success (“fail then succeed” or “succeed then fail”) than to get zero or two successes.
Moreover, the two trials are independent, so the outcomes “fail then succeed” and “succeed then fail” are just as likely as the outcomes “fail then fail” and “succeed then succeed”.
(If there was positive correlation, say, then “succeed then fail” would be less likely than “succeed then succeed”)
Here, already we have the essence of the CLT: addition under independence leads probability mass to pile up in the middle and thin out at the tails.
For
The intuition is the same — there are simply more ways to get these middle outcomes.
If we continue, the bell-shaped curve becomes ever more pronounced.
We are witnessing the binomial approximation of the normal distribution.
18.4.3. Simulation 1#
Since the CLT seems almost magical, running simulations that verify its implications is one good way to build intuition.
To this end, we now perform the following simulation
Choose an arbitrary distribution
for the underlying observations .Generate independent draws of
.Use these draws to compute some measure of their distribution — such as a histogram.
Compare the latter to
.
Here’s some code that does exactly this for the exponential distribution
(Please experiment with other choices of
using StatsPlots
function simulation1(distribution, n = 250, k = 10_000)
sigma = std(distribution)
y = rand(distribution, n, k)
y .-= mean(distribution)
y = mean(y, dims = 1)
y = √n * vec(y)
density(y, label = "Empirical Distribution")
return plot!(Normal(0, sigma), linestyle = :dash, color = :black,
label = L"Normal(0.00, %$(sigma^2))")
end
simulation1 (generic function with 3 methods)
simulation1(Exponential(0.5))
The fit to the normal density is already tight, and can be further improved by increasing n.
You can also experiment with other specifications of
18.4.4. Simulation 2#
Our next simulation is somewhat like the first, except that we aim to track the distribution of
In the simulation we’ll be working with random variables having
Thus, when
For
What we expect is that, regardless of the distribution of the underlying
random variable, the distribution of
The next figure shows this process for
(Taking a convex combination is an easy way to produce an irregular shape for
function simulation2(distribution = Beta(2, 2), n = 5, k = 10_000)
y = rand(distribution, k, n)
for col in 1:n
y[:, col] += rand([-0.5, 0.6, -1.1], k)
end
y = (y .- mean(distribution)) ./ std(distribution)
y = cumsum(y, dims = 2) ./ sqrt.(1:5)' # return grid of data
end
simulation2 (generic function with 4 methods)
ys = simulation2()
plots = [] # would preallocate in optimized code
for i in 1:size(ys, 2)
p = density(ys[:, i], linealpha = i, title = L"n = %$i")
push!(plots, p)
end
plot(plots..., legend = false, size = (900, 500))
As expected, the distribution smooths out into a bell curve as
We leave you to investigate its contents if you wish to know more.
If you run the file from the ordinary Julia or IJulia shell, the figure should pop up in a window that you can rotate with your mouse, giving different views on the density sequence.
18.4.5. The Multivariate Case#
The law of large numbers and central limit theorem work just as nicely in multidimensional settings.
To state the results, let’s recall some elementary facts about random vectors.
A random vector
Each realization of
A collection of random vectors
(The vector inequality
Let
The expectation
The variance-covariance matrix of random vector
Expanding this out, we get
The
With this notation we can proceed to the multivariate LLN and CLT.
Let
Let
Interpreting vector addition and scalar multiplication in the usual way (i.e., pointwise), let
In this setting, the LLN tells us that
Here
The CLT tells us that, provided
18.5. Exercises#
18.5.1. Exercise 1#
One very useful consequence of the central limit theorem is as follows.
Assume the conditions of the CLT as stated above.
If
This theorem is used frequently in statistics to obtain the asymptotic distribution of estimators — many of which can be expressed as functions of sample means.
(These kinds of results are often said to use the “delta method”)
The proof is based on a Taylor expansion of
Taking the result as given, let the distribution
Derive the asymptotic distribution of illustrate_clt.jl discussed above.
What happens when you replace
What is the source of the problem?
18.5.2. Exercise 2#
Here’s a result that’s often used in developing statistical tests, and is connected to the multivariate central limit theorem.
If you study econometric theory, you will see this result used again and again.
Assume the setting of the multivariate CLT discussed above, so that
is a sequence of IID random vectors, each taking values in , and is the variance-covariance matrix ofThe convergence
is valid.
In a statistical setting, one often wants the right hand side to be standard normal, so that confidence intervals are easily computed.
This normalization can be achieved on the basis of three observations.
First, if
Second, by the continuous mapping theorem, if
Third, if
Here
Putting these things together, your first exercise is to show that if
Applying the continuous mapping theorem one more time tells us that
Given the distribution of
where
(Recall that
Your second exercise is to illustrate the convergence in (18.11) with a simulation.
In doing so, let
where
each
is an IID draw from the uniform distribution oneach
is an IID draw from the uniform distribution on and are independent of each other
Hints:
sqrt(A::AbstractMatrix{<:Number})computes the square root ofA. You still need to invert it.You should be able to work out
from the proceeding information.
18.6. Solutions#
18.6.1. Exercise 1#
Here is one solution
function exercise1(distribution = Uniform(0, π / 2); n = 250, k = 10_000, g = sin,
g′ = cos)
mu, sigma = mean(distribution), std(distribution)
y = rand(distribution, n, k)
y = mean(y, dims = 1)
y = vec(y)
error_obs = sqrt(n) .* (g.(y) .- g.(mu))
density(error_obs, label = "Empirical Density")
return plot!(Normal(0, g′(mu) .* sigma), linestyle = :dash,
label = "Asymptotic",
color = :black)
end
exercise1()
What happens when you replace
In this case, the mean
Hence the conditions of the delta theorem are not satisfied.
18.6.2. Exercise 2#
First we want to verify the claim that
This is straightforward given the facts presented in the exercise.
Let
By the multivariate CLT and the continuous mapping theorem, we have
Since linear combinations of normal random variables are normal, the
vector
Its mean is clearly
In conclusion,
Now we turn to the simulation exercise.
Our solution is as follows
function exercise2(; n = 250, k = 50_000, dw = Uniform(-1, 1), du = Uniform(-2, 2))
vw = var(dw)
vu = var(du)
Sigma = [vw vw
vw vw+vu]
Q = inv(sqrt(Sigma))
function generate_data(dw, du, n)
dw = rand(dw, n)
X = [dw dw + rand(du, n)]
return sqrt(n) * mean(X, dims = 1)
end
X = mapreduce(x -> generate_data(dw, du, n), vcat, 1:k)
X = Q * X'
X = sum(abs2, X, dims = 1)
X = vec(X)
density(X, label = "", xlim = (0, 10))
return plot!(Chisq(2), color = :black, linestyle = :dash,
label = "Chi-squared with 2 degrees of freedom", grid = false)
end
exercise2()