This post is a simulation based walkthrough, starting with simple Bernoulli trials and building up to random walks with drift. The random walk section also connects naturally to regression to the mean, which is a related idea worth understanding alongside CLT.
The central limit theorem (CLT) states that, regardless of the data generating distribution, if the sample size is increased, the repeated sampling distribution of the sample mean converges to a normal distribution. This convergence does not preserve the underlying distribution due to the aggregation of random fluctuations.
There is another concept that is closely linked to CLT, which is regression to the mean. When small fluctuations are added after each trial, they cancel each other in the long run, and there will be a course-correction on the particle (subject) to move towards the underlying mean.
The binomial distribution is given by,
\[P(X=x) = \binom{n}{x} \cdot p^x \cdot (1-p)^{n-x} = \frac{n!}{x!(n-x)!} \cdot p^x \cdot (1-p)^{n-x},\] where \(n\) is the number of trials, \(p\) is the probability of success in each trial, with the expected value \(E(x) = n \cdot p\).
Here I will generate data from binomial trials
(size = 1). Each trial will follow a Bernoulli distribution
with a probability of success, ‘p,’ assumed to be uniformly distributed
between 0 and 1.
As a side note, there will be small random fluctuations in the value
of p that follow a beta distribution, i.e., p ~ Beta(1,1).
For example,
trials <- 1e3 # number of observations, or trials
plot_beta <- function() {
replicate(trials, rbinom(n = 1, size = 1, prob = 0.5)) |>
as_tibble() |>
list() |>
list_rbind(names_to = "simulation") |>
ggplot(aes(value)) +
# kernel density
geom_histogram(aes(y = ..density..)) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 3)) +
theme(axis.title = element_blank())
}
p_1 <- plot_beta()
p_2 <- plot_beta()
p_3 <- plot_beta()
gridExtra::grid.arrange(
patchworkGrob(p_1 + p_2 + p_3),
left = "density",
bottom = "prob",
top = "Beta(1,1) Distributions for Bernoulli Trials"
)
Although the beta distribution is not uniform on the unit interval, the binomial distribution resulting from the Bernoulli trials is.
Next is a simulation for Bernoulli outcomes. At each trial, the
function Reduce would sum up the observations
to obtain the sample mean. The replicate function then
repeats this process a specified number of times.
replicate(n = trials, expr = Reduce(x = rbinom(n = trials, size = 1, prob = 0.5), sum)) |>
as_tibble() |>
list() |>
list_rbind(names_to = "simulation") |>
ggplot(aes(value)) +
geom_histogram(aes(y = ..density..),
binwidth = 1
) +
stat_density(
# sd taken from the same calculation
data = tibble(value = rnorm(n = 1e6, mean = 500, sd = 15.8)),
col = "#619CFF",
geom = "line", size = 1
) +
geom_vline(xintercept = 0.5 * trials) +
scale_x_continuous(breaks = scales::breaks_pretty(n = 5)) +
labs(title = "Distribution of Bernoulli Trial Outcomes, blue (~ Gaussian)")
The distribution has the characteristic bell shape even with a small sampling size.
You can expect similar outcomes from the following binomial trials, as compared to the simulation above:
rbinom(n = trials, size = trials, prob = 0.5) |>
as_tibble() |>
list() |>
list_rbind(names_to = "simulation") |>
ggplot(aes(value)) +
geom_histogram(aes(y = ..density..),
binwidth = 1
) +
stat_density(
# sd taken from the same calculation
data = tibble(value = rnorm(n = 1e6, mean = 500, sd = 15.8)),
col = "#619CFF",
geom = "line", size = 1
) +
geom_vline(xintercept = 0.5 * trials) +
scale_x_continuous(breaks = scales::breaks_pretty(n = 5)) +
ggtitle(label = "CLT for Binomial Trial Outcomes, blue (~ Gaussian)")
Lets take this one step further. consider a random walk with a positive drift.
\[y_t = y_{t-1} + μ + e_t,\]
where \(y_t\) is the value at time \(t\), \(y_{t-1}\) is the value at time \(t-1\), \(\mu\) is the drift, which is a constant added to each step, and \(e_t\) is the error term at time \(t\), which is a white noise series. This is a time trend \(tμ\) and a random walk process of \(e_t\).
The next piece of code generates a specified number of Bernoulli
random walk processes with a number of steps. At each step (trial) of
the process, the function rbinom() generates a random value
of either 0 or 1 with a probability of success of 0.5, equivalent to
flipping a coin. These are the random Bernoulli fluctuations that
determine a move (1) or not (0).
Here is a plot with 100 simulations with cross sections as a reference:
simulations <- 100
trials <- 300
lines <- c(10, 50, 100, 200, 300)
map(
1:simulations,
\(i) accumulate(
1:trials,
\(acc, nxt) acc + rbinom(n = 1, size = 1, prob = 0.5)
)
) |>
set_names(paste0("sim", 1:simulations)) |>
map(\(x) tibble(value = x, trial = 1:trials)) |>
list_rbind(names_to = "simulation") |>
ggplot(aes(x = trial, y = value)) +
geom_step(aes(color = simulation), alpha = 0.4) +
geom_vline(xintercept = lines, alpha = 0.5, linetype = "dashed") +
scale_x_continuous(breaks = scales::breaks_pretty(n = 20)) +
labs(
title = "Simulated Random Walks With Drift Using Bernoulli Fluctuations",
subtitle = "Vertical lines represent cross sections of the random walk process",
x = "step (Bernoulli trial)"
) +
guides(col = "none")
Here are the plots for these cross sections:
simulations <- 1e4
trials <- 300
random_walk_sim <- map(
1:simulations,
\(i) accumulate(
1:trials,
\(acc, nxt) acc + rbinom(n = 1, size = 1, prob = 0.5)
)
) |>
set_names(paste0("sim", 1:simulations)) |>
map(\(x) tibble(value = x, trial = 1:trials)) |>
list_rbind(names_to = "simulation") |>
group_by(simulation) |>
filter(trial %in% lines) |>
group_by(trial) |>
mutate(
dist_mean = mean(value),
trial = factor(trial)
)
random_walk_sim |>
ggplot(aes(value, fill = trial)) +
geom_histogram(aes(y = ..density..), binwidth = 1) +
geom_vline(
data = random_walk_sim |> distinct(dist_mean, .keep_all = T),
aes(xintercept = dist_mean, col = trial), alpha = 1, linetype = "dashed"
) +
labs(
title = "Distributions at Cross Sections for Bernoulli Random Walks ",
fill = "Cross Section",
x = "Mean value (0.5 * Bernoulli trials)"
) +
scale_x_continuous(breaks = scales::breaks_pretty(n = 10)) +
guides(col = "none")
At each cross section in the Bernoulli random walk process, the
distribution of the simulated values is expected to follow a normal
distribution, as predicted by the CLT. In our example, initially, the
distributions do not look Gaussian, but a spread about the tails starts
to form. After around 100 steps, the characteristic bell shape appears.
Once the distribution has arrived at this shape, it will get wider and
wider and flatten out.
In the context of a Bernoulli random walk with drift, the mean refers to the expected value of the process at a particular time. The presence of a drift term in a Bernoulli random walk does not change this fact. This means that extreme values are likely to be followed by values that are closer to the mean. This is an example of regression to the mean.