Central limit theorem

In the last section, we have seen that the distribution of sample means can be estimated even without actually doing the sampling, if the mean and the sd of the population are known. However, this estimate is based on the assumption (or observation) that the population is normally distributed. Normal distribution has a specific, smooth pattern, so random sampling from normally distributed data should also reflect this pattern. On the other hand, if the population is more randomly distributed, random sampling should be more unpredictable.

However, in many cases, the distribution of sample means converges to a distribution similar to normal distribution.

Let us set the population as 10,000 natural numbers randomly selected from 1 to 1000.

population <- sample(c(1:1000), 10000, replace=T)
head(population)

## [1] 747 380 200 150 817 159

mean(population)

## [1] 502.1851

var(population)

## [1] 82667.43

sd(population)

## [1] 287.5195

hist(population, col = "#ff00ff40", border = "#ff00ff", breaks =  seq(0,1000,20), prob=T)
lines(density(population), col = "#ff00ff", lwd = 2 )

If we randomly sample 5 data points from the population and calculate the mean, and repeat this procedure 1000 times:

# Initialize the data set
samplemeans <- c()
for(i in 1:100){  # Repeat the process for each number in 1:100
  # Accumulate the mean of the sample of 10 data points
  samplemeans <- c(samplemeans,mean(sample(population,5,replace=F))); 
  }

mean(samplemeans)

## [1] 500.33

var(samplemeans)

## [1] 12326.13

sd(samplemeans)

## [1] 111.0231

Compare the histograms with density lines, with the dotted line showing a normal distribution N(mean(samplemeans),sd(samplemeans)):

hist(samplemeans, col="#0000ff40", border="#0000ff", breaks=seq(0,1000,20), prob=T)
lines(density(samplemeans), col = "#0000ff", lwd = 2)
hist(population, col="#ff00ff40", border="#ff00ff", prob=T, add=T)
lines(density(population), col = "#ff00ff", lwd = 2 )
curve(dnorm(x,mean(samplemeans),sd(samplemeans)), 0, 1000, col="red", lwd=2, lty=2, add=T)

Isn’t it amazing? The sample means are roughly normally distributed!

Now, what if the population has two bumps?

population <- c(rnorm(5000,330,50),rnorm(5000,660,50))
head(population)

## [1] 310.6671 321.6043 341.7506 380.6666 357.9366 347.0233

mean(population)

## [1] 495.3489

# Variance/SD should be population variance/SD but in this section we simply make use of corrected variance var() and corrected standard deviation sd() because adding either way it won't make a big difference.
var(population)

## [1] 29749.63

sd(population)

## [1] 172.4808

hist(population, col = "#ff00ff40", border = "#ff00ff", breaks =  seq(0,1000,20), prob=T)
lines(density(population), col = "#ff00ff", lwd = 2 )

Now randomly sample 5 data points from the population and calculate the mean, and repeat this procedure 1000 times:

# Initialize the data set
samplemeans <- c()
for(i in 1:100){  # Repeat the process for each number in 1:100
  # Accumulate the mean of the sample of 10 data points
  samplemeans <- c(samplemeans,mean(sample(population,5,replace=F))); 
  # Reduce the counter (if we forget this, the loop will continue endlessly!)
  }

mean(samplemeans)

## [1] 500.8943

var(samplemeans)

## [1] 6097.825

sd(samplemeans)

## [1] 78.08857

Compare the histograms with density lines, with the dotted line showing a normal distribution N(mean(samplemeans),sd(samplemeans)):

hist(samplemeans, col="#0000ff40", border="#0000ff", breaks=seq(0,1000,20), prob=T)
lines(density(samplemeans), col = "#0000ff", lwd = 2)
hist(population, col="#ff00ff40", border="#ff00ff", prob=T, add=T)
lines(density(population), col = "#ff00ff", lwd = 2 )
curve(dnorm(x,mean(samplemeans),sd(samplemeans)), 0, 1000, col="red", lwd=2, lty=2, add=T)

Interesting, isn’t it? The sample means are almost normally distributed, except that they show a small dent at the top. This obviously reflects the two headed distribution of the population.

What if we randomly sample 50 data points (instead of 5) 1000 times?

# Initialize the data set
samplemeans <- c()
for(i in 1:100){  # Repeat the process for each number in 1:100
  # Accumulate the mean of the sample of 10 data points
  samplemeans <- c(samplemeans,mean(sample(population,50,replace=F))); 
  }

mean(samplemeans)

## [1] 495.1821

var(samplemeans)

## [1] 455.0102

sd(samplemeans)

## [1] 21.33097

Compare the histograms with density lines, with the dotted line showing a normal distribution N(mean(samplemeans),sd(samplemeans)):

hist(samplemeans, col="#0000ff40", border="#0000ff", breaks=seq(0,1000,20), prob=T)
lines(density(samplemeans), col = "#0000ff", lwd = 2)
hist(population, col="#ff00ff40", border="#ff00ff", prob=T, add=T)
lines(density(population), col = "#ff00ff", lwd = 2 )
curve(dnorm(x,mean(samplemeans),sd(samplemeans)), 0, 1000, col="red", lwd=2, lty=2, add=T)

Now the sample means are normally distributed.

What if the population follows a poisson distribution?

population <- rpois(10000,1)
head(population)

## [1] 3 2 2 2 1 0

mean(population)

## [1] 0.9946

# Variance/SD should be population variance/SD but in this section we simply make use of corrected variance var() and corrected standard deviation sd() because adding either way it won't make a big difference.
var(population)

## [1] 0.9810689

sd(population)

## [1] 0.9904892

summary(population)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  1.0000  0.9946  2.0000  6.0000

hist(population, col = "#ff00ff40", border = "#ff00ff", breaks =  seq(0,7,0.5), prob=T)

#lines(density(population), col = "#ff00ff", lwd = 2 )

Now randomly sample 5 data points from the population and calculate the mean, and repeat this procedure 1000 times:

# Initialize the data set
samplemeans <- c()
for(i in 1:100){  # Repeat the process for each number in 1:100
  # Accumulate the mean of the sample of 10 data points
  samplemeans <- c(samplemeans,mean(sample(population,5,replace=F))); 
  }

mean(samplemeans)

## [1] 0.948

var(samplemeans)

## [1] 0.1976727

sd(samplemeans)

## [1] 0.444604

Compare the histograms with density lines, with the dotted line showing a normal distribution N(mean(samplemeans),sd(samplemeans)):

hist(samplemeans, col="#0000ff40", border="#0000ff", breaks=seq(0,6,0.1), prob=T)
lines(density(samplemeans), col = "#0000ff", lwd = 2)
hist(population, col="#ff00ff40", border="#ff00ff", prob=T, add=T)
#lines(density(population), col = "#ff00ff", lwd = 2 )
curve(dnorm(x,mean(samplemeans),sd(samplemeans)), 0, 12, col="red", lwd=2, lty=2, add=T)

What if we randomly sample 50 data points (instead of 5) 1000 times?

# Initialize the data set
samplemeans <- c()
for(i in 1:100){  # Repeat the process for each number in 1:100
  # Accumulate the mean of the sample of 10 data points
  samplemeans <- c(samplemeans,mean(sample(population,50,replace=F))); 
  }

mean(samplemeans)

## [1] 0.9838

var(samplemeans)

## [1] 0.02196925

sd(samplemeans)

## [1] 0.1482203

Compare the histograms with density lines, with the dotted line showing a normal distribution N(mean(samplemeans),sd(samplemeans)):

hist(samplemeans, col="#0000ff40", border="#0000ff", breaks=seq(0,6,0.1), prob=T)
lines(density(samplemeans), col = "#0000ff", lwd = 2)
hist(population, col="#ff00ff40", border="#ff00ff", prob=T, add=T)
#lines(density(population), col = "#ff00ff", lwd = 2 )
curve(dnorm(x,mean(samplemeans),sd(samplemeans)), 0, 12, col="red", lwd=2, lty=2, add=T)

Again, the sample means are pretty much normally distributed.

So the conclusion is that, even if the population is not normally distributed, the sample means will be roughly normally distributed. And if the sample size is big enough (like, n=50), they are more or less normally distributed. This latter fact is mathematically proven, and it is called the central limit theorem. In other words, if the sample size is large, we do not have to care about the normality of the population distribution.