R Sampling

From ECLR
Jump to: navigation, search

Sampling Examples

In this document we demonstrate the workings of the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT). The example we are working with is that of sampling from the population of the UK and asking for the respondent's age. here is an illustration of the UK's age distribution:

UKagedistribution.png

As we do not have the actual age data for the entire distribution available (although one could use census data and get pretty close) we will simulate a set 65 Million (app. size of the UK population) ages.

The following function does this random simulation. there is really no need for you to understand what is happening in there, unless, of course, you want to learn how to use a self-defined function (UKpop_ages) in R (an extremely useful skill!).

UKpop_ages <- function(pop) {
  agebins <- 1:100
  n <- length(agebins)
  ages <- rep(0, pop)
    # define probabilities to replicate UK age distribution
  prob <- c(seq(4.6, 3.58, length.out=9),seq(3.6, 4.5, length.out=12),
            seq(4.5, 4.7, length.out=10),4.60,4.4,4.2,4,
            seq(4.0, 4.8, length.out=12),seq(4.8, 3.8, length.out=14),
            3.75,3.8,3.8,3.9,3.85,4,3.3,3.35,3.25,3.1,
            seq(2.9, 0.1, length.out=22),seq(0.09, 0.05, length.out=7))
    # rescale to represent probabilities
  prob <- prob/sum(prob)
  cumprob <- cumsum(prob)
  sel <- runif(pop, min = 0, max = 1)
  
  for (i in 1:pop){

    ages[i] <- sum(sel[i]>cumprob)
    
  }
  return(ages)
}

This function has been calibrated to mimic the UK population age structure which has an average age of 40. By calling UKpop_ages(n) we can draw a sample of size n from the UK population.

Demonstration of the Law of Large Numbers (LLN)

Now we call this function and draw a random sample of 2 observations and save it into a variable ages.

ages <- UKpop_ages(2)
print(ages)
## [1] 40 44
x <- 2  # this is in preparation for the following graph
y <- mean(ages)

You can see that the sample mean 42 is not really close to the population mean of 40 (if it is, it is due to luck). Let's see what happens as we add more and more observations to our sample. We start out with the two observations we have and keep adding until we have a sample size of 1000.

As we go we will plot the sample mean at every stage

plot.new()
# convert factor to numeric for convenience 
nmax <- 500

# get the range for the x and y axis 
xrange <- c(2,nmax) 
yrange <- c(10,90) 

for (i in 3:nmax){

  new_obs <- UKpop_ages(1)  # draw one extra observation
  ages <- c(ages,new_obs)
  x <- c(x, i)           # add the new sample size to vector x
  y <- c(y, mean(ages))  # add the new sample mean

}

# set up the plot 
plot(x, y, type="l", main="Law of Large Numbers", xlab = "Sample Size (n)", ylab = "Sample mean")
Sampling LLN.jpeg

As you can see the sample mean can deviate substantially from the population mean for small sample sizes, but as we increase the sample size we will get closer to the population mean of 40. The LLN basically states that we can get arbitrarily close if we just increase the sample size sufficiently.

In some sense the LLN thinks about what happens to the sample mean of one sample as it increases in sample size.

Demonstration of the Central Limit Theorem (CLT)

The CLT makes statements about the distribution of sample means as the sample size increases. We are now explicitelty recognisisng that sample means are random variables. If we draw several samples of the same sample size from the same distribution we will get a distribution of outcomes.

library(ggplot2)  # to make density plots

To get this idea bedded in, let's draw 1000 samples of size two from our UK population function and then let's look at the resulting distribution.

nsamples <- 10000  # Number of samples to draw
sampmeans <- rep(0, nsamples)
n <- 2   # Sample size

for (i in 1:nsamples){

  ages <- UKpop_ages(n)  # draw one extra observation
  sampmeans[i] <- mean(ages)

}

And here comes the distribution plot. There is absolutely no need to understand the details here. We are using the awesome ggplot2 package which is much more powerful than the mosaic package which was introduced on ECLR (but also more complicated to use).

#Sample data
dat <- data.frame(s_means = c(sampmeans,rnorm(10000, mean=mean(sampmeans), sd=sd(sampmeans))),lines = c(rep("Sample",nsamples),rep("Normal", 10000)))
#Plot.
ggplot(dat, aes(x = s_means, fill = lines)) + geom_density(alpha = 0.5) + ggtitle("Sample Size = 2")
Sampling dist1.jpeg

The differences with the normal distribution are actually quite subtle, but especially the left hand tail does not match. In our sample of 2 we get too few sample averages in the tail.. Let's see what happens if we increase the sample size to n=10.

nsamples <- 10000  # Number of samples to draw
sampmeans <- rep(0, nsamples)
n <- 10   # Sample size

for (i in 1:nsamples){

  ages <- UKpop_ages(n)  # draw one extra observation
  sampmeans[i] <- mean(ages)

}

#Sample data
dat <- data.frame(s_means = c(sampmeans,rnorm(10000, mean=mean(sampmeans), sd=sd(sampmeans))),lines = c(rep("Sample",nsamples),rep("Normal", 10000)))
#Plot.
ggplot(dat, aes(x = s_means, fill = lines)) + geom_density(alpha = 0.5)+ ggtitle("Sample Size = 10")
Sampling dist2.jpeg

And we are bang on! Here, after only 10 observations the CLT has kicked in and has delivered a sampling distribution for the sample mean that is basically a normal distribution. And that happened despite the distribution from which we sample being very non-normal.

However, our age distribution had an important feature that makes this convergence to the normal distribution pretty quick (here already after a sample size of 5); it is fairly symmetric.