Populations Samples SampleDistributions

From ECLR
Jump to: navigation, search


POPULATIONS, SAMPLES & SAMPLING DISTRIBUTIONS

In the Introductory Section we provided some basic definitions and concepts of statistics - data, experiment, sampling, population, sample, sample space and statistic. In this section, the links from these ideas to probability distributions and random variables are made explicit.

Experiments and Populations

Random Variables and Probability

There are two aspects to the idea of an experiment. It is any process which generates data, and, at least in some cases, generates a sample of data, in which case, this is also considered to be sampling from a population. Here the population is defined as the totality of items that we are interested in. On the other hand, the sample space of the experiment lists all the possible outcomes of the experiment. Much effort was devoted in these [[1]] to establishing the properties of random variables which can be defined on the sample space of the experiment.

From this perspective, then, an experiment generates the values of a random variable, and possibly even several random variables. The values of a random variable are assumed to occur with a known probability for a given experiment, and the collection of these probabilities constitute the probability distribution of the random variable.

Pooling the two aspects, we see that data generated by experiments can be considered both as values of a random variable and as a sample of data from a population. More succinctly, we can argue that values that occur in a population are generated by an experiment. Pursuing this argument one stage further, we can conclude that values occurring in a population can be considered as the values of a random variable.

This is the crucial idea, but it can be extended still further. The relative frequencies with which values occur in the population must equal the probability of these values as values of a random variable. So, we could argue that a population is “equivalent” (in this sense) to a random variable. This reasoning permits us to use the language of probability and probability distributions alongside that of populations and population relative frequencies.

Examples

  1. The population of January examination results for ECON101 are the values of some random variable.
  2. The number of cars passing an observation point on the M60 in a short interval of time is the value of some random variable.
  3. Whether or not a household in the UK owns a Tablet Computer is the value of a random variable.

The skill of the statistician lies in deciding which is the appropriate random variable to describe the underlying populations. This is not always easy, and usually one tries to use a well known random variable and probability distribution, at least as a first approximation. Hence the need to discuss Binomial, Geometric, Poisson, Uniform, Exponential and Normal random variables in this course.

Populations and random variables

The discussion above helps us to see why it is that the expected value, [math]E\left[ X\right] [/math], of some random variable [math]X[/math] is simultaneously

  • the theoretical mean
  • the mean of the probability distribution of [math]X[/math]
  • the population mean.

The same applies to the variance [math]Var\left[ X\right] [/math] of [math]X:[/math] it is

  • the theoretical variance
  • the variance of the probability distribution of [math]X[/math]
  • the population variance.

Population characteristics like mean and variance are usually called population parameters, but they are also characteristics of the probability distribution of [math]X.[/math] There are other sorts of parameters that we may be interested in - for example, the population relative frequency [math]\pi [/math] in the above Tablet Computer Example. If we define [math]X[/math] to be a random variable taking on the value 1 if a household owns a Tablet Computer, and 0 otherwise, the population proportion becomes the parameter of a probability distribution as [math]\pi =\Pr \left( X=1\right)[/math].

Objective of Statistics

As far as much of the applications of Statistics in Economics is concerned, the objective of Statistics is to learn about population characteristics or parameters. It is important to remember that the values of these parameters are unknown to us, and generally, we will never discover their values exactly. The idea is to get as close to the truth as possible, even though the truth may never be revealed to us. All we can do in practice is to make reasonable judgments based on the evidence (data) and analysis of that data: this process is called statistical inference. So,

  • the objective of statistics is statistical inference on (unknown) population parameters.

Samples and Sampling

Put simply, we use a sample of data from a population to draw inferences about the unknown population parameters. However, the argument in the Section above makes it clear that this idea of sampling from a population is equivalent to sampling from the probability distribution of a random variable.

It is important to note that the way in which a sample is obtained will influence inference about population parameters. Indeed, badly drawn samples will bias such inference.

A Bad Example

Suppose that an investigator is interested in the amount of debt held by students when they graduate from a UK university. If the investigator samples only graduating students from the University of Manchester, there can be no presumption that the sample is representative of all graduating UK students.

Sampling from a population

It is easier to start by discussing appropriate sampling methods from a population, and then discuss the equivalence with sampling from probability distributions. Our objective of avoiding biased inferences is generally considered to be met if the sampling procedure satisfies two conditions:

  1. Each element of the population has an equal chance of being drawn for inclusion in the sample.
  2. Each draw from the population is independent of the preceding and succeeding draws.

A sample meeting these conditions is called a simple random sample, although frequently this is abbreviated to random sample.

How can these conditions be physically realised? Drawing names from a hat is one possible method, although not very practical for large populations. The electronic machine used to make draws for the National Lottery is apparently considered a fair way of drawing 6 or 7 numbers from 50. The use of computers to draw “pseudo-random” numbers is also a standard method.

There are some technical complications associated with this description of random sampling. One is that with a population of finite size, the conditions can be met only if an item drawn from the population is “replaced” after drawing - this is sampling with replacement. So, sampling without replacement makes the chance of being drawn from the population different at each draw. However, if we have a “large” population, and a “small” sample size relative to the population size, there is little practical difference between sampling with and sampling without replacement. [1]

Therefore these distinctions are ignored in what follows.

Sampling from a probability distribution

It is helpful to see an example of sampling from a population: the example is simple enough to make the link to sampling from a probability distribution transparent.

The population contains [math]1000[/math] elements, but only three distinct values occur in this population, [math]0,1,2,[/math] with population relative frequencies [math]p_{0},p_{1},p_{2}[/math] respectively. We can consider this population as being equivalent to a random variable [math]X[/math] taking on the values [math]0,1,2[/math] with probabilities [math]p_{0},p_{1},p_{2}.[/math] The probability distribution of [math]X[/math] can be represented as the table

Values of [math]X[/math] Probability
[math]0[/math] [math]p_{0}[/math]
[math]1[/math] [math]p_{1}[/math]
[math]2[/math] [math]p_{2}[/math]

In this population, [math]0[/math] occurs [math]1000p_{0}[/math] times, [math]1[/math] occurs [math]1000p_{1}[/math] times and [math]2[/math] occurs [math]1000p_{2}[/math] times. If we select an element from the population at random, we don’t know in advance which element will be drawn, but every element has the same chance, [math]1/1000,[/math] of being selected. What is the chance that a [math]0[/math] value is selected? Presumably this is ratio of the number of [math]0^{\prime }s[/math] to the population size:

[math]\dfrac{1000p_{0}}{1000}=p_{0}.[/math]

Exactly the same argument applies to selecting a [math]1[/math] or a [math]2,[/math] producing selection probabilities [math]p_{1}[/math] and [math]p_{2}[/math] respectively.

It is clear that the probability distribution of what might be drawn from the population is that of the random variable [math]X[/math] which describes this population. So, it is appropriate to define a random variable [math]X_{1},[/math] say, which describes what might be obtained on the first draw from this population. The possible values of [math]X_{1}[/math] are the three distinct values in the population, and their probabilities are equal to the probabilities of drawing these values from the population:

  • [math]X_{1}[/math] has the same probability distribution as [math]X.[/math]

By the principle of (simple) random sampling, what might be drawn at the second draw is independent of the first draw. The same values are available to draw, with the same probabilities. What might be drawn at the second drawing is also described by a random variable, [math]X_{2},[/math] say, which is independent of [math]X_{1}[/math] but has the same probability distribution as [math]X_{1}[/math].

  • [math]X_{2}[/math] is independent of [math]X_{1},[/math] and has the same distribution as [math]X[/math].

We can continue in this way until [math]n[/math] drawings have been made, resulting in a random sample of size [math]n[/math]. Each of the [math]n[/math] random variables [math]X_{1},...,X_{n}[/math] describing what might be drawn are independent random variables with the same probability distribution as the random variable [math]X[/math] describing the population. To use a jargon phrase, these sample random variables are independently and identically distributed.

We have to translate this process of sampling from a population to sampling from the probability distribution of [math]X[/math]. All we have to do is to say that what one might get in a random sample of size 1 from the probability distribution of [math]X[/math] are the values of a random variable [math]X_{1}[/math] with the same probability distribution as [math]X[/math]. For a random sample of size 2, what we might get are the values of a pair of independent random variables [math]X_{1},X_{2},[/math] each having the same probability distribution as [math]X[/math]. For a random sample of size [math]n[/math], what we might get are the values of [math]n[/math] independent random variables [math]X_{1},...,X_{n},[/math] each having the same probability distribution as [math]X[/math].

Although a particular population was used as an example, one can see that the description of sampling from the corresponding probability distribution yields properties that apply generally. Specifically, they apply even when the random variable used to describe a population is a continuous random variable.

To summarise, using the language of sampling from a probability distribution,

  • a random sample of size [math]n[/math] from the probability distribution of a random variable [math]X[/math]
  • consists of sample random variables [math]X_{1},...,X_{n}[/math]
  • that are mutually independent
  • and have the same probability distribution as [math]X;[/math]
  • [math]X_{1},...,X_{n}[/math] are independently and identically distributed or i.i.d. random variables

It is important to note that this discussion relates to what might be obtained in a random sample. The sample of data consists of the values [math]x_{1},...,x_{n}[/math] of the sample random variables [math]X_{1},...,X_{n}[/math], i.e. [math]x_n[/math] is the realisation for the random variable [math]X_n[/math].

At this stage we can already see what the use of all this is. Often we will be interested in statistics such as the sample mean

[math]\overline{x} = \dfrac{1}{n} \Sigma_{i=1}^{n} x_i = \dfrac{1}{n}x_1 + \dfrac{1}{n}x_2 + ... + \dfrac{1}{n}x_n.[/math]

Therefore [math]\overline{x}[/math] can be seen as a linear combination of several random variables. It is, therefore, itself a random variable and its properties can be derived using the rules established in this Section. More details on this in the next Section.

Examples

Here we list three examples which outline how a sample relates to its population:

  1. A random sample of size 3 is drawn from the population described above. It consists of three i.i.d random variables [math]X_{1},X_{2},X_{3}.[/math] Suppose that the values in the sample of data are 0,0,1: that is,

    [math]x_{1}=0,x_{2}=0,x_{3}=1.[/math]

  2. Pursuing the graduate debt example of Section [gdebt], suppose that graduate debt is described by a random variable [math]X[/math] with a normal distribution, so that [math]X[/math] is distributed as [math]N\left( 5000,1000^{2}\right) [/math]:

    [math]X\sim N\left( 5000,1000^{2}\right) .[/math]

    If [math]10[/math] students are drawn at random from the population of students (i.e. using a random sample), the debt at each drawing also has this distribution. The random sample consists of [math]10[/math] random variables [math]X_{1},...,X_{10}[/math], mutually independent, and each [math]X_{i}[/math] is normally distributed:

    [math]X_{i}\sim N\left( 5000,1000^{2}\right) ,\;\;\;i=1,...,10.[/math]

    The sample of data is the values [math]x_{1},...,x_{n},[/math] for example,

    [math]\begin{aligned} x_{1}&=&5754.0,\;\;\;x_{2}=6088.0,\;\;\;x_{3}=5997.5,\;\;\;x_{4}=5572.3,\;\;\;x_{5}=4791.9, \\ x_{6}&=&4406.9,\;\;\;x_{7}=5366.1,\;\;\;x_{8}=6083.3,\;\;\;x_{9}=6507.9,\;\;\;x_{10}=4510.7.\end{aligned}[/math]

  3. An alternative version of Example 2. Suppose that [math]X\sim N\left( \mu,\sigma ^{2}\right) ,[/math] with [math]\mu [/math] and [math]\sigma ^{2}[/math] unknown: then the random sample of size [math]10[/math] consists of [math]X_{1},...,X_{10},[/math] mutually independent, and

    [math]X_{i}\sim N\left( \mu ,\sigma ^{2}\right) ,\;\;\;i=1,...,10.[/math]

    If we suppose that the sample values are as shown above, it is tempting to use this data to make a guess at [math]\mu [/math] and [math]\sigma ^{2}.[/math]

Footnotes

Footnotes

  1. By way of illustration consider two bowls of M&Ms. In each you can find a quarter each of blue, green, yellow and brown M&Ms. One bowl has altogether 1,000,000 M&Ms, the other has 4. What is the probability that the 2nd M&M we draw blindly out of the bowl is a yellow one? In the big bowl this is easy to answer. It is 0.25 whether we put the first M&M back or not. But if we consider the second, smaller, bowl, then it does clearly depend on whether we replace or not. And if we don’t replace it does depend on what colour the M&M drawn first has.