Descriptive

From ECLR
Revision as of 10:15, 6 August 2013 by Admin (talk | contribs) (Example)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Numerical Descriptive Statistics

In GraphicRep we looked at graphical summaries of data. We now describe three numerical summary statistics. When these summaries are applied to a set of data, they return a number which often have useful interpretations.

We shall look at three categories of numerical summaries:

  • location or average
  • dispersion, spread or variance
  • association, correlation or regression. This is done in an extra section Regression.

Location

A measure of location tells us something about what a typical value from a a set of observations is. We sometimes use the expression central location, central tendency or, more commonly, average. We can imagine it as the value around which the observations in the sample are distributed.

The simplest numerical summary (descriptive statistic) of location is the sample (arithmetic) mean:

[math]\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}=\frac{(x_{1}+x_{2}+\ldots +x_{n})}{n}.[/math]

It is obtained by adding up all the values in the sample and dividing this total by the sample size. It uses all the observed values in the sample and is the most popular measure of location since it is particularly easy to deal with theoretically. Another measure, with which you may be familiar, is the sample median. This does not use all the values in the sample and is obtained by finding the middle value in the sample, once all the observations have been ordered from the smallest value to the largest. Thus, [math]50\%[/math] of the observations are larger than the median and [math]50\%[/math] are smaller. Since it does not use all the data it is less influenced by extreme values (or outliers), unlike the sample mean. For example, when investigating income distributions it is found that the mean income is higher than the median income. Imagine the highest [math]10\%[/math] of earners in a country double their income, but the income of everyone else remains constant. What effect will that have on the mean on the one hand, and the median on the other? It will increase the former but leave the median unchanged.

In some situations, it makes more sense to use a weighted sample mean, rather than the arithmetic mean. Let’s first look at the formula:

[math]\bar{x}_w=\sum_{i=1}^{n}w_{i}x_{i}=w_{1}x_{1}+w_{2}x_{2}+\ldots+w_{n}x_{n}.[/math]

where the weights [math]\left( w_{1},\ldots ,w_{n}\right) [/math] satisfy [math]\sum_{i=1}^{n}w_{i}=1[/math]. Note that equal weights of [math]w_{i}=n^{-1}[/math], for all [math]i,[/math] gives the arithmetic mean. This type of average statistic if often used in the construction of index numbers (such as the Consumer Price Index, CPI) and it comes into play whenever we have data from different categories that are of different importance. In the calculation of the CPI this is relevant as price increases for food items are more important than price increases for music, as the purchases for the latter use a smaller proportion of a typical income.

All the above measures of location can be referred to as an average. One must, therefore, be clear about what is being calculated. Two politicians may quote two different values for the "average income in the U.K."; both are probably right, but are computing two different measures!

There is another important difference between the arithmetic average and the median. The arithmetic average is really only applicable for continuous data (see the Data type Section), while the median can also be applied to discrete data which are ordinal (i.e. have a natural ranking). This makes the median applicable to a much wider range of data.

You may now ask what measure of central tendency is to be applied for nominal data (categorical data where the categories have no natural ordering). Neither the arithmetic average nor the median are applicable. The statistic to be used here is called the mode and it is that category that is represented the most amongst your dataset. Say you have 100 observations of which 55 are female and 45 male. Here the mode category for the gender variable is female.

One last note. Any of our statistics is based on a set of n observations. This language assumes that the observations we have is a sample only (see the Data Types Section). The statistic [math]\bar{x}[/math] is therefore often also called the sample mean. As we are usually interested in the mean of the entire population we use the sample mean as an estimate of the unknown population mean (which is often represented by [math]\mu[/math].

Example

Khan Academy:

  • Intro to Measures of Central Tendency: [1]
  • An example for mean , median and mode calculation: [2]
  • A short discussion of the relation between sample and population mean: [3]

Dispersion

A measure of dispersion (or variability) tells us something about how much the values in a sample differ from one another. The easiest measure of dispersion is called the range. All it measures is the difference between the largest and the smallest value in your observations. While this can be an informative piece of information, it has many shortcomings. First and foremost that it is influenced by outliers. Further, it only uses the two most extreme observations in the calculation. We therefore ignore a whole lot of information.

What we are really after is a measure of how closely your observations are distributed around the central location.

We begin by defining a deviation from the arithmetic mean:

[math]d_{i}=x_{i}-\bar{x}[/math]

This is at the core of most measures of dispersion. It describes how much a value differs from the mean. As an average measure of deviation from [math]\bar{x}[/math], we could consider the arithmetic mean of deviations, but this will always be zero and therefore is not very informative. In other words the positive and negative deviations from [math]\bar{x}[/math] will always cancel each other out.

The key to understanding this is to note that in terms of dispersion the two deviations [math]d_1=2[/math] and [math]d_2=-2[/math] are the same, although, clearly [math]d_1[/math] and [math]d_2[/math] are not the same as they have different signs. We need to find ways to summarise the extend of deviation without any cancellation effect. There are two alternatives to looking at [math]d_i[/math] directly. We could either look at the absolute deviation [math]\left| d_{i}\right|[/math] (leading to the Mean Absolute Deviation - MAD - below) or at the squared deviation, [math]d_{i}^2[/math] (leading to Mean Squared Deviation - MSD. You should convince yourself that [math]\left| d_{1}\right|=\left| d_{2}\right|[/math] and [math]d_{1}^2=d_{2}^2[/math], confirming that both [math]d_1=2[/math] and [math]d_2=-2[/math] carry the same information with respect to dispersion. Based on these two measures of dispersion the following two statistics are available: A more informative alternative is the Mean Absolute Deviation (MAD):

  • MAD: [math]\frac{1}{n}\sum_{i=1}^{n}\left| d_{i}\right| =\frac{1}{n}\sum_{i=1}^{n}\left| x_{i}-\bar{x}\right| \gt 0[/math]
  • MSD: [math]\frac{1}{n}\sum_{i=1}^{n}d_{i}^{2}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\gt 0[/math]

Like the arithmetic mean, the MSD is easier to work with and lends itself to theoretical treatment. A more commonly used name for MSD is the variance. as you can see from the formula, it is calculated on the basis of [math]n[/math] observations. If the data represent your relevant population (see the Data Types Section) then the formula for MSD is referred to as the population variance and it is often represented by [math]\sigma^2[/math]. If your observations represent a sample, then the formula for MSD is changed by replacing the factor [math]1/n[/math] with [math]1/(n-1)[/math]:

[math]s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}[/math]

This is what is usually called the sample variance, abbreviated with [math]s^2[/math]. The slightly changed factor is due to ensures that [math]s^2[/math] is an unbiased estimator for the unknown [math]\sigma^2[/math] which is the term we are usually interested in.

The variance measures have two major disadvantages:

  • The value you get for either [math]\sigma^2[/math] or [math]s^2[/math] has no easy interpretation. This is obvious when you think about the units in which the variance is measured. If your data are, say, income data, then the unit of the variance is [math]£^2[/math]. But what is the meaning of a squared pound sterling.
  • The variances for different data sets are almost impossible to compare. The reason for that is best seen by realising that the value of the variance will change if you multiply each value by 2[1]. However, if we multiply each observation by two, then really, nothing has changed about the dispersion of our data.

In order to address these shortcomings we often refer to a different measure of dispersion, the standard deviation. This is related to the variance. In fact, once you calculated the variance you can obtain the standard deviation by taking the square root of your calculated variance

  • Population standard deviation: [math]\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}[/math]
  • Sample standard deviation: [math]s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}[/math]

Once you calculated you sample (or population) standard deviation, you can say that the average deviation of your observation from the mean is [math]s[/math] (or [math]\sigma[/math]). As the variance is always a positive measure, so is the square root of the variance, the standard deviation.

Example

Khan Academy:

  • Intro to Measures of Dispersion: [4]
  • An example for the calculation of the population variance: [5]
  • Why do we need the sample variance? [6]
  • These two clips illustrates nicely why the sample variance uses the factor [math]1/(n-1)[/math]. First: ( [7]) which gives an intuitive explanation of why we need the factor [math]1/(n-1)[/math]. Second: ( [8]) which uses a simulation to convince you.

Footnotes

  1. You could try and confirm that in this case the variance will, in fact, increase by the factor 4.