Difference between revisions of "Descriptive"

From ECLR
Jump to: navigation, search
(Example)
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Correlation =
+
= Numerical Descriptive Statistics =
  
A commonly used measure of association is the ''sample correlation coefficient, ''which is designed to tell us something about the characteristics of a scatter plot of observations on the variable <math>Y</math> against observations on the variable <math>X</math>. In particularly, are higher than average values of <math>Y</math> associated with higher than average values of <math>X</math>, and vice-versa? Consider the following data-set in which we observe the weight (<math>Y_i</math>) measured in pounds and the height (<math>X_i</math>) measured in inches of a sample of 12 people:
+
In [[GraphicRep]] we looked at graphical summaries of data. We now describe three numerical summary statistics. When these summaries are applied to a set of data, they return a number which often have useful interpretations.
  
<table border="1">
+
We shall look at three categories of numerical summaries:
  
<tr class="odd">
+
* location or ''average''
<td align="center"></td>
+
* ''dispersion'', spread or ''variance''
<td align="center">''<math>i</math>''</td>
+
* association, ''correlation'' or ''regression''. This is done in an extra section [[Regression]].
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
</tr>
 
<tr class="even">
 
<td align="center">''Variable''</td>
 
<td align="center">1</td>
 
<td align="center">2</td>
 
<td align="center">3</td>
 
<td align="center">4</td>
 
<td align="center">5</td>
 
<td align="center">6</td>
 
<td align="center">7</td>
 
<td align="center">8</td>
 
<td align="center">9</td>
 
<td align="center">10</td>
 
<td align="center">11</td>
 
<td align="center">12</td>
 
</tr>
 
<tr class="odd">
 
<td align="center">Weight <math>=Y_i</math></td>
 
<td align="center">155</td>
 
<td align="center">150</td>
 
<td align="center">180</td>
 
<td align="center">135</td>
 
<td align="center">156</td>
 
<td align="center">168</td>
 
<td align="center">178</td>
 
<td align="center">160</td>
 
<td align="center">132</td>
 
<td align="center">145</td>
 
<td align="center">139</td>
 
<td align="center">152</td>
 
</tr>
 
<tr class="even">
 
<td align="center">Height <math>=X_i</math></td>
 
<td align="center">70</td>
 
<td align="center">63</td>
 
<td align="center">72</td>
 
<td align="center">60</td>
 
<td align="center">66</td>
 
<td align="center">70</td>
 
<td align="center">74</td>
 
<td align="center">65</td>
 
<td align="center">62</td>
 
<td align="center">67</td>
 
<td align="center">65</td>
 
<td align="center">68</td>
 
</tr>
 
  
</table>
+
== Location ==
  
The best way to graphically represent the data is the following scatter plot:
+
A measure of ''location'' tells us something about what a ''typical'' value from a a set of observations is. We sometimes use the expression ''central location'', ''central tendency'' or, more commonly, ''average''. We can imagine it as the value around which the observations in the sample are ''distributed''.
  
[[File:Regression_scatter1.jpg|frameless|700px]]
+
The simplest numerical summary (descriptive statistic) of location is the ''sample (arithmetic) mean'':
  
On this graph a horizontal line at <math>y=154</math> (i.e., <math>y\cong \bar{y}</math>) and also a vertical line at <math>x=67</math> (i.e., <math>x\cong \bar{x}</math>) are superimposed. Points in the obtained ''upper right quadrant'' are those for which the weight is higher than average '''and''' height is higher than average; points in the ''lower left quadrant'' are those for which weight is lower than average '''and''' height is lower than average. Since most points lie in these two quadrants, this suggests that higher than average weight is associated with higher than average height; whilst lower than average weight is associated with lower than average height. This is the hallmark of what we call a positive relationship between <math>X</math> and <math>Y</math>. If there was no association, we would expect to see a roughly equal distribution of points in all four quadrants.
+
<math>\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}=\frac{(x_{1}+x_{2}+\ldots +x_{n})}{n}.</math>
  
While it is often straightforward to see the qualitative nature of a relationship (positive, negative or unrelated) we want a numerical measure that describes this relationship such that we can also comment on the strength of the relationship. The basis of such a measure are again the deviations from the sample mean (as for the calculation of the variance and standard deviation), but now we have two such deviations for each observation, the deviation in the <math>Y</math> variable, <math>d_y,i=(y_i-\bar{y})</math>, and the deviation in the <math>X</math> variable, <math>d_x,i=(x_i-\bar{x})</math>. In the following graph you can see these deviations represented by the dashed lines for the third (i = 3) observation.
+
It is obtained by adding up all the values in the sample and dividing this total by the sample size. It uses all the observed values in the sample and is the most popular measure of location since it is particularly easy to deal with theoretically. Another measure, with which you may be familiar, is the ''sample median''. This does not use all the values in the sample and is obtained by finding the middle value in the sample, once all the observations have been ordered from the smallest value to the largest. Thus, <math>50\%</math> of the observations are larger than the median and <math>50\%</math> are smaller. Since it does not use all the data it is less influenced by extreme values (or outliers), unlike the sample mean. For example, when investigating income distributions it is found that the mean income is higher than the median income. Imagine the highest <math>10\%</math> of earners in a country double their income, but the income of everyone else remains constant. What effect will that have on the ''mean'' on the one hand, and the ''median'' on the other? It will increase the former but leave the ''median'' unchanged.
  
[[File:Regression_scatter_deviation.jpg|700px]]
+
In some situations, it makes more sense to use a ''weighted sample mean'', rather than the arithmetic mean. Let’s first look at the formula:
  
In the case of the third observation with <math>y_3=180</math> and <math>x_3=72</math> we can see that both values are larger than the respective sample means <math>\bar{y}</math> and <math>\bar{x}</math> and therefore both, <math>d_{y,i}</math> and <math>d_x,i</math> are positive. In fact this will be the case for all observations that lie in the ''upper right'' quadrant. For observations in the ''lower left'' quadrant we will find <math>d_{y,i}</math> and <math>d_{x,i}</math> to be smaller than 0. Observations in both these quadrants are reflective of a positive relationship. We therefore need to use the information in <math>d_{y,i}</math> and <math>d_{x,i}</math> in such a way that in both these cases we get a positive contribution to our statistic that numerically describes the relationship. Consider the term <math>(d_{y,i} \times d_{x,i})</math>; this term will be positive for all observations in either the ''upper right'' or ''lower left'' quadrant. For values in either the ''upper left'' or ''lower right'' quadrant, however, the terms <math>d_{y,i}</math> and <math>d_{x,i}</math> will have different signs and hence the term <math>(d_{y,i} \times d_{x,i})</math> will be negative, reflective of the fact that observations in these quadrants are representative of a negative relationship.
+
<math>\bar{x}_w=\sum_{i=1}^{n}w_{i}x_{i}=w_{1}x_{1}+w_{2}x_{2}+\ldots+w_{n}x_{n}.</math>
  
It should now be no surprise to find that our numerical measure of a relationship between two variables is based on these terms. This measure is called the ''correlation coefficient'':
+
where the weights <math>\left( w_{1},\ldots ,w_{n}\right) </math> satisfy <math>\sum_{i=1}^{n}w_{i}=1</math>. Note that equal weights of <math>w_{i}=n^{-1}</math>, for all <math>i,</math> gives the arithmetic mean. This type of average statistic if often used in the construction of index numbers (such as the Consumer Price Index, CPI) and it comes into play whenever we have data from different categories that are of different importance. In the calculation of the CPI this is relevant as price increases for food items are more important than price increases for music, as the purchases for the latter use a smaller proportion of a typical income.
  
<math>r=\frac{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right) \left( y_{i}-\bar{y}\right) }{\sqrt{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}\,\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}}}</math>
+
All the above measures of location can be referred to as an ''average''. One must, therefore, be clear about what is being calculated. Two politicians may quote two different values for the &quot;average income in the U.K.&quot;; both are probably right, but are computing two different measures!
  
If you calculate <math>r</math> for the above example you should obtain a value of 0.863. A few things are worth noting with respect to the ''correlation coefficient'':
+
There is another important difference between the ''arithmetic average'' and the ''median''. The ''arithmetic average'' is really only applicable for continuous data (see the [[DataType| Data type]] Section), while the ''median'' can also be applied to discrete data which are ordinal (i.e. have a natural ranking). This makes the ''median'' applicable to a much wider range of data.
  
* It can be shown algebraically that <math>-1<r<1</math>.
+
You may now ask what measure of central tendency is to be applied for nominal data (categorical data where the categories have no natural ordering). Neither the ''arithmetic average'' nor the ''median'' are applicable. The statistic to be used here is called the ''mode'' and it is that category that is represented the most amongst your dataset. Say you have 100 observations of which 55 are female and 45 male. Here the ''mode'' category for the gender variable is ''female''.
* Positive (negative) numbers represent a positive (negative) relationship and a value of 0 represents the absence of any relationship. In our example <math>r</math>=0.863 and hence the two variables display a strong positive correlation.
 
* The numerator contains the sum of the discussed cross products <math>d_{y,i} \times d_{x,i}=(y_i-\bar{y})(x_i-\bar{x})</math>
 
* The term in the denominator of the equation for <math>r</math> is related to the variances of <math>Y</math> and <math>X</math>. These terms are required to ''standardise'' the statistic to be between -1 and 1.
 
  
In order to understand the latter point somewhat better we can slightly reformulate the above equation by expanding both numerator and denominator with the factor <math>(1/n)</math><ref>Note that we do not change the value of <math>r</math>!
+
One last note. Any of our statistics is based on a set of n observations. This language assumes that the observations we have is a sample only (see the [[DataTypes|Data Types]] Section). The statistic <math>\bar{x}</math> is therefore often also called the ''sample mean''. As we are usually interested in the ''mean'' of the entire population we use the ''sample mean'' as an estimate of the unknown population mean (which is often represented by <math>\mu</math>.
</ref>:
 
  
<math>r=\frac{\frac{1}{n}\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right) \left( y_{i}-\bar{y}\right) }{\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}\,\frac{1}{n}\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}}}=\frac{\sigma_{y,x}}{\sqrt{\sigma_y^2 \sigma_x^2}}</math>
+
=== Example ===
 
 
Written in this way you should recognise that the denominator is nothing else but the square root of the product of the (population) variance of <math>y</math>, <math>\sigma_y^2</math>, times the variance of <math>x</math>, <math>\sigma_x^2</math>. The term in the numerator is what is called the ''population covariance'' between <math>y</math> and <math>x</math>, <math>\sigma_{y,x}</math>. The ''covariance'' is actually also a measure of the relationship between these two variables, but it has many of the same shortcomings as the variance (see [[Descriptive|Descriptive Statistics]]). Therefore we want a standardised measure (to ensure that <math>-1\leq r \geq 1</math>. This standardisation uses the square root of the two respective variances.
 
 
 
In what we just did we expanded with the term <math>(1/n)</math> and what we got were the correlation calculated as a function of the ''population'' covariance, <math>\sigma_{y,x}</math>, and the ''population'' variances, <math>\sigma_y^2</math> and <math>\sigma_x^2</math>. If we had expanded with the factor <math>(1/(n-1))</math> we would have obtained an expression that relates the correlation to the ''sample'' covariance, <math>s_{y,x}</math> and the sample variances <math>s_y^2</math> and <math>s_x^2</math> as follows:
 
 
 
<math>r=\frac{s_{y,x}}{\sqrt{s_y^2 s_x^2}}.</math>
 
 
 
There are two very important limitations of the ''correlation coefficient'' :
 
 
 
# In general, this sort of analysis does not imply causation, in either direction. Variables may appear to move together for a number of reasons and not because one is causally linked to the other. For example, over the period <math>1945-64</math> the number of TV licences <math>(x)</math> taken out in the UK increased steadily, as did the number of convictions for juvenile delinquency <math>\left( y\right)</math>. Thus a scatter of <math>y</math> against <math>x</math>, and the construction of the sample correlation coefficient reveals an apparent positive relationship. However, to therefore claim that increased exposure to TV causes juvenile delinquency would be extremely irresponsible.
 
# The sample correlation coefficient gives an index of the apparent linear relationship only. It ''assumes'' that the scatter of points must be distributed about some underlying straight line. This is discussed further below. However, the term relationship is not really confined to such linear relationships. Consider the relationship between ''age'' and ''income''. If we were to plot observations for the age and income of people in the age range of 20 to 50 we will clearly find a positive relationship. However, if we were to extend the age range to 80, we would most likely see that income decreases that the upper end of the age range. Therefore there is no linear age/income relationship across the full age range and the ''correlation coefficient'' cannot be used to describe such a relationship.
 
 
 
Imagine drawing a straight line of ''best fit'' through the scatter of points in the above Figure simply from ''visual'' inspection. You would try and make it ''go through'' the scatter, in some way, and it would probably have a positive slope. Numerically, one of the things that the correlation coefficient does is assess the slope of such a line: if <math>r>0</math>, then the slope of the line of best fit should be positive, and vice-versa. Moreover, if <math>r</math> is close to either <math>1</math> (or <math>-1</math>) then this implies that the scatter is quite closely distributed around the line of best fit. What the correlation coefficient doesn’t do, however, is tell us the exact position of line of best fit. This is achieved using ''regression'' analysis.
 
 
 
== Additional resources ==
 
 
 
* Khan Academy: Do not confuse correlation with causation: [https://www.khanacademy.org/math/probability/regression/regression-correlation/v/correlation-and-causality]
 
* Ralf Becker: Calculation Example for covariance and correlation: [http://www.youtube.com/watch?v=cw5eTSi7xpU&feature=share&list=PLW7MJJThJQQs3djo1EL6KCRFeCa6wpfYY]
 
 
 
= Regression =
 
  
The line you have drawn on the scatter can be represented algebraically as <math>a+bx</math>. Here <math>x</math> represents the value on the horizontal axis (or height), <math>a</math> is the intercept (i.e. the value on the vertical axis at <math>x=0</math>) and <math>b</math> is the slope (i.e. the value by which the line increases as we increase <math>x</math> by one unit). The line is defined at any value of <math>x</math> and not only those at which we have actual observations.
+
Khan Academy:
  
However, if we were to substitute our values <math>x_i</math> we would get <math>\widehat{y}_i=a+bx_i</math>. It is important to note that the result of this operation, <math>\widehat{y}_i</math> is not the same as <math>y_i</math>. The difference between the two is what is often called the residual
+
* Intro to Measures of Central Tendency: [https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/v/statistics-intro--mean--median-and-mode]
 +
* An example for mean , median and mode calculation: [https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/v/mean-median-and-mode]
 +
* A short discussion of the relation between sample and population mean: [https://www.khanacademy.org/math/probability/descriptive-statistics/central_tendency/v/statistics--sample-vs--population-mean]
  
<math>res_i=y_i-\widehat{y}_i = y_i - a - bx_i.</math>
+
== Dispersion ==
  
''Regression'' analysis is the statistical technique that finds the ''optimal'' values for <math>a</math> and <math>b</math>. We will soon see how to determine the best values for <math>a</math> and <math>b</math>. Before we do so we want to point out what ''optimal'' means in this context. In fact it implies that we want to minimise the values for <math>res_i</math> for all <math>i=1,...,n</math> (where <math>n</math> is the number of observations at hand) in some sense. in fact what we want to minimise is the ''sum of squared residuals''
+
A measure of ''dispersion'' (or variability) tells us something about how much the values in a sample differ from one another. The easiest measure of dispersion is called the ''range''. All it measures is the difference between the largest and the smallest value in your observations. While this can be an informative piece of information, it has many shortcomings. First and foremost that it is influenced by outliers. Further, it only uses the two most extreme observations in the calculation. We therefore ignore a whole lot of information.
  
<math>\sum_{i=1}^{n}(y_{i}-\widehat{y}_{i})^{2} = \sum_{i=1}^{n}(y_{i}-a-bx_{i})^{2}.</math>
+
What we are really after is a measure of how closely your observations are distributed around the central location.
  
This is equivalent to saying that we want to minimise the variation of our sample observations around the regression line (<math>a+bx</math>). Here is our previous scatter plot with the ''line of best fit''
+
We begin by defining a deviation from the arithmetic mean:
  
[[File:Regression_scatter_with_line.jpg|700px]]
+
<math>d_{i}=x_{i}-\bar{x}</math>
  
It turns out that this line would be extremely close to the imaginary line of best fit I asked you to draw towards the end of the correlation section. As it turns out our eyes are pretty good in ''visual fitting''. This is particularly easy if you have the sample means of <math>y</math> and <math>x</math> available, as the we have in form of the dotted line, as the line of best fit will always go through the point <math>{\bar{y}, \bar{x}}</math>.
+
This is at the core of most measures of dispersion. It describes how much a value differs from the ''mean''. As an average measure of deviation from <math>\bar{x}</math>, we could consider the arithmetic mean of deviations, but this will '''always''' be zero and therefore is not very informative. In other words the positive and negative deviations from <math>\bar{x}</math> will always cancel each other out.
  
Let’s get a little technical. The technique of obtaining <math>a</math> and <math>b</math> in this way is also known as ''ordinary least squares'' (OLS) since it minimises the sum of squared deviations (sum of squared residuals) from the fitted line. We shall not dwell on the algebra here, but the solutions to the algebraical problem are:
+
The key to understanding this is to note that in terms of dispersion the two deviations <math>d_1=2</math> and <math>d_2=-2</math> are the same, although, clearly <math>d_1</math> and <math>d_2</math> are not the same as they have different signs. We need to find ways to summarise the extend of deviation without any cancellation effect. There are two alternatives to looking at <math>d_i</math> directly. We could either look at the absolute deviation <math>\left| d_{i}\right|</math> (leading to the Mean Absolute Deviation - MAD - below) or at the squared deviation, <math>d_{i}^2</math> (leading to Mean Squared Deviation - MSD. You should convince yourself that <math>\left| d_{1}\right|=\left| d_{2}\right|</math> and <math>d_{1}^2=d_{2}^2</math>, confirming that both <math>d_1=2</math> and <math>d_2=-2</math> carry the same information with respect to dispersion. Based on these two measures of dispersion the following two statistics are available: A more informative alternative is the Mean Absolute Deviation (MAD):
  
<math>b=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}},\quad a=\bar{y}-b\bar{x};</math>
+
* MAD: <math>\frac{1}{n}\sum_{i=1}^{n}\left| d_{i}\right| =\frac{1}{n}\sum_{i=1}^{n}\left| x_{i}-\bar{x}\right| >0</math>
 +
* MSD: <math>\frac{1}{n}\sum_{i=1}^{n}d_{i}^{2}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}>0</math>
  
Applying the technique to the weight and height data yields: <math>b=616.333/191.667=3.2157</math>, and <math>a=154.167-3.2157\times 66.833=-60.746</math>, giving the smallest possible sum of squared residuals as <math>677.753.</math> This line:
+
Like the arithmetic mean, the MSD is easier to work with and lends itself to theoretical treatment. A more commonly used name for ''MSD'' is the ''variance''. as you can see from the formula, it is calculated on the basis of <math>n</math> observations. If the data represent your relevant population (see the [[DataTypes|Data Types]] Section) then the formula for MSD is referred to as the ''population variance'' and it is often represented by <math>\sigma^2</math>. If your observations represent a sample, then the formula for MSD is changed by replacing the factor <math>1/n</math> with <math>1/(n-1)</math>:
  
<math>-60.746+3.2157 \times x</math>
+
<math>s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}</math>
  
is superimposed on the above figure.
+
This is what is usually called the ''sample variance'', abbreviated with <math>s^2</math>. The slightly changed factor is due to ensures that <math>s^2</math> is an unbiased estimator for the unknown <math>\sigma^2</math> which is the term we are usually interested in.
  
== Interpretation of regression equation ==
+
The variance measures have two major disadvantages:
  
There are a number of issues that need to be stressed here. The first relates, yet again, to the ''sample''/''population'' issue. In mist cases the data available to run a regression will be sample data. Recall how we previously discussed that the <math>\bar{x}</math> was the sample estimate for some unknown population parameter <math>\mu</math> or the sample variance, <math>s^2</math> was the sample estimate of some unknown population variance, <math>\sigma^2</math>. In the same spirit it turns out that the values of <math>a</math> and <math>b</math> that describe the line of best fit, are sample estimates of some unknown population parameters (usually labeled, <math>\alpha</math> and <math>\beta</math>).
+
* The value you get for either <math>\sigma^2</math> or <math>s^2</math> has no easy interpretation. This is obvious when you think about the units in which the variance is measured. If your data are, say, income data, then the unit of the variance is <math>£^2</math>. But what is the meaning of a squared pound sterling.
 +
* The variances for different data sets are almost impossible to compare. The reason for that is best seen by realising that the value of the variance will change if you multiply each value by 2<ref>You could try and confirm that in this case the variance will, in fact, increase by the factor 4.</ref>. However, if we multiply each observation by two, then really, nothing has changed about the dispersion of our data.
  
Also note that <math>b</math> is the slope of the fitted line, <math>\hat{y}=a+bx</math>; i.e., the derivative of <math>\hat{y}</math> with respect to <math>x:</math>
+
In order to address these shortcomings we often refer to a different measure of dispersion, the ''standard deviation''. This is related to the variance. In fact, once you calculated the variance you can obtain the standard deviation by taking the square root of your calculated variance
  
<math>b=d\hat{y}/dx</math>
+
* Population standard deviation: <math>\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}</math>
 +
* Sample standard deviation: <math>s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}</math>
  
and measures the increase in <math>\hat{y}</math> for a unit increase in <math>x</math>.
+
Once you calculated you sample (or population) standard deviation, you can say that the average deviation of your observation from the mean is <math>s</math> (or <math>\sigma</math>). As the variance is always a positive measure, so is the square root of the variance, the ''standard deviation''.
 
 
Alternatively, it can be used to impute an elasticity. Elementary economics tells us that if <math>y</math> is some function of <math>x</math>, <math>y=f(x)</math>, then the elasticity of <math>y</math> with respect to <math>x</math> is given by the ''logarithmic derivative:''
 
 
 
<math>\dfrac{d\log \left( y\right) }{d\log \left(x\right) }=\dfrac{dy/y}{dx/x}\cong (x/y)b</math>
 
 
 
where we have used the fact that the differential <math>d\log (y)=\dfrac{1}{y}dy</math>. Such an elasticity is often evaluated at the respective sample means; i.e., it is calculated as <math>(\bar{x}/\bar{y})b</math>.
 
  
 
=== Example ===
 
=== Example ===
  
In applied economics studies of demand, the <math>\log </math> of demand <math>(Q)</math> is regressed on the <math>\log </math> of price <math>(P),</math> in order to obtain the fitted equation (or relationship). For example, suppose an economic model for the quantity demanded of a good, <math>Q</math>, as a function of its price, <math>P</math>, is postulated as approximately being <math>Q=aP^{b}</math> where <math>a</math> and <math>b</math> are unknown parameters, with <math>a>0</math>, <math>b<1</math> to ensure a positive downward sloping demand curve. Taking logs on both sides we see that <math>\log(Q)=a^{\ast }+b\log (P),</math> where <math>a^{\ast }=\log (a).</math> Thus, if <math>n</math> observations are available, (<math>q_{i},p_{i}</math>), <math>i=1,...,n,</math> a scatter plot of <math>\log (q_{i})</math> on <math>\log (p_{i})</math> should be approximately linear in nature. Thus suggests that a simple regression of <math>\log (q_{i})</math> on <math>\log (p_{i})</math> would provide a direct estimate of the elasticity of demand which is given by the value <math>b</math>.
+
Khan Academy:
  
== Transformations of data ==
+
* Intro to Measures of Dispersion: [https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/range--variance-and-standard-deviation-as-measures-of-dispersion]
 +
* An example for the calculation of the population variance: [https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/variance-of-a-population]
 +
* Why do we need the sample variance? [https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/sample-variance]
 +
* These two clips illustrates nicely why the sample variance uses the factor <math>1/(n-1)</math>. First: ( [https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/review-and-intuition-why-we-divide-by-n-1-for-the-unbiased-sample-variance]) which gives an intuitive explanation of why we need the factor <math>1/(n-1)</math>. Second: ( [https://www.khanacademy.org/math/probability/descriptive-statistics/variance_std_deviation/v/simulation-showing-bias-in-sample-variance]) which uses a simulation to convince you.
  
Numerically, transformations of data can affect the above summary measures. For example, in the weight-height scenario, consider for yourself what would happen to the values of <math>a</math> and <math>b</math> and the correlation if we were to use kilograms and centimetres rather than pounds and inches.
+
= Footnotes =
  
A more important matter arises if we find that a scatter of the some variable <math>y</math> against another, <math>x,</math> does not appear to reveal a linear relationship. In such cases, linearity may be retrieved if <math>y</math> is plotted against some function of <math>x</math> (e.g., <math>\log (x)</math> or <math>x^{2},</math> say). Indeed, there may be cases when <math>Y</math> also needs to be transformed in some way. That is to say, transformations of the data (via some mathematical function) may render a non-linear relationship more linear.
 
 
== Additional resources ==
 
 
* Khan Academy: Setup of the OLS problem and how to proof that the above formulae fr <math>a</math> and <math>b</math> ([https://www.khanacademy.org/math/probability/regression/regression-correlation/v/squared-error-of-regression-line] and four follow on clips - click on &quot;up next&quot; at the end of each clip). But be careful, in his video Salman Khan uses <math>m</math> for what we call <math>b</math> and <math>b</math> for what we call <math>a</math>. Life is never easy!
 
* Link to a full (55min) undergraduate, introductory lecture on regression [http://www.youtube.com/watch?v=AHAlqJTrPHE&list=PLW7MJJThJQQs3djo1EL6KCRFeCa6wpfYY]. After minute 22 this clip has material that is not covered here (<math>R^2</math> and statistical inference in regressions).
 
 
= Footnotes =
 
  
 
  <references />
 
  <references />

Latest revision as of 10:15, 6 August 2013

Numerical Descriptive Statistics

In GraphicRep we looked at graphical summaries of data. We now describe three numerical summary statistics. When these summaries are applied to a set of data, they return a number which often have useful interpretations.

We shall look at three categories of numerical summaries:

  • location or average
  • dispersion, spread or variance
  • association, correlation or regression. This is done in an extra section Regression.

Location

A measure of location tells us something about what a typical value from a a set of observations is. We sometimes use the expression central location, central tendency or, more commonly, average. We can imagine it as the value around which the observations in the sample are distributed.

The simplest numerical summary (descriptive statistic) of location is the sample (arithmetic) mean:

[math]\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i}=\frac{(x_{1}+x_{2}+\ldots +x_{n})}{n}.[/math]

It is obtained by adding up all the values in the sample and dividing this total by the sample size. It uses all the observed values in the sample and is the most popular measure of location since it is particularly easy to deal with theoretically. Another measure, with which you may be familiar, is the sample median. This does not use all the values in the sample and is obtained by finding the middle value in the sample, once all the observations have been ordered from the smallest value to the largest. Thus, [math]50\%[/math] of the observations are larger than the median and [math]50\%[/math] are smaller. Since it does not use all the data it is less influenced by extreme values (or outliers), unlike the sample mean. For example, when investigating income distributions it is found that the mean income is higher than the median income. Imagine the highest [math]10\%[/math] of earners in a country double their income, but the income of everyone else remains constant. What effect will that have on the mean on the one hand, and the median on the other? It will increase the former but leave the median unchanged.

In some situations, it makes more sense to use a weighted sample mean, rather than the arithmetic mean. Let’s first look at the formula:

[math]\bar{x}_w=\sum_{i=1}^{n}w_{i}x_{i}=w_{1}x_{1}+w_{2}x_{2}+\ldots+w_{n}x_{n}.[/math]

where the weights [math]\left( w_{1},\ldots ,w_{n}\right) [/math] satisfy [math]\sum_{i=1}^{n}w_{i}=1[/math]. Note that equal weights of [math]w_{i}=n^{-1}[/math], for all [math]i,[/math] gives the arithmetic mean. This type of average statistic if often used in the construction of index numbers (such as the Consumer Price Index, CPI) and it comes into play whenever we have data from different categories that are of different importance. In the calculation of the CPI this is relevant as price increases for food items are more important than price increases for music, as the purchases for the latter use a smaller proportion of a typical income.

All the above measures of location can be referred to as an average. One must, therefore, be clear about what is being calculated. Two politicians may quote two different values for the "average income in the U.K."; both are probably right, but are computing two different measures!

There is another important difference between the arithmetic average and the median. The arithmetic average is really only applicable for continuous data (see the Data type Section), while the median can also be applied to discrete data which are ordinal (i.e. have a natural ranking). This makes the median applicable to a much wider range of data.

You may now ask what measure of central tendency is to be applied for nominal data (categorical data where the categories have no natural ordering). Neither the arithmetic average nor the median are applicable. The statistic to be used here is called the mode and it is that category that is represented the most amongst your dataset. Say you have 100 observations of which 55 are female and 45 male. Here the mode category for the gender variable is female.

One last note. Any of our statistics is based on a set of n observations. This language assumes that the observations we have is a sample only (see the Data Types Section). The statistic [math]\bar{x}[/math] is therefore often also called the sample mean. As we are usually interested in the mean of the entire population we use the sample mean as an estimate of the unknown population mean (which is often represented by [math]\mu[/math].

Example

Khan Academy:

  • Intro to Measures of Central Tendency: [1]
  • An example for mean , median and mode calculation: [2]
  • A short discussion of the relation between sample and population mean: [3]

Dispersion

A measure of dispersion (or variability) tells us something about how much the values in a sample differ from one another. The easiest measure of dispersion is called the range. All it measures is the difference between the largest and the smallest value in your observations. While this can be an informative piece of information, it has many shortcomings. First and foremost that it is influenced by outliers. Further, it only uses the two most extreme observations in the calculation. We therefore ignore a whole lot of information.

What we are really after is a measure of how closely your observations are distributed around the central location.

We begin by defining a deviation from the arithmetic mean:

[math]d_{i}=x_{i}-\bar{x}[/math]

This is at the core of most measures of dispersion. It describes how much a value differs from the mean. As an average measure of deviation from [math]\bar{x}[/math], we could consider the arithmetic mean of deviations, but this will always be zero and therefore is not very informative. In other words the positive and negative deviations from [math]\bar{x}[/math] will always cancel each other out.

The key to understanding this is to note that in terms of dispersion the two deviations [math]d_1=2[/math] and [math]d_2=-2[/math] are the same, although, clearly [math]d_1[/math] and [math]d_2[/math] are not the same as they have different signs. We need to find ways to summarise the extend of deviation without any cancellation effect. There are two alternatives to looking at [math]d_i[/math] directly. We could either look at the absolute deviation [math]\left| d_{i}\right|[/math] (leading to the Mean Absolute Deviation - MAD - below) or at the squared deviation, [math]d_{i}^2[/math] (leading to Mean Squared Deviation - MSD. You should convince yourself that [math]\left| d_{1}\right|=\left| d_{2}\right|[/math] and [math]d_{1}^2=d_{2}^2[/math], confirming that both [math]d_1=2[/math] and [math]d_2=-2[/math] carry the same information with respect to dispersion. Based on these two measures of dispersion the following two statistics are available: A more informative alternative is the Mean Absolute Deviation (MAD):

  • MAD: [math]\frac{1}{n}\sum_{i=1}^{n}\left| d_{i}\right| =\frac{1}{n}\sum_{i=1}^{n}\left| x_{i}-\bar{x}\right| \gt 0[/math]
  • MSD: [math]\frac{1}{n}\sum_{i=1}^{n}d_{i}^{2}=\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\gt 0[/math]

Like the arithmetic mean, the MSD is easier to work with and lends itself to theoretical treatment. A more commonly used name for MSD is the variance. as you can see from the formula, it is calculated on the basis of [math]n[/math] observations. If the data represent your relevant population (see the Data Types Section) then the formula for MSD is referred to as the population variance and it is often represented by [math]\sigma^2[/math]. If your observations represent a sample, then the formula for MSD is changed by replacing the factor [math]1/n[/math] with [math]1/(n-1)[/math]:

[math]s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}[/math]

This is what is usually called the sample variance, abbreviated with [math]s^2[/math]. The slightly changed factor is due to ensures that [math]s^2[/math] is an unbiased estimator for the unknown [math]\sigma^2[/math] which is the term we are usually interested in.

The variance measures have two major disadvantages:

  • The value you get for either [math]\sigma^2[/math] or [math]s^2[/math] has no easy interpretation. This is obvious when you think about the units in which the variance is measured. If your data are, say, income data, then the unit of the variance is [math]£^2[/math]. But what is the meaning of a squared pound sterling.
  • The variances for different data sets are almost impossible to compare. The reason for that is best seen by realising that the value of the variance will change if you multiply each value by 2[1]. However, if we multiply each observation by two, then really, nothing has changed about the dispersion of our data.

In order to address these shortcomings we often refer to a different measure of dispersion, the standard deviation. This is related to the variance. In fact, once you calculated the variance you can obtain the standard deviation by taking the square root of your calculated variance

  • Population standard deviation: [math]\sigma=\sqrt{\sigma^2}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}[/math]
  • Sample standard deviation: [math]s=\sqrt{s^2}=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}[/math]

Once you calculated you sample (or population) standard deviation, you can say that the average deviation of your observation from the mean is [math]s[/math] (or [math]\sigma[/math]). As the variance is always a positive measure, so is the square root of the variance, the standard deviation.

Example

Khan Academy:

  • Intro to Measures of Dispersion: [4]
  • An example for the calculation of the population variance: [5]
  • Why do we need the sample variance? [6]
  • These two clips illustrates nicely why the sample variance uses the factor [math]1/(n-1)[/math]. First: ( [7]) which gives an intuitive explanation of why we need the factor [math]1/(n-1)[/math]. Second: ( [8]) which uses a simulation to convince you.

Footnotes

  1. You could try and confirm that in this case the variance will, in fact, increase by the factor 4.