Regression

From ECLR
Revision as of 13:42, 5 August 2013 by Admin (talk | contribs)
Jump to: navigation, search

Correlation

A commonly used measure of association is the sample correlation coefficient, which is designed to tell us something about the characteristics of a scatter plot of observations on the variable [math]Y[/math] against observations on the variable [math]X[/math]. In particularly, are higher than average values of [math]Y[/math] associated with higher than average values of [math]X[/math], and vice-versa? Consider the following data-set in which we observe the weight ([math]Y_i[/math]) measured in pounds and the height ([math]X_i[/math]) measured in inches of a sample of 12 people:

Observation, [math]i[/math]
Variable 1 2 3 4 5 6 7 8 9 10 11 12
Weight [math]=Y_i[/math] 155 150 180 135 156 168 178 160 132 145 139 152
Height [math]=X_i[/math] 70 63 72 60 66 70 74 65 62 67 65 68

The best way to graphically represent the data is the following scatter plot:

[[File:regessionscatter1.jpg|frameless|500px]]

On this graph a horizontal line at [math]y=154[/math] (i.e., [math]y\cong \bar{y}[/math]) and also a vertical line at [math]x=67[/math] (i.e., [math]x\cong \bar{x}[/math]) are superimposed. Points in the obtained upper right quadrant are those for which the weight is higher than average and height is higher than average; points in the lower left quadrant are those for which weight is lower than average and height is lower than average. Since most points lie in these two quadrants, this suggests that higher than average weight is associated with higher than average height; whilst lower than average weight is associated with lower than average height. This is the hallmark of what we call a positive relationship between [math]X[/math] and [math]Y[/math]. If there was no association, we would expect to see a roughly equal distribution of points in all four quadrants.

While it is often straightforward to see the qualitative nature of a relationship (positive, negative or unrelated) we want a numerical measure that describes this relationship such that we can also comment on the strength of the relationship. The basis of such a measure are again the deviations from the sample mean (as for the calculation of the variance and standard deviation), but now we have two such deviations for each observation, the deviation in the [math]Y[/math] variable, [math]d_y,i=(y_i-\bar{y})[/math], and the deviation in the [math]X[/math] variable, [math]d_x,i=(x_i-\bar{x})[/math]. In the following graph you can see these deviations represented by the dashed lines for the third (i = 3) observation.

[[File:regressionscatterdeviation.wmf|500px]]

In the case of the third observation with [math]y_3=180[/math] and [math]x_3=72[/math] we can see that both values are larger than the respective sample means [math]\bar(y)[/math] and [math]\bar(x)[/math] and therefore both, [math]d_{y,i}[/math] and [math]d_x,i[/math] are positive. In fact this will be the case for all observations that lie in the upper right quadrant. For observations in the lower left quadrant we will find [math]d_{y,i}[/math] and [math]d_{x,i}[/math] to be smaller than 0. Observations in both these quadrants are reflective of a positive relationship. We therefore need to use the information in [math]d_{y,i}[/math] and [math]d_{x,i}[/math] in such a way that in both these cases we get a positive contribution to our statistic that numerically describes the relationship. Consider the term [math](d_{y,i} \times d_{x,i})[/math]; this term will be positive for all observations in either the upper right or lower left quadrant. For values in either the upper left or lower right quadrant, however, the terms [math]d_{y,i}[/math] and [math]d_{x,i}[/math] will have different signs and hence the term [math](d_{y,i} \times d_{x,i})[/math] will be negative, reflective of the fact that observations in these quadrants are representative of a negative relationship.

It should now be no surprise to find that our numerical measure of a relationship between two variables is based on these terms. This measure is called the correlation coefficient:

[math]r=\frac{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right) \left( y_{i}-\bar{y}\right) }{\sqrt{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}\,\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}}}[/math]

If you calculate [math]r[/math] for the above example you should obtain a value of 0.863. A few things are worth noting with respect to the correlation coefficient:

  • It can be shown algebraically that [math]-1\lt r\lt 1[/math].
  • Positive (negative) numbers represent a positive (negative) relationship and a value of 0 represents the absence of any relationship. In our example [math]r[/math]=0.863 and hence the two variables display a strong positive correlation.
  • The numerator contains the sum of the discussed cross products [math]d_{y,i} \times d_{x,i}=(y_i-\bar{y})(x_i-\bar{x})[/math]
  • The term in the denominator of the equation for [math]r[/math] is related to the variances of [math]Y[/math] and [math]X[/math]. These terms are required to standardise the statistic to be between -1 and 1.

There are two very important limitations of the correlation coefficient :

  1. In general, this sort of analysis does not imply causation, in either direction. Variables may appear to move together for a number of reasons and not because one is causally linked to the other. For example, over the period [math]1945-64[/math] the number of TV licences [math](x)[/math] taken out in the UK increased steadily, as did the number of convictions for juvenile delinquency [math]\left( y\right)[/math]. Thus a scatter of [math]y[/math] against [math]x[/math], and the construction of the sample correlation coefficient reveals an apparent positive relationship. However, to therefore claim that increased exposure to TV causes juvenile delinquency would be extremely irresponsible.
  2. The sample correlation coefficient gives an index of the apparent linear relationship only. It assumes that the scatter of points must be distributed about some underlying straight line. This is discussed further below. However, the term relationship is not really confined to such linear relationships. Consider the relationship between age and income. If we were to plot observations for the age and income of people in the age range of 20 to 50 we will clearly find a positive relationship. However, if we were to extend the age range to 80, we would most likely see that income decreases that the upper end of the age range. Therefore there is no linear age/income relationship across the full age range and the correlation coefficient cannot be used to describe such a relationship.