Difference between revisions of "Regression"

From ECLR
Jump to: navigation, search
(Regression in Matrix Form)
(Replaced content with "\mathbf{y} = \left(\begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_{12} \end{array} \...")
Line 1: Line 1:
= Correlation =
+
\mathbf{y} = \left(\begin{array}{c}
 
+
                y_1 \\
A commonly used measure of association is the ''sample correlation coefficient, ''which is designed to tell us something about the characteristics of a scatter plot of observations on the variable <math>Y</math> against observations on the variable <math>X</math>. In particularly, are higher than average values of <math>Y</math> associated with higher than average values of <math>X</math>, and vice-versa? Consider the following data-set in which we observe the weight (<math>Y_i</math>) measured in pounds and the height (<math>X_i</math>) measured in inches of a sample of 12 people:
+
                y_2 \\
 
+
                \vdots \\
<table border="1">
+
                y_{12}  
 
+
              \end{array}
<tr class="odd">
+
            \right);
<td align="center"></td>
 
<td align="center">''<math>i</math>''</td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
<td align="center"></td>
 
</tr>
 
<tr class="even">
 
<td align="center">''Variable''</td>
 
<td align="center">1</td>
 
<td align="center">2</td>
 
<td align="center">3</td>
 
<td align="center">4</td>
 
<td align="center">5</td>
 
<td align="center">6</td>
 
<td align="center">7</td>
 
<td align="center">8</td>
 
<td align="center">9</td>
 
<td align="center">10</td>
 
<td align="center">11</td>
 
<td align="center">12</td>
 
</tr>
 
<tr class="odd">
 
<td align="center">Weight <math>=Y_i</math></td>
 
<td align="center">155</td>
 
<td align="center">150</td>
 
<td align="center">180</td>
 
<td align="center">135</td>
 
<td align="center">156</td>
 
<td align="center">168</td>
 
<td align="center">178</td>
 
<td align="center">160</td>
 
<td align="center">132</td>
 
<td align="center">145</td>
 
<td align="center">139</td>
 
<td align="center">152</td>
 
</tr>
 
<tr class="even">
 
<td align="center">Height <math>=X_i</math></td>
 
<td align="center">70</td>
 
<td align="center">63</td>
 
<td align="center">72</td>
 
<td align="center">60</td>
 
<td align="center">66</td>
 
<td align="center">70</td>
 
<td align="center">74</td>
 
<td align="center">65</td>
 
<td align="center">62</td>
 
<td align="center">67</td>
 
<td align="center">65</td>
 
<td align="center">68</td>
 
</tr>
 
 
 
</table>
 
 
 
The best way to graphically represent the data is the following scatter plot:
 
 
 
[[File:Regression_scatter1.jpg|frameless|700px]]
 
 
 
On this graph a horizontal line at <math>y=154</math> (i.e., <math>y\cong \bar{y}</math>) and also a vertical line at <math>x=67</math> (i.e., <math>x\cong \bar{x}</math>) are superimposed. Points in the obtained ''upper right quadrant'' are those for which the weight is higher than average '''and''' height is higher than average; points in the ''lower left quadrant'' are those for which weight is lower than average '''and''' height is lower than average. Since most points lie in these two quadrants, this suggests that higher than average weight is associated with higher than average height; whilst lower than average weight is associated with lower than average height. This is the hallmark of what we call a positive relationship between <math>X</math> and <math>Y</math>. If there was no association, we would expect to see a roughly equal distribution of points in all four quadrants.
 
 
 
While it is often straightforward to see the qualitative nature of a relationship (positive, negative or unrelated) we want a numerical measure that describes this relationship such that we can also comment on the strength of the relationship. The basis of such a measure are again the deviations from the sample mean (as for the calculation of the variance and standard deviation), but now we have two such deviations for each observation, the deviation in the <math>Y</math> variable, <math>d_y,i=(y_i-\bar{y})</math>, and the deviation in the <math>X</math> variable, <math>d_x,i=(x_i-\bar{x})</math>. In the following graph you can see these deviations represented by the dashed lines for the third (i = 3) observation.
 
 
 
[[File:Regression_scatter_deviation.jpg|700px]]
 
 
 
In the case of the third observation with <math>y_3=180</math> and <math>x_3=72</math> we can see that both values are larger than the respective sample means <math>\bar{y}</math> and <math>\bar{x}</math> and therefore both, <math>d_{y,i}</math> and <math>d_x,i</math> are positive. In fact this will be the case for all observations that lie in the ''upper right'' quadrant. For observations in the ''lower left'' quadrant we will find <math>d_{y,i}</math> and <math>d_{x,i}</math> to be smaller than 0. Observations in both these quadrants are reflective of a positive relationship. We therefore need to use the information in <math>d_{y,i}</math> and <math>d_{x,i}</math> in such a way that in both these cases we get a positive contribution to our statistic that numerically describes the relationship. Consider the term <math>(d_{y,i} \times d_{x,i})</math>; this term will be positive for all observations in either the ''upper right'' or ''lower left'' quadrant. For values in either the ''upper left'' or ''lower right'' quadrant, however, the terms <math>d_{y,i}</math> and <math>d_{x,i}</math> will have different signs and hence the term <math>(d_{y,i} \times d_{x,i})</math> will be negative, reflective of the fact that observations in these quadrants are representative of a negative relationship.
 
 
 
It should now be no surprise to find that our numerical measure of a relationship between two variables is based on these terms. This measure is called the ''correlation coefficient'':
 
 
 
<math>r=\frac{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right) \left( y_{i}-\bar{y}\right) }{\sqrt{\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}\,\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}}}</math>
 
 
 
If you calculate <math>r</math> for the above example you should obtain a value of 0.863. A few things are worth noting with respect to the ''correlation coefficient'':
 
 
 
* It can be shown algebraically that <math>-1<r<1</math>.
 
* Positive (negative) numbers represent a positive (negative) relationship and a value of 0 represents the absence of any relationship. In our example <math>r</math>=0.863 and hence the two variables display a strong positive correlation.
 
* The numerator contains the sum of the discussed cross products <math>d_{y,i} \times d_{x,i}=(y_i-\bar{y})(x_i-\bar{x})</math>
 
* The term in the denominator of the equation for <math>r</math> is related to the variances of <math>Y</math> and <math>X</math>. These terms are required to ''standardise'' the statistic to be between -1 and 1.
 
 
 
In order to understand the latter point somewhat better we can slightly reformulate the above equation by expanding both numerator and denominator with the factor <math>(1/n)</math><ref>Note that we do not change the value of <math>r</math>!
 
</ref>:
 
 
 
<math>r=\frac{\frac{1}{n}\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right) \left( y_{i}-\bar{y}\right) }{\sqrt{\frac{1}{n}\sum_{i=1}^{n}\left( x_{i}-\bar{x}\right)^{2}\,\frac{1}{n}\sum_{i=1}^{n}\left( y_{i}-\bar{y}\right) ^{2}}}=\frac{\sigma_{y,x}}{\sqrt{\sigma_y^2 \sigma_x^2}}</math>
 
 
 
Written in this way you should recognise that the denominator is nothing else but the square root of the product of the (population) variance of <math>y</math>, <math>\sigma_y^2</math>, times the variance of <math>x</math>, <math>\sigma_x^2</math>. The term in the numerator is what is called the ''population covariance'' between <math>y</math> and <math>x</math>, <math>\sigma_{y,x}</math>. The ''covariance'' is actually also a measure of the relationship between these two variables, but it has many of the same shortcomings as the variance (see [[Descriptive]] Statistics). Therefore we want a standardised measure (to ensure that <math>-1\leq r \geq 1</math>. This standardisation uses the square root of the two respective variances.
 
 
 
In what we just did we expanded with the term <math>(1/n)</math> and what we got were the correlation calculated as a function of the ''population'' covariance, <math>\sigma_{y,x}</math>, and the ''population'' variances, <math>\sigma_y^2</math> and <math>\sigma_x^2</math>. If we had expanded with the factor <math>(1/(n-1))</math> we would have obtained an expression that relates the correlation to the ''sample'' covariance, <math>s_{y,x}</math> and the sample variances <math>s_y^2</math> and <math>s_x^2</math> as follows:
 
 
 
<math>r=\frac{s_{y,x}}{\sqrt{s_y^2 s_x^2}}.</math>
 
 
 
There are two very important limitations of the ''correlation coefficient'' :
 
 
 
# In general, this sort of analysis does not imply causation, in either direction. Variables may appear to move together for a number of reasons and not because one is causally linked to the other. For example, over the period <math>1945-64</math> the number of TV licences <math>(x)</math> taken out in the UK increased steadily, as did the number of convictions for juvenile delinquency <math>\left( y\right)</math>. Thus a scatter of <math>y</math> against <math>x</math>, and the construction of the sample correlation coefficient reveals an apparent positive relationship. However, to therefore claim that increased exposure to TV causes juvenile delinquency would be extremely irresponsible.
 
# The sample correlation coefficient gives an index of the apparent linear relationship only. It ''assumes'' that the scatter of points must be distributed about some underlying straight line. This is discussed further below. However, the term relationship is not really confined to such linear relationships. Consider the relationship between ''age'' and ''income''. If we were to plot observations for the age and income of people in the age range of 20 to 50 we will clearly find a positive relationship. However, if we were to extend the age range to 80, we would most likely see that income decreases that the upper end of the age range. Therefore there is no linear age/income relationship across the full age range and the ''correlation coefficient'' cannot be used to describe such a relationship.
 
 
 
Imagine drawing a straight line of ''best fit'' through the scatter of points in the above Figure simply from ''visual'' inspection. You would try and make it ''go through'' the scatter, in some way, and it would probably have a positive slope. Numerically, one of the things that the correlation coefficient does is assess the slope of such a line: if <math>r>0</math>, then the slope of the line of best fit should be positive, and vice-versa. Moreover, if <math>r</math> is close to either <math>1</math> (or <math>-1</math>) then this implies that the scatter is quite closely distributed around the line of best fit. What the correlation coefficient doesn’t do, however, is tell us the exact position of line of best fit. This is achieved using ''regression'' analysis.
 
 
 
== Additional resources ==
 
 
 
Khan Academy:
 
 
 
* Do not confuse correlation with causation: [https://www.khanacademy.org/math/probability/regression/regression-correlation/v/correlation-and-causality]
 
 
 
= Regression =
 
 
 
The line you have drawn on the scatter can be represented algebraically as <math>a+bx</math>. Here <math>x</math> represents the value on the horizontal axis (or height), <math>a</math> is the intercept (i.e. the value on the vertical axis at <math>x=0</math>) and <math>b</math> is the slope (i.e. the value by which the line increases as we increase <math>x</math> by one unit). The line is defined at any value of <math>x</math> and not only those at which we have actual observations.
 
 
 
However, if we were to substitute our values <math>x_i</math> we would get <math>\widehat{y}_i=a+bx_i</math>. It is important to note that the result of this operation, <math>\widehat{y}_i</math> is not the same as <math>y_i</math>. The difference between the two is what is often called the residual
 
 
 
<math>res_i=y_i-\widehat{y}_i = y_i - a - bx_i.</math>
 
 
 
''Regression'' analysis is the statistical technique that finds the ''optimal'' values for <math>a</math> and <math>b</math>. We will soon see how to determine the best values for <math>a</math> and <math>b</math>. Before we do so we want to point out what ''optimal'' means in this context. In fact it implies that we want to minimise the values for <math>res_i</math> for all <math>i=1,...,n</math> (where <math>n</math> is the number of observations at hand) in some sense. in fact what we want to minimise is the ''sum of squared residuals''
 
 
 
<math>\sum_{i=1}^{n}(y_{i}-\widehat{y}_{i})^{2} = \sum_{i=1}^{n}(y_{i}-a-bx_{i})^{2}.</math>
 
 
 
This is equivalent to saying that we want to minimise the variation of our sample observations around the regression line (<math>a+bx</math>). Here is our previous scatter plot with the ''line of best fit''
 
 
 
[[File:Regression_scatter_with_line.jpg|700px]]
 
 
 
It turns out that this line would be extremely close to the imaginary line of best fit I asked you to draw towards the end of the correlation section. As it turns out our eyes are pretty good in ''visual fitting''. This is particularly easy if you have the sample means of <math>y</math> and <math>x</math> available, as the we have in form of the dotted line, as the line of best fit will always go through the point <math>{\bar{y}, \bar{x}}</math>.
 
 
 
Let’s get a little technical. The technique of obtaining <math>a</math> and <math>b</math> in this way is also known as ''ordinary least squares'' (OLS) since it minimises the sum of squared deviations (sum of squared residuals) from the fitted line. We shall not dwell on the algebra here, but the solutions to the algebraical problem are:
 
 
 
<math>b=\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(y_{i}-\bar{y})}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}},\quad a=\bar{y}-b\bar{x};</math>
 
 
 
Applying the technique to the weight and height data yields: <math>b=616.333/191.667=3.2157</math>, and <math>a=154.167-3.2157\times 66.833=-60.746</math>, giving the smallest possible sum of squared residuals as <math>677.753.</math> This line:
 
 
 
<math>-60.746+3.2157 \times x</math>
 
 
 
is superimposed on the above figure.
 
 
 
== Interpretation of regression equation ==
 
 
 
There are a number of issues that need to be stressed here. The first relates, yet again, to the ''sample''/''population'' issue. In mist cases the data available to run a regression will be sample data. Recall how we previously discussed that the <math>\bar{x}</math> was the sample estimate for some unknown population parameter <math>\mu</math> or the sample variance, <math>s^2</math> was the sample estimate of some unknown population variance, <math>\sigma^2</math>. In the same spirit it turns out that the values of <math>a</math> and <math>b</math> that describe the line of best fit, are sample estimates of some unknown population parameters (usually labeled, <math>\alpha</math> and <math>\beta</math>).
 
 
 
Also note that <math>b</math> is the slope of the fitted line, <math>\hat{y}=a+bx</math>; i.e., the derivative of <math>\hat{y}</math> with respect to <math>x:</math>
 
 
 
<math>b=d\hat{y}/dx</math>
 
 
 
and measures the increase in <math>\hat{y}</math> for a unit increase in <math>x</math>.
 
 
 
Alternatively, it can be used to impute an elasticity. Elementary economics tells us that if <math>y</math> is some function of <math>x</math>, <math>y=f(x)</math>, then the elasticity of <math>y</math> with respect to <math>x</math> is given by the ''logarithmic derivative:''
 
 
 
<math>\dfrac{d\log \left( y\right) }{d\log \left(x\right) }=\dfrac{dy/y}{dx/x}\cong (x/y)b</math>
 
 
 
where we have used the fact that the differential <math>d\log (y)=\dfrac{1}{y}dy</math>. Such an elasticity is often evaluated at the respective sample means; i.e., it is calculated as <math>(\bar{x}/\bar{y})b</math>.
 
 
 
=== Example ===
 
 
 
In applied economics studies of demand, the <math>\log </math> of demand <math>(Q)</math> is regressed on the <math>\log </math> of price <math>(P),</math> in order to obtain the fitted equation (or relationship). For example, suppose an economic model for the quantity demanded of a good, <math>Q</math>, as a function of its price, <math>P</math>, is postulated as approximately being <math>Q=aP^{b}</math> where <math>a</math> and <math>b</math> are unknown parameters, with <math>a>0</math>, <math>b<1</math> to ensure a positive downward sloping demand curve. Taking logs on both sides we see that <math>\log(Q)=a^{\ast }+b\log (P),</math> where <math>a^{\ast }=\log (a).</math> Thus, if <math>n</math> observations are available, (<math>q_{i},p_{i}</math>), <math>i=1,...,n,</math> a scatter plot of <math>\log (q_{i})</math> on <math>\log (p_{i})</math> should be approximately linear in nature. Thus suggests that a simple regression of <math>\log (q_{i})</math> on <math>\log (p_{i})</math> would provide a direct estimate of the elasticity of demand which is given by the value <math>b</math>.
 
 
 
== Transformations of data ==
 
 
 
Numerically, transformations of data can affect the above summary measures. For example, in the weight-height scenario, consider for yourself what would happen to the values of <math>a</math> and <math>b</math> and the correlation if we were to use kilograms and centimetres rather than pounds and inches.
 
 
 
A more important matter arises if we find that a scatter of the some variable <math>y</math> against another, <math>x,</math> does not appear to reveal a linear relationship. In such cases, linearity may be retrieved if <math>y</math> is plotted against some function of <math>x</math> (e.g., <math>\log (x)</math> or <math>x^{2},</math> say). Indeed, there may be cases when <math>Y</math> also needs to be transformed in some way. That is to say, transformations of the data (via some mathematical function) may render a non-linear relationship more linear.
 
 
 
== Regression in Matrix Form ==
 
 
 
Previously we looked at regression models that had one explanatory variable, i.e. variation in <math>y_i</math> was explained by variation in <math>x_i</math>. This setup allows for rather convenient representation of the OLS estimators for <math>\alpha</math> and <math>\beta</math> as seen above. It takes little imagination to envisage a situation where you would want to explain variation in the dependent variable (here the weight of a person), with more than one explanatory variable. In our little example we can think of a number of additional variables we may want to consider in addition to the height of a person. Potential additional explanatory variables could be the gender of the person (<math>gender_i</math>) and the amount of weekly exercise any person does (<math>exerc_i</math>). The regression model would then be represented by:
 
 
 
<math>weight_i = \alpha + \beta height_i + \gamma gender_i + \delta exerc_i + res_i</math>
 
 
 
Now we need estimates for the four coefficients <math>\alpha</math>, <math>\beta</math>, <math>\gamma</math> and <math>\delta</math>. Unfortunately, the formulae for these now become quite messy, that is unless you are prepared to represent this model in matrix form in which case everything becomes quite easy again.
 
 
 
As we like ''as easy as possible'', let’s do it!
 
 
 
To make the transition slightly more obvious, let me first restate the model slightly with different parameter names:
 
 
 
<math>weight_i = \beta_1*1 + \beta_2 height_i + \beta_3 gender_i + \beta_4 exerc_i + res_i</math>
 
 
 
Note that I also multipled the constant parameter <math>\beta_0</math> with the value 1, which of course doesn’t change anything. Now, let’s give the variables new names. We shall call the dependent variable, as before, <math>y_i</math>, and the explanatory variables (and we shall now include that constant value of 1 as one of the variables) <math>x_{1i} = 1</math>, <math>x_{2i} = height_i</math>, <math>x_{3i} = gender_i</math> and <math>x_{4i} = exerc_i</math>. Lastly I call the residual <math>u_i</math>. Using all these new names (note that nothing of substance has changed so far), the regression model now looks as follows:
 
 
 
<math>y_i = \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i} + u_i</math>
 
 
 
Next we note what the index <math>i</math> really stands for. In our example sample we had 12 observations, that means that in the example <math>i</math> takes values 1 to 12, indicating that for each of the 12 people in the sample we have values for the dependent and explanatory variables (although the earlier table doesn’t include values for gender and exercise). This implies that we could write the regression model for each observation
 
 
 
<math>\begin{aligned}
 
y_1 &=& \beta_1 x_{1,1} + \beta_2 x_{2,1} + \beta_3 x_{3,1} + \beta_4 x_{4,1} + u_1 \\
 
y_2 &=& \beta_1 x_{1,2} + \beta_2 x_{2,2} + \beta_3 x_{3,2} + \beta_4 x_{4,2} + u_2 \\
 
... &=& ... \\
 
y_12 &=& \beta_1 x_{1,12} + \beta_2 x_{2,12} + \beta_3 x_{3,12} + \beta_4 x_{4,12} + u_12\end{aligned}</math>
 
 
 
The only elements that remain constant for all observations are the coefficients. So and now we will introduce matrix notation. Let’s define the following vectors and matrices:
 
 
 
<math>y_i = \beta_1 x_{1i} + \beta_2 x_{2i} + \beta_3 x_{3i} + \beta_4 x_{4i} + u_i</math>
 
 
 
== Additional resources ==
 
 
 
* Khan Academy: Setup of the OLS problem and how to proof that the above formulae fr <math>a</math> and <math>b</math> ([https://www.khanacademy.org/math/probability/regression/regression-correlation/v/squared-error-of-regression-line] and four follow on clips - click on &quot;up next&quot; at the end of each clip). But be careful, in his video Salman Khan uses <math>m</math> for what we call <math>b</math> and <math>b</math> for what we call <math>a</math>. Life is never easy!
 
* Link to a full (55min) undergraduate, introductory lecture on regression [http://www.youtube.com/watch?v=AHAlqJTrPHE&list=PLW7MJJThJQQs3djo1EL6KCRFeCa6wpfYY]. After minute 22 this clip has material that is not covered here (<math>R^2</math> and statistical inference in regressions).
 
 
 
 
 
= Examples =
 
 
 
You can find examples related to these topics here: [[Regression_Examples]].
 
 
 
= Footnotes =
 
 
 
<references />
 

Revision as of 10:46, 4 September 2014

\mathbf{y} = \left(\begin{array}{c}

               y_1 \\
               y_2 \\
               \vdots \\
               y_{12} 
             \end{array}
           \right);