Difference between revisions of "IV in R"

From ECLR
Jump to: navigation, search
(Instrumental Variables (IV) and Two Stage Least Squares estimator (2SLS) estimator)
 
(44 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
 
= Introduction =
 
= Introduction =
  
In this Section we will demonstrate how to use instrumental variables (IV) estimation to estimate the parameters in a linear regression model. The material will follow the notation in the Heij ''et al.'' textbook<ref>Heij C, de Boer P., Franses P.H., Kloek T. and van Dijk H.K (2004) Econometric Methods with Applications in Business and Economics, Oxford University Press, New York [http://www.amazon.co.uk/Econometric-Methods-Applications-Business-Economics/dp/0199268010/ref=sr_1_1?s=books&ie=UTF8&qid=1354473313&sr=1-1]. This is an all-round good textbook that presents econometrics using matrix algebra.
+
In this Section we will demonstrate how to use instrumental variables (IV) estimation (or better Two-Stage-Least Squares, 2SLS) to estimate the parameters in a linear regression model. If you want some more theoretical background on why we may need to use these techniques you may want to refer to any decent Econometrics textbook, or perhaps to this [[IV|page]].
</ref>.
+
 
 +
Here we will be very short on the problem setup and big on the implementation! When you estimate a linear regression model, say
 +
 
 +
$y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + u$
 +
 
 +
the most crucial of all assumptions you got to make is that the explanatory variables $x_1$ to $x_3$ are uncorrelated to the error term $u$. Of course, the error term $u$ is unobservable and hence it is impossible to empirically test this assumption (notwithstanding a related test introduced below) and the applied econometrician ought to think very carefully whether there may be any reason that makes it likely that this assumption might be breached. The seasoned econometrician would immediately rattle down simultaneity bias, measurement error and omitted relevant variables as the three most common causes for this to happen.
 +
 
 +
In some such situations you can actually fix the problem, e.g. by including the relevant variable into the model, but in others that is impossible and you need to accept that there is a high probability that, say, $x_3$ is correlated with $u$. We would then call $x_3$ an ''endogenous'' variable and all those explanatory variables that do not have that issue are called ''exogenous''.. If you still persist with estimating that model by Ordinary Least Squares (OLS) you will have to accept that your estimated coefficients come from a random distribution that on average will not produce the correct (yet unknown) value, in technical lingo, the estimators are biased.
 +
 
 +
To the rescue come instrumental variables (IV) estimation. What we need to use this technique is what is called an instrumental variable. And if only $x_3$ is potentially correlated we need at least one such variable, but more could be useful as well. You always need at least as many instruments as you have endogenous variables. These instruments need to have the following crucial properties, they need to be correlated to the endogenous variable, uncorrelated to the error term and shouldn't be part of the model explaining the dependent variable $y$.
 +
 
 +
== The basic idea of IV/2SLS ==
 +
 
 +
Here is a brief outline of what happens when you use IV, in the form of a TSLS regression.
 +
 
 +
# Take all of your endogenous variables and run regressions with these as the dependent variable and all other exogenous and all instrumental variables as explanatory variables. From these regressions you get the predicted values for all your endogenous variables, e.g. $\hat{x}_3$. These regression(s) are called '''first stage regressions'''. The idea is that, as all explanatory variables in this first stage regression are assumed to be uncorrelated with the error term, the variable $\hat{x}_3$ is also uncorrelated with the unobserved error term $u$. All variation in $x_3$ that was correlated with the error term $u$ must have ended up in the error term of this auxilliary regression.
 +
 
 +
# In the '''second stage''' of the procedure we return to our original regression model and replace $x_3$ with $\hat{x}_3$ and then estimate values for the parameters $\alpha_0$ to $\alpha_3$ by OLS. These estimators, at least for sufficiently large samples, will deliver unbiased estimates (technical lingo: consistent).
 +
 
 +
This sounds pretty easy. There is a slight complication, the standard errors that the '''second stage''' OLS regression delivers are incorrect and we need to calculate different standard errors. But that will happen automatically in the procedure below.
 +
 
 +
= Implementation in R =
 +
 
 +
The R Package needed is the AER package that we already recommended for use in the context of estimating [[R_robust_se#Which_package_to_use|robust standard errors]]. Included in that package is a function called <source enclose=none>ivreg</source> which we will use. We explain how to use it by walking through an example.
 +
 
 +
If you use IV a lot in your work, you may well want to pack all of the following into one convenient function (just as Alan Fernihough has done here [https://diffuseprior.wordpress.com/2012/05/03/an-ivreg2-function-for-r/]. But if you are applying IV for the first time it is actually very instructive to go through some of the steps in a bit more detail. It is is also good to see that really there is not a lot of technical magic ... just a great idea!
 +
 
 +
== Example ==
 +
 
 +
We will use the [[R#Data_Sets|Women's Wages]] dataset to illustrate the use of the IV regression. The dependent variable which we use here is the log wage <source enclose=none>lwage</source> and we are interested in whether the years of education, <source enclose=none>educ</source>, has a positive influence on this log wage (here we mirror the analysis in Wooldridge's Example 15.1).  
 +
 
 +
Let's first import the data
 +
 
 +
    setwd("YOUR/DIRECTORY/PATH")              # This sets the working directory
 +
    mydata <- read.csv("mroz.csv", na.strings = ".")  # Opens mroz.csv from working directory
 +
 
 +
And also let's remove all observations with missing wages from the dataframe
 +
 
 +
    mydata <- subset(mydata, is.na(wage) == FALSE)  # remove observations with missing wages from dataset
  
<math>\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon }</math>
+
An extremely simple model would be to estimate the following OLS regression which models <source enclose=none>lwage</source> as a function of a constant and <source enclose=none>educ</source>.
  
Before continuing it is advisable to be clear about the dimensions of certain variables. Let’s assume that <math>\mathbf{y}</math> is a <math>(n \times 1)</math> vector containing the <math>n</math> observations for the dependent variable. <math>\mathbf{X}</math> is a <math>(n \times k)</math> matrix with the <math>k</math> explanatory variables in the columns, usually containing a vector of 1s in the first column, representing a regression constant. The issue is that we may suspect (or know) that at least one of the explanatory variables is correlated with the (unobserved) error term <math>\mathbf{\varepsilon }</math>
+
    reg_ex0 <- lm(lwage~educ,data=mydata)
 +
    print(summary(reg_ex0))
  
Reasons for such a situation include measurement error in one of the explanatory variables, endogenous explanatory variables, omitted relevant variables (correlated with included explanatory variables) or a combination of the above. The consequence is that the OLS parameter estimate of <math>\mathbf{\beta}</math> is biased and inconsistent. Fortunately it is well established that an IV estimation of <math>\mathbf{\beta}</math> can potentially deliver consistent parameter estimates. This does, however, require the availability of sufficient instruments <math>\mathbf{Z}</math>. Let <math>\mathbf{Z}</math> be a <math>(n \times p)</math> matrix with instruments. Importantly, <math>p \ge k</math>, and further <math>\mathbf{X}</math> and <math>\mathbf{Z}</math> may have columns in common. If so, these are explanatory variables from <math>\mathbf{X}</math> that are judged to be certainly uncorrelated with the error term (like the constant).
+
which delivers
  
It is well established that the instrumental variables in <math>\mathbf{Z}</math> need to meet certain restrictions in order to deliver useful IV estimators of <math>\mathbf{\beta}</math>. They need to be uncorrelated to the error terms. Further, we require <math>E(\mathbf{Z}'\mathbf{X})</math> to have full rank. In very simple cases this boils down to the instrument <math>\mathbf{Z}</math> and the endogenous variable <math>\mathbf{X}</math> being correlated with each other. Further they should have no relevance for the dependent variable, other than through its relation to the potentially endogenous variable (exclusion assumption).
+
    Call:
 +
    lm(formula = lwage ~ educ, data = mydata)
 +
    Residuals:
 +
          Min      1Q  Median      3Q      Max
 +
    -3.10256 -0.31473  0.06434  0.40081  2.10029
 +
    Coefficients:
 +
                Estimate Std. Error t value Pr(>|t|)   
 +
    (Intercept)  -0.1852    0.1852  -1.000    0.318   
 +
    educ          0.1086    0.0144  7.545 2.76e-13 ***
 +
    ---
 +
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 +
    Residual standard error: 0.68 on 426 degrees of freedom
 +
      (325 observations deleted due to missingness)
 +
    Multiple R-squared:  0.1179, Adjusted R-squared:  0.1158
 +
    F-statistic: 56.93 on 1 and 426 DF, p-value: 2.761e-13
  
A number of MATLAB functions can be found [[ExampleCodeIV|here]].
+
This seems to indicate that every additional year of education increases the wage by almost 11% (recall the interpretation of a coefficient in a log-lin model!). The issue with this sort of model is that education is most likely to be correlated with individual characteristics that are important for the person's wage, but not modelled (and hence captured by the error term).
  
= Instrumental Variables (IV) and Two Stage Least Squares estimator (2SLS) estimator =
+
What we need is an instrument that meets the conditions outlined above and here and as in Wooldridge's example we use the father's education as an instrument. The way to do this is as follows:
  
It is well established that the following estimator is useful in the situation in which an element of <math>\mathbf{X}</math> is correlated with the error term
+
    reg_iv0 <- ivreg(lwage~educ<span style="color:blue">|fatheduc</span>,data=mydata)
 +
    print(summary(reg_iv0))
  
<math>\mathbf{\widehat{\beta}}_{IV} = \left(\mathbf{X}'\mathbf{P}_Z \mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{P}_Z \mathbf{y}</math>
+
The <source enclose=none>ivreg</source> function works very similar to the <source enclose=none>lm</source> command (as usual use <source enclose=none>?ivreg</source> to get more detailed help). In fact the only difference is the specification of the instrument <span style="color:blue">|fatheduc</span>. The instruments follow the model specification. Behind the vertical lines we find the instrument used to instrument the <source enclose=none>educ</source> variable<ref>The order of the variables after the vertical line doesn't matter.</ref>.
  
where <math>\mathbf{P}_Z</math> is the projection matrix of <math>\mathbf{Z}</math>. In situations in which <math>p>k</math> this is called the 2SLS estimator and if <math>p=k</math> this is called the IV estimator. The latter can be understood to be a special case of the former.
+
The result is  
  
Without going into any detail, it is instructive to understand why this is called the two stage least squares estimator. If you understand what projection matrices are <ref>Check any econometrics textbook with a good section on matrix algebra, like the one in the above note or William Greene's Econometric Analysis.</ref> then you will realise that <math>\mathbf{X}^{\prime }\mathbf{P}_{Z}</math> will deliver the predicted values of the explanatory variables in <math>\mathbf{X}</math> when being regressed on all elements in <math>\mathbf{Z}</math>. This is like a first stage regression. These values are then used in place of the original values for <math>\mathbf{X}</math> in a second stage regression to obtain <math>\mathbf{\widehat{\beta}}_{IV}</math>
+
    Call:
 +
    ivreg(formula = lwage ~ educ | fatheduc, data = mydata)
 +
   
 +
    Residuals:
 +
        Min      1Q  Median      3Q    Max
 +
    -3.0870 -0.3393  0.0525  0.4042  2.0677
 +
   
 +
    Coefficients:
 +
                Estimate Std. Error t value Pr(>|t|) 
 +
    (Intercept)  0.44110    0.44610  0.989  0.3233 
 +
    educ        0.05917    0.03514  1.684  0.0929 .
 +
    ---
 +
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 +
   
 +
    Residual standard error: 0.6894 on 426 degrees of freedom
 +
    Multiple R-Squared: 0.09344, Adjusted R-squared: 0.09131
 +
    Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294
  
When performing inference the Variance-Covariance matrix of <math>\mathbf{\widehat{\beta}}_{IV}</math> is of obvious interest and it is calculated as follows
+
Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of <source enclose=none>educ</source> on <source enclose=none>fatheduc</source><ref>Which turns out to be 0.1958 if you check it.</ref>.
  
<math>Var\left(\mathbf{\widehat{\beta}}_{IV} \right) =  \sigma ^{2}\left( \mathbf{X}^{\prime }\mathbf{P}_{Z}\mathbf{X}\right)^{-1}</math>
+
In order to illustrate the full functionality of the <source enclose=none>ivreg</source> procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge's Example 15.5). Let's start by estimating this model by OLS (as we need this result later, but result not shown here)
  
where the estimate for the error variance comes from.
+
    reg_1 <- lm(lwage~educ+age+exper+expersq, data=mydata) # OLS estimation
 +
    reg_1_sm <- summary(reg_1)
 +
    print(reg_1_sm)
  
<math>\begin{aligned}
+
The estimated coefficient for <code>educ</code> is 0.1075 with standard error 0.0142 (the rest of the results are not shown). Then we estimate the TSLS regression with <code>fatheduc</code> and <code>matheduc</code> as instruments.
s_{IV}^{2} &=&\frac{1}{n-k}\widehat{\mathbf{\varepsilon }}_{IV}^{\prime }%
 
\widehat{\mathbf{\varepsilon }}_{IV} \\
 
&=&\frac{1}{n-k}\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right)
 
^{\prime }\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right)\end{aligned}</math>
 
  
 +
    reg_iv1 <- ivreg(lwage~educ+exper+expersq|<span style="color:blue">fatheduc+motheduc</span>+<span style="color:red">exper+expersq</span>,data=mydata)
 +
    print(summary(reg_iv1))
  
 +
Before the vertical line we can see the model that is to be estimeted, <source enclose=none>lwage~educ+exper+expersq</source>. All the action is after the vertical line. First we see the instrumental variables used to instrument <source enclose=none>educ</source>, <span style="color:blue">fatheduc+motheduc</span>; this is followed by all the explanatory variables that are considered exogenous, <span style="color:red">exper+expersq</span>.
  
 +
When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables (in red). A slightly different, more economical way of asking R to do the same thing is as follows
  
== MATLAB implementation ==
+
    reg_iv1 <- ivreg(lwage~educ+exper+expersq|<span style="color:brown">.-educ</span><span style="color:blue">+fatheduc+motheduc</span>,data=mydata)
 +
    print(summary(reg_iv1))
  
The following code extract assumes that the vector <source enclose="none">y</source> contains the <math>(n \times 1)</math> vector with the dependent variable, the <math>(n \times k)</math> matrix <source enclose="none">x</source> contains all explanatory variables and <source enclose="none">z</source> is a <math>(n \times p)</math> matrix <math>(p ge k)</math> with instruments.
+
After the vertical line you are basically telling R which variable to remove from the instrument set (the endogenous variable, <span style="color:brown">.-educ</span>) and which to add (<span style="color:blue">+fatheduc+motheduc</span>). Make sure you don't forget the decimal point straight after the vertical line when you use this way of specifying the instruments. What you get is the following
  
<source>pz    = z*inv(z'*z)*z';    % Projection matrix
+
    Call:
xpzxi = inv(x'*pz*x);     % this is also (Xhat'Xhat)^(-1)
+
    ivreg(formula = lwage ~ educ + age + exper + expersq | . - educ +
 +
        fatheduc + motheduc, data = mydata)
 +
   
 +
    Residuals:
 +
        Min      1Q  Median      3Q    Max
 +
    -3.1017 -0.3216  0.0545  0.3685 2.3496
 +
   
 +
    Coefficients:
 +
                  Estimate Std. Error t value Pr(>|t|)  
 +
     (Intercept) 0.0667186  0.4591101  0.145  0.8845 
 +
    educ        0.0609945  0.0315798  1.931  0.0541 .
 +
    age        -0.0003542  0.0049318  -0.072  0.9428 
 +
    exper        0.0441960  0.0134524  3.285  0.0011 **
 +
    expersq    -0.0008947  0.0004075  -2.196  0.0287 *
 +
    ---
 +
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 +
   
 +
    Residual standard error: 0.6756 on 423 degrees of freedom
 +
    Multiple R-Squared: 0.1353, Adjusted R-squared: 0.1272
 +
    Wald test: 6.085 on 4 and 423 DF,  p-value: 9.085e-05
  
biv    = xpzxi*x'*pz*y;    % IV estimate
+
Again we can see the significantly reduced size of the <code>educ</code> effect, when compared to OLS estimation.
res    = y - x*biv;        % IV residuals
 
ssq    = res'*res/(n-k);    % Sample variance for IV residuals
 
s      = sqrt(ssq);        % Sample Standard deviation for IV res
 
bse    = ssq*xpzxi;        % Variance covariance matrix for IV estimates
 
bse    = sqrt(diag(bse));  % Extract diagonal and take square root -> standard errors for IV estimators</source>
 
  
 
= IV related Testing procedures =
 
= IV related Testing procedures =
Line 57: Line 145:
 
One feature of IV estimations is that in general it is an inferior estimator of <math>\mathbf{\beta}</math> if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.
 
One feature of IV estimations is that in general it is an inferior estimator of <math>\mathbf{\beta}</math> if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.
  
For this reason any application of IV, should be accompanied by evidence that establishes that it was necessary. Once that is established, one should also establish that the instruments chosen meet the necessary requirements (of being correlated with the endogenous variable and being exogenous to the regression error term).
+
So there are usually three tests one performs in this context. Firstly a test to examine that the chosen instruments are indeed sufficiently strong correlated to the endogenous variable (Instrument relevance); whether the potentially endogenous variable is indeed endogenous (Testing for exogeneity) and finally that the instruments are indeed exogenous.
 +
 
 +
== Instrument relevance ==
 +
 
 +
The entire 2SLS procedure hinges on the instruments chosen being useful instruments. Useful here means that they are sufficiently strongly correlated to the endogenous variable.
 +
 
 +
We can use the first stage regression (described in the [[IV_in_R#The_basic_idea_of_IV.2F2SLS|Introduction]]) to test whether that is indeed the case. So here is the first stage regression
 +
 
 +
    # First Stage
 +
    first_stage <- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)
 +
 
 +
What we now need to know is whether the instruments <code>fatheduc</code> and <code>motheduc</code> explain a sufficient amount of variation in <code>educ</code>. We can use a standard F-test to test this. Here is the easiest way to implement this (check how to implement F-tests [[Regression_Inference_in_R#F-tests|here]]):
 +
 
 +
    > instrFtest <- waldtest(first_stage,.~.-fatheduc-motheduc)
 +
    > print(instrFtest)
 +
    Wald test
 +
    Model 1: educ ~ age + exper + expersq + fatheduc + motheduc
 +
    Model 2: educ ~ age + exper + expersq
 +
      Res.Df Df      F    Pr(>F)  
 +
    1    747                       
 +
    2    749 -2 116.74 < 2.2e-16 ***
 +
    ---
 +
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 +
 
 +
The value of the F_test is 116.74 with an extremely low p-value. So in this case we can clearly reject the null hypothesis that the instruments are irrelevant.
  
 
== Testing for exogeneity ==
 
== Testing for exogeneity ==
  
The null hypothesis to be tested here is whether
+
You really only want to use IV/TSLS if you are really dealing with endogenous explanatory variables. If the variable you suspected wasn't endogenous, then IV only has disadvantages compared to OLS. Most crucially it will deliver much larger standard errors. For this reason you really want to make sure that you do have an endogeneity problem.
 +
 
 +
The celebrated test to use in this case is the Hausman test. Here we use a slightly different implementation to the original Hausman test, the so-called Hausman-Wu test.
  
<math>p\lim \left( \frac{1}{n}\mathbf{X}^{\prime }\mathbf{\varepsilon }\right) \neq 0.</math>
+
In the end it is pretty straighforward and you only need simple regressions to implement it. In a first step you run the first step regression(s) of the TSLS procedure. In our case this is
  
and therefore whether an IV estimation is required or no. The procedure described is as in Heij ''et al.''. It consists of the following three steps.
+
    # First Stage
 +
    first_stage <- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)
  
<ol>
+
In a second step you add the residual(s) from this first step into the original model
<li><p>Estimate <math>\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon}</math> by OLS and save the residuals <math>\widehat{\mathbf{\varepsilon}}</math>.</p></li>
 
<li><p>Estimate</p>
 
<p><math>\mathbf{x}_{j}=\mathbf{Z\gamma }_{j}\mathbf{+v}_{j}</math></p>
 
<p>by OLS for all <math>\widetilde{k}</math> elements in <math>\mathbf{X}</math> that are possibly endogenous and save <math>\widehat{\mathbf{v}}_{j}</math>. Collect these in the <math>\left(
 
        n\times \widetilde{k}\right) </math> matrix <math>\widehat{\mathbf{V}}</math>.</p></li>
 
<li><p>Estimate the auxilliary regression</p>
 
<p><math>\widehat{\mathbf{\varepsilon }}=\mathbf{X\delta }_{0}+\widehat{\mathbf{V}}%
 
        \mathbf{\delta }_{1}+\mathbf{u}</math></p>
 
<p>and test the following hypothesis</p>
 
<p><math>\begin{aligned}
 
        H_{0} &:&\mathbf{\delta }_{1}=0~~\mathbf{X}\text{ is exogenous} \\
 
        H_{A} &:&\mathbf{\delta }_{1}\neq 0~~\mathbf{X}\text{ is endogenous}
 
        \end{aligned}</math></p>
 
<p>using the usual test statistic <math>\chi ^{2}=nR^{2}</math> which, under <math>H_{0}</math>, is <math>
 
        \chi ^{2}\left( \widetilde{k}\right) </math> distributed.</p></li></ol>
 
  
Implementing this test does not require anything else but the application of OLS regressions. In the following excerpt we assume that the dependent variable is contained in vector <source enclose="none">y</source>, the elements in <math>X</math> that are assumed to be exogenous are contained in <source enclose="none">x1</source>, those elements that are suspected that they may be endogenous are in <source enclose="none">x2</source> and the instrument matrix is saved in <source enclose="none">z</source>. As before, it is assumed that <source enclose="none">z</source> should contain all elements of <source enclose="none">x1</source>.
+
    # Hausman
 +
    Hausman_reg <- lm(lwage~educ+age+exper+expersq+first_stage<code>$</code>residuals,data=mydata)
 +
    print(summary(Hausman_reg))
  
The code also uses the OLSest function for the step 3 regression. However, that could easily be circumvented as for the regressions in Step 1 and 2.
 
  
<source>x = [x1 x2];            % Combine to one matrix x
+
Now we need to compare this result to the one we got from the original model <code>reg_2</code>. If the <code>educ</code> is indeed endogenous, then the first stage regression should have isolated the variation of $x_3$ that was correlated with error term in the residual of the first stage regression. In that case the included <code>first_stage$residuals</code> should be relevant. As there may potentially be more than one endogenous variable and hence more than one first stage residual we use an F-test to test the null hypothesis that these residuals are irrelevant (and hence endogeneity not being a problem).
xxi  = inv(x'*x);
 
b    = xxi*x'*y;      % Step 1: OLS estimator
 
res  = y - x*b;        % Step 1: saved residuals
 
  
zzi  = inv(z'*z);     % Step 2: inv(Z'Z) which is used in Step 2
+
    HausWutest <- waldtest(Hausman_reg,.~.-first_stage<code>$</code>residuals)
gam  = zzi*z'*x2;     % Step 2: Estimate OLS coefficients of step 2 regressions
+
    print(HausWutest)
                        % This works even if we have more than one element in x2
+
     Wald test
                        % we get as many columns of gam as we have elements in x2
+
    Model 1: lwage ~ educ + age + exper + expersq + first_stage<code>$</code>residuals
vhat = x2 - z*gam;     % Step 2: residuals (has as many columns as in x2
+
    Model 2: lwage ~ educ + age + exper + expersq
 +
      Res.Df Df      F  Pr(>F)
 +
    1    422                   
 +
     2   423 -1 2.9051 0.08903 .
 +
    ---
 +
     Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  
[b,bse,res,n,rss,r2] = OLSest(res,[x vhat],0);  % Step 3 regression
+
The result is a p-value of 0.089. So at an $\alpha = 0.05$ we just fail to reject the null of $x_3$ being exogenous.
teststat = size(res,1)*r2;                  % Step 3: Calculate nR^2 test stat
 
pval = 1 - chi2cdf(teststat,size(x2,2));    % Step 3: Calculate p-value</source>
 
A function that implements this test can be found [[ExampleCodeIV#Hausmann|here]].
 
  
 
== Sargan test for instrument validity ==
 
== Sargan test for instrument validity ==
  
One crucial property of instruments is that they ought to be uncorrelated to the regression error terms <math>\mathbf{\varepsilon}</math>. Instrument endogeneity is set as the null hypothesis of this test with the alternative hypothesis being that the instruments are endogenous.
+
One crucial property of instruments is that they ought to be uncorrelated to the regression error terms $u$. Instrument exogeneity is set as the null hypothesis of this following test with the alternative hypothesis being that the instruments are endogenous. This test can only be applied if you have more instruments than endogenous variables. It is therefore sometimes also called the test for overidentifying restrictions.
 
 
<ol>
 
<li><p>Estimate the regression model by IV and save <math>\widehat{\mathbf{\varepsilon }}%
 
_{IV}=\mathbf{y}-\mathbf{X}\widehat{\mathbf{\beta }}_{IV}</math></p></li>
 
<li><p>Regress</p>
 
<p><math>\widehat{\mathbf{\varepsilon }}_{IV}=\mathbf{Z\gamma +u}</math></p></li>
 
<li><p>Calculate <math>LM=nR^{2}</math> from the auxilliary regresion in step 2. <math>LM</math> is (under <math>H_{0}</math>) <math>\chi ^{2}</math> distributed with <math>\left( p-k\right) </math> degrees of freedom.</p></li></ol>
 
 
 
MATLAB implementation of this test relies on the availability of the IV parameter estimates. They can be calculated as indicated above. In [[ExampleCodeIV#IVest|this section]] you can find a function called <source enclose="none">IVest</source> that can deliver the required IV residuals by calling:
 
  
<source>[biv,bseiv,resiv,r2iv] = IVest(y,x,z);</source>
+
The test is rather simple to implement. Take the residuals from the 2SLS regression <code>reg_iv1$residuals</code> and use them as the dependent variable in a new regression in which you regress them on all exogenous explanatory variables and all instruments.
The third output are the IV residuals (refer to [[ExampleCodeIV#IVest|IVest]] for details) which can then be used as the dependent variable in the second step regression:
 
  
<source>[b,bse,res,n,rss,r2] = OLSest(resiv,z,0);              % Step 2: calculate Step 2 regression
+
    Sargan_reg <- lm(<code>reg_iv1$residuals</code>~age+exper+expersq+fatheduc+motheduc,data=mydata)
teststat = size(resiv,1)*r2;                            % Step 3: Calculates the nR^2 test statistic
+
    Sargan_reg_sm <- summary(Sargan_reg)
pval = 1 - chi2cdf(teststat,(size(z,2)-size(x,2)));    % Step 3: Calculate p-value</source>
 
It should be noted that this test is only applicable for an over-identified case when the <source enclose="none">z</source> contains more columns than <source enclose="none">x</source>. A function that implements this test can be found [[ExampleCodeIV#Sargan|here]].
 
  
== Instrument relevance ==
+
If the instruments are valid (null hypothesis), they should be uncorrelated to these residuals and hence we apply the following $\chi^2$ test. We use the $R^2$ of this regression and calculate $n*R^2$
 +
   
 +
    Sargan_test <- Sargan_reg_sm<code>$</code>r.squared*nrow(mydata)
 +
    print(Sargan_test)
 +
    print(1-pchisq(Sargan_test,1))  # prints p-value
  
The last instrument property that is required is that the instruments are correlated to the potentially endogenous variables. This is tested using a standard OLS regression that uses the endogenous variables as the dependent variable and all instrument variables (i.e. <source enclose="none">z</source>) as the explanatory variables. We then need to check whether the restriction that all (non-constant) variables in <source enclose="none">z</source> are relevant (F-test). If they are relevant, then the instruments are relevant. This is fact exactly what the Step 2 regressions of the Hausmann test do.
+
We find that the p-value of this test is 0.5260 and hence we do not reject the null hypothesis of instrument validity. The p-value was obtained from a $\chi^2$ distribution with one degree of freedom. That was one here as we had two instruments for one endogenous variable (2-1).
  
 
=Footnotes=
 
=Footnotes=
  
 
  <references />
 
  <references />

Latest revision as of 22:56, 7 August 2015

Introduction

In this Section we will demonstrate how to use instrumental variables (IV) estimation (or better Two-Stage-Least Squares, 2SLS) to estimate the parameters in a linear regression model. If you want some more theoretical background on why we may need to use these techniques you may want to refer to any decent Econometrics textbook, or perhaps to this page.

Here we will be very short on the problem setup and big on the implementation! When you estimate a linear regression model, say

$y = \alpha_0 + \alpha_1 x_1 + \alpha_2 x_2 + \alpha_3 x_3 + u$

the most crucial of all assumptions you got to make is that the explanatory variables $x_1$ to $x_3$ are uncorrelated to the error term $u$. Of course, the error term $u$ is unobservable and hence it is impossible to empirically test this assumption (notwithstanding a related test introduced below) and the applied econometrician ought to think very carefully whether there may be any reason that makes it likely that this assumption might be breached. The seasoned econometrician would immediately rattle down simultaneity bias, measurement error and omitted relevant variables as the three most common causes for this to happen.

In some such situations you can actually fix the problem, e.g. by including the relevant variable into the model, but in others that is impossible and you need to accept that there is a high probability that, say, $x_3$ is correlated with $u$. We would then call $x_3$ an endogenous variable and all those explanatory variables that do not have that issue are called exogenous.. If you still persist with estimating that model by Ordinary Least Squares (OLS) you will have to accept that your estimated coefficients come from a random distribution that on average will not produce the correct (yet unknown) value, in technical lingo, the estimators are biased.

To the rescue come instrumental variables (IV) estimation. What we need to use this technique is what is called an instrumental variable. And if only $x_3$ is potentially correlated we need at least one such variable, but more could be useful as well. You always need at least as many instruments as you have endogenous variables. These instruments need to have the following crucial properties, they need to be correlated to the endogenous variable, uncorrelated to the error term and shouldn't be part of the model explaining the dependent variable $y$.

The basic idea of IV/2SLS

Here is a brief outline of what happens when you use IV, in the form of a TSLS regression.

  1. Take all of your endogenous variables and run regressions with these as the dependent variable and all other exogenous and all instrumental variables as explanatory variables. From these regressions you get the predicted values for all your endogenous variables, e.g. $\hat{x}_3$. These regression(s) are called first stage regressions. The idea is that, as all explanatory variables in this first stage regression are assumed to be uncorrelated with the error term, the variable $\hat{x}_3$ is also uncorrelated with the unobserved error term $u$. All variation in $x_3$ that was correlated with the error term $u$ must have ended up in the error term of this auxilliary regression.
  1. In the second stage of the procedure we return to our original regression model and replace $x_3$ with $\hat{x}_3$ and then estimate values for the parameters $\alpha_0$ to $\alpha_3$ by OLS. These estimators, at least for sufficiently large samples, will deliver unbiased estimates (technical lingo: consistent).

This sounds pretty easy. There is a slight complication, the standard errors that the second stage OLS regression delivers are incorrect and we need to calculate different standard errors. But that will happen automatically in the procedure below.

Implementation in R

The R Package needed is the AER package that we already recommended for use in the context of estimating robust standard errors. Included in that package is a function called ivreg which we will use. We explain how to use it by walking through an example.

If you use IV a lot in your work, you may well want to pack all of the following into one convenient function (just as Alan Fernihough has done here [1]. But if you are applying IV for the first time it is actually very instructive to go through some of the steps in a bit more detail. It is is also good to see that really there is not a lot of technical magic ... just a great idea!

Example

We will use the Women's Wages dataset to illustrate the use of the IV regression. The dependent variable which we use here is the log wage lwage and we are interested in whether the years of education, educ, has a positive influence on this log wage (here we mirror the analysis in Wooldridge's Example 15.1).

Let's first import the data

    setwd("YOUR/DIRECTORY/PATH")              # This sets the working directory
    mydata <- read.csv("mroz.csv", na.strings = ".")  # Opens mroz.csv from working directory

And also let's remove all observations with missing wages from the dataframe

    mydata <- subset(mydata, is.na(wage) == FALSE)   # remove observations with missing wages from dataset

An extremely simple model would be to estimate the following OLS regression which models lwage as a function of a constant and educ.

    reg_ex0 <- lm(lwage~educ,data=mydata)
    print(summary(reg_ex0))

which delivers

    Call:
    lm(formula = lwage ~ educ, data = mydata)
    Residuals:
         Min       1Q   Median       3Q      Max 
    -3.10256 -0.31473  0.06434  0.40081  2.10029 
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  -0.1852     0.1852  -1.000    0.318    
    educ          0.1086     0.0144   7.545 2.76e-13 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Residual standard error: 0.68 on 426 degrees of freedom
      (325 observations deleted due to missingness)
    Multiple R-squared:  0.1179,	Adjusted R-squared:  0.1158 
    F-statistic: 56.93 on 1 and 426 DF,  p-value: 2.761e-13

This seems to indicate that every additional year of education increases the wage by almost 11% (recall the interpretation of a coefficient in a log-lin model!). The issue with this sort of model is that education is most likely to be correlated with individual characteristics that are important for the person's wage, but not modelled (and hence captured by the error term).

What we need is an instrument that meets the conditions outlined above and here and as in Wooldridge's example we use the father's education as an instrument. The way to do this is as follows:

    reg_iv0 <- ivreg(lwage~educ|fatheduc,data=mydata)
    print(summary(reg_iv0))

The ivreg function works very similar to the lm command (as usual use ?ivreg to get more detailed help). In fact the only difference is the specification of the instrument |fatheduc. The instruments follow the model specification. Behind the vertical lines we find the instrument used to instrument the educ variable[1].

The result is

    Call:
    ivreg(formula = lwage ~ educ | fatheduc, data = mydata)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.0870 -0.3393  0.0525  0.4042  2.0677 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  0.44110    0.44610   0.989   0.3233  
    educ         0.05917    0.03514   1.684   0.0929 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.6894 on 426 degrees of freedom
    Multiple R-Squared: 0.09344,	Adjusted R-squared: 0.09131 
    Wald test: 2.835 on 1 and 426 DF,  p-value: 0.09294 

Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of educ on fatheduc[2].

In order to illustrate the full functionality of the ivreg procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge's Example 15.5). Let's start by estimating this model by OLS (as we need this result later, but result not shown here)

    reg_1 <- lm(lwage~educ+age+exper+expersq, data=mydata) # OLS estimation
    reg_1_sm <- summary(reg_1)
    print(reg_1_sm)

The estimated coefficient for educ is 0.1075 with standard error 0.0142 (the rest of the results are not shown). Then we estimate the TSLS regression with fatheduc and matheduc as instruments.

    reg_iv1 <- ivreg(lwage~educ+exper+expersq|fatheduc+motheduc+exper+expersq,data=mydata)
    print(summary(reg_iv1))

Before the vertical line we can see the model that is to be estimeted, lwage~educ+exper+expersq. All the action is after the vertical line. First we see the instrumental variables used to instrument educ, fatheduc+motheduc; this is followed by all the explanatory variables that are considered exogenous, exper+expersq.

When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables (in red). A slightly different, more economical way of asking R to do the same thing is as follows

    reg_iv1 <- ivreg(lwage~educ+exper+expersq|.-educ+fatheduc+motheduc,data=mydata)
    print(summary(reg_iv1))

After the vertical line you are basically telling R which variable to remove from the instrument set (the endogenous variable, .-educ) and which to add (+fatheduc+motheduc). Make sure you don't forget the decimal point straight after the vertical line when you use this way of specifying the instruments. What you get is the following

    Call:
    ivreg(formula = lwage ~ educ + age + exper + expersq | . - educ + 
        fatheduc + motheduc, data = mydata)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -3.1017 -0.3216  0.0545  0.3685  2.3496 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  0.0667186  0.4591101   0.145   0.8845   
    educ         0.0609945  0.0315798   1.931   0.0541 . 
    age         -0.0003542  0.0049318  -0.072   0.9428   
    exper        0.0441960  0.0134524   3.285   0.0011 **
    expersq     -0.0008947  0.0004075  -2.196   0.0287 * 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.6756 on 423 degrees of freedom
    Multiple R-Squared: 0.1353,	Adjusted R-squared: 0.1272 
    Wald test: 6.085 on 4 and 423 DF,  p-value: 9.085e-05

Again we can see the significantly reduced size of the educ effect, when compared to OLS estimation.

IV related Testing procedures

One feature of IV estimations is that in general it is an inferior estimator of [math]\mathbf{\beta}[/math] if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.

So there are usually three tests one performs in this context. Firstly a test to examine that the chosen instruments are indeed sufficiently strong correlated to the endogenous variable (Instrument relevance); whether the potentially endogenous variable is indeed endogenous (Testing for exogeneity) and finally that the instruments are indeed exogenous.

Instrument relevance

The entire 2SLS procedure hinges on the instruments chosen being useful instruments. Useful here means that they are sufficiently strongly correlated to the endogenous variable.

We can use the first stage regression (described in the Introduction) to test whether that is indeed the case. So here is the first stage regression

    # First Stage
    first_stage <- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)

What we now need to know is whether the instruments fatheduc and motheduc explain a sufficient amount of variation in educ. We can use a standard F-test to test this. Here is the easiest way to implement this (check how to implement F-tests here):

    > instrFtest <- waldtest(first_stage,.~.-fatheduc-motheduc)
    > print(instrFtest)
    Wald test
    Model 1: educ ~ age + exper + expersq + fatheduc + motheduc
    Model 2: educ ~ age + exper + expersq
      Res.Df Df      F    Pr(>F)    
    1    747                        
    2    749 -2 116.74 < 2.2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The value of the F_test is 116.74 with an extremely low p-value. So in this case we can clearly reject the null hypothesis that the instruments are irrelevant.

Testing for exogeneity

You really only want to use IV/TSLS if you are really dealing with endogenous explanatory variables. If the variable you suspected wasn't endogenous, then IV only has disadvantages compared to OLS. Most crucially it will deliver much larger standard errors. For this reason you really want to make sure that you do have an endogeneity problem.

The celebrated test to use in this case is the Hausman test. Here we use a slightly different implementation to the original Hausman test, the so-called Hausman-Wu test.

In the end it is pretty straighforward and you only need simple regressions to implement it. In a first step you run the first step regression(s) of the TSLS procedure. In our case this is

    # First Stage
    first_stage <- lm(educ~age+exper+expersq+fatheduc+motheduc,data=mydata)

In a second step you add the residual(s) from this first step into the original model

    # Hausman
    Hausman_reg <- lm(lwage~educ+age+exper+expersq+first_stage$residuals,data=mydata)
    print(summary(Hausman_reg))


Now we need to compare this result to the one we got from the original model reg_2. If the educ is indeed endogenous, then the first stage regression should have isolated the variation of $x_3$ that was correlated with error term in the residual of the first stage regression. In that case the included first_stage$residuals should be relevant. As there may potentially be more than one endogenous variable and hence more than one first stage residual we use an F-test to test the null hypothesis that these residuals are irrelevant (and hence endogeneity not being a problem).

    HausWutest <- waldtest(Hausman_reg,.~.-first_stage$residuals)
    print(HausWutest)
    Wald test
    Model 1: lwage ~ educ + age + exper + expersq + first_stage$residuals
    Model 2: lwage ~ educ + age + exper + expersq
      Res.Df Df      F  Pr(>F)  
    1    422                    
    2    423 -1 2.9051 0.08903 .
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The result is a p-value of 0.089. So at an $\alpha = 0.05$ we just fail to reject the null of $x_3$ being exogenous.

Sargan test for instrument validity

One crucial property of instruments is that they ought to be uncorrelated to the regression error terms $u$. Instrument exogeneity is set as the null hypothesis of this following test with the alternative hypothesis being that the instruments are endogenous. This test can only be applied if you have more instruments than endogenous variables. It is therefore sometimes also called the test for overidentifying restrictions.

The test is rather simple to implement. Take the residuals from the 2SLS regression reg_iv1$residuals and use them as the dependent variable in a new regression in which you regress them on all exogenous explanatory variables and all instruments.

    Sargan_reg <- lm(reg_iv1$residuals~age+exper+expersq+fatheduc+motheduc,data=mydata)
    Sargan_reg_sm <- summary(Sargan_reg)

If the instruments are valid (null hypothesis), they should be uncorrelated to these residuals and hence we apply the following $\chi^2$ test. We use the $R^2$ of this regression and calculate $n*R^2$

    Sargan_test <- Sargan_reg_sm$r.squared*nrow(mydata)
    print(Sargan_test)
    print(1-pchisq(Sargan_test,1))  # prints p-value

We find that the p-value of this test is 0.5260 and hence we do not reject the null hypothesis of instrument validity. The p-value was obtained from a $\chi^2$ distribution with one degree of freedom. That was one here as we had two instruments for one endogenous variable (2-1).

Footnotes

  1. The order of the variables after the vertical line doesn't matter.
  2. Which turns out to be 0.1958 if you check it.