Difference between revisions of "IV in R"
(→Example) |
|||
Line 94: | Line 94: | ||
Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of <source enclose=none>educ</source> on <source enclose=none>fatheduc</source><ref>Which turns out to be 0.1958 if you check it.</ref>. | Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of <source enclose=none>educ</source> on <source enclose=none>fatheduc</source><ref>Which turns out to be 0.1958 if you check it.</ref>. | ||
− | In order to illustrate the full functionality of the <source enclose=none>ivreg</source> procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation: | + | In order to illustrate the full functionality of the <source enclose=none>ivreg</source> procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge's Example 15.5): |
reg_iv1 <- ivreg(lwage~educ+exper+expersq|<span style="color:blue">fatheduc+motheduc</span>+<span style="color:red">exper+expersq</span>,data=mydata) | reg_iv1 <- ivreg(lwage~educ+exper+expersq|<span style="color:blue">fatheduc+motheduc</span>+<span style="color:red">exper+expersq</span>,data=mydata) | ||
print(summary(reg_iv1)) | print(summary(reg_iv1)) | ||
− | Before the vertical line we can see the model that is to be estimeted, <source enclose=none>lwage~educ+exper+expersq</source>. All the action is after the vertical line. First we see the instrumental variables used to instrument <source enclose=none>educ</source>, <span style="color:blue">fatheduc+motheduc</span>; this is followed by all the explanatory variables that are considered exogenous, <span style="color:red">exper+expersq</span>. | + | Before the vertical line we can see the model that is to be estimeted, <source enclose=none>lwage~educ+exper+expersq</source>. All the action is after the vertical line. First we see the instrumental variables used to instrument <source enclose=none>educ</source>, <span style="color:blue">fatheduc+motheduc</span>; this is followed by all the explanatory variables that are considered exogenous, <span style="color:red">exper+expersq</span>. |
+ | |||
+ | When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables (in red). A slightly different, more economical way of asking R to do the same thing is as follows | ||
+ | |||
+ | reg_iv1 <- ivreg(lwage~educ+exper+expersq|<span style="color:blue">.-educ+fatheduc+motheduc</span>,data=mydata) | ||
+ | print(summary(reg_iv1)) | ||
+ | |||
+ | What you get is the following | ||
+ | |||
+ | Call: | ||
+ | ivreg(formula = lwage ~ educ + age + exper + expersq | . - educ + | ||
+ | fatheduc, data = mydata) | ||
+ | |||
+ | Residuals: | ||
+ | Min 1Q Median 3Q Max | ||
+ | -3.09354 -0.32798 0.05094 0.37402 2.35375 | ||
+ | |||
+ | Coefficients: | ||
+ | Estimate Std. Error t value Pr(>|t|) | ||
+ | (Intercept) -0.0513505 0.4936538 -0.104 0.91720 | ||
+ | educ 0.0701490 0.0346051 2.027 0.04328 * | ||
+ | age -0.0002287 0.0049140 -0.047 0.96290 | ||
+ | exper 0.0436778 0.0134180 3.255 0.00122 ** | ||
+ | expersq -0.0008790 0.0004064 -2.163 0.03111 * | ||
+ | --- | ||
+ | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 | ||
+ | |||
+ | Residual standard error: 0.6727 on 423 degrees of freedom | ||
+ | Multiple R-Squared: 0.143, Adjusted R-squared: 0.1349 | ||
+ | Wald test: 6.225 on 4 and 423 DF, p-value: 7.106e-05 | ||
= IV related Testing procedures = | = IV related Testing procedures = |
Revision as of 15:44, 5 August 2015
Contents
Introduction
In this Section we will demonstrate how to use instrumental variables (IV) estimation to estimate the parameters in a linear regression model. Go straight to the implementation section if you want to skip this.The material will follow the notation in the Heij et al. textbook[1].
[math]\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon }[/math]
Before continuing it is advisable to be clear about the dimensions of certain variables. Let’s assume that [math]\mathbf{y}[/math] is a [math](n \times 1)[/math] vector containing the [math]n[/math] observations for the dependent variable. [math]\mathbf{X}[/math] is a [math](n \times k)[/math] matrix with the [math]k[/math] explanatory variables in the columns, usually containing a vector of 1s in the first column, representing a regression constant. The issue is that we may suspect (or know) that at least one of the explanatory variables is correlated with the (unobserved) error term [math]\mathbf{\varepsilon }[/math]
Reasons for such a situation include measurement error in one of the explanatory variables, endogenous explanatory variables, omitted relevant variables (correlated with included explanatory variables) or a combination of the above. The consequence is that the OLS parameter estimate of [math]\mathbf{\beta}[/math] is biased and inconsistent. Fortunately it is well established that an IV estimation of [math]\mathbf{\beta}[/math] can potentially deliver consistent parameter estimates. This does, however, require the availability of sufficient instruments [math]\mathbf{Z}[/math]. Let [math]\mathbf{Z}[/math] be a [math](n \times p)[/math] matrix with instruments. Importantly, [math]p \ge k[/math], and further [math]\mathbf{X}[/math] and [math]\mathbf{Z}[/math] may have columns in common. If so, these are explanatory variables from [math]\mathbf{X}[/math] that are judged to be certainly uncorrelated with the error term (like the constant).
It is well established that the instrumental variables in [math]\mathbf{Z}[/math] need to meet certain restrictions in order to deliver useful IV estimators of [math]\mathbf{\beta}[/math]. They need to be uncorrelated to the error terms. Further, we require [math]E(\mathbf{Z}'\mathbf{X})[/math] to have full rank. In very simple cases this boils down to the instrument [math]\mathbf{Z}[/math] and the endogenous variable [math]\mathbf{X}[/math] being correlated with each other. Further they should have no relevance for the dependent variable, other than through its relation to the potentially endogenous variable (exclusion assumption).
Instrumental Variables (IV) and Two Stage Least Squares estimator (2SLS) estimator
It is well established that the following estimator is useful in the situation in which an element of [math]\mathbf{X}[/math] is correlated with the error term
[math]\mathbf{\widehat{\beta}}_{IV} = \left(\mathbf{X}'\mathbf{P}_Z \mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{P}_Z \mathbf{y}[/math]
where [math]\mathbf{P}_Z[/math] is the projection matrix of [math]\mathbf{Z}[/math]. In situations in which [math]p\gt k[/math] this is called the 2SLS estimator and if [math]p=k[/math] this is called the IV estimator. The latter can be understood to be a special case of the former.
Without going into any detail, it is instructive to understand why this is called the two stage least squares estimator. If you understand what projection matrices are [2] then you will realise that [math]\mathbf{X}^{\prime }\mathbf{P}_{Z}[/math] will deliver the predicted values of the explanatory variables in [math]\mathbf{X}[/math] when being regressed on all elements in [math]\mathbf{Z}[/math]. This is like a first stage regression. These values are then used in place of the original values for [math]\mathbf{X}[/math] in a second stage regression to obtain [math]\mathbf{\widehat{\beta}}_{IV}[/math]
When performing inference the Variance-Covariance matrix of [math]\mathbf{\widehat{\beta}}_{IV}[/math] is of obvious interest and it is calculated as follows
[math]Var\left(\mathbf{\widehat{\beta}}_{IV} \right) = \sigma ^{2}\left( \mathbf{X}^{\prime }\mathbf{P}_{Z}\mathbf{X}\right)^{-1}[/math]
where the estimate for the error variance comes from.
[math]\begin{aligned} s_{IV}^{2} &=&\frac{1}{n-k}\widehat{\mathbf{\varepsilon }}_{IV}^{\prime }% \widehat{\mathbf{\varepsilon }}_{IV} \\ &=&\frac{1}{n-k}\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right) ^{\prime }\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right)\end{aligned}[/math]
Implementation in R
The R Package needed is the AER package that we already recommended for use in the context of estimating robust standard errors. Included in that package is a function called ivreg
which we will use.
Example
We will use the Women's Wages dataset to illustrate the use of the IV regression. The dependent variable which we use here is the log wage lwage
and we are interested in whether the years of education, educ
, has a positive influence on this log wage (here we mirror the analysis in Wooldridge's Example 15.1). An extremely simple model would be to estimate the following OLS regression which models lwage
as a function of a constant and educ
.
reg_ex1 <- lm(lwage~educ,data=mydata) print(summary(reg_ex1))
which delivers
Call: lm(formula = lwage ~ educ, data = mydata) Residuals: Min 1Q Median 3Q Max -3.10256 -0.31473 0.06434 0.40081 2.10029 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.1852 0.1852 -1.000 0.318 educ 0.1086 0.0144 7.545 2.76e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.68 on 426 degrees of freedom (325 observations deleted due to missingness) Multiple R-squared: 0.1179, Adjusted R-squared: 0.1158 F-statistic: 56.93 on 1 and 426 DF, p-value: 2.761e-13
This seems to indicate that every additional year of education increases the wage by almost 11% (recall the interpretation of a coefficient in a log-lin model!). The issue with this sort of model is that education is most likely to be correlated with individual characteristics that are important for the person's wage, but not modelled (and hence captured by the error term).
What we need is an instrument that meets the conditions outlined above and here and as in Wooldridge's example we use the father's education as an instrument. The way to do this is as follows:
reg_iv1 <- ivreg(lwage~educ|fatheduc,data=mydata)
print(summary(reg_iv1))
The ivreg
function works very similar to the lm
command (as usual use ?ivreg
to get more detailed help). In fact the only difference is the specification of the instrument |fatheduc. The instruments follow the model specification. Behind the vertical lines we find the instrument used to instrument the educ
variable[3].
The result is
Call: ivreg(formula = lwage ~ educ | fatheduc, data = mydata) Residuals: Min 1Q Median 3Q Max -3.0870 -0.3393 0.0525 0.4042 2.0677 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.44110 0.44610 0.989 0.3233 educ 0.05917 0.03514 1.684 0.0929 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6894 on 426 degrees of freedom Multiple R-Squared: 0.09344, Adjusted R-squared: 0.09131 Wald test: 2.835 on 1 and 426 DF, p-value: 0.09294
Clearly, the effect of an additional year of education, has significantly dropped and is now only marginally significant. It is, of course, often a feature of IV estimation that the estimated standard errors are significantly smaller than the OLS estimators. The size of the standard error depends a lot on the strength of the relation between the endogenous explanatory variables which we can be checked by looking at the Rsquared of the regression of educ
on fatheduc
[4].
In order to illustrate the full functionality of the ivreg
procedure we re-estimate the model with extra explanatory variables and more instruments than endogenous variables which means that really we are applying a 2SLS estimation (This is the example estimated in Wooldridge's Example 15.5):
reg_iv1 <- ivreg(lwage~educ+exper+expersq|fatheduc+motheduc+exper+expersq,data=mydata) print(summary(reg_iv1))
Before the vertical line we can see the model that is to be estimeted, lwage~educ+exper+expersq
. All the action is after the vertical line. First we see the instrumental variables used to instrument educ
, fatheduc+motheduc; this is followed by all the explanatory variables that are considered exogenous, exper+expersq.
When you have a model with a lot of variables this way of calling an IV estimation can be quite unwieldy as you have to replicate all the exogenous variables (in red). A slightly different, more economical way of asking R to do the same thing is as follows
reg_iv1 <- ivreg(lwage~educ+exper+expersq|.-educ+fatheduc+motheduc,data=mydata)
print(summary(reg_iv1))
What you get is the following
Call: ivreg(formula = lwage ~ educ + age + exper + expersq | . - educ + fatheduc, data = mydata) Residuals: Min 1Q Median 3Q Max -3.09354 -0.32798 0.05094 0.37402 2.35375 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0513505 0.4936538 -0.104 0.91720 educ 0.0701490 0.0346051 2.027 0.04328 * age -0.0002287 0.0049140 -0.047 0.96290 exper 0.0436778 0.0134180 3.255 0.00122 ** expersq -0.0008790 0.0004064 -2.163 0.03111 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6727 on 423 degrees of freedom Multiple R-Squared: 0.143, Adjusted R-squared: 0.1349 Wald test: 6.225 on 4 and 423 DF, p-value: 7.106e-05
One feature of IV estimations is that in general it is an inferior estimator of [math]\mathbf{\beta}[/math] if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.
For this reason any application of IV, should be accompanied by evidence that establishes that it was necessary. Once that is established, one should also establish that the instruments chosen meet the necessary requirements (of being correlated with the endogenous variable and being exogenous to the regression error term).
Testing for exogeneity
The null hypothesis we want to test is that the potentially endogenous variable, here educ
is exogenous, i.e. unrelated to the regresison error. If that was the case an IV estimation was not required. The procedure described is as in Wooldridge's textbook in Chapter 15.
Estimate [math]\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon}[/math] by OLS and save the residuals [math]\widehat{\mathbf{\varepsilon}}[/math].
Estimate
[math]\mathbf{x}_{j}=\mathbf{Z\gamma }_{j}\mathbf{+v}_{j}[/math]
by OLS for all [math]\widetilde{k}[/math] elements in [math]\mathbf{X}[/math] that are possibly endogenous and save [math]\widehat{\mathbf{v}}_{j}[/math]. Collect these in the [math]\left( n\times \widetilde{k}\right) [/math] matrix [math]\widehat{\mathbf{V}}[/math].
Estimate the auxilliary regression
[math]\widehat{\mathbf{y }}=\mathbf{X\beta }_{0}+\widehat{\mathbf{V}}% \mathbf{\delta }_{1}+\mathbf{u}[/math]
and test the following hypothesis
[math]\begin{aligned} H_{0} &:&\mathbf{\delta }_{1}=0~~\mathbf{X}\text{ is exogenous} \\ H_{A} &:&\mathbf{\delta }_{1}\neq 0~~\mathbf{X}\text{ is endogenous} \end{aligned}[/math]
using either a t or F-test depending on how many columns we have in [math]\widehat{\mathbf{V}}[/math] .
Implementing this test does not require anything else but the application of OLS regressions. In the following excerpt we assume that the dependent variable is contained in vector y
, the elements in [math]X[/math] that are assumed to be exogenous are contained in x1
, those elements that are suspected that they may be endogenous are in x2
and the instrument matrix is saved in z
. As before, it is assumed that z
should contain all elements of x1
.
The code also uses the OLSest function for the step 3 regression. However, that could easily be circumvented as for the regressions in Step 1 and 2.
x = [x1 x2]; % Combine to one matrix x
xxi = inv(x'*x);
b = xxi*x'*y; % Step 1: OLS estimator
res = y - x*b; % Step 1: saved residuals
zzi = inv(z'*z); % Step 2: inv(Z'Z) which is used in Step 2
gam = zzi*z'*x2; % Step 2: Estimate OLS coefficients of step 2 regressions
% This works even if we have more than one element in x2
% we get as many columns of gam as we have elements in x2
vhat = x2 - z*gam; % Step 2: residuals (has as many columns as in x2
[b,bse,res,n,rss,r2] = OLSest(res,[x vhat],0); % Step 3 regression
teststat = size(res,1)*r2; % Step 3: Calculate nR^2 test stat
pval = 1 - chi2cdf(teststat,size(x2,2)); % Step 3: Calculate p-value
A function that implements this test can be found here.
Sargan test for instrument validity
One crucial property of instruments is that they ought to be uncorrelated to the regression error terms [math]\mathbf{\varepsilon}[/math]. Instrument endogeneity is set as the null hypothesis of this test with the alternative hypothesis being that the instruments are endogenous.
Estimate the regression model by IV and save [math]\widehat{\mathbf{\varepsilon }}% _{IV}=\mathbf{y}-\mathbf{X}\widehat{\mathbf{\beta }}_{IV}[/math]
Regress
[math]\widehat{\mathbf{\varepsilon }}_{IV}=\mathbf{Z\gamma +u}[/math]
Calculate [math]LM=nR^{2}[/math] from the auxilliary regresion in step 2. [math]LM[/math] is (under [math]H_{0}[/math]) [math]\chi ^{2}[/math] distributed with [math]\left( p-k\right) [/math] degrees of freedom.
MATLAB implementation of this test relies on the availability of the IV parameter estimates. They can be calculated as indicated above. In this section you can find a function called IVest
that can deliver the required IV residuals by calling:
[biv,bseiv,resiv,r2iv] = IVest(y,x,z);
The third output are the IV residuals (refer to IVest for details) which can then be used as the dependent variable in the second step regression:
[b,bse,res,n,rss,r2] = OLSest(resiv,z,0); % Step 2: calculate Step 2 regression
teststat = size(resiv,1)*r2; % Step 3: Calculates the nR^2 test statistic
pval = 1 - chi2cdf(teststat,(size(z,2)-size(x,2))); % Step 3: Calculate p-value
It should be noted that this test is only applicable for an over-identified case when the z
contains more columns than x
. A function that implements this test can be found here.
Instrument relevance
The last instrument property that is required is that the instruments are correlated to the potentially endogenous variables. This is tested using a standard OLS regression that uses the endogenous variables as the dependent variable and all instrument variables (i.e. z
) as the explanatory variables. We then need to check whether the restriction that all (non-constant) variables in z
are relevant (F-test). If they are relevant, then the instruments are relevant. This is fact exactly what the Step 2 regressions of the Hausmann test do.
Footnotes
- ↑ Heij C, de Boer P., Franses P.H., Kloek T. and van Dijk H.K (2004) Econometric Methods with Applications in Business and Economics, Oxford University Press, New York [1]. This is an all-round good textbook that presents econometrics using matrix algebra.
- ↑ Check any econometrics textbook with a good section on matrix algebra, like the one in the above note or William Greene's Econometric Analysis.
- ↑ The order of the variables after the vertical line doesn't matter.
- ↑ Which turns out to be 0.1958 if you check it.