Difference between revisions of "IV"
(→Sargan test for instrument validity) |
|||
(5 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Introduction = | = Introduction = | ||
− | In this Section we will demonstrate how to use instrumental variables (IV) estimation to estimate the parameters in a linear regression model. The material will follow the notation in the Heij ''et al.'' textbook<ref>Heij C, de Boer P., Franses P.H., Kloek T. and van Dijk H.K (2004) Econometric Methods with Applications in Business and Economics, Oxford University Press, New York [http://www.amazon.co.uk/Econometric-Methods-Applications-Business-Economics/dp/0199268010/ref=sr_1_1?s=books&ie=UTF8&qid=1354473313&sr=1-1]. This is an all-round good textbook that | + | In this Section we will demonstrate how to use instrumental variables (IV) estimation to estimate the parameters in a linear regression model. The material will follow the notation in the Heij ''et al.'' textbook<ref>Heij C, de Boer P., Franses P.H., Kloek T. and van Dijk H.K (2004) Econometric Methods with Applications in Business and Economics, Oxford University Press, New York [http://www.amazon.co.uk/Econometric-Methods-Applications-Business-Economics/dp/0199268010/ref=sr_1_1?s=books&ie=UTF8&qid=1354473313&sr=1-1]. This is an all-round good textbook that presents econometrics using matrix algebra. |
</ref>. | </ref>. | ||
Line 14: | Line 14: | ||
Before continuing it is advisable to be clear about the dimensions of certain variables. Let’s assume that <math>\mathbf{y}</math> is a <math>(n \times 1)</math> vector containing the <math>n</math> observations for the dependent variable. <math>\mathbf{X}</math> is a <math>(n \times k)</math> matrix with the <math>k</math> explanatory variables in the columns, usually containing a vector of 1s in the first column, representing a regression constant. Now, let <math>\mathbf{Z}</math> be a <math>(n \times p)</math> matrix with instruments. Importantly, <math>p \ge k</math>, and further <math>\mathbf{X}</math> and <math>\mathbf{Z}</math> may have columns in common. If so, these are explanatory variables from <math>\mathbf{X}</math> that are judged to be certainly uncorrelated with the error term (like the constant). | Before continuing it is advisable to be clear about the dimensions of certain variables. Let’s assume that <math>\mathbf{y}</math> is a <math>(n \times 1)</math> vector containing the <math>n</math> observations for the dependent variable. <math>\mathbf{X}</math> is a <math>(n \times k)</math> matrix with the <math>k</math> explanatory variables in the columns, usually containing a vector of 1s in the first column, representing a regression constant. Now, let <math>\mathbf{Z}</math> be a <math>(n \times p)</math> matrix with instruments. Importantly, <math>p \ge k</math>, and further <math>\mathbf{X}</math> and <math>\mathbf{Z}</math> may have columns in common. If so, these are explanatory variables from <math>\mathbf{X}</math> that are judged to be certainly uncorrelated with the error term (like the constant). | ||
− | It is well established that the instrumental variables in <math>\mathbf{Z}</math> need to meet certain restrictions in order to deliver useful IV estimators of <math>\mathbf{\beta}</math>. They need to be uncorrelated to the error terms | + | It is well established that the instrumental variables in <math>\mathbf{Z}</math> need to meet certain restrictions in order to deliver useful IV estimators of <math>\mathbf{\beta}</math>. They need to be uncorrelated to the error terms. Further, we require <math>E(\mathbf{Z}'\mathbf{X})</math> to have full rank. In very simple cases this boils down to the instrument <math>\mathbf{Z}</math> and the endogenous variable <math>\mathbf{X}</math> being correlated with each other.Further they should have no relevance for the dependent variable, other than through its relation to the potentially endogenous variable (exclusion assumption). |
A number of MATLAB functions can be found [[ExampleCodeIV|here]]. | A number of MATLAB functions can be found [[ExampleCodeIV|here]]. | ||
Line 55: | Line 55: | ||
For this reason any application of IV, should be accompanied by evidence that establishes that it was necessary. Once that is established, one should also establish that the instruments chosen meet the necessary requirements (of being correlated with the endogenous variable and being exogenous to the regression error term). | For this reason any application of IV, should be accompanied by evidence that establishes that it was necessary. Once that is established, one should also establish that the instruments chosen meet the necessary requirements (of being correlated with the endogenous variable and being exogenous to the regression error term). | ||
− | == Testing exogeneity == | + | == Testing for exogeneity == |
The null hypothesis to be tested here is whether | The null hypothesis to be tested here is whether | ||
Line 102: | Line 102: | ||
== Sargan test for instrument validity == | == Sargan test for instrument validity == | ||
− | One crucial property of instruments is that they ought to be uncorrelated to the regression error terms <math>\mathbf{\varepsilon}</math>. Instrument | + | One crucial property of instruments is that they ought to be uncorrelated to the regression error terms <math>\mathbf{\varepsilon}</math>. Instrument exogeneity is set as the null hypothesis of this test with the alternative hypothesis being that the instruments are endogenous. |
<ol> | <ol> | ||
Line 120: | Line 120: | ||
pval = 1 - chi2cdf(teststat,(size(z,2)-size(x,2))); % Step 3: Calculate p-value</source> | pval = 1 - chi2cdf(teststat,(size(z,2)-size(x,2))); % Step 3: Calculate p-value</source> | ||
It should be noted that this test is only applicable for an over-identified case when the <source enclose="none">z</source> contains more columns than <source enclose="none">x</source>. A function that implements this test can be found [[ExampleCodeIV#Sargan|here]]. | It should be noted that this test is only applicable for an over-identified case when the <source enclose="none">z</source> contains more columns than <source enclose="none">x</source>. A function that implements this test can be found [[ExampleCodeIV#Sargan|here]]. | ||
+ | |||
+ | == Instrument relevance == | ||
+ | |||
+ | The last instrument property that is required is that the instruments are correlated to the potentially endogenous variables. This is tested using a standard OLS regression that uses the endogenous variables as the dependent variable and all instrument variables (i.e. <source enclose="none">z</source>) as the explanatory variables. We then need to check whether the restriction that all (non-constant) variables in <source enclose="none">z</source> are relevant (F-test). If they are relevant, then the instruments are relevant. This is fact exactly what the Step 2 regressions of the Hausmann test do. | ||
=Footnotes= | =Footnotes= | ||
<references /> | <references /> |
Latest revision as of 21:46, 7 August 2015
Contents
Introduction
In this Section we will demonstrate how to use instrumental variables (IV) estimation to estimate the parameters in a linear regression model. The material will follow the notation in the Heij et al. textbook[1].
[math]\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon }[/math]
The issue is that we may suspect (or know) that the explanatory variable is correlated with the (unobserved) error term
[math]p\lim \left( \frac{1}{n}\mathbf{X}^{\prime }\mathbf{\varepsilon }\right) \neq 0.[/math]
Reasons for such a situation include measurement error in [math]x[/math], endogenous explanatory variables, omitted relevant variables or a combination of the above. The consequence is that the OLS parameter estimate of [math]\mathbf{\beta}[/math] is biased and inconsistent. Fortunately it is well established that an IV estimation of [math]\mathbf{\beta}[/math] can potentially deliver consistent parameter estimates. This does, however, require the availability of sufficient instruments [math]\mathbf{Z}[/math].
Before continuing it is advisable to be clear about the dimensions of certain variables. Let’s assume that [math]\mathbf{y}[/math] is a [math](n \times 1)[/math] vector containing the [math]n[/math] observations for the dependent variable. [math]\mathbf{X}[/math] is a [math](n \times k)[/math] matrix with the [math]k[/math] explanatory variables in the columns, usually containing a vector of 1s in the first column, representing a regression constant. Now, let [math]\mathbf{Z}[/math] be a [math](n \times p)[/math] matrix with instruments. Importantly, [math]p \ge k[/math], and further [math]\mathbf{X}[/math] and [math]\mathbf{Z}[/math] may have columns in common. If so, these are explanatory variables from [math]\mathbf{X}[/math] that are judged to be certainly uncorrelated with the error term (like the constant).
It is well established that the instrumental variables in [math]\mathbf{Z}[/math] need to meet certain restrictions in order to deliver useful IV estimators of [math]\mathbf{\beta}[/math]. They need to be uncorrelated to the error terms. Further, we require [math]E(\mathbf{Z}'\mathbf{X})[/math] to have full rank. In very simple cases this boils down to the instrument [math]\mathbf{Z}[/math] and the endogenous variable [math]\mathbf{X}[/math] being correlated with each other.Further they should have no relevance for the dependent variable, other than through its relation to the potentially endogenous variable (exclusion assumption).
A number of MATLAB functions can be found here.
IV estimator
It is well established that the IV estimator can be estimated as follows
[math]\mathbf{\widehat{\beta}}_{IV} = \left(\mathbf{X}'\mathbf{P}_Z \mathbf{X}\right)^{-1} \mathbf{X}'\mathbf{P}_Z \mathbf{y}[/math]
where [math]\mathbf{P}_Z[/math] is the projection matrix of [math]\mathbf{Z}[/math]. When performing inference the Variance-Covariance matrix of [math]\mathbf{\widehat{\beta}}_{IV}[/math] is of obvious interest and it is calculated as follows
[math]Var\left(\mathbf{\widehat{\beta}}_{IV} \right) = \sigma ^{2}\left( \mathbf{X}^{\prime }\mathbf{P}_{Z}\mathbf{X}\right)^{-1}[/math]
where the estimate for the error variance comes from
[math]\begin{aligned} s_{IV}^{2} &=&\frac{1}{n-k}\widehat{\mathbf{\varepsilon }}_{IV}^{\prime }% \widehat{\mathbf{\varepsilon }}_{IV} \\ &=&\frac{1}{n-k}\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right) ^{\prime }\left( \mathbf{y-X}\widehat{\mathbf{\beta }}_{IV}\right)\end{aligned}[/math]
MATLAB implementation
The following code extract assumes that the vector y
contains the [math](n \times 1)[/math] vector with the dependent variable, the [math](n \times k)[/math] matrix x
contains all explanatory variables and z
is a [math](n \times p)[/math] matrix [math](p ge k)[/math] with instruments.
pz = z*inv(z'*z)*z'; % Projection matrix
xpzxi = inv(x'*pz*x); % this is also (Xhat'Xhat)^(-1)
biv = xpzxi*x'*pz*y; % IV estimate
res = y - x*biv; % IV residuals
ssq = res'*res/(n-k); % Sample variance for IV residuals
s = sqrt(ssq); % Sample Standard deviation for IV res
bse = ssq*xpzxi; % Variance covariance matrix for IV estimates
bse = sqrt(diag(bse)); % Extract diagonal and take square root -> standard errors for IV estimators
One feature of IV estimations is that in general it is an inferior estimator of [math]\mathbf{\beta}[/math] if all explanatory variables are exogenous. In that case, assuming that all other Gauss-Markov assumptions are met, the OLS estimator is the BLUE estimator. In other words, IV estimators have larger standard errors for the coefficient estimates. Therefore, one would really like to avoid having to rely on IV estimators, unless, of course, they are the only estimators that deliver consistent estimates.
For this reason any application of IV, should be accompanied by evidence that establishes that it was necessary. Once that is established, one should also establish that the instruments chosen meet the necessary requirements (of being correlated with the endogenous variable and being exogenous to the regression error term).
Testing for exogeneity
The null hypothesis to be tested here is whether
[math]p\lim \left( \frac{1}{n}\mathbf{X}^{\prime }\mathbf{\varepsilon }\right) \neq 0.[/math]
and therefore whether an IV estimation is required or no. The procedure described is as in Heij et al.. It consists of the following three steps.
Estimate [math]\mathbf{y}=\mathbf{X\beta }+\mathbf{\varepsilon}[/math] by OLS and save the residuals [math]\widehat{\mathbf{\varepsilon}}[/math].
Estimate
[math]\mathbf{x}_{j}=\mathbf{Z\gamma }_{j}\mathbf{+v}_{j}[/math]
by OLS for all [math]\widetilde{k}[/math] elements in [math]\mathbf{X}[/math] that are possibly endogenous and save [math]\widehat{\mathbf{v}}_{j}[/math]. Collect these in the [math]\left( n\times \widetilde{k}\right) [/math] matrix [math]\widehat{\mathbf{V}}[/math].
Estimate the auxilliary regression
[math]\widehat{\mathbf{\varepsilon }}=\mathbf{X\delta }_{0}+\widehat{\mathbf{V}}% \mathbf{\delta }_{1}+\mathbf{u}[/math]
and test the following hypothesis
[math]\begin{aligned} H_{0} &:&\mathbf{\delta }_{1}=0~~\mathbf{X}\text{ is exogenous} \\ H_{A} &:&\mathbf{\delta }_{1}\neq 0~~\mathbf{X}\text{ is endogenous} \end{aligned}[/math]
using the usual test statistic [math]\chi ^{2}=nR^{2}[/math] which, under [math]H_{0}[/math], is [math] \chi ^{2}\left( \widetilde{k}\right) [/math] distributed.
Implementing this test does not require anything else but the application of OLS regressions. In the following excerpt we assume that the dependent variable is contained in vector y
, the elements in [math]X[/math] that are assumed to be exogenous are contained in x1
, those elements that are suspected that they may be endogenous are in x2
and the instrument matrix is saved in z
. As before, it is assumed that z
should contain all elements of x1
.
The code also uses the OLSest function for the step 3 regression. However, that could easily be circumvented as for the regressions in Step 1 and 2.
x = [x1 x2]; % Combine to one matrix x
xxi = inv(x'*x);
b = xxi*x'*y; % Step 1: OLS estimator
res = y - x*b; % Step 1: saved residuals
zzi = inv(z'*z); % Step 2: inv(Z'Z) which is used in Step 2
gam = zzi*z'*x2; % Step 2: Estimate OLS coefficients of step 2 regressions
% This works even if we have more than one element in x2
% we get as many columns of gam as we have elements in x2
vhat = x2 - z*gam; % Step 2: residuals (has as many columns as in x2
[b,bse,res,n,rss,r2] = OLSest(res,[x vhat],0); % Step 3 regression
teststat = size(res,1)*r2; % Step 3: Calculate nR^2 test stat
pval = 1 - chi2cdf(teststat,size(x2,2)); % Step 3: Calculate p-value
A function that implements this test can be found here.
Sargan test for instrument validity
One crucial property of instruments is that they ought to be uncorrelated to the regression error terms [math]\mathbf{\varepsilon}[/math]. Instrument exogeneity is set as the null hypothesis of this test with the alternative hypothesis being that the instruments are endogenous.
Estimate the regression model by IV and save [math]\widehat{\mathbf{\varepsilon }}% _{IV}=\mathbf{y}-\mathbf{X}\widehat{\mathbf{\beta }}_{IV}[/math]
Regress
[math]\widehat{\mathbf{\varepsilon }}_{IV}=\mathbf{Z\gamma +u}[/math]
Calculate [math]LM=nR^{2}[/math] from the auxilliary regresion in step 2. [math]LM[/math] is (under [math]H_{0}[/math]) [math]\chi ^{2}[/math] distributed with [math]\left( p-k\right) [/math] degrees of freedom.
MATLAB implementation of this test relies on the availability of the IV parameter estimates. They can be calculated as indicated above. In this section you can find a function called IVest
that can deliver the required IV residuals by calling:
[biv,bseiv,resiv,r2iv] = IVest(y,x,z);
The third output are the IV residuals (refer to IVest for details) which can then be used as the dependent variable in the second step regression:
[b,bse,res,n,rss,r2] = OLSest(resiv,z,0); % Step 2: calculate Step 2 regression
teststat = size(resiv,1)*r2; % Step 3: Calculates the nR^2 test statistic
pval = 1 - chi2cdf(teststat,(size(z,2)-size(x,2))); % Step 3: Calculate p-value
It should be noted that this test is only applicable for an over-identified case when the z
contains more columns than x
. A function that implements this test can be found here.
Instrument relevance
The last instrument property that is required is that the instruments are correlated to the potentially endogenous variables. This is tested using a standard OLS regression that uses the endogenous variables as the dependent variable and all instrument variables (i.e. z
) as the explanatory variables. We then need to check whether the restriction that all (non-constant) variables in z
are relevant (F-test). If they are relevant, then the instruments are relevant. This is fact exactly what the Step 2 regressions of the Hausmann test do.