Difference between revisions of "R reg diag"

From ECLR
Jump to: navigation, search
(Heteroskedasticity)
(Heteroskedasticity in cross sections)
 
(5 intermediate revisions by the same user not shown)
Line 1: Line 1:
When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see [http://cran.r-project.org/web/packages/sandwich/index.html the relevant CRAN webpage]).
+
= Regression Diagnostics for ECLR =
 +
 
 +
Ralf Becker
 +
 
 +
= Introduction =
 +
 
 +
When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant [https://cran.r-project.org/web/packages/sandwich/index.html CRAN webpage]).
  
 
= Heteroskedasticity =
 
= Heteroskedasticity =
  
One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see [[R_robust_se]]).
+
== Heteroskedasticity in cross sections ==
 +
 
 +
One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se).
 +
 
 +
Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:
 +
 
 +
<pre class="r">library(AER) # allow access to AER package</pre>
  
Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in [[R_Regression#A first example]] and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:
 
   
 
    <span style="color:#0000ff">library(AER)</span>  # allow access to AER package
 
    # This is my first R regression!
 
    setwd("T:/ECLR/R/FirstSteps")              # This sets the working directory
 
    mydata <- read.csv("mroz.csv")  # Opens mroz.csv from working directory
 
     
 
    # Now convert variables with "." to num with NA
 
    mydata<source enclose=none>$</source>wage <- as.numeric(as.character(mydata<source enclose=none>$</source>wage))
 
    mydata<source enclose=none>$</source>lwage <- as.numeric(as.character(mydata<source enclose=none>$</source>lwage))
 
  
Before we run our initial regression model we shall restrict the dataframe <source enclose=none>mydata</source> to those data that do not have missing wage information, using the following <source enclose=none>subset</source> command:
+
<pre class="r"># This is my first R regression!
 +
setwd(&quot;C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics&quot;)              # This sets the working directory
 +
mydata &lt;- read.csv(&quot;mroz.csv&quot;)  # Opens mroz.csv from working directory
 +
   
 +
# Now convert variables with &quot;.&quot; to num with NA
 +
mydata$wage &lt;- as.numeric(as.character(mydata$wage))</pre>
  
    <span style="color:#0000ff">mydata <- subset(mydata, wage!="NA")</span> # select non NA data
+
<pre class="r">mydata$lwage &lt;- as.numeric(as.character(mydata$lwage))</pre>
 +
<pre>## Warning: NAs introduced by coercion</pre>
 +
Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:
  
 +
<pre class="r">mydata &lt;- subset(mydata, wage!=&quot;NA&quot;)  # select non NA data</pre>
 
Now we can run our initial regression:
 
Now we can run our initial regression:
     
 
    # Run a regression
 
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
 
    reg_ex1_sm <- summary(reg_ex1)
 
  
as we have learned in [[R_Regression]] these two objects now contain all the information we would want from a regression.  
+
<pre class="r"># Run a regression
A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians <ref>T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294</ref> What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.
+
reg_ex1 &lt;- lm(lwage~exper+log(huswage),data=mydata)
 +
reg_ex1_sm &lt;- summary(reg_ex1)</pre>
 +
As we have learned in [http://eclr.humanities.manchester.ac.uk/index.php/R_Regression R_Regression] these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.
 +
 
 +
The AER toolbox contains a procedure that makes performing this test a doodle.
  
The <source enclose=none>AER</source> toolbox contains a procedure that makes performing this test a doodle.
+
<pre class="r"># Test wheter residuals are homoscedasticity
 +
print(bptest(reg_ex1))</pre>
 +
<pre>##
 +
##  studentized Breusch-Pagan test
 +
##
 +
## data:  reg_ex1
 +
## BP = 9.5792, df = 2, p-value = 0.008316</pre>
 +
The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.
  
    # Test wheter residuals are homoscedasticity
+
This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and <code>log(huswage)</code>. You could get the same result by running this regression yourself and calculating the test statistic <math>n*R2</math>
    print(bptest(reg_ex1))
 
  
will deliver:
+
<pre class="r">mydata$res1_sq &lt;- (reg_ex1$residuals)^2 # add squared residuals to dataframe
 +
hs_test &lt;- lm(res1_sq~exper+log(huswage), data = mydata)
 +
hs_test_sm &lt;- summary(hs_test)
 +
teststat &lt;- hs_test_sm$r.squared*nrow(mydata)
 +
print(teststat)  # test statistic n*R^2</pre>
 +
<pre>## [1] 9.57924</pre>
 +
<pre class="r">print(1-pchisq(teststat,2))  # print p value</pre>
 +
<pre>## [1] 0.008315617</pre>
 +
which returns exactly the same test statistic and p-value. Note that the function <code>pchisq(teststat,2)</code> returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in <code>lm(res1_sq~exper+log(huswage), data = mydata)</code>. But note that you will have to adjust the degrees of freedom in <code>pchisq(teststat,2)</code> to reflect the number of variables included.
  
    studentized Breusch-Pagan test
+
== Heteroskedasticity in time series data ==
    data:  reg_ex1
 
    BP = 9.5792, df = 2, p-value = 0.008316
 
  
The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.
+
Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in <code>USUK</code>, the log change is in <code>logdUSUK</code>.
 +
 
 +
<pre class="r">ex_data &lt;- read.csv(&quot;USDUKP.csv&quot;,skip = 12)
 +
n &lt;- nrow(ex_data)
 +
summary(ex_data) # this produces summary stats</pre>
 +
<pre>##          DATE            USUK          logdUSUK       
 +
##  01/02/1971:    1  Min.  :1.052  Min.  :-2.156815 
 +
##  01/02/1972:    1  1st Qu.:1.556  1st Qu.:-0.125833 
 +
##  01/02/1973:    1  Median :1.666  Median : 0.002539 
 +
##  01/02/1974:    1  Mean  :1.768  Mean  :-0.001520 
 +
##  01/02/1977:    1  3rd Qu.:1.915  3rd Qu.: 0.127335 
 +
##  01/02/1978:    1  Max.  :2.644  Max.  : 1.992773 
 +
##  (Other)   :10813                  NA's  :1</pre>
 +
As usual with time-series data we shall let R know what we are dealing with. We chose <code>frequency =1</code> for daily data, and tell R that we start at observation 1 (<code>start=1</code>) and finish at observation <code>nrow(ex_data)</code> which automatically checks how long our dataset is.
 +
 
 +
<pre class="r">ex_data$USUK &lt;- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1)
 +
ex_data$logdUSUK &lt;- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1)</pre>
 +
Let's estimate an AR(1) model for <code>logdUSUK</code> (see the [http://eclr.humanities.manchester.ac.uk/index.php/R_TimeSeries time series section] for details:
 +
 
 +
<pre class="r">reg2 &lt;- arima(ex_data$logdUSUK, order = c(1,0,0))</pre>
 +
Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the <code>FinTS</code> package which has a function written for exactly that purpose. So install that package and then load it.
  
This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the <source enclose=none>bptest</source> is to use the same explanatory variables as those used in the original regression model, here <source enclose=none>exper</source> and <source enclose=none>log(huswage)</source>. You could get the same result by running this regression yourself and calculating the test statistic <math>n*R^2</math>
+
<pre class="r">#install.packages(&quot;FinTS&quot;)  # only do this the first time you download the package
 +
library(FinTS)</pre>
 +
<pre>## Warning: package 'FinTS' was built under R version 3.1.3</pre>
 +
And this is how you call the <code>ArchTest</code> function. Under the hood this will run an auxiliary regression:
  
    mydata<source enclose=none>$</source>res1_sq <- (reg_ex1<source enclose=none>$</source>residuals)^2 # add squared residuals to dataframe
+
<math>\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t</math>
    hs_test <- lm(res1_sq~exper+log(huswage), data = mydata)
 
    hs_test_sm <- summary(hs_test)
 
    teststat <- hs_test_sm<source enclose=none>$</source>r.squared*nrow(mydata)
 
    print(teststat)  # test statistic n*R^2
 
    print(1-<span style="color:#0000ff">pchisq(teststat,2)</span>)  # print p value
 
  
which returns exactly the same test statistic and p-value. Note that the function <span style="color:#0000ff"><source enclose=none>pchisq(teststat,2)</source></span> returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom.
+
<pre class="r">ArchTest(reg2$residuals, lags = 12) </pre>
 +
<pre>##
 +
##  ARCH LM-test; Null hypothesis: no ARCH effects
 +
##
 +
## data:  reg2$residuals
 +
## Chi-squared = 1004.562, df = 12, p-value &lt; 2.2e-16</pre>
 +
The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity.
  
 
= Autocorrelation =
 
= Autocorrelation =
  
 +
When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the <code>bgtest</code> function from the <code>lmtest</code> package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation.
 +
 +
We show how to apply the test to the estimated residuals from the <code>reg2</code> model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the <code>bgtest</code> function does not accept the <code>arfima</code> function output as an input, but it does accept the output of the <code>lm</code> function.
 +
 +
<pre class="r">reg3 &lt;- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK
 +
bgtest(reg3,order=4)</pre>
 +
<pre>##
 +
##  Breusch-Godfrey test for serial correlation of order up to 4
 +
##
 +
## data:  reg3
 +
## LM test = 2.2193, df = 4, p-value = 0.6955</pre>
 +
This is the Breusch-Godfrey (LM) test of order 4. Note that <code>logdUSUK[-1]</code> removes the first observation and <code>logdUSUK[-nrow(ex_data)]</code> removes the last. This crates the <math>y_t</math> and <math>y_{t-1}</math> respectively.
  
 
= Residual Normality =
 
= Residual Normality =
  
 +
= General Misspecification - RESET test =
 +
 +
This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification.
  
= Structural Break =
+
Let's go back to the model we estimated in <code>reg_ex1</code>.
  
= References =
+
<pre class="r">resettest(reg_ex1 , power=2, type=&quot;regressor&quot;)</pre>
  <references />
+
<pre>##
 +
##  RESET test
 +
##
 +
## data: reg_ex1
 +
## RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193</pre>
 +
= Structural Breaks =

Latest revision as of 13:27, 22 July 2016

Regression Diagnostics for ECLR

Ralf Becker

Introduction

When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant CRAN webpage).

Heteroskedasticity

Heteroskedasticity in cross sections

One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se).

Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:

library(AER)  # allow access to AER package


# This is my first R regression!
setwd("C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics")              # This sets the working directory
mydata <- read.csv("mroz.csv")  # Opens mroz.csv from working directory
     
# Now convert variables with "." to num with NA
mydata$wage <- as.numeric(as.character(mydata$wage))
mydata$lwage <- as.numeric(as.character(mydata$lwage))
## Warning: NAs introduced by coercion

Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:

mydata <- subset(mydata, wage!="NA")  # select non NA data

Now we can run our initial regression:

# Run a regression
reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
reg_ex1_sm <- summary(reg_ex1)

As we have learned in R_Regression these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.

The AER toolbox contains a procedure that makes performing this test a doodle.

# Test wheter residuals are homoscedasticity
print(bptest(reg_ex1))
## 
##  studentized Breusch-Pagan test
## 
## data:  reg_ex1
## BP = 9.5792, df = 2, p-value = 0.008316

The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.

This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and log(huswage). You could get the same result by running this regression yourself and calculating the test statistic [math]n*R2[/math]

mydata$res1_sq <- (reg_ex1$residuals)^2 # add squared residuals to dataframe
hs_test <- lm(res1_sq~exper+log(huswage), data = mydata)
hs_test_sm <- summary(hs_test)
teststat <- hs_test_sm$r.squared*nrow(mydata)
print(teststat)  # test statistic n*R^2
## [1] 9.57924
print(1-pchisq(teststat,2))   # print p value
## [1] 0.008315617

which returns exactly the same test statistic and p-value. Note that the function pchisq(teststat,2) returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in lm(res1_sq~exper+log(huswage), data = mydata). But note that you will have to adjust the degrees of freedom in pchisq(teststat,2) to reflect the number of variables included.

Heteroskedasticity in time series data

Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in USUK, the log change is in logdUSUK.

ex_data <- read.csv("USDUKP.csv",skip = 12)
n <- nrow(ex_data)
summary(ex_data) # this produces summary stats
##          DATE            USUK          logdUSUK        
##  01/02/1971:    1   Min.   :1.052   Min.   :-2.156815  
##  01/02/1972:    1   1st Qu.:1.556   1st Qu.:-0.125833  
##  01/02/1973:    1   Median :1.666   Median : 0.002539  
##  01/02/1974:    1   Mean   :1.768   Mean   :-0.001520  
##  01/02/1977:    1   3rd Qu.:1.915   3rd Qu.: 0.127335  
##  01/02/1978:    1   Max.   :2.644   Max.   : 1.992773  
##  (Other)   :10813                   NA's   :1

As usual with time-series data we shall let R know what we are dealing with. We chose frequency =1 for daily data, and tell R that we start at observation 1 (start=1) and finish at observation nrow(ex_data) which automatically checks how long our dataset is.

ex_data$USUK <- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1)
ex_data$logdUSUK <- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1)

Let's estimate an AR(1) model for logdUSUK (see the time series section for details:

reg2 <- arima(ex_data$logdUSUK, order = c(1,0,0))

Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the FinTS package which has a function written for exactly that purpose. So install that package and then load it.

#install.packages("FinTS")  # only do this the first time you download the package
library(FinTS)
## Warning: package 'FinTS' was built under R version 3.1.3

And this is how you call the ArchTest function. Under the hood this will run an auxiliary regression:

[math]\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t[/math]

ArchTest(reg2$residuals, lags = 12)  
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  reg2$residuals
## Chi-squared = 1004.562, df = 12, p-value < 2.2e-16

The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity.

Autocorrelation

When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the bgtest function from the lmtest package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation.

We show how to apply the test to the estimated residuals from the reg2 model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the bgtest function does not accept the arfima function output as an input, but it does accept the output of the lm function.

reg3 <- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK
bgtest(reg3,order=4)
## 
##  Breusch-Godfrey test for serial correlation of order up to 4
## 
## data:  reg3
## LM test = 2.2193, df = 4, p-value = 0.6955

This is the Breusch-Godfrey (LM) test of order 4. Note that logdUSUK[-1] removes the first observation and logdUSUK[-nrow(ex_data)] removes the last. This crates the [math]y_t[/math] and [math]y_{t-1}[/math] respectively.

Residual Normality

General Misspecification - RESET test

This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification.

Let's go back to the model we estimated in reg_ex1.

resettest(reg_ex1 , power=2, type="regressor")
## 
##  RESET test
## 
## data:  reg_ex1
## RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193

Structural Breaks