R reg diag

From ECLR
Jump to: navigation, search

Regression Diagnostics for ECLR

Ralf Becker

Introduction

When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant CRAN webpage).

Heteroskedasticity

Heteroskedasticity in cross sections

One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se).

Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:

library(AER)  # allow access to AER package


# This is my first R regression!
setwd("C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics")              # This sets the working directory
mydata <- read.csv("mroz.csv")  # Opens mroz.csv from working directory
     
# Now convert variables with "." to num with NA
mydata$wage <- as.numeric(as.character(mydata$wage))
mydata$lwage <- as.numeric(as.character(mydata$lwage))
## Warning: NAs introduced by coercion

Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:

mydata <- subset(mydata, wage!="NA")  # select non NA data

Now we can run our initial regression:

# Run a regression
reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
reg_ex1_sm <- summary(reg_ex1)

As we have learned in R_Regression these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.

The AER toolbox contains a procedure that makes performing this test a doodle.

# Test wheter residuals are homoscedasticity
print(bptest(reg_ex1))
## 
##  studentized Breusch-Pagan test
## 
## data:  reg_ex1
## BP = 9.5792, df = 2, p-value = 0.008316

The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.

This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and log(huswage). You could get the same result by running this regression yourself and calculating the test statistic [math]n*R2[/math]

mydata$res1_sq <- (reg_ex1$residuals)^2 # add squared residuals to dataframe
hs_test <- lm(res1_sq~exper+log(huswage), data = mydata)
hs_test_sm <- summary(hs_test)
teststat <- hs_test_sm$r.squared*nrow(mydata)
print(teststat)  # test statistic n*R^2
## [1] 9.57924
print(1-pchisq(teststat,2))   # print p value
## [1] 0.008315617

which returns exactly the same test statistic and p-value. Note that the function pchisq(teststat,2) returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in lm(res1_sq~exper+log(huswage), data = mydata). But note that you will have to adjust the degrees of freedom in pchisq(teststat,2) to reflect the number of variables included.

Heteroskedasticity in time series data

Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in USUK, the log change is in logdUSUK.

ex_data <- read.csv("USDUKP.csv",skip = 12)
n <- nrow(ex_data)
summary(ex_data) # this produces summary stats
##          DATE            USUK          logdUSUK        
##  01/02/1971:    1   Min.   :1.052   Min.   :-2.156815  
##  01/02/1972:    1   1st Qu.:1.556   1st Qu.:-0.125833  
##  01/02/1973:    1   Median :1.666   Median : 0.002539  
##  01/02/1974:    1   Mean   :1.768   Mean   :-0.001520  
##  01/02/1977:    1   3rd Qu.:1.915   3rd Qu.: 0.127335  
##  01/02/1978:    1   Max.   :2.644   Max.   : 1.992773  
##  (Other)   :10813                   NA's   :1

As usual with time-series data we shall let R know what we are dealing with. We chose frequency =1 for daily data, and tell R that we start at observation 1 (start=1) and finish at observation nrow(ex_data) which automatically checks how long our dataset is.

ex_data$USUK <- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1)
ex_data$logdUSUK <- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1)

Let's estimate an AR(1) model for logdUSUK (see the time series section for details:

reg2 <- arima(ex_data$logdUSUK, order = c(1,0,0))

Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the FinTS package which has a function written for exactly that purpose. So install that package and then load it.

#install.packages("FinTS")  # only do this the first time you download the package
library(FinTS)
## Warning: package 'FinTS' was built under R version 3.1.3

And this is how you call the ArchTest function. Under the hood this will run an auxiliary regression:

[math]\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t[/math]

ArchTest(reg2$residuals, lags = 12)  
## 
##  ARCH LM-test; Null hypothesis: no ARCH effects
## 
## data:  reg2$residuals
## Chi-squared = 1004.562, df = 12, p-value < 2.2e-16

The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity.

Autocorrelation

When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the bgtest function from the lmtest package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation.

We show how to apply the test to the estimated residuals from the reg2 model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the bgtest function does not accept the arfima function output as an input, but it does accept the output of the lm function.

reg3 <- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK
bgtest(reg3,order=4)
## 
##  Breusch-Godfrey test for serial correlation of order up to 4
## 
## data:  reg3
## LM test = 2.2193, df = 4, p-value = 0.6955

This is the Breusch-Godfrey (LM) test of order 4. Note that logdUSUK[-1] removes the first observation and logdUSUK[-nrow(ex_data)] removes the last. This crates the [math]y_t[/math] and [math]y_{t-1}[/math] respectively.

Residual Normality

General Misspecification - RESET test

This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification.

Let's go back to the model we estimated in reg_ex1.

resettest(reg_ex1 , power=2, type="regressor")
## 
##  RESET test
## 
## data:  reg_ex1
## RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193

Structural Breaks