R reg diag

From ECLR
Revision as of 01:30, 2 March 2016 by Rb (talk | contribs)
Jump to: navigation, search
  1. Regression Diagnostics for ECLR

Ralf Becker

  1. Introduction

When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant [CRAN webpage](https://cran.r-project.org/web/packages/sandwich/index.html)).

  1. Heteroskedasticity
    1. Heteroskedasticity in cross sections

One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se).

Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:


```r library(AER) # allow access to AER package ```

```

    1. Warning: package 'AER' was built under R version 3.1.3

```

```

    1. Loading required package: car

```

```

    1. Warning: package 'car' was built under R version 3.1.3

```

```

    1. Loading required package: lmtest

```

```

    1. Warning: package 'lmtest' was built under R version 3.1.3

```

```

    1. Loading required package: zoo
    2. Attaching package: 'zoo'
    3. The following objects are masked from 'package:base':
    4. as.Date, as.Date.numeric
    5. Loading required package: sandwich

```

```

    1. Warning: package 'sandwich' was built under R version 3.1.3

```

```

    1. Loading required package: survival
    2. Loading required package: splines

```

```r

  1. This is my first R regression!

setwd("C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics") # This sets the working directory mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory

  1. Now convert variables with "." to num with NA

mydata$wage <- as.numeric(as.character(mydata$wage)) ```

```

    1. Warning: NAs introduced by coercion

```

```r mydata$lwage <- as.numeric(as.character(mydata$lwage)) ```

```

    1. Warning: NAs introduced by coercion

```

Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:


```r mydata <- subset(mydata, wage!="NA") # select non NA data ```

Now we can run our initial regression:


```r

  1. Run a regression

reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata) reg_ex1_sm <- summary(reg_ex1) ```


As we have learned in [R_Regression](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression) these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.

The AER toolbox contains a procedure that makes performing this test a doodle.


```r

  1. Test wheter residuals are homoscedasticity

print(bptest(reg_ex1)) ```

```

    1. studentized Breusch-Pagan test
    2. data: reg_ex1
    3. BP = 9.5792, df = 2, p-value = 0.008316

```


The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.

This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and `log(huswage)`. You could get the same result by running this regression yourself and calculating the test statistic $n???R2$


```r mydata$res1_sq <- (reg_ex1$residuals)^2 # add squared residuals to dataframe hs_test <- lm(res1_sq~exper+log(huswage), data = mydata) hs_test_sm <- summary(hs_test) teststat <- hs_test_sm$r.squared*nrow(mydata) print(teststat) # test statistic n*R^2 ```

```

    1. [1] 9.57924

```

```r print(1-pchisq(teststat,2)) # print p value ```

```

    1. [1] 0.008315617

```

which returns exactly the same test statistic and p-value. Note that the function `pchisq(teststat,2)` returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in `lm(res1_sq~exper+log(huswage), data = mydata)`. But note that you will have to adjust the degrees of freedom in `pchisq(teststat,2)` to reflect the number of variables included.

    1. Heteroskedasticity in time series data

Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in `USUK`, the log change is in `logdUSUK`.


```r ex_data <- read.csv("USDUKP.csv",skip = 12) n <- nrow(ex_data) summary(ex_data) # this produces summary stats ```

```

    1. DATE USUK logdUSUK
    2. 01/02/1971: 1 Min.  :1.052 Min.  :-2.156815
    3. 01/02/1972: 1 1st Qu.:1.556 1st Qu.:-0.125833
    4. 01/02/1973: 1 Median :1.666 Median : 0.002539
    5. 01/02/1974: 1 Mean  :1.768 Mean  :-0.001520
    6. 01/02/1977: 1 3rd Qu.:1.915 3rd Qu.: 0.127335
    7. 01/02/1978: 1 Max.  :2.644 Max.  : 1.992773
    8. (Other)  :10813 NA's  :1

```

As usual with time-series data we shall let R know what we are dealing with. We chose `frequency =1` for daily data, and tell R that we start at observation 1 (`start=1`) and finish at observation `nrow(ex_data)` which automatically checks how long our dataset is.


```r ex_data$USUK <- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1) ex_data$logdUSUK <- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1) ```

Let's estimate an AR(1) model for `logdUSUK` (see the [time series section](http://eclr.humanities.manchester.ac.uk/index.php/R_TimeSeries) for details:


```r reg2 <- arima(ex_data$logdUSUK, order = c(1,0,0)) ```

Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the `FinTS` package which has a function written for exactly that purpose. So install that package and then load it.


```r

  1. install.packages("FinTS") # only do this the first time you download the package

library(FinTS) ```

```

    1. Warning: package 'FinTS' was built under R version 3.1.3

```

And this is how you call the `ArchTest` function. Under the hood this will run an auxiliary regression:

$\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t$


```r ArchTest(reg2$residuals, lags = 12) ```

```

    1. ARCH LM-test; Null hypothesis: no ARCH effects
    2. data: reg2$residuals
    3. Chi-squared = 1004.562, df = 12, p-value < 2.2e-16

```

The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity.

Autocorrelation

When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the `bgtest` function from the `lmtest` package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation.

We show how to apply the test to the estimated residuals from the `reg2` model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the `bgtest` function does not accept the `arfima` function output as an input, but it does accept the output of the `lm` function.


```r reg3 <- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK bgtest(reg3,order=4) ```

```

    1. Breusch-Godfrey test for serial correlation of order up to 4
    2. data: reg3
    3. LM test = 2.2193, df = 4, p-value = 0.6955

```

This is the Breusch-Godfrey (LM) test of order 4. Note that `logdUSUK[-1]` removes the first observation and `logdUSUK[-nrow(ex_data)]` removes the last. This crates the $y_t$ and $y_{t-1}$ respectively.

Residual Normality

General Misspecification - RESET test

This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification.

Let's go back to the model we estimated in `reg_ex1`.


```r resettest(reg_ex1 , power=2, type="regressor") ```

```

    1. RESET test
    2. data: reg_ex1
    3. RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193

```

Structural Breaks