Difference between revisions of "R reg diag"
(→Heteroskedasticity) |
(→Heteroskedasticity in cross sections) |
||
(6 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the | + | = Regression Diagnostics for ECLR = |
+ | |||
+ | Ralf Becker | ||
+ | |||
+ | = Introduction = | ||
+ | |||
+ | When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant [https://cran.r-project.org/web/packages/sandwich/index.html CRAN webpage]). | ||
= Heteroskedasticity = | = Heteroskedasticity = | ||
− | One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see | + | == Heteroskedasticity in cross sections == |
+ | |||
+ | One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se). | ||
+ | |||
+ | Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox: | ||
+ | |||
+ | <pre class="r">library(AER) # allow access to AER package</pre> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | <pre class="r"># This is my first R regression! | |
+ | setwd("C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics") # This sets the working directory | ||
+ | mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory | ||
+ | |||
+ | # Now convert variables with "." to num with NA | ||
+ | mydata$wage <- as.numeric(as.character(mydata$wage))</pre> | ||
− | + | <pre class="r">mydata$lwage <- as.numeric(as.character(mydata$lwage))</pre> | |
+ | <pre>## Warning: NAs introduced by coercion</pre> | ||
+ | Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command: | ||
+ | <pre class="r">mydata <- subset(mydata, wage!="NA") # select non NA data</pre> | ||
Now we can run our initial regression: | Now we can run our initial regression: | ||
− | |||
− | |||
− | |||
− | |||
− | + | <pre class="r"># Run a regression | |
− | A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians | + | reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata) |
+ | reg_ex1_sm <- summary(reg_ex1)</pre> | ||
+ | As we have learned in [http://eclr.humanities.manchester.ac.uk/index.php/R_Regression R_Regression] these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance. | ||
+ | |||
+ | The AER toolbox contains a procedure that makes performing this test a doodle. | ||
− | + | <pre class="r"># Test wheter residuals are homoscedasticity | |
+ | print(bptest(reg_ex1))</pre> | ||
+ | <pre>## | ||
+ | ## studentized Breusch-Pagan test | ||
+ | ## | ||
+ | ## data: reg_ex1 | ||
+ | ## BP = 9.5792, df = 2, p-value = 0.008316</pre> | ||
+ | The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level. | ||
− | + | This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and <code>log(huswage)</code>. You could get the same result by running this regression yourself and calculating the test statistic <math>n*R2</math> | |
− | |||
− | will | + | <pre class="r">mydata$res1_sq <- (reg_ex1$residuals)^2 # add squared residuals to dataframe |
+ | hs_test <- lm(res1_sq~exper+log(huswage), data = mydata) | ||
+ | hs_test_sm <- summary(hs_test) | ||
+ | teststat <- hs_test_sm$r.squared*nrow(mydata) | ||
+ | print(teststat) # test statistic n*R^2</pre> | ||
+ | <pre>## [1] 9.57924</pre> | ||
+ | <pre class="r">print(1-pchisq(teststat,2)) # print p value</pre> | ||
+ | <pre>## [1] 0.008315617</pre> | ||
+ | which returns exactly the same test statistic and p-value. Note that the function <code>pchisq(teststat,2)</code> returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in <code>lm(res1_sq~exper+log(huswage), data = mydata)</code>. But note that you will have to adjust the degrees of freedom in <code>pchisq(teststat,2)</code> to reflect the number of variables included. | ||
− | + | == Heteroskedasticity in time series data == | |
− | |||
− | |||
− | The | + | Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in <code>USUK</code>, the log change is in <code>logdUSUK</code>. |
+ | |||
+ | <pre class="r">ex_data <- read.csv("USDUKP.csv",skip = 12) | ||
+ | n <- nrow(ex_data) | ||
+ | summary(ex_data) # this produces summary stats</pre> | ||
+ | <pre>## DATE USUK logdUSUK | ||
+ | ## 01/02/1971: 1 Min. :1.052 Min. :-2.156815 | ||
+ | ## 01/02/1972: 1 1st Qu.:1.556 1st Qu.:-0.125833 | ||
+ | ## 01/02/1973: 1 Median :1.666 Median : 0.002539 | ||
+ | ## 01/02/1974: 1 Mean :1.768 Mean :-0.001520 | ||
+ | ## 01/02/1977: 1 3rd Qu.:1.915 3rd Qu.: 0.127335 | ||
+ | ## 01/02/1978: 1 Max. :2.644 Max. : 1.992773 | ||
+ | ## (Other) :10813 NA's :1</pre> | ||
+ | As usual with time-series data we shall let R know what we are dealing with. We chose <code>frequency =1</code> for daily data, and tell R that we start at observation 1 (<code>start=1</code>) and finish at observation <code>nrow(ex_data)</code> which automatically checks how long our dataset is. | ||
+ | |||
+ | <pre class="r">ex_data$USUK <- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1) | ||
+ | ex_data$logdUSUK <- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1)</pre> | ||
+ | Let's estimate an AR(1) model for <code>logdUSUK</code> (see the [http://eclr.humanities.manchester.ac.uk/index.php/R_TimeSeries time series section] for details: | ||
+ | |||
+ | <pre class="r">reg2 <- arima(ex_data$logdUSUK, order = c(1,0,0))</pre> | ||
+ | Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the <code>FinTS</code> package which has a function written for exactly that purpose. So install that package and then load it. | ||
− | + | <pre class="r">#install.packages("FinTS") # only do this the first time you download the package | |
+ | library(FinTS)</pre> | ||
+ | <pre>## Warning: package 'FinTS' was built under R version 3.1.3</pre> | ||
+ | And this is how you call the <code>ArchTest</code> function. Under the hood this will run an auxiliary regression: | ||
− | + | <math>\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t</math> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | + | <pre class="r">ArchTest(reg2$residuals, lags = 12) </pre> | |
+ | <pre>## | ||
+ | ## ARCH LM-test; Null hypothesis: no ARCH effects | ||
+ | ## | ||
+ | ## data: reg2$residuals | ||
+ | ## Chi-squared = 1004.562, df = 12, p-value < 2.2e-16</pre> | ||
+ | The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity. | ||
= Autocorrelation = | = Autocorrelation = | ||
+ | When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the <code>bgtest</code> function from the <code>lmtest</code> package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation. | ||
+ | |||
+ | We show how to apply the test to the estimated residuals from the <code>reg2</code> model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the <code>bgtest</code> function does not accept the <code>arfima</code> function output as an input, but it does accept the output of the <code>lm</code> function. | ||
+ | |||
+ | <pre class="r">reg3 <- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK | ||
+ | bgtest(reg3,order=4)</pre> | ||
+ | <pre>## | ||
+ | ## Breusch-Godfrey test for serial correlation of order up to 4 | ||
+ | ## | ||
+ | ## data: reg3 | ||
+ | ## LM test = 2.2193, df = 4, p-value = 0.6955</pre> | ||
+ | This is the Breusch-Godfrey (LM) test of order 4. Note that <code>logdUSUK[-1]</code> removes the first observation and <code>logdUSUK[-nrow(ex_data)]</code> removes the last. This crates the <math>y_t</math> and <math>y_{t-1}</math> respectively. | ||
= Residual Normality = | = Residual Normality = | ||
+ | = General Misspecification - RESET test = | ||
+ | |||
+ | This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification. | ||
− | + | Let's go back to the model we estimated in <code>reg_ex1</code>. | |
− | = | + | <pre class="r">resettest(reg_ex1 , power=2, type="regressor")</pre> |
− | < | + | <pre>## |
+ | ## RESET test | ||
+ | ## | ||
+ | ## data: reg_ex1 | ||
+ | ## RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193</pre> | ||
+ | = Structural Breaks = |
Latest revision as of 13:27, 22 July 2016
Contents
Regression Diagnostics for ECLR
Ralf Becker
Introduction
When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant CRAN webpage).
Heteroskedasticity
Heteroskedasticity in cross sections
One of the Gauss-Markov assumption is that the variance of the regression error terms is constant. If they are not, then the OLS parameter estimators will not be efficient and one needs to use heteroskedasticity robust standard errors to obtain valid inference on regression coefficients (see R_robust_se).
Tests for heteroskedasticity are usually based on an auxiliary regression of estimated squared regression residuals on a set of explanatory variables that are suspected to be related to the potentially changing error variance. We continue the example we started in the Regression Section](http://eclr.humanities.manchester.ac.uk/index.php/R_Regression#A_first_example) and which is replicated here, but note the first line which we include to gain access to the procedures in the AER toolbox:
library(AER) # allow access to AER package
# This is my first R regression! setwd("C:/Users/Ralph Becker/Dropbox/ECLR/R/RegressionDiagnostics") # This sets the working directory mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory # Now convert variables with "." to num with NA mydata$wage <- as.numeric(as.character(mydata$wage))
mydata$lwage <- as.numeric(as.character(mydata$lwage))
## Warning: NAs introduced by coercion
Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:
mydata <- subset(mydata, wage!="NA") # select non NA data
Now we can run our initial regression:
# Run a regression reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata) reg_ex1_sm <- summary(reg_ex1)
As we have learned in R_Regression these two objects now contain all the information we would want from a regression. A test for error heteroskedasticty is now based on an auxiliary regression that uses the estimated squared regression residuals (as proxies for error variance) as a dependent variable and a range of variables as explanatory variables. This testing principle goes back to Trevor Breusch and Adrian Pagan two excellent Australian econometricians (T. S. Breusch and A. R. Pagan, (1979), A Simple Test for Heteroscedasticity and Random Coefficient Variation, Econometrica, Vol. 47, No. 5, pp. 1287-1294) What these variables should be depends on the problem. In general any variables that the researcher suspects correlate to the error variance.
The AER toolbox contains a procedure that makes performing this test a doodle.
# Test wheter residuals are homoscedasticity print(bptest(reg_ex1))
## ## studentized Breusch-Pagan test ## ## data: reg_ex1 ## BP = 9.5792, df = 2, p-value = 0.008316
The p-value comes from the asymptotic distribution (Chi-Square distribution with 2 degrees of freedom) of the test statistic assuming that the null hypothesis of homoskedasticity is valid. In this particular case this p-value is fairly small (smaller than 1%) which woul lead us to reject the null hypothesis of homoskedasticity at a 1% significance level.
This was pretty easy, but I want to demonstrate what is really happening behind this test so that you know how to change it according to your problem at hand. As discussed above, we need to decide on which explanatory variables to use in order to execute the test. The default choice of the bptest is to use the same explanatory variables as those used in the original regression model, here exper and log(huswage)
. You could get the same result by running this regression yourself and calculating the test statistic [math]n*R2[/math]
mydata$res1_sq <- (reg_ex1$residuals)^2 # add squared residuals to dataframe hs_test <- lm(res1_sq~exper+log(huswage), data = mydata) hs_test_sm <- summary(hs_test) teststat <- hs_test_sm$r.squared*nrow(mydata) print(teststat) # test statistic n*R^2
## [1] 9.57924
print(1-pchisq(teststat,2)) # print p value
## [1] 0.008315617
which returns exactly the same test statistic and p-value. Note that the function pchisq(teststat,2)
returns the size of the left hand tail of the Chi-Square distribution with 2 degrees of freedom. In this procedure you could easily change the variables that should be included in the auxiliary regression by changing the explanatory variables in lm(res1_sq~exper+log(huswage), data = mydata)
. But note that you will have to adjust the degrees of freedom in pchisq(teststat,2)
to reflect the number of variables included.
Heteroskedasticity in time series data
Let's load another dataset, a time-series dataset. These are exchange rates. The exchange rate is in USUK
, the log change is in logdUSUK
.
ex_data <- read.csv("USDUKP.csv",skip = 12) n <- nrow(ex_data) summary(ex_data) # this produces summary stats
## DATE USUK logdUSUK ## 01/02/1971: 1 Min. :1.052 Min. :-2.156815 ## 01/02/1972: 1 1st Qu.:1.556 1st Qu.:-0.125833 ## 01/02/1973: 1 Median :1.666 Median : 0.002539 ## 01/02/1974: 1 Mean :1.768 Mean :-0.001520 ## 01/02/1977: 1 3rd Qu.:1.915 3rd Qu.: 0.127335 ## 01/02/1978: 1 Max. :2.644 Max. : 1.992773 ## (Other) :10813 NA's :1
As usual with time-series data we shall let R know what we are dealing with. We chose frequency =1
for daily data, and tell R that we start at observation 1 (start=1
) and finish at observation nrow(ex_data)
which automatically checks how long our dataset is.
ex_data$USUK <- ts(ex_data$USUK,start = 1, end=nrow(ex_data),frequency = 1) ex_data$logdUSUK <- ts(ex_data$logdUSUK,start = 1, end=nrow(ex_data),frequency = 1)
Let's estimate an AR(1) model for logdUSUK
(see the time series section for details:
reg2 <- arima(ex_data$logdUSUK, order = c(1,0,0))
Let's say we want to perform a test on ARCH type heteroskedasticity (say lag =12). The easiest way to do that is to use the FinTS
package which has a function written for exactly that purpose. So install that package and then load it.
#install.packages("FinTS") # only do this the first time you download the package library(FinTS)
## Warning: package 'FinTS' was built under R version 3.1.3
And this is how you call the ArchTest
function. Under the hood this will run an auxiliary regression:
[math]\hat{\epsilon}^2_t=\alpha_0 + \alpha_1 \hat{\epsilon}^2_{t-1}+ \alpha_2 \hat{\epsilon}^2_{t-2} +...+ \alpha_{12} \hat{\epsilon}^2_{t-12}+u_t[/math]
ArchTest(reg2$residuals, lags = 12)
## ## ARCH LM-test; Null hypothesis: no ARCH effects ## ## data: reg2$residuals ## Chi-squared = 1004.562, df = 12, p-value < 2.2e-16
The p-value of this test is extremely small such that we can easily reject the null hypothesis of no heteroskedasticity.
Autocorrelation
When we are dealing with time series data we will usually want to know whether the error terms are uncorreated as this would be an assumption required for OLS estimators to be efficient. The function we use to test for autocorrelation is the bgtest
function from the lmtest
package which implements a Breusch-Godfrey test. The null hypothesis is the absence of autocorrelation, whereas the alternative hypothesis implies the presence of autocorrelation.
We show how to apply the test to the estimated residuals from the reg2
model, which was an AR(1) model for the log differences in the US-UK exchange rate. First we re-estimate the problem. The reason for this is that the bgtest
function does not accept the arfima
function output as an input, but it does accept the output of the lm
function.
reg3 <- lm(logdUSUK[-1]~logdUSUK[-nrow(ex_data)],data = ex_data) # this is an AR(1) model for logdUSUK bgtest(reg3,order=4)
## ## Breusch-Godfrey test for serial correlation of order up to 4 ## ## data: reg3 ## LM test = 2.2193, df = 4, p-value = 0.6955
This is the Breusch-Godfrey (LM) test of order 4. Note that logdUSUK[-1]
removes the first observation and logdUSUK[-nrow(ex_data)]
removes the last. This crates the [math]y_t[/math] and [math]y_{t-1}[/math] respectively.
Residual Normality
General Misspecification - RESET test
This is a test which does not have any specific alternative. A number of reasons could trigger a rejection of the null hypothesis of correct model specification.
Let's go back to the model we estimated in reg_ex1
.
resettest(reg_ex1 , power=2, type="regressor")
## ## RESET test ## ## data: reg_ex1 ## RESET = 3.9845, df1 = 2, df2 = 423, p-value = 0.0193