Difference between revisions of "Regression Inference in R"

From ECLR
Jump to: navigation, search
Line 1: Line 1:
 
Here we will discuss how to perform standard inference in regression models. When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see [http://cran.r-project.org/web/packages/sandwich/index.html the relevant CRAN webpage]).
 
Here we will discuss how to perform standard inference in regression models. When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see [http://cran.r-project.org/web/packages/sandwich/index.html the relevant CRAN webpage]).
 
= Heteroskedasticity =
 
  
  

Revision as of 13:34, 14 April 2015

Here we will discuss how to perform standard inference in regression models. When estimating regression models you will usually want to undertake some diagnostic testing. The functions we will use are all contained in the "AER" package (see the relevant CRAN webpage).


Setup

We continue the example we started in R_Regression#A first example and which is replicated here:

    # This is my first R regression!
    setwd("T:/ECLR/R/FirstSteps")              # This sets the working directory
    mydata <- read.csv("mroz.csv")  # Opens mroz.csv from working directory
     
    # Now convert variables with "." to num with NA
    mydata$wage <- as.numeric(as.character(mydata$wage))
    mydata$lwage <- as.numeric(as.character(mydata$lwage))

Before we run our initial regression model we shall restrict the dataframe mydata to those data that do not have missing wage information, using the following subset command:

   mydata <- subset(mydata, wage!="NA")  # select non NA data

Now we can run our initial regression:

    # Run a regression
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
    reg_ex1_sm <- summary(reg_ex1)

We will introduce inference in this model.

t-tests

We use t-tests to test simple coefficient restrictions on regression coefficients. Let's initially have a look at our regression output

   print(reg_ex1_sm)

which delivers the following regression output

   Call:
   lm(formula = lwage ~ exper + log(huswage), data = mydata)
   Residuals:
        Min       1Q   Median       3Q      Max 
   -3.10089 -0.31219  0.02919  0.37466  2.11402 
    
   Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
   (Intercept)  0.534866   0.139082   3.846 0.000139 ***
   exper        0.016684   0.004243   3.933 9.81e-05 ***
   log(huswage) 0.236466   0.063684   3.713 0.000232 ***
   ---
   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
    
   Residual standard error: 0.7031 on 425 degrees of freedom
   Multiple R-squared: 0.05919,	Adjusted R-squared: 0.05477 
   F-statistic: 13.37 on 2 and 425 DF,  p-value: 2.338e-06 

As you can see, this output contains t-statistics and their associated p-values. These test statistics and their p-values are all associate to the following hypothesis test: [math]H_0: \beta_{i} = 0; H_A: \beta_{i} \neq 0[/math]. Here [math]\beta_{i}[/math] represents the ith unknown population parameter. If you want to test any other hypothesis (rather than the two-sided, equal to 0 hypothesis) you will need to access the regression output in order to calculate

   [math]t-stat = \frac{\widehat{\beta}_{i} - \beta_{i}}{se_{\widehat{\beta}_{i}}}[/math]

As was discussed in R_Regression#Accessing Regression Output the easiest way to get to these is to recognise that the coefficients element of reg_ex1_sm contains the parameter estimates in the first column and the standard errors in the second column. So let's say that we wanted to test the following hypothesis:

   [math]H_0: \beta_{exper} = 0.1; H_A: \beta_{exper} \gt  0.1[/math]

then we can calculate the relevant test statistic according to:

   t_test = (reg_ex1_sm$coefficients[2,1]-0.1)/reg_ex1_sm$coefficients[2,2]

where we recognise that that the experience coefficient is saved in the 2nd row of coefficients.

F-tests

F-tests are used to test multiple coefficient restrictions on regression coefficients.

Let's say we are interested whether two additional variables age and educ should be included into the model. As a good econometrics student, or even master, you know that to calculate a F-test you need residual sum of squares from a restricted model (that is model reg_ex1) and an unrestricted model. The latter we estimate here:

   reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata)
   reg_ex2_sm <- summary(reg_ex2)

Calculating the F-test is now very easy. We use the function anova:

   print(anova(reg_ex1,reg_ex2))

which delivers the following output:

   Analysis of Variance Table
   Model 1: lwage ~ exper + log(huswage)
   Model 2: lwage ~ exper + log(huswage) + age + educ
     Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
   1    425 210.11                                  
   2    423 188.10  2    22.004 24.741 6.895e-11 ***
   ---
   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

The table at the heart of this output delivers the individual residual sum of squares, the F-test statstic and its p-value. The p-value is extremely small which would lead us to reject the null hypothesis, concluding that at least one of age or educ was significant. If you look at the regression output of reg_ex2 you will see that it is the education variable.