Difference between revisions of "Regression Inference in R"
(→F-test 3 ways) |
|||
(24 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
# This is my first R regression! | # This is my first R regression! | ||
setwd("T:/ECLR/R/FirstSteps") # This sets the working directory | setwd("T:/ECLR/R/FirstSteps") # This sets the working directory | ||
− | mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory | + | mydata <- read.csv("mroz.csv",na.strings = ".") # Opens mroz.csv from working directory |
− | |||
− | |||
− | |||
− | |||
Before we run our initial regression model we shall restrict the dataframe <source enclose=none>mydata</source> to those data that do not have missing wage information, using the following <source enclose=none>subset</source> command: | Before we run our initial regression model we shall restrict the dataframe <source enclose=none>mydata</source> to those data that do not have missing wage information, using the following <source enclose=none>subset</source> command: | ||
Line 56: | Line 52: | ||
As was discussed in [[R_Regression#Accessing Regression Output]] the easiest way to get to these is to recognise that the <code>coefficients</code> element of <code>reg_ex1_sm</code> contains the parameter estimates in the first column and the standard errors in the second column. So let's say that we wanted to test the following hypothesis: | As was discussed in [[R_Regression#Accessing Regression Output]] the easiest way to get to these is to recognise that the <code>coefficients</code> element of <code>reg_ex1_sm</code> contains the parameter estimates in the first column and the standard errors in the second column. So let's say that we wanted to test the following hypothesis: | ||
− | <math>H_0: \beta_{exper} = 0. | + | <math>H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} > 0.01</math> |
− | + | Noting that the results for the exper variables are in the 2nd row we can calculate the relevant test statistic according to: | |
− | t_test = (reg_ex1_sm<source enclose=none>$</source>coefficients[<span style="color:red">2</span>,1]-0. | + | t_test = (reg_ex1_sm<source enclose=none>$</source>coefficients[<span style="color:red">2</span>,1]-0.01)/reg_ex1_sm<source enclose=none>$</source>coefficients[<span style="color:red">2</span>,2] |
− | where we recognise that that the experience coefficient is saved in the <span style="color:red">2nd</span> row of coefficients. | + | where we recognise that that the experience coefficient is saved in the <span style="color:red">2nd</span> row of coefficients. As it turns out the value of this t-test is 1.5755. Sometime, especially if you have many explanatory variables, it can be awkward to have to count in which row the relevant coefficients are. But you can also use the name of the relevant variable as follows: |
+ | |||
+ | t_test = (reg_ex1_sm<source enclose=none>$</source>coefficients[<span style="color:red">"exper"</span>,1]-0.01)/reg_ex1_sm<source enclose=none>$</source>coefficients[<span style="color:red">"exper"</span>,2] | ||
+ | |||
+ | This delivers the identical result. | ||
= F-tests = | = F-tests = | ||
Line 73: | Line 73: | ||
reg_ex2_sm <- summary(reg_ex2) | reg_ex2_sm <- summary(reg_ex2) | ||
− | Calculating the F-test is now very easy. We use the function <code>anova</code>: | + | == F-test 3 ways == |
+ | |||
+ | As for many things in R you can achieve the same thing in many different ways. Here I will introduce three different ways to get the F-test result. Choose whichever way seems the most intuitive or straightforward way for you. | ||
+ | |||
+ | Calculating the F-test is now very easy. What we need is a restricted and unrestricted model. We use the function <code>anova</code>: | ||
print(anova(reg_ex1,reg_ex2)) | print(anova(reg_ex1,reg_ex2)) | ||
Line 90: | Line 94: | ||
The table at the heart of this output delivers the individual <span style="color:#0000ff">residual sum of squares</span>, the <span style="color:brown">F-test statstic</span> and its <span style="color:#ff0000">p-value</span>. The p-value is extremely small which would lead us to reject the null hypothesis, concluding that at least one of <code>age</code> or <code>educ</code> was significant. If you look at the regression output of <code>reg_ex2</code> you will see that it is the education variable. | The table at the heart of this output delivers the individual <span style="color:#0000ff">residual sum of squares</span>, the <span style="color:brown">F-test statstic</span> and its <span style="color:#ff0000">p-value</span>. The p-value is extremely small which would lead us to reject the null hypothesis, concluding that at least one of <code>age</code> or <code>educ</code> was significant. If you look at the regression output of <code>reg_ex2</code> you will see that it is the education variable. | ||
− | = p- | + | Slighly easier, use the linear hypothesis testing function (<code>lht</code>) which is also part of the car package: |
+ | |||
+ | lht1 <- lht(reg_ex2, c("age = 0"," educ = 0")) | ||
+ | |||
+ | which requires as input the unrestricted model, (here <code>reg_ex2</code>) and the restrictions. Here we have two restrictions, i.e. that the coefficients to the two variables <code>age</code> and <code>educ</code> are 0 and hence both variables irrelevant. You could add more restrictions by adding them into the vector list <code>c( )</code>. Also note that instead of the short version (<code>lht</code>) you could use the slightly longer function name (<code>linearHypothesis</code>). But everything else remains unchanged. | ||
+ | |||
+ | Look at the output and you will find that it is essentially exactly the same as for the <code>anova</code> function. The <code>lht</code> function is more convenient to use for testing purposes, as you will only need the unrestricted model and then add the restriction. Internally the function will then estimate the restricted model, but you as the user will not see it. | ||
+ | |||
+ | The last way to calculate an F-test is by using the <code>waldtest</code> function which is a part of the AER package. In some ways this is the most powerful of the three options, in particular if you also want to allow for heteroskedastic error terms (see [[R_robust_se]]). All you need to do to calculate the above F-test is to call the following line | ||
+ | |||
+ | > waldtest(<span style="color:blue">reg_ex2</span>, <span style="color:brown">.~.</span> <span style="color:red">- age - educ</span>) | ||
+ | Wald test | ||
+ | |||
+ | Model 1: lwage ~ exper + log(huswage) + age + educ | ||
+ | Model 2: lwage ~ exper + log(huswage) | ||
+ | Res.Df Df F Pr(>F) | ||
+ | 1 423 | ||
+ | 2 425 -2 24.741 6.895e-11 *** | ||
+ | --- | ||
+ | Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 | ||
+ | |||
+ | Don't be confused by the name "waldtest" in the end you get the same F-test. As for the <code>lht</code> function you will only need the unrestricted model (<code><span style="color:blue">reg_ex2</span></code>) as the first input. The second input into the function specifies the restriction to be tested, or more literally specifies how the restricted model should look like relative to the unrestricted one. We start this specification by the term <code><span style="color:brown">.~.</span></code> which basically says "use the same terms to the left and right of the "~" sign, but then remove the age and education variables, <code><span style="color:red">-age-educ</span></code>. | ||
+ | |||
+ | Of course, the results are absolutely identical. | ||
+ | |||
+ | = p-value = | ||
Often you will want to calculate p-values for test statistics. You need a number of ingredients to do that: | Often you will want to calculate p-values for test statistics. You need a number of ingredients to do that: | ||
Line 96: | Line 125: | ||
* You need to know what the distribution of the test statistic is (assuming that the null hypothesis is true) - this includes knowledge of degrees of freedom parameters if these are required for your distribution. | * You need to know what the distribution of the test statistic is (assuming that the null hypothesis is true) - this includes knowledge of degrees of freedom parameters if these are required for your distribution. | ||
* You need to know whether you are working with a two-tailed, left-tailed or right-tailed test. | * You need to know whether you are working with a two-tailed, left-tailed or right-tailed test. | ||
+ | |||
+ | Once you know all these ingredients you can use some R internal functions to get p-values. Let's go to the t-test we calculated earlier (and saved in <code>t_test</code>) to test <math>H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} > 0.01</math>. To get the p-value here (right-tailed area) we can call on the function <code>pt</code>, which calculates probabilities from a t distribution (probability under the t-distribution with <span style="color:blue">425 degrees of freedom</span> to the left of <span style="color:red">t_test</span>: | ||
+ | |||
+ | p_t_test = (1-pt(t_test,425)) # right tailed p-value | ||
+ | |||
+ | The result is 0.05794. The <code>(1 - ... )</code> part ensures that we calculate the right tail. Alternatively (as usual there are several ways to achieve the same result - all equally valied) we could have used an option in the <code>pt</code> function to ask for the right tail: | ||
+ | |||
+ | p_t_test = pt(t_test,425,lower.tail = FALSE) | ||
+ | |||
+ | Of course different test statistics require different distributions. But the principle is the same. The relevant function to get p-values for a F-test is <code>pf</code>. Type <code>?pf</code> into R to get to the help function where you can see how to use it exactly. | ||
+ | |||
+ | These probability functions exist for a range of common distributions. For a list see [https://stat.ethz.ch/R-manual/R-patched/library/stats/html/Distributions.html]. |
Latest revision as of 17:07, 5 September 2015
Here we will discuss how to perform standard inference in regression models.
Contents
Setup
We continue the example we started in R_Regression#A first example and which is replicated here:
# This is my first R regression! setwd("T:/ECLR/R/FirstSteps") # This sets the working directory mydata <- read.csv("mroz.csv",na.strings = ".") # Opens mroz.csv from working directory
Before we run our initial regression model we shall restrict the dataframe mydata
to those data that do not have missing wage information, using the following subset
command:
mydata <- subset(mydata, wage!="NA") # select non NA data
Now we can run our initial regression:
# Run a regression reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata) reg_ex1_sm <- summary(reg_ex1)
We will introduce inference in this model.
t-tests
We use t-tests to test simple coefficient restrictions on regression coefficients. Let's initially have a look at our regression output
print(reg_ex1_sm)
which delivers the following regression output
Call: lm(formula = lwage ~ exper + log(huswage), data = mydata) Residuals: Min 1Q Median 3Q Max -3.10089 -0.31219 0.02919 0.37466 2.11402 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.534866 0.139082 3.846 0.000139 *** exper 0.016684 0.004243 3.933 9.81e-05 *** log(huswage) 0.236466 0.063684 3.713 0.000232 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7031 on 425 degrees of freedom Multiple R-squared: 0.05919, Adjusted R-squared: 0.05477 F-statistic: 13.37 on 2 and 425 DF, p-value: 2.338e-06
As you can see, this output contains t-statistics and their associated p-values. These test statistics and their p-values are all associate to the following hypothesis test: [math]H_0: \beta_{i} = 0; H_A: \beta_{i} \neq 0[/math]. Here [math]\beta_{i}[/math] represents the ith unknown population parameter. If you want to test any other hypothesis (rather than the two-sided, equal to 0 hypothesis) you will need to access the regression output in order to calculate
[math]t-stat = \frac{\widehat{\beta}_{i} - \beta_{i}}{se_{\widehat{\beta}_{i}}}[/math]
As was discussed in R_Regression#Accessing Regression Output the easiest way to get to these is to recognise that the coefficients
element of reg_ex1_sm
contains the parameter estimates in the first column and the standard errors in the second column. So let's say that we wanted to test the following hypothesis:
[math]H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} \gt 0.01[/math]
Noting that the results for the exper variables are in the 2nd row we can calculate the relevant test statistic according to:
t_test = (reg_ex1_sm$
coefficients[2,1]-0.01)/reg_ex1_sm$
coefficients[2,2]
where we recognise that that the experience coefficient is saved in the 2nd row of coefficients. As it turns out the value of this t-test is 1.5755. Sometime, especially if you have many explanatory variables, it can be awkward to have to count in which row the relevant coefficients are. But you can also use the name of the relevant variable as follows:
t_test = (reg_ex1_sm$
coefficients["exper",1]-0.01)/reg_ex1_sm$
coefficients["exper",2]
This delivers the identical result.
F-tests
F-tests are used to test multiple coefficient restrictions on regression coefficients.
Let's say we are interested whether two additional variables age
and educ
should be included into the model. As a good econometrics student, or even master, you know that to calculate a F-test you need residual sum of squares from a restricted model (that is model reg_ex1
) and an unrestricted model. The latter we estimate here:
reg_ex2 <- lm(lwage~exper+log(huswage)+age+educ,data=mydata) reg_ex2_sm <- summary(reg_ex2)
F-test 3 ways
As for many things in R you can achieve the same thing in many different ways. Here I will introduce three different ways to get the F-test result. Choose whichever way seems the most intuitive or straightforward way for you.
Calculating the F-test is now very easy. What we need is a restricted and unrestricted model. We use the function anova
:
print(anova(reg_ex1,reg_ex2))
which delivers the following output:
Analysis of Variance Table Model 1: lwage ~ exper + log(huswage) Model 2: lwage ~ exper + log(huswage) + age + educ Res.Df RSS Df Sum of Sq F Pr(>F) 1 425 210.11 2 423 188.10 2 22.004 24.741 6.895e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The table at the heart of this output delivers the individual residual sum of squares, the F-test statstic and its p-value. The p-value is extremely small which would lead us to reject the null hypothesis, concluding that at least one of age
or educ
was significant. If you look at the regression output of reg_ex2
you will see that it is the education variable.
Slighly easier, use the linear hypothesis testing function (lht
) which is also part of the car package:
lht1 <- lht(reg_ex2, c("age = 0"," educ = 0"))
which requires as input the unrestricted model, (here reg_ex2
) and the restrictions. Here we have two restrictions, i.e. that the coefficients to the two variables age
and educ
are 0 and hence both variables irrelevant. You could add more restrictions by adding them into the vector list c( )
. Also note that instead of the short version (lht
) you could use the slightly longer function name (linearHypothesis
). But everything else remains unchanged.
Look at the output and you will find that it is essentially exactly the same as for the anova
function. The lht
function is more convenient to use for testing purposes, as you will only need the unrestricted model and then add the restriction. Internally the function will then estimate the restricted model, but you as the user will not see it.
The last way to calculate an F-test is by using the waldtest
function which is a part of the AER package. In some ways this is the most powerful of the three options, in particular if you also want to allow for heteroskedastic error terms (see R_robust_se). All you need to do to calculate the above F-test is to call the following line
> waldtest(reg_ex2, .~. - age - educ) Wald test Model 1: lwage ~ exper + log(huswage) + age + educ Model 2: lwage ~ exper + log(huswage) Res.Df Df F Pr(>F) 1 423 2 425 -2 24.741 6.895e-11 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Don't be confused by the name "waldtest" in the end you get the same F-test. As for the lht
function you will only need the unrestricted model (reg_ex2
) as the first input. The second input into the function specifies the restriction to be tested, or more literally specifies how the restricted model should look like relative to the unrestricted one. We start this specification by the term .~.
which basically says "use the same terms to the left and right of the "~" sign, but then remove the age and education variables, -age-educ
.
Of course, the results are absolutely identical.
p-value
Often you will want to calculate p-values for test statistics. You need a number of ingredients to do that:
- You need to know the value of a calculated test statistic
- You need to know what the distribution of the test statistic is (assuming that the null hypothesis is true) - this includes knowledge of degrees of freedom parameters if these are required for your distribution.
- You need to know whether you are working with a two-tailed, left-tailed or right-tailed test.
Once you know all these ingredients you can use some R internal functions to get p-values. Let's go to the t-test we calculated earlier (and saved in t_test
) to test [math]H_0: \beta_{exper} = 0.01; H_A: \beta_{exper} \gt 0.01[/math]. To get the p-value here (right-tailed area) we can call on the function pt
, which calculates probabilities from a t distribution (probability under the t-distribution with 425 degrees of freedom to the left of t_test:
p_t_test = (1-pt(t_test,425)) # right tailed p-value
The result is 0.05794. The (1 - ... )
part ensures that we calculate the right tail. Alternatively (as usual there are several ways to achieve the same result - all equally valied) we could have used an option in the pt
function to ask for the right tail:
p_t_test = pt(t_test,425,lower.tail = FALSE)
Of course different test statistics require different distributions. But the principle is the same. The relevant function to get p-values for a F-test is pf
. Type ?pf
into R to get to the help function where you can see how to use it exactly.
These probability functions exist for a range of common distributions. For a list see [1].