Difference between revisions of "R Regression"
Line 64: | Line 64: | ||
=== Accessing Regression Output === | === Accessing Regression Output === | ||
− | The next thing you should want to know as a buddying applied econometrician is how you actually access these regression statistics in order to use them in some further analysis. Let's see how you would access the regression residuals: | + | The next thing you should want to know as a buddying applied econometrician is how you actually access these regression statistics in order to use them in some further analysis. We previously mentioned that <source enclose=none>reg_ex1</source> was a list that contained a number of elements some of which we print by using the summary function. |
+ | |||
+ | It turns out that it is actually useful to save the summary as its own object: | ||
+ | |||
+ | reg_ex1_sm <- summary(reg_ex1) | ||
+ | |||
+ | Inspecting <source enclose=none>reg_ex1_sm</source> reveals that this is a list as well. A list is a collection of objects. Sometimes these objects in lists have names and this is one such occasion. Try <source enclose=none>names(reg_ex1_sm)</source> in your console and you find the following list of names: | ||
+ | |||
+ | [1] "call" "terms" "residuals" | ||
+ | [4] "coefficients" "aliased" "sigma" | ||
+ | [7] "df" "r.squared" "adj.r.squared" | ||
+ | [10] "fstatistic" "cov.unscaled" "na.action" | ||
+ | |||
+ | In fact if you do the same for the <source enclose=none>reg_ex1</source> list you will find the following list: | ||
+ | |||
+ | > names(reg_ex1) | ||
+ | [1] "coefficients" "residuals" "effects" | ||
+ | [4] "rank" "fitted.values" "assign" | ||
+ | [7] "qr" "df.residual" "na.action" | ||
+ | [10] "xlevels" "call" "terms" | ||
+ | [13] "model" | ||
+ | |||
+ | If you compare the lists you can see that there is some overlap between the information contained in the two lists. | ||
+ | |||
+ | The question now remains how you access particular members of these lists if you wanted to use them further down in your code. | ||
+ | |||
+ | Let's see how you would access the regression residuals: | ||
reg_ex1_resid <- residuals(reg_ex1) # assigns the residual vector to a new object | reg_ex1_resid <- residuals(reg_ex1) # assigns the residual vector to a new object |
Revision as of 21:50, 17 January 2015
Let's assume we want to run a regression with lwage
(the logarithm of the woman's wage) as dependent variable and a constant, exper
(the years of experience) and the logarithm of the husbands wage (huswage
as explanatory variables. First we should note that the logarithm of the woman's wage already exists as variable lwage
, but the logarithm of the husband's wage doesn't exist as its own variable. Hence we are yet to calculate it.
Contents
The lm()
function
The R function that does the heavy lifting for regression analysis is the lm()
function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt.
A first example
The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see R_Data#Data_Types) and eventually runs a regression:
# This is my first R regression! setwd("T:/ECLR/R/FirstSteps") # This sets the working directory mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory # Now convert variables with "." to num with NA mydata$
wage <- as.numeric(as.character(mydata$
wage)) mydata$
lwage <- as.numeric(as.character(mydata$
lwage)) # Run a regression reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
So let's look at the last line in which we ask R to run a regression. Whatever comes in the parenthesis after lm
are parameters to the lm()
function. Different parameters are separated by commas. So here we have two inputs. Let's start with the second data=mydata
. This basically indicates to R that we are drawing the data for the regression from our dataframe called mydata
. That means that for the first input, in which we actually specify the model we estimate, we can refer to the variable names of the variables that are contained in mydata
. In that first input you should imagine writing down a regression model. The model we want to estimate is the following:
[math]\label{OLSModel} lwage = \beta_0 + \beta_1 * exper + \beta_2 * log(huswage) + \epsilon[/math]
The way how you tell lm()
to estimate this model is to leave the coefficients and error term away, and replace the equal sign with a ~
. This is where the bold part of
reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)
comes from. One additional note here. The log(huswage)
part of this model took the huswage
variable from our dataframe and applied the log()
function to it.
You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called reg_ex1
. You should think of this as some sort of folder in which R has now saved all the regression results. If you look at the object reg_ex1
in your environment, you will see that it is a list object with 13 elements. So far so good, but if you click on the little triangle next to its name and see the detail you will most likely scratch your head and think "What a mess!" and I think you are quite right.
Regression Output
You would be familiar with a standard regression output containing estimated coefficients, standard errors and a set of regression statistics[1]. Fortunately it is very easy to obtain this. The command is
> summary(reg_ex1) Call: lm(formula = lwage ~ exper + log(huswage), data = mydata) Residuals: Min 1Q Median 3Q Max -3.10089 -0.31219 0.02919 0.37466 2.11402 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.534866 0.139082 3.846 0.000139 *** exper 0.016684 0.004243 3.933 9.81e-05 *** log(huswage) 0.236466 0.063684 3.713 0.000232 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7031 on 425 degrees of freedom (325 observations deleted due to missingness) Multiple R-squared: 0.05919, Adjusted R-squared: 0.05477 F-statistic: 13.37 on 2 and 425 DF, p-value: 2.338e-06
Most things should sound familiar here. One thing you should note is that, on this occasion, R automatically ignored all the missing observations, all 325 of them, which left R to estimate the model with 428 observations[2]
Accessing Regression Output
The next thing you should want to know as a buddying applied econometrician is how you actually access these regression statistics in order to use them in some further analysis. We previously mentioned that reg_ex1
was a list that contained a number of elements some of which we print by using the summary function.
It turns out that it is actually useful to save the summary as its own object:
reg_ex1_sm <- summary(reg_ex1)
Inspecting reg_ex1_sm
reveals that this is a list as well. A list is a collection of objects. Sometimes these objects in lists have names and this is one such occasion. Try names(reg_ex1_sm)
in your console and you find the following list of names:
[1] "call" "terms" "residuals" [4] "coefficients" "aliased" "sigma" [7] "df" "r.squared" "adj.r.squared" [10] "fstatistic" "cov.unscaled" "na.action"
In fact if you do the same for the reg_ex1
list you will find the following list:
> names(reg_ex1) [1] "coefficients" "residuals" "effects" [4] "rank" "fitted.values" "assign" [7] "qr" "df.residual" "na.action" [10] "xlevels" "call" "terms" [13] "model"
If you compare the lists you can see that there is some overlap between the information contained in the two lists.
The question now remains how you access particular members of these lists if you wanted to use them further down in your code.
Let's see how you would access the regression residuals:
reg_ex1_resid <- residuals(reg_ex1) # assigns the residual vector to a new object
Now you could use the new object reg_ex1_resid
for some further analysis. What other regression outputs are available in this manner? Here is a list of the most useful:
coefficients(reg_ex1) # regression coefficients na.action(reg_ex1) # information about which observations were omitted fitted.values(reg_ex1) # a vector of fitted values df.residuals(reg_ex1) # model degree of freedoms (number of obs - number of est coefs)
There are clearly a number of important omissions here
Footnotes
- ↑ This wiki is not meant to explain the econometrics, but merely the programming implementation.
- ↑ Why does the output not show this number???? Yes, a fair question. If you realise that your dataframe has 753 observations then 753-325 (missing) observations is 428. Alternatively you can infer the number from the reported degrees of freedom which are calculatesd as the number of observations minus the number of estimated coefficients, here 3.