Difference between revisions of "R Regression"

From ECLR
Jump to: navigation, search
(Accessing Regression Output)
 
(54 intermediate revisions by the same user not shown)
Line 1: Line 1:
Let's assume we want to run a regression with <source enclose=none>lwage</source>
+
[http://youtu.be/SJK1OBAiuUc?hd=1 Online Demonstration]
 +
 
 +
Let's assume we want to run a regression with <source enclose=none>lwage</source> (the logarithm of the woman's wage) as dependent variable and a constant, <source enclose=none>exper</source> (the years of experience) and the logarithm of the husbands wage (<source enclose=none>huswage</source> as explanatory variables. First we should note that the logarithm of the woman's wage already exists as variable <source enclose=none>lwage</source>, but the logarithm of the husband's wage doesn't exist as its own variable. Hence we are yet to calculate it.
 +
 
 +
== The <source enclose=none>lm()</source> function ==
 +
 
 +
The R function that does the heavy lifting for regression analysis is the <source enclose=none>lm()</source> function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt.
 +
 
 +
=== A first example ===
 +
 
 +
The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see [[R_Data#Data_Types]]) and eventually runs a regression:
 +
 
 +
    # This is my first R regression!
 +
    setwd("T:/ECLR/R/FirstSteps")              # This sets the working directory
 +
    mydata <- read.csv("mroz.csv",na.strings = ".")  # Opens mroz.csv from working directory
 +
         
 +
    # Run a regression
 +
    <span style="color:#0000ff">reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)</span>
 +
 
 +
So let's look at the last line in which we ask R to run a regression. Whatever comes in the parenthesis after <source enclose=none>lm</source> are parameters to the <source enclose=none>lm()</source> function. Different parameters are separated by commas. So here we have two inputs. Let's start with the second <source enclose=none>data=mydata</source>. This basically indicates to R that we are drawing the data for the regression from our dataframe called <source enclose=none>mydata</source>. That means that for the first input, in which we actually specify the model we estimate, we can refer to the variable names of the variables that are contained in <source enclose=none>mydata</source>. In that first input you should imagine writing down a regression model. The model we want to estimate is the following:
 +
 
 +
<math>\label{OLSModel}
 +
  lwage = \beta_0 + \beta_1 * exper + \beta_2 * log(huswage) + \epsilon</math>
 +
 
 +
The way how you tell <source enclose=none>lm()</source> to estimate this model is to leave the coefficients and error term away, and replace the equal sign with a <source enclose=none>~</source>. This is where the bold part of
 +
 
 +
    reg_ex1 <- lm('''lwage~exper+log(huswage)''',data=mydata)
 +
 
 +
comes from. One additional note here. The <source enclose=none>log(huswage)</source> part of this model took the <source enclose=none>huswage</source> variable from our dataframe and applied the <source enclose=none>log()</source> function to it.
 +
 
 +
You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called <source enclose=none>reg_ex1</source>. You should think of this as some sort of folder in which R has now saved all the regression results. If you look at the object <source enclose=none>reg_ex1</source> in your environment, you will see that it is a list object with 13 elements. So far so good, but if you click on the little triangle next to its name and see the detail you will most likely scratch your head and think "What a mess!" and I think you are quite right.
 +
 
 +
=== What if I do not use a dataframe ===
 +
 
 +
While you will often use a dataframe from which you extract your variables, it may well be possible that you have your data not saved in a dataframe but in individual data vectors. In that case the regression command becomes even simpler. Let's assume that your dependent variable was saved in <source enclose=none>y</source> and your two explanatory variables in <source enclose=none>x1</source> and <source enclose=none>x2</source>. In that case you would merely have to call
 +
 
 +
    reg_ex2 <- lm(y~x1+x2)
 +
 
 +
That's it, nothing else!
 +
 
 +
=== Regression Output ===
 +
 
 +
You would be familiar with a standard regression output containing estimated coefficients, standard errors and a set of regression statistics<ref>This wiki is not meant to explain the econometrics, but merely the programming implementation.</ref>. Fortunately it is very easy to obtain this. The command is
 +
 
 +
    > summary(reg_ex1)
 +
     
 +
    Call:
 +
    lm(formula = lwage ~ exper + log(huswage), data = mydata)
 +
     
 +
    Residuals:
 +
          Min      1Q  Median      3Q      Max
 +
    -3.10089 -0.31219  0.02919  0.37466  2.11402
 +
     
 +
    Coefficients:
 +
                  Estimate Std. Error t value Pr(>|t|)   
 +
    (Intercept)  0.534866  0.139082  3.846 0.000139 ***
 +
    exper        0.016684  0.004243  3.933 9.81e-05 ***
 +
    log(huswage) 0.236466  0.063684  3.713 0.000232 ***
 +
    ---
 +
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 +
     
 +
    Residual standard error: 0.7031 on 425 degrees of freedom
 +
      (325 observations deleted due to missingness)
 +
    Multiple R-squared:  0.05919, Adjusted R-squared:  0.05477
 +
    F-statistic: 13.37 on 2 and 425 DF,  p-value: 2.338e-06
 +
 
 +
Most things should look familiar here. One thing you should note is that, on this occasion, R automatically ignored all the missing observations, all 325 of them, which left R to estimate the model with 428 observations<ref>Why does the output not show this number???? Yes, a fair question. If you realise that your dataframe has 753 observations then 753-325 (missing) observations is 428. You can also use the command <source enclose=none>nrow(mydata)</source> to get that number. Alternatively you can infer the number from the reported degrees of freedom which are calculatesd as the number of observations minus the number of estimated coefficients, here 3.</ref>
 +
 
 +
=== Accessing Regression Output ===
 +
 
 +
The next thing you should want to know as a buddying applied econometrician is how you actually access these regression statistics in order to use them in some further analysis<ref>You may, e.g., want to calculate a t-test for a null hypothesis that does not test whether the coefficient is equal to 0.</ref>. We previously mentioned that <source enclose=none>reg_ex1</source> was a list that contained a number of elements some of which we print by using the summary function.
 +
 
 +
It turns out that it is actually useful to save the summary as its own object:
 +
 
 +
          reg_ex1_sm <- summary(reg_ex1)
 +
 
 +
Inspecting <source enclose=none>reg_ex1_sm</source> reveals that this is a list as well. A list is a collection of objects. Sometimes these objects in lists have names and this is one such occasion. Try <source enclose=none>names(reg_ex1_sm)</source> in your console and you find the following list of names:
 +
 
 +
      [1] "call"          "terms"        "residuals"   
 +
      [4] "coefficients"  "aliased"      "sigma"       
 +
      [7] "df"            "r.squared"    "adj.r.squared"
 +
    [10] "fstatistic"    "cov.unscaled"  "na.action"
 +
 
 +
In fact if you do the same for the <source enclose=none>reg_ex1</source> list you will find the following list:
 +
 
 +
    > names(reg_ex1)
 +
      [1] "coefficients"  "residuals"    "effects"     
 +
      [4] "rank"          "fitted.values" "assign"     
 +
      [7] "qr"            "df.residual"  "na.action"   
 +
    [10] "xlevels"      "call"          "terms"       
 +
    [13] "model" 
 +
 
 +
If you compare the lists you can see that there is some overlap between the information contained in the two lists. In fact I find that the only list member that you would usually want to access from <source enclose=none>reg_ex1</source> directly is the <source enclose=none>"fitted.values"</source>. Most other commonly used statistics are available from the summary list. I therefore make it a habit to save the summary as above.
 +
 
 +
The question now remains how you access particular members of these lists if you wanted to use them further down in your code. There are a number of ways of accessing members of a list. This isn't really the place to discuss these in detail. [http://www.r-tutor.com/r-introduction/list/named-list-members This] is a good place to go if you want to find out more. I shall just demonstrate one way to do this.
 +
 
 +
    reg_ex1_resid <- reg_ex1_sm<source enclose=none>$</source>residuals  # assigns the residual vector to a new object
 +
 
 +
This will work in a similar manner for any of the other list elements. The <source enclose=none>$</source> sign separates the list name from the name of the list member sought.  It is worth looking at one other output in particular, the coefficients. As it turns out, while both, the regression output and the regression summary both have a list member called coefficients, they are defined slightly differently:
 +
 
 +
The version from the regression output reports the coefficients only, while the summary version also contains standard errors, t-statistics and p-values:
 +
 
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients
 +
                  Estimate  Std. Error  t value    Pr(>|t|)
 +
    (Intercept)  0.53486640 0.139081935 3.845693 1.386203e-04
 +
    exper        0.01668435 0.004242588 3.932587 9.811191e-05
 +
    log(huswage) 0.23646570 0.063684482 3.713082 2.320238e-04
 +
 
 +
The latter is a 3 by 4 matrix. If you were only interested in the p-value of the coefficient to <code>log(huswage)</code> you only need the element in the 3rd row and 4th and you would request it as in the following line:
 +
 
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients[3,4]
 +
    [1] 0.0002320238
 +
 
 +
I want to illustrate the different ways in which you can access these values. All the following commands return the standard error to the <code>exper</code> coefficient. Which one you use is really up to you:
 +
 
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients[2,2]
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients["exper",2]
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients["exper","Std. Error"]
 +
    reg_ex1_sm<source enclose=none>$</source>coefficients[2,"Std. Error"]
 +
 
 +
Only two things to mention here. The coefficients variable in the summary output is really a matrix (here with 3 rows and 4 columns) and yo can access individual elements by indicating the respective <code>[row,col]</code>. But instead of using row and column numbers you can also use the row's and column's names.
 +
 
 +
As you would often want to use values from this matrix, the kind people who developped R also devised a slightly shorter way to do this, using the <code>coef</code> method. This results in the following alternative ways to access the standard error to the exper coefficient estimate.
 +
 
 +
    coef(reg_ex1_sm)[2,2]
 +
    coef(reg_ex1_sm)["exper",2]
 +
    coef(reg_ex1_sm)["exper","Std. Error"]
 +
    coef(reg_ex1_sm)[2,"Std. Error"]
 +
 
 +
As I said above, you can choose whatever way is more convenient.
 +
 
 +
== Extra parameters to the <source enclose=none>lm()</source> function ==
 +
 
 +
So far we employed the <source enclose=none>lm()</source> function in its most basic form. It does have some bells and whistles and here I will introduce a few that may become useful on occasions. What sort of options you have available you can find out by typing <source enclose=none>?lm</source> into your console and reading the help function.
 +
 
 +
=== Regression without constant ===
 +
 
 +
By default the <source enclose=none>lm()</source> function assumes that you want to estimate a regression with a constant term. In case you want to not have a constant term in your regression model, you indicate this as follows:
 +
 
 +
    reg_ex1 <- lm(lwage~exper+log(huswage)<span style="color:#ff0000">-1</span>,data=mydata)
 +
 
 +
You add a <span style="color:#ff0000"><source enclose=none>-1</source></span> to your regression model. Try it and confirm that the results now do not contain a regression constant.
 +
 
 +
=== Regression on subsamples ===
 +
 
 +
Being able to easily repeat a regression on just a sub-sample of data is an important tool in the econometrician's toolbox. Fortunately this is done very easily in R. We use the additional parameter into the <source enclose=none>lm()</source> function, <source enclose=none>subset</source>. Let's see an example and then discuss how it works:
 +
 
 +
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata, subset = (exper > 9))
 +
 
 +
The term that follows <source enclose=none>subset =</source> is a vector of boolean variables (FALSE/TRUE variables). Here we are asking the question (of all our available observations) whether the value of the <source enclose=none>exper</source> is larger than 9. If it is the answer is TRUE and that observation is included in the subset, if it is not the answer is FALSE and that observation will be dropped from the regression.
 +
 
 +
 
 +
 
 +
== Footnotes ==
 +
 
 +
<references />

Latest revision as of 08:56, 29 August 2015

Online Demonstration

Let's assume we want to run a regression with lwage (the logarithm of the woman's wage) as dependent variable and a constant, exper (the years of experience) and the logarithm of the husbands wage (huswage as explanatory variables. First we should note that the logarithm of the woman's wage already exists as variable lwage, but the logarithm of the husband's wage doesn't exist as its own variable. Hence we are yet to calculate it.

The lm() function

The R function that does the heavy lifting for regression analysis is the lm() function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt.

A first example

The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see R_Data#Data_Types) and eventually runs a regression:

    # This is my first R regression!
    setwd("T:/ECLR/R/FirstSteps")              # This sets the working directory
    mydata <- read.csv("mroz.csv",na.strings = ".")  # Opens mroz.csv from working directory
         
    # Run a regression
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)

So let's look at the last line in which we ask R to run a regression. Whatever comes in the parenthesis after lm are parameters to the lm() function. Different parameters are separated by commas. So here we have two inputs. Let's start with the second data=mydata. This basically indicates to R that we are drawing the data for the regression from our dataframe called mydata. That means that for the first input, in which we actually specify the model we estimate, we can refer to the variable names of the variables that are contained in mydata. In that first input you should imagine writing down a regression model. The model we want to estimate is the following:

[math]\label{OLSModel} lwage = \beta_0 + \beta_1 * exper + \beta_2 * log(huswage) + \epsilon[/math]

The way how you tell lm() to estimate this model is to leave the coefficients and error term away, and replace the equal sign with a ~. This is where the bold part of

    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)

comes from. One additional note here. The log(huswage) part of this model took the huswage variable from our dataframe and applied the log() function to it.

You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called reg_ex1. You should think of this as some sort of folder in which R has now saved all the regression results. If you look at the object reg_ex1 in your environment, you will see that it is a list object with 13 elements. So far so good, but if you click on the little triangle next to its name and see the detail you will most likely scratch your head and think "What a mess!" and I think you are quite right.

What if I do not use a dataframe

While you will often use a dataframe from which you extract your variables, it may well be possible that you have your data not saved in a dataframe but in individual data vectors. In that case the regression command becomes even simpler. Let's assume that your dependent variable was saved in y and your two explanatory variables in x1 and x2. In that case you would merely have to call

    reg_ex2 <- lm(y~x1+x2)

That's it, nothing else!

Regression Output

You would be familiar with a standard regression output containing estimated coefficients, standard errors and a set of regression statistics[1]. Fortunately it is very easy to obtain this. The command is

    > summary(reg_ex1)
     
    Call:
    lm(formula = lwage ~ exper + log(huswage), data = mydata)
     
    Residuals:
         Min       1Q   Median       3Q      Max 
    -3.10089 -0.31219  0.02919  0.37466  2.11402 
     
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.534866   0.139082   3.846 0.000139 ***
    exper        0.016684   0.004243   3.933 9.81e-05 ***
    log(huswage) 0.236466   0.063684   3.713 0.000232 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
     
    Residual standard error: 0.7031 on 425 degrees of freedom
      (325 observations deleted due to missingness)
    Multiple R-squared:  0.05919,	Adjusted R-squared:  0.05477 
    F-statistic: 13.37 on 2 and 425 DF,  p-value: 2.338e-06

Most things should look familiar here. One thing you should note is that, on this occasion, R automatically ignored all the missing observations, all 325 of them, which left R to estimate the model with 428 observations[2]

Accessing Regression Output

The next thing you should want to know as a buddying applied econometrician is how you actually access these regression statistics in order to use them in some further analysis[3]. We previously mentioned that reg_ex1 was a list that contained a number of elements some of which we print by using the summary function.

It turns out that it is actually useful to save the summary as its own object:

         reg_ex1_sm <- summary(reg_ex1)

Inspecting reg_ex1_sm reveals that this is a list as well. A list is a collection of objects. Sometimes these objects in lists have names and this is one such occasion. Try names(reg_ex1_sm) in your console and you find the following list of names:

     [1] "call"          "terms"         "residuals"    
     [4] "coefficients"  "aliased"       "sigma"        
     [7] "df"            "r.squared"     "adj.r.squared"
    [10] "fstatistic"    "cov.unscaled"  "na.action"

In fact if you do the same for the reg_ex1 list you will find the following list:

    > names(reg_ex1)
     [1] "coefficients"  "residuals"     "effects"      
     [4] "rank"          "fitted.values" "assign"       
     [7] "qr"            "df.residual"   "na.action"    
    [10] "xlevels"       "call"          "terms"        
    [13] "model"  

If you compare the lists you can see that there is some overlap between the information contained in the two lists. In fact I find that the only list member that you would usually want to access from reg_ex1 directly is the "fitted.values". Most other commonly used statistics are available from the summary list. I therefore make it a habit to save the summary as above.

The question now remains how you access particular members of these lists if you wanted to use them further down in your code. There are a number of ways of accessing members of a list. This isn't really the place to discuss these in detail. This is a good place to go if you want to find out more. I shall just demonstrate one way to do this.

    reg_ex1_resid <- reg_ex1_sm$residuals  # assigns the residual vector to a new object

This will work in a similar manner for any of the other list elements. The $ sign separates the list name from the name of the list member sought. It is worth looking at one other output in particular, the coefficients. As it turns out, while both, the regression output and the regression summary both have a list member called coefficients, they are defined slightly differently:

The version from the regression output reports the coefficients only, while the summary version also contains standard errors, t-statistics and p-values:

   reg_ex1_sm$coefficients
                  Estimate  Std. Error  t value     Pr(>|t|)
   (Intercept)  0.53486640 0.139081935 3.845693 1.386203e-04
   exper        0.01668435 0.004242588 3.932587 9.811191e-05
   log(huswage) 0.23646570 0.063684482 3.713082 2.320238e-04

The latter is a 3 by 4 matrix. If you were only interested in the p-value of the coefficient to log(huswage) you only need the element in the 3rd row and 4th and you would request it as in the following line:

   reg_ex1_sm$coefficients[3,4]
   [1] 0.0002320238

I want to illustrate the different ways in which you can access these values. All the following commands return the standard error to the exper coefficient. Which one you use is really up to you:

   reg_ex1_sm$coefficients[2,2]
   reg_ex1_sm$coefficients["exper",2]
   reg_ex1_sm$coefficients["exper","Std. Error"]
   reg_ex1_sm$coefficients[2,"Std. Error"]

Only two things to mention here. The coefficients variable in the summary output is really a matrix (here with 3 rows and 4 columns) and yo can access individual elements by indicating the respective [row,col]. But instead of using row and column numbers you can also use the row's and column's names.

As you would often want to use values from this matrix, the kind people who developped R also devised a slightly shorter way to do this, using the coef method. This results in the following alternative ways to access the standard error to the exper coefficient estimate.

   coef(reg_ex1_sm)[2,2]
   coef(reg_ex1_sm)["exper",2]
   coef(reg_ex1_sm)["exper","Std. Error"]
   coef(reg_ex1_sm)[2,"Std. Error"]

As I said above, you can choose whatever way is more convenient.

Extra parameters to the lm() function

So far we employed the lm() function in its most basic form. It does have some bells and whistles and here I will introduce a few that may become useful on occasions. What sort of options you have available you can find out by typing ?lm into your console and reading the help function.

Regression without constant

By default the lm() function assumes that you want to estimate a regression with a constant term. In case you want to not have a constant term in your regression model, you indicate this as follows:

    reg_ex1 <- lm(lwage~exper+log(huswage)-1,data=mydata)

You add a -1 to your regression model. Try it and confirm that the results now do not contain a regression constant.

Regression on subsamples

Being able to easily repeat a regression on just a sub-sample of data is an important tool in the econometrician's toolbox. Fortunately this is done very easily in R. We use the additional parameter into the lm() function, subset. Let's see an example and then discuss how it works:

    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata, subset = (exper > 9))

The term that follows subset = is a vector of boolean variables (FALSE/TRUE variables). Here we are asking the question (of all our available observations) whether the value of the exper is larger than 9. If it is the answer is TRUE and that observation is included in the subset, if it is not the answer is FALSE and that observation will be dropped from the regression.


Footnotes

  1. This wiki is not meant to explain the econometrics, but merely the programming implementation.
  2. Why does the output not show this number???? Yes, a fair question. If you realise that your dataframe has 753 observations then 753-325 (missing) observations is 428. You can also use the command nrow(mydata) to get that number. Alternatively you can infer the number from the reported degrees of freedom which are calculatesd as the number of observations minus the number of estimated coefficients, here 3.
  3. You may, e.g., want to calculate a t-test for a null hypothesis that does not test whether the coefficient is equal to 0.