Difference between revisions of "R Regression"

From ECLR
Jump to: navigation, search
Line 4: Line 4:
 
== The <source enclose=none>lm()</source> function ==
 
== The <source enclose=none>lm()</source> function ==
  
The R function that does the heavy lifting for regression analysis is the <source enclose=none>lm()</source> function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt. The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see [[R_Data#Data_Types]]) and eventually runs a regression:
+
The R function that does the heavy lifting for regression analysis is the <source enclose=none>lm()</source> function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt.  
 +
 
 +
=== A first example ===
 +
 
 +
The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see [[R_Data#Data_Types]]) and eventually runs a regression:
  
 
     # This is my first R regression!
 
     # This is my first R regression!
Line 26: Line 30:
 
comes from. One additional note here. The <source enclose=none>log(huswage)</source> part of this model took the <source enclose=none>huswage</source> variable from our dataframe and applied the <source enclose=none>log()</source> function to it.  
 
comes from. One additional note here. The <source enclose=none>log(huswage)</source> part of this model took the <source enclose=none>huswage</source> variable from our dataframe and applied the <source enclose=none>log()</source> function to it.  
  
You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called <source enclose=none>reg_ex1</source>. You should think of this as some sort of folder in which R has now saved all the regression results.
+
You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called <source enclose=none>reg_ex1</source>. You should think of this as some sort of folder in which R has now saved all the regression results. If you look at the object <source enclose=none>reg_ex1</source> in your environment, you will most likely scratch your head and think "What a mess!" and I think you are quite right.
 +
 
 +
=== Regression Output ===
 +
 
 +
You would be familiar with

Revision as of 21:17, 17 January 2015

Let's assume we want to run a regression with lwage (the logarithm of the woman's wage) as dependent variable and a constant, exper (the years of experience) and the logarithm of the husbands wage (huswage as explanatory variables. First we should note that the logarithm of the woman's wage already exists as variable lwage, but the logarithm of the husband's wage doesn't exist as its own variable. Hence we are yet to calculate it.

The lm() function

The R function that does the heavy lifting for regression analysis is the lm() function (presumably an abbreviation for "linear model") and we will have a close up look at how it works. But let's get our first regression under the belt.

A first example

The following few lines of code (which you should save in a script) import the data, convert missing data to NAs (see R_Data#Data_Types) and eventually runs a regression:

    # This is my first R regression!
    setwd("T:/ECLR/R/FirstSteps")              # This sets the working directory
    mydata <- read.csv("mroz.csv")  # Opens mroz.csv from working directory
    # Now convert variables with "." to num with NA
    mydata$wage <- as.numeric(as.character(mydata$wage))
    mydata$lwage <- as.numeric(as.character(mydata$lwage))
    # Run a regression
    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)

So let's look at the last line in which we ask R to run a regression. Whatever comes in the parenthesis after lm are parameters to the lm() function. Different parameters are separated by commas. So here we have two inputs. Let's start with the second data=mydata. This basically indicates to R that we are drawing the data for the regression from our dataframe called mydata. That means that for the first input, in which we actually specify the model we estimate, we can refer to the variable names of the variables that are contained in mydata. In that first input you should imagine writing down a regression model. The model we want to estimate is the following:

[math]\label{OLSModel} exper = \beta_0 + \beta_1 * exper + \beta_2 * log(huswage) + \epsilon[/math]

The way how you tell lm() to estimate this model is to leave the coefficients and error term away, and replace the equal sign with a ~. This is where the bold part of

    reg_ex1 <- lm(lwage~exper+log(huswage),data=mydata)

comes from. One additional note here. The log(huswage) part of this model took the huswage variable from our dataframe and applied the log() function to it.

You should be familiar with the left hand side of the command, which assigns the results of the regression to a new object called reg_ex1. You should think of this as some sort of folder in which R has now saved all the regression results. If you look at the object reg_ex1 in your environment, you will most likely scratch your head and think "What a mess!" and I think you are quite right.

Regression Output

You would be familiar with