Predicting from Regression in R
Here we will be using the Women's Wages dataset which is made available here.
Introduction and Data Upload
Set your working directory and load data
setwd("C:/Users/Ralph Becker/Dropbox/R") # This sets the working directory, ensure data file is in here mydata <- read.csv("Mroz.csv",na.strings = ".") # Opens mroz.csv from working directory
Let's look at the wage
variable (Variable Descriptions):
summary(mydata$wage)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.1282 2.2630 3.4820 4.1780 4.9710 25.0000 325
In our dataset we have 325 observations that have no wage information. You will have to be clear what the reason for this is. Are they not working, or are these genuinly missing observations? The clue here is in the inlf
variable which indicates whether a particular women is in the labour force (inlf = 1) or not (inlf = 0). All those observations for which the wage is missing coincide with (inlf = 0). You can see that from the spreadsheet or by checking
mydata$inlf[is.na(mydata$wage)]
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [316] 0 0 0 0 0 0 0 0 0 0
The is.na( )
function is a very useful one. It checks whether the elemets in mydata$wage
are missing or not. It returns values of either "TRUE" if they are missing or "FALSE" for observations where the wage is not missing. When you put this function into square brackets after mydata$inlf
then R will only return those values of the inlf
variable where the corresponding value for the wage
was missing (as this would produce a "TRUE" from the is.na( )
function.)
What we are about to do is the following. We shall first estimate the following regression model:
[math]wage_i = \alpha_0 + \alpha_1 age_i + \alpha_2 educ_i + eps_i[/math]
on the basis of all data have wage information. Then, subsequently we shall use the estimated model to predict the wages of those who are not currently working. I.e. we'd be answering the question of how much they should expect to earn (on the basis of their age and education).
Before we do this, there are two reasons of why we should look at this analysis with a pinch of salt
- Including the education variable in such an analysis is likely to be problematic as it is likely to be correlated with the error terms. This will produce biased coefficient estimates (also see the section on Instrumental variables estimation)
- The reason why the women is most likely not a random decision and therefore they are not a randomly selected subset. Therefore it may well be that the realtionship we estimated between wage and the explanatory variables on the basis of those women who do work may not be the same for women who do not work.
Regression and Prediction
Despite these reservations we shall use this example to demonstrate how we use R to produce predictions. Let's start by splitting our dataframe into the two subgroups (those for whom we do have wages and those for whom we dont't). and we shall again use the is.na( )
function to gether with the subset
function.
mydata_est <- subset(mydata, is.na(wage) == FALSE) mydata_pred <- subset(mydata, is.na(wage) == TRUE)
Those observations for which wage is missing go into the mydata_pred
dataframe and all others into mydata_est
.
We now run our regresison on the data in the mydata_est
dataframe
lm1 <- lm(wage~age+educ, data=mydata_est)
All the results are in lm1
. we now use this estimated model to produce predicted wages for all those observations that are in mydata_pred
. To do this we use the predict) _
function. We need to hand in the estimated model lm1
and then the dataframe to which we want to apply the estimated model to produce forecasts (mydata_pred
). This, of course, requires that this dataframe has a age
and educ
variable.
pr1 <- predict(lm1,mydata_pred)
Check out the new object pr1
. It is a vector with 325 predictions for the 325 observations in mydata_pred
.
New data
Sometimes you don't have actual observations for which you want to produce forecast, but rather you have a set of conditioning values at which you want to clculate predictions.
Now we will show how you would set this up. You would basically want to produce a new dataframe which contains, at the least, two variables named age
and educ
. You could do this as follows (try and understand what age/education combinations are being produced here:
age = seq(20, 55, 1) # creates ages 20, 21, 22, ..., 54, 55 , 36 observations educ = rep(c(8,10,12,14),9) # creates 9 repetions of 8,10,12,14, altogether 36 observations new_df <- data.frame(age,educ)
And now we apply the estimated model lm1
to this new dataset.
pr2 <- predict(lm1,new_df)
Of course you could have asked for forecasts for very different age/education combinations. The predict( )
function has a few extra bells and whistles you could use. You know well (or at least you should) that when we make predictions using a linear regresison model we get the conditional expectation, but there wil be uncertainlty around this expectation. One of the options if the predict
function allows you to get confidence intervals.
Let's try the following:
pr3 <- predict(lm1,new_df, level = 0.99, interval = "prediction")
here we set the confidence level
to 0.99 (if you don't set it explicitly R will use a default of 0.95) and ask R to produce prediction intervals. Look at the resulting pr3
object, it now has three columns. Let's loook at the 1st row of new_df
and pr3
new_df[1,]
## age educ ## 1 20 8
pr3[1,]
## fit lwr upr ## 1.397456 -6.789651 9.584564
The 1st of our new cases is for someone with age 20 and 8 years of education. The forecast we produced is 1.40 with lower confidence interval boundary at -3.82 and the upper boundary at 6.61. So the model tells us that there is a 99% probability that a women aged 20 with 8 years of education would earn something between -3.82 and 6.61 dollars per hour. Clearly this is a huge confidence interval and in fact it reaches into negative hourly wages which don't really make sense. There are three reasons for this.
- We set a high level of confidence, resulting in a very wide interval
- Our conditioning values were at the lower range of observed ages and education levels. If we had used conditioning values closer to the respective averages we would get a narrower interval (try it!)
- We asked R to get us an interval for an individual ()by using
interval = "prediction"
which means that R also factors is the observed level of error variance.
The last point is important as it allows us to distinguish this from a slightly different type of confidence interval:
pr4 <- predict(lm1,new_df, level = 0.99, interval = "confidence")
All that has changed is the type of interval we asked for, interval = "confidence"
. Before discussing what the difference is, let's see how the result changes
new_df[1,]
## age educ ## 1 20 8
pr4[1,]
## fit lwr upr ## 1.39745636 -0.05580818 2.85072091
We can see that the confidence interval is now much narrower, although we still have the same expected value of 1.40. The question this type of interval asks is the following. Assume you have many women aged 20 and with 8 years of education. What should you expect the average wage of these many women to be. Answer: With 99% probability, the model tells us, this value should be between -0.06 and 2.85.
We can see that the confidence interval is now much narrower, although we still have the same expected value of 1.40. The question this type of interval asks is the following. Assume you have many women aged 20 and with 8 years of education. What should you expect the average wage of these many women to be. Answer: With 99% probability, the model tells us, this value should be between -0.06 and 2.85.
Why does predicting an average produce such a narrower prediction interval. As you are calculating averages you are averaging out individual variation and hence we don't need to take that into account. This tends to massively reduce the width of the confidence interval.