# Predicting from Regression in R

Here we will be using the Women's Wages dataset which is made available here.

# Introduction and Data Upload

Set your working directory and load data

setwd("C:/Users/Ralph Becker/Dropbox/R") # This sets the working directory, ensure data file is in here mydata <- read.csv("Mroz.csv",na.strings = ".") # Opens mroz.csv from working directory

Let's look at the `wage`

variable (Variable Descriptions):

summary(mydata$wage)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## 0.1282 2.2630 3.4820 4.1780 4.9710 25.0000 325

In our dataset we have 325 observations that have no wage information. You will have to be clear what the reason for this is. Are they not working, or are these genuinly missing observations? The clue here is in the `inlf`

variable which indicates whether a particular women is in the labour force (inlf = 1) or not (inlf = 0). All those observations for which the wage is missing coincide with (inlf = 0). You can see that from the spreadsheet or by checking

mydata$inlf[is.na(mydata$wage)]

## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [106] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [141] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [211] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [246] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [281] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [316] 0 0 0 0 0 0 0 0 0 0

The `is.na( )`

function is a very useful one. It checks whether the elemets in `mydata$wage`

are missing or not. It returns values of either "TRUE" if they are missing or "FALSE" for observations where the wage is not missing. When you put this function into square brackets after `mydata$inlf`

then R will only return those values of the `inlf`

variable where the corresponding value for the `wage`

was missing (as this would produce a "TRUE" from the `is.na( )`

function.)

What we are about to do is the following. We shall first estimate the following regression model:

[math]wage_i = \alpha_0 + \alpha_1 age_i + \alpha_2 educ_i + eps_i[/math]

on the basis of all data have wage information. Then, subsequently we shall use the estimated model to predict the wages of those who are not currently working. I.e. we'd be answering the question of how much they should expect to earn (on the basis of their age and education).

Before we do this, there are two reasons of why we should look at this analysis with a pinch of salt

- Including the education variable in such an analysis is likely to be problematic as it is likely to be correlated with the error terms. This will produce biased coefficient estimates (also see the section on Instrumental variables estimation)
- The reason why the women is most likely not a random decision and therefore they are not a randomly selected subset. Therefore it may well be that the realtionship we estimated between wage and the explanatory variables on the basis of those women who do work may not be the same for women who do not work.

# Regression and Prediction

Despite these reservations we shall use this example to demonstrate how we use R to produce predictions. Let's start by splitting our dataframe into the two subgroups (those for whom we do have wages and those for whom we dont't). and we shall again use the `is.na( )`

function to gether with the `subset`

function.

mydata_est <- subset(mydata, is.na(wage) == FALSE) mydata_pred <- subset(mydata, is.na(wage) == TRUE)

Those observations for which wage is missing go into the `mydata_pred`

dataframe and all others into `mydata_est`

.

We now run our regresison on the data in the `mydata_est`

dataframe

lm1 <- lm(wage~age+educ, data=mydata_est)

All the results are in `lm1`

. we now use this estimated model to produce predicted wages for all those observations that are in `mydata_pred`

. To do this we use the `predict) _`

function. We need to hand in the estimated model `lm1`

and then the dataframe to which we want to apply the estimated model to produce forecasts (`mydata_pred`

). This, of course, requires that this dataframe has a `age`

and `educ`

variable.

pr1 <- predict(lm1,mydata_pred)

Check out the new object `pr1`

. It is a vector with 325 predictions for the 325 observations in `mydata_pred`

.

## New data

Sometimes you don't have actual observations for which you want to produce forecast, but rather you have a set of conditioning values at which you want to clculate predictions.

Now we will show how you would set this up. You would basically want to produce a new dataframe which contains, at the least, two variables named `age`

and `educ`

. You could do this as follows (try and understand what age/education combinations are being produced here:

age = seq(20, 55, 1) # creates ages 20, 21, 22, ..., 54, 55 , 36 observations educ = rep(c(8,10,12,14),9) # creates 9 repetions of 8,10,12,14, altogether 36 observations new_df <- data.frame(age,educ)

And now we apply the estimated model `lm1`

to this new dataset.

pr2 <- predict(lm1,new_df)

Of course you could have asked for forecasts for very different age/education combinations. The `predict( )`

function has a few extra bells and whistles you could use. You know well (or at least you should) that when we make predictions using a linear regresison model we get the conditional expectation, but there wil be uncertainlty around this expectation. One of the options if the `predict`

function allows you to get confidence intervals.

Let's try the following:

pr3 <- predict(lm1,new_df, level = 0.99, interval = "prediction")

here we set the confidence `level`

to 0.99 (if you don't set it explicitly R will use a default of 0.95) and ask R to produce prediction intervals. Look at the resulting `pr3`

object, it now has three columns. Let's loook at the 1st row of `new_df`

and `pr3`

new_df[1,]

## age educ ## 1 20 8

pr3[1,]

## fit lwr upr ## 1.397456 -6.789651 9.584564

The 1st of our new cases is for someone with age 20 and 8 years of education. The forecast we produced is 1.40 with lower confidence interval boundary at -3.82 and the upper boundary at 6.61. So the model tells us that there is a 99% probability that a women aged 20 with 8 years of education would earn something between -3.82 and 6.61 dollars per hour. Clearly this is a huge confidence interval and in fact it reaches into negative hourly wages which don't really make sense. There are three reasons for this.

- We set a high level of confidence, resulting in a very wide interval
- Our conditioning values were at the lower range of observed ages and education levels. If we had used conditioning values closer to the respective averages we would get a narrower interval (try it!)
- We asked R to get us an interval for an individual ()by using
`interval = "prediction"`

which means that R also factors is the observed level of error variance.

The last point is important as it allows us to distinguish this from a slightly different type of confidence interval:

pr4 <- predict(lm1,new_df, level = 0.99, interval = "confidence")

All that has changed is the type of interval we asked for, `interval = "confidence"`

. Before discussing what the difference is, let's see how the result changes

new_df[1,]

## age educ ## 1 20 8

pr4[1,]

## fit lwr upr ## 1.39745636 -0.05580818 2.85072091

We can see that the confidence interval is now much narrower, although we still have the same expected value of 1.40. The question this type of interval asks is the following. Assume you have many women aged 20 and with 8 years of education. What should you expect the average wage of these many women to be. Answer: With 99% probability, the model tells us, this value should be between -0.06 and 2.85.

We can see that the confidence interval is now much narrower, although we still have the same expected value of 1.40. The question this type of interval asks is the following. Assume you have many women aged 20 and with 8 years of education. What should you expect the average wage of these many women to be. Answer: With 99% probability, the model tells us, this value should be between -0.06 and 2.85.

Why does predicting an average produce such a narrower prediction interval. As you are calculating averages you are averaging out individual variation and hence we don't need to take that into account. This tends to massively reduce the width of the confidence interval.