Dummy Variables in R

From ECLR
Jump to: navigation, search

In this section we explain how dummy variables can be used in Regressions and we will utilise the Baseball Wages dataset for this purpose.

Dummy Variables

Econometricians think of dummy variables as binary (0/1) variables. And in some datasets you will find the data presented as such right from the start. This is, for instance, the case for the Baseball wages dataset. Importing the dataset you will find information on the position each player takes in its team. These are firstbase (frstbase), second base (scndbase), thitd base (thrdbase), short stop (shrtstop), outfield (outfield) and catcher (catcher). Each player is given exactly one of these positions.

    setwd("YOUR DIRECTORY PATH")              # This sets the working directory
    load("mlb1.RData")  # Opens mlb1 dataset from R datafile

If you now look at the data (the data themselves are stored in data, and the variable descriptions in desc) you will find them looking something like this

Mlb1pic1.JPG

You can see that the first player is a second base player (1 for scndbase and 0 for all other positional variables) and the second player is a short stop.

Dummy variables as independent variables

If the data come as predefined dummy variables, then it is rather straightforward to use these in regressions.

    reg_ex1 <- lm(lsalary~years+gamesyr+frstbase+scndbase+thrdbase+shrtstop+catcher,data=data)
    print(summary(reg_ex1))

Here we are running a regression in which we explain variation in log salary by using the explanatory variables years of major league experience and games played per year plus a set of dummy variables (in bold) for all positions but the outfield position (beware the dummy variable trap!).

What we get is the following output:

    Call:
    lm(formula = lsalary ~ years + gamesyr + frstbase + scndbase + 
        thrdbase + shrtstop + catcher, data = data)
    Residuals:
         Min       1Q   Median       3Q      Max 
    -2.71524 -0.46973 -0.00695  0.45610  2.73707 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 11.222840   0.125818  89.199  < 2e-16 ***
    years        0.067257   0.012551   5.359 1.54e-07 ***
    gamesyr      0.021095   0.001412  14.935  < 2e-16 ***
    frstbase    -0.060406   0.128470  -0.470   0.6385    
    scndbase    -0.340685   0.139059  -2.450   0.0148 *  
    thrdbase     0.002862   0.142958   0.020   0.9840    
    shrtstop    -0.232334   0.124566  -1.865   0.0630 .  
    catcher      0.129668   0.126458   1.025   0.3059    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.7455 on 345 degrees of freedom
    Multiple R-squared:  0.6105,	Adjusted R-squared:  0.6026 
    F-statistic: 77.24 on 7 and 345 DF,  p-value: < 2.2e-16

As you can see, and we surely would have expected, years of major league experience has a positive effect on the salary (although we may really need to consider a quadratic effect) as has the games per year variable [1]. The included dummy variables indicate that compared to outfield players (which are the base category as that dummy variable was omitted) only second base players seem to have a significantly (at 5 per cent) different salary. The results seem to indicate that they, ceteris paribus, earn 34 per cent less than outfield players.

Interaction terms

When you just include straight dummy variables you allow for intercept shifts according to the relevant categories (here positions on the field). This may often be inadequate and we may really want interaction terms. These may be interactions between different dummy variables (for instance we may be interested whether it is really only black second base players that earn less, we would include sndbase*black)[2] or interactions between a dummy and another explanatory variable to to allow for changing slope coefficients (e.g. we may want to figure out whether experience counts differently for catchers, we would include catcher*years).

When learning how to use interaction terms we will actually encounter another quirk of R[3]. To see this it is instructional to first start with an extremely simple model, one which would really make no economic sense.

    reg_ex1 <- lm(lsalary~(years*black),data=data)
    print(summary(reg_ex1))

Intuitively we would think that this should estimate a model with a constant and one explanatory variable, years*black. But when we look at the result we can see that R has taken it upon itself to extend the model:

    Call:
    lm(formula = lsalary ~ (years * black), data = data)
    Residuals:
       Min      1Q  Median      3Q     Max 
    -3.0165 -0.7867 -0.1900  0.7537  1.9904 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 12.307118   0.117606  104.65   <2e-16 ***
    years        0.178426   0.016394   10.88   <2e-16 ***
    black        0.248952   0.214635    1.16    0.247    
    years:black -0.009502   0.027919   -0.34    0.734    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.9628 on 349 degrees of freedom
    Multiple R-squared:  0.3427,	Adjusted R-squared:  0.3371 
    F-statistic: 60.66 on 3 and 349 DF,  p-value: < 2.2e-16

It has included the simple explanatory variables years and black. To understand this we need to understand that in the context of model building (which is what we do here) R understands the operator * as an invitation to include the variables itself and its cross term[4]. This is, at times, very convenient as this is often what you want to do.

But, if we want to include cross terms only we need to use the operator : instead of *. So

    reg_ex1 <- lm(lsalary~(years:black),data=data)
    print(summary(reg_ex1))

will deliver a regression model with a constant and the cross term only.[5]

Using Categorical/Factor variables in regressions

As we discussed in the Data Section, when you import categorical data from csv files they usually be imported as factor variables into R. In the data analysis section we already learned how to get frequency counts of categorical variables using the table( ) or summary( ) command.

When using such categorical variables in regressions as explanatory variables we will use them in the form of dummy variables (binary 0/1 variables). When importing the Baseball salary dataset there were two categorical variables, playing position and ethnicity/race. But both these were already transformed to individual dummy variables as discussed above.

What would we do if, as often the case, the categorical variable would be imported as one variable and not separate dummies. Download this csv file which presents the position and race variable as a categorical and also includes the variables lsalary and years (all other variables have been deleted from this datafile).

Read the csv into R,

   setwd("YOUR DIRECTORY PATH")              # This sets the working directory
   mydata <- read.csv("mlb1_cat_test.csv")

if you now look at the dataset it will look like this:

Mlb1pic2.JPG

and inspecting the variables and their datatypes by using str(mydata) we find that the variables position and race are indeed factor variables.

    'data.frame':	353 obs. of  5 variables:
     $ years   : int  12 8 5 8 12 17 4 10 4 3 ...
     $ position: Factor w/ 6 levels "catcher","first base",..: 4 5 2 6 3 3 3 1 5 3 ...
     $ race    : Factor w/ 3 levels "black","hispan",..: 3 1 3 3 1 1 2 3 2 1 ...
     $ gamesyr : num  142.1 114.8 150.2 132 99.7 ...
     $ lsalary : num  15.7 15 14.9 14.9 14.3 ...

From here there are two ways to go if you want to use dummy variables based on either of these variables in a regression.

Translating into dummy variables

We can translate the factor variable into dummy variables. We can do this using the following type of commands

   mydata$frstbase <- as.numeric(mydata$position == "first base")  # as.numeric translates to numerical - here from logical

which creates a new variable in the data frame called frstbase that takes a value of 1 if the player is a first base player and 0 otherwise. Other dummy variables can be used accordingly. Once you have done this you can proceed as in the previous sections.

Using factor variables directly

One very nice aspect of R is that you can use such factor variables directly in regressions. For instance we could estimate a regression lsalary, using years and gamesyr as explanatory variables but also include intercept dummies for the different positions. The straightforward way to do that is as follows:

    reg_ex1 <- lm(lsalary~years+gamesyr+position,data=mydata)
    print(summary(reg_ex1))

which delivers

    Call:
    lm(formula = lsalary ~ years + gamesyr + position, data = mydata)
    Residuals:
         Min       1Q   Median       3Q      Max 
    -2.71524 -0.46973 -0.00695  0.45610  2.73707 
    
    Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         11.352508   0.129846  87.430  < 2e-16 ***
    years                0.067257   0.012551   5.359 1.54e-07 ***
    gamesyr              0.021095   0.001412  14.935  < 2e-16 ***
    positionfirst base  -0.190074   0.157450  -1.207  0.22818    
    positionoutfielder  -0.129669   0.126458  -1.025  0.30590    
    positionsecond base -0.470353   0.167849  -2.802  0.00536 ** 
    positionshort stop  -0.362002   0.150584  -2.404  0.01674 *  
    positionthird base  -0.126807   0.168252  -0.754  0.45156    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.7455 on 345 degrees of freedom
    Multiple R-squared:  0.6105,	Adjusted R-squared:  0.6026 
    F-statistic: 77.24 on 7 and 345 DF,  p-value: < 2.2e-16

When you compare the summary statistics to those of the first regression we estimated in this dummy variable section, then you will realise that they are identical, we essentially estimated the same model. There is, however, one difference. In the previous estimation we use outfielders as the base category (i.e. the respective dummy variables was excluded). Here we can see that R automatically includes dummy variables for the different positions, but for one, here the catcher position. R chose to drop the catcher position as this is the position which comes first in the alphabet.

Inherent in a factor variable in R is that R uses one of the values as its reference value and by default this is the value first in the alphabet as we saw in the above regression. There is, however, a way to let R know to change the reference value. The way to do this is as follows:

    mydata$position <- relevel(mydata$position, ref = "outfielder")

This ensures that from now on R will use "outfielder" as the reference. If you know run the same regression as above

    reg_ex1 <- lm(lsalary~years+gamesyr+position,data=mydata)
    print(summary(reg_ex1))

you will find that the outfielder dummy variable will be omitted.

References

  1. We lay no claim that this is the best possible model to explain salary.
  2. Check the dataset to find the dummy variable black
  3. Had anyone seen a programming language without quirk? I havn't.
  4. See [1] for details.
  5. Alternatively you could use I(years*black), where the I() function ensures that R understands the multiplication as a literal mathematical operation/