Dummy Variables in R
In this section we explain how dummy variables can be used in Regressions and we will utilise the Baseball Wages dataset for this purpose.
Econometricians think of dummy variables as binary (0/1) variables. And in some datasets you will find the data presented as such right from the start. This is, for instance, the case for the Baseball wages dataset. Importing the dataset you will find information on the position each player takes in its team. These are firstbase (frstbase), second base (scndbase), thitd base (thrdbase), short stop (shrtstop), outfield (outfield) and catcher (catcher). Each player is given exactly one of these positions.
setwd("YOUR DIRECTORY PATH") # This sets the working directory load("mlb1.RData") # Opens mlb1 dataset from R datafile
If you now look at the data (the data themselves are stored in data, and the variable descriptions in desc) you will find them looking something like this
You can see that the first player is a second base player (1 for scndbase and 0 for all other positional variables) and the second player is a short stop.
Dummy variables as independent variables
If the data come as predefined dummy variables, then it is rather straightforward to use these in regressions.
reg_ex1 <- lm(lsalary~years+gamesyr+frstbase+scndbase+thrdbase+shrtstop+catcher,data=data) print(summary(reg_ex1))
Here we are running a regression in which we explain variation in log salary by using the explanatory variables years of major league experience and games played per year plus a set of dummy variables (in bold) for all positions but the outfield position (beware the dummy variable trap!).
What we get is the following output:
Call: lm(formula = lsalary ~ years + gamesyr + frstbase + scndbase + thrdbase + shrtstop + catcher, data = data)
Residuals: Min 1Q Median 3Q Max -2.71524 -0.46973 -0.00695 0.45610 2.73707 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.222840 0.125818 89.199 < 2e-16 *** years 0.067257 0.012551 5.359 1.54e-07 *** gamesyr 0.021095 0.001412 14.935 < 2e-16 *** frstbase -0.060406 0.128470 -0.470 0.6385 scndbase -0.340685 0.139059 -2.450 0.0148 * thrdbase 0.002862 0.142958 0.020 0.9840 shrtstop -0.232334 0.124566 -1.865 0.0630 . catcher 0.129668 0.126458 1.025 0.3059 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7455 on 345 degrees of freedom Multiple R-squared: 0.6105, Adjusted R-squared: 0.6026 F-statistic: 77.24 on 7 and 345 DF, p-value: < 2.2e-16
As you can see, and we surely would have expected, years of major league experience has a positive effect on the salary (although we may really need to consider a quadratic effect) as has the games per year variable . The included dummy variables indicate that compared to outfield players (which are the base category as that dummy variable was omitted) only second base players seem to have a significantly (at 5 per cent) different salary. The results seem to indicate that they, ceteris paribus, earn 34 per cent less than outfield players.
When you just include straight dummy variables you allow for intercept shifts according to the relevant categories (here positions on the field). This may often be inadequate and we may really want interaction terms. These may be interactions between different dummy variables (for instance we may be interested whether it is really only black second base players that earn less, we would include
sndbase*black) or interactions between a dummy and another explanatory variable to to allow for changing slope coefficients (e.g. we may want to figure out whether experience counts differently for catchers, we would include
When learning how to use interaction terms we will actually encounter another quirk of R. To see this it is instructional to first start with an extremely simple model, one which would really make no economic sense.
reg_ex1 <- lm(lsalary~(years*black),data=data) print(summary(reg_ex1))
Intuitively we would think that this should estimate a model with a constant and one explanatory variable,
years*black. But when we look at the result we can see that R has taken it upon itself to extend the model:
Call: lm(formula = lsalary ~ (years * black), data = data)
Residuals: Min 1Q Median 3Q Max -3.0165 -0.7867 -0.1900 0.7537 1.9904 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.307118 0.117606 104.65 <2e-16 *** years 0.178426 0.016394 10.88 <2e-16 *** black 0.248952 0.214635 1.16 0.247 years:black -0.009502 0.027919 -0.34 0.734 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9628 on 349 degrees of freedom Multiple R-squared: 0.3427, Adjusted R-squared: 0.3371 F-statistic: 60.66 on 3 and 349 DF, p-value: < 2.2e-16
It has included the simple explanatory variables
black. To understand this we need to understand that in the context of model building (which is what we do here) R understands the operator
* as an invitation to include the variables itself and its cross term. This is, at times, very convenient as this is often what you want to do.
But, if we want to include cross terms only we need to use the operator
: instead of
reg_ex1 <- lm(lsalary~(years:black),data=data) print(summary(reg_ex1))
will deliver a regression model with a constant and the cross term only.
Using Categorical/Factor variables in regressions
As we discussed in the Data Section, when you import categorical data from csv files they usually be imported as factor variables into R. In the data analysis section we already learned how to get frequency counts of categorical variables using the
table( ) or
summary( ) command.
When using such categorical variables in regressions as explanatory variables we will use them in the form of dummy variables (binary 0/1 variables). When importing the Baseball salary dataset there were two categorical variables, playing position and ethnicity/race. But both these were already transformed to individual dummy variables as discussed above.
What would we do if, as often the case, the categorical variable would be imported as one variable and not separate dummies. Download this csv file which presents the position and race variable as a categorical and also includes the variables
years (all other variables have been deleted from this datafile).
Read the csv into R,
setwd("YOUR DIRECTORY PATH") # This sets the working directory mydata <- read.csv("mlb1_cat_test.csv")
if you now look at the dataset it will look like this:
and inspecting the variables and their datatypes by using
str(mydata) we find that the variables
race are indeed factor variables.
'data.frame': 353 obs. of 5 variables:
$years : int 12 8 5 8 12 17 4 10 4 3 ...
$position: Factor w/ 6 levels "catcher","first base",..: 4 5 2 6 3 3 3 1 5 3 ...
$race : Factor w/ 3 levels "black","hispan",..: 3 1 3 3 1 1 2 3 2 1 ...
$gamesyr : num 142.1 114.8 150.2 132 99.7 ...
$lsalary : num 15.7 15 14.9 14.9 14.3 ...
From here there are two ways to go if you want to use dummy variables based on either of these variables in a regression.
Translating into dummy variables
We can translate the factor variable into dummy variables. We can do this using the following type of commands
$frstbase <- as.numeric(mydata
$position == "first base") # as.numeric translates to numerical - here from logical
which creates a new variable in the data frame called
frstbase that takes a value of 1 if the player is a first base player and 0 otherwise. Other dummy variables can be used accordingly. Once you have done this you can proceed as in the previous sections.
Using factor variables directly
One very nice aspect of R is that you can use such factor variables directly in regressions. For instance we could estimate a regression
gamesyr as explanatory variables but also include intercept dummies for the different positions. The straightforward way to do that is as follows:
reg_ex1 <- lm(lsalary~years+gamesyr+position,data=mydata) print(summary(reg_ex1))
Call: lm(formula = lsalary ~ years + gamesyr + position, data = mydata)
Residuals: Min 1Q Median 3Q Max -2.71524 -0.46973 -0.00695 0.45610 2.73707 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.352508 0.129846 87.430 < 2e-16 *** years 0.067257 0.012551 5.359 1.54e-07 *** gamesyr 0.021095 0.001412 14.935 < 2e-16 *** positionfirst base -0.190074 0.157450 -1.207 0.22818 positionoutfielder -0.129669 0.126458 -1.025 0.30590 positionsecond base -0.470353 0.167849 -2.802 0.00536 ** positionshort stop -0.362002 0.150584 -2.404 0.01674 * positionthird base -0.126807 0.168252 -0.754 0.45156 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7455 on 345 degrees of freedom Multiple R-squared: 0.6105, Adjusted R-squared: 0.6026 F-statistic: 77.24 on 7 and 345 DF, p-value: < 2.2e-16
When you compare the summary statistics to those of the first regression we estimated in this dummy variable section, then you will realise that they are identical, we essentially estimated the same model. There is, however, one difference. In the previous estimation we use outfielders as the base category (i.e. the respective dummy variables was excluded). Here we can see that R automatically includes dummy variables for the different positions, but for one, here the catcher position. R chose to drop the catcher position as this is the position which comes first in the alphabet.
Inherent in a factor variable in R is that R uses one of the values as its reference value and by default this is the value first in the alphabet as we saw in the above regression. There is, however, a way to let R know to change the reference value. The way to do this is as follows:
$position <- relevel(mydata
$position, ref = "outfielder")
This ensures that from now on R will use "outfielder" as the reference. If you know run the same regression as above
reg_ex1 <- lm(lsalary~years+gamesyr+position,data=mydata) print(summary(reg_ex1))
you will find that the outfielder dummy variable will be omitted.
- We lay no claim that this is the best possible model to explain salary.
- Check the dataset to find the dummy variable
- Had anyone seen a programming language without quirk? I havn't.
- See  for details.
- Alternatively you could use
I(years*black), where the
I()function ensures that R understands the multiplication as a literal mathematical operation/