Panel in R
In this section we shall discuss how to deal with panel data and how to use econometric techniques that exploit the additional analysis that can be performed due to the Panel character of data.
A lot of this material repeats material that is discussed in this YouTube clip .
The plm package
To deal efficiently with panel data we will need the
plm package and you need to downlaod it (
install.packages("plm")) and load it into the workspace (
library(plm)) in the usual manner.
Details for this package can be found here.
Here we are using the Crime Statistics dataset for illustration. It is used in Example 13.9 in Wooldridge's Introductory Econometrics. The dependent variable we will be looking at here is the crime rate (
crmrte and we will use a range of explanatory variables that describe features of the local enforcement setting, like the probability of arrest (
prbarr), probability of conviction (
prbconv), probability of prison if convicted (
prbpris), average sentence length (
avgsen) and the number of police officers per person (
polpc). The data are for 90 counties in North Carolina and for each we have observations for the years 1081 to 1987.
These type of models are usually estimated in log-log form to obtain elasticities and the for this reason the dataset already includes the logged variables (the above names preceded with the letter
lcrmrte. Further, and in anticipation of time-differenced series often being used, the data-set also includes the differenced log variables. These are the variables preceded with the letter
Often original data-files do only come with the original variables and without the logged and differenced variables and as it turns out you will often not need these extra variables.
When using panel data we will often want to allow for time period effects and for this reason the dataset also includes time dummy variables, e.g.
d83, which takes a value of 1 for observations from 1983 and 0 otherwise.
Set-up of Panel
Here is our initial data load-up
library(plm) setwd("X:/ECLR/R/PanelData") # This sets the working directory # Opens crime4.csv from working directory # converts variables with "." entries to num with NA instead of "." mydata <- read.csv("crime4.csv",na.strings = ".")
So far we have merely uploaded the csv file. It is worth having a look at the data at this stage
The distinctive Panel feature of this dataset is that we have several periods of observations (year) for each county. At this stage R does not yet know that these are Panel Data and now we need to let it know about this feature. This is what the following function does:
pdata <- plm.data(mydata, index = c("county","year")) # defines the panel dimensions
The first input into the
plm.data function is the original data frame (here
mydata). The second input specifies which variable indexes the individual (here
"county".) and the variable which indexes the time (here
"year".). Both these are collected in a list and handed over to the function as
index = c("county","year").
This is not the right place to discuss the merits of the range of different estimation methods that exist for Panel Data sets. Let's assume we want to explain variation in the dependent variable
lcrmrte as a function of logs of all the other variables listed above (
lprbarr, lprbconv, lprbpris, lavgsen and lpolpc) and time dummy variables.
A range of estimation methods exist to make use of the panel character of data. Here I will only introduce pooled estimation and first difference estimation. If you want to use any of the other available methods you should consult the documentation of the plm package.
This is the most straightforward way to estimate a model. Essentially we are just chucking all the observations into one big pot and apply a straightforward OLS estimation. The way to do this is as follows:
pooling <- plm(formula = lcrmrte ~ d83 + d84 + d85 + d86 + d87 + prbarr + prbconv + prbpris + avgsen + polpc, data = pdata, model = "pooling") print(summary(pooling))
As you can see, calling a panel data estimation method using the
plm function is not unlike calling a normal OLS regression using the
lm function. The first input is the model representation (the dependent variable followed by all explanatory variables) and the second is the dataframe which is being used, and importantly here we are using the panel data version we defined previously
pdata. A difference is that here we need a third input which specifies how we estimate the Panel Data model. If we want to pool all observation then we call
model = "pooling". 
Oneway (individual) effect Pooling Model Call: plm(formula = lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 + prbarr + prbconv + prbpris + avgsen + polpc, data = pdata, model = "pooling") Balanced Panel: n=90, T=7, N=630 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -2.0000 -0.2840 0.0328 0.3110 1.4800 Coefficients : Estimate Std. Error t-value Pr(>|t|) (Intercept) -3.3529108 0.1385095 -24.2071 < 2.2e-16 *** d82 -0.0118914 0.0723975 -0.1643 0.8695870 d83 -0.0465997 0.0720681 -0.6466 0.5181272 d84 -0.1524940 0.0725012 -2.1033 0.0358405 * d85 -0.1180270 0.0728541 -1.6200 0.1057325 d86 -0.0838222 0.0723015 -1.1593 0.2467644 d87 0.0027029 0.0709855 0.0381 0.9696382 prbarr -1.7873016 0.1163001 -15.3680 < 2.2e-16 *** prbconv -0.0958937 0.0126226 -7.5970 1.127e-13 *** prbpris 0.8453609 0.2193194 3.8545 0.0001281 *** avgsen -0.0057214 0.0074539 -0.7676 0.4430340 polpc 56.9623752 8.1233886 7.0121 6.173e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 206.38 Residual Sum of Squares: 137.88 R-Squared : 0.33191 Adj. R-Squared : 0.32559 F-statistic: 27.9113 on 11 and 618 DF, p-value: < 2.22e-16
Here we have included 6 time dummies to allow for different intercepts for the seven different years.
One issue with simple models like the pooled model is that it is quite likely that there will be unobserved heterogeneity which is, while not explicitly modelled and hence contained in the error term, it to be expected that some of this county to county variation is correlated with some of the explanatory variables and consequently violating the zero-conditional mean assumption.
The perhaps easiest way to deal with this is to estimate a model in (time-)differenced form, as this differencing will eliminate the (time-invariant) elements of this heterogeneity. To estimte the model in differenced form. This is done as follows:
fd3 <- plm(lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 + lprbarr + lprbconv + lprbpris + lavgsen + lpolpc - 1, data = pdata, model = "fd") print(summary(fd3))
Most of this function call is identical to the above call, but for two differences. First, the method now indicates that we need a first difference estimation model = "fd". Second, as we are taking first differences we need to estimate the model without a constant, which is why we include the term - 1 into the model specification.
This model basically replicates the model estimated in Wooldridge's Example 13.9:
Oneway (individual) effect First-Difference Model Call: plm(formula = lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 + lprbarr + lprbconv + lprbpris + lavgsen + lpolpc - 1, data = pdata, model = "fd") Balanced Panel: n=90, T=7, N=630 Residuals : Min. 1st Qu. Median 3rd Qu. Max. -0.65900 -0.07840 0.00296 0.07500 0.68300 Coefficients : Estimate Std. Error t-value Pr(>|t|) d82 0.0077133 0.0170579 0.4522 0.6513202 d83 -0.0844391 0.0234564 -3.5998 0.0003484 *** d84 -0.1246632 0.0287464 -4.3367 1.733e-05 *** d85 -0.1215609 0.0331500 -3.6670 0.0002702 *** d86 -0.0863332 0.0366763 -2.3539 0.0189411 * d87 -0.0377932 0.0399728 -0.9455 0.3448481 lprbarr -0.3274943 0.0299801 -10.9237 < 2.2e-16 *** lprbconv -0.2381068 0.0182341 -13.0583 < 2.2e-16 *** lprbpris -0.1650464 0.0259690 -6.3555 4.488e-10 *** lavgsen -0.0217606 0.0220909 -0.9850 0.3250509 lpolpc 0.3984266 0.0268820 14.8213 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Total Sum of Squares: 22.197 Residual Sum of Squares: 12.596 R-Squared : 0.43251 Adj. R-Squared : 0.4237 F-statistic: 36.6529 on 11 and 529 DF, p-value: < 2.22e-16
As you can see we now have lost a constant. When interpreting you should keep in mind that all variables are used in differences. To see this you will have to refer to the note in the title of the regression output (First-Difference Model).
The marginal effects of the explanatory variable effects are negative as expected for all but the police number variable and are exactly as those reported in the Wooldridge textbook example 13.9. What differs are the estimated values for the dummy variables. Recall that the model reported here uses the time-differences and hence also the time differences of the time dummy variables which are slightly unintuitive. An alternative way to include these dummy variables is to drop one of the dummy variables (so only include 5 here), but use these in their actual levels and add a constant back. That is what is done in Wooldridge, but it only changes the estimated values for the dummy variable parameters, leaving the model fit and the estimated coefficients for all other explanatory variables unchanged.
- Wooldridge, J.M. (2015) Introductory Econometrics, 6th edition
- Angrist and Pischke, Mostly Harmless Econometrics
- The documentation to the plm package found here
- Other available methods are "within", "between", "random", "fd" and "ht". For details see the plm documentation.
- The reason for that is most likely the endogeneity of that variable.