Panel in R

Jump to: navigation, search

In this section we shall discuss how to deal with panel data and how to use econometric techniques that exploit the additional analysis that can be performed due to the Panel character of data.

A lot of this material repeats material that is discussed in this YouTube clip [1].

The plm package

To deal efficiently with panel data we will need the plm package and you need to downlaod it (install.packages("plm")) and load it into the workspace (library(plm)) in the usual manner.

Details for this package can be found here.

Example Data

Here we are using the Crime Statistics dataset for illustration. It is used in Example 13.9 in Wooldridge's Introductory Econometrics. The dependent variable we will be looking at here is the crime rate (crmrte and we will use a range of explanatory variables that describe features of the local enforcement setting, like the probability of arrest (prbarr), probability of conviction (prbconv), probability of prison if convicted (prbpris), average sentence length (avgsen) and the number of police officers per person (polpc). The data are for 90 counties in North Carolina and for each we have observations for the years 1081 to 1987.

These type of models are usually estimated in log-log form to obtain elasticities and the for this reason the dataset already includes the logged variables (the above names preceded with the letter "l", e.g. lcrmrte. Further, and in anticipation of time-differenced series often being used, the data-set also includes the differenced log variables. These are the variables preceded with the letter "c", e.g. clcrmrte.

Often original data-files do only come with the original variables and without the logged and differenced variables and as it turns out you will often not need these extra variables.

When using panel data we will often want to allow for time period effects and for this reason the dataset also includes time dummy variables, e.g. d83, which takes a value of 1 for observations from 1983 and 0 otherwise.

Set-up of Panel

Here is our initial data load-up

    setwd("X:/ECLR/R/PanelData")              # This sets the working directory
    # Opens crime4.csv from working directory
    # converts variables with "." entries to num with NA instead of "."
    mydata <- read.csv("crime4.csv",na.strings = ".") 

So far we have merely uploaded the csv file. It is worth having a look at the data at this stage

Panel bic1.jpg

The distinctive Panel feature of this dataset is that we have several periods of observations (year) for each county. At this stage R does not yet know that these are Panel Data and now we need to let it know about this feature. This is what the following function does:

    pdata <- pdata.frame(mydata, index = c("county","year")) # defines the panel dimensions

The first input into the pdata.frame function is the original data frame (here mydata). The second input specifies which variable indexes the individual (here "county".) and the variable which indexes the time (here "year".). Both these are collected in a list and handed over to the function as index = c("county","year").

Estimation Methods

This is not the right place to discuss the merits of the range of different estimation methods that exist for Panel Data sets. Let's assume we want to explain variation in the dependent variable lcrmrte as a function of logs of all the other variables listed above (lprbarr, lprbconv, lprbpris, lavgsen and lpolpc) and time dummy variables.

A range of estimation methods exist to make use of the panel character of data. Here I will only introduce pooled estimation and first difference estimation. If you want to use any of the other available methods you should consult the documentation of the plm package.

Pooled OLS

This is the most straightforward way to estimate a model. Essentially we are just chucking all the observations into one big pot and apply a straightforward OLS estimation. The way to do this is as follows:

    pooling <- plm(formula = lcrmrte ~ d83 + d84 + d85 + d86 + d87 
              + prbarr + prbconv + prbpris + avgsen + polpc, 
              data = pdata, model = "pooling")

As you can see, calling a panel data estimation method using the plm function is not unlike calling a normal OLS regression using the lm function. The first input is the model representation (the dependent variable followed by all explanatory variables) and the second is the dataframe which is being used, and importantly here we are using the panel data version we defined previously pdata. A difference is that here we need a third input which specifies how we estimate the Panel Data model. If we want to pool all observation then we call model = "pooling". [1]

This delivers

    Oneway (individual) effect Pooling Model
    plm(formula = lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 + prbarr + 
        prbconv + prbpris + avgsen + polpc, data = pdata, model = "pooling")
    Balanced Panel: n=90, T=7, N=630
    Residuals :
       Min. 1st Qu.  Median 3rd Qu.    Max. 
    -2.0000 -0.2840  0.0328  0.3110  1.4800 
    Coefficients :
                  Estimate Std. Error  t-value  Pr(>|t|)    
    (Intercept) -3.3529108  0.1385095 -24.2071 < 2.2e-16 ***
    d82         -0.0118914  0.0723975  -0.1643 0.8695870    
    d83         -0.0465997  0.0720681  -0.6466 0.5181272    
    d84         -0.1524940  0.0725012  -2.1033 0.0358405 *  
    d85         -0.1180270  0.0728541  -1.6200 0.1057325    
    d86         -0.0838222  0.0723015  -1.1593 0.2467644    
    d87          0.0027029  0.0709855   0.0381 0.9696382    
    prbarr      -1.7873016  0.1163001 -15.3680 < 2.2e-16 ***
    prbconv     -0.0958937  0.0126226  -7.5970 1.127e-13 ***
    prbpris      0.8453609  0.2193194   3.8545 0.0001281 ***
    avgsen      -0.0057214  0.0074539  -0.7676 0.4430340    
    polpc       56.9623752  8.1233886   7.0121 6.173e-12 ***
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Total Sum of Squares:    206.38
    Residual Sum of Squares: 137.88
    R-Squared      :  0.33191 
    Adj. R-Squared :  0.32559 
    F-statistic: 27.9113 on 11 and 618 DF, p-value: < 2.22e-16

Here we have included 6 time dummies to allow for different intercepts for the seven different years.

First Difference

One issue with simple models like the pooled model is that it is quite likely that there will be unobserved heterogeneity which is, while not explicitly modelled and hence contained in the error term, it to be expected that some of this county to county variation is correlated with some of the explanatory variables and consequently violating the zero-conditional mean assumption.

The perhaps easiest way to deal with this is to estimate a model in (time-)differenced form, as this differencing will eliminate the (time-invariant) elements of this heterogeneity. To estimte the model in differenced form. This is done as follows:

    fd3 <- plm(lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 
         + lprbarr + lprbconv + lprbpris + lavgsen + lpolpc - 1, 
         data = pdata, model = "fd")

Most of this function call is identical to the above call, but for two differences. First, the method now indicates that we need a first difference estimation model = "fd". Second, as we are taking first differences we need to estimate the model without a constant, which is why we include the term - 1 into the model specification.

This model basically replicates the model estimated in Wooldridge's Example 13.9:

    Oneway (individual) effect First-Difference Model
    plm(formula = lcrmrte ~ d82 + d83 + d84 + d85 + d86 + d87 + lprbarr + 
        lprbconv + lprbpris + lavgsen + lpolpc - 1, data = pdata, 
        model = "fd")
    Balanced Panel: n=90, T=7, N=630
    Residuals :
        Min.  1st Qu.   Median  3rd Qu.     Max. 
    -0.65900 -0.07840  0.00296  0.07500  0.68300 
    Coefficients :
               Estimate Std. Error  t-value  Pr(>|t|)    
    d82       0.0077133  0.0170579   0.4522 0.6513202    
    d83      -0.0844391  0.0234564  -3.5998 0.0003484 ***
    d84      -0.1246632  0.0287464  -4.3367 1.733e-05 ***
    d85      -0.1215609  0.0331500  -3.6670 0.0002702 ***
    d86      -0.0863332  0.0366763  -2.3539 0.0189411 *  
    d87      -0.0377932  0.0399728  -0.9455 0.3448481    
    lprbarr  -0.3274943  0.0299801 -10.9237 < 2.2e-16 ***
    lprbconv -0.2381068  0.0182341 -13.0583 < 2.2e-16 ***
    lprbpris -0.1650464  0.0259690  -6.3555 4.488e-10 ***
    lavgsen  -0.0217606  0.0220909  -0.9850 0.3250509    
    lpolpc    0.3984266  0.0268820  14.8213 < 2.2e-16 ***
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    Total Sum of Squares:    22.197
    Residual Sum of Squares: 12.596
    R-Squared      :  0.43251 
    Adj. R-Squared :  0.4237 
    F-statistic: 36.6529 on 11 and 529 DF, p-value: < 2.22e-16

As you can see we now have lost a constant. When interpreting you should keep in mind that all variables are used in differences. To see this you will have to refer to the note in the title of the regression output (First-Difference Model).

The marginal effects of the explanatory variable effects are negative as expected for all but the police number variable[2] and are exactly as those reported in the Wooldridge textbook example 13.9. What differs are the estimated values for the dummy variables. Recall that the model reported here uses the time-differences and hence also the time differences of the time dummy variables which are slightly unintuitive. An alternative way to include these dummy variables is to drop one of the dummy variables (so only include 5 here), but use these in their actual levels and add a constant back. That is what is done in Wooldridge, but it only changes the estimated values for the dummy variable parameters, leaving the model fit and the estimated coefficients for all other explanatory variables unchanged.


  • Wooldridge, J.M. (2015) Introductory Econometrics, 6th edition
  • Angrist and Pischke, Mostly Harmless Econometrics
  • The documentation to the plm package found here


  1. Other available methods are "within", "between", "random", "fd" and "ht". For details see the plm documentation.
  2. The reason for that is most likely the endogeneity of that variable.