R TimeSeries

From ECLR
Jump to: navigation, search

The forecast package

In this section we will demonstrate how to do basic univariate time-series modelling with R. We will use a package written by Rob Hyndman, called "forecast". So before you get started you need to go to R and

   install.packages("forecast")

But note that this package requires R of version 3. Then at the beginning of your code you will have to import the library by adding

   library(forecast)

to your code.

Importing Data

Here we will initially use a dataset on UK CPI UKCPI.xls. Download this and save it as a csv file as this will facilitate the upload to R.

   setwd("YOUR DIRECTORY")              # This sets the working directory
   data <- read.csv("UKCPI.csv")  # Opens UKCPI.csv from working directory

which will produce a dataframe (data) with two variables, one giving dates (DATE) and the other containing the actual CPI data for 1988Q1 to 2013Q4 (UKCPI). You could find more recent data e.g. from the excellent FRED database [1].

Basic Time-Series Data Transformations

Forcing Time-Series

The first thing we need is to ensure that R knows that the data we are dealing with are actually time series data. There is a function in R you can use to check whether R realises that data are time series data or not:

   print(is.ts(data$UKCPI))  # check whether it is a time-series object

This will print either TRUE or FALSE. In this case the answer is FALSE, data$UKCPI is not recognised as a time-series. We can now force R to recognise this variable as a time-series. Use the following command:

   data$UKCPI <- ts(data$UKCPI, start=c(1988, 1), end=c(2013, 4), frequency=4)

Let's see what is happening here. ts() is the function used and as usual you can use ?ts to figure out some more details. The first input is the data series we want to be recognised as atime-series. The next input is the data of the first observation, start=c(1988, 1), and the date of the last observation, and importantly the data frequency, frequency=4 (4 representing quarterly data and 12 monthly data).

To confirm that the data are now time-series data you could use the is.ts() function again which should now return a TRUE.

Differencing data

Often you will need a differenced time series [math]\Delta y_t = y_{t} - y_{t-1}[/math]. The command to create this is a straightforward:

   dUKCPI <- diff(data$UKCPI)

You can actually have a quick look at your data by using the basic plotting function

   plot(dUKCPI)

which will reveal

DUKCPIplot.jpg


Selecting subsets of data

If you want to use a subset of your time-series data you use the following command:

   dUKCPI_subset <- window(dUKCPI, start=c(2001, 1), end=c(2012, 4))

It should be obvious what this command does. A note of caution is due, however, the window command only works for series that are not part of a dataframe. As you can see above we defined dUKCPI outside the dataframe. If you wanted to apply the window function to UKCPI series you would first have to define it outside the dataframe as follows: UKCPI <- data$UKCPI.

ARMA models

A generic ARMA(p,q) model can be written as

[math]y_t = \alpha_0 + \alpha_1 y_{t-1} + ... + \alpha_p y_{t-p} + \epsilon_t + \delta_1 \epsilon_{t-1} + ... + \delta_q \epsilon_{t-q} [/math]

If the series that is being modelled with this ARMA(p,q) process is a differenced series, e.g. dUKCPI, then we would actually call this a ARIMA(p,1,q) model. If however we model the UK CPI directly then we would also call it ARIMA(p,0,q).

It should be noted that the forecast package which we use actually estimates the following model:

[math](y_t-m) = \alpha_1 (y_{t-1}-m) + ... + \alpha_p (y_{t-p}-m) + \epsilon_t + \delta_1 \epsilon_{t-1} + ... + \delta_q \epsilon_{t-q} [/math]

in other words, we drop the constant but instead estimate a mean parameter m. This has the advantage that the estimated value for m (it will be reported as the constant) can actually be interpreted as a mean value for the modelled series.

Estimating ARMA Models

To estimate a specific ARMA(p,q) model, say a ARIMA(4,0,0) or ARMA(4,0) or AR(4) model, we use the Arima() function from the "forecast" package. It's use is very simple and straightforward:

   fit_diff_400 <- Arima(dUKCPI, order=c(4,0,0))

The first input is the time series for which the model is being estimated and the second input delivers the model order information to R, order=c(4,0,0). The first number gives the AR order, the second gives the number of times we want to difference the time series, and the third is the MA order. You got it if you realise that

   Arima(data$UKCPI, order=c(4,1,0))

delivers exactly the same results. If you don't believe it, try it!

Before we discuss the output of such an ARIMA estimation there is one further little detail that is worth mentioning. Whenever we are using autoregressive lags, e.g. [math]p[/math] lags of [math]y_t[/math], we are in some sense loosing [math]p[/math] observations. The values of [math]y_1[/math] to [math]y_p[/math] only appear on the right hand side but not on the left hand side as dependent variable as we do not have all conditioning information. If we were to estimate the model only for the [math]y_{p+1}[/math] onwards, we are essentially treating [math]y_1[/math] to [math]y_p[/math] as conditioning information only and we are not trying to fit the model to these values. This means that we are not using all information efficiently. It is not straightforward to explain how to use these data well (read James Hamilton's textbook on Time-Series Analysis if you want to know), but we are lucky, the clever people who wrote the forecast package, did the right thing. So when you use the arima function as described above all is good.

The output of this function is a list (just as the output of the LM() function. To figure out what information you can extract from this estimated model type

   names(fit_diff_400)

which in this case will produce

   [1] "coef"      "sigma2"    "var.coef"  "mask"
   [5] "loglik"    "aic"       "arma"      "residuals"
   [9] "call"      "series"    "code"      "n.cond"
   [13] "model"     "bic"       "aicc"      "x"

You should recognise a few of these. There are estimated coefficients (coef), residuals and information criterion (aic, bic and aicc).

As usual you can access individual elements using the $-sign, e.g. fit_diff_400$coef.

When estimating univariate time-series models it is crucial to get the process orders right. The most common technique is to find that process order that minimises an information criterion. You could do that manually by estimating many models of different orders and compare an information criterion or you could let R do the hard work and use the auto.arima() function. Without going into any details (you by now know how to access details about the use of functions), just try

   fit_diff <- auto.arima(dUKCPI,seasonal=FALSE)

and try to figure out what happens.

Forecasting ARMA Models

Once you have an estimated model you can use that information to produce forecasts. Again there is a convenient function that handles the details of the forecasting. You merely need to know what information needs to be handed to the forecasting function forecast(). Again we will illustrate this using an example.

   fit_diff_400f <- forecast(fit_diff_400,h=10)

The first input the forecast() function requires is the list in which we saved the model estimates, here fit_diff_400. The second input specifies how many periods we want to forecast ahead, here h=10 periods.

Again this function calculates all sorts of stuff and names(fit_diff_400f) reveals which outputs are available. The actual forecasts can be accessed from fit_diff_400f$mean. Often you will want to see a graphical representation of the forecasts. This is again easily done using the plot function. Let's look at the following example:

   plot(forecast(fit_diff_400,h=10))

The first input is our above call to the forecast function, forecast(fit_diff_400,h=10) (you could have also directly used fit_diff_400f). This will deliver the following plot:

DUKCPIf plot.jpg

As you can see this plots all data used to estimate the model plus the 10 quarters of forecasts. The shaded areas indicate the forecast confidence intervals. If you add a second input, include = 40, into the plot function

   plot(forecast(fit_diff_400,h=10), include = 40)

then we will only see the last 40 observations of the series used to estimate the model (this plot is not shown here).

Online Demo

Here is an online clip which goes through this material [2].

Additional Resources

  • A very quick intro from Quick-R can be found here [3]
  • We are using the package "forecast" authored by Rob Hyndman who has also written an online textbook on the topic of forecasting [4]
  • To access some very useful data-series in a very convenient way we will also use the QUANDL package.
  • A good intro into a date-time format which you need if you have intra-day data can be found here
  • Here is another tutorial on how to deal with time-series data .