Difference between revisions of "Example 2"

From ECLR
Jump to: navigation, search
(Recalculate statistics in subsamples)
Line 223: Line 223:
 
== Recalculate statistics in subsamples ==
 
== Recalculate statistics in subsamples ==
  
This part of the exercise is discussed on [[Example2b]].
+
This part of the exercise is discussed on [[Example_2b]].
  
 
=Footnotes=
 
=Footnotes=
  
 
  <references />
 
  <references />

Revision as of 12:26, 7 October 2014

Task

In this exercise you will have to download some share prices and then use these data to calculate summary statistics for every year in the sample. We will then compare these statistics and see how they change through time.

The data you should download is the share prices of two companies, Glaxo Smith Kline (GSK) and Apple (AAPL). You can get these data from http://finance.yahoo.com/. Enter the Ticker symbols into the search box and after clicking enter go to the historical prices link. You should download daily data and then use the "Adjusted Close Prices". The sample period we use is from 2 January 1987 to 30 December 2011.

These are your tasks:

  1. Download the data and import into MATLAB (Date info and adjusted close prices only are required)
  2. Delete days for which you do not have observations for both stocks
  3. Calculate the daily log and simple returns for both series
  4. Calculate the following summary statistics for both stocks and for both types of returns for the full sample:
    1. Mean, standard deviation, variance, skewness and kurtosis of returns
    2. Number of positive and negative returns in the sample
    3. correlation between the AAPL and GSK returns
    4. Average positive and negative returns in the sample
    5. the sum of the autoregressive coefficients of an AR(5) model for each series
  5. Calculate the same statistics separately for every year of data (first for 1987, then 1988 and so forth) and evaluate (by eyeballing) any significant changes through the years. This part of the exercise is discussed on Example 2b.

Implementation

In the following we will show short extracts of code that tackle small problems. You cannot just stick the bits together. You will have to think carefully about using consistent names for variables. In particular, some bits are written with generic variable names and others with names that specifically refer to one of the two stocks.

Import data in MATLAB (keyword: data import)

MATLAB has a very wide range of importing procedures (see LoadingData). The most straightforward and user-friendly is the MATLAB import wizard, although how that works precisely changes from one version to another. Here we will use the xlsread command as this works fairly consistently across versions. When you download dta from yahoo you are likely to obtain a csv file. The csvread command, unfortunately does not like importing dates. However the xlsread command does this easily. You can either convert your csv file in EXCEL to an xlsx file or you can use the xlsread to import the csv file directly. This is what we do here.

%% Import Data
% this imports the csv files obtained from yahoo
% Note that the files have the inverse date orders
% adjusted close data are in the 6th column

[gsk_data, gsk_txt, gsk_raw] = xlsread('GSK.csv');
[app_data, app_txt, app_raw] = xlsread('AAPL.csv');

gsk_p = flipud(gsk_data(:,6));      % Extract the adjusted close price which is in the 6th data col
app_p = flipud(app_data(:,6));      % and flip upside down to get the right date order

dates_gsk = datenum(gsk_txt(2:end,1),'dd/mm/yyyy');
dates_app = datenum(app_txt(2:end,1),'dd/mm/yyyy');
dates_gsk = flipud(dates_gsk);
dates_app = flipud(dates_app);

The adjusted close data are in the 6th column of gsk_data and app_data respectively. The date information is in the first column of the *_txt files (1st row is excluded as it includes the headers). The flipud commands reverse the data order as yahoo routinely returns files with the latest data on the top.

At this stage we will have to save two different date vectors as there is no guarantee that both series contain data for exactly the same dates. This is what needs to be checked next.

Synchronize data

This is not a straightforward task. If you look at the length of the data vectors you will see that the file for Apple shares contains one line more than that for GSK. When you inspect the date vectors for the two shares you can easily confirm that the first and last dates are identical. This means that the discrepency is somewhere in the middle of the dataset and it would be good to have MATLAB find the discrepency rather than having to find it manually. Fortunately MATLAB provides a function that is very useful in this context. What we need is the intersection of the two dates vectors as that will tell us which dates are contained in both datasets. Conveniently the name of the function is intersect[1]. Here is how we use it:

%% Delete days that are not available for both stocks
% check if there are dates that are nonsynchronous

[dates,i_app,i_gsk] = intersect(dates_app, dates_gsk);
gsk_p = gsk_p(i_gsk);
app_p = app_p(i_app);

The three outputs of this function make our job fairly straightforward. The first, here called dates, delivers the intersection of dates_app and dates_gsk. The second output i_gsk returns the row numbers of dates_gsk in which the elements that end up in dates can be found. Therefore, gsk_p = gsk_p(i_gsk); selects all those prices that correspond to days that are in dates. The same will happen in app_p = app_p(i_app);. Recall that the Apple data had one observation more than that of GSK. That date will not be dates and therefore that row will not be represented in i_app and consequently neither in app_p. We have essentially eliminated the data for the day(s) for which only one of the stocks had data available.

Construction of Log-returns

Now that we have synchronised the data, we can calculate the return vectors. Log-returns are defined as [math]r_t=\ln(p_t)-\ln(p_{t-1})[/math] Simple returns are defined as [math]R_t=\frac{p_t}{p_{t-1}} \times 100\% -1 = \frac{(p_t - p_{t-1})}{p_{t-1}} \times 100[/math] The first return [math]r_1[/math] is not defined, since [math]p_0[/math] is not known. In MATLAB constructing of returns can be done in several ways. The long way to do this:

[T, n] = size(gsk_p);
GSKlogrets = zeros(T,n);
GSKsimplerets = zeros(T,n);
for i = 2:T
    %This way r(1)=0 by construction
    GSKlogrets(i,1)=log(gsk_p(i,1))-log(gsk_p(i-1,1));
    GSKsimplerets(i,1)=gsk_p(i,1)/gsk_p(i-1,1)*100-1;
end

The same result can be achieved in a shorter way using the fact that MATLAB can extract submatrices from a matrix, that is gsk_p(2:end,:) will select all elements but the first row in a matrix, and gsk_p(1:end-1,:) will select all elements but the last.

GSKlogrets = zeros(T,n);
GSKsimplerets = zeros(T,n);
%This way r(1) = 0 by construction
GSKlogrets(2:end,:) = log(gsk_p(2:end,:))-log(gsk_p(1:end-1,:));
GSKsimplerets(2:end,:) = gsk_p(2:end,:)./gsk_p(1:end-1,:)*100-1;

You can also use a special command y = diff(x) which generates a matrix [math]y[/math], such that y(i)=x(i+1)-x(i). This will result in a vector that is one row shorter than x as from [math]T[/math] observations we can calculate returns for periods [math]t=2,3,...,T[/math]. In order to keep the return vectors in sync with the date vector we will therefore add a 0 return at the beginning, in place of the return at time [math]t=1[/math]. This is the code we will use to calculate the returns. From now we will continue wil the log-returns only as this is common practice in applied Finance.

%% Return calculation (log-returns only)
gsk_r = diff(log(gsk_p));
app_r = diff(log(app_p));
gsk_r = [0;gsk_r];          % append 0 for r(1)
app_r = [0;app_r];

Constructing Sample moments for full sample

The formulae for mean [math]\bar r[/math], variance [math]\hat \sigma_r^2[/math], standard deviation [math]\hat \sigma_r[/math], skewness [math]\hat S_r[/math], and kurtosis [math]\hat K_r[/math] are[2]

[math]\begin{aligned} \bar r &= \frac{1}{T}\sum_{t=1}^T r_t\\ \hat \sigma_r^2 &= \frac{1}{T}\sum_{t=1}^T (r_t-\bar r)^2\\ \hat \sigma_r &= \sqrt{\hat \sigma_r^2}\\ \hat S_r &= \frac{1}{T}\sum_{t=1}^T (r_t-\bar r)^3/\hat \sigma_r^3\\ \hat K_r &= \frac{1}{T}\sum_{t=1}^T (r_t-\bar r)^4/\hat \sigma_r^4\end{aligned}[/math]

and can be implemented directly using the following code:

%% Full sample statistics
[T,n]     = size(gsk_r);
GSKMean = sum(gsk_r)/T;
GSKVar  = sum((gsk_r-GSKMean).^2)/T;
GSKStd  = sqrt(GSKVar);
GSKSkew = (sum((gsk_r-GSKMean).^3)/T)/GSKStd^3;
GSKKurt = (sum((gsk_r-GSKMean).^4)/T)/GSKStd^4;

The same command lines can of course be replicated for the sample statistics for Apple share returns. Once we have done so we can calculate the correlation coefficient

[math]\hat \rho_ = \frac{\sum_{t=1}^T (r_{1t}-\bar r_1)(r_{2t}-\bar r_2)}{\sqrt{\sum_{t=1}^T (r_{1t}-\bar r_1)^2\sum_{t=1}^T (r_{2t}-\bar r_2)^2}}[/math]

This is straightforward to implement by

numcorr = sum((gsk_r-GSKMean).*(app_r-APPMean));
dencorr = sqrt(sum((gsk_r-GSKMean).^2)*sum((app_r-APPMean).^2));
GSKAPPcorr  = numcorr/dencorr;

This could have been written in one line but was separated into denominator and numerator to make it a little easier to read.

Positive and negative returns (keywords: loops, if-then-else statements, logical operations, vectorization)

Number ([math]T^+,T^-[/math]) and sample means ([math]r^+,r^-[/math]) of non-negative and negative returns are computed

[math]\begin{aligned} T^+&=\sum_{t=1}^T I(r_t\ge0)\\ T^-&=\sum_{t=1}^T I(r_t\lt 0)\\ r^+&=\frac{\sum_{t=1}^T r_t I(r_t\ge0)}{T^+}\\ r^-&=\frac{\sum_{t=1}^T r_t I(r_t\lt 0)}{T^-}\end{aligned}[/math]

where [math]I(True)=1[/math], [math]I(False)=0[/math].


A long way to compute these quantities is (where ret is the relevant return vector, e.g. gsk_r):

  1. Initialize variables Tplus=0, Tminus=0, retplus=0, retminus=0.

  2. Check whether [math]i[/math]th observation of returns ret(i) is greater than or equal to 0 [1st] for [math]i=1[/math]

  3. If (2) is True, set Tplus=Tplus+1;retplus=retplus+ret(i), else
    set Tminus=Tminus+1;retminus=retminus+ret(i) [3rd]

  4. Repeat lines 2 – 3 for [math]i=2,3,...,T[/math], where [math]T[/math] is the sample size

    Tplus    = 0;
    Tminus   = 0;
    retplus  = 0;
    retminus = 0;
    for i = 1:T             %starts the loop
        if lnrets(i)>=0
           Tplus = Tplus+1;            %counting non-negative returns
           retplus = retplus+ret(i);    %summation of non-negative returns
        else
           Tminus=Tminus+1;            %counting negative returns
           retminus=retminus+ret(i);    %summation of negative returns
    end
    retplus=retplus/Tplus;      %computing average non-negative return
    retminus=retminus/Tminus;   %computing average negative return

This way was quite logical but there is a shorter (and quicker) way to achieve this. The following has to be kept in mind:

  1. Logical relationships also work for vectors, that is indpos = (ret>=0) generates a vector of 0s (where ret(i)<0) and 1s (where ret(i)>=0)

  2. Logical expressions can be used for selecting subsamples from a sample, that is retplus=lnrets(indpos) generates a subvector of non-negative returns and retminus=lnrets(1-indpos) generates a subvector of negative returns.

    indpos   = (ret >= 0);
    indneg   = 1-indpos;
    Tplus    = sum(indpos);
    Tminus   = sum(indneg);
    retplus  = sum(ret(indpos))/Tplus;
    retminus = sum(ret(indneg))/Tminus;

Sum of AR(5) coefficients

Here we will have to call an OLS function. An OLS function is the one function every self-respecting econometrics student should have handy at all times. We shall use the OLSest.m function which we assume is saved either in the same folder you are working in or in a folder that is added to the MATLAB search path[3].

The OLSest.m function has the following structure (type help OLSest to see this in your command window)

function [b,bse,res,n,rss,r2] = OLSest(y,x,output)

and therefore requires the definition of the y and x input variables. The third input, output, determines whether we want the full regression output in the command window. We will set it to 0 as we are not really interested in that. As we want to estimate an AR(5) regression model for our returns

[math]r_t = \phi_0 + \phi_1 r_{t-1} + \phi_2 r_{t-2} + \phi_3 r_{t-3} + \phi_4 r_{t-4} + \phi_5 r_{t-5} + \epsilon_t[/math]

the variables y and x are defined as (using gsk_r as an example):

lags = 5;
ygsk = gsk_r(lags+1:end);
xgsk = [ones(T-lags,1) gsk_r(lags:end-1) gsk_r(lags-1:end-2) ...
    gsk_r(lags-2:end-3) gsk_r(lags-3:end-4) gsk_r(lags-4:end-5)];

Note that the length of these matrices is [math]T-lags[/math]. If you are doing a lot of time series modelling creating matrices like these is a common problem and it is likely that you will want to write a little function that does this job for you. Or you can use one written by someone else, e.g. the function newlagmatrix which is included in Kevin Sheppard’s MFE toolbox.

Now we can call theOLSest function and calculate the sum [math]\sum_{i=1}^5 \phi_i[/math] as follows:

[bgsk,~,~,~,~,~] = OLSest(y,x,output);
ar5gsk = sum(bgsk(2:end));  % Exclude first coefficient which is the constant

Note that OLSest is prepred to deliver 6 outputs. But as we only need the first (we are only interested in the estimated coefficients which are the first output) we replace all other outputs with a tilde which immediately deletes the respective variables from the workspace. The sum of the five AR coefficients is then saved in ar5gsk.

Display full sample statistics

Let’s say we want to print the full sample statistics into a text file (’Example2.txt’). This is done by the following lines:

fid = fopen('Example2.txt','w');
fprintf(fid,'Number of obs: %6.0f \n', T);
fprintf(fid,'Return Stats for:      GSK       Apple \n');
fprintf(fid,'--------------------------------------- \n');
fprintf(fid,'Mean                %6.4f     %6.4f \n',GSKMean,APPMean);
fprintf(fid,'Standard Dev        %6.4f     %6.4f \n',GSKStd,APPStd);
fprintf(fid,'Skewness            %6.4f     %6.4f \n',GSKSkew,APPSkew);
fprintf(fid,'Kurtosis            %6.4f     %6.4f \n',GSKKurt,APPKurt);
fprintf(fid,'Mean(pos rets)      %6.4f     %6.4f \n',GSKretplus,APPretplus);
fprintf(fid,'Mean(neg rets)      %6.4f     %6.4f \n',GSKretminus,APPretminus);
fprintf(fid,'--------------------------------------- \n');
fprintf(fid,'Corr(GSK_r,APP_r)        %6.4f \n',GSKAPPcorr);
fclose(fid);

This will produce the following result:

Number of obs:   6304
Return Stats for:      GSK       Apple
---------------------------------------
Mean                0.0004     0.0007
Standard Dev        0.0183     0.0313
Skewness            -0.4095     -2.0749
Kurtosis            13.1143     56.2547
Mean(pos rets)      0.0125     0.0207
Mean(neg rets)      -0.0136     -0.0220
---------------------------------------
Corr(GSK_r,APP_r)        0.2143

Recalculate statistics in subsamples

This part of the exercise is discussed on Example_2b.

Footnotes

  1. Often it pays to look for intuitively named functions.
  2. Note that here we use the population statistic formulae for simplicity reason
  3. Check doc path. MATLAB has a predefined list of folders in which it searches for a function if you call one. Unless your function is saved in one of these folders MATLAB will not find it and will generate an error message.