MaxLik

From ECLR
Revision as of 17:05, 19 October 2012 by Admin (talk | contribs)
Jump to: navigation, search

Introduction

This Section will make use of the nonlinear optimisation procedure described in this Section. There it was described how you can make MATLAB minimise a function by identifying that parameter vector that minimises an objective/target function. In Maximum Likelihood (ML) section the target function will be a negative (log)likelihood function. We want to choose a set of parameters such that the likelihood for these data is maximised (or the negative likelihood minimised).

Below we will illustrate the use of maximum likelihood with a very simple example, in particular a simple linear regression. While we know that we can solve such a problem by means of OLS, we therefore have a means of comparison for the results we obtain from ML estimation. We will introduce the objective function that is to be minimised and then also discuss how to obtain the relevant standard errors for ML parameter estimates.

The example

The problem addressed here is to find the parameters of the following linear model:

[math]\begin{aligned} y_{t}&=&\beta _{0}+\beta _{1}x_{1,t}+\beta _{2}x_{2,t}+u_{t}\\ u_t&\sim&N(0,\sigma^2)\end{aligned}[/math]

It is of course well known that if the data are stationary then the above model is best estimated by OLS which will find an estimate for [math]\mathbf{\beta}=(\beta_0~\beta_1~\beta_2)'[/math]. Once that is obtained one can obtain an estimate for [math]\sigma^2[/math] by estimating the sample variance of the estimated regression residual.

When performing ML estimation one will obtain estimates for all four parameters [math]\mathbf{\theta}=(\mathbf{\beta}'~\sigma)'[/math] in one swoosh. It is well known (check for instance the textbooks in the Literature Listing at the bottom of this page) that ML estimates for [math]\mathbf{\theta}[/math] are asymptotically normally distributed, assuming that the model is correctly specified.

The reason why we use this apparently simple example to introduce ML is actually a very important programming point. When you programme something new you really want to apply your method to a problem to which you know the solution and therefore you can check whether you actually obtain what you know to be the correct solution. Once that is established you can explore either more complicated or new problems.

The negative log-likelihood function

Before writing the objective function we need to recognise that it is the idea of ML estimates to maximise the log-likelihood of the data conditional on the parameters [math]lnL(\theta) = \sum log(p(y_t|\mathbf{x}_t,\mathbf{\theta}))[/math]. For the above model it is easy to establish that that [math]y_t[/math] conditional on [math]\mathbf{x}_t[/math] and [math]\mathbf{\theta}[/math] is normally distributed as follows

[math]y_t \sim N(\beta _{0}+\beta _{1}x_{1,t}+\beta _{2}x_{2,t},\sigma^2).[/math]

Therefore the density is as follows:

[math]p(y_t|\mathbf{x}_t,\mathbf{\theta})=\frac{1}{\sqrt(2 \pi \sigma^2)}exp(-\frac{1}{2} \frac{(y_t-\beta _{0}-\beta _{1}x_{1,t}-\beta _{2}x_{2,t})^2}{\sigma^2}).[/math]

When replacing [math]y_t-\beta _{0}-\beta _{1}x_{1,t}-\beta _{2}x_{2,t}[/math] with [math]r_t[/math] and taking the log of the density we obtain:

[math]ll_t=log(p(y_t|\mathbf{x}_t,\mathbf{\theta}))=-\frac{1}{2}log(2\pi) -\frac{1}{2}log(\sigma^2)-\frac{r_t^2}{2 \sigma^2}.[/math]

This will be at the core of the negative log-likelihood function. We need the negative of the log-likelihood as the MATLAB optimisation algorithm (fminunc or fminsearch) minimise rather than maximise.

function nll = nll_lin(theta, data , vec)
% input:    (i) theta, coef vector, last element is error sd
%           (ii), data matrix (T x k), col1: y cols2:end: explan. variables (incl constant)
%           (iii), 0 = if vector of loglikelihoods, and 1 if sum should be
%           returned
beta = theta(1:end-1);
sig  = abs(theta(end))+0.000001;    % this ensures a non-zero variance
y = data(:,1);
x = data(:,2:end);
res = y - x*beta;
nll = (-0.5)*(-log(2*pi)-log(sig^2)-(res.^2/(sig^2)));

if vec
    nll = sum(nll);
end

end

The variable nll is a [math](T \times 1)[/math] vector that contains the log density for each observation. The function will either return this vector or the sum of this vector. The need to allow for both types out outputs will become clear when we discuss how to calculate the parameter estimates standard errors.

You can see rom the code that this function will expect to be handed one parameter vector that contains all parameters including the error term’s standard deviation as the last element. The variance used in the definition of nll of course needs to be positive by definition of being a variance. The MATLAB optimisation routine does not know this and hence it is the job of the econometrician to ensure that the variance used in nll is indeed positive. This is why we define the parameter to be estimated as the standard deviation sig and then use sig2 in the calculation. This will be positive even if MATLAB delivers a negative sig which it may well do[1]. If this was all we wanted to do then the line in which we define sig as the last element of theta would merely be: sig = theta(end). There is, however, one extra little snag. The variance should also be non-zero (as it appears in the denominator of nll. To ensure this we add a very small value to abs(theta(end)).

The above function should be saved in a file called nll_lin.m.

Calling fminunc

Calling the optimisation function is exactly as explained in the Section. I assume that you imported your data and saved them in a vector y and a matrix x where the latter has three columns, the first being a vector of ones to represent the constant, the second a vector with values for [math]x_1[/math] and the third a vector with values for [math]x_2[/math].

datamat = [y x];                        % define data matrix for use in nll_lin
theta0 = [mean(y); 0; 0; std(y)];       % this sets the initial parameter vector
options = optimset; % sets optimisation options to default
[thetaopt] = fminunc(@nll_lin,theta0,options,datamat,1);

Recall that we need starting values for [math]\mathbf{\theta}[/math]. As we discussed in the Section you should choose reasonable starting values as, otherwise, there is the risk that the optimisation algorithm finds a local rather than a global minimum. Therefore we use the average value of [math]y_t[/math] as the starting value for the constant and the standard deviation of [math]y_t[/math] as the starting value for the error standard deviation. This would indeed be the correct values if [math]x_{1t}[/math] and [math]x_{2t}[/math] were insignificant ([math]\beta_1 = \beta_2 = 0[/math]). This why the starting values for [math]\beta_1[/math] and [math]\beta_2[/math] are set to 0.

Also note that the last two inputs into the fminunc function are dataset which is the matrix containing the data and a 1 which lets the function know that it should return the sum of the negative log-likelihood. This is to comply with the definition of inputs required for function nll_lin. When you estimate this model you should virtually obtain the same parameters (in thetaopt) as if you had estimated the model by OLS.

Estimating standard errors of parameter estimates

Please refer to any good quality econometrics textbook (see below for a short list) to establish that the asymptotic distribution of the ML estimator, [math]\widehat{\mathbf{\theta}}[/math], is a Normal distribution:

[math](\widehat{\mathbf{\theta}}-\mathbf{\theta}_0) \sim N(0,I^{-1}(\mathbf{\theta}_0)).[/math]

where [math]\mathbf{\theta}_0[/math] is the unknown population parameter and [math]I^(-1)(\mathbf{\theta}_0)[/math] is the information matrix evaluated at the unknown population parameter. Here we do not wish to elaborate on the statistical details of this result[2] but rather concentrate on how this result is used to obtain standard errors for estimated parameters.

The variance-covariance matrix of [math]\mathbf{\theta}[/math], [math]V(\mathbf{\theta})[/math], is therefore the inverse information matrix which is defined as the negative of the expected inverse Hessian matrix, and will in practice be evaluated at [math]\widehat{\mathbf{\theta}}[/math] i.e.

[math]V(\widehat{\mathbf{\theta}})=-H^{-1}(\widehat{\mathbf{\theta}})=-\left( \frac{\partial^2 lnL(\widehat{\mathbf{\theta}})}{\partial \widehat{\mathbf{\theta}} \partial \widehat{\mathbf{\theta}}'}\right) ^{-1}[/math]

This means that in order to estimate the coefficient estimate’s standard errors we need to estimate the second derivative of the log-likelihood function [3]. We then take the square root of the diagonal elements of the parameter estimates variance-covariance matrix [math]V(\widehat{\mathbf{\theta}})[/math].

It is desirable to have an alternative method to calculate the variance covariance matrix (as the Hessian, evaluated at the parameter estimates, is not always invertible[4]. The alternative makes use of the gradient of the objective function. Let [math]\mathbf{g}_t = \partial nll_t(\mathbf{\theta})/\partial \mathbf{\theta}[/math] be the [math](k \times 1)[/math] vector of the partial derivative of the negative log-likelihood evaluated at time [math]t[/math]. We can then estimate [math]V(\widehat{\mathbf{\theta}})[/math] using the following outer product procedure for the gradients (OPG):

[math]V(\widehat{\mathbf{\theta}})=J^{-1}(\widehat{\mathbf{\theta}})=\left(\sum_{t=1}^T \left( \mathbf{g}_t \mathbf{g}_t' \right)\right)^{-1}[/math]

If the estimated model is correctly specified then the OPG and the Hessian approach to calculation [math]V(\widehat{\mathbf{\theta}})[/math] will be asymptotically equivalent. If that is not the case then one is in the territory of textitquasi maximum likelihood estimation. Without going into any details we merely note that parameters of the conditional mean function (here [math]\mathbf{\beta}[/math]) can still be estimated consistently. Further, the parameter estimate’s standard errors are estimated with the so-called sandwich estimator:

[math]V(\widehat{\mathbf{\theta}})=\left( H(\widehat{\mathbf{\theta}}) J^{-1}(\widehat{\mathbf{\theta}})^{-1} H(\widehat{\mathbf{\theta}}) \right)^{-1}[/math]

It should be mentioned that the calculation of [math]J(\widehat{\mathbf{\theta}})[/math] as described above assumes that the gradient vectors [math]\mathbf{h}_t[/math] are independently distributed. If there was dependence between the gradient vectors one ought to calculate a Newey-West type version of [math]J(\widehat{\mathbf{\theta}})[/math] (see Martin et.al., 2012, Chapter 9.4 for details).

MATLAB implementation

In order to calculate the standard errors for the ML parameter estimates [math]\widehat{\mathbf{\theta}}[/math] we therefore require estimates of [math]H(\widehat{\mathbf{\theta}})[/math] and [math]J(\widehat{\mathbf{\theta}})[/math]. In fact the optimisation routine fminunc can return an estimate of the Hessian, [math]H(\widehat{\mathbf{\theta}})[/math]. You can consult doc fminunc for details on how to have the function return this estimate. Here, however, we prefer a slightly different approach. We will use dedicated functions to calculate [math]H(\widehat{\mathbf{\theta}})[/math] (HessMp) and [math]J(\widehat{\mathbf{\theta}})[/math] (gradp). The only input required for these functions are a handle to your objective function and then all the inputs required by that function (the first of which should be a parameter vector). For the parameter vector we will of course use our estimated parameters. The functions will then return estimates for [math]H(\widehat{\mathbf{\theta}})[/math] and [math]J(\widehat{\mathbf{\theta}})[/math].

H = HessMp(@nll_lin,thetaopt,datamat,1); % this returns the negative of the Hessian
g = gradp(@nll_lin,thetaopt,datamat,0);  % this returns a (T x size(thetaopt,1)) matrix of gradients
J = (g'*g);                              % calculates the OPG

Note that the nll_lin function requires three inputs (theta, data , vec). For the first we use the estimated parameter vector and the second is the same datamatrix that is used when calling the optimisation function. As you can see from the above the third input vec is different for the two calls. For the call to HessMp this variable is set to 1 which makes znll_lin return the sum of the negative log-likelihood, i.e. a scalar. It will return a square matrix with as many rows and columns as the parameter vector (say [math]k[/math]). For the call to the gradient function, however, this value is set to 0 which implies that nll_lin will actually return a vector of negative log-likelihoods, the value at each observation. The gradient function will then establish how each of these values (i.e. at every [math]t[/math] reacts to changes of the [math]k[/math] parameter values and it therefore returns a [math](T \times k)[/math] matrix g. The line J = (g’*g); then performs the calculation [math]\sum_{t=1}^T \left( \mathbf{g}_t \mathbf{g}_t' \right)[/math].

With that information we can then calculate the three different versions of the variance covariance matrices discussed above and extract (square roots of the diagonal elements) standard errors for our ML estimates [math]\widehat{\mathbf{\theta}}[/math]:

se_H = sqrt(diag(inv(H)));
se_J = sqrt(diag(inv(J)));
se_SW = sqrt(diag(inv(H*inv(J)*H)));    % Sandwich variance covariance

Finally we can plot the estimates and their standard errors

disp('   Est     se(OLS)     se(H)     se(J)     se(SW)');
disp([thetaopt [bse;0] se_H se_J se_SW]);

If you do this for the above linear model you should of course find that the ML standard errors are fairly similar to the OLS standard errors. The advantage is that the ML procedure is available when OLS is not (you merely need to change the objective function to reflect, say, a nonlinear model). The Hessian and Gradient functions are downloadable from here: HessMp.m and gradp.m. These functions are from : Eric Jondeau and Michael Rockinger webpage with MATLAB codes.

If you were to perform hypotheses tests on individual parameters you could do so with a standard t-test strategy. You should just be aware that here we only know the distribution of a t-test asymptotically, and that is a standard normal distribution.

Testing in a ML framework

Often you may be interested in testing multiple restrictions in the context of a ML estimation. This could be that you test whether a set of variables can be excluded from a model, but could also involve more complicated restrictions. For the remainder of this Section assume that [math]\widehat{\theta}_u[/math] is the unrestricted ML estimate and [math]\widehat{\theta}_r[/math] the ML parameter estimate of the restricted model.

There are the different type of tests that you may want to use.

Likelihood Ratio (LR) Test

The idea here is that imposing the restriction will certainly reduce the value of the log-likelihood function, i.e. [math]lnL(\widehat{\theta}_u)\gt lnL(\widehat{\theta}_r)[/math]. The real question is whether the reduction in log-likelihood ([math]lnL(\widehat{\theta}_u)-lnL(\widehat{\theta}_r)[/math]) is sufficiently big to rule out the validity of the restriction. If the restriction was valid we would expect a small reduction in likelihood. The question is where is the "critical value" or more precisely what is the distribution of a test statistic based on ([math]lnL(\widehat{\theta}_u)-lnL(\widehat{\theta}_r)[/math])? Without any derivation, you should note the following result:

[math]LR = 2( lnL(\widehat{\theta}_u)-lnL(\widehat{\theta}_r)) \sim \chi^2_k[/math]

where [math]k[/math] is the number of restrictions imposed in [math]\widehat{\mathbf{\theta}}_r[/math]. In essence, to perform this test you will have to estimate the unrestricted and the restricted model, obtain the respective log-likelihood values and plug them into the above test statistic.

MATLAB implementation

Assume that you estimate a linear model where the unrestricted model estimates a [math](5 \times 1)[/math] dimensional [math]\widehat{\theta}_u[/math]. Further you want to test the null hypothesis that the second and third elements (b2 and b3 in the code below) of [math]\mathbf{\theta}[/math] are equal to 0. The code below assumes that you have already estimated the unrestricted parameter estimate [math]\widehat{\mathbf{\theta}}_u[/math], saved in thetaopt.

% Testing multiple restriction
% b2 = 0, b3 = 0

% LR test
% estimate the restricted model
theta0_r = [mean(y); 0; 0; std(y)];       % this sets the initial parameter vector
        % do not hand in x1 and x2
[theta_r] = fminsearch(@nll_lin,theta0_r,options,datamat(:,[1 2 5 6]),1);

L_u = -nll_lin(thetaopt,datamat,1);             % calculates unrestricted logLikelihood
L_r = -nll_lin(theta_r,datamat(:,[1 2 5 6]),1); % calculates restricted logLikelihood

LR   = 2*(L_u - L_r);
LR_p = 1-chi2cdf(LR,2);

fprintf('LR test = %6.2f; p-value = %6.4f \n', LR, LR_p);

Wald Test

The second idea is useful as it will only require the estimation of the unrestricted coefficient [math]\widehat{\mathbf{\theta}}_u[/math]. Let’s say that you want to test the hypothesis that [math]\mathbf{\theta}=\mathbf{\theta}_r[/math]. Note, as we will not estimate the restricted model there is no hat on [math]\mathbf{\theta_r}[/math]. At the core of the Wald test statistic is the difference between [math]\widehat{\mathbf{\theta}}_u[/math] and [math]\mathbf{\theta}_r[/math], [math](\widehat{\mathbf{\theta}}_u-\mathbf{\theta}_r)[/math]. Intuitively it is obvious that the lager the difference is the more likely it is that the restriction (null hypothesis) is invalid. There are two questions remaining: a) How to evaluate the difference between the estimated value and the hypothesised value (which is not straightforward as we are talking about a vector) and b) what is the distribution (under the null hypothesis of the restriction being valid) of a suitable test statistic.

It is best to keep the following somewhat generic. First we need to establish how we would represent a multiple restriction in a generic way. We stick to linear restrictions for starters. The general representation of a linear restriction of a parameter vector [math]\theta[/math] is:

[math]\mathbf{R} \mathbf{\theta} = \mathbf{b}[/math]

where [math]\mathbf{\theta}[/math] is a [math](q \times 1)[/math] parameter vector, [math]R[/math] is a [math](r \times q)[/math] matrix and [math]b[/math] is a [math](r \times 1)[/math] vector. [math]R[/math] and [math]b[/math] are used to define the required restrictions. By way of example let’s say [math]q=5[/math], [math]\mathbf{\theta}=(\theta_1~\theta_2~\theta_3~\theta_4~\theta_5)'[/math], and the restriction to be tested is that [math]\theta_2 + 2\theta_3 = 1[/math] and [math]\theta_4 = 0[/math]. This set of restrictions is then represented by:

[math]\begin{aligned} \mathbf{R} \mathbf{\theta} &=& \mathbf{b}\\ \left( \begin{array}{cccc} 0 & 1 & 2 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ \end{array} \right) \left( \begin{array}{c} \theta_1 \\ \theta_2 \\ \theta_3 \\ \theta_4 \\ \theta_5 \end{array} \right) &=& \left( \begin{array}{c} 1 \\ 0 \end{array} \right)\end{aligned}[/math]

After estimating the unrestricted parameter [math]\widehat{\mathbf{\theta}}_u[/math] the [math](r \times 1)[/math] term [math](\mathbf{R} \widehat{\mathbf{\theta}}_u - \mathbf{b})[/math] will measure the extend to which the estimated parameter vector disagrees with the restriction. The test statistic that is used to establish whether this disagreement is statistically significant or not is the following:

[math]W = (\mathbf{R} \widehat{\mathbf{\theta}}_u - \mathbf{b})' (\mathbf{R} V(\widehat{\mathbf{\theta}}_u) \mathbf{R}')^{-1} (\mathbf{R} \widehat{\mathbf{\theta}}_u - \mathbf{b})[/math]

[math]V(\widehat{\mathbf{\theta}}_u))[/math] is the variance-covariance matrix of the estimated parameter vector and can be estimated in any of the ways discussed earlier. The test statistic is distributed as [math]\chi^2[/math] distributed with [math]r[/math] degrees of freedom. If you have nonlinear restrictions to be tested then the basic structure of the test statistic remains the same but that [math](r \times 1)[/math] term [math](\mathbf{R} \widehat{\mathbf{\theta}}_u - \mathbf{b})[/math] will be replaced by [math]C(\widehat{\mathbf{\theta}}_u)[/math] and [math]\mathbf{R}[/math] will be replaced by [math]\partial C(\widehat{\mathbf{\theta}}_u)/partial \widehat{\mathbf{\theta}}_u[/math]:

[math]W = C(\widehat{\mathbf{\theta}}_u)' \left( \left(\frac{\partial C(\widehat{\mathbf{\theta}}_u)}{\partial \widehat{\mathbf{\theta}}_u}\right) V(\widehat{\mathbf{\theta}}_u) \left( \frac{\partial C(\widehat{\mathbf{\theta}}_u)}{\partial \widehat{\mathbf{\theta}}_u} \right)'\right)^{-1} C(\widehat{\mathbf{\theta}}_u)[/math]

The differentiable function [math]C[/math] is defined such that [math]C(\widehat{\mathbf{\theta}}_u)=0[/math] and encompasses the nonlinear restrictions.

Lagrange Multiplier (LM) Test

The previous two testing strategies relied on estimates of the unrestricted model being available (plus an estimate of the restricted model in the case of the [math]LR[/math] test). The attractive feature of this third testing strategy is that it does not require such an estimate, it merely relies on the estimate of the restricted model. In situations where estimation of the unrestricted model is complicated this can be much appreciated[5].

This test is based on the estimated restricted parameter vector [math]\widehat{\mathbf{\theta}}_r[/math]. How can we evaluate if a restriction is valid or not if we only have the restricted parameter vector estimate. The main idea is that if the estimated [math]\widehat{\mathbf{\theta}}_r[/math] represented an unrestricted optimum, then the gradient of the (negative) log-likelihood will be 0, i.e. [math]G(\mathbf{\theta}_r)=\partial lnL(\mathbf{\theta}_r) / \partial \mathbf{\theta}_r = 0[/math]. We will have to device a test statistic that helps us to evaluate if [math]G(\mathbf{\theta}_r)[/math] is unequal to 0. Without any further discussion we merely state that

[math]LM = G(\mathbf{\theta}_r)' V(\widehat{\mathbf{\theta}}_r) G(\mathbf{\theta}_r)[/math]

Literature

Hamilton J.D. (1994) Time Series Analysis, Princeton, Chapter 5.8.

Heij, C., de Boer, P., Franses, P.H., Kloek, T. and van Dijk, H.K. (2004) Econometric Methods with Applications in Business and Economics, Oxford University Press, Chapter 4.3.

Judge G.G, W.E. Griffiths, R.C. Hill, H. Lütkepohl and T.-C. Lee (1985) The Theory and Practice of Econometrics, John Wiley.

Martin V., Hurn S. and Harris D. (2012) Econometric Modelling with Time Series: Specification, Estimation and Testing (Themes in Modern Econometrics), Chapter 3 gives an excellent introduction into nonlinear optimisation strategies. Chapters 1 to 4.

Footnotes

  1. The optimisation algorithm fminunc or fminsearch will explore any value (positive or negative) to minimise the objective function. If you need constraint parameter values that cannot be obtained with the simple strategy used here you should consider using fmincon.
  2. E.g. how different assumptions on the dependence and independence properties require different Laws of Large Numbers and Central Limit Theorems to establish this result
  3. Note from the Theory of Nonlinear Optimisation section that the Hessian is required to update the parameter vector throughout the iterations. Therefore it is available from the fminunc function. See doc fminunc for details.
  4. In fact, if this is the case, the optimisation algorithm may not have found a minimum. This is therefore an indication that perhaps not all is good with your optimisation. But for the time being we shall ignore this.
  5. In the case of a linear model this is of course not a problem.