NonlinOptTheory

From ECLR
Revision as of 12:02, 9 October 2012 by Admin (talk | contribs)
Jump to: navigation, search

Introduction

This section will give you an introduction to a very common problem in econometrics[1]. Recall what Ordinary Least Squares achieves, it minimises the sum of squared residuals [math]\sum_{t=1}^{T}u_{t}[/math] in the following sort of regression specifications

[math]y_{t}=\beta _{0}+\beta _{1}x_{1,t}+\beta _{2}x_{2,t}+u_{t}\text{.}[/math]

In other words it found those values of [math]\beta _{0}[/math] and [math]\beta _{1}[/math] that left us with the smallest variation of the observed values [math]y_{t}[/math] around its predictions. In your basic econometrics lectures you learned that, provided this specification is linear in its parameters, you can find the optimal values for [math]\mathbf{\beta }=\left( \beta _{0}~\beta _{1}\right)' [/math] analytically, meaning there is a particular formula which gives you [math]% \widehat{\mathbf{\beta }}_{OLS}[/math] as a function of the values observed for [math]% y_{t}[/math], [math]x_{1,t}[/math] and [math]x_{2,t}[/math].

Unfortunately there will be many other occasions where your model parameters cannot be found by means of OLS. These could be linear models, but you may like to find parameters that optimise a criterion other than the residual sum of squares (say you want to minimise some moment conditions) or it could be nonlinear models. What is required in such situations is a nonlinear optimisation algorithm. What does this mean? Well, essentially we will have to follow some trial and error strategy to find the parameter values which optimises the criterion function.

Nonlinear Least Squares

In order to introduce the problem we stick to the minimisation of residual sum of squares as our objective, but we want to do that in a nonlinear model. Consider the simple model:

[math]y_{t}=\beta _{0}+\beta _{1}x_{1,t}+\beta _{1}^{2}x_{2,t}+u_{t} \label{Judge}[/math]

where the [math]u_{t}[/math] are independent and identically distributed random variables with mean [math]0[/math] and variance [math]\sigma ^{2}[/math]. (I call this function the Judge Function as it appears in a book by Judge and others). The nonlinear least-squares estimate for [math]\mathbf{\beta }[/math] is defined as that value of [math]\mathbf{\beta} = (\beta_0~\beta_1)'[/math] that minimises the residual sum of squares

[math]S(\mathbf{\beta })=\sum_{t=1}^{T}\hat{u}_{t}^{2}=\sum_{t=1}^{T}\left( y_{t}-\beta _{0}-\beta _{1}x_{1,t}-\beta _{1}^{2}x_{2,t}\right) ^{2}[/math]

The smallest achievable value for [math]S(\mathbf{\beta })[/math] is known as the global minimum, and the value of [math]\mathbf{\beta}[/math] that delivers that value is the optimal parameter. But how to find it? The next two sections will introduce two fundamentally different strategies. The first (grid search) is easy to understand but not implementable in complex problems. The second (nonlinear optimisation) is more generally applicable.

Grid Search

The simplest way of finding an estimate of [math]\mathbf{\beta }[/math] is by what is known as a grid search. This procedure is a brute force computational approach. One has to propose a range of possible values for each parameter (here [math]\beta _{0}[/math] and [math]\beta _{1}[/math]), then one simply computes [math]S(\mathbf{% \beta })[/math] for all possible combinations of values of [math]\mathbf{\beta }[/math] and identifies the parameter combination for which [math]S(\mathbf{\beta })[/math] is minimized.

beta0_range = % define a (n0 x 1) vector of possible values for beta0
beta1_range = % define a (n1 x 1) vector of possible values for beta1

save_S = zeros(n1,n2);

for i0 = 1:n0
    for i1 = 1: n1

        save_S(i0,i1) = % calculate S as a function of beta0_range(n0) and beta1_range(i1)

    end
end

% find minimum value of S in save_S
% values of beta0_range and beta1_range that are related to that minimun value
% are the optimal parammeter values from the grid search

As a general proposition the grid-search approach is not attractive as it is computationally extremely intensive. It is easy to see that this procedure quickly becomes infeasible when you either have many parameters and/or use a finer/larger grid for the parameters. You will also have to rely on (aka hope that) the optimal parameter values lie inside the ranges you chose for the parameters. Fortunately there are alternative optimisation algorithms which do not posses these shortcomings.

In the Implementation Section you can find an example for a grid search. For the above reasons grid search is rarely used as the optimisation algorithm of choice. However, it often has a role to play in finding starting values for a nonlinear optimisation algorithm.

General Nonlinear Optimisation Algorithms

Here I will provide a schematic representation of the procedure used to find optimal parameter estimates when analytical solutions are not available. Such algorithms are called nonlinear optimisation algorithms. Consider [math]% S(\theta )[/math] to be the target function, the function that is to be minimised. The value of this function varies with the parameter vector [math]\theta [/math] and the goal is to find that parameter vector, [math]\widehat{\theta }[/math], which minimises [math]S(\widehat{\theta })[/math]. In the above example [math]S(\theta )[/math] was the sum of squared residuals, [math]S(\theta )=\sum_{t=1}^{T}\hat{u}_{t}^{2}[/math], and the parameter vector was [math]\theta =\mathbf{\beta }[/math], but that is only a special case. Very commonly [math]S(\theta )[/math] represents a negative likelihood function [math]S(\theta )=-L\left( \theta \right) [/math], where [math]L\left( \theta\right) [/math] is the likelihood function to be maximised (and hence [math]S\left(\theta \right) [/math] is to be minimised). More on this in a later section, for the time being you only got to realise that [math]S\left( \theta \right) [/math] is to minimised, however it is defined.

 In general, the task is to provide an iterative scheme

[math]\widehat{\theta }^{\left( 1\right) }=\widehat{\theta }^{\left( 0\right)}+\eta^{(1)} \label{theta_k1}\\ \widehat{\theta }^{\left( 2\right) }=\widehat{\theta }^{\left( 1\right)}+\eta^{(2)} \\ ...\\ \widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right)}+\eta^{(k+1)}\\ ...[/math]

The idea is that this procedure converges to [math]\widehat{\theta }[/math]. Here [math]\eta^{(i)} [/math] is a vector of the same dimension as [math]\theta[/math] and represents the change in parameter values from one iteration to the next. The algorithms to accomplish this task fall naturally into two categories, namely, those which use information contained in derivatives of the target function [math]S(\theta )[/math] (i.e. [math]\partial S\left(\theta \right) /\partial \theta [/math] and [math]\partial ^{2}S\left( \theta \right)/\partial \theta ^{2}[/math]) - gradient based algorithms - and those which rely solely on function evaluations [math]S\left( \theta \right) [/math][2].

From the above scheme it is apparent that we need three major ingredients, a starting value [math]\widehat{\theta }^{\left( 0\right)}[/math], a way to compute the vector of increments [math]\eta^{(i)}[/math] and a stopping rule which determines when the iterations should stop. The next subsections will address these issues in turn.

Iteration increments

Here it is useful to consult a little algebra in order to understand the origin of the most common algorithms. The task is to find useful values for [math]\eta [/math] in equation ([thetak1]). Such a value is in general useful if [math]S(\widehat{\theta }^{\left( k+1\right) })\lt S(\widehat{\theta }^{\left( k\right) })[/math]. The basic idea is as follows. We want to find the minimum value for [math]S(\theta )[/math] and we do know that at the minimum, [math]\theta _{0}[/math], [math]\left. \partial S\left( \theta \right) /\partial \theta \right\vert _{\theta =\theta _{0}}=0[/math]. With this in the back of our mind we will use a little trick[3]. Assume we have some initial parameter guess, [math]\widehat{% \theta }^{\left( 0\right) }[/math], then let us approximate [math]\left. \partial S\left( \theta \right) /\partial \theta \right\vert _{\theta =\theta _{0}}[/math] by means of a Taylor Series expansion around our initial parameter guess, [math]% \widehat{\theta }^{\left( 0\right) }[/math]:

[math]\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\theta _{0}}\simeq \left. \frac{\partial S\left( \theta \right) }{% \partial \theta }\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }}+\left( \theta _{0}-\widehat{\theta }^{\left( 0\right) }\right) \left. \frac{\partial ^{2}S\left( \theta \right) }{\partial \theta ^{2}}\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }} \label{Taylor}[/math]

As noted earlier, at the minimum:

[math]\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\theta _{0}}=0[/math]

Hence (slightly incorrectly using [math]=[/math] rather than [math]\simeq [/math])

[math]\begin{aligned} \theta _{0} &=&\widehat{\theta }^{\left( 0\right) }-\left( \left. \frac{% \partial ^{2}S\left( \theta \right) }{\partial \theta ^{2}}\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }}\right) ^{-1}\left. \frac{% \partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =% \widehat{\theta }^{\left( 0\right) }} \notag \\ \theta _{0} &=&\widehat{\theta }^{\left( 0\right) }-H^{-1}\left( \widehat{% \theta }^{\left( 0\right) }\right) G\left( \widehat{\theta }^{\left( 0\right) }\right) \label{sugg_step0}\end{aligned}[/math]

where

[math]\begin{aligned} H\left( \widehat{\theta }^{\left( 0\right) }\right) &=&\left. \frac{\partial ^{2}S\left( \theta \right) }{\partial \theta ^{2}}\right\vert _{\theta =% \widehat{\theta }^{\left( 0\right) }} \\ G\left( \widehat{\theta }^{\left( 0\right) }\right) &=&\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\widehat{% \theta }^{\left( 0\right) }}\end{aligned}[/math]

are the first and second derivatives of the target function evaluated at our initial guess [math]\widehat{\theta }^{\left( 0\right) }[/math]. [math]G(\widehat{\theta }% ^{\left( 0\right) })[/math] has the same dimension as the parameter vector [math]\theta [/math] and [math]H(\widehat{\theta }^{\left( 0\right) })[/math] is a square matrix of the same dimension. Equation ([suggstep0]) suggests that the optimal parameter can be calculated on the basis of our initial parameter guess and the terms [math]H(\widehat{\theta }^{\left( 0\right) })[/math] and [math]G(\widehat{\theta }% ^{\left( 0\right) })[/math]. If this were true, there would be no need for any iterative procedure as it was suggested in Figure [FigIter]. This, however, ignores that ([Taylor]) is not an equality but rather an approximation. Hence it would be more appropriate to write

[math]\widehat{\theta }^{\left( 1\right) }=\widehat{\theta }^{\left( 0\right) }-H^{-1}\left( \widehat{\theta }^{\left( 0\right) }\right) G\left( \widehat{% \theta }^{\left( 0\right) }\right) \text{.}[/math]

While now [math]\widehat{\theta }^{\left( 1\right) }[/math] is most likely not equal to [math]\theta _{0}[/math] it is, for small [math]\left\vert \widehat{\theta }^{\left( 1\right) }-\widehat{\theta }^{\left( 0\right) }\right\vert [/math], likely to fulfill our improvement criterion [math]S(\widehat{\theta }^{\left( 1\right) })\lt S(% \widehat{\theta }^{\left( 0\right) })[/math]. We can then repeat this procedure with [math]\widehat{\theta }^{\left( 1\right) }[/math] as our initial parameter and hence this suggests the general updating rule

[math]\widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right) }-H^{-1}\left( \widehat{\theta }^{\left( k\right) }\right) G\left( \widehat{% \theta }^{\left( k\right) }\right) \label{Newton}[/math]

where[math]\widehat{\theta }^{\left( k+1\right) }=[/math] updated value, [math]\left( k+1\right) th[/math] iteration.[math]\widehat{\theta }^{\left( k\right) }=[/math] old value, [math]\left( k\right) [/math]th iteration.

This algorithm, which uses both the first and second derivatives of the target function is known as Newton’s method.

One issue that arises at this stage is how to calculate [math]H(\widehat{\theta }% ^{\left( 0\right) })[/math] and [math]G(\widehat{\theta }^{\left( 0\right) })[/math]. Sometimes one can easily derive the first and second derivative of the target function [math]S\left( \theta \right) [/math]. Often, however, that is not easy and one calculates numerical derivatives. When doing so, the first (slope) and second derivatives (curvature) are approximated by using slight perturbations to the parameter value [math]\widehat{\theta }^{\left( k\right) }[/math] and evaluating the effect on [math]S(\widehat{\theta }^{\left( k\right) })[/math]. Without any further discussion it should be accepted that this information can be used to approximate the slope and the curvature of a function at a particular parameter vector [math]\widehat{\theta }^{\left( k\right) }[/math]. Most statistical software packages which utilise nonlinear optimisation routines use numerical first and second derivatives and the user will not have to worry about this. It should, however, be noted that numerically estimating the terms [math]H(\widehat{\theta }^{\left( 0\right) })[/math] and [math]G(% \widehat{\theta }^{\left( 0\right) })[/math] is not without pitfalls and sometimes it is wise to use analytical derivatives when they are available. Most statistical software packages, MATLAB included, provide the option to supply such analytical derivatives.

There are a number of common variations on the algorithm in ([Newton]) the most common being the BHHH algorithm (named after its proponents Berndt, Hall, Hall and Hausmann). This replaces [math]H(\widehat{% \theta }^{\left( 0\right) })[/math] by outer product of first derivatives. Hence the algorithm is

[math]\widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right) }-% \left[ \sum_{t}\frac{\partial S_{t}}{\partial \theta }\frac{\partial S_{t}}{% \partial \theta ^{\prime }}\right] ^{-1}\sum_{t}\frac{\partial S_{t}}{% \partial \theta } \label{BHHH}[/math]

where [math]\sum_{t}\frac{\partial S_{t}}{\partial \theta }[/math] is an empirical approximation to [math]G(\widehat{\theta }^{\left( k\right) })[/math]. The function evaluation now has a time subscript, indicating that the summations required are over the number of observations and function evaluations are calculated observation wise and not for the entire sample.[4]

Stopping rule

Starting values

Literature

Hamilton J.D. (1994) Time Series Analysis, Princeton, Section 5.7 as well as Judge G.G, W.E. Griffiths, R.C. Hill, H. Lütkepohl and T.-C. Lee (1985) The Theory and Practice of Econometrics, John Wiley, Appendix B, give good introductions into the mechanics of nonlinear optimisation algorithms. Martin V., Hurn S. and Harris D. (2012) Econometric Modelling with Time Series: Specification, Estimation and Testing (Themes in Modern Econometrics), Chapter 3 gives an excellent introduction into nonlinear optimisation strategies.

Footnotes

  1. I am indebted to Stan Hurn, School of Economics and Finance, Queensland University of Technology, who allowed me to use some of his lecture material.
  2. There will be no discussion of such gradient-free methods. A good introduction to the most promising such algorithm, the Simplex algorithm, can be found in Judge et al. (1985)
  3. It is not immediately obvious why we do this at this stage, but as you will see a little later, this will eventually provide us with an update rule for the parameters.
  4. For the case of minimising the sum of squared residuals this would imply that we need to use the individual squared residuals [math]S\left( \widehat{% \theta }\right) _{t}=\hat{u}_{t}^{2}[/math] rather than the sum of squared residuals.