NonlinOptTheory
Contents
Introduction
This section will give you an introduction to a very common problem in econometrics[1]. Recall what Ordinary Least Squares achieves, it minimises the sum of squared residuals [math]\sum_{t=1}^{T}u_{t}^2[/math] in the following sort of regression specifications
[math]y_{t}=\beta _{0}+\beta _{1}x_{1,t}+\beta _{2}x_{2,t}+u_{t}\text{.}[/math]
In other words it found those values of [math]\beta _{0}[/math], [math]\beta _{1}[/math] and [math]\beta _{2}[/math] that left us with the smallest variation of the observed values [math]y_{t}[/math] around its predictions. In your basic econometrics lectures you learned that, provided this specification is linear in its parameters, you can find the optimal values for [math]\mathbf{\beta }=\left( \beta _{0}~\beta _{1}~\beta _{2}\right)' [/math] analytically, meaning there is a particular formula which gives you [math]% \widehat{\mathbf{\beta }}_{OLS}[/math] as a function of the values observed for [math]% y_{t}[/math], [math]x_{1,t}[/math] and [math]x_{2,t}[/math].
Unfortunately there will be many other occasions where your model parameters cannot be found by means of OLS. These could be linear models, but you may like to find parameters that optimise[2] a criterion other than the residual sum of squares (say you want to minimise some moment conditions - see below, or even the sum of absolute residuals) or it could be nonlinear models. What is required in such situations is a nonlinear optimisation algorithm. What does this mean? Well, essentially we will have to follow some trial and error strategy to find the parameter values which optimises the criterion function.
Nonlinear Least Squares
In order to introduce the problem we stick to the minimisation of residual sum of squares as our objective, but we want to do that in a nonlinear model. Consider the simple model:
[math]y_{t}=\beta _{0}+\beta _{1}x_{1,t}+\beta _{1}^{2}x_{2,t}+u_{t} \label{Judge}[/math]
where the [math]u_{t}[/math] are independent and identically distributed random variables with mean [math]0[/math] and variance [math]\sigma ^{2}[/math]. (I call this function the Judge Function as it appears in a book by Judge and others). The nonlinear least-squares estimate for [math]\mathbf{\beta }[/math] is defined as that value of [math]\mathbf{\beta} = (\beta_0~\beta_1)'[/math] that minimises the residual sum of squares
[math]S(\mathbf{\beta })=\sum_{t=1}^{T}\hat{u}_{t}^{2}=\sum_{t=1}^{T}\left( y_{t}-\beta _{0}-\beta _{1}x_{1,t}-\beta _{1}^{2}x_{2,t}\right) ^{2}[/math]
The smallest achievable value for [math]S(\mathbf{\beta })[/math] is known as the global minimum, and the value of [math]\mathbf{\beta}[/math] that delivers that value is the optimal parameter. But how to find it? The next two sections will introduce two fundamentally different strategies. The first (grid search) is easy to understand but not implementable in complex problems. The second (nonlinear optimisation) is more generally applicable.
Grid Search
The simplest way of finding an estimate of [math]\mathbf{\beta }[/math] is by what is known as a grid search. This procedure is a brute force computational approach. One has to propose a range of possible values for each parameter (here [math]\beta _{0}[/math] and [math]\beta _{1}[/math]), then one simply computes [math]S(\mathbf{% \beta })[/math] for all possible combinations of values of [math]\mathbf{\beta }[/math] and identifies the parameter combination for which [math]S(\mathbf{\beta })[/math] is minimized.
beta0_range = % define a (n0 x 1) vector of possible values for beta0
beta1_range = % define a (n1 x 1) vector of possible values for beta1
save_S = zeros(n1,n2);
for i0 = 1:n0
for i1 = 1: n1
save_S(i0,i1) = % calculate S as a function of beta0_range(n0) and beta1_range(i1)
end
end
% find minimum value of S in save_S
% values of beta0_range and beta1_range that are related to that minimun value
% are the optimal parammeter values from the grid search
As a general proposition the grid-search approach is not attractive as it is computationally extremely intensive. It is easy to see that this procedure quickly becomes infeasible when you either have many parameters and/or use a finer/larger grid for the parameters. You will also have to rely on (aka hope that) the optimal parameter values lie inside the ranges you chose for the parameters. Fortunately there are alternative optimisation algorithms which do not posses these shortcomings.
In the Implementation Section you can find an example for a grid search. For the above reasons grid search is rarely used as the optimisation algorithm of choice. However, it often has a role to play in finding starting values for a nonlinear optimisation algorithm.
General Nonlinear Optimisation Algorithms
Here I will provide a schematic representation of the procedure used to find optimal parameter estimates when analytical solutions are not available. Such algorithms are called nonlinear optimisation algorithms. Consider [math]S(\theta )[/math] to be the target function, the function that is to be minimised. The value of this function varies with the [math](k \times 1)[/math] parameter vector [math]\theta [/math] and the goal is to find that parameter vector, [math]\widehat{\theta }[/math], which minimises [math]S(\widehat{\theta })[/math]. In the above example [math]S(\theta )[/math] was the sum of squared residuals, [math]S(\theta )=\sum_{t=1}^{T}\hat{u}_{t}^{2}[/math], and the parameter vector was [math]\theta =\mathbf{\beta }[/math], but that is only a special case. Very commonly [math]S(\theta )[/math] represents a negative likelihood function [math]S(\theta )=-L\left( \theta \right) [/math], where [math]L\left( \theta\right) [/math] is the likelihood function to be maximised (and hence [math]S\left(\theta \right) [/math] is to be minimised). More on this in a later section, for the time being you only got to realise that [math]S\left( \theta \right) [/math] is to minimised, however it is defined.
In general, the task is to provide an iterative scheme
[math]\widehat{\theta }^{\left( 1\right) }=\widehat{\theta }^{\left( 0\right)}+\eta^{(1)} \\ \widehat{\theta }^{\left( 2\right) }=\widehat{\theta }^{\left( 1\right)}+\eta^{(2)} \\ ...\\ \widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right)}+\eta^{(k+1)}\\ ...[/math]
The idea is to find a sequence of [math]\eta^{(i)}[/math] such that [math]\widehat{\theta }^{\left( i\right) }[/math] converge to [math]\widehat{\theta }[/math]. Here [math]\eta^{(i)} [/math] is a vector of the same dimension as [math]\theta[/math] and represents the change in parameter values from one iteration to the next. The algorithms to accomplish this task fall naturally into two categories, namely, those which use information contained in derivatives of the target function [math]S(\theta )[/math] (i.e. [math]\partial S\left(\theta \right) /\partial \theta [/math] and [math]\partial ^{2}S\left( \theta \right)/\partial \theta \partial \theta'[/math]) - gradient based algorithms - and those which rely solely on function evaluations [math]S\left( \theta \right) [/math][3].
From the above scheme it is apparent that we need three major ingredients, a starting value [math]\widehat{\theta }^{\left( 0\right)}[/math], a way to compute the vector of increments [math]\eta^{(i)}[/math] and a stopping rule which determines when the iterations should stop. For the latter two all software packages have default values which will often suffice. Starting values, however, will always have to be supplied by the user and a wise choice is absolutely crucial. The next subsections will address these issues in turn.
Iteration increments
Here it is useful to consult a little algebra in order to understand the origin of the most common algorithms. The task is to find useful values for [math]\eta^{(i)} [/math]. Such a value is in general useful if [math]S(\widehat{\theta }^{\left( k+1\right) })\lt S(\widehat{\theta }^{\left( k\right) })[/math]. The basic idea is as follows. We want to find the minimum value for [math]S(\theta )[/math] and we do know that at the true (but unknown), [math]\theta _{0}[/math], the objective function is minimised such that [math]\left. \partial S\left( \theta \right)/\partial \theta \right\vert _{\theta =\theta _{0}}=0[/math]. With this in the back of our mind we will use a little trick[4]. Assume we have some initial parameter guess, [math]\widehat{\theta }^{\left( 0\right) }[/math], then let us approximate [math]\left. \partial S\left( \theta \right) /\partial \theta \right\vert _{\theta =\theta _{0}}[/math] by means of a Taylor Series expansion around our initial parameter guess, [math]\widehat{\theta }^{\left( 0\right) }[/math]:
[math]\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\theta _{0}}\simeq \left. \frac{\partial S\left( \theta \right) }{% \partial \theta }\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }}+ \left. \frac{\partial ^{2}S\left( \theta \right) }{\partial \theta \partial \theta'}\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }}\left( \theta _{0}-\widehat{\theta }\right) \label{Taylor}[/math]
As noted earlier, at the minimum:
[math]\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\theta _{0}}=0[/math]
Hence (slightly incorrectly using [math]=[/math] rather than [math]\simeq [/math])
[math]\begin{aligned} \theta _{0} &=&\widehat{\theta }^{\left( 0\right) }-\left( \left. \frac{\partial ^{2}S\left( \theta \right) }{\partial \theta \partial \theta'}\right\vert_{\theta =\widehat{\theta }^{\left( 0\right) }}\right) ^{-1}\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\widehat{\theta }^{\left( 0\right) }} \notag \\ \theta _{0} &=&\widehat{\theta }^{\left( 0\right) }-H^{-1} \left( \widehat{\theta}^{(0)} \right) G\left( \widehat{\theta }^{\left( 0\right) }\right) \\ (k \times 1) &=& (k \times 1) - (k \times k) (k \times 1) \end{aligned} [/math]
where the last line indicates the dimensions and
[math]\begin{aligned} H\left( \widehat{\theta }^{\left( 0\right) }\right) &=&\left. \frac{\partial ^{2}S\left( \theta \right) }{\partial \theta \partial \theta'}\right\vert _{\theta =% \widehat{\theta }^{\left( 0\right) }} \\ G\left( \widehat{\theta }^{\left( 0\right) }\right) &=&\left. \frac{\partial S\left( \theta \right) }{\partial \theta }\right\vert _{\theta =\widehat{% \theta }^{\left( 0\right) }}\end{aligned}[/math]
are the first and second derivatives of the target function evaluated at our initial guess [math]\widehat{\theta }^{\left( 0\right) }[/math]. [math]G(\widehat{\theta }^{\left( 0\right) })[/math] has the same dimension as the parameter vector [math]\theta[/math] and [math]H(\widehat{\theta }^{\left( 0\right) })[/math] is a square matrix of the same dimension. Equation ([suggstep0]) suggests that the optimal parameter can be calculated on the basis of our initial parameter guess and the terms [math]H(\widehat{\theta }^{\left( 0\right) })[/math] and [math]G(\widehat{\theta }^{\left( 0\right) })[/math]. If this were true, there would be no need for any iterative procedure. This, however, ignores that the Taylor approximation used above is not an equality but rather an approximation. Hence it would be more appropriate to write
[math]\widehat{\theta }^{\left( 1\right) }=\widehat{\theta }^{\left( 0\right) }-H^{-1}\left( \widehat{\theta }^{\left( 0\right) }\right) G\left( \widehat{% \theta }^{\left( 0\right) }\right) \text{.}[/math]
While now [math]\widehat{\theta }^{\left( 1\right) }[/math] is most likely not equal to [math]\theta _{0}[/math] it is, for small [math]\left\vert \widehat{\theta }^{\left(1\right) }-\widehat{\theta }^{\left( 0\right) }\right\vert [/math], likely to fulfill our improvement criterion [math]S(\widehat{\theta }^{\left( 1\right) })\lt S(\widehat{\theta }^{\left( 0\right) })[/math]. We can then repeat this procedure with [math]\widehat{\theta }^{\left( 1\right) }[/math] as our initial parameter and hence this suggests the general updating rule
[math]\widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right) }-H^{-1}\left( \widehat{\theta }^{\left( k\right) }\right) G\left( \widehat{% \theta }^{\left( k\right) }\right) \label{Newton}[/math]
which suggests that we can use [math]-H^{-1}\left( \widehat{\theta }^{\left( k\right) }\right) G\left( \widehat{% \theta }^{\left( k\right) }\right)[/math] as an approximation for [math]\eta^{(k+1)}[/math].
Different approximations to the Hessian
A number of different nonlinear optimisation algorithms exist and some of them differ in how they arrive at values for [math]H^{-1}\left( \widehat{\theta }^{\left( k\right) }\right)[/math]. Here we will just provide a short list (you can find an excellent and detailed discussion of the different algorithms in Chapter 3 of the Martin et al. (2012) textbook).
- Calculate the Hessian directly: Newton-Raphson algorithm.
- Approximate the Hessian with an outer product of gradients: BHHH algorithm
- Use an update algorithm for the Hessian (not unlike the one used for the parameter vector itself): BFGS algorithm
Analytical or numerical derivatives
One issue that arises at this stage is how to calculate [math]H(\widehat{\theta }^{\left( 0\right) })[/math] and [math]G(\widehat{\theta }^{\left( 0\right) })[/math][5]. Sometimes one can easily derive the first and second derivative of the target function [math]S\left( \theta \right) [/math]. Often, however, that is not easy and one calculates numerical derivatives. When doing so, the first (slope) and second derivatives (curvature) are approximated by using slight perturbations to the parameter value [math]\widehat{\theta }^{\left( k\right) }[/math] and evaluating the effect on [math]S(\widehat{\theta }^{\left( k\right) })[/math].
Without any further discussion it should be accepted that this information can be used to approximate the slope and the curvature of a function at a particular parameter vector [math]\widehat{\theta }^{\left( k\right) }[/math]. Most software packages which utilise nonlinear optimisation routines use numerical first and second derivatives and the user will not have to worry about this. It should, however, be noted that numerically estimating the terms [math]H(\widehat{\theta }^{\left( 0\right) })[/math] and [math]G(\widehat{\theta }^{\left( 0\right) })[/math] is not without pitfalls and sometimes it is wise to use analytical derivatives when they are available. If they are available this will help the optimisation algorithm to find the optimal solution. It will also compute faster, but it relies on the econometrician having derived and coded these derivatives correctly. Most statistical software packages, MATLAB included, provide the option to supply such analytical derivatives.
Line search
A more accurate representation of the iterative scheme employed by gradient-based algorithms would be
[math]\widehat{\theta }^{\left( k+1\right) }=\widehat{\theta }^{\left( k\right) }-\lambda H^{-1}\left( \widehat{\theta }^{\left( k\right) }\right) G\left( \widehat{\theta }^{\left( k\right) }\right) \qquad \lambda \in \left[ 0,1% \right][/math]
where the parameter [math]\lambda [/math] is known as the step size. In other words, embedded in every optimisation algorithm is a one-dimensional optimisation problem. Crude approaches to choosing the optimal step size would be a grid search or a variation of the grid search known as squeezing, which sets [math]\lambda =1,\lambda =\frac{1}{2},\lambda =\frac{1}{3},\lambda =\frac{1}{4},...[/math] and chooses the value which minimises the objective function. The reason why it may be optimal to use a step size smaller than one is again due to the fact that the Taylor approximation is only an approximation and not an equality. There are more sophisticated ways of choosing this step size, but we will not discuss them here.
Stopping rule
The stopping rule is also called convergence criteria, as the applied econometrician has to decide to stop the iteration and declare the result as having converged. There are a number of criteria one could use. The algorithm could be stopped if any or several of the following are met
- The improvement in the Objective Function is smaller than [math]\epsilon[/math]
- The Gradient is smaler than [math]\epsilon[/math]
- The change in parameter values is smaller than [math]\epsilon[/math][6]
where [math]\epsilon[/math] is some pre-specified tolerance level, say [math]\epsilon = 0.000001[/math]. Smaller values of [math]\epsilon[/math] make it more difficult for the algorithm to converge.
Starting values
This is the most crucial of the choices that can be made by the applied econometrician. Unfortunately, an unwise choice of starting values may lead you to what is usually called a local rather than a global minimum. A local minimum is a value for the parameter vector to which the stopping rules apply (i.e. in the vicinity of the current parameter values it appears as if they cannot be improved upon), but in fact there is a value of the parameter vector which does provide an improved value of the criterion function.
So, what does this mean in practice. Here are a few pointers:
- try a few sensible starting values and check which provides the optimal value for the criterion function
- use estimated parameters from previous work with similar data and models
- use values from a restricted model if that is easier to estimate
The less certain you are that you have very good starting values, the more important it will be that you use multiple starting values.
Criterion/Target functions
So far we used a very familiar target function for the optimisation algorithm, the sum of squared residuals, which was exactly the target function used to derive the OLS estimates. This is by no means the only target function available. This is not really the place to discuss alternative target functions, but I will merely list two very popular approaches for which parameters usually have to be estimated by means of a nonlinear optimisation algorithm:
- Maximum Likelihood Estimation: here the researcher proposes a distribution according to which some variable is distributed and then attempts to find the set of parameters that maximise the likelihood of the actually observed data.
- Generalized Method of Moments: Here the applied econometrician proposes that a set of moments (like [math]E(y_t \widehat{u}_t)=0[/math]) is met. In the context of the earlier nonlinear models this would for instance imply that the estimated residuals [math]\widehat{u}_t[/math] are uncorrelated with the observed dependent variable [math]y_t[/math]. As [math]\widehat{u_t}[/math] depends on the model parameters, one could find the model parameters that make this moment condition true[7].
Literature
Hamilton J.D. (1994) Time Series Analysis, Princeton, Section 5.7 as well as Judge G.G, W.E. Griffiths, R.C. Hill, H. Lütkepohl and T.-C. Lee (1985) The Theory and Practice of Econometrics, John Wiley, Appendix B, give good introductions into the mechanics of nonlinear optimisation algorithms.
Martin V., Hurn S. and Harris D. (2012) Econometric Modelling with Time Series: Specification, Estimation and Testing (Themes in Modern Econometrics), Chapter 3 gives an excellent introduction into nonlinear optimisation strategies.
Footnotes
- ↑ I am indebted to Stan Hurn, School of Economics and Finance, Queensland University of Technology, who allowed me to use some of his lecture material.
- ↑ this could mean minimise or maximise, depending on the problem
- ↑ There will be no discussion of such gradient-free methods. A good introduction to the most promising such algorithm, the Simplex algorithm, can be found in Judge et al. (1985)
- ↑ It is not immediately obvious why we do this at this stage, but as you will see a little later, this will eventually provide us with an update rule for the parameters.
- ↑ This section is dedicated to my friend Adam Clements with whom I once had a fairly passionate discussion about the relative value of analytical and numerical derivatives. Go Analytical!
- ↑ As we usually estimate a vector [math]\theta[/math] this will compare some vector norm (see e.g. [[1]] against [math]\epsilon[/math].
- ↑ One will need at least as many moment requirements as parameters.