ConNonlinOptImp
Contents
Introduction
This Section is based on the Section in which we demonstrate how to implement an unconstrained optimisation (which is where the UNC in fminUNC) came from. This was called unconstrained, because we allowed the optimisation to find any parameter value as the optimal value. The optimal parameter value could be any real number on the real line.
This is appropriate on many occasions (and is equivalent to what happens when we estimate a linear regression model by OLS) but may be inappropriate in some:
- Some parameters only make sense if they are constraint to a subset of the real line. An example are variances or standard deviations which have to be positive.
- You want to estimate a restricted model where the restriction cannot be implemented by a rearrangement of the explanatory variables (e.g. the dropping of a variable from the model).
- You want to estimate a model where some model parameters are related to each other and you need to impose this relationship.
There are basically two ways in which you can implement restrictions. The first is by imposing the restrictions inside the function which is to be optimised (minimised), the second is to use a constrained optimisation routine.
Adjusting the target function
This is the easiest way to deal with constraints if at all possible. As an example we shall consider the case in which we estimate a linearasied regression function
[math]y_{t}=\beta _{0}+\beta _{1}x_{1,t}+\beta _{1}^{2}x_{2,t}+u_{t}[/math]
by Maximum Lilkelihood. In that case we will have to specify the distribution of the error terms [math]u_t[/math] as that will determine the form of the target function. Let’s assume that [math]u_t[/math] come from a normal distribution with variance [math]\sigma^2[/math]. It is well known that the density of the normal distribution is
[math]f(u_{t})=\frac{1}{\sqrt{2\pi \sigma^2}}exp\left( \frac{u_t^2}{2\sigma^2}\right)[/math]
In practice we will have to replace [math]u_t[/math] with the estimate
[math]\widehat{u}_{t}=y_{t}-\beta _{0}-\beta _{1}x_{1,t}-\beta _{1}^{2}x_{2,t}[/math]
This means that [math]f(u_{t})[/math] is a function of all three parameters ([math]\beta _{0}[/math], [math]\beta _{1}[/math] and [math]\sigma[/math]). We want to minimise the negative log-likelihood function and the following target function is written to return just that.
function nll = judgeml(theta,data)
% this returns the negative log likelihood function for the Judge function
% input: (i) theta, parameter vector, 3rd value is standard deviation
% (ii) data, data matrix
% first calculate residual for res = y - theta(1) - theta(2)*x1 -
% theta(2)^2*x2
res = data(:,1) - theta(1) - theta(2)*data(:,2) - theta(2)^2*data(:,3);
% negative log likelihood
nlogl = 0.5*log(2*pi) + 0.5*log(theta(3)^2) + (res.^2)./(2*theta(3)^2);
nll = sum(nlogl); % output for function is sum negative log likelihood
end
Why is this function an example for imposing a restriction through the target function? One of the parameters that needs to be estimated is the error variance [math]\sigma^2[/math]. That variance needs to be positive, besides, if it wasn’t the term [math]log(\sigma^2)[/math] was not computable. This is the reason why the parameter vector is defined to contain [math]\theta=(\beta _{0}~\beta _{1}~\sigma)'[/math] and not [math]\theta=(\beta _{0}~\beta _{1}~\sigma^2)'[/math]. Note that in the target function we actually never need [math]\sigma[/math], but always [math]\sigma^2[/math]. This is fortunate as in that way, [math]\sigma^2[/math] (i.e. theta(3)^ 2
) will always be positive, even if the nonlinear optimisation routine (e.g. fminunc
or fminsearch
) suggests a negative value for [math]\sigma[/math] (i.e. theta(3)
).
If we had submitted the variance into the judgeml(theta,data)
as the third element in theta
and had directly used theta(3)
in the calculations it may well have happened that fminunc
or fminsearch
had attempted to use a negative value for that parameter, leading to nonsensical results (or even an error message in the code). Remember, the optimisation routines are just "stupid" trial and error machines who have no knowledge of the meaning of the parameters. If you had submitted the variance, you could rescue the situation by not using theta(3)
directly in the calculation of the negative log likelihood, but rather abs(theta(3))
.
Useful parameter transformation
Sometimes you need parameters to be constrained to certain regions. Examples:
- A parameter should be positive (e.g. a variance parameter).
- A parameter should take a value between 0 and 1.
- A parameter should take a value between [math]a[/math] and [math]b[/math].
There are fairly straightforward ways to enforce such restrictions without having to resort to constrained optimisation algorithms (see below).
Let’s say a parameter ([math]\theta[/math]) should be positive (e.g. a variance parameter). We have seen in the above example for judgeml
how to deal with this. We define the parameter that get’s handed into the function as the standard deviation and whenever we need the variance we use the squared value of the parameter handed in as standard deviation. But what if we need the standard deviation. In that case we use either the absolute value (abs(theta)
) of the parameter handed in or we now take the square root of the variance (sqrt(theta2)
). This will ensure that even if the optimisation algorithm hands in a negative value for the standard deviation, it will be treated as if it was positive. You just need to remember that if your algorithm returned an optimised negative standard deviation, that it really is a positive value.
Next assume that you parameter should be larger than a certain value, say [math]\theta \gt 5[/math]. It is then best to take the input theta
and transform it into theta2 = 5 + exp(theta)
. Confirm yourself that theta2
will be larger than 5 for any value of theta
on the real line.
Another common restriction you may be interested in is that a parameter should take values between 0 and 1. Now you need to find a function that transposes any value on the real line into that interval. We are in uck as we should be familiar with one example of such functions, cumulative density functions. So here is the trick. The parameter handed into the function is theta
, but the parameter you really use it theta2 = normcdf(theta,0,1)
. This works as normcdf(theta,0,1)
will deliver the cumulative (standard) normal density for value theta
and that, by definition of the cdf is between 0 and 1.
One drawback of the above methods is that any transformations may complicate the calculation of standard errors. Say you restricted the parameter to be in the unit interval according to theta2 = normcdf(theta,0,1)
. This means that it really is the parameter theta2
which has some economic meaning and hence you would like any standard errors (and hence hypothesis tests) to apply to theta2
rather than theta
. However, if you use the techniques described in the MaxLik#Estimating_standard_errors_of_parameter_estimates section describing how to estimate standard errors for Maximum Likelihood parameter estimates, then these methods will automatically calculate the standard error for theta
(the parameter the optimisation black box sees) and not theta2
. This means that you need to translate the standard error for theta
into the one for theta2
. The method used is called the Delta Method.