StatFunct

From ECLR
Jump to: navigation, search

The Statistics Toolbox

Mathworks, the company producing and distributing MATLAB, has written a huge amount of functions that do useful things in all sorts of areas. They package these functions in what are called toolboxes and sell these as add-on products to MATLAB (see the Mathworks page for a list of available toolboxes). Depending on what licence arrangement your employer has you may (or may not) have access to these toolboxes.

One particularly useful toolbox for Econometricians is the Statistics toolbox. If you want to know whether your installation of MATLAB has access to the Statistics toolbox, type the following into the command window:

which normpdf

If the result is

'normpdf' not found.

then you know that you do not have access to the Statistics toolbox. If you see a file path, then you know that you have access and you also found out where the toolboxes are installed.

The toolbox contains a large number of useful functions (see the complete list here) and here we will highlight a few which are extremely useful when doing econometrics (plus a few which are actually part of the general MATLAB software).

Statistical Distributions

It is standard fair for an econometrician to evaluate test statistics against a distribution. In particular this is what delivers p-values. What you need is a test statistic and a hypothesised distribution and then you establish how large the probability (under the assumed distribution under the null hypothesis) is that we would get a value as extreme as the test statistic.

Fortunately MATLAB has a large range of functions that facilitates this job. It basically does the same as reading values of a distribution table. Let’s do this using an example. Let’s say we have a test statistic of value 2.34 and the test statistic is hypothesised to have a standard normal distribution, then what is the probability that we would, under the hypothesised [math]N(0,1)[/math] distribution, get a value larger than 2.34?

p = normcdf(2.34,0,1)
p = 0.9904

normcdf(x,mu,sd) produces the cumulative distribution value for the normal distribution (with mean mu and standard deviation sd) at value x. Here we get 0.994 indicating that the probability of getting a value larger than 2.34 is [math](1-0.9904)=0.0096[/math]. If we were operating a right-tailed hypothesis test, this would be the p-value. If we had a two-tailed test we would have to multiply this value by two to get the p-value.

A range of MATLAB functions exist for reading probabilities. You need to watch out for the following features of these functions.

  1. Which distribution are you looking for?
  2. cdf or pdf? There exist functions to call probability densities (pdf) or cumulative probabilities (cdf). In the above example we requested a cdf. In general these work as follows: input = value, output = (cumulative) probability.
    1. If you are calling cdf, then you sometimes have the option to call an "inverse" function. Here: input = cumulative probability, output = value. This is useful to get critical values.

Here are a few extra examples:

  • normpdf(x,mu,sd), height of the probability density function of a [math]N(mu,sd)[/math] distribution at the value x.
  • normcdf(x,mu,sd), value of the cumulative density function of a [math]N(mu,sd)[/math] distribution at the value x.
  • norminv(p,0,1), inverse of the cumulative [math]N(0,1)[/math] distribution. The input p should be between 0 and 1 as it represents a probability. Try norminv(0.975,0,1). What should be the result?
  • tinv(p,v), inverse of the cumulative t distribution with v degrees of freedom. The input p should be between 0 and 1 as it represents a probability.
  • chi2pdf(x,v), height of the pdf of the [math]\chi^{2}[/math] distribution with v degrees of freedom at value x.
  • chi2cdf(x,v), value of the cdf of the [math]\chi^{2}[/math] distribution with v degrees of freedom at value x.
  • chi2inv(p,v), inverse of the [math]\chi^{2}[/math] distribution with v degrees of freedom. The input p should be between 0 and 1 as it represents a probability.

You can possibly start to guess the pattern behind the naming of the functions, but for more exotic distributions you will need to consult the help function.

Random Numbers

On many occasions you may need to generate random numbers. It is important to understand though that any software will not really randomly generate numbers but numbers that appear random. This is important as, if you do it right, you will be able to recreate the same "random" numbers again. This can be exactly what you want, but may also work against you if you are not careful.

Let’s see how it works. First we need to introduce a function that calls random numbers. Let me introduce the rand(r,c) function. This will generate a vector of random numbers (uniformly distributed on [math][0,1][/math]) with r rows and c columns. Call this line twice from your command window and you will get the following:

>> rand(1,4)
ans =
    0.8147    0.9058    0.1270    0.9134

>> rand(1,4)
ans =
    0.6324    0.0975    0.2785    0.5469

As you can see the same command produced different numbers (as you would expect from random numbers). You should think of a random number generator (RNG) as an extremely long list of truly random numbers. When you call the random number generator as above you will basically get numbers from some unspecified place on that list. And every time you call it that place will change. However, there is a way in which you can determine where exactly on the list the RNG should start. This command is

RandStream.setGlobalStream(RandStream('mt19937ar','seed',123))

You really do not need to understand what exactly this does, but that the last number, here 123, specifies the exact place where the RNG should start. This could really be any number. Now do the following:

>>RandStream.setGlobalStream(RandStream('mt19937ar','seed',123));

>> rand(1,4)
ans =
    0.6965    0.2861    0.2269    0.5513

>>RandStream.setGlobalStream(RandStream('mt19937ar','seed',123));

>> rand(1,4)
ans =
    0.6965    0.2861    0.2269    0.5513

We called the RNG seeding command twice. And we twice asked it to start at exactly the same place and therefore it twice generated exactly the same "random" numbers[1]. The mind boggles. Why would you want to do that? This command allows you to recreate results exactly when they are partially based on random numbers (e.g. simulation studies or bootstrap tests).

The calls to the RNG above always created random numbers that were uniformly distributed (on the unit interval). Often you may want to create random numbers that follow a different distribution. Most commonly random numbers that follow a normal distribution. The command randn(r,c) produces a [math](r \times c)[/math] matrix of standard normally distributed random variables. If you want to generate normally distributed random numbers with a different mean and standard deviation you can use the following lines.

mu = 3;
sd = 5;
a = mu + randn(20,1)*sd;

These lines will produce a [math](20 \times 1)[/math] vector a of random numbers that come from a normal distribution with mean 3 and standard deviation 5.

There will be occasions when you want to draw random numbers from other distributions. The Statistics toolbox provides an extremely useful function for this purpose.

a = random(name,par1,par2,par3,r,c);

For details of this function you should consult doc random, but in short name represents the name of the distribution from which you want to draw a random number. Depending on the distribution you want to draw numbers from you may have up to three parameters. The last two entries r (rows) and c (columns) are used to indicate what size the matrix of random numbers should have. If you leave them away you will get one random number. Let’s show one example for the case of a Poisson distribution with mean parameter mu = 30.

mu = 30;
a = random('Poisson',mu,10,2);

This will create a [math]10 \times 2[/math] matrix of Poisson distributed random variables. For a list of all available distributions you should consult the documentation for this function.

Footnotes

  1. In what follows we will omit the inverted commas, but in the back of your mind you should keep that the numbers are not truly random.