Joint Probability Distributions
Contents
Joint Probability Distributions
The objective of statistics is to learn about population characteristics: this was first mentioned in the Introductory Section. An EXPERIMENT is then any process which generates data. It is easy to imagine circumstances where an experiment (say a few interviewers asking couples walking down the road) generates observations, for example the weekly income of husbands and the weekly income of wives in a particular population of husbands and wives. One possible use of such sample data is to investigate the relationship between the observed values of the two variables. In the Section Correlation and Regression we discussed how to use correlation and regression to summarise the extent and nature of any linear relationship between these observed values. It has to be said that the discussion of relationships between variables in this section does but scratch the surface of a very large topic. Subsequent courses in Econometrics take the analysis of relationships between two or more variables much further.
If these two pieces of information generated by the experiment are considered to be the values of two random variables defined on the SAMPLE SPACE of an experiment (see this [[ProbabilityIntro | Section]]), then the discussion of random variables and probability distributions needs to be extended to the multivariate case.
Let [math]X[/math] and [math]Y[/math] be the two random variables: for simplicity, they are considered to be discrete random variables. The outcome of the experiment is a pair of values [math]\left( x,y\right)[/math]. The probability of this outcome is a joint probability which can be denoted
[math]\Pr \left( X=x\cap Y=y\right) ,[/math]
emphasising the analogy with the probability of a joint event [math]\Pr \left( A\cap B\right)[/math], or, more usually, by
[math]\Pr \left( X=x,Y=y\right) .[/math]
The collection of these probabilities, for all possible combinations of [math]x[/math] and [math]y[/math], is the joint probability distribution of [math]X[/math] and [math]Y, [/math] denoted
[math]p\left( x,y\right) =\Pr \left( X=x,Y=y\right) .[/math]
The Axioms of Probability discussed in this Section carry over to imply
[math]0\leqslant p\left( x,y\right) \leqslant 1,[/math]
[math]\sum_{x}\sum_{y}p\left( x,y\right) =1,[/math]
where the sum is over all [math]\left( x,y\right)[/math] combinations.
Examples
Example 1
Let [math]H[/math] and [math]W[/math] be the random variables representing the population of weekly incomes of husbands and wives, respectively, in some country. There are only three possible weekly incomes, £0, £100 or £ 200. The joint probability distribution of [math]H[/math] and [math]W[/math] is represented as a table:
[math]H[/math] | ||||
Probabilities | £ 0 | £ 100 | £ 200 | |
Values of [math]W:[/math] | £ 0 | [math]0.05[/math] | [math]0.15[/math] | [math]0.10[/math] |
£ 100 | [math]0.10[/math] | [math]0.10[/math] | [math]0.30[/math] | |
£ 200 | [math]0.05[/math] | [math]0.05[/math] | [math]0.10[/math] |
Then we can read off, for example, that
[math]\Pr \left( H=0,W=0\right) =0.05,[/math]
or that in this population, [math]5\%[/math] of husbands and wives have each a zero weekly income.
In this example, the nature of the experiment underlying the population data is not explicitly stated. However, in the next example, the experiment is described, the random variables defined in relation to the experiment, and their probability distribution deduced directly.
Example 2
Consider the following simple version of a lottery. Players in the lottery choose one number between [math]1[/math] and [math]5[/math], whilst a machine selects the lottery winners by randomly selecting one of five balls (numbered [math]1[/math] to [math]5[/math]). Any player whose chosen number coincides with the number on the ball is a winner. Whilst the machine selects one ball at random (so that each ball has an [math]0.2[/math] chance of selection), players of the lottery have “lucky” and “unlucky” numbers. Therefore they choose numbers not randomly, but with the following probabilities:
Number chosen by player | Probability of being chosen |
---|---|
[math]1[/math] | [math]0.40[/math] |
[math]2[/math] | [math]0.20[/math] |
[math]3[/math] | [math]0.05[/math] |
[math]4[/math] | [math]0.10[/math] |
[math]5[/math] | [math]0.25[/math] |
[math]1.0[/math] |
Let [math]X[/math] denote the number chosen by a player and [math]Y[/math] the number chosen by the machine. If they are assumed to be independent events, then for each possible value of [math]X[/math] and [math]Y[/math], we will have
[math]\Pr \left( X\cap Y\right) =\Pr \left( X\right) \Pr \left( Y\right) .[/math]
The table above gives the probabilities for [math]X[/math], and [math]\Pr \left( Y = y\right) =0.2[/math] for all [math]y=1,...,5[/math], so that a table can be drawn up displaying the joint distribution [math]p\left( x,y\right) :[/math]
Probabilities | ||||||
Selected by machine | ||||||
Chosen by player | [math]\ 1\ [/math] | [math]\ 2\ [/math] | [math]\ 3\ [/math] | [math]\ 4\ [/math] | [math]\ 5\ [/math] | Row Total |
[math]1[/math] | [math]0.08[/math] | [math]0.08[/math] | [math]0.08[/math] | [math]0.08[/math] | [math]0.08[/math] | [math]0.40[/math] |
[math]2[/math] | [math]0.04[/math] | [math]0.04[/math] | [math]0.04[/math] | [math]0.04[/math] | [math]0.04[/math] | [math]0.20[/math] |
[math]3[/math] | [math]0.01[/math] | [math]0.01[/math] | [math]0.01[/math] | [math]0.01[/math] | [math]0.01[/math] | [math]0.05[/math] |
[math]4[/math] | [math]0.02[/math] | [math]0.02[/math] | [math]0.02[/math] | [math]0.02[/math] | [math]0.02[/math] | [math]0.10[/math] |
[math]5[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.25[/math] |
Column Total | [math]0.20[/math] | [math]0.20[/math] | [math]0.20[/math] | [math]0.20[/math] | [math]0.20[/math] | [math]1.00[/math] |
The general question of independence in joint probability distributions will be discussed later in the section.
This last example may seem somewhat strange, but it really isn’t. When people play Lotto and choose their numbers, they will often choose their lucky numbers. Lucky numbers are often dates (birthdays, wedding days, etc.). That implies that people choose numbers from 1 - 12 and again numbers from 1 - 31 more often than numbers that are larger than 31. If you do choose such "lucky" numbers you are not less likely to win, but if you win you will have to share the prize with more people!
Marginal Probabilities
Given a joint probability distribution
[math]p\left( x,y\right) =\Pr \left( X=x,Y=y\right)[/math]
for the random variables [math]X[/math] and [math]Y[/math], a probability of the form [math]\Pr \left(X=x\right) [/math] or [math]\Pr \left(Y=y\right) [/math] is called a marginal probability.
The collection of these probabilities for all values of [math]X[/math] is the marginal probability distribution for [math]X[/math],
[math]p_{X}\left( x\right) =\Pr \left( X=x\right) .[/math]
If it is clear from the context, write [math]p_{X}\left( x\right) [/math] as [math]p\left(x\right) [/math]. Suppose that [math]Y[/math] takes on values [math]0,1,2[/math]. Then
[math]\Pr \left( X=x\right) =\Pr \left( X=x,Y=0\right) +\Pr \left( X=x,Y=1\right)+\Pr \left( X=x,Y=2\right) ,[/math]
the sum of all the joint probabilities favourable to [math]X=x[/math]. So, marginal probability distributions are found by summing over all the values of the other variable:
[math]p_{X}\left( x\right) =\sum_{y}p\left( x,y\right) ,\;\;\;\;\;p_{Y}\left(y\right) =\sum_{x}p\left( x,y\right) .[/math]
This can be illustrated using Example 1 of Section [jtprobex] again:
[math]\begin{aligned} \Pr \left( W=0\right) &=&\Pr \left( W=0,H=0\right) +\Pr \left(W=0,H=1\right) +\Pr \left( W=0,H=2\right) \\ &=&0.05+0.15+0.10 \\ &=&0.30.\end{aligned}[/math]
There is a simple recipe for finding the marginal distributions in the table of joint probabilities: find the row sums and column sums. From Example 1 (where we now, for simplicity, represent [math]£\ 100[/math] by [math]1[/math]),
[math]H[/math] | Row Sums | ||||
Probabilities | [math]\mathbf{0}[/math] | [math]\mathbf{1}[/math] | [math]\mathbf{2}[/math] | [math]p_{W}\left( w\right) [/math] | |
Values of [math]W:[/math] | [math]\mathbf{0}[/math] | [math]0.05[/math] | [math]0.15[/math] | [math]0.10[/math] | [math]0.30[/math] |
[math]\mathbf{1}[/math] | [math]0.10[/math] | [math]0.10[/math] | [math]0.30[/math] | [math]0.50[/math] | |
[math]\mathbf{2}[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.10[/math] | [math]0.20[/math] | |
Column Sums [math]p_{H}\left( h\right) [/math] | [math]0.20[/math] | [math]0.30[/math] | [math]0.50[/math] | [math]1.00[/math] |
from which the marginal distributions should be written out explicitly as
Values of [math]W[/math] | [math]p_{W}\left( w\right) [/math] | Values of [math]H[/math] | [math]p_{H}\left(h\right) [/math] |
[math]0[/math] | [math]0.3[/math] | [math]0[/math] | [math]0.2[/math] |
[math]1[/math] | [math]0.5[/math] | [math]1[/math] | [math]0.3[/math] |
[math]2[/math] | [math]0.2[/math] | [math]2[/math] | [math]0.5[/math] |
[math]\mathbf{1.0}[/math] | [math]\mathbf{1.0}[/math] |
By calculation, we can find the expected values and variances of [math]W[/math] and [math]H[/math] as
[math]\begin{aligned} E\left[ W\right] &=&0.9,\;\;\;var\left[ W\right] =0.49, \\ E\left[ H\right] &=&1.3,\;\;\;var\left[ H\right] =0.61.\end{aligned}[/math]
Notice that a marginal probability distribution has to satisfy the usual properties expected of a probability distribution (for a discrete random variable):
[math]\begin{aligned} 0 &\leq &p_{X}\left( x\right) \leq 1,\;\;\;\;\;\sum_{x}p_{X}\left( x\right)=1, \\ 0 &\leq &p_{Y}\left( y\right) \leq 1,\;\;\;\;\;\sum_{y}p_{Y}\left( y\right)=1.\end{aligned}[/math]
Functions of Two Random Variables
Given the experiment of Example 1, one can imagine defining further random variables on the sample space of this experiment. One example is the random variable [math]T[/math] representing total household income:
[math]T=H+W.[/math]
This new random variable is a (linear) function of [math]H[/math] and [math]W[/math], and we can deduce the probability distribution of [math]T[/math] from the joint distribution of [math]H[/math] and [math]W[/math]. For example,
[math]\begin{aligned} \Pr \left( T=0\right) &=&\Pr \left( H=0,W=0\right) , \\ \Pr \left( T=1\right) &=&\Pr \left( H=0,W=1\right) +\Pr (H=1,W=0).\end{aligned}[/math]
The complete probability distribution of [math]T[/math] is
[math]\mathbf{\Pr \left( T=t\right) }[/math] | [math]\mathbf{t\times \Pr \left(T=t\right)} [/math] | |
[math]0[/math] | [math]0.05[/math] | [math]0[/math] |
[math]1[/math] | [math]0.25[/math] | [math]0.25[/math] |
[math]2[/math] | [math]0.25[/math] | [math]0.5[/math] |
[math]3[/math] | [math]0.35[/math] | [math]1.05[/math] |
[math]4[/math] | [math]0.10[/math] | [math]0.40[/math] |
[math]1.00[/math] | [math]2.20[/math] |
from which we note that [math]E\left[ T\right] =2.2[/math], indicating that the population mean income for married couples in the specific country is £ 220.
Now we consider a more formal approach. Let [math]X[/math] and [math]Y[/math] be two discrete random variables with joint probability distribution [math]p\left( x,y\right) [/math]. Let [math]V[/math] be a random variable defined as a function of [math]X[/math] and [math]Y:[/math]
[math]V=g\left( X,Y\right) .[/math]
Here, [math]g\left( X,Y\right) [/math] is not necessarily a linear function: it could be any function of two variables. In principle, we can deduce the probability distribution of [math]V[/math] from [math]p\left( x,y\right) [/math] and thus deduce the mean of [math]V[/math], [math]E\left[ V\right] [/math], just as we did for [math]T[/math] in Example 1.
However, there is a second method that works directly with the joint probability distribution [math]p\left( x,y\right):[/math] the expected value of [math]V[/math] is
[math]E\left[ V\right] =E\left[ g\left( X,Y\right) \right] =\sum_{x}\sum_{y}g\left( x,y\right) p\left( x,y\right) .[/math]
The point about this approach is that it avoids the calculation of the probability distribution of [math]V[/math].
To apply this argument to find [math]E\left[ T\right] [/math] in Example 1, it is helpful to modify the table of joint probabilities to display (in parenthesis) the value of [math]T[/math] associated with each pair of values for [math]H[/math] and [math]W[/math]:
[math]\mathbf{h}[/math] | ||||
[math]\mathbf{(t)}[/math] | [math]\mathbf{0}[/math] | [math]\mathbf{1}[/math] | [math]\mathbf{2}[/math] | |
[math]\mathbf{w}[/math] | [math]\mathbf{0}[/math] | [math]\left( 0\right) \;0.05[/math] | [math]\left( 1\right) \;0.15[/math] | [math]\left(2\right) \;0.10[/math] |
[math]\mathbf{1}[/math] | [math]\left( 1\right) \;0.10[/math] | [math]\left( 2\right) \;0.10[/math] | [math]\left(3\right) \;0.30[/math] | |
[math]\mathbf{2}[/math] | [math]\left( 2\right) \;0.05[/math] | [math]\left( 3\right) \;0.05[/math] | [math]\left(4\right) \;0.10[/math] |
Then, the double summation required for the calculation of [math]E\left[ T\right][/math] can be performed along each row in turn:
[math]\begin{aligned} E\left[ T\right] &=&\left( 0\right) \times 0.05+\left( 1\right) \left(0.15\right) +\left( 2\right) \times 0.10 \\ &&+\left( 1\right) \times 0.10+\left( 2\right) \left( 0.10\right) +\left(3\right) \times 0.30 \\ &&+\left( 2\right) \times 0.05+\left( 3\right) \left( 0.05\right) +\left(4\right) \times 0.10 \\ &=&2.20.\end{aligned}[/math]
So, the recipe is to multiply, for each cell, the implied value of [math]T[/math] in that by the probability in that cell, and add up the calculated values over all the cells.
Independence, Covariance and Correlation
Joint probability distributions have interesting characteristics that are irrelevant (or better not defined) when we look at one random variable only. These characteristics describe how the two individual random variables relate to each other. The relevant concepts are independence, covariance and correlation and will be discussed in turn.
Independence
Let’s first define the following joint probability
[math]p\left( x,y\right) =\Pr \left( X=x,Y=y\right) ,[/math]
which represents the probability that simultaneously [math]X=x[/math] and [math]Y=y[/math]. Also [math]p\left( x\right)[/math] and [math]p\left(y\right)[/math] are defined as the respective marginal probabilities.
Before discussing the definition of independence it is worth recalling a result of the Bayes Theorem, that was discussed earlier in this Section. That Theorem established that a conditional probability was to be calculated according to
[math]p\left( x|y\right) = frac{p\left( x,y\right)}{p(y)}.[/math]
From this it follows that the joint probability can be derived from
[math]p\left( x,y\right) = p\left(y\right) p\left( x|y\right).[/math]
The remarkable feature of this result is that it is always valid regardless of how the two random variables [math]X[/math] and [math]Y[/math] relate to each other. There is one special case of relation between the two random variables, that special case is called independence. In fact the statistical meaning of this term is close to how we would use the term in normal life.
The statistical definition of independence of the random variables [math]X[/math] and [math]Y[/math] is that for all values of [math]x[/math] and [math]y[/math] the following relationship holds:
[math]p\left( x,y\right) =p_{X}\left( x\right) p_{Y}\left( y\right) .[/math]
If this relationship indeed holds for all values of [math]x[/math] and [math]y[/math], the random variables [math]X[/math] and [math]Y[/math] are said to be independent:
[math]X[/math] and [math]Y[/math] are independent random variables if and only if
[math]p\left( x,y\right) =p_{X}\left( x\right) p_{Y}\left( y\right) \;\;\;\;\; \text{for \textbf{all }}x,y.[/math]
Each joint probability is the product of the corresponding marginal probabilities. Independence also means that [math]\Pr \left( Y=y\right) [/math] would not be affected by knowing that [math]X=x:[/math] knowing the value taken on by one random variable does not affect the probabilities of the outcomes of the other random variable. This is also expressed by the following relationship that is only valid if X and Y are independent:
[math]p\left( x|y\right) = frac{p\left( x,y\right)}{p(y)}.[/math]
Another corollary of this is that if two random variables [math]X[/math] and [math]Y[/math] are independent, then there can be no relationship of any kind, linear or non-linear, between them.
The joint probabilities and marginal probabilities for Example 1 in section [jtprobex] are
[math]H:[/math] Row Sums Probabilities [math]0[/math] [math]1[/math] [math]2[/math] [math]p_{W}\left( w\right) [/math] Values of [math]W:[/math] [math]0[/math] [math]0.05[/math] [math]0.15[/math] [math]0.10[/math] [math]0.30[/math] [math]1[/math] [math]0.10[/math] [math]0.10[/math] [math]0.30[/math] [math]0.50[/math] [math]2[/math] [math]0.05[/math] [math]0.05[/math] [math]0.10[/math] [math]0.20[/math] Column Sums: [math]p_{H}\left( h\right) [/math] [math]0.20[/math] [math]0.30[/math] [math]0.50[/math] [math]1.00[/math] Here [math]p\left( 0,0\right) =0.05[/math], whilst [math]p_{W}\left( 0\right) =0.30[/math], [math]p_{H}\left( 0\right) =0.20[/math], with
[math]p\left( 0,0\right) \neq p_{W}\left( 0\right) p_{H}\left( 0\right) .[/math]
So, [math]H[/math] and [math]W[/math] cannot be independent.
For [math]X[/math] and [math]Y[/math] to be independent, [math]p\left( x,y\right) =p_{X}\left( x\right) p_{Y}\left( y\right) [/math] has to hold for all [math]x,y[/math]. Finding one pair of values [math]x,y[/math] for which this fails is sufficient to conclude that [math]X[/math] and [math]Y[/math] are not independent. However, one may also have to check every possible pair of values to confirm independence: think what would be required in Example 2 above, if one did not know that the joint probability distribution had been constructed using an independence property.
Covariance and Correlation
A popular measure of association for random variables [math]X[/math] and [math]Y[/math] is the (population) correlation coefficient. It is the population characteristic analogous to the (sample) correlation coefficient introduced in this Section. It will be seen that this (population) correlation coefficient is really only a measure of strength of any linear relationship between the random variables.
Covariance
The first step is to define the (population) covariance as a characteristic of the joint probability distribution of [math]X[/math] and [math]Y[/math]. Let
[math]E\left[ X\right] =\mu _{X},\;\;\;\;\;E\left[ Y\right] =\mu _{Y}.[/math]
the (population) covariance is defined as
[math]\begin{aligned} cov\left[ X,Y\right] &=&E\left[ \left( X-\mu _{X}\right) \left(Y-\mu _{Y}\right) \right] \\ &=&\sigma _{XY}.\end{aligned}[/math]
Notice that by this definition, [math]cov\left[ X,Y\right] =cov\left[ Y,X\right] [/math].
There are a number of alternative expressions for the covariance. The first follows from seeing
[math]\left( X-\mu _{X}\right) \left( Y-\mu _{Y}\right)[/math]
as a function [math]g\left( X,Y\right) [/math] of [math]X[/math] and [math]Y:[/math]
[math]cov\left[ X,Y\right] =\sum_{x}\sum_{y}\left( x-\mu _{X}\right)\left( y-\mu _{Y}\right) p\left( x,y\right)[/math]
We can see from this expression that if enough [math]\left( x,y\right)[/math] pairs have [math]x-\mu _{X}[/math] and [math]y-\mu _{Y}[/math] values with the same sign, [math]cov\left[ X,Y\right] \gt 0[/math], so that large (small) values of [math]x-\mu _{X}[/math] tend to occur with large (small) values of [math]y-\mu _{Y}[/math]. Similarly, if enough [math]\left( x,y\right) [/math] pairs have [math]x-\mu _{X}[/math] and [math]y-\mu _{Y}[/math] values with different signs, [math]cov\left[ X,Y\right] \lt 0[/math]. Here, large (small) values of [math]x-\mu _{X}[/math] tend to occur with small (large) values of [math]y-\mu _{Y}[/math].
[math]cov\left[ X,Y\right] \gt 0[/math] gives a “positive” relationship between [math]X[/math] and [math]Y[/math], [math]cov\left[ X,Y\right] \lt 0[/math] a “negative” relationship.
There is a shorthand calculation for covariance, analogous to that given for the variance in this Section:
[math]\begin{aligned} cov\left[ X,Y\right] &=&E\left[ \left( X-\mu _{X}\right) \left(Y-\mu _{Y}\right) \right] \\ &=&E\left[ XY-X\mu _{Y}-\mu _{X}Y+\mu _{X}\mu _{Y}\right] \\ &=&E\left[ XY\right] -E\left[ X\right] \mu _{Y}-\mu _{X}E\left[ Y\right]+\mu _{X}\mu _{Y} \\ &=&E\left[ XY\right] -\mu _{X}\mu _{Y}-\mu _{X}\mu _{Y}+\mu _{X}\mu _{Y} \\ &=&E\left[ XY\right] -\mu _{X}\mu _{Y}. \end{aligned}[/math]
Here, the linear function rule of the previous Section has been used to make the expected value of a sum of terms equal to the sum of expected values, and then to make, for example,
[math]E\left[ X\mu _{Y}\right] =E\left[ X\right] \mu _{Y}.[/math]
Even with this shorthand method, the calculation of the covariance is rather tedious. To calculate [math]cov\left[ W,H\right] [/math] in Example 1, the best approach is to imitate the way in which [math]E\left[ T\right] [/math] was calculated above. Rather than display the values of [math]T[/math], here we display the values of [math]W\times H[/math] in order to first calculate [math]E\left[ WH\right][/math]:
[math]h[/math] | ||||
[math](w\times h)[/math] | [math]0[/math] | [math]1[/math] | [math]2[/math] | |
[math]w[/math] | [math]0[/math] | [math]\left( 0\right) \;0.05[/math] | [math]\left( 0\right) \;0.15[/math] | [math]\left(0\right) \;0.10[/math] |
[math]1[/math] | [math]\left( 0\right) \;0.10[/math] | [math]\left( 1\right) \;0.10[/math] | [math]\left(2\right) \;0.30[/math] | |
[math]2[/math] | [math]\left( 0\right) \;0.05[/math] | [math]\left( 2\right) \;0.05[/math] | [math]\left(4\right) \;0.10[/math] |
Recall that a value of [math]1[/math] represented [math]£ 100[/math] (and similar [math]2=£ 200[/math]). Therefore, using the same strategy of multiplication within cells, and adding up along each row in turn, we find the following
[math]\begin{aligned} E\left[ WH\right] &=&\left( 0\right) \times 0.05+\left( 0\right) \times 0.15+\left( 0\right) \times 0.10 \\ &&+\left( 0\right) \times 0.10+\left( 100\times 100\right) \times 0.10+\left( 100 \times 200\right) \times 0.30 \\ &&+\left( 0\right) \times 0.05+\left( 200 \times 100\right) \times 0.05+\left( 200 \times 200\right) \times 0.10 \\ &=&1000+6000+1000+4000 \\ &=&12000.\end{aligned}[/math]
Previously we found that [math]E\left[ W\right] =0.9[/math] (equivalent to [math]£ 90[/math] and [math]E\left[ H\right] =1.3[/math] (equal to [math]£ 130[/math]), so that
[math]\begin{aligned} cov\left[ W,H\right] &=&E\left[ WH\right] -E\left[ W\right] E\left[H\right] \\ &=&12000-\left( 90\right) \left( 130\right) \\ &=&300.\end{aligned}[/math]
Strength of Association and Units of Measurement
How does covariance measure the strength of a relationship between two random variables? Not well is the answer, because the value of the covariance is dependent on the units of measurement. Take the above example in which both random variables [math]W[/math] and [math]H[/math]. Example 1 the units of measurement of were pounds. When we calculated [math]cov\left[ W,H\right][/math] we needed to calculate terms like [math]£ 100 \times £ 100[/math] and hence the units of measurement of the covariance is [math]£^2[/math] which is certainly difficult to interpret.
Perhaps even more worrying, if we had calculated the covariance with [math]1[/math] and [math]2[/math] instead of [math]£ 100[/math] and [math]£ 200[/math] we would have obtained a value of 0.03. Of course, we just changed the unit of measurement and nothing really has changed in the relationship between the two variables.
In the Section on Correlation and Regression we only briefly mentioned covariance for exactly these reasons. There we introduced the correlation statistic as the preferred measure of (linear) association. As discussed there, the correlation statistics is actually based on the covariance statistic, but it does avoid the weaknesses just discussed.
Correlation
What is required is a measure of the strength of association which is invariant to changes in units of measurement. Generalising what we have just seen, if the units of measurement of two random variables [math]X[/math] and [math]Y[/math] are changed to produce new random variables [math]\alpha X[/math] and [math]\beta Y[/math], then the covariance in the new units of measurement is related to the covariance in the original units of measurement by
[math]cov\left[ \alpha X,\beta Y\right] =\alpha \beta cov\left[ X,Y\right] .[/math]
What are the variances of [math]\alpha X[/math] and [math]\beta Y[/math] in terms of [math]var\left[ X\right][/math] and [math]var\left[ Y\right] ?[/math] By Section 8.3, they are
[math]var\left[ \alpha X\right] =\alpha ^{2}var\left[ X\right],\;\;\;var\left[ \beta Y\right] =\beta^{2}var\left[ Y\right] .[/math]
The (population) correlation coefficient between [math]X[/math] and [math]Y[/math] is defined by
[math]\rho _{XY}=\dfrac{cov\left[ X,Y\right] }{\sqrt{var\left[ X\right] var\left[ Y\right] }}.[/math]
This is also the correlation between [math]\alpha X[/math] and [math]\beta Y[/math]:
[math]\begin{aligned} \rho _{\alpha X,\beta Y} &=&\dfrac{cov\left[ \alpha X,\beta Y\right] }{\sqrt{var\left[ \alpha X\right] var\left[\beta Y\right] }} \\ &=&\dfrac{\alpha \beta cov\left[ X,Y\right] }{\sqrt{\alpha^{2}\beta ^{2}var\left[ X\right] var\left[ Y\right] }} \\ &=&\rho _{XY},\end{aligned}[/math]
so that the correlation coefficient does not depend on the units of measurement.
For the income example we calculated the covariance statistic to be 300. If we calculate the variances, [math]var[H][/math] and [math]var[W][/math] we will get [math]var[H]=6100[/math] and [math]var[W]=4900[/math] respectively when using the [math]£ 100[/math] and [math]£ 200[/math] as the units of measurement. We then get[1]
[math]\begin{aligned} \rho _{WH} &=&\dfrac{300}{\sqrt{\left( 4900\right) \left( 6100\right) }} \\ &=&0.0549.\end{aligned}[/math]
Is this indicative of a strong relationship? Just like the sample correlation coefficient of the Regression and Correlation Section, it can be shown that
- the correlation coefficient [math]\rho _{XY}[/math] always satisfies [math]-1\leqslant \rho _{XY}\leqslant 1[/math].
- The closer [math]\rho [/math] is to [math]1[/math] or [math]-1[/math], the stronger the relationship.
So, [math]\rho _{WH}=0.0549[/math] is indicative of a very weak relationship.
It can shown that if [math]X[/math] and [math]Y[/math] are exactly linearly related by
[math]Y=a+bX \quad \text{with\ \ \ }b\gt 0[/math]
then [math]\rho _{XY}=1[/math] - that is, [math]X[/math] and [math]Y[/math] are perfectly correlated. [math]X[/math] and [math]Y[/math] are also perfectly correlated if they are exactly linearly related by
[math]Y=a+bX \quad \text{with\ \ \ }b\lt 0,[/math]
but [math]\rho _{XY}=-1[/math]. Thus,
- correlation measures only the strength of a linear relationship between [math]X[/math] and [math]Y[/math];
- correlation does not imply causation.
Other notations for the correlation coefficient are
[math]\rho _{XY}=\dfrac{\sigma _{XY}}{\sigma _{X}\sigma _{Y}}[/math]
which uses covariance and standard deviation notation, and
[math]\rho _{XY}=\dfrac{E\left[ \left( X-\mu _{X}\right) \left( Y-\mu _{Y}\right) \right] }{\sqrt{E\left[ \left( X-\mu _{X}\right) ^{2}\right] E\left[ \left(Y-\mu _{Y}\right) ^{2}\right] }}.[/math]
The relation between Correlation, Covariance and Independence
Non-zero correlation and covariance between random variables [math]X[/math] and [math]Y[/math] indicate some linear association between them, whilst independence of [math]X[/math] and [math]Y[/math] implies no relationship or association of any kind between them. So, it is not surprising that
- independence of [math]X[/math] and [math]Y[/math] implies zero covariance: cov[math]\left[ X,Y\right] =0;[/math]
- independence of [math]X[/math] and [math]Y[/math] implies zero correlation: [math]\rho _{XY}=0[/math].
The converse is not true, in general:
- zero covariance or correlation does not imply independence.
The reason is that there may be a relationship between [math]X[/math] and [math]Y[/math] which is not linear.
Conditional Distributions
Previously, we looked at the joint and marginal distributions for a pair of discrete random variables. Continuing the previous discussion of relationships between variables, we are often interested in econometrics in how random variable [math]X[/math] affects random variable [math]Y[/math]. This information is contained in something called the conditional distribution of "[math]Y[/math] given [math]X[/math]". For discrete random variables, [math]X[/math] and [math]Y[/math], this distribution is defined by the following probabilities
[math]\begin{aligned} p_{Y|X}\left( y|x\right) &=&\Pr \left( Y=y|X=x\right) \\ &=&\frac{p\left( x,y\right) }{p_{X}\left( x\right) },\end{aligned}[/math]
where reads as "the probability that [math]Y[/math] takes the value [math]y[/math] given that (conditional on) [math]X[/math] takes the value [math]x[/math]". The important aspect to understand here is that [math]Y|X[/math] is also a random variable. It is, of course, related to the random variable [math]X[/math], but it is not identical if [math]X[/math] and [math]Y[/math] are not independent.
As with the discussion of conditional probability in the Section on conditional probabilities, these conditional probabilities are defined on a restricted sample space of [math]X=x[/math] (hence the rescaling by [math]p_{X}\left( x\right) [/math]) and they are calculated on a sequence on restricted sample spaces; one for each possible value of [math]x[/math] (in the discrete case).
As an illustration of the calculations, consider again Example 1 and the construction of the conditional distribution of [math]W[/math] given [math]H[/math] for which we had the following joint distribution:
[math]H:[/math] | ||||
Probabilities | [math]0[/math] | [math]1[/math] | [math]2[/math] | |
Values of [math]W:[/math] | [math]0[/math] | [math]0.05[/math] | [math]0.15[/math] | [math]0.10[/math] |
[math]1[/math] | [math]0.10[/math] | [math]0.10[/math] | [math]0.30[/math] | |
[math]2[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.10[/math] |
We consider, in turn, conditional probablities for the values of [math]W[/math] given, first [math]H=0[/math], then [math]H=1[/math] and finally [math]H=2[/math]. Intuitively, think of the probabilities in the cells as indicating sub-areas of the entire sample space, with the latter having and area of [math]1[/math] and the former (therefore) summing to [math]1[/math]. With this interpretation, the restriction [math]H=0[/math] " occupies" [math]20\%[/math] of the entire sample space (recall the marginal probability, [math]\Pr \left( H=0\right) [/math], from Section [mp]). The three cells corresponding to [math]H=0[/math] now correspond to the restricted sample space of [math]H=0[/math], and the outcome [math]W=0[/math] takes up [math]0.05/0.2=0.25[/math] of this restricted sample space; thus [math]\Pr \left(W=0|H=0\right) =\Pr \left( W=0.H=0\right) /\Pr (H=0)=0.25[/math]. Similarly, [math]\Pr\left( W=1|H=0\right) =0.10/0.2=0.5[/math] and [math]\Pr \left( W=2|H=0\right)=0.05/0.2=0.25[/math]. Notice that [math]\sum_{j=0}^{2}\Pr \left( W=j|H=0\right) =1[/math], as it should do for the restricted sample space of [math]H=0[/math]. For all possible restrictions imposed by [math]H[/math] we get the following conditional distributions for [math]W[/math] (we get three conditional distributions, one for each of [math]h=0[/math], [math]h=1[/math], [math]h=2[/math]):
[math]H:[/math] | ||||||||
Probabilities | [math]0[/math] | [math]1[/math] | [math]2[/math] | [math]\quad[/math] | [math]Pr\left(W=w|H=0\right) [/math] | [math]Pr\left(W=w|H=1\right) [/math] | [math]Pr\left(W=w|H=2\right) [/math] | |
Values of [math]W:[/math] | [math]0[/math] | [math]0.05[/math] | [math]0.15[/math] | [math]0.10[/math] | [math]1/4[/math] | [math]1/2[/math] | [math]1/5[/math] | |
[math]1[/math] | [math]0.10[/math] | [math]0.10[/math] | [math]0.30[/math] | [math]1/2[/math] | [math]1/3[/math] | [math]3/5[/math] | ||
[math]2[/math] | [math]0.05[/math] | [math]0.05[/math] | [math]0.10[/math] | [math]1/4[/math] | [math]1/6[/math] | [math]1/5[/math] |
Notice how the probabilities for particular values of [math]W[/math] change according to the restriction imposed by [math]H;[/math] for example, [math]\Pr \left( W=0|H=0\right)\neq \Pr \left( W=0|H=1\right) [/math], say. Thus knowledge of, or information about, [math]H[/math] changes probabilities concerning [math]W[/math]. Because of this, and as ascertained previously, [math]W[/math] and [math]H[/math] are NOT independent.
In general[math],.X[/math] and [math]Y[/math] are independent, if and only if knowledge of the value taken by [math]X[/math] does not tell us anything about the probability that [math]Y[/math] takes any particular value. Indeed, from the definition of [math]p_{Y|X}\left(y|x\right) [/math], we see that [math]X[/math] and [math]Y[/math] are independent if and only if [math]p_{Y|X}\left( y|x\right) =p_{Y}\left( y\right) [/math], for all [math]x,y[/math].
There is a similar treatment for conditional distributions for continuous random variables.
Conditional Expectation
While correlation is a useful summary of the relationship between two random variables, in econometrics we often want to go further and explain one random variable [math]Y[/math] as a function of some other random variable [math]X[/math]. One way of doing this is to look at the properties of the distribution of [math]Y[/math] conditional on [math]X[/math], as introduced above. In general these properties, such as expectation and variance, will depend on the value of [math]X[/math], thus we can think of them as being functions of [math]X[/math]. The conditional expectation of [math]Y[/math] is denoted [math]E(Y|X=x)[/math] and tells us the expectation of [math]Y[/math] given that [math]X[/math] has taken the particular value [math]x[/math]. Since this will vary with the particular value taken by [math]X[/math] we can think of [math]E(Y|X=x)=m(x)[/math], as a function of [math]x[/math].
As an example think of the population of all working individuals and let [math]X[/math] be years of education and [math]Y[/math] be hourly wages. [math]E(Y|X=12)[/math] is the expected hourly wage for all those people who have [math]12[/math] years of education while [math]E(Y|X=16)[/math] tells us the expected hourly wage for all those who have [math]16[/math] years of education. Tracing out the values of [math]E(Y|X=x)[/math] for all values of [math]X[/math] tells us a lot about how education and wages are related.
In econometrics we typically summarise the relationship represented by [math]E(Y|X)=m(X)[/math] in the form of a simple function. For example we could use a simple linear function:
[math]E(WAGE|EDUC) = 1.05 + 0.45\ast EDUC[/math]
or a non-linear function:
[math]E(QUANTITY|PRICE)=10/PRICE,[/math]
with the latter example demonstrating the deficiencies of correlation as a measure of association (since it confines itself to the consideration of linear relationships only).
Properties of Conditional Expectation
The following properties hold for both discrete and continuous random variables.
- [math]E[c(X)|X]=c(X)[/math] for any function [math]c(X)[/math].Functions of [math]X[/math] behave as constants when we compute expectations conditional on [math]X[/math]. (If we know the value of [math]X[/math] then we know the value of [math]c(X)[/math] so this is effectively a constant.)
For functions [math]a(X)[/math] and [math]b(X)[/math]
[math]E[a(X)Y+b(X)|X]=a(X)E(Y|X)+b(X)[/math]This is an extension of the previous rule’s logic and says that since we are conditioning on [math]X[/math], we can treat [math]X[/math], and any function of [math]X[/math], as a constant when we take the expectation.
If [math]X[/math] and [math]Y[/math] are independent, then [math]E(Y|X)=E(Y)[/math]. This follows immediately from the earlier discussion of conditional probability distributions. If the two random variables are independent then knowledge of the value of [math]X[/math] should not change our view of the likelihood of any value of [math]Y[/math]. It should therefore not change our view of the expected value of [math]Y[/math].A special case is where [math]U[/math] and [math]X[/math] are independent and [math]E(U)=0[/math]. It is then clear that [math]E(U|X)=0[/math].
[math]E[E(Y|X)]=E(Y)[/math]This result is known as the "iterative expectations" rule. We can think of [math]E(Y|X)[/math] as being a function of [math]X[/math]. Since [math]X[/math] is a random variable then [math]E(Y|X)=m(X)[/math] is a random variable and it makes sense to think about its distribution and hence its expected value. Think about the following example: suppose [math]E(WAGE|EDUC)=4+0.6\ast EDUC[/math]. Suppose [math]E(EDUC)=11.5[/math]. Then according to the iterative expectation rule
[math]E(WAGE)=E(4+0.6\ast EDUC)=4+0.6(11.5)=10.9.[/math]
If [math]E(Y|X)=E(Y)[/math] then [math]Cov(X,Y)=0[/math].
The last two properties have immediate applications in econometric modelling: if [math]U[/math] and [math]X[/math] are random variables with [math]E(U|X)=0[/math], then [math]E(U)=0[/math] and [math]Cov(U,X)=0[/math].
Finally, [math]E(Y|X)[/math] is often called the regression of [math]Y[/math] on [math]X[/math]. We can always write
[math]Y=E(Y|X)+U[/math]
where, by the above properties, [math]E(U|X)=0[/math]. Now consider [math]E(U^{2}|X)[/math], which is
[math]E\left( U^{2}|X\right) =E\left[ \left( Y-E\left( Y|X\right) \right) ^{2}|X\right] =var\left( Y|X\right)[/math]
the conditional variance of [math]Y[/math] given [math]X[/math].
In general, it can be shown that
[math]var\left( Y\right) =E\left[ var\left( Y|X\right) \right] +var\left[ E\left(Y|X\right) \right] .[/math]
Footnotes
- ↑ You could repeat the calculations with [math]1[/math] and [math]2[/math] instead of [math]£ 100[/math] and [math]£ 200[/math] and you will get different values for the covariance and variances, but an identical final correlation result.