StatPrelim
Introduction
The subject of statistics is concerned with scientific methods for collecting, organizing, summarizing, presenting data (numerical information). The power and utility of statistics derives from being able to draw valid conclusions (inferences), and make reasonable decisions, on the basis the available data. (The term statistics is also used in a much narrower sense when referring to the data themselves or various other numbers derived from any given set of data. Thus we hear of employment statistics (% of people unemployed), accident statistics, (number of road accidents involving drunk drivers), etc.)
Data arise in many spheres of human activity and in all sorts of different contexts in the natural world about us. Such data may be obtained as a matter of course (e.g., meteorological records, daily closing prices of shares, monthly interest rates, etc), or they may be collected by survey or experiment for the purposes of a specific statistical investigation. An example of an investigation using statistics was the Survey of British Births, 1970, the aim of which was to improve the survival rate and care of British babies at or soon after birth. To this end, data on new born babies and their mothers were collected and analysed. The Family Expenditure Survey regularly collects information on household expenditure patterns - including amounts spent on lottery tickets.
In this course we shall say very little about how our data are obtained; to a large degree this shall be taken as given. Rather, this course aims simply to describe, and direct you in the use of, a set of tools which can be used to analyse a given set of data. The reason for the development of such techniques is so that evidence can be brought to bear on particular questions (for example),
- Why do consumption patterns vary from individual to individual?
- Compared to those currently available, does a newly developed medical test offer a significantly higher chance of correctly diagnosing a particular disease?
or theories/hypotheses, such as,
- “The majority of the voting population in the UK is in favour of, a single European Currency”
- “Smoking during pregnancy adversely affects the birth weight of the unborn child”
- “Average real earnings of females, aged [math]30-50,[/math] has risen over the past [math]20[/math] years”
which are of interest in the social/natural/medical sciences.
Statistics is all about using the available data to shed light on such questions and hypotheses. At this level, there are a number of similarities between a statistical investigation and a judicial investigation. For the statistician the evidence comes in the form of data and these need to be interrogated in some way so that plausible conclusions can be drawn. In this course we attempt to outline some of the fundamental methods of statistical interrogation which may be applied to data. The idea is to get as close to the truth as is possible; although, as in a court of law, the truth may never be revealed and we are therefore obliged to make reasonable judgements about what the truth might be based on the evidence (the data) and our investigations (analysis) of it.
For example, think about the following. Suppose we pick at random [math]100[/math] male and [math]100[/math] female, University of Manchester first year undergraduates (who entered with A-levels) and recover the A-level points score for each of the [math]200[/math] students selected. How might we use these data to say something about whether or not, in general, (a) female students achieve higher A-level grades than males, or (b) female first year undergraduates are more intelligent than first year undergraduate males ? How convincing will any conclusions be?
We begin with some definitions and concepts which are commonly used in statistics:
Some Definitions and Concepts
DATA: body of numerical evidence (i.e., numbers)
EXPERIMENT: any process which generates data
For example, the following are experiments:
select a number of individuals from the UK voting population and how they will vote in the forthcoming General Election
flipping a coin twice and noting down whether, or not, you get a HEAD at each flip
interview a number of unemployed individuals and obtain information about their personal characteristics (age, educational background, family circumstances, previous employment history, etc) and the local and national economic environment. Interview them again at regular three-monthly intervals for 2 years in order to model (i.e., say something about possible causes of) unemployment patterns
to each of [math]10[/math] rats, differing dosage levels of a particular hormone are given and, then, the elapsed time to observing a particular (benign) reaction (to the hormone) in the each of the rats is recorded.
An experiment which generates data for use by the statistician is often referred to as sampling, with the data so generated being called a sample (of data). The reason for sampling is that it would be impossible to interview (at the very least too costly) all unemployed individuals in order to explain shed light on the cause of unemployment variations or, or all members of the voting population in order to say something about the outcome of a General Election. We therefore select a number of them in some way (not all of them), analyse the data on this sub-set, and then (hopefully) conclude something useful about the population of interest in general. The initial process of selection is called sampling and the conclusions drawn (about the general population from which the sample was drawn) constitutes statistical inference:
- SAMPLING: the process of selecting individuals (single items) from a population.
- POPULATION: a description of the totality of items with which we are interested. It will be defined by the issue under investigation.
- Sampling/experimentation yields a SAMPLE (of items), and it is the sample which ultimately provides the data used in statistical analysis.
It is tremendously important at this early stage to reflect on this and to convince yourself that the results of sampling can not be known with certainty. That is to say, although we can propose a strategy (or method) whereby a sample is to be obtained from a given population, we can not predict exactly what the sample will look like (i.e., what the outcome will be) once the selection process, and subsequent collection of data, has been completed. For example, just consider the outcome of the following sampling process: ask [math]10[/math] people in this room whether or not they are vegetarian; record the data as [math]1[/math] for yes and [math]0[/math] for no/unsure. How many [math]1[/math]’s will you get? The answer is uncertain, but will presumably be an integer in the range [math]0[/math] to [math]10[/math] (and nothing else).
Thus, the design of the sampling process (together with the sort of data that is to be collected) will rule out certain outcomes. Consequently, although not knowing exactly the sample data that will emerge we can list or provide some representation of what could possibly be obtained and such a listing is called a sample space:
- SAMPLE SPACE: a listing, or representation, of all possible samples that could be obtained
The following example brings all of these concepts together and we shall often use this simple scenario to illustrate various concepts:
- Example:
- Population: a coin which, when flipped, lands either H (Head) or T (Tail)
- Experiment/Sampling: flip the coin twice and note down H or T
- Sample: consists of two items. The first item indicates H or T from the first flip; the second indicates H or T from the second flip
- Sample Space: {(H,H),(H,T),(T,H),(T,T)}; list of [math]4[/math] possible outcomes.
The above experiment yields a sample of size [math]2[/math] (items or outcomes) which, when obtained, is usually given a numerical code and it is this coding that defines the data. If we also add that the population is defined by a fair coin, then we can say something about how likely it is that any one of the four possible outcomes will be observed (obtained) if the experiment were to be performed. In particular, elementary probability theory (see section [math]3[/math]), or indeed plain intuition, shows that each of the [math]4[/math] possible outcomes are, in fact, equally likely to occur if the coin is fair.
Thus, in general, although the outcome of sampling can not be known with certainty we will be able to list (in some way) possible outcomes. Furthermore, if we are also willing to assume something about the population from which the sample is to be drawn then, although certainty is still not assured, we may be able to say how likely (or probable) a particular outcome is. This latter piece of analysis is an example of DEDUCTIVE REASONING and the first [math]8[/math] Chapters in this module are devoted to helping you develop the techniques of statistical deductive reasoning. As suggested above, since it addresses the question of how likely/probable the occurrence of certain phenomena are, it will necessitate a discussion of probability.
Continuing the example above, suppose now that the experiment of flipping this coin twice is repeated [math]100[/math] times. If the coin is fair then we have stated (and it can be shown) that, for each of the [math]100[/math] experiments, the [math]4[/math] possible outcomes are equally likely. Thus, it seems reasonable to predict that a (H,H) should arise about [math]25[/math] times, as should (H,T), (T,H) and (T,T) - roughly speaking. This is an example of deductive reasoning. On the other hand, a question you might now consider is the following: if when this experiment is carried out [math]100[/math] times and a (T,T) outcome arises [math]50[/math] times, what evidence does this provide on the assumption that the coin is fair? The question asks you to make a judgement (an inference, we say) about the coin’s fairness based on the observation that a (T,T) occurs [math]50[/math] times. This introduces the more powerful notion of statistical inference, which is the subject matter of the sections [math]9-16[/math]. The following brief discussion gives a flavour of what statistical inference can do.
Firstly, data as used in a statistical investigation are rarely presented for public consumption in raw form. Indeed, it would almost certainly be a meaningless exercise. Rather they are manipulated, summarised and, some would say, distorted! The result of such data manipulation is called a statistic:
- STATISTIC: the result of data manipulation, or any method or procedure which involves data manipulation.
Secondly, data manipulation (the production of statistics) is often performed in order to shed light on some unknown feature of the population, from which the sample was drawn. For example, consider the relationship between a sample proportion [math]\left( p\right) [/math] and the true or actual population proportion [math]\left( \Pi \right) ,[/math] for some phenomenon of interest:
[math]p=\Pi +error[/math]
where, say, [math]p[/math] is the proportion of students in a collected sample of [math]100[/math] who are vegetarian and [math]\Pi [/math] is the proportion of all Manchester University students who are vegetarian. ([math]\Pi [/math] is the upper case Greek letter pi; it is not used here to denote the number Pi [math]=3.14159...[/math].) The question is “what can [math]p[/math] tell us about [math]\Pi [/math]?”, when [math]p[/math] is observed but [math]\Pi [/math] isn’t.
For [math]p[/math] to approximate [math]\Pi [/math] it seems obvious that the [math]error[/math] is required to be ‘small’, in some sense. However, the [math]error[/math] is unknown; if it were known then an observed [math]p[/math] would pinpoint [math]\Pi [/math] exactly. In statistics we characterise situations in which we believe it is highly likely that the [math]error[/math] is small (i.e., less than some specified amount). We then make statements which claim that it is highly likely that the observed [math]p[/math] is close to the unknown [math]\Pi [/math]. Here, again, we are drawing conclusions about the nature of the population from which the observed sample was taken; it is statistical inference. Suppose, for example, that based on [math]100[/math] students the observed proportion of vegetarians is [math]p=0.3.[/math] The theory of statistics (as developed in this course) then permits us to infer that there is a [math]95\%[/math] chance (it is [math]95\%[/math] likely) that the interval [math](0.21,0.39)[/math] contains the unknown true proportion [math]\Pi [/math]. Notice that this interval is symmetric about the value [math]p=0.3[/math] and allows for margin of error of [math]\pm 0.09[/math] about the observed sample proportion. The term margin of error is often quoted when newspapers report the results of political opinion polls; technically it is called a sampling error - the error which arises from just looking at a subset of the population in which we are interested and not the whole population.
For this sort of thing to work it is clearly important that the obtained sample is a fair (or typical) reflection of the population (not atypical): such samples are termed (simple) random samples. For example, we would not sample students as they left a vegetarian restaurant in order to say something about University student population as a whole! We shall signal that samples are random in this way by using expressions like: ‘consider a random sample of individuals’; ‘observations were randomly sampled’; ‘a number of individuals were selected at random from an underlying population’, etc.
In order to understand the construction of statistics, and their use in inference, you need some basic tools of the trade, which are now described.
Notation and Tools of the Trade
Variables
a variable is a label, with description, for an event or phenomenon of interest. To denote a variable, we use upper case letters. For example, [math]X,\,\,Y,\,\,Z,[/math] etc are often used as labels.
a lower case leter, [math]x,[/math] is used to denote an observation obtained (actual number) on the variable [math]X.[/math] (Lower case [math]y[/math] denotes an observation on the variable [math]Y,[/math] etc.)
Example:
Let [math]X[/math] = A-level points score. If we sample [math]4[/math] individuals we obtain [math]4[/math] observations and we label these 4 observations as [math]x_{1}[/math] = ; [math]x_{2}[/math] = ; [math]x_{3}[/math]= ; [math]x_{4}[/math] = . (You can fill in four numbers here.) [math]x_{1}[/math] denotes the first listed number (in this case, A-level points score), [math]x_{2}[/math] the second score, [math]x_{3}[/math] the third and [math]x_{4}[/math] the fourth.
In general, then, [math]x_{i}[/math] is used to denote a number - the [math]i^{th}[/math] observation (or value) for the variable [math]X,[/math] which is read simply as “[math]x"[/math]“[math]i[/math]”. The subscript [math]i[/math] is usually a positive integer ([math]1,2,3,4,[/math] etc), although terms like [math]x_{0}[/math] and [math]x_{-1}[/math] can occur in more sophisticated types of analyses. Thus, we also use [math]y_{i}[/math] for values of the variable with label [math]Y.[/math] Other possibilities are [math]z_{i},x_{j},y_{k}[/math] etc. Thus, the label we may use for the variable is essentially arbitrary as is the subscript we use to denote a particular observation on that variable.
- the values [math]x_{1},x_{2},x_{3},x_{4,}\ldots ,x_{n}[/math] denote a sample of n observations ([math]n[/math] numbers or values) for the variable [math]X[/math]. The “dots” indicates that the sequence continues until the subscript [math]n[/math] is reached (e.g., if [math]n=10,[/math] there are another [math]6[/math] numbers in the sequence after [math]x_{4}).[/math] For ease of notation we usually write this simply as [math]x_{1},...,x_{n}.[/math]Similarly, [math]y_{1},\ldots ,y_{m}[/math] for a sample of m observations for the variable [math]Y.[/math]
Summation notation
The summation notation is used to signify the addition of a set of numbers. Let an arbitrary set of numbers be doted [math]x_{1},[/math] ..., [math]x_{n}.[/math]
- the symbol [math]\sum [/math] is used: the Greek letter capital sigma. English equivalent is S, for sum.
And we have the following definition of [math]\sum :[/math]
[math]x_{1}+x_{2}+\ldots +x_{n}\equiv \sum_{i=1}^{n}x_{i},[/math] or [math]\sum\limits_{i=1}^{n}x_{i}[/math]
which means add up the n numbers, [math]x_{1}[/math] to [math]x_{n}[/math]. The expression [math]\sum_{i=1}^{n}x_{i}[/math] is read as “the sum from [math]i[/math] equals [math]1[/math] to [math]n[/math] of [math]x_{i}[/math]”, or “the sum over [math]i,[/math] [math]x_{i}[/math]”
For example, the total A-level points score from the [math]4[/math] individuals is:
[math]x_{1}+x_{2}+x_{3}+x_{4}\equiv \sum_{i=1}^{4}x_{i}=\qquad .[/math]
Note that for any [math]n[/math] numbers, [math]x_{1},\ldots ,x_{n}[/math], the procedure which is defined by [math]\sum_{i=1}^{n}x_{i}[/math] is exactly the same as that defined by [math]\sum_{j=1}^{n}x_{j},[/math] because
[math]\sum_{i=1}^{n}x_{i}=x_{1}+x_{2}+\ldots +x_{n}=\sum_{j=1}^{n}x_{j}.[/math]
You might also encounter [math]\sum_{i=1}^{n}x_{i}[/math] expressed in the following ways:
- [math]\sum\limits_{i}x_{i}[/math]or[math]\quad \sum_{i}x_{i};\qquad \sum x_{i};\qquad [/math]or even, [math]\sum x[/math] .
Moreover, since the labelling of variables, and corresponding observations, is arbitrary, the sum of [math]n[/math] observations on a particular variable, can be denoted equivalently as [math]\sum_{k=1}^{n}y_{k}=y_{1}+y_{2}+\ldots +y_{n},[/math] if we were to label the variable as [math]Y[/math] rather than [math]X[/math] and use observational subscript of [math]k[/math] rather than [math]i.[/math]
Rules of Summation
- Let [math]c[/math] be some fixed number (e.g., let [math]c=2[/math]) and let [math]x_{1},\ldots ,x_{n}[/math] denote a sample of [math]n[/math] observations for the variable [math]X[/math]:
[math]\sum_{i=1}^{n}(cx_{i})=cx_{1}+cx_{2}+\ldots +cx_{n}=c(x_{1}+x_{2}+\ldots +x_{n})=c\left( \sum_{i=1}^{n}x_{i}\right)[/math]
In above sense [math]c[/math] is called a constant (it does not have a subscript attached). It is constant in relation to the variable [math]X,[/math] whose values (denoted [math]x_{i}[/math]) are allowed to vary from across observations (over [math]i[/math]).
Notice that when adding numbers together, the orderr in which we add them is irrelevant. With this in mind we have the following result:
Let [math]y_{1},\ldots ,y_{n}[/math] be a sample of n observations on the variable labelled [math]Y[/math]:
[math]\begin{aligned} \sum_{i=1}^{n}(x_{i}+y_{i}) &=&(x_{1}+y_{1})+(x_{2}+y_{2})+\ldots +(x_{n}+y_{n}) \\ &=&(x_{1}+x_{2}+\ldots +x_{n})+(y_{1}+y_{2}+\ldots +y_{n}) \\ &=&\left( \sum_{i=1}^{n}x_{i}\right) +\left( \sum_{i=1}^{n}y_{i}\right) \\ &=&\sum_{i=1}^{n}x_{i}+\sum_{i=1}^{n}y_{i}.\end{aligned}[/math]
Combining the above two results we obtain the following:
If [math]d[/math] is another constant then
[math]\begin{aligned} \sum_{i=1}^{n}(cx_{i}+dy_{i}) &=&c\left( \sum_{i=1}^{n}x_{i}\right) +d\left( \sum_{i=1}^{n}y_{i}\right) \\ &=&c\sum_{i=1}^{n}x_{i}+d\sum_{i=1}^{n}y_{i}.\end{aligned}[/math]
[math]cX+dY[/math] is known as a linear combination (of the variable [math]X[/math] and the variable [math]Y)[/math] and is an extremely important concept in the study of statistics.
And, finally
[math]\begin{aligned} \sum_{i=1}^{n}(x_{i}+c) &=&\left( x_{1}+c\right) +\left( x_{2}+c\right) +...+\left( x_{n}+c\right) \\ &=&\left( \sum_{i=1}^{n}x_{i}\right) +\left( n\times c\right)\end{aligned}[/math]
These sorts of results can be illustrated using the following simple example:
Example:
lll
<tbody>
[math]i:[/math] [math]1[/math] [math]2[/math] [math]3[/math] [math]4[/math] [math]x_{i}:[/math] [math]3[/math] [math]3[/math] [math]4[/math] [math]1[/math] [math]y_{i}:[/math] [math]4[/math] [math]4[/math] [math]2[/math] [math]3[/math] [math]c=2[/math] [math]d=3[/math] & & [math]\begin{tabular}{ll} (i) & [/math]i=14cxi=ci=14xi;[math] \\ (ii) & [/math]i=14(xi+yi)=i=14xi+[math] [/math]i=14yi;[math] \\ (iii) & [/math]i=14(cxi+dyi)=ci=14xi+di=14yi.[math]\end{tabular} [/math]
You should be able to verify these for yourselves as follows. Firstly, the left hand side of (i) is
[math]\begin{aligned} \sum_{i=1}^{4}cx_{i} &=&\left( 2\times 3\right) +\left( 2\times 3\right) +\left( 2\times 4\right) +\left( 2\times 1\right) \\ &=&6+6+8+2 \\ &=&22\end{aligned}[/math]
and the right hand side of (i) is
[math]\begin{aligned} c\sum_{i=1}^{4}x_{i} &=&2\times \left( 3+3+4+1\right) \\ &=&2\times 11=22.\end{aligned}[/math]
Establishing (ii) and (iii) follow in a similar way (try it by working out separately the left hand side and right hand side of each of (ii) and (iii).
HEALTH WARNING! However, beware of “casual” application of the summation notation (think about what you’re trying to achieve). In the above example
[math]\begin{array}{ll} \text{(iv)} & \sum_{i=1}^{4}x_{i}y_{i}\neq \left( \sum_{i=1}^{4}x_{i}\right) \left( \sum_{i=1}^{4}y_{i}\right) \\ \text{(v)} & \sum_{i=1}^{4}\left( \dfrac{x_{i}}{y_{i}}\right) \neq \dfrac{\sum_{i=1}^{4}x_{i}}{\sum_{i=1}^{4}y_{i}}.\end{array}[/math]
In the case of the right hand side of (iv) is the “sum of the products”
[math]\sum_{i=1}^{4}x_{i}y_{i}=12+12+8+3=35,[/math]
whilst the right hand side is the “product of the sums”
[math]\left( \sum_{i=1}^{4}X_{i}\right) \left( \sum_{i=1}^{4}Y_{i}\right) =\left( 3+3+4+1\right) \left( 4+4+2+3\right) =11\times 13=143\neq 35.[/math]
Now, show that the left hand side of (v) is not equal to the right hand side of (v).
Also, the square of a sum is not (in general) equal to the sum of the squares. By which we mean:
[math]\left( \sum_{i=1}^{n}x_{i}\right) ^{2}\neq \sum_{i=1}^{n}y_{i}^{2}[/math]
where [math]x_{i}^{2}=(x_{i})^{2},[/math] the squared value of [math]x_{i}.[/math] This is easily verified since, for example, [math](-1+1)^{2}\neq (-1)^{2}+1^{2}.[/math] Or, using the preceeding example, [math]\sum_{i=1}^{4}x_{i}^{2}=9+9+16+1=35,[/math] whilst [math]\left( \sum_{i=1}^{4}x_{i}\right) ^{2}=\left( 3+3+4+1\right) ^{2}=11^{2}=121.\medskip [/math]
- Example: Consider a group of [math]10[/math] students (i.e., [math]n=10[/math]) out celebrating on a Friday night, in a particular pub, and each by their own drinks. Let [math]x_{i}[/math] denote the number of pints of beer consumed by individual [math]i;[/math] [math]y_{i},[/math] the number of glasses of white wine; [math]z_{i},[/math] the number of bottles of lager. If only beer, wine and lager are consumed at prices (in pence) of [math]a[/math] for a pint of beer, [math]b[/math] for a glass of white wine, [math]c[/math] for a bottle of lager, then the expenditure on drinks by individual [math]i,[/math] denoted [math]e_{i},[/math] is: [math]e_{i}=ax_{i}+by_{i}+cz_{i}.[/math] Whereas total expenditure on drinks is: [math]\sum_{i=1}^{10}e_{i}=a\sum_{i=1}^{10}x_{i}+b\sum_{i=1}^{10}y_{i}+c\sum_{i=1}^{10}z_{i}.[/math]