Difference between revisions of "DataTypes"

From ECLR
Jump to: navigation, search
(Created page with "= Types of data = Broadly speaking, by ‘'''data’ '''we mean numerical values associated with some variable of interest. However, we must not be overly complacent about su...")
 
Line 1: Line 1:
 
= Types of data =
 
= Types of data =
  
Broadly speaking, by ‘'''data’ '''we mean numerical values associated with some variable of interest. However, we must not be overly complacent about such a broad definition; we must be aware of different types of data that may need special treatment. Let us distinguish the following types of data, by means of simple examples:
+
Broadly speaking, by ‘'''data’ '''we mean numerical values associated with some variable of interest. However, we must not be overly complacent about such a broad definition; we must be aware of different types of data that may need special treatment when it comes to statistical analysis. For this reason it is important to be able to distinguish a few key features. A (random) variable can produce data that are either of continuous or discrete nature (see below for examples). Another level at which variables differ is whether they are sampled in time or in a cross-section.
  
* '''NOMINAL/CATEGORICAL'''
+
== Discrete data ==
** ''Examples''? are given in the lecture
 
* '''ORDINAL'''
 
** ''Examples''? are given in the lecture
 
* '''DATA WITH ACTUAL NUMERICAL MEANING'''
 
** ''Examples''? are given in the lecture
 
** '''Interval scale data: '''indicates rank and distance from an ''arbitrary'' zero measured in unit intervals. An example is temperature in Fahrenheit and Celsius scales.
 
** '''Ratio scale data:''' indicates both rank and distance from a ''natural'' (or ''common'') zero, with ratios of two measurements having meaning. Examples include weight, height and distance (<math>0</math> is the lower limit and, for example, <math>10</math> miles (<math>16</math>km) is twice as far as <math>5</math> miles (<math>8</math>km)), total consumption, speed etc.
 
 
 
Note that the ratio of temperature (Fharenheit over Celsius) changes as the tempertaute changes; however, the ratio of distance travelled (miles over kilometres) is always the same whatever distance is trevelled (the constant ratio being about <math>5/8.)</math>
 
 
 
Although one could provide examples of other sorts of data, the above illustrate some of the subtle differences that can occur. For the most part, however, we will be happy to distinguish between just two broad classes of data: ''discrete'' and ''continuous''.
 
  
== Discrete data ==
+
The variable, <math>X,</math> is said to be discrete if it can only ever yield isolated values some of which (if not all) are often repeated in the sample. It is, however, important to note that there are different types of discrete data:
  
The variable, <math>X,</math> is said to be discrete if it can only ever yield isolated values some of which (if not all) are often repeated in the sample. The values taken by the variable change by discernible, pre-determined steps or jumps. A discrete variable often describes something which can be counted; for example, the number of children previously born to a pregnant mother. However, it can also be categorical; for example, whether or not the mother smoked during pregnancy.
+
* '''ORDINAL'''. Here the categories have a natural ordering.<br />''Examples'': Football Leagues: Premier League, Championship, etc.
 +
* '''NOMINAL'''. Here there is no natural ordering to the categories.<br />''Examples'': Gender: Male, Female
 +
* '''COUNT'''. A variable that represents the counts of certain events.<br />''Examples'': Number of children in household: 0,1,2,3,etc.
  
 
== Continuous data ==
 
== Continuous data ==
  
 +
<source>Additional Material:
 +
Khan Academy: [https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/discrete-and-continuous-random-variables|Discrete and Continuous Variables]</source>
 
The variable, <math>Y,</math> is said to be continuous if it can assume any value taken (more or less) from a continuum (a continuum is an interval, or range of numbers). A nice way to distinguish between a discrete and continuous variable is to consider the possibility of listing possible values. It is theoretically impossible even to ''begin'' listing all possible values that a continuous variable, <math>Y,</math> could assume. However, this is not so with a discrete variable; you may not always be able to finish the list, but at least you can make a start.
 
The variable, <math>Y,</math> is said to be continuous if it can assume any value taken (more or less) from a continuum (a continuum is an interval, or range of numbers). A nice way to distinguish between a discrete and continuous variable is to consider the possibility of listing possible values. It is theoretically impossible even to ''begin'' listing all possible values that a continuous variable, <math>Y,</math> could assume. However, this is not so with a discrete variable; you may not always be able to finish the list, but at least you can make a start.
  
 
For example, the birth-weight of babies is an example of a continuous variable. There is no reason why a baby should not have a birth weight of <math>2500.0234</math> grams, even though it wouldn’t be measured as such! Try to list all possible weights (in theory) bearing in mind that for any two weights that you write down, there will always be another possibility half way between. We see, then, that for a continuous variable an ''observation'' is recorded, as the result of applying some measurement, but that this inevitably gives rise to a rounding (up or down) of the ''actual value''. (No such rounding occurs when recording observations on a discrete variable.)
 
For example, the birth-weight of babies is an example of a continuous variable. There is no reason why a baby should not have a birth weight of <math>2500.0234</math> grams, even though it wouldn’t be measured as such! Try to list all possible weights (in theory) bearing in mind that for any two weights that you write down, there will always be another possibility half way between. We see, then, that for a continuous variable an ''observation'' is recorded, as the result of applying some measurement, but that this inevitably gives rise to a rounding (up or down) of the ''actual value''. (No such rounding occurs when recording observations on a discrete variable.)
 +
 +
A variable can be continuous even though it is defined on a limited scale. For instance the weight variable has a limited scale as weights cannot be negative.
  
 
Finally, note that for a continuous variable, it is unlikely that values will be repeated frequently in the sample, unless rounding occurs.
 
Finally, note that for a continuous variable, it is unlikely that values will be repeated frequently in the sample, unless rounding occurs.
  
 
Other examples of continuous data include: heights of people; volume of water in a reservoir; and, to a workable approximation, Government Expenditure. One could argue that the last of these is discrete (due to the finite divisibility of monetary units). However, when the amounts involved are of the order of millions of pounds, changes at the level of individual pence are hardly discernible and so it is sensible to treat the variable as continuous.
 
Other examples of continuous data include: heights of people; volume of water in a reservoir; and, to a workable approximation, Government Expenditure. One could argue that the last of these is discrete (due to the finite divisibility of monetary units). However, when the amounts involved are of the order of millions of pounds, changes at the level of individual pence are hardly discernible and so it is sensible to treat the variable as continuous.
 
Observations are also often classified as ''cross-section'' or ''time-series:''
 
  
 
== Cross-section data ==
 
== Cross-section data ==

Revision as of 15:03, 11 April 2013

Types of data

Broadly speaking, by ‘data’ we mean numerical values associated with some variable of interest. However, we must not be overly complacent about such a broad definition; we must be aware of different types of data that may need special treatment when it comes to statistical analysis. For this reason it is important to be able to distinguish a few key features. A (random) variable can produce data that are either of continuous or discrete nature (see below for examples). Another level at which variables differ is whether they are sampled in time or in a cross-section.

Discrete data

The variable, [math]X,[/math] is said to be discrete if it can only ever yield isolated values some of which (if not all) are often repeated in the sample. It is, however, important to note that there are different types of discrete data:

  • ORDINAL. Here the categories have a natural ordering.
    Examples: Football Leagues: Premier League, Championship, etc.
  • NOMINAL. Here there is no natural ordering to the categories.
    Examples: Gender: Male, Female
  • COUNT. A variable that represents the counts of certain events.
    Examples: Number of children in household: 0,1,2,3,etc.

Continuous data

Additional Material:
Khan Academy: [https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/discrete-and-continuous-random-variables|Discrete and Continuous Variables]

The variable, [math]Y,[/math] is said to be continuous if it can assume any value taken (more or less) from a continuum (a continuum is an interval, or range of numbers). A nice way to distinguish between a discrete and continuous variable is to consider the possibility of listing possible values. It is theoretically impossible even to begin listing all possible values that a continuous variable, [math]Y,[/math] could assume. However, this is not so with a discrete variable; you may not always be able to finish the list, but at least you can make a start.

For example, the birth-weight of babies is an example of a continuous variable. There is no reason why a baby should not have a birth weight of [math]2500.0234[/math] grams, even though it wouldn’t be measured as such! Try to list all possible weights (in theory) bearing in mind that for any two weights that you write down, there will always be another possibility half way between. We see, then, that for a continuous variable an observation is recorded, as the result of applying some measurement, but that this inevitably gives rise to a rounding (up or down) of the actual value. (No such rounding occurs when recording observations on a discrete variable.)

A variable can be continuous even though it is defined on a limited scale. For instance the weight variable has a limited scale as weights cannot be negative.

Finally, note that for a continuous variable, it is unlikely that values will be repeated frequently in the sample, unless rounding occurs.

Other examples of continuous data include: heights of people; volume of water in a reservoir; and, to a workable approximation, Government Expenditure. One could argue that the last of these is discrete (due to the finite divisibility of monetary units). However, when the amounts involved are of the order of millions of pounds, changes at the level of individual pence are hardly discernible and so it is sensible to treat the variable as continuous.

Cross-section data

Cross-section data comprises observations on a particular variable taken at a single point in time. For example: annual crime figures recorded by Police regions for the year 1999; the birth-weight of babies born, in a particular maternity unit, during the month of April 1998; initial salaries of graduates from the University of Manchester, 2000. Note, the defining feature is that there is no natural ordering in the data.

Time-series data

On the other hand, time-series data are observations on a particular variable recorded over a period of time, at regular intervals. For example; personal crime figures for Greater Manchester recorded annually over 1980-99; monthly household expenditure on food; the daily closing price of a certain stock. In this case, the data does have a natural ordering since they are measured from one time period to the next.

Footnotes