Difference between revisions of "R Analysis"

From ECLR
Jump to: navigation, search
Line 13: Line 13:
  
 
One more warning before we get started. As R is an Open Source software, there are many ways to do the same things in R. In fact there are many packages that have been written to achieve the same thing in a slightly different manner. That can be a slightly frustrating aspect of working with R, but as long as you remember that there are many solutions to the same problem you will be fine!
 
One more warning before we get started. As R is an Open Source software, there are many ways to do the same things in R. In fact there are many packages that have been written to achieve the same thing in a slightly different manner. That can be a slightly frustrating aspect of working with R, but as long as you remember that there are many solutions to the same problem you will be fine!
 +
 +
Here we decided to use some functionality from a package called <code>mosaic</code> and you may want to install and load it here:
 +
 +
    install.packages("mosaic")  # only needed once on each computer
 +
    library(mosaic)              # needed at the beginning of every code that uses functions from mosaic
  
 
== Summary Statistics - Take 1 ==
 
== Summary Statistics - Take 1 ==
Line 66: Line 71:
 
     3rd Qu.:1516.0  3rd Qu.:52.00   
 
     3rd Qu.:1516.0  3rd Qu.:52.00   
 
     Max.  :4950.0  Max.  :60.00
 
     Max.  :4950.0  Max.  :60.00
 
  
 
Let's say we want the correlation between <source enclose=none>educ, motheduc, fatheduc</source>, then we use in the same manner:
 
Let's say we want the correlation between <source enclose=none>educ, motheduc, fatheduc</source>, then we use in the same manner:
Line 78: Line 82:
 
     motheduc 0.4353365 1.0000000 0.5730717
 
     motheduc 0.4353365 1.0000000 0.5730717
 
     fatheduc 0.4424582 0.5730717 1.0000000
 
     fatheduc 0.4424582 0.5730717 1.0000000
 
  
 
In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call
 
In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call
Line 91: Line 94:
  
 
This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
 
This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
 +
 +
=== Introducing the subset function ===
 +
 +
There is yet another way of selecting certain variables. It is by using the <code>subset</code> function. This function deserves its own little section as it is extremely useful and powerful and you will get to know it well.
 +
 +
Try the following
 +
 +
    mydata_set1 <- subset(mydata, select=c("hours","husage"))
 +
 +
and look at the outcome in your Environment. You have created a new dataframe called <code>mydata_set1</code> that consists of only the two variables <code>hours</code> and <code>husage</code>. Now you could apply the <code>summary</code> function to just this dataframe
 +
 +
    > summary(mydata_set1)
 +
          hours            husage   
 +
      Min.  :  0.0  Min.  :30.00 
 +
      1st Qu.:  0.0  1st Qu.:38.00 
 +
      Median : 288.0  Median :46.00 
 +
      Mean  : 740.6  Mean  :45.12 
 +
      3rd Qu.:1516.0  3rd Qu.:52.00 
 +
      Max.  :4950.0  Max.  :60.00 
 +
    > cor(mydata_set1)
 +
                  hours      husage
 +
    hours  1.00000000 -0.03108875
 +
    husage -0.03108875  1.00000000
 +
 +
In this way you can ensure that you only see those statistics on the screen which you are really interested in. The <code>subset</code> function will also be used when we select rows/observations rather than columns/variables. But we will soon get to that.
  
 
== Dealing with missing observations ==
 
== Dealing with missing observations ==

Revision as of 15:26, 4 August 2015

In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Here is an online demonstration of some of the material covered on this page.

Data Upload and Introduction

We shall continue working with the same dataset as in R_Data, the mroz.xls dataset. It is easiest to import the data as we learned in R_Data#Converting_to_NAs_during_import, taking care of missing values (which in the csv datafile are represented by ".") during the data import process

    setwd("X:/Your/full/Path")   # This sets the working directory, ensure data file is in here
    mydata <- read.csv("mroz.csv",na.strings = ".")  # Opens mroz.csv from working directory

This will upload a dataframe mydata into your work environment.

Here we will learn how to do very basic descriptive statistics in R. The main tasks are going to be how to apply certain statistics to certain parts of the data. That may be certain variables in our dataframe, but it may also involve us selecting certain rows/observations.

One more warning before we get started. As R is an Open Source software, there are many ways to do the same things in R. In fact there are many packages that have been written to achieve the same thing in a slightly different manner. That can be a slightly frustrating aspect of working with R, but as long as you remember that there are many solutions to the same problem you will be fine!

Here we decided to use some functionality from a package called mosaic and you may want to install and load it here:

    install.packages("mosaic")   # only needed once on each computer
    library(mosaic)              # needed at the beginning of every code that uses functions from mosaic

Summary Statistics - Take 1

The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:

    > summary(mydata)
               inlf            hours           kidslt6          kidsge6           age             educ            wage 
     Min.   :0.0000   Min.   :   0.0   Min.   :0.0000   Min.   :0.000   Min.   :30.00   Min.   : 5.00   Min.   : 0.1282   
     1st Qu.:0.0000   1st Qu.:   0.0   1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:36.00   1st Qu.:12.00   1st Qu.: 2.2626   
     Median :1.0000   Median : 288.0   Median :0.0000   Median :1.000   Median :43.00   Median :12.00   Median : 3.4819 
     Mean   :0.5684   Mean   : 740.6   Mean   :0.2377   Mean   :1.353   Mean   :42.54   Mean   :12.29   Mean   : 4.1777   
     3rd Qu.:1.0000   3rd Qu.:1516.0   3rd Qu.:0.0000   3rd Qu.:2.000   3rd Qu.:49.00   3rd Qu.:13.00   3rd Qu.: 4.9707   
     Max.   :1.0000   Max.   :4950.0   Max.   :3.0000   Max.   :8.000   Max.   :60.00   Max.   :17.00   Max.   :25.0000   
                                                                                                         NA's   :325            

As you can see this provides a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy (and therefore isn't shown completely here), but in the next section we learn how to apply this to selected variables.

This output should be fairly self explanatory. The one interesting aspect is that in the Section that describes the wage variable we also find the additional information that there are 325 missing observations (NA's).

Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( ) function.

    > cor(mydata)
                     inlf       hours      kidslt6      kidsge6         age        educ wage      repwage      
    inlf      1.000000000  0.74114539 -0.213749303 -0.002424231 -0.08049811  0.18735285   NA  0.634048541 
    hours     0.741145387  1.00000000 -0.222063296 -0.090632070 -0.03311418  0.10596042   NA  0.606916375 
    kidslt6  -0.213749303 -0.22206330  1.000000000  0.084159872 -0.43394869  0.10869022   NA -0.134908831 
    kidsge6  -0.002424231 -0.09063207  0.084159872  1.000000000 -0.38541134 -0.05889891   NA -0.068680213 
    age      -0.080498109 -0.03311418 -0.433948687 -0.385411341  1.00000000 -0.12022299   NA -0.058314931 
    educ      0.187352846  0.10596042  0.108690218 -0.058898912 -0.12022299  1.00000000   NA  0.267574542 
    wage               NA          NA           NA           NA          NA          NA    1           NA  
    repwage   0.634048541  0.60691637 -0.134908831 -0.068680213 -0.05831493  0.26757454   NA  1.000000000 
    hushrs   -0.065182605 -0.05634759  0.024292257  0.099377891 -0.08437157  0.07891592   NA -0.070797198 
    husage   -0.072820048 -0.03108875 -0.442991438 -0.350199434  0.88813797 -0.13352150   NA -0.055398862 

This is only the top left corner of a huge correlation matrix. We can, for instance, see that the correlation between the variables "hours" and "age" is -0.0805, so slightly negative. But in this table you can also see that all correlations that involve the variable wage are shown as NA or "not available". In this case the reason for this is that there are missing observations for that variable, i.e. some respondents did not report a wage (in fact all those that are categorised as not being in the labour force (inlf = 0). When you start exploring your data it is useful to understand data features like this. Take some time to get familiar with your dataset.

Selecting variables

Say you are only interested in the summary statistics for two of the variables hours and husage, then you would want to select these two variables only. The way to do that is the following:

    summary(mydata[c("hours","husage")])

This will produce the following output:

        hours            husage     
    Min.   :   0.0   Min.   :30.00  
    1st Qu.:   0.0   1st Qu.:38.00  
    Median : 288.0   Median :46.00  
    Mean   : 740.6   Mean   :45.12  
    3rd Qu.:1516.0   3rd Qu.:52.00  
    Max.   :4950.0   Max.   :60.00

Let's say we want the correlation between educ, motheduc, fatheduc, then we use in the same manner:

    cor(mydata[c("educ","motheduc","fatheduc")])

resulting in the following correlation matrix

                  educ  motheduc  fatheduc
   educ     1.0000000 0.4353365 0.4424582
   motheduc 0.4353365 1.0000000 0.5730717
   fatheduc 0.4424582 0.5730717 1.0000000

In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call

    test1 = mydata[c("hours")]

which will create a new dataframe which includes only the one variable hours. This is very useful, as some functions need to be applied to a dataframe (see for example the "empirical" function in R_Packages).

There is another way to select the hours variable from the dataframe. Try:

    test2 = mydata$hours

This will also select the hours variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).

Introducing the subset function

There is yet another way of selecting certain variables. It is by using the subset function. This function deserves its own little section as it is extremely useful and powerful and you will get to know it well.

Try the following

    mydata_set1 <- subset(mydata, select=c("hours","husage"))

and look at the outcome in your Environment. You have created a new dataframe called mydata_set1 that consists of only the two variables hours and husage. Now you could apply the summary function to just this dataframe

    > summary(mydata_set1)
         hours            husage     
     Min.   :   0.0   Min.   :30.00  
     1st Qu.:   0.0   1st Qu.:38.00  
     Median : 288.0   Median :46.00  
     Mean   : 740.6   Mean   :45.12  
     3rd Qu.:1516.0   3rd Qu.:52.00  
     Max.   :4950.0   Max.   :60.00  
    > cor(mydata_set1)
                 hours      husage
    hours   1.00000000 -0.03108875
    husage -0.03108875  1.00000000

In this way you can ensure that you only see those statistics on the screen which you are really interested in. The subset function will also be used when we select rows/observations rather than columns/variables. But we will soon get to that.

Dealing with missing observations

So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between educ, wage

    cor(mydata[c("educ","wage")])

The output we get is:

        educ wage
   educ    1   NA
   wage   NA    1

The reason for R's inability to calculate a correlation between these two variables can be seen here:

    > summary(mydata[c("hours","wage")])
         hours             wage
    Min.   :   0.0   Min.   : 0.1282
    1st Qu.:   0.0   1st Qu.: 2.2626
    Median : 288.0   Median : 3.4819
    Mean   : 740.6   Mean   : 4.1777
    3rd Qu.:1516.0   3rd Qu.: 4.9707
    Max.   :4950.0   Max.   :25.0000
                     NA's   :325

The important information is that the variable wage has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing ?cor. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:

    > cor(mydata[c("educ","wage")],use = "complete")
              educ      wage
    educ 1.0000000 0.3419544
    wage 0.3419544 1.0000000

It is perhaps worth adding a word of explanation here. cor( ) is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, mydata[c("educ","wage")]. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, use = "complete". Read the help function to see what over levers are at your disposal.

Using Subsets of Data

Often you will want to perform some analysis on a subset of data. The way to do this in R is to use the subset function, together with a logical (boolean) statement. I will first write down the statement and then explain what it does:

    mydata.sub1 <- subset(mydata, hours > 0)

On the left hand side of <- we have a new object named mydata.sub1. On the right hand side of <- we can see how that new object is defined. We are using the function subset() which has been designed to select observations and/or columns from a dataframe such as mydata. This function needs at least two inputs. The first input is the dataframe from which we are selecting observations and variables. Here we are selecting from mydata. The second element indicates which observations/rows we want to select. hours > 0 tells R to select all those observations for which the variable hours is larger than 0.

Often (if not always) you will not remember how exactly a function works. The internet is then usually a good source, but in your console you could also type ?subset which would open a help function. There you could see that you could add a third input to the subset function which would indicate which variables you want to include (e.g. select = c(hours, wage) which would only select these two variables). By not using this third input we indicate to R that it should select all variables in mydata.

Logical/Boolean Statements

The way in which we selected the observations, i.e. by using the logical statement hours > 0 is worth dwelling on for a moment. These type of logical statements create variables in R that are given the logical data type. Sometimes these are also called boolean variables.

To see what is special about these go to your console and just type something like 5>9 and then press ENTER. You will realise that R is a clever little thing and will tell you that in fact 5 is not larger than 9 by returning the answer FALSE. When provided R with hours > 0, the software, for all our 753 observations, checks whether the value of the hour variable is larger than 0 or not. It will create a variable (vector) with 753 entries and in each entry there will be either a hours >TRUE or hours >FALSE, depending on whether the respective value is larger than 0 or not.

You can create logical variables on the basis of more complicated logical statements as well. You can combine statements by noting that & represents AND, and | represents OR. You will want to use one of the following relational operators: == checks whether two things are equal; != will check if two things are unequal; > and < take their well known roles. To figure out how these work, try the following statements in your console and see whether you can guess the right answers:

    (3 > 2) & (3 > 1)
    (3 > 2) & (3 > 6)
    (3 > 5) & (3 > 6)
    (3 > 2) | (3 > 1)
    (3 > 2) | (3 > 6)
    (3 > 5) | (3 > 6)
    ((3 == 5) & (3 > 2)) | (3 > 1)

Being comfortable with these logical statements will make the life of every programmer much easier.

Summary Statistics - Take 2

Especially when yo are dealing with categorical data it is often useful to look at contingency tables, i.e. tables with counts of all possible values. The function that achieves this is the table function. Try:

    > table(mydata$kidslt6)

and you will get:

      0   1   2   3 
    606 118  26   3 

which tells you that there were 118 women with one child younger than 6. It turns out that you could use the table function also with the alternative way of selecting a variable, i.e. mydata[c("kidslt6")]. There is, unfortunately no easy way of knowing which function works with which way of selecting variables from a dataframe. You will just have to try or read the relevant help function.

You can also produce a cross tabulation by adding a second variable:

    > table(mydata$kidslt6,mydata$kidsge6)

which produces:

          0   1   2   3   4   5   6   7   8
      0 229 144 121  75  26   9   0   1   1
      1  17  35  36  24   3   3   0   0   0
      2  11   5   5   3   1   0   1   0   0
      3   1   1   0   1   0   0   0   0   0

There are 24 women that have three children at least 6 years old and one younger child.

There are a number of basic summary statistics that are part of every basic data toolbox. Being able to calculate means, medians and standard deviations for a set of data. Let's take a particular variable, the wage variable. Try the following command:

    mean(mydata$wage, hours > 0)

You could replace mean with median, sd, var, min or max (which all represent obvious sample summary statistics), the result is always that you will find an unpleasant NA. Why is this? If you look again at the wage data you will see that there are missing data in here (as we already discovered above). Now check the details of any of your function by, say, typing ?mean into the console. reading through the help function you will find that you will need to add the parameter na.rm=TRUE to your function call. So:

    mean(mydata$wage,na.rm=TRUE)

will deliver the sample mean of 4.177682. This additional parameter essentially instructs the function mean to remove all NAs.

While you could produce all sorts of summary statistics individually as just indicated, you could also obtain all in one go by using

    summary(mydata$wage,na.rm=TRUE)

which will return the mean, median, max, min, and quartiles (but annoyingly not the standard deviation).

Re-classifying categorical/factor variables

When you have categorical data you may often want to re-classify your categories into new, usually broader categories. In the current data-set this isn't really an issue, but let's say we did have an ethnicity variable in our dataframe and for arguments sake assume that this variable these data are in mydata$Ethnicity and let's assume that these data are encoded as a factor variable.

The reason for re-classifying (or re-coding) is that sometimes we will have too small categories. Too find your frequencies you can use the table(mydata$Ethnicity) or summary(mydata$Ethnicity)command. If you do that you may find something like:

         Asian         Black   Mixed Asian
           120           254             2 
    Mixed Black  Mixed White         White    
            15            12           350   

Let's say you want to amalgamate the Mixed categories into one big "Mixed" category. Here is the easiest way to do this. We create a new variable in our dataframe

   mydata$Eth_cat <- as.character(0)  # new variable is called Eth_cat, initially as character variable

Now we need to define the variables this new variable should take:

    mydata$Eth_cat[mydata$Ethnicity == "Asian"] <- "Asian"
    mydata$Eth_cat[mydata$Ethnicity == "Black" ] <- "Black"
    mydata$Eth_cat[mydata$Ethnicity == "Mixed Asian" ] <- "Mixed"
    mydata$Eth_cat[mydata$Ethnicity == "Mixed Black" ] <- "Mixed"
    mydata$Eth_cat[mydata$Ethnicity == "Mixed White" ] <- "Mixed"
    mydata$Eth_cat[mydata$Ethnicity == "White"] <- "White"

In each line we are selecting all rows in the dataframe for which the Ethnicity variable takes a certain value, e.g. mydata$Eth_cat[mydata$Ethnicity == "Asian"], all rows with Asian respondents. Then we assign <- "Asian" to these rows. We do this for all possible categories in Ethnicity. What we have created at this stage is a new variable with all the desired variables. It is, however, at this stage a text based variable and it may be of advantage to transform it to a factor (categorical) variable. This is very straightforward:

    mydata$Eth_cat <- as.factor(mydata$Eth_cat)

and you are good to go!