Difference between revisions of "R Analysis"

From ECLR
Jump to: navigation, search
Line 45: Line 45:
  
 
This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
 
This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
 
== Dealing with missing observations ==
 
 
So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between <source enclose=none>educ, wage</source>
 
 
    cor(mydata[c("educ","wage")])
 
 
The output we get is:
 
 
        educ wage
 
    educ    1  NA
 
    wage  NA    1
 
 
The reason for R's inability to calculate a correlation between these two variables can be seen here:
 
 
    > summary(mydata[c("hours","wage")])
 
          hours            wage       
 
    Min.  :  0.0  Min.  : 0.1282 
 
    1st Qu.:  0.0  1st Qu.: 2.2626 
 
    Median : 288.0  Median : 3.4819 
 
    Mean  : 740.6  Mean  : 4.1777 
 
    3rd Qu.:1516.0  3rd Qu.: 4.9707 
 
    Max.  :4950.0  Max.  :25.0000 
 
                      NA's  :325
 
 
The important information is that the variable <source enclose=none>wage</source> has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing <source enclose=none>?con</source>. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:
 
 
    > cor(mydata[c("educ","wage")],use = "complete")
 
              educ      wage
 
    educ 1.0000000 0.3419544
 
    wage 0.3419544 1.0000000
 
 
It is perhaps worth adding a word of explanation here. <source enclose=none>cor( )</source> is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, <source enclose=none>mydata[c("educ","wage")]</source>. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, <source enclose=none>use = "complete"</source>. Read the help function to see what over levers are at your disposal.
 
 
== Using Subsets of Data ==
 
 
Often you will want to perform some analysis on a subset of data. The way to do this in R is to use the subset function, together with a logical (boolean) statement. I will first write down the statement and then explain what it does:
 
 
    mydata.sub1 <- subset(mydata, hours > 0)
 
 
On the left hand side of <source enclose=none><-</source> we have a new object named <source enclose=none>mydata.sub1</source>. On the right hand side of <source enclose=none><-</source> we can see how that new object is defined. We are using the function <source enclose=none>subset()</source> which has been designed to select observations and/or columns from a dataframe such as <source enclose=none>mydata</source>. This function needs at least two inputs. The first input is the dataframe from which we are selecting observations and variables. Here we are selecting from <source enclose=none>mydata</source>. The second element indicates which observations/rows we want to select. <source enclose=none>hours > 0</source> tells R to select all those observations for which the variable <source enclose=none>hours</source> is larger than 0.
 
 
Often (if not always) you will not remember how exactly a function works. The internet is then usually a good source, but in your console you could also type <source enclose=none>?subset</source> which would open a help function. There you could see that you could add a third input to the subset function which would indicate which variables you want to include (e.g. <source enclose=none>select = c(hours, wage)</source> which would only select these two variables). By not using this third input we indicate to R that it should select all variables in <source enclose=none>mydata</source>.
 
 
=== Logical/Boolean Statements ===
 
 
The way in which we selected the observations, i.e. by using the logical statement <source enclose=none>hours > 0</source> is worth dwelling on for a moment. These type of logical statements create variables in R that are given the <source enclose=none>logical</source> data type. Sometimes these are also called boolean variables.
 
 
To see what is special about these go to your console and just type something like <source enclose=none>5>9</source> and then press ENTER. You will realise that R is a clever little thing and will tell you that in fact 5 is not larger than 9 by returning the answer <source enclose=none>FALSE</source>. When provided R with <source enclose=none>hours > 0</source>, the software, for all our 753 observations, checks whether the value of the hour variable is larger than 0 or not. It will create a variable (vector) with 753 entries and in each entry there will be either a <source enclose=none>hours >TRUE</source> or <source enclose=none>hours >FALSE</source>, depending on whether the respective value is larger than 0 or not.
 
 
You can create logical variables on the basis of more complicated logical statements as well. You can combine statements by noting that <source enclose=none>&</source> represents AND, <source enclose=none>|</source> represents OR and <source enclose=none>==</source> checks whether two things are equal. To figure out how these work, try the following statements in your console and see whether you can guess the right answers:
 
 
    (3 > 2) & (3 > 1)
 
    (3 > 2) & (3 > 6)
 
    (3 > 5) & (3 > 6)
 
    (3 > 2) | (3 > 1)
 
    (3 > 2) | (3 > 6)
 
    (3 > 5) | (3 > 6)
 
    ((3 == 5) & (3 > 2)) | (3 > 1)
 
 
Being comfortable with these logical statements will make the life of every programmer much easier.
 
 
== Some basic summary statistics ==
 
 
There are a number of basic summary statistics that are part of every basic data toolbox. Being able to calculate means, medians and standard deviations for a set of data. Let's take a particular variable, the <source enclose=none>wage</source> variable. Try any of the following three commands:
 
 
If you try any of these you will find an unpleasant <source enclose=none>NA</source> as a result. Why is this? If you look again at the <source enclose=none>wage</source> data you will see that there are missing data in here. Now check the details of any of your function by, say, typing <source enclose=none>?sd</source> into the console. reading through the help function you will find that you will need to add the parameter <source enclose=none>na.rm=TRUE</source> to your function call. So:
 
 
    sd(mydata$wage,na.rm=TRUE)
 
 
will deliver the sample standard deviation of <source enclose=none>3.310282</source>.
 

Revision as of 23:32, 16 January 2015

In this section we shall demonstrate how to do some basic data analysis on data in a dataframe.

Basic Data Analysis

The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:

    summary(mydata)

You will find that this will provide a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy. Say you are only interested in the summary statistics for two of the variables hours and husage, then you would want to select these two variables only. The way to do that is the following:

    summary(mydata[c("hours","husage")])

This will produce the following output:

        hours            husage     
    Min.   :   0.0   Min.   :30.00  
    1st Qu.:   0.0   1st Qu.:38.00  
    Median : 288.0   Median :46.00  
    Mean   : 740.6   Mean   :45.12  
    3rd Qu.:1516.0   3rd Qu.:52.00  
    Max.   :4950.0   Max.   :60.00

Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( ) function. Let's say we want the correlation between educ, motheduc, fatheduc, then we use in the same manner:

    cor(mydata[c("educ","motheduc","fatheduc")])

resulting in the following correlation matrix

                  educ  motheduc  fatheduc
   educ     1.0000000 0.4353365 0.4424582
   motheduc 0.4353365 1.0000000 0.5730717
   fatheduc 0.4424582 0.5730717 1.0000000

Selecting variables

In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call

    test1 = mydata[c("hours")]

which will create a new dataframe which includes only the one variable hours. This is very useful, as some functions need to be applied to a dataframe (see for example the "empirical" function in R_Packages).

There is another way to select the hours variable from the dataframe. Try:

    test2 = mydata$hours

This will also select the hours variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).