Difference between revisions of "R Analysis"
Line 45: | Line 45: | ||
This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below). | This will also select the <source enclose=none>hours</source> variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below). | ||
+ | |||
+ | == Dealing with missing observations == | ||
+ | |||
+ | So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between <source enclose=none>educ, wage</source> | ||
+ | |||
+ | cor(mydata[c("educ","wage")]) | ||
+ | |||
+ | The output we get is: | ||
+ | |||
+ | educ wage | ||
+ | educ 1 NA | ||
+ | wage NA 1 | ||
+ | |||
+ | The reason for R's inability to calculate a correlation between these two variables can be seen here: | ||
+ | |||
+ | > summary(mydata[c("hours","wage")]) | ||
+ | hours wage | ||
+ | Min. : 0.0 Min. : 0.1282 | ||
+ | 1st Qu.: 0.0 1st Qu.: 2.2626 | ||
+ | Median : 288.0 Median : 3.4819 | ||
+ | Mean : 740.6 Mean : 4.1777 | ||
+ | 3rd Qu.:1516.0 3rd Qu.: 4.9707 | ||
+ | Max. :4950.0 Max. :25.0000 | ||
+ | NA's :325 | ||
+ | |||
+ | The important information is that the variable <source enclose=none>wage</source> has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing <source enclose=none>?con</source>. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following: | ||
+ | |||
+ | > cor(mydata[c("educ","wage")],use = "complete") | ||
+ | educ wage | ||
+ | educ 1.0000000 0.3419544 | ||
+ | wage 0.3419544 1.0000000 | ||
+ | |||
+ | It is perhaps worth adding a word of explanation here. <source enclose=none>cor( )</source> is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, <source enclose=none>mydata[c("educ","wage")]</source>. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, <source enclose=none>use = "complete"</source>. Read the help function to see what over levers are at your disposal. |
Revision as of 23:33, 16 January 2015
In this section we shall demonstrate how to do some basic data analysis on data in a dataframe.
Basic Data Analysis
The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:
summary(mydata)
You will find that this will provide a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy. Say you are only interested in the summary statistics for two of the variables hours and husage
, then you would want to select these two variables only. The way to do that is the following:
summary(mydata[c("hours","husage")])
This will produce the following output:
hours husage Min. : 0.0 Min. :30.00 1st Qu.: 0.0 1st Qu.:38.00 Median : 288.0 Median :46.00 Mean : 740.6 Mean :45.12 3rd Qu.:1516.0 3rd Qu.:52.00 Max. :4950.0 Max. :60.00
Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( )
function. Let's say we want the correlation between educ, motheduc, fatheduc
, then we use in the same manner:
cor(mydata[c("educ","motheduc","fatheduc")])
resulting in the following correlation matrix
educ motheduc fatheduc educ 1.0000000 0.4353365 0.4424582 motheduc 0.4353365 1.0000000 0.5730717 fatheduc 0.4424582 0.5730717 1.0000000
Selecting variables
In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call
test1 = mydata[c("hours")]
which will create a new dataframe which includes only the one variable hours
. This is very useful, as some functions need to be applied to a dataframe (see for example the "empirical" function in R_Packages).
There is another way to select the hours
variable from the dataframe. Try:
test2 = mydata$hours
This will also select the hours
variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
Dealing with missing observations
So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between educ, wage
cor(mydata[c("educ","wage")])
The output we get is:
educ wage educ 1 NA wage NA 1
The reason for R's inability to calculate a correlation between these two variables can be seen here:
> summary(mydata[c("hours","wage")]) hours wage Min. : 0.0 Min. : 0.1282 1st Qu.: 0.0 1st Qu.: 2.2626 Median : 288.0 Median : 3.4819 Mean : 740.6 Mean : 4.1777 3rd Qu.:1516.0 3rd Qu.: 4.9707 Max. :4950.0 Max. :25.0000 NA's :325
The important information is that the variable wage
has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing ?con
. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:
> cor(mydata[c("educ","wage")],use = "complete") educ wage educ 1.0000000 0.3419544 wage 0.3419544 1.0000000
It is perhaps worth adding a word of explanation here. cor( )
is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, mydata[c("educ","wage")]
. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, use = "complete"
. Read the help function to see what over levers are at your disposal.