R Packages

From ECLR
Revision as of 11:27, 16 January 2015 by Rb (talk | contribs)
Jump to: navigation, search

In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Eventually we will use this task to also introduce how packages are used in R.

Basic Data Analysis

The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:

    summary(mydata)

You will find that this will provide a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy. Say you are only interested in the summary statistics for two of the variables hours and husage, then you would want to select these two variables only. The way to do that is the following:

    summary(mydata[c("hours","husage")])

This will produce the following output:

        hours            husage     
    Min.   :   0.0   Min.   :30.00  
    1st Qu.:   0.0   1st Qu.:38.00  
    Median : 288.0   Median :46.00  
    Mean   : 740.6   Mean   :45.12  
    3rd Qu.:1516.0   3rd Qu.:52.00  
    Max.   :4950.0   Max.   :60.00

Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( ) function. Let's say we want the correlation between educ, motheduc, fatheduc, then we use in the same manner:

    cor(mydata[c("educ","motheduc","fatheduc")])

resulting in the following correlation matrix

                  educ  motheduc  fatheduc
   educ     1.0000000 0.4353365 0.4424582
   motheduc 0.4353365 1.0000000 0.5730717
   fatheduc 0.4424582 0.5730717 1.0000000

Dealing with missing observations

So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between educ, wage

    cor(mydata[c("educ","wage")])

The output we get is:

        educ wage
   educ    1   NA
   wage   NA    1

The reason for R's inability to calculate a correlation between these two variables can be seen here:

    > summary(mydata[c("hours","wage")])
         hours             wage        
    Min.   :   0.0   Min.   : 0.1282  
    1st Qu.:   0.0   1st Qu.: 2.2626  
    Median : 288.0   Median : 3.4819  
    Mean   : 740.6   Mean   : 4.1777  
    3rd Qu.:1516.0   3rd Qu.: 4.9707  
    Max.   :4950.0   Max.   :25.0000  
                     NA's   :325 

The important information is that the variable wage has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing ?con. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:

    > cor(mydata[c("educ","wage")],use = "complete")
              educ      wage
    educ 1.0000000 0.3419544
    wage 0.3419544 1.0000000

It is perhaps worth adding a word of explanation here. cor( ) is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, mydata[c("educ","wage")]. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, use = "complete". Read the help function to see what over levers are at your disposal.

Using Subsets of Data

Often you will want to perform some analysis on a subset of data. The way to do this in R is to use the subset function, together with a logical (boolean) statement. I will first write down the statement and then explain what it does:

    mydata.sub1 <- subset(mydata, hours > 0)

IN PROGRESS

Packages

The basic R software has some basic functionality, but the power of R comes from the ability to use code written to perform statistical and econometric techniques that has been written by other people. These additional pieces of software are called packages and the next step will be to learn how to use these.

Such packages do not come pre-installed into R, but luckily, they are easily installed and used. The process you need to go through to use them is the following:

  1. Install the package: install.packages("NAME_OF_PACKAGE"). The name comes in inverted commas. This only needs to be done once on any computer.
  2. Load the package into your particular code. You need to do this every time you load R. So, if you are working in a script you would have the following command at the beginning: library(NAME_OF_PACKAGE). Now you use the package name without inverted commas.

The most difficult task is often to find the right package you want to use. Usually Dr Google or Prof Bing will be the right people to ask.

Let's say you want to show an empirical frequency distribution for a categorical variable.

Empirical distribution When googling you often find solutions of how to do everything yoursel, if you want a pre-programmed solution add package to the search term "r empirical probability distributions package"

find prob package

in there empirical