R Packages
In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Eventually we will use this task to also introduce how packages are used in R.
Contents
Basic Data Analysis
The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:
summary(mydata)
You will find that this will provide a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy. Say you are only interested in the summary statistics for two of the variables hours and husage
, then you would want to select these two variables only. The way to do that is the following:
summary(mydata[c("hours","husage")])
This will produce the following output:
hours husage Min. : 0.0 Min. :30.00 1st Qu.: 0.0 1st Qu.:38.00 Median : 288.0 Median :46.00 Mean : 740.6 Mean :45.12 3rd Qu.:1516.0 3rd Qu.:52.00 Max. :4950.0 Max. :60.00
Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( )
function. Let's say we want the correlation between educ, motheduc, fatheduc
, then we use in the same manner:
cor(mydata[c("educ","motheduc","fatheduc")])
resulting in the following correlation matrix
educ motheduc fatheduc educ 1.0000000 0.4353365 0.4424582 motheduc 0.4353365 1.0000000 0.5730717 fatheduc 0.4424582 0.5730717 1.0000000
Dealing with missing observations
So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between educ, wage
cor(mydata[c("educ","wage")])
The output we get is:
educ wage educ 1 NA wage NA 1
The reason for R's inability to calculate a correlation between these two variables can be seen here:
> summary(mydata[c("hours","wage")]) hours wage Min. : 0.0 Min. : 0.1282 1st Qu.: 0.0 1st Qu.: 2.2626 Median : 288.0 Median : 3.4819 Mean : 740.6 Mean : 4.1777 3rd Qu.:1516.0 3rd Qu.: 4.9707 Max. :4950.0 Max. :25.0000 NA's :325
The important information is that the variable wage
has 325 missing observations (NA). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing ?con
. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:
> cor(mydata[c("educ","wage")],use = "complete") educ wage educ 1.0000000 0.3419544 wage 0.3419544 1.0000000
It is perhaps worth adding a word of explanation here. cor( )
is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, mydata[c("educ","wage")]
. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, use = "complete"
. Read the help function to see what over levers are at your disposal.
Using Subsets of Data
Often you will want to perform some analysis on a subset of data. The way to do this in R is to use the subset function, together with a logical (boolean) statement. I will first write down the statement and then explain what it does:
mydata.sub1 <- subset(mydata, hours > 0)
On the left hand side of <-
we have a new object named mydata.sub1
. On the right hand side of <-
we can see how that new object is defined. We are using the function subset()
which has been designed to select observations and/or columns from a dataframe such as mydata
. This function needs at least two inputs. The first input is the dataframe from which we are selecting observations and variables. Here we are selecting from mydata
. The second element indicates which observations/rows we want to select. hours > 0
tells R to select all those observations for which the variable hours
is larger than 0.
Often (if not always) you will not remember how exactly a function works. The internet is then usually a good source, but in your console you could also type ?subset
which would open a help function. There you could see that you could add a third input to the subset function which would indicate which variables you want to include (e.g. select = c(hours, wage)
which would only select these two variables). By not using this third input we indicate to R that it should select all variables in mydata
.
Logical/Boolean Statements
The way in which we selected the observations, i.e. by using the logical statement hours > 0
is worth dwelling on for a moment. These type of logical statements create variables in R that are given the logical
data type. Sometimes these are also called boolean variables.
To see what is special about these go to your console and just type something like 5>9
and then press ENTER. You will realise that R is a clever little thing and will tell you that in fact 5 is not larger than 9 by returning the answer FALSE
. When provided R with hours > 0
, the software, for all our 753 observations, checks whether the value of the hour variable is larger than 0 or not. It will create a variable (vector) with 753 entries and in each entry there will be either a hours >TRUE
or hours >FALSE
, depending on whether the respective value is larger than 0 or not.
You can create logical variables on the basis of more complicated logical statements as well. You can combine statements by noting that &
represents AND, |
represents OR and ==
checks whether two things are equal. To figure out how these work, try the following statements in your console and see whether you can guess the right answers:
(3 > 2) & (3 > 1) (3 > 2) & (3 > 6) (3 > 5) & (3 > 6) (3 > 2) | (3 > 1) (3 > 2) | (3 > 6) (3 > 5) | (3 > 6) ((3 == 5) & (3 > 2)) | (3 > 1)
Being comfortable with these logical statements will make the life of every programmer much easier.
Packages
The R software has some basic functionality, but the power of R comes from the ability to use code written to perform statistical and econometric techniques that has been written by other people. These additional pieces of software are called packages. Packages usually include a host of functions that can perform tasks that are all related to a particular problem type (say, using probability distributions, estimating GARCH models, performing Bayesian inference etc.). The next step will be to learn how to find them, install them and use them.
The most difficult task is often to find the right package you want to use. Usually Dr Google or Prof Bing will be the right people to ask.
Let's say you want to show an empirical frequency distribution for a categorical variable, like the number of children below or at least 6 years old (kidsth6, kidsge6
). It is a bit of an art to find the right package (and there may be several packages which do the job). I googled the following term "R empirical probability distributions package". If you don't include the package term you will tend to find solutions to programme it yourself, but if you want a pre-written code including "package" helps.
Scanning the results there appears a link to a package called "prob" and if you open the pdf file (to which my serach engine linked) you will find a list of function that are contained in that package and you will soon see an "empirical" function which appears to be designed to do the job. Keep this file open as we will need to consult it to understand how to use the function. But first we need to make this code available to our script file.
Such packages do not come pre-installed into R, but luckily, they are easily installed and used. If you first want to check which packages are already installed on your computer, you can use the following command:
ip <- installed.packages(.Library)
Which produces an object ip which contains the list of all the packges that are already installed. The process you need to go through to use a new package is the following:
- Install the package:
install.packages("NAME_OF_PACKAGE")
. The name comes in inverted commas. This only needs to be done once on any computer. The very nice thing is that you will not need to download anything yourself. R will do all the work for you! [1] - Load the package into your particular code. You need to do this every time you load R. So, if you are working in a script you would have the following command at the beginning:
library(NAME_OF_PACKAGE)
. Now you use the package name without inverted commas.
find prob package
in there empirical
Additional Resources for Packages
- A list of available packages can be found on the CRAN webpage.
- If you are looking for something particular it may be a good idea to look at the CRAN Task Views which are short paragraphs introducing useful packages for particular topic areas.
Footnotes
- ↑ The first time you do that on your computer, R will ask you from which Mirri you want to download this and will offer a list. Choose the one that is geographically closest to you.