Difference between revisions of "R Analysis"
Line 1: | Line 1: | ||
− | In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Here is an [ | + | In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Here is an [https://youtu.be/0pwRxhxG0tg?hd=1 online demonstration] of some of the material covered on this page. |
== Data Upload and Introduction == | == Data Upload and Introduction == |
Revision as of 21:55, 4 August 2015
In this section we shall demonstrate how to do some basic data analysis on data in a dataframe. Here is an online demonstration of some of the material covered on this page.
Contents
Data Upload and Introduction
We shall continue working with the same dataset as in R_Data, the mroz.xls dataset. It is easiest to import the data as we learned in R_Data#Converting_to_NAs_during_import, taking care of missing values (which in the csv datafile are represented by ".") during the data import process
setwd("X:/Your/full/Path") # This sets the working directory, ensure data file is in here mydata <- read.csv("mroz.csv",na.strings = ".") # Opens mroz.csv from working directory
This will upload a dataframe mydata
into your work environment.
Here we will learn how to do very basic descriptive statistics in R. The main tasks are going to be how to apply certain statistics to certain parts of the data. That may be certain variables in our dataframe, but it may also involve us selecting certain rows/observations.
One more warning before we get started. As R is an Open Source software, there are many ways to do the same things in R. In fact there are many packages that have been written to achieve the same thing in a slightly different manner. That can be a slightly frustrating aspect of working with R, but as long as you remember that there are many solutions to the same problem you will be fine!
Here we decided to use some functionality from a package called mosaic
and you may want to install and load it here:
install.packages("mosaic") # only needed once on each computer library(mosaic) # needed at the beginning of every code that uses functions from mosaic
Summary Statistics - Take 1
The easiest way to find basic summary statistics on your variables contained in a dataframe is the following command:
> summary(mydata)
inlf hours kidslt6 kidsge6 age educ wage Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. :0.000 Min. :30.00 Min. : 5.00 Min. : 0.1282 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:36.00 1st Qu.:12.00 1st Qu.: 2.2626 Median :1.0000 Median : 288.0 Median :0.0000 Median :1.000 Median :43.00 Median :12.00 Median : 3.4819 Mean :0.5684 Mean : 740.6 Mean :0.2377 Mean :1.353 Mean :42.54 Mean :12.29 Mean : 4.1777 3rd Qu.:1.0000 3rd Qu.:1516.0 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:49.00 3rd Qu.:13.00 3rd Qu.: 4.9707 Max. :1.0000 Max. :4950.0 Max. :3.0000 Max. :8.000 Max. :60.00 Max. :17.00 Max. :25.0000 NA's :325
As you can see this provides a range of summary statistics for each variable (Minimum and Maximum, Quartiles, Mean and Median). If the dataframe contains a lot of variables, as the dataframe based on mroz.xls, this output can be somewhat lengthy (and therefore isn't shown completely here), but in the next section we learn how to apply this to selected variables.
This output should be fairly self explanatory. The one interesting aspect is that in the Section that describes the wage variable we also find the additional information that there are 325 missing observations (NA's).
Another extremely useful statistic is the correlation between different variables. This is achieved with the cor( )
function.
> cor(mydata) inlf hours kidslt6 kidsge6 age educ wage repwage inlf 1.000000000 0.74114539 -0.213749303 -0.002424231 -0.08049811 0.18735285 NA 0.634048541 hours 0.741145387 1.00000000 -0.222063296 -0.090632070 -0.03311418 0.10596042 NA 0.606916375 kidslt6 -0.213749303 -0.22206330 1.000000000 0.084159872 -0.43394869 0.10869022 NA -0.134908831 kidsge6 -0.002424231 -0.09063207 0.084159872 1.000000000 -0.38541134 -0.05889891 NA -0.068680213 age -0.080498109 -0.03311418 -0.433948687 -0.385411341 1.00000000 -0.12022299 NA -0.058314931 educ 0.187352846 0.10596042 0.108690218 -0.058898912 -0.12022299 1.00000000 NA 0.267574542 wage NA NA NA NA NA NA 1 NA repwage 0.634048541 0.60691637 -0.134908831 -0.068680213 -0.05831493 0.26757454 NA 1.000000000 hushrs -0.065182605 -0.05634759 0.024292257 0.099377891 -0.08437157 0.07891592 NA -0.070797198 husage -0.072820048 -0.03108875 -0.442991438 -0.350199434 0.88813797 -0.13352150 NA -0.055398862
This is only the top left corner of a huge correlation matrix. We can, for instance, see that the correlation between the variables "hours" and "age" is -0.0805, so slightly negative. But in this table you can also see that all correlations that involve the variable wage
are shown as NA
or "not available". In this case the reason for this is that there are missing observations for that variable, i.e. some respondents did not report a wage (in fact all those that
are categorised as not being in the labour force (inlf = 0
). When you start exploring your data it is useful to understand data features like this. Take some time to get familiar with your dataset.
Selecting variables
Say you are only interested in the summary statistics for two of the variables hours
and husage
, then you would want to select these two variables only. One way to do that is the following:
summary(mydata[c("hours","husage")])
This will produce the following output:
hours husage Min. : 0.0 Min. :30.00 1st Qu.: 0.0 1st Qu.:38.00 Median : 288.0 Median :46.00 Mean : 740.6 Mean :45.12 3rd Qu.:1516.0 3rd Qu.:52.00 Max. :4950.0 Max. :60.00
Let's say we want the correlation between educ, motheduc, fatheduc
, then we use in the same manner:
cor(mydata[c("educ","motheduc","fatheduc")])
resulting in the following correlation matrix
educ motheduc fatheduc educ 1.0000000 0.4353365 0.4424582 motheduc 0.4353365 1.0000000 0.5730717 fatheduc 0.4424582 0.5730717 1.0000000
In what we did above we selected a small number of variables from a larger dataset (saved in a dataframe), the way we did that was to call the dataframe and then in square brackets indicate which variables we wanted to select. To understand what this does, go to your console and call
test1 = mydata[c("hours")]
which will create a new dataframe which includes only the one variable hours
. This is very useful, as some functions need to be applied to a dataframe (see for example the "empirical" function in R_Packages).
There is another way to select the hours
variable from the dataframe. Try:
test2 = mydata$
hours
This will also select the hours
variable. But if you check your environment tab you will see that the data have now been saved in a different type of R object, a list or vector. Some functions will require such an object as input (see for example the "sd" function below).
Introducing the subset function
There is yet another way of selecting certain variables. It is by using the subset
function. This function deserves its own little section as it is extremely useful and powerful and you will get to know it well.
Try the following
mydata.sub0 <- subset(mydata, select=c("hours","husage"))
and look at the outcome in your Environment. You have created a new dataframe called mydata.sub0
that consists of only the two variables hours
and husage
. Now you could apply the summary
and cor
function to just this dataframe
> summary(mydata.sub0) hours husage Min. : 0.0 Min. :30.00 1st Qu.: 0.0 1st Qu.:38.00 Median : 288.0 Median :46.00 Mean : 740.6 Mean :45.12 3rd Qu.:1516.0 3rd Qu.:52.00 Max. :4950.0 Max. :60.00 > cor(mydata.sub0) hours husage hours 1.00000000 -0.03108875 husage -0.03108875 1.00000000
In this way you can ensure that you only see those statistics on the screen which you are really interested in. The subset
function will also be used when we select rows/observations rather than columns/variables. But we will soon get to that.
Dealing with missing observations
So far all is honky dory. Let's show some difficulties/issues. Consider we want to calculate the correlation between educ, wage
cor(mydata[c("educ","wage")])
The output we get is:
educ wage educ 1 NA wage NA 1
The reason for R's inability to calculate a correlation between these two variables is that the variable wage
has 325 missing observations (NA, see above). It is not immediately obvious how to tackle this issue. We need to consult either Dr. Google or the R help function. The latter is done by typing ?cor
. The help will pop up in the "Help" tab on the right hand side. You will need to read through it to find a solution to the issue. Frankly, the clever people who write the R software are not always the most skilful in writing clearly and it is often most useful to go to the bottom of the help where you can usually find some examples. If you do that you will find that the solution to our problem is the following:
> cor(mydata[c("educ","wage")],use = "complete") educ wage educ 1.0000000 0.3419544 wage 0.3419544 1.0000000
It is perhaps worth adding a word of explanation here. cor( )
is what is called a function. It needs some inputs to work. The first input is the data for which to calculate correlations, mydata[c("educ","wage")]
. Most functions also have what are called parameters. These are like little dials and levers to the function which change how the function works. And one of these levers can be used to tell the function to only use observations that are complete, i.e. don't have missing observations, use = "complete"
. Read the help function to see what over levers are at your disposal.
Using Subsets of Data
Often you will want to perform some analysis on a subset of data. The way to do this in R is to use the subset function, together with a logical (boolean) statement. I will first write down the statement and then explain what it does:
mydata.sub1 <- subset(mydata, hours > 0)
On the left hand side of <-
we have a new object named mydata.sub1
. On the right hand side of <-
we can see how that new object is defined. We are using the function subset()
which has been designed to select observations and/or columns from a dataframe such as mydata
. This function needs at least two inputs. The first input is the dataframe from which we are selecting observations and variables. Here we are selecting from mydata
. The second element indicates which observations/rows we want to select. hours > 0
tells R to select all those observations for which the variable hours
is larger than 0.
Often (if not always) you will not remember how exactly a function works. The internet is then usually a good source, but in your console you could also type ?subset
which would open a help function. There you could see that you could add a third input to the subset function which would indicate which variables you want to include (e.g. select = c(hours, wage)
which would only select these two variables, see the section above!). By not using this third input we indicate to R that it should select all variables in mydata
.
Logical/Boolean Statements
The way in which we selected the observations, i.e. by using the logical statement hours > 0
is worth dwelling on for a moment. These type of logical statements create variables in R that are given the logical
data type. Sometimes these are also called boolean variables.
To see what is special about these go to your console and just type something like 5>9
and then press ENTER. You will realise that R is a clever little thing and will tell you that in fact 5 is not larger than 9 by returning the answer FALSE
. When R was provided with hours > 0
, the software, for all our 753 observations, checks whether the value of the hour variable is larger than 0 or not. It will create a variable (vector) with 753 entries and in each entry there will be either a TRUE
or FALSE
, depending on whether the respective value is larger than 0 or not.
You can create logical variables on the basis of more complicated logical statements as well. You can combine statements by noting that &
represents AND, and |
represents OR. You will want to use one of the following relational operators: ==
checks whether two things are equal; !=
will check if two things are unequal; >
and <
take their well known roles. To figure out how these work, try the following statements in your console and see whether you can guess the right answers:
(3 > 2) & (3 > 1) (3 > 2) & (3 > 6) (3 > 5) & (3 > 6) (3 > 2) | (3 > 1) (3 > 2) | (3 > 6) (3 > 5) | (3 > 6) ((3 == 5) & (3 > 2)) | (3 > 1)
Being comfortable with these logical statements will make the life of every programmer much easier.
Summary Statistics - Take 2
Now that we have at least a dark yellow belt in selecting subsets of data, we shall return to the issue of actually calculating descriptive statistics. And we can get slightly more adventurous than just using the summary
function.
Categorical Variables
We mentioned earlier the mosaic
package and if you havn't done so yet, you should load it now: library(mosaic)
.
Especially when yo are dealing with categorical data it is often useful to look at contingency tables, i.e. tables with counts of all possible values. There are a number of functions that do this, e.g. the table
function which is part of the basic R software. Here we shall use the tally
function which is part of the mosaic package Try the following line, I will explain it once you see the output:
> tally(~kidslt6, data=mydata, margins = TRUE)
If you get the following error message Error: could not find function "tally"
R is telling you that it has no idea what the "tally" function is. The most likely reason is that you havn't executed the library(mosaic)
command yet. There us no better time to do it than now!
The output for the tally function is
> tally(~kidslt6, data=mydata, margins = TRUE) 0 1 2 3 Total 606 118 26 3 753
which tells you that there were 118 women with one child younger than 6 and that in total there were 753 responses. Let me go back to the command and explain what is happening. We called the function tally
and we used three inputs[1]. Let's look at the second input first, data=mydata
. This indicated to the function tally that it should be using data in the dataframe called mydata. The first input, ~kidslt6
, told the function to create a contingency table for the variable that counts the number of children under 6 years. Don't worry about the use of the ~
in front of the variable name. You don't need to understand why it is there, but it will make more sense after you learned about running regressions and producing graphs. The last input, margins = TRUE
, ensured that we would get marginal counts, here the "Total". Some of these switches that are used to fine-tune how these functions work are operated with Boolean variables, TRUE
meaning on and FALSE
meaning off.
The table above creates the counts of the variable, if you want to produce a table with percentages or proportions you can do that by adding a 4th input into the function call, either format = "percent"
or format = "prop"
. Try yourself and confirm that 3.45 per cent of women had 2 children younger than 6.
In our dataset we have a number of categorical variables and in such cases you are often interested in cross-tabulations.
> tally(~kidsge6+kidslt6, data=mydata, margins = TRUE, format = "prop")
kidslt6
kidsge6 0 1 2 3 Total
0 0.304116866 0.022576361 0.014608234 0.001328021 0.342629482
1 0.191235060 0.046480744 0.006640106 0.001328021 0.245683931
2 0.160690571 0.047808765 0.006640106 0.000000000 0.215139442
3 0.099601594 0.031872510 0.003984064 0.001328021 0.136786189
4 0.034528552 0.003984064 0.001328021 0.000000000 0.039840637
5 0.011952191 0.003984064 0.000000000 0.000000000 0.015936255
6 0.000000000 0.000000000 0.001328021 0.000000000 0.001328021
7 0.001328021 0.000000000 0.000000000 0.000000000 0.001328021
8 0.001328021 0.000000000 0.000000000 0.000000000 0.001328021
Total 0.804780876 0.156706507 0.034528552 0.003984064 1.000000000
The output is a contingency table that reports proportions (as we set format = "prop"
). The important aspect here is how we indicated to R that it should be producing a cross table, ~kidsge6+kidslt6
. The second variable was just added with a plus sign. Easy peazy, and you could actually add a 3rd variable and check out what sort of output you get (it get's quite unwieldy).
The last class of tables we will introduce here are tables that use data that meet certain conditions. Let's see an example
> tally(~kidsge6|educ>=16, data=mydata, margins = TRUE,format = "percent") educ >= 16 kidsge6 TRUE FALSE 0 41.7475728 33.0769231 1 21.3592233 25.0769231 2 26.2135922 20.7692308 3 9.7087379 14.3076923 4 0.9708738 4.4615385 5 0.0000000 1.8461538 6 0.0000000 0.1538462 7 0.0000000 0.1538462 8 0.0000000 0.1538462 Total 100.0000000 100.0000000
You see that R delivered percentages of women with certain numbers of children larger or equal to 6 years old, but it delivered two such columns of percentages, one for women who had at least 16 years of education (equivalent to some postgrad degree) and another for those with fewer years of education. The way in which we instructed R to condition on this was by using the following variable definition: ~kidsge6|educ>=16
; which implied that we wanted percentages for the kidsge6 variable, but conditional on whether the womens' education was at least 16 years.
Continuous variables
There are a number of basic summary statistics that are part of every basic data toolbox. Being able to calculate means, medians and standard deviations for a set of data. Let's take a particular variable, the wage
variable. Try the following command:
mean(~wage, data=mydata)
You could replace mean with median, sd, var, min or max (which all represent obvious sample summary statistics), the result is always that you will find an unpleasant NA
. Why is this? As we already discovered above, the wage variable has missing observations. Reading through the help function (?mean
and then choosing the mosaic version) you will find that you will need to add an option to your function, na.rm=TRUE
to your function call. So:
mean(~wage, data=mydata, na.rm=TRUE)
will deliver the sample mean of 4.177682
. This additional parameter essentially instructs the function mean to remove all NAs.
We already discussed how you can use the summary function to obtain a range of summary statistics, but annoyingly not the sample standard deviation. The good people who programmed the mosaic package got so annoyed by that that they have given us an alternative way to get a range of summary statistics which includes the standard deviation.
> favstats(~wage, data=mydata) min Q1 median Q3 max mean sd n missing 0.1282 2.2626 3.4819 4.97075 25 4.177682 3.310282 428 325
this is really nice as it, by default disregarded the missing values. If you want these statistics for several variables, there are really two ways of doing this, either you merely replicate this like of code for all the variables for which you want the stats, or you use the code in the next subsection
Advanced: using the dfapply function
This is really a little advanced and I only recommend for the adventurous amongst you. Here is the code and output first:
> mydata.sub2 <- subset(mydata,select=c("hours","husage","wage","huswage")) > dfapply(mydata.sub2,favstats)$
hours min Q1 median Q3 max mean sd n missing 0 0 288 1516 4950 740.5764 871.3142 753 0$
husage min Q1 median Q3 max mean sd n missing 30 38 46 52 60 45.12085 8.058793 753 0$
wage min Q1 median Q3 max mean sd n missing 0.1282 2.2626 3.4819 4.97075 25 4.177682 3.310282 428 325$
huswage min Q1 median Q3 max mean sd n missing 0.4121 4.7883 6.9758 9.1667 40.509 7.482179 4.230559 753 0
You can see from the output that we produced four sets of summary statistics for the four variables which we subselected into mydata.sub2
. The magic happens via the dfapply
function. This function applies the chosen function (here the favstats
function) to all elements in a dataframe. And the dataframe we haded in was mydata.sub2
which contained the four chosen variables.
Re-classifying categorical/factor variables
When you have categorical data you may often want to re-classify your categories into new, usually broader categories. In the current data-set this isn't really an issue, but let's say we did have an ethnicity variable in our dataframe and for arguments sake assume that this variable these data are in mydata$Ethnicity
and let's assume that these data are encoded as a factor variable.
The reason for re-classifying (or re-coding) is that sometimes we will have too small categories. Too find your frequencies you can use the table(mydata$Ethnicity)
or summary(mydata$Ethnicity)
command. If you do that you may find something like:
Asian Black Mixed Asian 120 254 2 Mixed Black Mixed White White 15 12 350
Let's say you want to amalgamate the Mixed categories into one big "Mixed" category. Here is the easiest way to do this. We create a new variable in our dataframe
mydata$Eth_cat <- as.character(0) # new variable is called Eth_cat, initially as character variable
Now we need to define the variables this new variable should take:
mydata$
Eth_cat[mydata$
Ethnicity == "Asian"] <- "Asian" mydata$
Eth_cat[mydata$
Ethnicity == "Black" ] <- "Black" mydata$
Eth_cat[mydata$
Ethnicity == "Mixed Asian" ] <- "Mixed" mydata$
Eth_cat[mydata$
Ethnicity == "Mixed Black" ] <- "Mixed" mydata$
Eth_cat[mydata$
Ethnicity == "Mixed White" ] <- "Mixed" mydata$
Eth_cat[mydata$
Ethnicity == "White"] <- "White"
In each line we are selecting all rows in the dataframe for which the Ethnicity variable takes a certain value, e.g. mydata$Eth_cat[mydata$Ethnicity == "Asian"]
, all rows with Asian respondents. Then we assign <- "Asian"
to these rows. We do this for all possible categories in Ethnicity. What we have created at this stage is a new variable with all the desired variables. It is, however, at this stage a text based variable and it may be of advantage to transform it to a factor (categorical) variable. This is very straightforward:
mydata$
Eth_cat <- as.factor(mydata$
Eth_cat)
and you are good to go!
Footnotes
- ↑ As usual, you can check
?tally
to find out all about this function and how to use it.