Difference between revisions of "R AnalysisTidy"
(→Introdution) |
(→Introdution) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
= Introdution = | = Introdution = | ||
+ | |||
+ | A video walk through this exercise can be found [https://youtu.be/xngavnPBDO4?hd=1 here] | ||
In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described [http://eclr.humanities.manchester.ac.uk/index.php/R_Analysis here] but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do. | In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described [http://eclr.humanities.manchester.ac.uk/index.php/R_Analysis here] but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do. | ||
Line 6: | Line 8: | ||
<pre class="r">library(tidyverse)</pre> | <pre class="r">library(tidyverse)</pre> | ||
− | By the way, at this stage you should take five minuted to learn about [https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ Hadley Wickham | + | By the way, at this stage you should take five minuted to learn about [https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ Hadley Wickham] a real hero for data nerds. And if you think at the end of this section "Wow, that is powerful and quite straightforward" you got him to thank for it. |
= Loading a dataset = | = Loading a dataset = | ||
− | Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993. | + | Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993 (get the datafile from the [http://eclr.humanities.manchester.ac.uk/index.php/R#Data_Sets ECLR page]). |
− | <pre class="r">mydata <- read.csv(" | + | <pre class="r">mydata <- read.csv("YOURPATH/mlb1.csv")</pre> |
Let's check out what variables we have in this data-file | Let's check out what variables we have in this data-file | ||
Line 75: | Line 77: | ||
So let's learn by doing. | So let's learn by doing. | ||
− | Let's say we want to see the average salary for each position. | + | Let's say we want to see the average salary for each position. First we'll see how we do it and we explain what happened afterwards. |
<pre class="r">mydata %>% group_by(position) %>% summarise(mean(salary))</pre> | <pre class="r">mydata %>% group_by(position) %>% summarise(mean(salary))</pre> | ||
Line 101: | Line 103: | ||
## 5 Short Stop 49 1069210.7 | ## 5 Short Stop 49 1069210.7 | ||
## 6 Third Base 34 1382647.1</pre> | ## 6 Third Base 34 1382647.1</pre> | ||
− | Here we added another aspect of the above groups. By cchecking <code>length(salary)</code> we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database. | + | Here we added another aspect of the above groups to the final display, namely the number of observations. By cchecking <code>length(salary)</code> we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database. |
− | Also by not just, in <code>summarise, saying< | + | Also by not just, in <code>summarise</code>, saying <code>mean(salary)</code> but rather <code>avg.salary = mean(salary)</code> we can rename the column in which the salary mean is displayed. |
= Simple pivot tables = | = Simple pivot tables = | ||
Line 131: | Line 133: | ||
## 1 0 289 1410990 | ## 1 0 289 1410990 | ||
## 2 1 64 1050723</pre> | ## 2 1 64 1050723</pre> | ||
− | reveals that it is hispanics that earned significantly less than the | + | reveals that it is hispanics that earned significantly less than the others and the full variety is only revealed by using our race variable: |
<pre class="r">mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))</pre> | <pre class="r">mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))</pre> | ||
Line 140: | Line 142: | ||
## 2 Hispanic 64 1050723 | ## 2 Hispanic 64 1050723 | ||
## 3 White 181 1265780</pre> | ## 3 White 181 1265780</pre> | ||
− | + | On face value these resuts suggest that, on average, black players earn most and hispanic players the least. Of course there are a numer of other factors at play which this very simple summary statistics does not take account of and the three groups very likely differ in other aspects taht are relevant for player salary. | |
=== filter() === | === filter() === | ||
− | The <code>filter_by</code> command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player. | + | The <code>filter_by</code> command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player (<code>yrsallst == 0</code>). |
<pre class="r">mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))</pre> | <pre class="r">mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))</pre> | ||
Line 158: | Line 160: | ||
When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower. | When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower. | ||
− | We can look at all All Stars by | + | We can look at all All Stars (<code>yrsallst > 0</code>) by changing the input into the <code>filter</code> command: |
<pre class="r">mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))</pre> | <pre class="r">mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))</pre> | ||
Line 170: | Line 172: | ||
## 5 Short Stop 11 2387014 | ## 5 Short Stop 11 2387014 | ||
## 6 Third Base 13 2482500</pre> | ## 6 Third Base 13 2482500</pre> | ||
− | immediately seeing that | + | immediately seeing that All Starts attract significantly higher salaries (note, this is not a causal relationship!). They are All Stars because they are good players and it is being a good player that earns them a high salary. Of course there may still be a premium for All Stars, but you cannot conclude this from the above statistics. |
=== arrange() === | === arrange() === | ||
Line 189: | Line 191: | ||
These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells. | These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells. | ||
+ | |||
+ | == group_by() for more than one group == | ||
<pre class="r">mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))</pre> | <pre class="r">mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))</pre> | ||
Line 216: | Line 220: | ||
As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the <code>group_by()</code> command), but we would like to display the result differently (positions in rows and race in columns). | As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the <code>group_by()</code> command), but we would like to display the result differently (positions in rows and race in columns). | ||
− | + | == spread() and arrange() == | |
+ | |||
+ | At this stage it is useful to notice that R returned the above tables in what are known as <code>tibbles</code>, which are a type of dataframe. The above result had three variables: <code>position</code>, <code>race</code> and <code>avg.salary</code>, the last being the new display variable we created containing the grouped averages. | ||
+ | |||
+ | Rearranging the data display such that variation on one of the grouping variables is shown across different columns is achieved as follows: | ||
<pre class="r">mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)</pre> | <pre class="r">mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)</pre> | ||
Line 230: | Line 238: | ||
## 5 Short Stop 2007098 682710.5 1103049.6 | ## 5 Short Stop 2007098 682710.5 1103049.6 | ||
## 6 Third Base 1019889 1309722.3 1540992.4</pre> | ## 6 Third Base 1019889 1309722.3 1540992.4</pre> | ||
− | As you see we merely added the <code>spread</code> command at the end, meaning that we send the previous result to the <code>spread</code> command. The spread command takes as the first input the variable that should form the | + | As you see we merely added the <code>spread</code> command at the end, meaning that we send the previous result to the <code>spread</code> command. The spread command takes as the first input the variable that should form the colums (here <code>race</code>) and as the second input the variable that should show in the cells (here `avg.salary'). |
To illustrate that you can also group by more than two variables we first create a new variable <code>AS</code> which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables. | To illustrate that you can also group by more than two variables we first create a new variable <code>AS</code> which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables. | ||
Line 253: | Line 261: | ||
## 11 TRUE Short Stop 4324061.3 1600000.0 1696994.6 | ## 11 TRUE Short Stop 4324061.3 1600000.0 1696994.6 | ||
## 12 TRUE Third Base 1953333.3 3016667.0 2599537.0</pre> | ## 12 TRUE Third Base 1953333.3 3016667.0 2599537.0</pre> | ||
+ | We smuggled one extra tool into this analysis. The last command here is <code>arrange(AS)</code>. This merely told R to order the rows in the display table according to the variable <code>AS</code>. The rows are ordered in ascending order (as <code>AS</code> is a boolean variable that means from FALSE to TRUE). If you wanted a reversed ordering of <code>AS</code> and in addition a secondary ordering according to position name you would achieve this by using <code>arrange(desc(AS),position)</code> instead. | ||
+ | |||
+ | = Summary = | ||
+ | |||
+ | Through this small exercise you got a taste of how to use the mighty piping technique. Once you understand the architecture of the commands you will realise that this is an almighty technique. | ||
+ | |||
+ | = Extra Reading = | ||
+ | |||
+ | I learned this initially from [http://marcoghislanzoni.com/blog/2014/09/01/pivot-tables-r-dplyr/ this website]. |
Latest revision as of 10:17, 9 January 2018
Contents
Introdution
A video walk through this exercise can be found here
In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described here but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do.
So before we do anything else you should install the tidyverse
package and then load it:
library(tidyverse)
By the way, at this stage you should take five minuted to learn about Hadley Wickham a real hero for data nerds. And if you think at the end of this section "Wow, that is powerful and quite straightforward" you got him to thank for it.
Loading a dataset
Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993 (get the datafile from the ECLR page).
mydata <- read.csv("YOURPATH/mlb1.csv")
Let's check out what variables we have in this data-file
names(mydata)
## [1] "salary" "teamsal" "nl" "years" "games" "atbats" ## [7] "runs" "hits" "doubles" "triples" "hruns" "rbis" ## [13] "bavg" "bb" "so" "sbases" "fldperc" "frstbase" ## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher" "yrsallst" ## [25] "hispan" "black" "whitepop" "blackpop" "hisppop" "pcinc" ## [31] "gamesyr" "hrunsyr" "atbatsyr" "allstar" "slugavg" "rbisyr" ## [37] "sbasesyr" "runsyr" "percwhte" "percblck" "perchisp" "blckpb" ## [43] "hispph" "whtepw" "blckph" "hisppb" "lsalary"
You can find short variable descriptions here and of course you need to understand what data types the variables represent (check str(mydata)
to confirm the R datatypes.)
You can perhaps see that the positional information is organised in individual positional variables ("frstbase" "scndbase" "shrtstop" "thrdbase" "outfield" "catcher") that take the value 1 if a player plays in a particular position.
To confirm that each player is only assigned one position we calculate the following:
temp <- rowSums(mydata[,c("frstbase","scndbase","shrtstop","thrdbase","outfield","catcher")]) min(temp)
## [1] 1
max(temp)
## [1] 1
As the result is one for both min and max value we have confirmed that every player has been assigned exactly one position.
A similar situation exists with teh ethnicity variable. We have two variables ("hispan" "black") which are 1 if the respective player is ither black or hispanic. If both are 0 the player is white.
Let us now create two variables ("position" and "race") which summarise the respective information in one variable each.
mydata$position <- "First Base" mydata$position[mydata$scndbase == 1] <- "Second Base" mydata$position[mydata$shrtstop == 1] <- "Short Stop" mydata$position[mydata$thrdbase == 1] <- "Third Base" mydata$position[mydata$outfield == 1] <- "Outfield" mydata$position[mydata$catcher == 1] <- "Catcher" mydata$position <- as.factor(mydata$position) # now ensure it is a factor variable mydata$race <- "White" mydata$race[mydata$hispan == 1] <- "Hispanic" mydata$race[mydata$black == 1] <- "Black" mydata$race <- as.factor(mydata$race) # now ensure it is a factor variable
What data dimensions are you interested in?
Almost the most difficult task in data analysis, in particular if you have data with so many different variables as the dataset here, is to know what you are interested in. Once you know that you have to find ways to slice the data into the right bits before you analyse them. That is the main task to learn here.
A flashback
Remember a few basis commands before we proceed. If you want a quick summaries for a particular variable in the data frame, say salary
you use:
summary(mydata$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 109000 253600 675000 1346000 2250000 6329000
summary(mydata$position)
## Catcher First Base Outfield Second Base Short Stop Third Base ## 52 45 136 37 49 34
If you know exectly the particular statistic you are afte, you can immediately calculate it as such
max(mydata$salary)
## [1] 6329213
First pipe!
So let's learn by doing.
Let's say we want to see the average salary for each position. First we'll see how we do it and we explain what happened afterwards.
mydata %>% group_by(position) %>% summarise(mean(salary))
## # A tibble: 6 × 2 ## position `mean(salary)` ## <fctr> <dbl> ## 1 Catcher 892519.2 ## 2 First Base 1586781.5 ## 3 Outfield 1539324.3 ## 4 Second Base 1309640.9 ## 5 Short Stop 1069210.7 ## 6 Third Base 1382647.1
Here we used the %>%
piping operator. What this does is best described in words. Here we did the following: "Thake the dataset mydata, group the data by position and then summarise the data by presenting the mean salary for each group".
Let's show a few variations here:
mydata %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 52 892519.2 ## 2 First Base 45 1586781.5 ## 3 Outfield 136 1539324.3 ## 4 Second Base 37 1309640.9 ## 5 Short Stop 49 1069210.7 ## 6 Third Base 34 1382647.1
Here we added another aspect of the above groups to the final display, namely the number of observations. By cchecking length(salary)
we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database.
Also by not just, in summarise
, saying mean(salary)
but rather avg.salary = mean(salary)
we can rename the column in which the salary mean is displayed.
Simple pivot tables
Let's start with what I call simple pivot tables. Tables where we group by one variable.
The core tools
Now we look at each of the main tools in our toolbox
group_by
The main work in the example above was done by the group_by
command. The variables by which we group will typically be categorical variables. Often these will be defined as factor variables. But they could also be, for instance, int
variables, such as black
.
mydata %>% group_by(black) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3 ## black `length(salary)` `mean(salary)` ## <int> <int> <dbl> ## 1 0 245 1209602 ## 2 1 108 1654350
Interestingly this would suggest that black players earn higher salaries. However,
mydata %>% group_by(hispan) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3 ## hispan `length(salary)` `mean(salary)` ## <int> <int> <dbl> ## 1 0 289 1410990 ## 2 1 64 1050723
reveals that it is hispanics that earned significantly less than the others and the full variety is only revealed by using our race variable:
mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))
## # A tibble: 3 × 3 ## race `length(salary)` `mean(salary)` ## <fctr> <int> <dbl> ## 1 Black 108 1654350 ## 2 Hispanic 64 1050723 ## 3 White 181 1265780
On face value these resuts suggest that, on average, black players earn most and hispanic players the least. Of course there are a numer of other factors at play which this very simple summary statistics does not take account of and the three groups very likely differ in other aspects taht are relevant for player salary.
filter()
The filter_by
command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player (yrsallst == 0
).
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 42 587166.7 ## 2 First Base 31 827747.3 ## 3 Outfield 93 858689.3 ## 4 Second Base 25 717133.3 ## 5 Short Stop 38 687741.2 ## 6 Third Base 21 701785.7
When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower.
We can look at all All Stars (yrsallst > 0
) by changing the input into the filter
command:
mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 10 2175000 ## 2 First Base 14 3267500 ## 3 Outfield 43 3011395 ## 4 Second Base 12 2544032 ## 5 Short Stop 11 2387014 ## 6 Third Base 13 2482500
immediately seeing that All Starts attract significantly higher salaries (note, this is not a causal relationship!). They are All Stars because they are good players and it is being a good player that earns them a high salary. Of course there may still be a premium for All Stars, but you cannot conclude this from the above statistics.
arrange()
Let's say you wanted to arrange the table such that positions with lower salaries are shown first. The arrange
command is the tool you need.
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary)) %>% arrange(avg.salary)
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 42 587166.7 ## 2 Short Stop 38 687741.2 ## 3 Third Base 21 701785.7 ## 4 Second Base 25 717133.3 ## 5 First Base 31 827747.3 ## 6 Outfield 93 858689.3
Double pivot tables
These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells.
group_by() for more than one group
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))
## Source: local data frame [18 x 3] ## Groups: position [?] ## ## position race avg.salary ## <fctr> <fctr> <dbl> ## 1 Catcher Black 736000.0 ## 2 Catcher Hispanic 970214.3 ## 3 Catcher White 887151.2 ## 4 First Base Black 1582916.7 ## 5 First Base Hispanic 977833.3 ## 6 First Base White 1799057.7 ## 7 Outfield Black 1728032.4 ## 8 Outfield Hispanic 1344531.6 ## 9 Outfield White 1319637.0 ## 10 Second Base Black 1715208.2 ## 11 Second Base Hispanic 1315357.1 ## 12 Second Base White 1160343.0 ## 13 Short Stop Black 2007097.7 ## 14 Short Stop Hispanic 682710.5 ## 15 Short Stop White 1103049.6 ## 16 Third Base Black 1019888.9 ## 17 Third Base Hispanic 1309722.3 ## 18 Third Base White 1540992.4
As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the group_by()
command), but we would like to display the result differently (positions in rows and race in columns).
spread() and arrange()
At this stage it is useful to notice that R returned the above tables in what are known as tibbles
, which are a type of dataframe. The above result had three variables: position
, race
and avg.salary
, the last being the new display variable we created containing the grouped averages.
Rearranging the data display such that variation on one of the grouping variables is shown across different columns is achieved as follows:
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)
## Source: local data frame [6 x 4] ## Groups: position [6] ## ## position Black Hispanic White ## * <fctr> <dbl> <dbl> <dbl> ## 1 Catcher 736000 970214.3 887151.2 ## 2 First Base 1582917 977833.3 1799057.7 ## 3 Outfield 1728032 1344531.6 1319637.0 ## 4 Second Base 1715208 1315357.1 1160343.0 ## 5 Short Stop 2007098 682710.5 1103049.6 ## 6 Third Base 1019889 1309722.3 1540992.4
As you see we merely added the spread
command at the end, meaning that we send the previous result to the spread
command. The spread command takes as the first input the variable that should form the colums (here race
) and as the second input the variable that should show in the cells (here `avg.salary').
To illustrate that you can also group by more than two variables we first create a new variable AS
which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables.
mydata$AS <- (mydata$yrsallst>0) mydata %>% group_by(AS,position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary) %>% arrange(AS)
## Source: local data frame [12 x 5] ## Groups: AS, position [12] ## ## AS position Black Hispanic White ## <lgl> <fctr> <dbl> <dbl> <dbl> ## 1 FALSE Catcher 172000.0 238300.0 647152.8 ## 2 FALSE First Base 625694.5 521500.0 1014194.4 ## 3 FALSE Outfield 831628.8 762221.4 931295.3 ## 4 FALSE Second Base 708750.0 1014000.0 626458.3 ## 5 FALSE Short Stop 269375.0 510718.8 938064.8 ## 6 FALSE Third Base 553166.7 456250.0 808153.8 ## 7 TRUE Catcher 1300000.0 2800000.0 2121428.6 ## 8 TRUE First Base 3018750.0 2575000.0 3565000.0 ## 9 TRUE Outfield 3136666.6 2975000.0 2678833.3 ## 10 TRUE Second Base 2721666.5 2068750.0 2584035.5 ## 11 TRUE Short Stop 4324061.3 1600000.0 1696994.6 ## 12 TRUE Third Base 1953333.3 3016667.0 2599537.0
We smuggled one extra tool into this analysis. The last command here is arrange(AS)
. This merely told R to order the rows in the display table according to the variable AS
. The rows are ordered in ascending order (as AS
is a boolean variable that means from FALSE to TRUE). If you wanted a reversed ordering of AS
and in addition a secondary ordering according to position name you would achieve this by using arrange(desc(AS),position)
instead.
Summary
Through this small exercise you got a taste of how to use the mighty piping technique. Once you understand the architecture of the commands you will realise that this is an almighty technique.
Extra Reading
I learned this initially from this website.