R AnalysisTidy

From ECLR
Jump to: navigation, search

Introdution

A video walk through this exercise can be found here

In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described here but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do.

So before we do anything else you should install the tidyverse package and then load it:

library(tidyverse)

By the way, at this stage you should take five minuted to learn about Hadley Wickham a real hero for data nerds. And if you think at the end of this section "Wow, that is powerful and quite straightforward" you got him to thank for it.

Loading a dataset

Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993 (get the datafile from the ECLR page).

mydata <- read.csv("YOURPATH/mlb1.csv")

Let's check out what variables we have in this data-file

names(mydata)
##  [1] "salary"   "teamsal"  "nl"       "years"    "games"    "atbats"  
##  [7] "runs"     "hits"     "doubles"  "triples"  "hruns"    "rbis"    
## [13] "bavg"     "bb"       "so"       "sbases"   "fldperc"  "frstbase"
## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher"  "yrsallst"
## [25] "hispan"   "black"    "whitepop" "blackpop" "hisppop"  "pcinc"   
## [31] "gamesyr"  "hrunsyr"  "atbatsyr" "allstar"  "slugavg"  "rbisyr"  
## [37] "sbasesyr" "runsyr"   "percwhte" "percblck" "perchisp" "blckpb"  
## [43] "hispph"   "whtepw"   "blckph"   "hisppb"   "lsalary"

You can find short variable descriptions here and of course you need to understand what data types the variables represent (check str(mydata) to confirm the R datatypes.)

You can perhaps see that the positional information is organised in individual positional variables ("frstbase" "scndbase" "shrtstop" "thrdbase" "outfield" "catcher") that take the value 1 if a player plays in a particular position.

To confirm that each player is only assigned one position we calculate the following:

temp <- rowSums(mydata[,c("frstbase","scndbase","shrtstop","thrdbase","outfield","catcher")])
min(temp)
## [1] 1
max(temp)
## [1] 1

As the result is one for both min and max value we have confirmed that every player has been assigned exactly one position.

A similar situation exists with teh ethnicity variable. We have two variables ("hispan" "black") which are 1 if the respective player is ither black or hispanic. If both are 0 the player is white.

Let us now create two variables ("position" and "race") which summarise the respective information in one variable each.

mydata$position <- "First Base"
mydata$position[mydata$scndbase == 1] <- "Second Base"
mydata$position[mydata$shrtstop == 1] <- "Short Stop"
mydata$position[mydata$thrdbase == 1] <- "Third Base"
mydata$position[mydata$outfield == 1] <- "Outfield"
mydata$position[mydata$catcher == 1] <- "Catcher"
mydata$position <- as.factor(mydata$position)  # now ensure it is a factor variable

mydata$race <- "White"
mydata$race[mydata$hispan == 1] <- "Hispanic"
mydata$race[mydata$black == 1] <- "Black"
mydata$race <- as.factor(mydata$race)   # now ensure it is a factor variable

What data dimensions are you interested in?

Almost the most difficult task in data analysis, in particular if you have data with so many different variables as the dataset here, is to know what you are interested in. Once you know that you have to find ways to slice the data into the right bits before you analyse them. That is the main task to learn here.

A flashback

Remember a few basis commands before we proceed. If you want a quick summaries for a particular variable in the data frame, say salary you use:

summary(mydata$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  109000  253600  675000 1346000 2250000 6329000
summary(mydata$position)
##     Catcher  First Base    Outfield Second Base  Short Stop  Third Base 
##          52          45         136          37          49          34

If you know exectly the particular statistic you are afte, you can immediately calculate it as such

max(mydata$salary)
## [1] 6329213

First pipe!

So let's learn by doing.

Let's say we want to see the average salary for each position. First we'll see how we do it and we explain what happened afterwards.

mydata %>% group_by(position) %>% summarise(mean(salary))
## # A tibble: 6 × 2
##      position `mean(salary)`
##        <fctr>          <dbl>
## 1     Catcher       892519.2
## 2  First Base      1586781.5
## 3    Outfield      1539324.3
## 4 Second Base      1309640.9
## 5  Short Stop      1069210.7
## 6  Third Base      1382647.1

Here we used the %>% piping operator. What this does is best described in words. Here we did the following: "Thake the dataset mydata, group the data by position and then summarise the data by presenting the mean salary for each group".

Let's show a few variations here:

mydata %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     52   892519.2
## 2  First Base     45  1586781.5
## 3    Outfield    136  1539324.3
## 4 Second Base     37  1309640.9
## 5  Short Stop     49  1069210.7
## 6  Third Base     34  1382647.1

Here we added another aspect of the above groups to the final display, namely the number of observations. By cchecking length(salary) we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database.

Also by not just, in summarise, saying mean(salary) but rather avg.salary = mean(salary) we can rename the column in which the salary mean is displayed.

Simple pivot tables

Let's start with what I call simple pivot tables. Tables where we group by one variable.

The core tools

Now we look at each of the main tools in our toolbox

group_by

The main work in the example above was done by the group_by command. The variables by which we group will typically be categorical variables. Often these will be defined as factor variables. But they could also be, for instance, int variables, such as black.

mydata %>% group_by(black) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
##   black `length(salary)` `mean(salary)`
##   <int>            <int>          <dbl>
## 1     0              245        1209602
## 2     1              108        1654350

Interestingly this would suggest that black players earn higher salaries. However,

mydata %>% group_by(hispan) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
##   hispan `length(salary)` `mean(salary)`
##    <int>            <int>          <dbl>
## 1      0              289        1410990
## 2      1               64        1050723

reveals that it is hispanics that earned significantly less than the others and the full variety is only revealed by using our race variable:

mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))
## # A tibble: 3 × 3
##       race `length(salary)` `mean(salary)`
##     <fctr>            <int>          <dbl>
## 1    Black              108        1654350
## 2 Hispanic               64        1050723
## 3    White              181        1265780

On face value these resuts suggest that, on average, black players earn most and hispanic players the least. Of course there are a numer of other factors at play which this very simple summary statistics does not take account of and the three groups very likely differ in other aspects taht are relevant for player salary.

filter()

The filter_by command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player (yrsallst == 0).

mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     42   587166.7
## 2  First Base     31   827747.3
## 3    Outfield     93   858689.3
## 4 Second Base     25   717133.3
## 5  Short Stop     38   687741.2
## 6  Third Base     21   701785.7

When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower.

We can look at all All Stars (yrsallst > 0) by changing the input into the filter command:

mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     10    2175000
## 2  First Base     14    3267500
## 3    Outfield     43    3011395
## 4 Second Base     12    2544032
## 5  Short Stop     11    2387014
## 6  Third Base     13    2482500

immediately seeing that All Starts attract significantly higher salaries (note, this is not a causal relationship!). They are All Stars because they are good players and it is being a good player that earns them a high salary. Of course there may still be a premium for All Stars, but you cannot conclude this from the above statistics.

arrange()

Let's say you wanted to arrange the table such that positions with lower salaries are shown first. The arrange command is the tool you need.

mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary)) %>% arrange(avg.salary)
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     42   587166.7
## 2  Short Stop     38   687741.2
## 3  Third Base     21   701785.7
## 4 Second Base     25   717133.3
## 5  First Base     31   827747.3
## 6    Outfield     93   858689.3

Double pivot tables

These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells.

group_by() for more than one group

mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))
## Source: local data frame [18 x 3]
## Groups: position [?]
## 
##       position     race avg.salary
##         <fctr>   <fctr>      <dbl>
## 1      Catcher    Black   736000.0
## 2      Catcher Hispanic   970214.3
## 3      Catcher    White   887151.2
## 4   First Base    Black  1582916.7
## 5   First Base Hispanic   977833.3
## 6   First Base    White  1799057.7
## 7     Outfield    Black  1728032.4
## 8     Outfield Hispanic  1344531.6
## 9     Outfield    White  1319637.0
## 10 Second Base    Black  1715208.2
## 11 Second Base Hispanic  1315357.1
## 12 Second Base    White  1160343.0
## 13  Short Stop    Black  2007097.7
## 14  Short Stop Hispanic   682710.5
## 15  Short Stop    White  1103049.6
## 16  Third Base    Black  1019888.9
## 17  Third Base Hispanic  1309722.3
## 18  Third Base    White  1540992.4

As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the group_by() command), but we would like to display the result differently (positions in rows and race in columns).

spread() and arrange()

At this stage it is useful to notice that R returned the above tables in what are known as tibbles, which are a type of dataframe. The above result had three variables: position, race and avg.salary, the last being the new display variable we created containing the grouped averages.

Rearranging the data display such that variation on one of the grouping variables is shown across different columns is achieved as follows:

mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)
## Source: local data frame [6 x 4]
## Groups: position [6]
## 
##      position   Black  Hispanic     White
## *      <fctr>   <dbl>     <dbl>     <dbl>
## 1     Catcher  736000  970214.3  887151.2
## 2  First Base 1582917  977833.3 1799057.7
## 3    Outfield 1728032 1344531.6 1319637.0
## 4 Second Base 1715208 1315357.1 1160343.0
## 5  Short Stop 2007098  682710.5 1103049.6
## 6  Third Base 1019889 1309722.3 1540992.4

As you see we merely added the spread command at the end, meaning that we send the previous result to the spread command. The spread command takes as the first input the variable that should form the colums (here race) and as the second input the variable that should show in the cells (here `avg.salary').

To illustrate that you can also group by more than two variables we first create a new variable AS which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables.

mydata$AS <- (mydata$yrsallst>0)
mydata %>% group_by(AS,position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary) %>% arrange(AS)
## Source: local data frame [12 x 5]
## Groups: AS, position [12]
## 
##       AS    position     Black  Hispanic     White
##    <lgl>      <fctr>     <dbl>     <dbl>     <dbl>
## 1  FALSE     Catcher  172000.0  238300.0  647152.8
## 2  FALSE  First Base  625694.5  521500.0 1014194.4
## 3  FALSE    Outfield  831628.8  762221.4  931295.3
## 4  FALSE Second Base  708750.0 1014000.0  626458.3
## 5  FALSE  Short Stop  269375.0  510718.8  938064.8
## 6  FALSE  Third Base  553166.7  456250.0  808153.8
## 7   TRUE     Catcher 1300000.0 2800000.0 2121428.6
## 8   TRUE  First Base 3018750.0 2575000.0 3565000.0
## 9   TRUE    Outfield 3136666.6 2975000.0 2678833.3
## 10  TRUE Second Base 2721666.5 2068750.0 2584035.5
## 11  TRUE  Short Stop 4324061.3 1600000.0 1696994.6
## 12  TRUE  Third Base 1953333.3 3016667.0 2599537.0

We smuggled one extra tool into this analysis. The last command here is arrange(AS). This merely told R to order the rows in the display table according to the variable AS. The rows are ordered in ascending order (as AS is a boolean variable that means from FALSE to TRUE). If you wanted a reversed ordering of AS and in addition a secondary ordering according to position name you would achieve this by using arrange(desc(AS),position) instead.

Summary

Through this small exercise you got a taste of how to use the mighty piping technique. Once you understand the architecture of the commands you will realise that this is an almighty technique.

Extra Reading

I learned this initially from this website.