Difference between revisions of "R AnalysisTidy"

Jump to: navigation, search
(Created page with "RalfBecker<br /> 25 December 2017 = Introdution = In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perfor...")
(No difference)

Revision as of 23:47, 26 December 2017

25 December 2017


In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described http://eclr.humanities.manchester.ac.uk/index.php/R_Analysis but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do.

So before we do anything else you should install the tidyverse package and then load it:


By the way, at this stage you should take five minuted to learn about https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ a real hero for data nerds. And if you think at the end of this section "Wow, that is powerful and quite straightforward" you got him to thank for it.

Loading a dataset

Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993.

mydata <- read.csv("C:/Users/msassrb2/Dropbox (The University of Manchester)/ECLR/R/SummaryStatsTidyverse/mlb1.csv")

Let's check out what variables we have in this data-file

##  [1] "salary"   "teamsal"  "nl"       "years"    "games"    "atbats"  
##  [7] "runs"     "hits"     "doubles"  "triples"  "hruns"    "rbis"    
## [13] "bavg"     "bb"       "so"       "sbases"   "fldperc"  "frstbase"
## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher"  "yrsallst"
## [25] "hispan"   "black"    "whitepop" "blackpop" "hisppop"  "pcinc"   
## [31] "gamesyr"  "hrunsyr"  "atbatsyr" "allstar"  "slugavg"  "rbisyr"  
## [37] "sbasesyr" "runsyr"   "percwhte" "percblck" "perchisp" "blckpb"  
## [43] "hispph"   "whtepw"   "blckph"   "hisppb"   "lsalary"

You can find short variable descriptions here and of course you need to understand what data types the variables represent (check str(mydata) to confirm the R datatypes.)

You can perhaps see that the positional information is organised in individual positional variables ("frstbase" "scndbase" "shrtstop" "thrdbase" "outfield" "catcher") that take the value 1 if a player plays in a particular position.

To confirm that each player is only assigned one position we calculate the following:

temp <- rowSums(mydata[,c("frstbase","scndbase","shrtstop","thrdbase","outfield","catcher")])
## [1] 1
## [1] 1

As the result is one for both min and max value we have confirmed that every player has been assigned exactly one position.

A similar situation exists with teh ethnicity variable. We have two variables ("hispan" "black") which are 1 if the respective player is ither black or hispanic. If both are 0 the player is white.

Let us now create two variables ("position" and "race") which summarise the respective information in one variable each.

mydata$position <- "First Base"
mydata$position[mydata$scndbase == 1] <- "Second Base"
mydata$position[mydata$shrtstop == 1] <- "Short Stop"
mydata$position[mydata$thrdbase == 1] <- "Third Base"
mydata$position[mydata$outfield == 1] <- "Outfield"
mydata$position[mydata$catcher == 1] <- "Catcher"
mydata$position <- as.factor(mydata$position)  # now ensure it is a factor variable

mydata$race <- "White"
mydata$race[mydata$hispan == 1] <- "Hispanic"
mydata$race[mydata$black == 1] <- "Black"
mydata$race <- as.factor(mydata$race)   # now ensure it is a factor variable

What data dimensions are you interested in?

Almost the most difficult task in data analysis, in particular if you have data with so many different variables as the dataset here, is to know what you are interested in. Once you know that you have to find ways to slice the data into the right bits before you analyse them. That is the main task to learn here.

A flashback

Remember a few basis commands before we proceed. If you want a quick summaries for a particular variable in the data frame, say salary you use:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  109000  253600  675000 1346000 2250000 6329000
##     Catcher  First Base    Outfield Second Base  Short Stop  Third Base 
##          52          45         136          37          49          34

If you know exectly the particular statistic you are afte, you can immediately calculate it as such

## [1] 6329213

First pipe!

So let's learn by doing.

Let's say we want to see the average salary for each position. Let's first see how we do it and then explain

mydata %>% group_by(position) %>% summarise(mean(salary))
## # A tibble: 6 × 2
##      position `mean(salary)`
##        <fctr>          <dbl>
## 1     Catcher       892519.2
## 2  First Base      1586781.5
## 3    Outfield      1539324.3
## 4 Second Base      1309640.9
## 5  Short Stop      1069210.7
## 6  Third Base      1382647.1

Here we used the %>% piping operator. What this does is best described in words. Here we did the following: "Thake the dataset mydata, group the data by position and then summarise the data by presenting the mean salary for each group".

Let's show a few variations here:

mydata %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     52   892519.2
## 2  First Base     45  1586781.5
## 3    Outfield    136  1539324.3
## 4 Second Base     37  1309640.9
## 5  Short Stop     49  1069210.7
## 6  Third Base     34  1382647.1

Here we added another aspect of the above groups. By cchecking length(salary) we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database.

Also by not just, in summarise, sayingmean(salary)but ratheravg.salary = mean(salary)` we can rename the column in which the salary mean is displayed.

Simple pivot tables

Let's start with what I call simple pivot tables. Tables where we group by one variable.

The core tools

Now we look at each of the main tools in our toolbox


The main work in the example above was done by the group_by command. The variables by which we group will typically be categorical variables. Often these will be defined as factor variables. But they could also be, for instance, int variables, such as black.

mydata %>% group_by(black) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
##   black `length(salary)` `mean(salary)`
##   <int>            <int>          <dbl>
## 1     0              245        1209602
## 2     1              108        1654350

Interestingly this would suggest that black players earn higher salaries. However,

mydata %>% group_by(hispan) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3
##   hispan `length(salary)` `mean(salary)`
##    <int>            <int>          <dbl>
## 1      0              289        1410990
## 2      1               64        1050723

reveals that it is hispanics that earned significantly less than the otehrs and the full variety s only revealed by using our race variable:

mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))
## # A tibble: 3 × 3
##       race `length(salary)` `mean(salary)`
##     <fctr>            <int>          <dbl>
## 1    Black              108        1654350
## 2 Hispanic               64        1050723
## 3    White              181        1265780

Without any further analysis one should not draw any early conclusions from this yet.


The filter_by command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player.

mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     42   587166.7
## 2  First Base     31   827747.3
## 3    Outfield     93   858689.3
## 4 Second Base     25   717133.3
## 5  Short Stop     38   687741.2
## 6  Third Base     21   701785.7

When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower.

We can look at all All Stars by

mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     10    2175000
## 2  First Base     14    3267500
## 3    Outfield     43    3011395
## 4 Second Base     12    2544032
## 5  Short Stop     11    2387014
## 6  Third Base     13    2482500

immediately seeing that ll Starts attract significantly higher salaries (note, this is not a causal relationship!). They are All Starts because they are good players and it is being a good player that earns them a high salary.


Let's say you wanted to arrange the table such that positions with lower salaries are shown first. The arrange command is the tool you need.

mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary)) %>% arrange(avg.salary)
## # A tibble: 6 × 3
##      position number avg.salary
##        <fctr>  <int>      <dbl>
## 1     Catcher     42   587166.7
## 2  Short Stop     38   687741.2
## 3  Third Base     21   701785.7
## 4 Second Base     25   717133.3
## 5  First Base     31   827747.3
## 6    Outfield     93   858689.3

Double pivot tables

These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells.

mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))
## Source: local data frame [18 x 3]
## Groups: position [?]
##       position     race avg.salary
##         <fctr>   <fctr>      <dbl>
## 1      Catcher    Black   736000.0
## 2      Catcher Hispanic   970214.3
## 3      Catcher    White   887151.2
## 4   First Base    Black  1582916.7
## 5   First Base Hispanic   977833.3
## 6   First Base    White  1799057.7
## 7     Outfield    Black  1728032.4
## 8     Outfield Hispanic  1344531.6
## 9     Outfield    White  1319637.0
## 10 Second Base    Black  1715208.2
## 11 Second Base Hispanic  1315357.1
## 12 Second Base    White  1160343.0
## 13  Short Stop    Black  2007097.7
## 14  Short Stop Hispanic   682710.5
## 15  Short Stop    White  1103049.6
## 16  Third Base    Black  1019888.9
## 17  Third Base Hispanic  1309722.3
## 18  Third Base    White  1540992.4

As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the group_by() command), but we would like to display the result differently (positions in rows and race in columns).

This is achieved by as follows:

mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)
## Source: local data frame [6 x 4]
## Groups: position [6]
##      position   Black  Hispanic     White
## *      <fctr>   <dbl>     <dbl>     <dbl>
## 1     Catcher  736000  970214.3  887151.2
## 2  First Base 1582917  977833.3 1799057.7
## 3    Outfield 1728032 1344531.6 1319637.0
## 4 Second Base 1715208 1315357.1 1160343.0
## 5  Short Stop 2007098  682710.5 1103049.6
## 6  Third Base 1019889 1309722.3 1540992.4

As you see we merely added the spread command at the end, meaning that we send the previous result to the spread command. The spread command takes as the first input the variable that should form the coluns and as the second the variable that should show in the cells.

To illustrate that you can also group by more than two variables we first create a new variable AS which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables.

mydata$AS <- (mydata$yrsallst>0)
mydata %>% group_by(AS,position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary) %>% arrange(AS)
## Source: local data frame [12 x 5]
## Groups: AS, position [12]
##       AS    position     Black  Hispanic     White
##    <lgl>      <fctr>     <dbl>     <dbl>     <dbl>
## 1  FALSE     Catcher  172000.0  238300.0  647152.8
## 2  FALSE  First Base  625694.5  521500.0 1014194.4
## 3  FALSE    Outfield  831628.8  762221.4  931295.3
## 4  FALSE Second Base  708750.0 1014000.0  626458.3
## 5  FALSE  Short Stop  269375.0  510718.8  938064.8
## 6  FALSE  Third Base  553166.7  456250.0  808153.8
## 7   TRUE     Catcher 1300000.0 2800000.0 2121428.6
## 8   TRUE  First Base 3018750.0 2575000.0 3565000.0
## 9   TRUE    Outfield 3136666.6 2975000.0 2678833.3
## 10  TRUE Second Base 2721666.5 2068750.0 2584035.5
## 11  TRUE  Short Stop 4324061.3 1600000.0 1696994.6
## 12  TRUE  Third Base 1953333.3 3016667.0 2599537.0