Difference between revisions of "R AnalysisTidy"
(Created page with "RalfBecker<br /> 25 December 2017 = Introdution = In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perfor...") |
|||
Line 1: | Line 1: | ||
− | |||
− | |||
− | |||
= Introdution = | = Introdution = | ||
Revision as of 22:49, 26 December 2017
Contents
Introdution
In this little project we will demonstrate how to use the mightily powerful packages of the "tidyverse" to perform some data analysis. Some basic data analysis is also described http://eclr.humanities.manchester.ac.uk/index.php/R_Analysis but what the power of the procedures shown here lies in the more advanced data prparation that can be done. In particular we learn how to perform more advanced filtering and grouping tasks such that data analysis can then be applied to a range of different daa slices. Those of you who have some Excel experience may be familiar with pivot tables, and we are aiming to perform tasks that are similar to what pivot tables can do.
So before we do anything else you should install the tidyverse
package and then load it:
library(tidyverse)
By the way, at this stage you should take five minuted to learn about https://priceonomics.com/hadley-wickham-the-man-who-revolutionized-r/ a real hero for data nerds. And if you think at the end of this section "Wow, that is powerful and quite straightforward" you got him to thank for it.
Loading a dataset
Let's get a dataset to look at. We shall use the Baseball wages dataset, including 353 Baseball Players in 1993.
mydata <- read.csv("C:/Users/msassrb2/Dropbox (The University of Manchester)/ECLR/R/SummaryStatsTidyverse/mlb1.csv")
Let's check out what variables we have in this data-file
names(mydata)
## [1] "salary" "teamsal" "nl" "years" "games" "atbats" ## [7] "runs" "hits" "doubles" "triples" "hruns" "rbis" ## [13] "bavg" "bb" "so" "sbases" "fldperc" "frstbase" ## [19] "scndbase" "shrtstop" "thrdbase" "outfield" "catcher" "yrsallst" ## [25] "hispan" "black" "whitepop" "blackpop" "hisppop" "pcinc" ## [31] "gamesyr" "hrunsyr" "atbatsyr" "allstar" "slugavg" "rbisyr" ## [37] "sbasesyr" "runsyr" "percwhte" "percblck" "perchisp" "blckpb" ## [43] "hispph" "whtepw" "blckph" "hisppb" "lsalary"
You can find short variable descriptions here and of course you need to understand what data types the variables represent (check str(mydata)
to confirm the R datatypes.)
You can perhaps see that the positional information is organised in individual positional variables ("frstbase" "scndbase" "shrtstop" "thrdbase" "outfield" "catcher") that take the value 1 if a player plays in a particular position.
To confirm that each player is only assigned one position we calculate the following:
temp <- rowSums(mydata[,c("frstbase","scndbase","shrtstop","thrdbase","outfield","catcher")]) min(temp)
## [1] 1
max(temp)
## [1] 1
As the result is one for both min and max value we have confirmed that every player has been assigned exactly one position.
A similar situation exists with teh ethnicity variable. We have two variables ("hispan" "black") which are 1 if the respective player is ither black or hispanic. If both are 0 the player is white.
Let us now create two variables ("position" and "race") which summarise the respective information in one variable each.
mydata$position <- "First Base" mydata$position[mydata$scndbase == 1] <- "Second Base" mydata$position[mydata$shrtstop == 1] <- "Short Stop" mydata$position[mydata$thrdbase == 1] <- "Third Base" mydata$position[mydata$outfield == 1] <- "Outfield" mydata$position[mydata$catcher == 1] <- "Catcher" mydata$position <- as.factor(mydata$position) # now ensure it is a factor variable mydata$race <- "White" mydata$race[mydata$hispan == 1] <- "Hispanic" mydata$race[mydata$black == 1] <- "Black" mydata$race <- as.factor(mydata$race) # now ensure it is a factor variable
What data dimensions are you interested in?
Almost the most difficult task in data analysis, in particular if you have data with so many different variables as the dataset here, is to know what you are interested in. Once you know that you have to find ways to slice the data into the right bits before you analyse them. That is the main task to learn here.
A flashback
Remember a few basis commands before we proceed. If you want a quick summaries for a particular variable in the data frame, say salary
you use:
summary(mydata$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 109000 253600 675000 1346000 2250000 6329000
summary(mydata$position)
## Catcher First Base Outfield Second Base Short Stop Third Base ## 52 45 136 37 49 34
If you know exectly the particular statistic you are afte, you can immediately calculate it as such
max(mydata$salary)
## [1] 6329213
First pipe!
So let's learn by doing.
Let's say we want to see the average salary for each position. Let's first see how we do it and then explain
mydata %>% group_by(position) %>% summarise(mean(salary))
## # A tibble: 6 × 2 ## position `mean(salary)` ## <fctr> <dbl> ## 1 Catcher 892519.2 ## 2 First Base 1586781.5 ## 3 Outfield 1539324.3 ## 4 Second Base 1309640.9 ## 5 Short Stop 1069210.7 ## 6 Third Base 1382647.1
Here we used the %>%
piping operator. What this does is best described in words. Here we did the following: "Thake the dataset mydata, group the data by position and then summarise the data by presenting the mean salary for each group".
Let's show a few variations here:
mydata %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 52 892519.2 ## 2 First Base 45 1586781.5 ## 3 Outfield 136 1539324.3 ## 4 Second Base 37 1309640.9 ## 5 Short Stop 49 1069210.7 ## 6 Third Base 34 1382647.1
Here we added another aspect of the above groups. By cchecking length(salary)
we are basically finding out how many group members there are. Here, for instance, we see that there are 52 catchers in the database.
Also by not just, in summarise, saying
mean(salary)but rather
avg.salary = mean(salary)` we can rename the column in which the salary mean is displayed.
Simple pivot tables
Let's start with what I call simple pivot tables. Tables where we group by one variable.
The core tools
Now we look at each of the main tools in our toolbox
group_by
The main work in the example above was done by the group_by
command. The variables by which we group will typically be categorical variables. Often these will be defined as factor variables. But they could also be, for instance, int
variables, such as black
.
mydata %>% group_by(black) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3 ## black `length(salary)` `mean(salary)` ## <int> <int> <dbl> ## 1 0 245 1209602 ## 2 1 108 1654350
Interestingly this would suggest that black players earn higher salaries. However,
mydata %>% group_by(hispan) %>% summarise(length(salary),mean(salary))
## # A tibble: 2 × 3 ## hispan `length(salary)` `mean(salary)` ## <int> <int> <dbl> ## 1 0 289 1410990 ## 2 1 64 1050723
reveals that it is hispanics that earned significantly less than the otehrs and the full variety s only revealed by using our race variable:
mydata %>% group_by(race) %>% summarise(length(salary),mean(salary))
## # A tibble: 3 × 3 ## race `length(salary)` `mean(salary)` ## <fctr> <int> <dbl> ## 1 Black 108 1654350 ## 2 Hispanic 64 1050723 ## 3 White 181 1265780
Without any further analysis one should not draw any early conclusions from this yet.
filter()
The filter_by
command allows us to remove a subset of the data. Here is how we could use this command if we only wanted to look at players that have not (by 1993) been an all star player.
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 42 587166.7 ## 2 First Base 31 827747.3 ## 3 Outfield 93 858689.3 ## 4 Second Base 25 717133.3 ## 5 Short Stop 38 687741.2 ## 6 Third Base 21 701785.7
When comparing this table to the table above we can of course see that we are now looking at fewer players and their salaries are lower.
We can look at all All Stars by
mydata %>% filter(yrsallst > 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary))
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 10 2175000 ## 2 First Base 14 3267500 ## 3 Outfield 43 3011395 ## 4 Second Base 12 2544032 ## 5 Short Stop 11 2387014 ## 6 Third Base 13 2482500
immediately seeing that ll Starts attract significantly higher salaries (note, this is not a causal relationship!). They are All Starts because they are good players and it is being a good player that earns them a high salary.
arrange()
Let's say you wanted to arrange the table such that positions with lower salaries are shown first. The arrange
command is the tool you need.
mydata %>% filter(yrsallst == 0) %>% group_by(position) %>% summarise(number = length(salary),avg.salary = mean(salary)) %>% arrange(avg.salary)
## # A tibble: 6 × 3 ## position number avg.salary ## <fctr> <int> <dbl> ## 1 Catcher 42 587166.7 ## 2 Short Stop 38 687741.2 ## 3 Third Base 21 701785.7 ## 4 Second Base 25 717133.3 ## 5 First Base 31 827747.3 ## 6 Outfield 93 858689.3
Double pivot tables
These are tables where we group the data by at least two dimensions, say position and race. So in the end we want a table that has positions in rows, race in columns and the respective group averages in the cells.
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary))
## Source: local data frame [18 x 3] ## Groups: position [?] ## ## position race avg.salary ## <fctr> <fctr> <dbl> ## 1 Catcher Black 736000.0 ## 2 Catcher Hispanic 970214.3 ## 3 Catcher White 887151.2 ## 4 First Base Black 1582916.7 ## 5 First Base Hispanic 977833.3 ## 6 First Base White 1799057.7 ## 7 Outfield Black 1728032.4 ## 8 Outfield Hispanic 1344531.6 ## 9 Outfield White 1319637.0 ## 10 Second Base Black 1715208.2 ## 11 Second Base Hispanic 1315357.1 ## 12 Second Base White 1160343.0 ## 13 Short Stop Black 2007097.7 ## 14 Short Stop Hispanic 682710.5 ## 15 Short Stop White 1103049.6 ## 16 Third Base Black 1019888.9 ## 17 Third Base Hispanic 1309722.3 ## 18 Third Base White 1540992.4
As you can see it is pretty straightforward to group by more than one variable (you merely add another variable to the group_by()
command), but we would like to display the result differently (positions in rows and race in columns).
This is achieved by as follows:
mydata %>% group_by(position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary)
## Source: local data frame [6 x 4] ## Groups: position [6] ## ## position Black Hispanic White ## * <fctr> <dbl> <dbl> <dbl> ## 1 Catcher 736000 970214.3 887151.2 ## 2 First Base 1582917 977833.3 1799057.7 ## 3 Outfield 1728032 1344531.6 1319637.0 ## 4 Second Base 1715208 1315357.1 1160343.0 ## 5 Short Stop 2007098 682710.5 1103049.6 ## 6 Third Base 1019889 1309722.3 1540992.4
As you see we merely added the spread
command at the end, meaning that we send the previous result to the spread
command. The spread command takes as the first input the variable that should form the coluns and as the second the variable that should show in the cells.
To illustrate that you can also group by more than two variables we first create a new variable AS
which is a boolean variable (TRUE or FALSE) depending on whether a player was an all start in 1993. Then we merely add this new variable into our list of group_by variables.
mydata$AS <- (mydata$yrsallst>0) mydata %>% group_by(AS,position,race) %>% summarise(avg.salary = mean(salary)) %>% spread(race,avg.salary) %>% arrange(AS)
## Source: local data frame [12 x 5] ## Groups: AS, position [12] ## ## AS position Black Hispanic White ## <lgl> <fctr> <dbl> <dbl> <dbl> ## 1 FALSE Catcher 172000.0 238300.0 647152.8 ## 2 FALSE First Base 625694.5 521500.0 1014194.4 ## 3 FALSE Outfield 831628.8 762221.4 931295.3 ## 4 FALSE Second Base 708750.0 1014000.0 626458.3 ## 5 FALSE Short Stop 269375.0 510718.8 938064.8 ## 6 FALSE Third Base 553166.7 456250.0 808153.8 ## 7 TRUE Catcher 1300000.0 2800000.0 2121428.6 ## 8 TRUE First Base 3018750.0 2575000.0 3565000.0 ## 9 TRUE Outfield 3136666.6 2975000.0 2678833.3 ## 10 TRUE Second Base 2721666.5 2068750.0 2584035.5 ## 11 TRUE Short Stop 4324061.3 1600000.0 1696994.6 ## 12 TRUE Third Base 1953333.3 3016667.0 2599537.0