Scraping in R

From ECLR
Jump to: navigation, search

The challenge

The internet is a very rich source of data and here we demonstrate how to tap it. well, at least we provide an introduction into how this can be done if the data are structured very nicely.

The example we shall use here is the following. Schools that educate Children from families with low income receive an extra amount of money for educating these children. In the educational jargon these pupils are called Free School Meal (FSM) pupils. We shall investigate how the proportion of FSM children correlates with the actual expenditure per child.

The source of this information is going to be the Department of Education's School Performance Database. Say we are interested in Claycots School in Slough (which is close to being the largest Primary Schol in the country). Once you searched for that school you will find a large amount of information. at this stage I want you to have a good look at the url of that page:

http://www.education.gov.uk/cgi-bin/schools/performance/school.pl?urn=132089

The last six digits in that url (132089) are the Unique reference Number (urn). This will be important later.

But for starters, have a look at that website and you should be able to find some information on the percentage of pupils eligible for Free School meals, here 25.2% (as per November 2015). Also browse this website and check what other info may be of interest to you. Clearly we will want to look at the Total Income (£ per pupil) and Total expenditure (£ per pupil) information. And we may also be interested in the actual number of pupils (Total number of pupils on roll (all ages)).

What we want our end result to be is a spreadsheet with these bits of information for all Primary schools (and perhaps their location, Local Authority etc). I should warn you, you will need some patience and you will have to learn a little programming and html to achieve this.

The website structure

Before you can think about extracting data from websites you will have to understand the rough structure of the website. First, given we want to extract information for several (thousands) of schools you want to understand what changes for this website from school to school. Go to the url in your browser and change the urn number from 132089 to 101193 (which is for the manor Infants School in barking, another huge Primary School). as you change the urn in the url and press enter you should see that really nothing changes in the structure of the website, but the entries to the Table. That is an important feature which enables us t automate the scraping (collection) of data.

Next, you need to understand a little bit of the html code that is used to produce the website. The way to do that is to do the following (in Windows at least). Put your cursor into that part of the page you are interested in, say the table row with the info on FSM pupil percentage. Then right mouse click and then click on "Inspect Element" and then ... refrain from panicking! You will see something like

InspectElement.JPG

This is how Tables look in html. <table> and </table> mark the beginnings and end of each table. Each row begins with <tr> and ends with </tr>. these markers are called tags and we will use them soon to our advantage. Tags are your friends! In essence we will later ask R to find the row in the Table that has the entry "Percentage of pupils eligible for FSM at any time during the past 6 years" and then we will just take the entry in the second column of that same row to be the number we want. Sounds easy, right? Conceptually yes, of course making it happen is another sory.

Prep Work

Start with setting your working directory

# make sure this points to your working directory
setwd("C:/Users/Ralf/Dropbox/R/scraping/DeptEduc")

We will require the Rcurl and XML package, so let's install these (if you havn't already done so) and load them using the library command.

x <- c("RCurl","XML")
# install.packages(x) # warning: uncommenting this may take a number of minutes
lapply(x, library, character.only = TRUE) # load the required packages
## Loading required package: bitops

Get your reference spreadsheet

we said we wanted this info for all Primary Schools in the country. So what we need is a list of these and, importantly, their urn. Fortunately this exists. Go to the homepage of the UK's Department for Education and then click on Edubase. In the data downloads part of the website you will find the following spreadsheet: "All EduBase data.csv" which you should download into your working directory. Have a look at the spreadsheet. You should easily be able to identify the column named URN. This is the list we are after. Also note that there are more than 44K entries in that spreadsheet. A lot of websites to look at!

# Read in school admin data
schooldata <- read.csv("edubasealldata20151029.csv")   # make sure the name matches that of your csv file, the date changes fequently

The spreadsheet is now available as the dataframe schooldata

Let's rename a few variables as some of the names are extremely clumsy:

names(schooldata)[names(schooldata)=="PhaseOfEducation..name."] <- "EducPhase"
names(schooldata)[names(schooldata)=="EstablishmentStatus..name."] <- "EstabStatus"
names(schooldata)[names(schooldata)=="TypeOfEstablishment..name."] <- "EstabType"

And now we will select the relevant schools (Primary schools that are still open)

# Select relevant school
primary_data <- schooldata[schooldata$EducPhase == "Primary",]  # only primary schools
primary_data <- primary_data[primary_data$EstabStatus == "Open",]  # only open schools

This leaves us with 17,988 primary schools.

Setup of required info

Now we start preparing for the real work. First we indicate what the url (but for the school's urn) looks like

# Base url from which to find the data - URN needs to be appended
url_base_2014 <-  "http://www.education.gov.uk/cgi-bin/schools/performance/school.pl?urn="

We will soon combine this with the URN information from the reference spreadsheet.

Let's prepare to get information from the website. As we aregued before, our search strategy is going to be to find Table rows with entries that identify the rows with the data we are interested in. So let's collect thise pieces of identifying information in two list (although we will not really use them later) as they focus your mind:

# this identifies the rows of interest
get_info = rbind("Percentage of pupils eligible for FSM at any time during the past 6 years",
                 "Total income",
                 "Total expenditure")

# these will be the variable names for the respective data
get_info_new = rbind("eFSM",
                 "TotalIncome",
                 "TotalExp")

Now we prepare the dataframe into which we will save the scraped information. We are basically creating a dummay entry into a new dataframe called save_data:

save_data <- data.frame("URN"=-9999, "eFSM"=NA, "TotalIncome"=NA, "TotalExp"=NA)

Getting the data from the website

In the end we will have to create a loop which repeats the information extraction for all the schools in primary_data. But first we shall investigate how this works for one particular school, say the first school in the primary_data dataframe, which is the Sir John Cass's Foundation Primary School (urn = 100000). Later we will extract this from all schools.

id <- primary_data$URN[1]  # get first URN from primary_data
url <- paste0(url_base_2014,id) # assemble the complete url
print(url)
## [1] "http://www.education.gov.uk/cgi-bin/schools/performance/school.pl?urn=100000"
SOURCE <-  getURL(url,encoding="UTF-8") #Download the page
PARSED <- htmlParse(SOURCE) #Format the html code 

The getURL function retrieves what you saw when you looked at the html code of the website url and saves it into the SOURCE object. At this stage this is basically a very long and unstructured piece of text. The next line htmlParse(SOURCE) then takes this text and divides it into little bits using the tags (like <table> and <tr>) which we saw earlier. This will facilitate our search process.

Print SOURCE and PARSED to see the difference.

We continue by creating a new object in which we will save all the new school's info. Eventually we will add this to save_data.

newschool <- NULL   # create object that will be attached to dataframe
newschool["URN"] <- id

Finding the relevant information

While PARSED is more structured than SOURCE, it is still massive. But there is a very convenient function (xpathSApply) that allows to quickly identify relevant bits of the website.

Finding the FSM data

Let's quote the line first and explain after:

temp_tr <- xpathSApply(PARSED, "//tr[count(td)=1]")  # get all rows with one td tag count, as results are in Table

The first input is the object PARSED, the second input is something like a search criterion for tags. This is when you haveto go back to the html info to understand what you are looking for. The FSM info is in a table with two columns, the first being a header column ( <th> ... </th> ) and the second being an ordinary column ( <td> ... </td> ). So if we find all table rows ( <tr> ... </tr> ) with one " <td> ... </td> " entry. If you want to understand the syntax of this search criterion and want to understand how to adjust it to your problem you should check this website.

Look at the first two elements to see what temp_tr delivers:

head(temp_tr,2)
## [[1]]
## <tr>
##   <th>
##     <label>Street</label>
##   </th>
##   <td class="num "> St James's Passage</td>
## </tr> 
## 
## [[2]]
## <tr>
##   <th>
##     <label>Town</label>
##   </th>
##   <td class="num "> London</td>
## </tr>

each element is a html row and indeed has one elemnt of <td>...<td> (and in fact being preceeded by a <th> ... </th> element). We will now loop through all these elements and check whether we can find one which we are interested in.

  # search through selected rows to find the ones we need
  if (length(temp_tr)>0){   # case in which school data are available
    for (j in 1:length(temp_tr)){
      temp <- temp_tr[[j]]              # this picks out the current row (including all tags)
      temp1 <- xmlSApply(temp,xmlValue) # this extracts the values from the tags
      # check if we found the value we are after
      if (temp1[[1]] == "Percentage of pupils eligible for FSM at any time during the past 6 years") {
        var <- "eFSM"                      # Set's the variable name we found
        temp2  <- gsub("%","",temp1[[2]])  # gsub is a function to subsitute something
        newschool[var] <- temp2            # places the value in newschool
      }
    }
  } 

To check what we have achieved so far we look at newschool

print(newschool)
##      URN     eFSM 
## "100000"  " 33.2"

As you can see we have added the percentage (without percentage sign) of chidren on Free School Meals.

Just a few extra notes on the previous block of code. It starts with an if condition. This has to be added as some school's webpage has no relevant data (no elements in temp_tr) and if that is the case the code would fail. When I first wrote this code I did not have this condition and while the code worked well for the first 100 odd schools it failed then as there was a school with no available information. When you write codes like this you will have to do this a lot, proofing the code against some unexpected behaviour. This is why code never works on the first go and of course the code can fail in all sorts of different ways. Think of being a detective that needs to figire out how it fails and then think about how to fix it.

In the previous block of code I used two functions xmlSApply and gsub. I added comments to explain what these functions do but of course you can use ?xmlSApply and ?gsub to find out more details. How do you find these functions in the first place? Dr Google is your friend. For instance google for "How can I extract values from XML elements in R?" and you will after some searching find someone who had the same problem as you and in general you will also find someone providing a solution.

Finding the financial data

The financial data we are after Total income and Total Expenditure are in tables that have a slightly different structure. Open one of the school's site on the performance tables and go to the Financial Table (note that some schools have little such information published). Then right-mouse click and "Inspect Element" and you should see something like this:

InspectElement2.JPG

Here you can see that in this table each row has one <th> ...</th> element and four <tr>...</tr> elements. This means that we have to change our filtering strategy

temp_tr <- xpathSApply(PARSED, "//tr[count(td)=4]")  # get all rows with four td tag count, as results are in Table

Now we have filtered all rows from tables with four columns (plus a header column although we didn't specify this).

What follows now is the serach for the rows with Total income and Total expenditure but that works in almost exactly the same way as for FSM. See whether you can spot the slight differences.

  # search through selected rows to find the ones we need
  if (length(temp_tr)>0){   # case in which school data are available
    for (j in 1:length(temp_tr)){
      temp <- temp_tr[[j]]              # this picks out the current row (including all tags)
      temp1 <- xmlSApply(temp,xmlValue) # this extracts the values from the tags
      # check if we found the value we are after
      if (temp1[[1]] == "Total income") {
        var <- "TotalIncome"      # Set's the variable name we found
        temp2  <- temp1[[2]]      # The school's finance data is in columns 2
        newschool[var] <- temp2   # places the value in newschool
      }
      if (temp1[[1]] == "Total expenditure") {
        var <- "TotalExp"         # Set's the variable name we found
        temp2  <- temp1[[2]]      # The school's finance data is in columns 2
        newschool[var] <- temp2   # places the value in newschool
      }
    }
  } 

Clearly we now looked for two pieces of information rather than just one, which is why we had two if conditions. If you have many more you may want to learn about the switch function instead. We also didn't need to strib any "%" signs out of our values, so the definition of temp2 is a little more straightforward.

To check what we have achieved so far we look at newschool

print(newschool)
##         URN        eFSM TotalIncome    TotalExp 
##    "100000"     " 33.2"     " 9578"     " 9628"

You can see that we have now also added the financial data to our newschool entry. We have collected everything we want. Let's add that new school to our dataframe with new data, save_data. For this we use the rbind command (row binding) which attaches the newschool entry to the already existing save_data dataframe.

  # add the newschool entry to the dataframe
  save_data <- rbind(save_data,newschool)

We made sure that we used identical variable names in both. You could just print save_data a this stage to convince yourself that we have achieved what we wanted.

print(save_data)
##      URN  eFSM TotalIncome TotalExp
## 1  -9999  <NA>        <NA>     <NA>
## 2 100000  33.2        9578     9628

As you can see we have now added the info of the school with URN = 100000 to our dataframe.

Putting it all together

So far we have learned how to get the required information for one particular school. Now we want to do that for all the schools in our database, ie all the URNs in primary_data. Recall that previously we defined the id variable as follows:

id <- primary_data$URN[1]  # get first URN from primary_data

And once we had defined id this would determine which website we called up and all the rest followed automatically. we now need to write a loop to ensure that we do everything we did so far for all schools in our database primary_data.

for (i in 1:nrow(primary_data)){

    id <- primary_data$URN[i]  # ger URN from primary_data
    url <- paste0(url_base_2014,id) # assemble the complete url
    
    # ADD all the parsing and information extraction code here
    
    # save_data <- rbind(save_data,newschool)
  }

nrow(primary_data) checks how many entries there are in primary_data and allows the loop to run through all rows and pick out the URN of all schools. After the line that defines url you will have to pase all the code we wrote above to get the data off the webpage and extract the information required and save it into newschool. The last line in the loop should be save_data <- rbind(save_data,newschool).

Once the loop has completed save_data will contain all the extracted information.

Tidying up and merging

Changing variable types

Let's look at the structure of save_data (after only scraping the info from one school, i.e. before implementing the previous loop).

str(save_data)
## 'data.frame':    2 obs. of  4 variables:
##  $ URN        : chr  "-9999" "100000"
##  $ eFSM       : chr  NA " 33.2"
##  $ TotalIncome: chr  NA " 9578"
##  $ TotalExp   : chr  NA " 9628"

You can see that all frour variables incuded in the dataframe as characterised as chr variables, or text/string variables. That means you cannot really perform calculations with these. We want to transform them to numerical variables, where that makes sense. This is done now:

save_data["URN"]<-lapply(save_data["URN"], as.numeric)
save_data["eFSM"]<-lapply(save_data["eFSM"], as.numeric)
save_data["TotalIncome"]<-lapply(save_data["TotalIncome"], as.numeric)
save_data["TotalExp"]<-lapply(save_data["TotalExp"], as.numeric)

The lapply(x,fctname) (list apply) function aplies a particular function (fctname) to all members of a list x. So we convert each element of save_data["URN"] to a numeric value. If we now look again

str(save_data)
## 'data.frame':    2 obs. of  4 variables:
##  $ URN        : num  -9999 100000
##  $ eFSM       : num  NA 33.2
##  $ TotalIncome: num  NA 9578
##  $ TotalExp   : num  NA 9628

we notice that all variables are now numeric variables.

Merging data

We want to add the scraped info to the existing database of primary schools.

# primary_data_merged_2014 <- merge(primary_data,save_data,by="URN")

The merge command does what it says and it merges all the info. The variable by which it merges is URN which exists in both primary_data and save_data. check out ?merge to find more details on merging two dataframes.

Saving data

# save(primary_data_merged_2014,file = "primary_data_merged_2014.Rda")

Things to look out for

When you access data from the internet you should make sure that you are not acting against any terms and conditions you may have inadvertently agreed to. In our example we are all good under the Open Government Licence 3.0 (see the link at the bottom of Department for Education's website). The issue of scraping has indeed exercised the courts. Here is a link to an article that describes how Ryanair has been allowed to restrict scraping activities.

EBay is another website that does not allow scraping of data, but they actually do offer access to the data via an Application Programming Interface (API). As it turns out many websites have APIs and allow free usage of these. Examples are Ebay, Twitter and Google. But often you will need very similar technmiques as the ones discussed here to access the data and bring them into a useable format.