On most occasions you would want to use data which already exist in some electronic form (lucky you that you did not study in the 70s when you had to trawl through paper back-copies of some statistical agency and copy data by hand and then enter manually into some spreadsheet). The question then is how to import these data into R and use them for your statistical or econometric analysis.
Upload a data file to your working directory
In the first instance I want you to download this mroz.xls Excel file that contains a dataset which we will use for our first steps in R. It is a well used cross-sectional dataset with 753 observations of female members of the labour force in the US (in 1975). It contains variables such as the number of children, the wage, the hours worked etc. A bit more detail on the data and the variables can be found in this file. See also .
Make sure that you note down in which folder you save this file. Save it in a folder in which you want to save your work. We shall soon call this folder our working directory. At this stage we have not yet made the data available to R. This will come soon!
R makes it really easy to import data if they are already in the R data format (see later) or indeed if they are in csv (comma separated values) format. This is a rather short list and importantly, this list does not include EXCEL files, which is the format in which most datafiles will land in your inbox.
csv File Import
Now, as R is such a popular package, clever and busy programmers have written an extension (or better a package in R speak) that does import data directly into R, but unfortunately this package (gdata) requires that some other software is installed on your computer and ... it just gets too messy. Good thing that it is really easy to turn your Excel file into a csv file. Open your data file in Excel and then "Save as ..." the file again and change the extension from an Excel file to a csv file.
Once you have done this with the "mroz.xls" file you should have, besides the Excel file, a "mroz.csv" file in your folder. It is now time to let R do some work. Return to your Firststeps.R script, or open a new script with the following first two lines:
# This is my first R script! setwd("FULL_PATH_OF_YOUR_DIRECTORY") # This sets the working directory mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory
And then press CTRL+ENTER+s or the Source button () to run the code. You should then see, in the "Workspace" window a new entry called mydata, informing you that you imported an object with 753 observations and 22 variables.
By double clicking on "mydata" you will open this object and see that it basically contains a spreadsheet with all your data. This sort of object is called a data frame and R basically understands that you have 73 observations and for each you have observations for 22 variables.
The above sequence of commands assumed that the first row in your csv file is a row with variable names that have then be used in your data frame. In case there is no header row in your csv file, you should replace the read.csv command from above with:
mydata <- read.csv("mroz.csv", header = FALSE)
Saving to and Importing from a R Data file
R has it's own way of saving data, its own data format. This means you don't always have to import a datafile from a csv file. If you are frequently using the same dataset you may want to save the data in this data format.
In the above we created a R object, a dataframe, called
mydata. The way in which you save this in a R datafile is by using the following command:
save(mydata, file = "test.RData")
This will save the object mydata in the file "test.RData". If you had another object in your workspace, it could be another dataframe, or a variable, say called
testvar, then you could save both by using:
save(mydata, testvar, file = "test.RData")
You can then load the data back into your workspace with the following command:
load("test.Rdata") # assuming that the file is in your working directory
This will load all the objects that have been saved in "test.Rdata" into your workspace.
As you work with dataframes, especially those that contain many different variables, the following commands may come in very handy:
names(mydata) # displays the names of all variables in the dataframe head(mydata) # displays the first 6 observations for all variables tail(mydata) # displays the last 6 observations for all variables str(mydata) # will give a list of all variables with their data types
These tend to be commands you would use straight in the Console as you would use them to remind yourself of some basic data features (like names) or types. Try the head and tail commands yourself. But let's look at the output of the str(mydata) command:
'data.frame': 753 obs. of 22 variables: inlf : int 1 1 1 1 1 1 1 1 1 1 ... hours : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ... kidslt6 : int 1 0 1 0 1 0 0 0 0 0 ... kidsge6 : int 0 2 3 3 2 0 2 0 2 2 ... age : int 32 30 35 34 31 54 37 54 48 39 ... educ : int 12 12 12 12 14 12 16 12 12 12 ... wage : Factor w/ 374 levels ".","0.1282","0.1616",..: 191 35 268 ... repwage : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ... hushrs : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ... husage : int 34 30 40 53 32 57 37 53 52 43 ... huseduc : int 12 9 12 10 12 11 12 8 4 12 ... huswage : num 4.03 8.44 3.58 3.54 10 ... faminc : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ... mtr : num 0.722 0.661 0.692 0.781 0.622 ... motheduc: int 12 7 12 7 12 14 14 3 7 7 ... fatheduc: int 7 7 7 7 14 7 7 3 7 7 ... unem : num 5 11 5 5 9.5 7.5 5 5 3 5 ... city : int 0 1 0 0 1 1 0 0 0 0 ... exper : int 14 5 15 6 7 33 11 35 24 21 ... nwifeinc: num 10.9 19.5 12 6.8 20.1 ... lwage : Factor w/ 374 levels "-0.0392607","-0.1505904",..: 176 35 253 ... expersq : int 196 25 225 36 49 1089 121 1225 576 441 ...
The interesting thing in this list are the datatypes. R attempts to figure out automatically what type of data you have supplied.
- int, which is an integer. E.g. kidslt6, which tells us how many kids younger than 6 years a woman has.
- num, which are real numbers. E.g. huswage, which gives us the hourly wage of the husband
- factor, this, in R, indicates a categorical variable (either nominal as in this case, or ordered factor for ordinal variables. At first sight it is not clear why wage should be a factor. Soon more on this.
- logical, these are TRUE or FALSE variables. Not used here, although inlf and city could be treated as a logical variables.
- character, which are text variables. These are not used here, but they could be names.
- NA, these indicate missing values in R.
Renaming variables in dataframes
Often you may find the variable names coming from an important file somewhat clumsy and therefore would prefer to rename your variables in a dataframe. This is achieved with the following command
names(mydata)[names(mydata)=="inlf"] <- "inLabForce"
we are making use of the previously mentioned names() function. In words this command does the following. Get all the variable names of the dataframe mydata[then pick that name that is "inlf"] <- and replace it with "inLabForce".
Dealing with missing data
Let's try and explore why wage and lwage are treated as categorical variables. The 374 levels merely indicates that R found 374 different values in these variables. Something seems to be special about these variables. Go back to the spreadsheet view of mydata and scroll down and keep an eye out for the values in the wage column.
Once you get to row 429 you will see that there are missing values. Where there were missing values the particular spreadsheet we imported had a full stop, ".". R interpreted this as a text value and subsequently interpreted the combination of numbers and text values as a categorical variable and hence the factor type.
At this stage it is useful to learn another little programming clue in R. You could have just printed all values of the wage variable. As wage is one of 22 variables in mydata, you could have used one of the following two commands:
mydata$wage mydata # as wage is the 7th variable
Try them both in your Console. We will soon learn that it is important to know how to address individual variables from a dataframe.
Converting to NAs in R
So what we want is to convert the "." into NAs and the whole variables into data of the numerical type.
As turns out, converting factors to num in R isn't the easiest of tasks. So, the easy solution is to go to your csv file and use Excel to replace all cells that have a full stop only in a cell, ".", with "NA". But be careful to choose the replacement option "Match entire cell contents"! In any case here is how you do it in R. And I shall describe it as it introduces you best helper: GOOGLE (other search engines exist!). Type "R how to convert factor to numeric" into your search engine and you will find the solution.
mydata$wage <- as.numeric(as.character(mydata$wage))
What this does is the following.
mydata$wage <- indicates that we are assigning something to the
wage variable of our
mydata dataset. The rest we need to read from the inside out.
as.character(mydata$wage) translates the factor variable to a character variable and
as.numeric(...) takes the result of this and translates it into a numeric variable. Don't ask why
mydata$wage <- as.numeric(mydata$wage) doesn't work and we can't translate to numeric directly from the factor
mydata$wage. And if you do ask, ask Dr Google.
When you perform this transformation (via the character type) R will give you a warning
Warning message: NAs introduced by coercion
lwagevariable. Once you have done that you yould test whether the data types have indeed changed by using
# This is my first R script! setwd("T:/ECLR/R/FirstSteps") # This sets the working directory mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory # Now convert variables with "." to num with NA mydata$wage <- as.numeric(as.character(mydata$wage)) mydata$lwage <- as.numeric(as.character(mydata$lwage))
Converting to NAs during import
The issue R had as it imported the csv file was that it saw entries with just decimal points and it decided (quite rightly) that these were not numbers and hence gave the variables a factor type. Fortunately there is an easier way to do that. We can let R know that when it is importing data it should immediately assign a NA to all values of the type ".". This is done by means of an additional and optional input into the
mydata <- read.csv("mroz.csv",na.strings = ".")
It is the
na.strings option that tells R to immediatly assign a NA to all entries that are merely a decimal point. Let's say that there were some missing values that, in the original spreadsheet were labeled as "-999" and some as "." then the additional input would have been
na.strings= c(".","-999"), remembering that we could combine several names to a list using the
na.strings= c( ) function.
Return to the R Start page.
Your script could now look a much simple:
# This is my first R script! setwd("T:/ECLR/R/FirstSteps") # This sets the working directory mydata <- read.csv("mroz.csv",na.strings = ".") # Opens mroz.csv from working directory
Confirm that all variables have the correct type using
Sometimes it may be appropriate to remove all observations with any missing data from any further analysis. But make sure you think carefully about this. Only if the missing values are missing at random this would be uncontroversial.
Once you decided that it is appropriate to do so, you can do this with the following command
mydata <- na.omit(mydata) # Removes observations with missing values
R will remove all entries that have any missing values.