R Data
On most occasions you would want to use data which already exist in some electronic form (lucky you that you did not study in the 70s when you had to trawl through paper back-copies of some statistical agency and copy data by hand and then enter manually into some spreadsheet). The question then is how to import these data into R and use them for your statistical or econometric analysis.
Upload a data file to your working directory
In the first instance I want you to download this mroz.xls Excel file that contains a dataset which we will use for our first steps in R. It is a well used cross-sectional dataset with 753 observations of female members of the labour force in the US (in 1975). It contains variables such as the number of children, the wage, the hours worked etc. A bit more detail on the data and the variables can be found in this file. See also [1].
Make sure that you note down in which folder you save this file. Save it in a folder in which you want to save your work. We shall soon call this folder our working directory. At this stage we have not yet made the data available to R. This will come soon!
Data Imports
R makes it really easy to import data if they are already in the R data format (see later) or indeed if they are in csv (comma separated values) format. This is a rather short list and importantly, this list does not include EXCEL files, which is the format in which most datafiles will land in your inbox.
csv File Import
Now, as R is such a popular package, clever and busy programmers have written an extension (or better a package in R speak) that does import data directly into R, but unfortunately this package (gdata) requires that some other software is installed on your computer and ... it just gets too messy. Good thing that it is really easy to turn your Excel file into a csv file. Open your data file in Excel and then "Save as ..." the file again and change the extension from an Excel file to a csv file.
Once you have done this with the "mroz.xls" file you should have, besides the Excel file, a "mroz.csv" file in your folder. It is now time to let R do some work. Return to your Firststeps.R script, or open a new script with the following first two lines:
# This is my first R script!
setwd("O:/ECLR/R") # This sets the working directory
mydata <- read.csv("mroz.csv") # Opens mroz.csv from working directory
<\source>
And then press CTRL+ENTER+s or the Source button ([[File:R_sourcebutton.JPG|frameless|500px]]) to run the code. You should then see, in the "Workspace" window a new entry called mydata, informing you that you imported an object with 753 observations and 22 variables.
[[File:R_Workspace1.JPG|frameless|500px]]
By double clicking on "mydata" you will open this object and see that it basically contains a spreadsheet with all your data. This sort of object is called a data frame and R basically understands that you have 73 observations and for each you have observations for 22 variables.
[[File:R_mydata1.JPG|frameless|500px]]
The above sequence of commands assumed that the first row in your csv file is a row with variable names that have then be used in your data frame. In case there is no header row in your csv file, you should replace the read.csv command from above with:
<source>
mydata <- read.csv("mroz.csv", header = FALSE)
<\source>
=== Data Import from a R Data file===
to be written
== Basic Analysis and Data Types ==
As you work with dataframes, especially those that contain many different variables, the following commands may come in very handy:
<source>
head(mydata) # displays the first 6 observations for all variables
tail(mydata) # displays the last 6 observations for all variables
str(mydata) # will give a list of all variables with their data types
<\source>
These tend to be commands you would use straight in the Console as you would use them to remind yourself of some basic data features (like names) or types. Try the head and tail commands yourself. But let's look at the output of the str(mydata) command:
<source>
'data.frame': 753 obs. of 22 variables:
inlf : int 1 1 1 1 1 1 1 1 1 1 ...
hours : int 1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
kidslt6 : int 1 0 1 0 1 0 0 0 0 0 ...
kidsge6 : int 0 2 3 3 2 0 2 0 2 2 ...
age : int 32 30 35 34 31 54 37 54 48 39 ...
educ : int 12 12 12 12 14 12 16 12 12 12 ...
wage : Factor w/ 374 levels ".","0.1282","0.1616",..: 191 35 268 ...
repwage : num 2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
hushrs : int 2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
husage : int 34 30 40 53 32 57 37 53 52 43 ...
huseduc : int 12 9 12 10 12 11 12 8 4 12 ...
huswage : num 4.03 8.44 3.58 3.54 10 ...
faminc : int 16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
mtr : num 0.722 0.661 0.692 0.781 0.622 ...
motheduc: int 12 7 12 7 12 14 14 3 7 7 ...
fatheduc: int 7 7 7 7 14 7 7 3 7 7 ...
unem : num 5 11 5 5 9.5 7.5 5 5 3 5 ...
city : int 0 1 0 0 1 1 0 0 0 0 ...
exper : int 14 5 15 6 7 33 11 35 24 21 ...
nwifeinc: num 10.9 19.5 12 6.8 20.1 ...
lwage : Factor w/ 374 levels "-0.0392607","-0.1505904",..: 176 35 253 ...
expersq : int 196 25 225 36 49 1089 121 1225 576 441 ...
<\source>
The interesting thing in this list are the datatypes. R attempts to figure out automatically what type of data you have supplied.
* int, which is an integer. E.g. kidslt6, which tells us how many kids younger than 6 years a woman has.
* num, which are real numbers. E.g. huswage, which gives us the hourly wage of the husband
* factor, this, in R, indicates a categorical variable (either nominal as in this case, or ordered factor for ordinal variables. At first sight it is not clear why wage should be a factor. Soon more on this.
* logical, these are TRUE or FALSE variables. Not used here, although inlf and city could be treated as a logical variables.
* character, which are text variables. These are not used here, but they could be names.
* NA, these indicate missing values in R.
Let's try and explore why wage and lwage are treated as categorical variables. The 374 levels merely indicates that R found 374 different values in these variables. Something seems to be special about these variables. Go back to the spreadsheet view of mydata and scroll down and keep an eye out for the values in the wage column.
[[File:R_mydata2.JPG|frameless|500px]]
Once you get to row 429 you will see that there are missing values. Where there were missing values the particular spreadsheet we imported had a full stop, ".". R interpreted this as a text value and subsequently interpreted the combination of numbers and text values as a categorical variable and hence the factor type.
At this stage it is useful to learn another little programming clue in R. You could have just printed all values of the wage variable. As wage is one of 22 variables in mydata, you could have used one of the following two commands:
<source>
mydata$wage
mydata[7] # as wage is the 7th variable
Try them both in your Console. We will soon learn that it is important to know how to address individual variables from a dataframe.
As turns out, converting factors to num in R isn't the easiest of tasks. So, the easy solution is to go to your csv file and use Excel to replace all cells that have a full stop only in a cell, ".", with "NA". But be careful to choose the replacement option "Match entire cell contents"! In any case here is how you do it in R. And I shall describe it as it introduces you best helper: GOOGLE (other search engines exist!). Type "R how to convert factor to numeric" into your search engine and you will find the solution. <source>
mydata$wage <- as.numeric(as.character(mydata$wage))
<\source> Return to the R Start page.