Difference between revisions of "Python/Data"
Line 152: | Line 152: | ||
= Joining Datasets = | = Joining Datasets = | ||
− | If you check out Panda’s documentation on merging DataFrames you will find a bewildering array of ways how you can join different DataFrames. This site makes no claim to deliver an exhausting documentation so will will unashamedly restrict our demonstration to our example at hand. Earlier we uploaded two datasets into DataFrames, <code> | + | If you check out Panda’s documentation on merging DataFrames you will find a bewildering array of ways how you can join different DataFrames. This site makes no claim to deliver an exhausting documentation so will will unashamedly restrict our demonstration to our example at hand. Earlier we uploaded two datasets into DataFrames, <code>data_SP</code>, the one which we were using so far, and <code>data_IBM</code>. |
The following command will join the two DataFrames: | The following command will join the two DataFrames: | ||
− | <source>data = pd.concat( | + | <source>data = pd.concat({'SP': data_SP, 'IBM': data_IBM},axis=1,join='inner')</source> |
− | Let’s investigate the elements of this command. First, the <code>pd.</code> ensures that Python looks in the Panda module for the <code>concat</code> command | + | Let’s investigate the elements of this command. First, the <code>pd.</code> ensures that Python looks in the Panda module for the <code>concat</code> command. The second argument, <code>axis=1</code>, indicates that we want to join by creating new variables/columns (as opposed to new entries which would be <code>axis=0</code>). But, this by itself will cause a problem, as both DataFrames that are to be merged have identical column names. This will cause a conflict. With this in the back of our names, the first argument will make most sense. It is presented in curly brackets which in Python indicates a dictionary [https://docs.python.org/2/tutorial/datastructures.html#dictionaries]. <code></code>, is dictionary which associates <code>data_SP</code> with SP and <code>data_IBM</code> with IBM and the reason why the two DataFrames to be merged in this way are presented to the concatenation function in this way will become obvious very shortly. The last argument in the above command is <code>join=’inner’</code>. This ensures that the new records are only those for which we have available data for all variables. The IBM datafile was somewhat shorter (it only had 13,227 daily entries) and therefore the new DataFrame has 13,227 entries (indexed from 0 to 13226). If you are happy to create partially empty records you can leave this last argument out (or use <code>join=’outer’</code>). How this works can be seen as follows: |
<source>print(data) | <source>print(data) | ||
Line 179: | Line 179: | ||
(SP, Adj Close) 13227 non-null values | (SP, Adj Close) 13227 non-null values | ||
dtypes: datetime64[ns](2), float64(12)</source> | dtypes: datetime64[ns](2), float64(12)</source> | ||
− | The result is obvious. | + | The result is obvious. The columns/variables now have been given a two layer name. The first layer indicating whether the data are from Sp or IBM and the second indicating the variable. We can call a particular variable as follows: |
− | + | <source>data2['SP']['Adj Close'].plot()</source> | |
+ | this selects the S&P500 adjusted close prices and plots them in a line graph. | ||
+ | How concat works (as most operations on DataFrames) is that it merges the entries from <code>dataSP</code> and <code>dataIBM</code> with the same index. There is no guarantee that entries with the same index actually come from the same day. The indexes were allocated according to the initial excel file. As the files from Yahoo come with the latest observation in the first row, the observations from 18 July 2014 had index 0, from 17 July 2014 index 1 and so forth. And as in both files the latest observation was from the 18 July 2014 we got lucky and actually merged data on the same day. | ||
+ | |||
+ | But if your input data come from files with either different orderings or different sample lengths, then we need to be careful. The key to this is to set the date as the index. | ||
+ | |||
+ | = Setting the Date as Index = | ||
+ | |||
+ | For most Time Series Econometric problems it will be best to set the date as the index. If you have CS data you may want a person identifier to the index and if you have Panel data you should be using the Panda Panel structure. | ||
+ | |||
+ | First we shall change the individual DataFrame’s indices to the date information. You can see from the earlier information that <code>date</code> is listed as one of the variables. We can set this variable to be the index as follows: | ||
+ | |||
+ | <source>data_SP2 = data_SP.set_index('Date') | ||
+ | data_IBM2 = data_IBM.set_index('Date')</source> | ||
+ | If we now look at the DataFrame info we get | ||
+ | |||
+ | <source>In [107]: data_SP2 | ||
+ | Out[107]: | ||
+ | <class 'pandas.core.frame.DataFrame'> | ||
+ | DatetimeIndex: 16240 entries, 2014-07-18 00:00:00 to 1950-01-03 00:00:00 | ||
+ | Data columns (total 6 columns): | ||
+ | Open 16240 non-null values | ||
+ | High 16240 non-null values | ||
+ | Low 16240 non-null values | ||
+ | Close 16240 non-null values | ||
+ | Volume 16240 non-null values | ||
+ | Adj Close 16240 non-null values | ||
+ | dtypes: float64(6)</source> | ||
+ | and we can see that <code>Date</code> as a variable has disappeared, but the date info now appears as the index. | ||
+ | |||
+ | If we now concatenate the two DataFrames they will again be matched along the Index information: | ||
+ | |||
+ | <source>data2 = pd.concat({'SP': data_SP2, 'IBM': data_IBM2},axis=1,join='inner')</source> | ||
+ | As the index now is the date information this will ensure that we join the datasets along the correct date information. If the individual DataFrames had different sample sizes and/or orderings we this will still ensure the correct data merge. | ||
+ | |||
+ | For some reason (yes, sometimes you can spend an hour figuring out why without result), just printing the DataFrame info doesn’t work for the DataFrame merged on dates. However, all the data are available as you can verify with | ||
+ | |||
+ | <source>In [109]: data2['IBM']['Adj Close'] | ||
+ | Out[109]: | ||
+ | Date | ||
+ | 2014-07-18 192.50 | ||
+ | 2014-07-17 192.49 | ||
+ | 2014-07-16 192.36 | ||
+ | ... | ||
+ | |||
+ | 1962-01-04 2.45 | ||
+ | 1962-01-03 2.48 | ||
+ | 1962-01-02 2.45 | ||
+ | Name: Adj Close, Length: 13227, dtype: float64</source> | ||
= Literature = | = Literature = | ||
− | The Panda Documentation can be found here [http://pandas.pydata.org/pandas-docs/stable/pandas.pdf]. | + | The Panda Documentation can be found here [http://pandas.pydata.org/pandas-docs/stable/pandas.pdf]. I found the tutorials particularly useful [http://pandas.pydata.org/pandas-docs/dev/tutorials.html]. |
Revision as of 11:02, 25 July 2014
Contents
Introduction
Here we will give a brief introduction into how to best handle data when using Python to solve Econometric problems. Here we will use a tool called Pandas. They are based on Numpy arrays. So first you got to make sure that the Numpy [1] and Panda [2] modules are available.
You can use Pandas to any of the following:
- Merge data-sets
- Filter data-sets
- Calculate summary statistics
We will do this by way of an example. Here are two datafiles:
- S&P500: SP500.xlsx
- IBM: IBM.xlsx
These are csy files downloaded from [[Yahoo|http://www.yahoo.com/finance Yahoo]] which contain information about the S&P500 share price index and the IBM share prices. But let’s use Python and Pandas to explore the data.
Data Import
Use the following code:
import numpy as np # import modules for use
import pandas as pd
xlsxfile = pd.ExcelFile('IBM.xlsx') # Loads an Excel Table
data_IBM = xlsxfile.parse('IBM') # Read an Excel table into DataFrame
xlsxfile = pd.ExcelFile('SP500.xlsx')
data_SP = xlsxfile.parse('SP500')
Importing Excel files works in the two steps of first uploading an Excel file (which is what the pd.ExcelFile(’IBM.xlsx’)
command does). The second line, dataIBM = xlsxfile.parse(’IBM’)
puts the data into what is called a DataFrame. We’ll see what that is in a minute.
If you have the data in a csv file, then this process is even simpler! Then you only need one line dataIBM = pd.readcsv(’IBM.csv’)
to achieve the same. The importing process will also be significantly longer from an xlx file, when compared to a csv file. So if you have large datasets it may be much better to import them from a csv file.
So let’s see what we have at this stage. If you use the following command
print(data_SP)
you will obtain the following
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16240 entries, 0 to 16239
Data columns (total 7 columns):
Date 16240 non-null values
Open 16240 non-null values
High 16240 non-null values
Low 16240 non-null values
Close 16240 non-null values
Volume 16240 non-null values
Adj Close 16240 non-null values
dtypes: datetime64[ns](1), float64(6)
You can see that we get 16240 daily entries. They have index 0 to 16239, as usual in Python we start counting at 0. The index is actually crucial to understanding DataFrames. Each entry has an index and it is the index by which we can recall that entry. The central role of indices will become obvious when we join datasets together a little later and we will then see how to change it. Then we find out that we have 7 data columns, or variables. They are called Date, Open, High, Low, Close, Volume, Adj Close. These names are automatically taken from the first row of the Excel file. In the last line we see the data-types contained in the DataFrame. One of the variables is a date, the others are floating variables. You can get a glimpse at the actual data as follows
print(data_SP.values)
which will deliver the following output
[[datetime.datetime(2014, 7, 18, 0, 0) 1961.54 1979.91 ..., 1978.22 3106060000.0 1978.22]
[datetime.datetime(2014, 7, 17, 0, 0) 1979.75 1981.8 ..., 1958.12 3381680000.0 1958.12]
[datetime.datetime(2014, 7, 16, 0, 0) 1976.35 1983.94 ..., 1981.57 3390950000.0 1981.57]
...,
[datetime.datetime(1950, 1, 5, 0, 0) 16.93 16.93 ..., 16.93 2550000.0 16.93]
[datetime.datetime(1950, 1, 4, 0, 0) 16.85 16.85 ..., 16.85 1890000.0 16.85]
[datetime.datetime(1950, 1, 3, 0, 0) 16.66 16.66 ..., 16.66 1260000.0 16.66]]
Here you can see the data, the first being a vector of dates (starting with the last observation from 18 July 2014). This is followed by the Opening value of the S&P500 index on that date (1961.54), then the Highest price (1979.91) on that day. The last three columns we can see (some are hidden from display) are the Close price, the Volume and the Adjusted Close.
You can get to individual variables/columns using the following referencing style: dataSP[’Adj Close’]
which will select the Adjusted Close variable.
Extracting/Slicing data
Selecting variables
You can select the data according to column/variable information and/or according to index information. Let’s start with columnwise selection. The following command
data_SP['High']
will show just the High price information. If you want to actually use the data for some operation outside the Panda you can do any of the following:
test1 = data_SP['Adj Close'].copy()
test2 = data_SP['Adj Close'].values
test3 = data_SP.as_matrix(['Adj Close'])
it is interesting (some may say confusing) to see as what type of objects these come back as.
In [77]: type(test1)
Out[77]: pandas.core.series.Series
In [78]: type(test2)
Out[78]: numpy.ndarray
In [79]:type(test3)
Out[79]: numpy.ndarray
The first turns out to be a special type of a Panda DataFrame called a series. It is basically the same as a DataFrame (i.e. it has an index). temp2
and temp3
are numpy arrays. But to add to the confusion, let’s actually look at the objects:
In [80]:test1
Out[80]:
0 1978.22
1 1958.12
2 1981.57
...
16237 16.93
16238 16.85
16239 16.66
Name: Adj Close, Length: 16240, dtype: float64
In [81]: test2
Out[81]: array([ 1978.22, 1958.12, 1981.57, ..., 16.93, 16.85, 16.66])
In [82]: test3
Out[82]:
array([[ 1978.22],
[ 1958.12],
[ 1981.57],
...,
[ 16.93],
[ 16.85],
[ 16.66]])
To see the difference we investigate the shape of test2
and test3
:
In [83]: shape(test2)
Out[83]: (16240L,)
In [84]: shape(test3)
Out[84]: (16240L, 1L)
It turns out that test2
is a 1-dimensional array (with 16240 elements), whereas test3
is a two-dimensional array with 16240 rows and 1 column.
If you want to select several variables, the extension is quite straightforward. Try the following:
test1 = data_SP[['Adj Close','High']].copy()
test2 = data_SP[['Adj Close','High']].values
test3 = data_SP.as_matrix([['Adj Close','High']])
These commands will deliver two variables. Now both test2
and test3
will deliver numpy arrays with dimension (16240L, 2L)
.
Selecting rows/entries
You can select individual entries by their index, e.g.
data_SP[3:6].values
will return an array that contains entries 3 to 5 from the DataFrame dataSP. Even if you only want to select one entry, say entry 3, you need to use dataSP[3:4].values
rather than dataSP[3].values
, which will not work. You can also select entries on conditions. Say you want to select those days on which there were more than 10 Billion shares traded in the S&P500. They are selected as follows.
data_SP[data_SP.Volume>10000000000]
Attach .values
to get an array with these entries or .copy()
to obtain a new DataFrame. You will find that only three of the 16240 entries meet this criterion.
Selecting on rows and variables
To select on two dimensions we use the .ix
method which is part of DataFrames. Test this command out:
In [136]: data_SP.ix[data_SP.Volume>10000000000,['Date','Adj Close','Volume']].values
Out[136]:
array([[datetime.datetime(2010, 5, 6, 0, 0), 1128.15, 10617809600.0],
[datetime.datetime(2008, 10, 10, 0, 0), 899.22, 11456230400.0],
[datetime.datetime(2008, 9, 18, 0, 0), 1206.51, 10082689600.0]], dtype=object)
This returns an array with three rows, each of which contains the date, the adjusted close and the volume of shares traded on these three days with the highest volume. As you can see, the last of these days was in 2010.
Joining Datasets
If you check out Panda’s documentation on merging DataFrames you will find a bewildering array of ways how you can join different DataFrames. This site makes no claim to deliver an exhausting documentation so will will unashamedly restrict our demonstration to our example at hand. Earlier we uploaded two datasets into DataFrames, data_SP
, the one which we were using so far, and data_IBM
.
The following command will join the two DataFrames:
data = pd.concat({'SP': data_SP, 'IBM': data_IBM},axis=1,join='inner')
Let’s investigate the elements of this command. First, the pd.
ensures that Python looks in the Panda module for the concat
command. The second argument, axis=1
, indicates that we want to join by creating new variables/columns (as opposed to new entries which would be axis=0
). But, this by itself will cause a problem, as both DataFrames that are to be merged have identical column names. This will cause a conflict. With this in the back of our names, the first argument will make most sense. It is presented in curly brackets which in Python indicates a dictionary [3]. , is dictionary which associates
data_SP
with SP and data_IBM
with IBM and the reason why the two DataFrames to be merged in this way are presented to the concatenation function in this way will become obvious very shortly. The last argument in the above command is join=’inner’
. This ensures that the new records are only those for which we have available data for all variables. The IBM datafile was somewhat shorter (it only had 13,227 daily entries) and therefore the new DataFrame has 13,227 entries (indexed from 0 to 13226). If you are happy to create partially empty records you can leave this last argument out (or use join=’outer’
). How this works can be seen as follows:
print(data)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 13227 entries, 0 to 13226
Data columns (total 14 columns):
(IBM, Date) 13227 non-null values
(IBM, Open) 13227 non-null values
(IBM, High) 13227 non-null values
(IBM, Low) 13227 non-null values
(IBM, Close) 13227 non-null values
(IBM, Volume) 13227 non-null values
(IBM, Adj Close) 13227 non-null values
(SP, Date) 13227 non-null values
(SP, Open) 13227 non-null values
(SP, High) 13227 non-null values
(SP, Low) 13227 non-null values
(SP, Close) 13227 non-null values
(SP, Volume) 13227 non-null values
(SP, Adj Close) 13227 non-null values
dtypes: datetime64[ns](2), float64(12)
The result is obvious. The columns/variables now have been given a two layer name. The first layer indicating whether the data are from Sp or IBM and the second indicating the variable. We can call a particular variable as follows:
data2['SP']['Adj Close'].plot()
this selects the S&P500 adjusted close prices and plots them in a line graph.
How concat works (as most operations on DataFrames) is that it merges the entries from dataSP
and dataIBM
with the same index. There is no guarantee that entries with the same index actually come from the same day. The indexes were allocated according to the initial excel file. As the files from Yahoo come with the latest observation in the first row, the observations from 18 July 2014 had index 0, from 17 July 2014 index 1 and so forth. And as in both files the latest observation was from the 18 July 2014 we got lucky and actually merged data on the same day.
But if your input data come from files with either different orderings or different sample lengths, then we need to be careful. The key to this is to set the date as the index.
Setting the Date as Index
For most Time Series Econometric problems it will be best to set the date as the index. If you have CS data you may want a person identifier to the index and if you have Panel data you should be using the Panda Panel structure.
First we shall change the individual DataFrame’s indices to the date information. You can see from the earlier information that date
is listed as one of the variables. We can set this variable to be the index as follows:
data_SP2 = data_SP.set_index('Date')
data_IBM2 = data_IBM.set_index('Date')
If we now look at the DataFrame info we get
In [107]: data_SP2
Out[107]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16240 entries, 2014-07-18 00:00:00 to 1950-01-03 00:00:00
Data columns (total 6 columns):
Open 16240 non-null values
High 16240 non-null values
Low 16240 non-null values
Close 16240 non-null values
Volume 16240 non-null values
Adj Close 16240 non-null values
dtypes: float64(6)
and we can see that Date
as a variable has disappeared, but the date info now appears as the index.
If we now concatenate the two DataFrames they will again be matched along the Index information:
data2 = pd.concat({'SP': data_SP2, 'IBM': data_IBM2},axis=1,join='inner')
As the index now is the date information this will ensure that we join the datasets along the correct date information. If the individual DataFrames had different sample sizes and/or orderings we this will still ensure the correct data merge.
For some reason (yes, sometimes you can spend an hour figuring out why without result), just printing the DataFrame info doesn’t work for the DataFrame merged on dates. However, all the data are available as you can verify with
In [109]: data2['IBM']['Adj Close']
Out[109]:
Date
2014-07-18 192.50
2014-07-17 192.49
2014-07-16 192.36
...
1962-01-04 2.45
1962-01-03 2.48
1962-01-02 2.45
Name: Adj Close, Length: 13227, dtype: float64
Literature
The Panda Documentation can be found here [4]. I found the tutorials particularly useful [5].