Difference between revisions of "Python/Data"
Line 3: | Line 3: | ||
= Introduction = | = Introduction = | ||
− | Here we will give a brief introduction into how to best handle data when using Python to solve Econometric problems. Here we will use a tool called Pandas. They are based on Numpy arrays. So first you got to make sure that the Numpy[http://www.numpy.org/] and Panda[http://pandas.pydata.org/] modules are available. | + | Here we will give a brief introduction into how to best handle data when using Python to solve Econometric problems. Here we will use a tool called Pandas. They are based on Numpy arrays. So first you got to make sure that the Numpy [http://www.numpy.org/] and Panda [http://pandas.pydata.org/] modules are available. |
You can use Pandas to any of the following: | You can use Pandas to any of the following: | ||
Line 63: | Line 63: | ||
[datetime.datetime(1950, 1, 3, 0, 0) 16.66 16.66 ..., 16.66 1260000.0 16.66]]</source> | [datetime.datetime(1950, 1, 3, 0, 0) 16.66 16.66 ..., 16.66 1260000.0 16.66]]</source> | ||
Here you can see the data, the first being a vector of dates (starting with the last observation from 18 July 2014). This is followed by the ''Open''ing value of the S&P500 index on that date (1961.54), then the ''High''est price (1979.91) on that day. The last three columns we can see (some are hidden from display) are the ''Close'' price, the ''Volume'' and the ''Adj''usted ''Close''. | Here you can see the data, the first being a vector of dates (starting with the last observation from 18 July 2014). This is followed by the ''Open''ing value of the S&P500 index on that date (1961.54), then the ''High''est price (1979.91) on that day. The last three columns we can see (some are hidden from display) are the ''Close'' price, the ''Volume'' and the ''Adj''usted ''Close''. | ||
+ | |||
+ | You can get to individual variables/columns using the following referencing style: <code>dataSP[’Adj Close’]</code> which will select the ''Adjusted Close'' variable. | ||
+ | |||
+ | = Joining Datasets = | ||
+ | |||
+ | = Extracting/Slicing data = | ||
+ | |||
+ | You can select the data according to column/variable information and/or according to index information. Let’s start with columnwise selection. The following command | ||
+ | |||
+ | <source>data_SP['High']</source> | ||
+ | will show just the High price information. If you want to actually use the data for some operation outside the Panda you can do any of the following: | ||
+ | |||
+ | <source>test1 = data_SP['Adj Close'].copy() | ||
+ | test2 = data_SP['Adj Close'].values | ||
+ | test3 = data_SP.as_matrix(['Adj Close'])</source> | ||
+ | it is interesting (some may say confusing) to see as what type of objects these come back as. | ||
+ | |||
+ | <source>In [77]: type(test1) | ||
+ | Out[77]: pandas.core.series.Series | ||
+ | |||
+ | In [78]: type(test2) | ||
+ | Out[78]: numpy.ndarray | ||
+ | |||
+ | In [79]:type(test3) | ||
+ | Out[79]: numpy.ndarray</source> | ||
+ | The first turns out to be a special type of a Panda DataFrame called a series. It is basically the same as a DataFrame (i.e. it has an index). <code>temp2</code> and <code>temp3</code> are numpy arrays. But to add to the confusion, let’s actually look at the objects: | ||
+ | |||
+ | <source>In [80]:test1 | ||
+ | Out[80]: | ||
+ | 0 1978.22 | ||
+ | 1 1958.12 | ||
+ | 2 1981.57 | ||
+ | ... | ||
+ | 16237 16.93 | ||
+ | 16238 16.85 | ||
+ | 16239 16.66 | ||
+ | Name: Adj Close, Length: 16240, dtype: float64 | ||
+ | |||
+ | In [81]: test2 | ||
+ | Out[81]: array([ 1978.22, 1958.12, 1981.57, ..., 16.93, 16.85, 16.66]) | ||
+ | |||
+ | In [82]: test3 | ||
+ | Out[82]: | ||
+ | array([[ 1978.22], | ||
+ | [ 1958.12], | ||
+ | [ 1981.57], | ||
+ | ..., | ||
+ | [ 16.93], | ||
+ | [ 16.85], | ||
+ | [ 16.66]])</source> | ||
+ | To see the difference we investigate the shape of <code>test2</code> and <code>test3</code>: | ||
+ | |||
+ | <source>In [83]: shape(test2) | ||
+ | Out[83]: (16240L,) | ||
+ | |||
+ | In [84]: shape(test3) | ||
+ | Out[84]: (16240L, 1L)</source> | ||
+ | It turns out that <code>test2</code> is a 1-dimensional array (with 16240 elements), whereas <code>test3</code> is a two-dimensional array with 16240 rows and 1 column. | ||
= Literature = | = Literature = | ||
+ | |||
+ | The Panda Documentation can be found here [http://pandas.pydata.org/pandas-docs/stable/pandas.pdf]. |
Revision as of 13:54, 24 July 2014
Introduction
Here we will give a brief introduction into how to best handle data when using Python to solve Econometric problems. Here we will use a tool called Pandas. They are based on Numpy arrays. So first you got to make sure that the Numpy [1] and Panda [2] modules are available.
You can use Pandas to any of the following:
- Merge data-sets
- Filter data-sets
- Calculate summary statistics
We will do this by way of an example. Here are two datafiles:
- S&P500: SP500.xlsx
- IBM: IBM.xlsx
These are csy files downloaded from [[Yahoo|http://www.yahoo.com/finance Yahoo]] which contain information about the S&P500 share price index and the IBM share prices. But let’s use Python and Pandas to explore the data.
Data Import
Use the following code:
import numpy as np # import modules for use
import pandas as pd
xlsxfile = pd.ExcelFile('IBM.xlsx') # Loads an Excel Table
data_IBM = xlsxfile.parse('IBM') # Read an Excel table into DataFrame
xlsxfile = pd.ExcelFile('SP500.xlsx')
data_SP = xlsxfile.parse('SP500')
Importing Excel files works in the two steps of first uploading an Excel file (which is what the pd.ExcelFile(’IBM.xlsx’)
command does). The second line, dataIBM = xlsxfile.parse(’IBM’)
puts the data into what is called a DataFrame. We’ll see what that is in a minute.
If you have the data in a csv file, then this process is even simpler! Then you only need one line dataIBM = pd.readcsv(’IBM.csv’)
to achieve the same. The importing process will also be significantly longer from an xlx file, when compared to a csv file. So if you have large datasets it may be much better to import them from a csv file.
So let’s see what we have at this stage. If you use the following command
print(data_SP)
you will obtain the following
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16240 entries, 0 to 16239
Data columns (total 7 columns):
Date 16240 non-null values
Open 16240 non-null values
High 16240 non-null values
Low 16240 non-null values
Close 16240 non-null values
Volume 16240 non-null values
Adj Close 16240 non-null values
dtypes: datetime64[ns](1), float64(6)
You can see that we get 16240 daily entries. They have index 0 to 16239, as usual in Python we start counting at 0. Then we find out that we have 7 data columns, or variables. They are called Date, Open, High, Low, Close, Volume, Adj Close. These names are automatically taken from the first row of the Excel file. In the last line we see the data-types contained in the DataFrame. One of the variables is a date, the others are floating variables. You can get a glimpse at the actual data as follows
print(data_SP.values)
which will deliver the following output
[[datetime.datetime(2014, 7, 18, 0, 0) 1961.54 1979.91 ..., 1978.22 3106060000.0 1978.22]
[datetime.datetime(2014, 7, 17, 0, 0) 1979.75 1981.8 ..., 1958.12 3381680000.0 1958.12]
[datetime.datetime(2014, 7, 16, 0, 0) 1976.35 1983.94 ..., 1981.57 3390950000.0 1981.57]
...,
[datetime.datetime(1950, 1, 5, 0, 0) 16.93 16.93 ..., 16.93 2550000.0 16.93]
[datetime.datetime(1950, 1, 4, 0, 0) 16.85 16.85 ..., 16.85 1890000.0 16.85]
[datetime.datetime(1950, 1, 3, 0, 0) 16.66 16.66 ..., 16.66 1260000.0 16.66]]
Here you can see the data, the first being a vector of dates (starting with the last observation from 18 July 2014). This is followed by the Opening value of the S&P500 index on that date (1961.54), then the Highest price (1979.91) on that day. The last three columns we can see (some are hidden from display) are the Close price, the Volume and the Adjusted Close.
You can get to individual variables/columns using the following referencing style: dataSP[’Adj Close’]
which will select the Adjusted Close variable.
Joining Datasets
Extracting/Slicing data
You can select the data according to column/variable information and/or according to index information. Let’s start with columnwise selection. The following command
data_SP['High']
will show just the High price information. If you want to actually use the data for some operation outside the Panda you can do any of the following:
test1 = data_SP['Adj Close'].copy()
test2 = data_SP['Adj Close'].values
test3 = data_SP.as_matrix(['Adj Close'])
it is interesting (some may say confusing) to see as what type of objects these come back as.
In [77]: type(test1)
Out[77]: pandas.core.series.Series
In [78]: type(test2)
Out[78]: numpy.ndarray
In [79]:type(test3)
Out[79]: numpy.ndarray
The first turns out to be a special type of a Panda DataFrame called a series. It is basically the same as a DataFrame (i.e. it has an index). temp2
and temp3
are numpy arrays. But to add to the confusion, let’s actually look at the objects:
In [80]:test1
Out[80]:
0 1978.22
1 1958.12
2 1981.57
...
16237 16.93
16238 16.85
16239 16.66
Name: Adj Close, Length: 16240, dtype: float64
In [81]: test2
Out[81]: array([ 1978.22, 1958.12, 1981.57, ..., 16.93, 16.85, 16.66])
In [82]: test3
Out[82]:
array([[ 1978.22],
[ 1958.12],
[ 1981.57],
...,
[ 16.93],
[ 16.85],
[ 16.66]])
To see the difference we investigate the shape of test2
and test3
:
In [83]: shape(test2)
Out[83]: (16240L,)
In [84]: shape(test3)
Out[84]: (16240L, 1L)
It turns out that test2
is a 1-dimensional array (with 16240 elements), whereas test3
is a two-dimensional array with 16240 rows and 1 column.
Literature
The Panda Documentation can be found here [3].