Python/Data

From ECLR
Revision as of 21:34, 21 July 2014 by Rb (talk | contribs)
Jump to: navigation, search


Introduction

Here we will give a brief introduction into how to best handle data when using Python to solve Econometric problems. Here we will use a tool called Pandas. They are based on Numpy arrays. So first you got to make sure that the [[Numpy|Numpy]] and [[Panda|Panda]] modules are available.

You can use Pandas to any of the following:

  • Merge data-sets
  • Filter data-sets
  • Calculate summary statistics

We will do this by way of an example. Here are two datafiles:

  1. S&P500: SP500.xlsx
  2. IBM: IBM.xlsx

These are csy files downloaded from [[Yahoo|http://www.yahoo.com/finance Yahoo]] which contain information about the S&P500 share price index and the IBM share prices. But let’s use Python and Pandas to explore the data.

Data Import

Use the following code:

import numpy as np      # import modules for use
import pandas as pd

xlsxfile = pd.ExcelFile('IBM.xlsx') # Loads an Excel Table
data_IBM = xlsxfile.parse('IBM')    # Read an Excel table into DataFrame

xlsxfile = pd.ExcelFile('SP500.xlsx')
data_SP = xlsxfile.parse('SP500')

Importing Excel files works in the two steps of first uploading an Excel file (which is what the pd.ExcelFile(’IBM.xlsx’) command does). The second line, dataIBM = xlsxfile.parse(’IBM’) puts the data into what is called a DataFrame. We’ll see what that is in a minute.

If you have the data in a csv file, then this process is even simpler! Then you only need one line dataIBM = pd.readcsv(’IBM.csv’) to achieve the same. The importing process will also be significantly longer from an xlx file, when compared to a csv file. So if you have large datasets it may be much better to import them from a csv file.

So let’s see what we have at this stage. If you use the following command

print(data_SP)

you will obtain the following

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16240 entries, 0 to 16239
Data columns (total 7 columns):
Date         16240  non-null values
Open         16240  non-null values
High         16240  non-null values
Low          16240  non-null values
Close        16240  non-null values
Volume       16240  non-null values
Adj Close    16240  non-null values
dtypes: datetime64[ns](1), float64(6)

You can see that we get 16240 daily entries. They have index 0 to 16239, as usual in Python we start counting at 0. Then we find out that we have 7 data columns, or variables. They are called Date, Open, High, Low, Close, Volume, Adj Close. These names are automatically taken from the first row of the Excel file. In the last line we see the data-types contained in the DataFrame. One of the variables is a date, the others are floating variables. You can get a glimpse at the actual data as follows

print(data_SP.values)

which will deliver the following output

[[datetime.datetime(2014, 7, 18, 0, 0) 1961.54 1979.91 ..., 1978.22 3106060000.0 1978.22]
 [datetime.datetime(2014, 7, 17, 0, 0) 1979.75 1981.8 ..., 1958.12 3381680000.0 1958.12]
 [datetime.datetime(2014, 7, 16, 0, 0) 1976.35 1983.94 ..., 1981.57 3390950000.0 1981.57]
 ...,
 [datetime.datetime(1950, 1, 5, 0, 0) 16.93 16.93 ..., 16.93 2550000.0 16.93]
 [datetime.datetime(1950, 1, 4, 0, 0) 16.85 16.85 ..., 16.85 1890000.0 16.85]
 [datetime.datetime(1950, 1, 3, 0, 0) 16.66 16.66 ..., 16.66 1260000.0 16.66]]

Literature