Practical Business Python

Taking care of business, one python script at a time

Tue 14 October 2014

Simple Interactive Data Analysis with Python

Posted by Chris Moffitt in articles   

Interactive Python

Python is a language that allows you to create quick and simple code to do relatively complex tasks. It is very common to use the interactive python interpreter to enter a few commands in order to “figure out” how they work. If you’ve done any kind of basic python tutorial, there will be a step early in the process that asks you to type python in your command line.

The python command opens up an interpreter which allows you to type commands and get real time feedback on the results. Here is a very simple example from powerful one-liners:

$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pprint
>>> pprint.pprint(zip(('Byte', 'KByte', 'MByte', 'GByte', 'TByte'), (1 << 10*i for i in xrange(5))))
[('Byte', 1),
 ('KByte', 1024),
 ('MByte', 1048576),
 ('GByte', 1073741824),
 ('TByte', 1099511627776)]
>>>

While this interactive environment is really useful, it is not very conducive for more thorough exploration of python. Very early in your python journey, you’ll probably hear about IPython. IPython provides many useful features, including:

  • tab completion
  • object exploration
  • command history

You can invoke ipython in a similar way but you’ll immediately notice a little different interface:

$ ipython
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
Type "copyright", "credits" or "license" for more information.

IPython 2.3.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import pprint

In [2]: pprint.pprint(zip(('Byte', 'KByte', 'MByte', 'GByte', 'TByte'), (1 << 10*i for i in xrange(5))))
[('Byte', 1),
 ('KByte', 1024),
 ('MByte', 1048576),
 ('GByte', 1073741824),
 ('TByte', 1099511627776)]

In [3]: help(pprint)


In [4]: pprint.
pprint.PrettyPrinter  pprint.isrecursive    pprint.pprint         pprint.warnings
pprint.isreadable     pprint.pformat        pprint.saferepr

In [4]: pprint.

In the example, I ran the same commands to get the same output but also tried the help function as well as used TAB completion after typing pprint. The other command I used was the up arrow to scroll through the history of commands, edit them and execute the results:

In [4]: pprint.pprint(zip(('Byte', 'KiloByte', 'MegaByte', 'GigaByte', 'TeraByte'), (1 << 10*i for i in xrange(5))))
[('Byte', 1),
 ('KiloByte', 1024),
 ('MegaByte', 1048576),
 ('GigaByte', 1073741824),
 ('TeraByte', 1099511627776)]

In [5]: pprint.pprint(zip(('Byte', 'KByte', 'MByte', 'GByte', 'TByte'), (1 << 10*i for i in xrange(5))))
[('Byte', 1),
 ('KByte', 1024),
 ('MByte', 1048576),
 ('GByte', 1073741824),
 ('TByte', 1099511627776)]

IPython also makes it easy to learn more about the objects you are using. If you ever get stuck, try using the ? to learn more about something:

In [9]: s = {'1','2'}

In [10]: s?
Type:        set
String form: set(['1', '2'])
Length:      2
Docstring:
set() -> new empty set object
set(iterable) -> new set object

Build an unordered collection of unique elements.

In [11]:

The functionality provided by IPython is really cool and useful and I encourage you to install it on your system and play with the various features to learn more about it.

IPython Notebook

IPython is very useful and I have used it over the years when working on Django projects. Sometime in 2011, they introduced the concept of the IPython notebook to this powerful tool. For some reason I’m late to the party but now that I’ve had the chance to use them and play with them, I can see their immense power.

The simplest way to describe an IPython Notebook is that it is a super cool way to provide the IPython console in a browser. However, it does not just provide IPython-like features in a browser, it makes it very simple to record your steps and share them with others. In the context of business applications, there are two main points to keep in mind:

  • Notebooks allow you to easily interact with and explore your data
  • The exploration is almost self-documenting and allows you to easily share and train others on what you are doing

Imagine you are working with Excel, and have just created a pivot table or done some other analysis. If you would like to explain to someone how to do it, what would you do? Cut and paste screen shots into Word? Record the session via some sort of screen recording tool? Hand them the Excel file and tell them to go figure it out?

None of those options are particularly good but are certainly the standard in most places where Excel rules the ad-hoc analysis world. IPython Notebooks in coordination with pandas provide a robust way to analyze large amounts of data and share your process with your teammates.

Python Data Analysis Library

The Python Data Analysis Library aka pandas is a “BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.” Pandas is a very sophisticated program and you can do some wildly complex math with it. In future articles, I’ll go through it in more detail but I wanted to do a quick sample analysis using the same data I used in my sets article.

Starting Up The Environment

Start a python notebook session:

$ ipython notebook

Your browser should then automatically open and redirect to the notebook server. Here is what the main screen looks like (yours will probably be empty but this shows a few example notebooks):

notebook main screen

Clicking on the New Notebook button starts a new environment for you to code:

notebook main entry screen

You’ll notice that the input cell looks very much like the IPython command prompt we looked at earlier.

For the rest of this article, I will show the various commands I have entered in the cells. I have chosen to download the entire session via reST so it integrates more seamlessly with my blog work flow. If people would like the actual notebook and/or data files, let me know and I’ll post them.

Additionally, the IPython Notebook has a lot of powerful features. If you’d like me to talk through it in more detail - add your input in the comments. I’m open to giving more insight into using this application.

Very Quick Data Analysis with Pandas

Now that I am up and running with my notebook, I can do some pretty powerful analysis.

First, we need to import the standard pandas libraries

import pandas as pd
import numpy as np

Next, we can read in the sample data and get a summary of how it looks.

SALES=pd.read_csv("sample-sales.csv")
SALES.head()
Account Number Account Name sku category quantity unit price ext price date
0 803666 Fritsch-Glover HX-24728 Belt 1 98.98 98.98 2014-09-28 11:56:02
1 64898 O’Conner Inc LK-02338 Shirt 9 34.80 313.20 2014-04-24 16:51:22
2 423621 Beatty and Sons ZC-07383 Shirt 12 60.24 722.88 2014-09-17 17:26:22
3 137865 Gleason, Bogisich and Franecki QS-76400 Shirt 5 15.25 76.25 2014-01-30 07:34:02
4 435433 Morissette-Heathcote RU-25060 Shirt 19 51.83 984.77 2014-08-24 06:18:12

Now, we can use the pivot table function to summarize the sales and turn the rows of data into something useful. We will start with something very simple

report = SALES.pivot_table(values=['quantity'],index=['Account Name'],columns=['category'], aggfunc=np.sum)
report.head(n=10)
quantity
category Belt Shirt Shoes
Account Name
Abbott PLC NaN NaN 19
Abbott, Rogahn and Bednar NaN 18 NaN
Abshire LLC NaN 18 2
Altenwerth, Stokes and Paucek NaN 13 NaN
Ankunding-McCullough NaN 2 NaN
Armstrong, Champlin and Ratke 7 36 NaN
Armstrong, McKenzie and Greenholt NaN NaN 4
Armstrong-Williamson 19 NaN NaN
Aufderhar and Sons NaN NaN 2
Aufderhar-O’Hara NaN NaN 11

This command shows us the number of products each customer purchased - all in one command! As impressive as this is, you’ll notice that there are a bunch of NaN’s in the output. This means “Not a Number” and represents places where there is no value.

Wouldn’t it be nicer if the value was a 0 instead? That’s where fill_value comes in:

report = SALES.pivot_table(values=['quantity'],index=['Account Name'],columns=['category'], fill_value=0, aggfunc=np.sum)
report.head(n=10)
quantity
category Belt Shirt Shoes
Account Name
Abbott PLC 0 0 19
Abbott, Rogahn and Bednar 0 18 0
Abshire LLC 0 18 2
Altenwerth, Stokes and Paucek 0 13 0
Ankunding-McCullough 0 2 0
Armstrong, Champlin and Ratke 7 36 0
Armstrong, McKenzie and Greenholt 0 0 4
Armstrong-Williamson 19 0 0
Aufderhar and Sons 0 0 2
Aufderhar-O’Hara 0 0 11

This looks much cleaner! We will do one more thing with this example to show some of the power of the pivot_table. Let’s see how much in sales we did as well:

report = SALES.pivot_table(values=['ext price','quantity'],index=['Account Name'],columns=['category'], fill_value=0,aggfunc=np.sum)
report.head(n=10)
ext price quantity
category Belt Shirt Shoes Belt Shirt Shoes
Account Name
Abbott PLC 0.00 0.00 755.44 0 0 19
Abbott, Rogahn and Bednar 0.00 615.60 0.00 0 18 0
Abshire LLC 0.00 720.18 90.34 0 18 2
Altenwerth, Stokes and Paucek 0.00 843.31 0.00 0 13 0
Ankunding-McCullough 0.00 132.30 0.00 0 2 0
Armstrong, Champlin and Ratke 587.30 786.73 0.00 7 36 0
Armstrong, McKenzie and Greenholt 0.00 0.00 125.04 0 0 4
Armstrong-Williamson 1495.87 0.00 0.00 19 0 0
Aufderhar and Sons 0.00 0.00 193.54 0 0 2
Aufderhar-O’Hara 0.00 0.00 669.57 0 0 11

If we want, we can even output this to Excel. We have to convert it back to a DataFrame, then we can write it out to excel

report.to_excel('report.xlsx', sheet_name='Sheet1')

Showing the version of pandas in use since some syntax has changed in the more recent versions.

pd.__version__
'0.14.1'

Closing Thoughts

The purpose of this article was to give you a basic understanding of a few interactive python tools and how you can use these to do some complex analysis in a very quick and repeatable way. I plan to spend more time going through examples such as this to show how useful this toolset can be and to continue to let people know that there are alternatives to Excel when it comes to complex data analysis!

If you would like to learn more about pivot tables, please look at the Pandas Pivot Table Explained article for much more detail.

Updates

10-21-2014:
  • Cleaned up an extra line in the Excel writing function
  • Also showing the pandas version used in this example
  • Added a link to the sample data
6-17-2015:

 
       Vote on Hacker News          

Comments