Practical Business Python

Taking care of business, one python script at a time

Mon 26 October 2015

Pandas 0.17 Release and Other Notes

Posted by Chris Moffitt in articles   

Introduction

As many of you know, pandas released version 0.17.0 on October 9th. In typical pandas fashion there are a bunch of updates, bug fixes and new features which I encourage you to read all about here. I do not plan to go through all of the changes but there are a couple of key things that I think will be useful to me in my daily work that I will explore briefly in this article. In addition, I am including a couple of other tips and tricks for pandas that I use on a frequent basis and hope will be useful to you.

Excel MultiIndex

Strangely, one of the most exciting things about 0.17 is that MultiIndex supports to_excel again. I say again because it used to work just fine for me (prior to 0.16.2). However in that release (and up to 0.17) the method would raise a NotImplementedError . With 0.17, all works as expected. Yay!

In order to illustrate in a little more detail what this supports, here is a short example:

from __future__ import print_function
import pandas as pd

sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true')
sales_df["prod_group"] = sales_df["sku"].str[0:2]
print(sales_df.head())

Here is our Data:

account numbername sku quantity unit price ext pricedate prod_group
0 740150Barton LLC B1-20000 39 86.69 3380.912014-01-01 07:21:51B1
1 714466Trantow-Barrows S2-77896 -1 63.16 -63.162014-01-01 10:00:47S2
2 218895Kulas Inc B1-69924 23 90.7 2086.1 2014-01-01 13:24:58B1
3 307599Kassulke, Ondricka and MetzS1-65481 41 21.05 863.052014-01-01 15:05:22S1
4 412290Jerde-Hilpert S2-34077 6 83.21 499.262014-01-01 23:26:55S2

We are using our simple sales data again to show purchase history over several months. The prod_group column is just a way to break down the products into different categories based on their SKU.

Now, let’s create a pivot table of the data. The key here is that the pivot table creates a MultiIndex , which will cause problems when we try to save to excel.

sales_summary = pd.pivot_table(sales_df, index=["name"],
                               columns=["prod_group"],
                               values=["ext price", "quantity"])
ext price quantity
prod_group B1 S1 S2 B1 S1 S2
name
Barton LLC 1171.640278 1306.237600 1647.786190 20.777778 28.040000 28.190476
Cronin, Oberbrunner and Spencer 1387.616842 1342.598571 1302.787407 26.210526 25.428571 23.740741
Frami, Hills and Schmidt 1475.132143 1439.311875 1401.317857 25.071429 31.062500 25.142857
Fritsch, Russel and Anderson 1372.360286 1188.012857 1569.353200 23.714286 26.047619 29.400000
Halvorson, Crona and Champlin 1356.640000 1267.756667 983.963158 25.857143 22.333333 17.842105

All looks good until you try to save this using to_excel

sales_summary.to_excel('sampleout.xlsx')

In prior versions of pandas you’ll see something like this:

Traceback (most recent call last):
  File "pandas_version.py", line 25, in <module>
    sales_summary.to_excel('sampleout.xlsx')
  File "/home/chris/miniconda3/envs/pbpython2/lib/python2.7/site-packages/pandas/core/frame.py", line 1252, in to_excel
    raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.

Thankfully, pandas 0.17 works again and you get a clean Excel file that you would expect. The other reason I wanted to specifically call this out is that some of the code in my older blog posts stopped working as a result of this change. Now, I can point people here to get the solution - upgrade to pandas 0.17!

Rounding Data

One simple (but useful) new feature is the addition of a round method to a DataFrame. Prior to this release, if you wanted to round data in a column you had to use np.round or change the display using display.float_format .

In this example, let’s just look at the first 5 rows of data and round two columns:

sales_df = sales_df[0:5]
sales_df.round({'ext price':0, 'unit price': 0})
account number name sku quantity unit price ext price date prod_group
0 740150 Barton LLC B1-20000 39 87 3381 2014-01-01 07:21:51 B1
1 714466 Trantow-Barrows S2-77896 -1 63 -63 2014-01-01 10:00:47 S2
2 218895 Kulas Inc B1-69924 23 91 2086 2014-01-01 13:24:58 B1
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21 863 2014-01-01 15:05:22 S1
4 412290 Jerde-Hilpert S2-34077 6 83 499 2014-01-01 23:26:55 S2

I can control the number of decimal places to round as well as specify which columns to round by including in the dictionary of column names. This is pretty useful.

Troubleshooting Merges

In my experience manipulating and combining data, pandas tends to just do what I would expect. One of the areas where I can find myself getting tripped up is when I do a merge of DataFrames. Sometimes I need to play around with whether or not I really want a left, right, outer or inner join. To help troubleshoot these sometimes pesky problems, there is a new indicator argument that can help you figure out why your merge may not be doing what you expect.

For this example, let’s add in a description to our data. Here is the sales_groups DataFrame that looks like this:

prod_group Desc
0 B1 Belt-Large
1 B2 Belt-Medium
2 S1 Shirt-Large
3 S2 Shirt-Medium
4 H1 Hat-Large

If we want to merge the sales_groups data with our sales data, we could do something like this:

pd.merge(sales_df, sales_groups, on='prod_group', how='left', indicator=True)

Notice the _merge column shows that the data is only included if it is in both DataFrames.

account number name sku quantity unit price ext price date prod_group Desc _merge
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 B1 Belt-Large both
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 S2 Shirt-Medium both
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 B1 Belt-Large both
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 S1 Shirt-Large both
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 S2 Shirt-Medium both

Let’s see what happens when we do a right join:

pd.merge(sales_df, sales_groups, on='prod_group', how='right', indicator=True)
account number name sku quantity unit price ext price date prod_group Desc _merge
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51 B1 Belt-Large both
1 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58 B1 Belt-Large both
2 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47 S2 Shirt-Medium both
3 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55 S2 Shirt-Medium both
4 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22 S1 Shirt-Large both
5 NaN NaN NaN NaN NaN NaN NaN B2 Belt-Medium right_only
6 NaN NaN NaN NaN NaN NaN NaN H1 Hat-Large right_only

Given this small dataset, the indicator is not required but you could imagine that if you had thousands of rows, this could be really helpful to make sure you are getting the results you expect.

Sorting API

One warning you will probably see after upgrading to pandas 0.17 is something like this:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
sales_df.sort(columns=["name", "sku"])

To fix it:

sales_df.sort_values(by=["name", "sku"])

The old syntax will continue to work but it’s best to proactively make these changes now so that upgrades will be simpler.

The next couple items are not related to this latest release but include some quick notes on working with pandas DataFrames and Series.

Saving Series to Excel

Pandas has a couple of functions I use quite a bit that return a Series of data. One that I use is value_counts . What is sometimes frustrating is that there is no to_excel function for a series.

Here is an example of using value_counts (I included all my data but truncated the results for brevity):

sales_df["sku"].value_counts()
S2-77896    73
S1-82801    60
S2-10342    59
S1-47412    58
S1-93683    57
B1-38851    56
S2-82423    56
S1-50961    55
S1-30248    55
B1-53636    53
S1-06532    53
S1-27722    53
B1-20000    53
S2-34077    51
S2-83881    51
........

These results are useful and wouldn’t it be nice to dump it to Excel? Unfortunately when you try, you get a nasty AttributeError: 'Series' object has no attribute to_excel :

sales_df["sku"].value_counts().to_excel("values.xlsx")
Traceback (most recent call last):
  File "pandas_version.py", line 50, in <module>
    sales_df["sku"].value_counts().to_excel("values.xlsx")
  File "/home/chris/miniconda3/lib/python3.4/site-packages/pandas/core/generic.py", line 2246, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'to_excel'

A handy shortcut to fix this is to use to_frame to force it to a DataFrame which can be saved:

sales_df["sku"].value_counts().to_frame("frequency").to_excel("values.xlsx")

You can pass one argument to to_frame - the label to use for your data. It is optional but I find it handy to include.

Tabulate

Pandas has robust options to export DataFrames. However, I have had a few instances where I want to display a DataFrame in a more ASCII friendly way. I was pleasantly surprised to find the tabulate program understands pandas DataFrames.

Here is a sample usage:

print(tabulate(sales_df, tablefmt="fancy_grid"))
╒═══╤════════╤═════════════════════════════╤══════════╤════╤═══════╤═════════╤═════════════════════╤════╕
│ 0740150 │ Barton LLC                  │ B1-20000 │ 39 │ 86.69 │ 3380.91 │ 2014-01-01 07:21:51 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 1714466 │ Trantow-Barrows             │ S2-77896 │ -1 │ 63.16 │  -63.16 │ 2014-01-01 10:00:47 │ S2 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 2218895 │ Kulas Inc                   │ B1-69924 │ 23 │ 90.7  │ 2086.1  │ 2014-01-01 13:24:58 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 3307599 │ Kassulke, Ondricka and Metz │ S1-65481 │ 41 │ 21.05 │  863.05 │ 2014-01-01 15:05:22 │ S1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 4412290 │ Jerde-Hilpert               │ S2-34077 │  6 │ 83.21 │  499.26 │ 2014-01-01 23:26:55 │ S2 │
╘═══╧════════╧═════════════════════════════╧══════════╧════╧═══════╧═════════╧═════════════════════╧════╛

One extra step is required if you want to display headers:

headers = list(sales_df)
print(tabulate(sales_df, headers, tablefmt="fancy_grid"))
╒════╤══════════════════╤═════════════════════════════╤══════════╤════════════╤══════════════╤═════════════╤═════════════════════╤══════════════╕
│    │   account number │ name                        │ sku      │   quantity │   unit price │   ext price │ date                │ prod_group   │
╞════╪══════════════════╪═════════════════════════════╪══════════╪════════════╪══════════════╪═════════════╪═════════════════════╪══════════════╡
│  0740150 │ Barton LLC                  │ B1-20000 │         39 │        86.69 │     3380.91 │ 2014-01-01 07:21:51 │ B1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  1714466 │ Trantow-Barrows             │ S2-77896 │         -1 │        63.16 │      -63.16 │ 2014-01-01 10:00:47 │ S2           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  2218895 │ Kulas Inc                   │ B1-69924 │         23 │        90.7  │     2086.1  │ 2014-01-01 13:24:58 │ B1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  3307599 │ Kassulke, Ondricka and Metz │ S1-65481 │         41 │        21.05 │      863.05 │ 2014-01-01 15:05:22 │ S1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  4412290 │ Jerde-Hilpert               │ S2-34077 │          6 │        83.21 │      499.26 │ 2014-01-01 23:26:55 │ S2           │
╘════╧══════════════════╧═════════════════════════════╧══════════╧════════════╧══════════════╧═════════════╧═════════════════════╧══════════════╛

I realize the need for tabulate+pandas may be a little niche but I think it is useful to understand all the options that might be available to you.

Thanks for reading. Enjoy using your shiny new pandas 0.17 install!


 
       Vote on Hacker News          

Comments