Mon 26 October 2015

Pandas 0.17 Release and Other Notes

Introduction

As many of you know, pandas released version 0.17.0 on October 9th. In typical pandas fashion there are a bunch of updates, bug fixes and new features which I encourage you to read all about here. I do not plan to go through all of the changes but there are a couple of key things that I think will be useful to me in my daily work that I will explore briefly in this article. In addition, I am including a couple of other tips and tricks for pandas that I use on a frequent basis and hope will be useful to you.

Excel MultiIndex

Strangely, one of the most exciting things about 0.17 is that MultiIndex supports to_excel again. I say again because it used to work just fine for me (prior to 0.16.2). However in that release (and up to 0.17) the method would raise a NotImplementedError . With 0.17, all works as expected. Yay!

In order to illustrate in a little more detail what this supports, here is a short example:

from __future__ import print_function
import pandas as pd

sales_df = pd.read_excel('https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=true')
sales_df["prod_group"] = sales_df["sku"].str[0:2]
print(sales_df.head())

Here is our Data:

	account number	name	sku	quantity	unit price	ext price	date	prod_group
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	B1
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	S2
2	218895	Kulas Inc	B1-69924	23	90.7	2086.1	2014-01-01 13:24:58	B1
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	S1
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	S2

We are using our simple sales data again to show purchase history over several months. The prod_group column is just a way to break down the products into different categories based on their SKU.

Now, let’s create a pivot table of the data. The key here is that the pivot table creates a MultiIndex , which will cause problems when we try to save to excel.

sales_summary = pd.pivot_table(sales_df, index=["name"],
                               columns=["prod_group"],
                               values=["ext price", "quantity"])

	ext price			quantity
prod_group	B1	S1	S2	B1	S1	S2
name
Barton LLC	1171.640278	1306.237600	1647.786190	20.777778	28.040000	28.190476
Cronin, Oberbrunner and Spencer	1387.616842	1342.598571	1302.787407	26.210526	25.428571	23.740741
Frami, Hills and Schmidt	1475.132143	1439.311875	1401.317857	25.071429	31.062500	25.142857
Fritsch, Russel and Anderson	1372.360286	1188.012857	1569.353200	23.714286	26.047619	29.400000
Halvorson, Crona and Champlin	1356.640000	1267.756667	983.963158	25.857143	22.333333	17.842105

All looks good until you try to save this using to_excel

sales_summary.to_excel('sampleout.xlsx')

In prior versions of pandas you’ll see something like this:

Traceback (most recent call last):
  File "pandas_version.py", line 25, in <module>
    sales_summary.to_excel('sampleout.xlsx')
  File "/home/chris/miniconda3/envs/pbpython2/lib/python2.7/site-packages/pandas/core/frame.py", line 1252, in to_excel
    raise NotImplementedError("Writing as Excel with a MultiIndex is "
NotImplementedError: Writing as Excel with a MultiIndex is not yet implemented.

Thankfully, pandas 0.17 works again and you get a clean Excel file that you would expect. The other reason I wanted to specifically call this out is that some of the code in my older blog posts stopped working as a result of this change. Now, I can point people here to get the solution - upgrade to pandas 0.17!

Rounding Data

One simple (but useful) new feature is the addition of a round method to a DataFrame. Prior to this release, if you wanted to round data in a column you had to use np.round or change the display using display.float_format .

In this example, let’s just look at the first 5 rows of data and round two columns:

sales_df = sales_df[0:5]
sales_df.round({'ext price':0, 'unit price': 0})

	account number	name	sku	quantity	unit price	ext price	date	prod_group
0	740150	Barton LLC	B1-20000	39	87	3381	2014-01-01 07:21:51	B1
1	714466	Trantow-Barrows	S2-77896	-1	63	-63	2014-01-01 10:00:47	S2
2	218895	Kulas Inc	B1-69924	23	91	2086	2014-01-01 13:24:58	B1
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21	863	2014-01-01 15:05:22	S1
4	412290	Jerde-Hilpert	S2-34077	6	83	499	2014-01-01 23:26:55	S2

I can control the number of decimal places to round as well as specify which columns to round by including in the dictionary of column names. This is pretty useful.

Troubleshooting Merges

In my experience manipulating and combining data, pandas tends to just do what I would expect. One of the areas where I can find myself getting tripped up is when I do a merge of DataFrames. Sometimes I need to play around with whether or not I really want a left, right, outer or inner join. To help troubleshoot these sometimes pesky problems, there is a new indicator argument that can help you figure out why your merge may not be doing what you expect.

For this example, let’s add in a description to our data. Here is the sales_groups DataFrame that looks like this:

	prod_group	Desc
0	B1	Belt-Large
1	B2	Belt-Medium
2	S1	Shirt-Large
3	S2	Shirt-Medium
4	H1	Hat-Large

If we want to merge the sales_groups data with our sales data, we could do something like this:

pd.merge(sales_df, sales_groups, on='prod_group', how='left', indicator=True)

Notice the _merge column shows that the data is only included if it is in both DataFrames.

	account number	name	sku	quantity	unit price	ext price	date	prod_group	Desc	_merge
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	B1	Belt-Large	both
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	S2	Shirt-Medium	both
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	B1	Belt-Large	both
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	S1	Shirt-Large	both
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	S2	Shirt-Medium	both

Let’s see what happens when we do a right join:

pd.merge(sales_df, sales_groups, on='prod_group', how='right', indicator=True)

	account number	name	sku	quantity	unit price	ext price	date	prod_group	Desc	_merge
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51	B1	Belt-Large	both
1	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58	B1	Belt-Large	both
2	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47	S2	Shirt-Medium	both
3	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55	S2	Shirt-Medium	both
4	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22	S1	Shirt-Large	both
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	B2	Belt-Medium	right_only
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	H1	Hat-Large	right_only

Given this small dataset, the indicator is not required but you could imagine that if you had thousands of rows, this could be really helpful to make sure you are getting the results you expect.

Sorting API

One warning you will probably see after upgrading to pandas 0.17 is something like this:

FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
sales_df.sort(columns=["name", "sku"])

To fix it:

sales_df.sort_values(by=["name", "sku"])

The old syntax will continue to work but it’s best to proactively make these changes now so that upgrades will be simpler.

The next couple items are not related to this latest release but include some quick notes on working with pandas DataFrames and Series.

Saving Series to Excel

Pandas has a couple of functions I use quite a bit that return a Series of data. One that I use is value_counts . What is sometimes frustrating is that there is no to_excel function for a series.

Here is an example of using value_counts (I included all my data but truncated the results for brevity):

sales_df["sku"].value_counts()

S2-77896    73
S1-82801    60
S2-10342    59
S1-47412    58
S1-93683    57
B1-38851    56
S2-82423    56
S1-50961    55
S1-30248    55
B1-53636    53
S1-06532    53
S1-27722    53
B1-20000    53
S2-34077    51
S2-83881    51
........

These results are useful and wouldn’t it be nice to dump it to Excel? Unfortunately when you try, you get a nasty AttributeError: 'Series' object has no attribute to_excel :

sales_df["sku"].value_counts().to_excel("values.xlsx")

Traceback (most recent call last):
  File "pandas_version.py", line 50, in <module>
    sales_df["sku"].value_counts().to_excel("values.xlsx")
  File "/home/chris/miniconda3/lib/python3.4/site-packages/pandas/core/generic.py", line 2246, in __getattr__
    (type(self).__name__, name))
AttributeError: 'Series' object has no attribute 'to_excel'

A handy shortcut to fix this is to use to_frame to force it to a DataFrame which can be saved:

sales_df["sku"].value_counts().to_frame("frequency").to_excel("values.xlsx")

You can pass one argument to to_frame - the label to use for your data. It is optional but I find it handy to include.

Tabulate

Pandas has robust options to export DataFrames. However, I have had a few instances where I want to display a DataFrame in a more ASCII friendly way. I was pleasantly surprised to find the tabulate program understands pandas DataFrames.

Here is a sample usage:

print(tabulate(sales_df, tablefmt="fancy_grid"))

╒═══╤════════╤═════════════════════════════╤══════════╤════╤═══════╤═════════╤═════════════════════╤════╕
│ 0 │ 740150 │ Barton LLC                  │ B1-20000 │ 39 │ 86.69 │ 3380.91 │ 2014-01-01 07:21:51 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 1 │ 714466 │ Trantow-Barrows             │ S2-77896 │ -1 │ 63.16 │  -63.16 │ 2014-01-01 10:00:47 │ S2 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 2 │ 218895 │ Kulas Inc                   │ B1-69924 │ 23 │ 90.7  │ 2086.1  │ 2014-01-01 13:24:58 │ B1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 3 │ 307599 │ Kassulke, Ondricka and Metz │ S1-65481 │ 41 │ 21.05 │  863.05 │ 2014-01-01 15:05:22 │ S1 │
├───┼────────┼─────────────────────────────┼──────────┼────┼───────┼─────────┼─────────────────────┼────┤
│ 4 │ 412290 │ Jerde-Hilpert               │ S2-34077 │  6 │ 83.21 │  499.26 │ 2014-01-01 23:26:55 │ S2 │
╘═══╧════════╧═════════════════════════════╧══════════╧════╧═══════╧═════════╧═════════════════════╧════╛

One extra step is required if you want to display headers:

headers = list(sales_df)
print(tabulate(sales_df, headers, tablefmt="fancy_grid"))

╒════╤══════════════════╤═════════════════════════════╤══════════╤════════════╤══════════════╤═════════════╤═════════════════════╤══════════════╕
│    │   account number │ name                        │ sku      │   quantity │   unit price │   ext price │ date                │ prod_group   │
╞════╪══════════════════╪═════════════════════════════╪══════════╪════════════╪══════════════╪═════════════╪═════════════════════╪══════════════╡
│  0 │           740150 │ Barton LLC                  │ B1-20000 │         39 │        86.69 │     3380.91 │ 2014-01-01 07:21:51 │ B1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  1 │           714466 │ Trantow-Barrows             │ S2-77896 │         -1 │        63.16 │      -63.16 │ 2014-01-01 10:00:47 │ S2           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  2 │           218895 │ Kulas Inc                   │ B1-69924 │         23 │        90.7  │     2086.1  │ 2014-01-01 13:24:58 │ B1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  3 │           307599 │ Kassulke, Ondricka and Metz │ S1-65481 │         41 │        21.05 │      863.05 │ 2014-01-01 15:05:22 │ S1           │
├────┼──────────────────┼─────────────────────────────┼──────────┼────────────┼──────────────┼─────────────┼─────────────────────┼──────────────┤
│  4 │           412290 │ Jerde-Hilpert               │ S2-34077 │          6 │        83.21 │      499.26 │ 2014-01-01 23:26:55 │ S2           │
╘════╧══════════════════╧═════════════════════════════╧══════════╧════════════╧══════════════╧═════════════╧═════════════════════╧══════════════╛

I realize the need for tabulate+pandas may be a little niche but I think it is useful to understand all the options that might be available to you.

Thanks for reading. Enjoy using your shiny new pandas 0.17 install!

Practical Business Python

Pandas 0.17 Release and Other Notes

Introduction

Excel MultiIndex

Rounding Data

Troubleshooting Merges

Sorting API

Saving Series to Excel

Tabulate

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure