Practical Business Python

Taking care of business, one python script at a time

Mon 31 July 2017

Pandas Grouper and Agg Functions Explained

Posted by Chris Moffitt in articles   

article header image

Introduction

Every once in a while it is useful to take a step back and look at pandas’ functions and see if there is a new or better way to do things. I was recently working on a problem and noticed that pandas had a Grouper function that I had never used before. I looked into how it can be used and it turns out it is useful for the type of summary analysis I tend to do on a frequent basis.

In addition to functions that have been around a while, pandas continues to provide new and improved capabilities with every release. The updated agg function is another very useful and intuitive tool for summarizing data.

This article will walk through how and why you may want to use the Grouper and agg functions on your own data. Along the way, I will include a few tips and tricks on how to use them most effectively.

Grouping Time Series Data

Pandas’ origins are in the financial industry so it should not be a surprise that it has robust capabilities to manipulate and summarize time series data. Just look at the extensive time series documentation to get a feel for all the options. I encourage you to review it so that you’re aware of the concepts.

In order to illustrate this particular concept better, I will walk through an example of sales data and some simple operations to get total sales by month, day, year, etc.

For this example, I’ll use my trusty transaction data that I’ve used in other articles. You can follow along in the notebook as well.

import pandas as pd
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df["date"] = pd.to_datetime(df['date'])
df.head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

Before I go much further, it’s useful to become familiar with Offset Aliases. These strings are used to represent various common time frequencies like days vs. weeks vs. years. I always forget what these are called and how to use the more esoteric ones so make sure to bookmark the link!

For example, if you were interested in summarizing all of the sales by month, you could use the resample function. The tricky part about using resample is that it only operates on an index. In this data set, the data is not indexed by the date column so resample would not work without restructuring the data. In order to make it work, use set_index to make the date column an index and then resample:

df.set_index('date').resample('M')["ext price"].sum()
date
2014-01-31    185361.66
2014-02-28    146211.62
2014-03-31    203921.38
2014-04-30    174574.11
2014-05-31    165418.55
2014-06-30    174089.33
2014-07-31    191662.11
2014-08-31    153778.59
2014-09-30    168443.17
2014-10-31    171495.32
2014-11-30    119961.22
2014-12-31    163867.26
Freq: M, Name: ext price, dtype: float64

This is a fairly straightforward way to summarize the data but it gets a little more challenging if you would like to group the data as well. If we would like to see the monthly results for each customer, then you could do this (results truncated to 20 rows):

df.set_index('date').groupby('name')["ext price"].resample("M").sum()
name                             date
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

This certainly works but it feels a bit clunky. Fortunately Grouper makes this a little more streamlined. Instead of having to play around with reindexing, we can use our normal groupby syntax but provide a little more info on how to group the data in the date column:

df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()
name                             date
Barton LLC                       2014-01-31     6177.57
                                 2014-02-28    12218.03
                                 2014-03-31     3513.53
                                 2014-04-30    11474.20
                                 2014-05-31    10220.17
                                 2014-06-30    10463.73
                                 2014-07-31     6750.48
                                 2014-08-31    17541.46
                                 2014-09-30    14053.61
                                 2014-10-31     9351.68
                                 2014-11-30     4901.14
                                 2014-12-31     2772.90
Cronin, Oberbrunner and Spencer  2014-01-31     1141.75
                                 2014-02-28    13976.26
                                 2014-03-31    11691.62
                                 2014-04-30     3685.44
                                 2014-05-31     6760.11
                                 2014-06-30     5379.67
                                 2014-07-31     6020.30
                                 2014-08-31     5399.58
Name: ext price, dtype: float64

Since groupby is one of my standard functions, this approach seems simpler to me and it is more likely to stick in my brain.

The nice benefit of this capability is that if you are interested in looking at data summarized in a different time frame, just change the freq parameter to one of the valid offset aliases. For instance, an annual summary using December as the last month would look like this:

df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name                             date
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

If your annual sales were on a non-calendar basis, then the data can be easily changed by modifying the freq parameter. I encourage you to play around with different offsets to get a feel for how it works. When dealing with summarizing time series data, this is incredibly handy. To put this in perspective, try doing this in Excel. It is certainly possible (using pivot tables and custom grouping) but I do not think it is nearly as intuitive as the pandas approach.

New and improved aggregate function

In pandas 0.20.1, there was a new agg function added that makes it a lot simpler to summarize data in a manner similar to the groupby API.

To illustrate the functionality, let’s say we need to get the total of the ext price and quantity column as well as the average of the unit price . The process is not very convenient:

df[["ext price", "quantity"]].sum()
ext price    2018784.32
quantity       36463.00
dtype: float64
df["unit price"].mean()
55.007526666666664

This works but it’s a bit messy. The new agg makes this simpler:

df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
ext price quantity unit price
sum 2.018784e+06 36463.000000 82511.290000
mean 1.345856e+03 24.308667 55.007527

The results are good but including the sum of the unit price is not really that useful. Fortunately we can pass a dictionary to agg and specify what operations to apply to each column.

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
quantity ext price unit price
mean 24.308667 1.345856e+03 55.007527
sum 36463.000000 2.018784e+06 NaN

I find this approach really handy when I want to summarize several columns of data. In the past, I would run the individual calculations and build up the resulting dataframe a row at a time. It was tedious. This is a much better approach.

As an added bonus, you can define your own functions. For instance, I frequently find myself needing to aggregate data and use a mode function that works on text. I found a lambda function that uses value_counts to do what I need and frequently use this get_max function:

get_max = lambda x: x.value_counts(dropna=False).index[0]

Then, if I want to include the most frequent sku in my summary table:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity sku ext price unit price
<lambda> NaN S2-77896 NaN NaN
mean 24.308667 NaN 1.345856e+03 55.007527
sum 36463.000000 NaN 2.018784e+06 NaN

This is pretty cool but there is one thing that has always bugged me about this approach. The fact that the column says “<lambda>” bothers me. Ideally I want it to say “most frequent.” In the past I’d jump through some hoops to rename it. But, when working on this article I stumbled on another approach - explicitly defining the name of the lambda function.

get_max.__name__ = "most frequent"

Now, when I do the aggregation:

df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity sku ext price unit price
most frequent NaN S2-77896 NaN NaN
mean 24.308667 NaN 1.345856e+03 55.007527
sum 36463.000000 NaN 2.018784e+06 NaN

I get a much nicer label! It’s a small thing but I am definitely glad I finally figured that out.

As a final final bonus, here’s one other trick. The aggregate function using a dictionary is useful but one challenge is that it does not preserve order. If you want to make sure your columns are in a specific order, you can use an OrderedDict :

import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price quantity sku
mean 1.345856e+03 24.308667 NaN
most frequent NaN NaN S2-77896
sum 2.018784e+06 36463.000000 NaN

Conclusion

The pandas library continues to grow and evolve over time. Sometimes it is useful to make sure there aren’t simpler approaches to some of the frequent approaches you may use to solve your problems. Pandas’ Grouper function and the updated agg function are really useful when aggregating and summarizing data. I hope this article will be useful to you in your data analysis. Are there any other pandas functions that you just learned about or might be useful to others? Feel free to give your input in the comments.

Comments