Pandas Grouper and Agg Functions Explained
Posted by Chris Moffitt in articles

Introduction
Every once in a while it is useful to take a step back and look at pandas’ functions and see if there is a new or better way to do things. I was recently working on a problem and noticed that pandas had a Grouper function that I had never used before. I looked into how it can be used and it turns out it is useful for the type of summary analysis I tend to do on a frequent basis.
In addition to functions that have been around a while, pandas continues to provide new and improved capabilities with every release. The updated agg function is another very useful and intuitive tool for summarizing data.
This article will walk through how and why you may want to use the
Grouper
and
agg
functions on your own data. Along the way, I will include a few tips
and tricks on how to use them most effectively.
Grouping Time Series Data
Pandas’ origins are in the financial industry so it should not be a surprise that it has robust capabilities to manipulate and summarize time series data. Just look at the extensive time series documentation to get a feel for all the options. I encourage you to review it so that you’re aware of the concepts.
In order to illustrate this particular concept better, I will walk through an example of sales data and some simple operations to get total sales by month, day, year, etc.
For this example, I’ll use my trusty transaction data that I’ve used in other articles. You can follow along in the notebook as well.
import pandas as pd
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
df["date"] = pd.to_datetime(df['date'])
df.head()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 |
Before I go much further, it’s useful to become familiar with Offset Aliases. These strings are used to represent various common time frequencies like days vs. weeks vs. years. I always forget what these are called and how to use the more esoteric ones so make sure to bookmark the link!
For example, if you were interested in summarizing all of the sales by month, you could use the
resample
function. The tricky part about using resample is that it only
operates on an index. In this data set, the data is not indexed by the date column
so resample would not work without restructuring the data. In order to make it work,
use
set_index
to make the date column an index and then resample:
df.set_index('date').resample('M')["ext price"].sum()
date 2014-01-31 185361.66 2014-02-28 146211.62 2014-03-31 203921.38 2014-04-30 174574.11 2014-05-31 165418.55 2014-06-30 174089.33 2014-07-31 191662.11 2014-08-31 153778.59 2014-09-30 168443.17 2014-10-31 171495.32 2014-11-30 119961.22 2014-12-31 163867.26 Freq: M, Name: ext price, dtype: float64
This is a fairly straightforward way to summarize the data but it gets a little more challenging if you would like to group the data as well. If we would like to see the monthly results for each customer, then you could do this (results truncated to 20 rows):
df.set_index('date').groupby('name')["ext price"].resample("M").sum()
name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64
This certainly works but it feels a bit clunky. Fortunately
Grouper
makes
this a little more streamlined. Instead of having to play around with reindexing, we
can use our normal
groupby
syntax but provide a little more info on how
to group the data in the date column:
df.groupby(['name', pd.Grouper(key='date', freq='M')])['ext price'].sum()
name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 2014-06-30 10463.73 2014-07-31 6750.48 2014-08-31 17541.46 2014-09-30 14053.61 2014-10-31 9351.68 2014-11-30 4901.14 2014-12-31 2772.90 Cronin, Oberbrunner and Spencer 2014-01-31 1141.75 2014-02-28 13976.26 2014-03-31 11691.62 2014-04-30 3685.44 2014-05-31 6760.11 2014-06-30 5379.67 2014-07-31 6020.30 2014-08-31 5399.58 Name: ext price, dtype: float64
Since
groupby
is one of my standard functions, this approach seems simpler
to me and it is more likely to stick in my brain.
The nice benefit of this capability is that if you are interested in looking at
data summarized in a different time frame, just change the
freq
parameter
to one of the valid offset aliases. For instance, an annual summary using December
as the last month would look like this:
df.groupby(['name', pd.Grouper(key='date', freq='A-DEC')])['ext price'].sum()
name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64
If your annual sales were on a non-calendar basis, then the data can be easily
changed by modifying the
freq
parameter. I encourage you to play around
with different offsets to get a feel for how it works. When dealing with summarizing
time series data, this is incredibly handy. To put this in perspective, try doing
this in Excel. It is certainly possible (using pivot tables and
custom grouping) but I do not think it is nearly as intuitive as the pandas approach.
New and improved aggregate function
In pandas 0.20.1, there was a new
agg
function added that makes it a lot simpler
to summarize data in a manner similar to the
groupby
API.
To illustrate the functionality, let’s say we need to get the total of the
ext price
and
quantity
column as well as the average of the
unit price
. The process
is not very convenient:
df[["ext price", "quantity"]].sum()
ext price 2018784.32 quantity 36463.00 dtype: float64
df["unit price"].mean()
55.007526666666664
This works but it’s a bit messy. The new
agg
makes this simpler:
df[["ext price", "quantity", "unit price"]].agg(['sum', 'mean'])
ext price | quantity | unit price | |
---|---|---|---|
sum | 2.018784e+06 | 36463.000000 | 82511.290000 |
mean | 1.345856e+03 | 24.308667 | 55.007527 |
The results are good but including the sum of the unit price is not really that
useful. Fortunately we can pass a dictionary to
agg
and specify what
operations to apply to each column.
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean']})
quantity | ext price | unit price | |
---|---|---|---|
mean | 24.308667 | 1.345856e+03 | 55.007527 |
sum | 36463.000000 | 2.018784e+06 | NaN |
I find this approach really handy when I want to summarize several columns of data. In the past, I would run the individual calculations and build up the resulting dataframe a row at a time. It was tedious. This is a much better approach.
As an added bonus, you can define your own functions. For instance, I frequently
find myself needing to aggregate data and use a mode function that works on text.
I found a lambda function that uses
value_counts
to do what I need and
frequently use this
get_max
function:
get_max = lambda x: x.value_counts(dropna=False).index[0]
Then, if I want to include the most frequent sku in my summary table:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity | sku | ext price | unit price | |
---|---|---|---|---|
<lambda> | NaN | S2-77896 | NaN | NaN |
mean | 24.308667 | NaN | 1.345856e+03 | 55.007527 |
sum | 36463.000000 | NaN | 2.018784e+06 | NaN |
This is pretty cool but there is one thing that has always bugged me about this approach. The fact that the column says “<lambda>” bothers me. Ideally I want it to say “most frequent.” In the past I’d jump through some hoops to rename it. But, when working on this article I stumbled on another approach - explicitly defining the name of the lambda function.
get_max.__name__ = "most frequent"
Now, when I do the aggregation:
df.agg({'ext price': ['sum', 'mean'], 'quantity': ['sum', 'mean'], 'unit price': ['mean'], 'sku': [get_max]})
quantity | sku | ext price | unit price | |
---|---|---|---|---|
most frequent | NaN | S2-77896 | NaN | NaN |
mean | 24.308667 | NaN | 1.345856e+03 | 55.007527 |
sum | 36463.000000 | NaN | 2.018784e+06 | NaN |
I get a much nicer label! It’s a small thing but I am definitely glad I finally figured that out.
As a final final bonus, here’s one other trick. The aggregate function using a
dictionary is useful but one challenge is that it does not preserve order. If
you want to make sure your columns are in a specific order, you can use an
OrderedDict
:
import collections
f = collections.OrderedDict([('ext price', ['sum', 'mean']), ('quantity', ['sum', 'mean']), ('sku', [get_max])])
df.agg(f)
ext price | quantity | sku | |
---|---|---|---|
mean | 1.345856e+03 | 24.308667 | NaN |
most frequent | NaN | NaN | S2-77896 |
sum | 2.018784e+06 | 36463.000000 | NaN |
Conclusion
The pandas library continues to grow and evolve over time. Sometimes it is useful to make sure there aren’t simpler approaches to some of the frequent approaches you may use to solve your problems. Pandas’ Grouper function and the updated agg function are really useful when aggregating and summarizing data. I hope this article will be useful to you in your data analysis. Are there any other pandas functions that you just learned about or might be useful to others? Feel free to give your input in the comments.
Comments