Practical Business Python

Taking care of business, one python script at a time

Mon 26 September 2022

Pandas Groupby Warning

Posted by Chris Moffitt in articles   

article header image

Introduction

One of the reasons I like using pandas instead of Excel for data analysis is that it is easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there is still plenty of opportunity to make errors with pandas code. This article discusses a subtle issue with pandas groupby code that can lead to big errors if you’re not careful. I’m writing this because I have happened upon this in the past but it still bit me big time just recently. I hope this article can help a few of you avoid this mistake.

The Problem

To illustrate this problem, we’ll use a simple data set that shows sales for 20 customers and includes their region and an internal customer segment designation of Platinum, Gold or Silver. Here is the full data set:

Customer ID Customer Name Region Segment Sales
0 740150 Barton LLC US Gold 215000
1 714466 Trantow-Barrows EMEA Silver 430000
2 218895 Kulas Inc EMEA Platinum 410000
3 307599 Kassulke, Ondricka and Metz EMEA   91000
4 412290 Jerde-Hilpert EMEA Gold 630000
5 729833 Koepp Ltd US   230000
6 737550 Fritsch, Russel and Anderson US Gold 630000
7 146832 Kiehn-Spinka US Silver 615000
8 688981 Keeling LLC US Platinum 515000
9 786968 Frami, Hills and Schmidt US Gold 215000
10 239344 Stokes LLC US Silver 230000
11 672390 Kuhn-Gusikowski APAC Platinum 630000
12 141962 Herman LLC APAC Gold 215000
13 424914 White-Trantow US   230000
14 527099 Sanford and Sons US Platinum 615000
15 642753 Pollich LLC US Gold 419000
16 383080 Will LLC US Silver 415000
17 257198 Cronin, Oberbrunner and Spencer US Platinum 425000
18 604255 Halvorson, Crona and Champlin US   430000
19 163416 Purdy-Kunde APAC Silver 410000

The data looks pretty simple. There’s only one numeric column so let’s see what it totals to.

import pandas as pd

df = pd.read_excel('https://github.com/chris1610/pbpython/raw/master/data/sales_9_2022.xlsx')
df["Sales"].sum()
8000000

We have $8,000,000 in sales in the spreadsheet. Keep that number in mind.

Let’s do some simple analysis to summarize sales by region:

df.groupby(['Region']).agg({'Sales': 'sum'})
Sales
Region
APAC 1255000
EMEA 1561000
US 5184000

We can double check the math:

df.groupby(['Region']).agg({'Sales': 'sum'}).sum()
Sales    8000000
dtype: int64

Looks good. That’s what we expect. Now let’s see what sales look like by Segment:

df.groupby(['Region', 'Segment']).agg({'Sales': 'sum'})

Which yields this table:

Sales
Region Segment
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
US Gold 1479000
Platinum 1555000
Silver 1260000

This looks good. No errors and the table seems reasonable. We should continue our analysis right?

Nope. There’s a potentially subtle issue here. Let’s sum the data to double check:

df.groupby(['Region', 'Segment']).agg({'Sales': 'sum'}).sum()
Sales    7019000
dtype: int64

This only includes $7,019,000. Where did the other $981,000 go? Is pandas broken?

You can see the issue clearly if we use the dropna=False parameter to explicitly include NaN values in our results:

df.groupby(['Region', 'Segment'], dropna=False).agg({'Sales': 'sum'})

Now we can see the NaN combinations with EMEA and the US groupings:

Sales
Region Segment
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
NaN 91000
US Gold 1479000
Platinum 1555000
Silver 1260000
NaN 890000

If we check the sum, we can see it totals to $8M.

df.groupby(['Region', 'Segment'], dropna=False).agg({'Sales': 'sum'}).sum()
Sales    8000000
dtype: int64

The pandas documentation is very clear on this:

dropna: bool, default True
If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

The take away is that if your groupby columns contain any NA values, then you need to make a conscious decision about whether or not you want to include those values in the grouped results.

If you are ok dropping those values, then use the default dropna=True .

However, if you want to ensure that all values (Sales in this particular case) are included, then make sure to use dropna=False in your groupby

An ounce of prevention

The main way to deal with this potential issue is to understand if you have any NaN values in your data. There are a couple of ways to do this.

You can use pure pandas:

df.isnull().sum()
Customer ID      0
Customer Name    0
Region           0
Segment          4
Sales            0
dtype: int64

There are other tools like missingno which provide a more robust interface for exploring the data.

I’m partial to sidetable. Here’s how to use it after it’s installed and imported:

df.stb.missing()
missing total percent
Segment 4 20 20.0
Customer ID 0 20 0.0
Customer Name 0 20 0.0
Region 0 20 0.0
Sales 0 20 0.0

Regardless of the approach you use, its worth keeping in mind that you need to know if you have any null or NaN values in your data and how you would like to handle them in your analysis.

The other alternative to using the dropna is to explicitly fill in the values using fillna

df.fillna('unknown').groupby(['Region', 'Segment']).agg({'Sales': 'sum'})

Now the unknown values are explicitly called out:

Sales
Region Segment
APAC Gold 215000
Platinum 630000
Silver 410000
EMEA Gold 630000
Platinum 410000
Silver 430000
unknown 91000
US Gold 1479000
Platinum 1555000
Silver 1260000
unknown 890000

Conclusion

When working with pandas groupby , the results can be surprising if you have NaN values in your dataframe columns. The default behavior is to drop those values which means you can effectively “lose” some of your data during the process.

I have been bit by this behavior several times in the past. In some cases, it might not be a big deal. In others, you might need to sheepishly explain why your numbers aren’t adding up.

Have you seen this before? Let me know in the comments below.

Comments