Mon 26 September 2022

Pandas Groupby Warning

Introduction

One of the reasons I like using pandas instead of Excel for data analysis is that it is easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there is still plenty of opportunity to make errors with pandas code. This article discusses a subtle issue with pandas groupby code that can lead to big errors if you’re not careful. I’m writing this because I have happened upon this in the past but it still bit me big time just recently. I hope this article can help a few of you avoid this mistake.

The Problem

To illustrate this problem, we’ll use a simple data set that shows sales for 20 customers and includes their region and an internal customer segment designation of Platinum, Gold or Silver. Here is the full data set:

	Customer ID	Customer Name	Region	Segment	Sales
0	740150	Barton LLC	US	Gold	215000
1	714466	Trantow-Barrows	EMEA	Silver	430000
2	218895	Kulas Inc	EMEA	Platinum	410000
3	307599	Kassulke, Ondricka and Metz	EMEA		91000
4	412290	Jerde-Hilpert	EMEA	Gold	630000
5	729833	Koepp Ltd	US		230000
6	737550	Fritsch, Russel and Anderson	US	Gold	630000
7	146832	Kiehn-Spinka	US	Silver	615000
8	688981	Keeling LLC	US	Platinum	515000
9	786968	Frami, Hills and Schmidt	US	Gold	215000
10	239344	Stokes LLC	US	Silver	230000
11	672390	Kuhn-Gusikowski	APAC	Platinum	630000
12	141962	Herman LLC	APAC	Gold	215000
13	424914	White-Trantow	US		230000
14	527099	Sanford and Sons	US	Platinum	615000
15	642753	Pollich LLC	US	Gold	419000
16	383080	Will LLC	US	Silver	415000
17	257198	Cronin, Oberbrunner and Spencer	US	Platinum	425000
18	604255	Halvorson, Crona and Champlin	US		430000
19	163416	Purdy-Kunde	APAC	Silver	410000

The data looks pretty simple. There’s only one numeric column so let’s see what it totals to.

import pandas as pd

df = pd.read_excel('https://github.com/chris1610/pbpython/raw/master/data/sales_9_2022.xlsx')
df["Sales"].sum()

We have $8,000,000 in sales in the spreadsheet. Keep that number in mind.

Let’s do some simple analysis to summarize sales by region:

df.groupby(['Region']).agg({'Sales': 'sum'})

	Sales
Region
APAC	1255000
EMEA	1561000
US	5184000

We can double check the math:

df.groupby(['Region']).agg({'Sales': 'sum'}).sum()

Sales    8000000
dtype: int64

Looks good. That’s what we expect. Now let’s see what sales look like by Segment:

df.groupby(['Region', 'Segment']).agg({'Sales': 'sum'})

Which yields this table:

		Sales
Region	Segment
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000

This looks good. No errors and the table seems reasonable. We should continue our analysis right?

Nope. There’s a potentially subtle issue here. Let’s sum the data to double check:

df.groupby(['Region', 'Segment']).agg({'Sales': 'sum'}).sum()

Sales    7019000
dtype: int64

This only includes $7,019,000. Where did the other $981,000 go? Is pandas broken?

You can see the issue clearly if we use the dropna=False parameter to explicitly include NaN values in our results:

df.groupby(['Region', 'Segment'], dropna=False).agg({'Sales': 'sum'})

Now we can see the NaN combinations with EMEA and the US groupings:

		Sales
Region	Segment
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
	NaN	91000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000
	NaN	890000

If we check the sum, we can see it totals to $8M.

df.groupby(['Region', 'Segment'], dropna=False).agg({'Sales': 'sum'}).sum()

Sales    8000000
dtype: int64

The pandas documentation is very clear on this:

dropna: bool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

The take away is that if your groupby columns contain any NA values, then you need to make a conscious decision about whether or not you want to include those values in the grouped results.

If you are ok dropping those values, then use the default dropna=True .

However, if you want to ensure that all values (Sales in this particular case) are included, then make sure to use dropna=False in your groupby

An ounce of prevention

The main way to deal with this potential issue is to understand if you have any NaN values in your data. There are a couple of ways to do this.

You can use pure pandas:

df.isnull().sum()

Customer ID      0
Customer Name    0
Region           0
Segment          4
Sales            0
dtype: int64

There are other tools like missingno which provide a more robust interface for exploring the data.

I’m partial to sidetable. Here’s how to use it after it’s installed and imported:

df.stb.missing()

	missing	total	percent
Segment	4	20	20.0
Customer ID	0	20	0.0
Customer Name	0	20	0.0
Region	0	20	0.0
Sales	0	20	0.0

Regardless of the approach you use, its worth keeping in mind that you need to know if you have any null or NaN values in your data and how you would like to handle them in your analysis.

The other alternative to using the dropna is to explicitly fill in the values using fillna

df.fillna('unknown').groupby(['Region', 'Segment']).agg({'Sales': 'sum'})

Now the unknown values are explicitly called out:

		Sales
Region	Segment
APAC	Gold	215000
	Platinum	630000
	Silver	410000
EMEA	Gold	630000
	Platinum	410000
	Silver	430000
	unknown	91000
US	Gold	1479000
	Platinum	1555000
	Silver	1260000
	unknown	890000

Conclusion

When working with pandas groupby , the results can be surprising if you have NaN values in your dataframe columns. The default behavior is to drop those values which means you can effectively “lose” some of your data during the process.

I have been bit by this behavior several times in the past. In some cases, it might not be a big deal. In others, you might need to sheepishly explain why your numbers aren’t adding up.

Have you seen this before? Let me know in the comments below.

Practical Business Python

Pandas Groupby Warning

Introduction

The Problem

An ounce of prevention

Conclusion

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure