Tue 04 April 2017

Understanding the Transform Function in Pandas

Introduction

One of the compelling features of pandas is that it has a rich library of methods for manipulating data. However, there are times when it is not clear what the various functions do and how to use them. If you are approaching a problem from an Excel mindset, it can be difficult to translate the planned solution into the unfamiliar pandas command. One of those “unknown” functions is the transform method. Even after using pandas for a while, I have never had the chance to use this function so I recently took some time to figure out what it is and how it could be helpful for real world analysis. This article will walk through an example where transform can be used to efficiently summarize data.

What is transform?

I have found the best coverage of this topic in Jake VanderPlas’ excellent Python Data Science Handbook. I plan to write a review on this book in the future but the short and sweet is that it is a great resource that I highly recommend.

As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate , filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. Since Jake made all of his book available via jupyter notebooks it is a good place to start to understand how transform is unique:

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean.

With that basic definition, I will go through another example that can explain how this is useful in other instances outside of centering data.

Problem Set

For this example, we will analyze some fictitious sales data. In order to keep the dataset small, here is a sample of 12 sales transactions for our company:

	account	name	order	sku	quantity	unit price	ext price
0	383080	Will LLC	10001	B1-20000	7	33.69	235.83
1	383080	Will LLC	10001	S1-27722	11	21.12	232.32
2	383080	Will LLC	10001	B1-86481	3	35.99	107.97
3	412290	Jerde-Hilpert	10005	S1-06532	48	55.82	2679.36
4	412290	Jerde-Hilpert	10005	S1-82801	21	13.62	286.02
5	412290	Jerde-Hilpert	10005	S1-06532	9	92.55	832.95
6	412290	Jerde-Hilpert	10005	S1-47412	44	78.91	3472.04
7	412290	Jerde-Hilpert	10005	S1-27722	36	25.42	915.12
8	218895	Kulas Inc	10006	S1-27722	32	95.66	3061.12
9	218895	Kulas Inc	10006	B1-33087	23	22.55	518.65
10	218895	Kulas Inc	10006	B1-33364	3	72.30	216.90
11	218895	Kulas Inc	10006	B1-20000	-1	72.18	-72.18

You can see in the data that the file contains 3 different orders (10001, 10005 and 10006) and that each order consists has multiple products (aka skus).

The question we would like to answer is: “What percentage of the order total does each sku represent?”

For example, if we look at order 10001 with a total of $576.12, the break down would be:

B1-20000 = $235.83 or 40.9%
S1-27722 = $232.32 or 40.3%
B1-86481 = $107.97 or 18.7%

The tricky part in this calculation is that we need to get a total for each order and combine it back with the transaction level detail in order to get the percentages. In Excel, you could try to use some version of a subtotal to try to calculate the values.

First Approach - Merging

If you are familiar with pandas, your first inclination is going to be trying to group the data into a new dataframe and combine it in a multi-step process. Here’s what that approach would look like.

Import all the modules we need and read in our data:

import pandas as pd

df = pd.read_excel("sales_transactions.xlsx")

Now that the data is in a dataframe, determining the total by order is simple with the help of the standard groupby aggregation.

df.groupby('order')["ext price"].sum()

order
10001     576.12
10005    8185.49
10006    3724.49
Name: ext price, dtype: float64

Here is a simple image showing what is happening with the standard groupby

The tricky part is figuring out how to combine this data back with the original dataframe. The first instinct is to create a new dataframe with the totals by order and merge it back with the original. We could do something like this:

order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]

	account	name	order	sku	quantity	unit price	ext price	order total	Order_Total	Percent_of_Order
0	383080	Will LLC	10001	B1-20000	7	33.69	235.83	576.12	576.12	0.409342
1	383080	Will LLC	10001	S1-27722	11	21.12	232.32	576.12	576.12	0.403249
2	383080	Will LLC	10001	B1-86481	3	35.99	107.97	576.12	576.12	0.187409
3	412290	Jerde-Hilpert	10005	S1-06532	48	55.82	2679.36	8185.49	8185.49	0.327330
4	412290	Jerde-Hilpert	10005	S1-82801	21	13.62	286.02	8185.49	8185.49	0.034942
5	412290	Jerde-Hilpert	10005	S1-06532	9	92.55	832.95	8185.49	8185.49	0.101759
6	412290	Jerde-Hilpert	10005	S1-47412	44	78.91	3472.04	8185.49	8185.49	0.424170
7	412290	Jerde-Hilpert	10005	S1-27722	36	25.42	915.12	8185.49	8185.49	0.111798
8	218895	Kulas Inc	10006	S1-27722	32	95.66	3061.12	3724.49	3724.49	0.821890
9	218895	Kulas Inc	10006	B1-33087	23	22.55	518.65	3724.49	3724.49	0.139254
10	218895	Kulas Inc	10006	B1-33364	3	72.30	216.90	3724.49	3724.49	0.058236
11	218895	Kulas Inc	10006	B1-20000	-1	72.18	-72.18	3724.49	3724.49	-0.019380

This certainly works but there are several steps needed to get the data combined in the manner we need.

Second Approach - Using Transform

Using the original data, let’s try using transform and groupby and see what we get:

df.groupby('order')["ext price"].transform('sum')

0      576.12
1      576.12
2      576.12
3     8185.49
4     8185.49
5     8185.49
6     8185.49
7     8185.49
8     3724.49
9     3724.49
10    3724.49
11    3724.49
dtype: float64

You will notice how this returns a different size data set from our normal groupby functions. Instead of only showing the totals for 3 orders, we retain the same number of items as the original data set. That is the unique feature of using transform .

The final step is pretty simple:

df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]

	account	name	order	sku	quantity	unit price	ext price	order total	Order_Total	Percent_of_Order
0	383080	Will LLC	10001	B1-20000	7	33.69	235.83	576.12	576.12	0.409342
1	383080	Will LLC	10001	S1-27722	11	21.12	232.32	576.12	576.12	0.403249
2	383080	Will LLC	10001	B1-86481	3	35.99	107.97	576.12	576.12	0.187409
3	412290	Jerde-Hilpert	10005	S1-06532	48	55.82	2679.36	8185.49	8185.49	0.327330
4	412290	Jerde-Hilpert	10005	S1-82801	21	13.62	286.02	8185.49	8185.49	0.034942
5	412290	Jerde-Hilpert	10005	S1-06532	9	92.55	832.95	8185.49	8185.49	0.101759
6	412290	Jerde-Hilpert	10005	S1-47412	44	78.91	3472.04	8185.49	8185.49	0.424170
7	412290	Jerde-Hilpert	10005	S1-27722	36	25.42	915.12	8185.49	8185.49	0.111798
8	218895	Kulas Inc	10006	S1-27722	32	95.66	3061.12	3724.49	3724.49	0.821890
9	218895	Kulas Inc	10006	B1-33087	23	22.55	518.65	3724.49	3724.49	0.139254
10	218895	Kulas Inc	10006	B1-33364	3	72.30	216.90	3724.49	3724.49	0.058236
11	218895	Kulas Inc	10006	B1-20000	-1	72.18	-72.18	3724.49	3724.49	-0.019380

As an added bonus, you could combine into one statement if you did not want to show the individual order totals:

df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')

Here is a diagram to show what is happening:

After taking the time to understand transform , I think you will agree that this tool can be very powerful - even if it is a unique approach as compared to the standard Excel mindset.

Conclusion

I am continually amazed at the power of pandas to make complex numerical manipulations very efficient. Despite working with pandas for a while, I never took the time to figure out how to use transform. Now that I understand how it works, I am sure I will be able to use it in future analysis and hope that you will find this useful as well.

Practical Business Python

Understanding the Transform Function in Pandas

Introduction

What is transform?

Problem Set

First Approach - Merging

Second Approach - Using Transform

Conclusion

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure