Practical Business Python

Taking care of business, one python script at a time

Tue 04 April 2017

Understanding the Transform Function in Pandas

Posted by Chris Moffitt in articles   

article header image

Introduction

One of the compelling features of pandas is that it has a rich library of methods for manipulating data. However, there are times when it is not clear what the various functions do and how to use them. If you are approaching a problem from an Excel mindset, it can be difficult to translate the planned solution into the unfamiliar pandas command. One of those “unknown” functions is the transform method. Even after using pandas for a while, I have never had the chance to use this function so I recently took some time to figure out what it is and how it could be helpful for real world analysis. This article will walk through an example where transform can be used to efficiently summarize data.

What is transform?

I have found the best coverage of this topic in Jake VanderPlas’ excellent Python Data Science Handbook. I plan to write a review on this book in the future but the short and sweet is that it is a great resource that I highly recommend.

As described in the book, transform is an operation used in conjunction with groupby (which is one of the most useful operations in pandas). I suspect most pandas users likely have used aggregate , filter or apply with groupby to summarize data. However, transform is a little more difficult to understand - especially coming from an Excel world. Since Jake made all of his book available via jupyter notebooks it is a good place to start to understand how transform is unique:

While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean.

With that basic definition, I will go through another example that can explain how this is useful in other instances outside of centering data.

Problem Set

For this example, we will analyze some fictitious sales data. In order to keep the dataset small, here is a sample of 12 sales transactions for our company:

account name order sku quantity unit price ext price
0 383080 Will LLC 10001 B1-20000 7 33.69 235.83
1 383080 Will LLC 10001 S1-27722 11 21.12 232.32
2 383080 Will LLC 10001 B1-86481 3 35.99 107.97
3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 2679.36
4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 286.02
5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 832.95
6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 3472.04
7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 915.12
8 218895 Kulas Inc 10006 S1-27722 32 95.66 3061.12
9 218895 Kulas Inc 10006 B1-33087 23 22.55 518.65
10 218895 Kulas Inc 10006 B1-33364 3 72.30 216.90
11 218895 Kulas Inc 10006 B1-20000 -1 72.18 -72.18

You can see in the data that the file contains 3 different orders (10001, 10005 and 10006) and that each order consists has multiple products (aka skus).

The question we would like to answer is: “What percentage of the order total does each sku represent?”

For example, if we look at order 10001 with a total of $576.12, the break down would be:

  • B1-20000 = $235.83 or 40.9%
  • S1-27722 = $232.32 or 40.3%
  • B1-86481 = $107.97 or 18.7%

The tricky part in this calculation is that we need to get a total for each order and combine it back with the transaction level detail in order to get the percentages. In Excel, you could try to use some version of a subtotal to try to calculate the values.

First Approach - Merging

If you are familiar with pandas, your first inclination is going to be trying to group the data into a new dataframe and combine it in a multi-step process. Here’s what that approach would look like.

Import all the modules we need and read in our data:

import pandas as pd

df = pd.read_excel("sales_transactions.xlsx")

Now that the data is in a dataframe, determining the total by order is simple with the help of the standard groupby aggregation.

df.groupby('order')["ext price"].sum()
order
10001     576.12
10005    8185.49
10006    3724.49
Name: ext price, dtype: float64

Here is a simple image showing what is happening with the standard groupby

Groupby Example

The tricky part is figuring out how to combine this data back with the original dataframe. The first instinct is to create a new dataframe with the totals by order and merge it back with the original. We could do something like this:

order_total = df.groupby('order')["ext price"].sum().rename("Order_Total").reset_index()
df_1 = df.merge(order_total)
df_1["Percent_of_Order"] = df_1["ext price"] / df_1["Order_Total"]
account name order sku quantity unit price ext price order total Order_Total Percent_of_Order
0 383080 Will LLC 10001 B1-20000 7 33.69 235.83 576.12 576.12 0.409342
1 383080 Will LLC 10001 S1-27722 11 21.12 232.32 576.12 576.12 0.403249
2 383080 Will LLC 10001 B1-86481 3 35.99 107.97 576.12 576.12 0.187409
3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 2679.36 8185.49 8185.49 0.327330
4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 286.02 8185.49 8185.49 0.034942
5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 832.95 8185.49 8185.49 0.101759
6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 3472.04 8185.49 8185.49 0.424170
7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 915.12 8185.49 8185.49 0.111798
8 218895 Kulas Inc 10006 S1-27722 32 95.66 3061.12 3724.49 3724.49 0.821890
9 218895 Kulas Inc 10006 B1-33087 23 22.55 518.65 3724.49 3724.49 0.139254
10 218895 Kulas Inc 10006 B1-33364 3 72.30 216.90 3724.49 3724.49 0.058236
11 218895 Kulas Inc 10006 B1-20000 -1 72.18 -72.18 3724.49 3724.49 -0.019380

This certainly works but there are several steps needed to get the data combined in the manner we need.

Second Approach - Using Transform

Using the original data, let’s try using transform and groupby and see what we get:

df.groupby('order')["ext price"].transform('sum')
0      576.12
1      576.12
2      576.12
3     8185.49
4     8185.49
5     8185.49
6     8185.49
7     8185.49
8     3724.49
9     3724.49
10    3724.49
11    3724.49
dtype: float64

You will notice how this returns a different size data set from our normal groupby functions. Instead of only showing the totals for 3 orders, we retain the same number of items as the original data set. That is the unique feature of using transform .

The final step is pretty simple:

df["Order_Total"] = df.groupby('order')["ext price"].transform('sum')
df["Percent_of_Order"] = df["ext price"] / df["Order_Total"]
account name order sku quantity unit price ext price order total Order_Total Percent_of_Order
0 383080 Will LLC 10001 B1-20000 7 33.69 235.83 576.12 576.12 0.409342
1 383080 Will LLC 10001 S1-27722 11 21.12 232.32 576.12 576.12 0.403249
2 383080 Will LLC 10001 B1-86481 3 35.99 107.97 576.12 576.12 0.187409
3 412290 Jerde-Hilpert 10005 S1-06532 48 55.82 2679.36 8185.49 8185.49 0.327330
4 412290 Jerde-Hilpert 10005 S1-82801 21 13.62 286.02 8185.49 8185.49 0.034942
5 412290 Jerde-Hilpert 10005 S1-06532 9 92.55 832.95 8185.49 8185.49 0.101759
6 412290 Jerde-Hilpert 10005 S1-47412 44 78.91 3472.04 8185.49 8185.49 0.424170
7 412290 Jerde-Hilpert 10005 S1-27722 36 25.42 915.12 8185.49 8185.49 0.111798
8 218895 Kulas Inc 10006 S1-27722 32 95.66 3061.12 3724.49 3724.49 0.821890
9 218895 Kulas Inc 10006 B1-33087 23 22.55 518.65 3724.49 3724.49 0.139254
10 218895 Kulas Inc 10006 B1-33364 3 72.30 216.90 3724.49 3724.49 0.058236
11 218895 Kulas Inc 10006 B1-20000 -1 72.18 -72.18 3724.49 3724.49 -0.019380

As an added bonus, you could combine into one statement if you did not want to show the individual order totals:

df["Percent_of_Order"] = df["ext price"] / df.groupby('order')["ext price"].transform('sum')

Here is a diagram to show what is happening:

Groupby Example

After taking the time to understand transform , I think you will agree that this tool can be very powerful - even if it is a unique approach as compared to the standard Excel mindset.

Conclusion

I am continually amazed at the power of pandas to make complex numerical manipulations very efficient. Despite working with pandas for a while, I never took the time to figure out how to use transform. Now that I understand how it works, I am sure I will be able to use it in future analysis and hope that you will find this useful as well.


 
       Vote on Hacker News          

Comments