Taking care of business, one python script at a time

Tue 18 November 2014

Creating a Waterfall Chart in Python

Posted by Chris Moffitt in articles

Introduction

Waterfall charts can be a really useful tool to for certain types of data plotting. Not surprisingly, we can use pandas and matplotlib to create a repeatable waterfall chart.

Before I go any further, I want to level set with everyone about which type of chart I’m referring to. I will be building a 2-D waterfall chart described in this wikipedia article.

A fairly typical use for a chart such as this is to show what the + and - values are that “bridge” between a start and end value. For this reason, finance folks will sometimes refer to this as a bridge. Like the other examples I’ve used, this type of plot is not easy to generate in Excel. There are certainly ways to do it but it is not easy to remember.

The key thing to keep in mind with a waterfall chart is: at its heart it is a stacked bar chart. The “special sauce” is that you have a blank bottom bar so the top bar “floats” in space. Let’s get started.

Creating the Chart

Execute the standard imports and make sure IPython will display matplot plots.

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```
```%matplotlib inline
```

Setup the data we want to waterfall chart and load it into a dataframe.

The data needs to start with your starting value but you leave out the final total. We will calculate it.

```index = ['sales','returns','credit fees','rebates','late charges','shipping']
data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]}
trans = pd.DataFrame(data=data,index=index)
```

I am using the handy ``` display``` function in IPython to make it easier to control what I want to display.

```from IPython.display import display
display(trans)
```
amount
sales 350000
returns -30000
credit fees -7500
rebates -25000
late charges 95000
shipping -7000

The biggest trick with a waterfall plot is figuring out what the bottom stacked bar chart should be. I learned a lot from this stackoverflow discussion

First, let’s get the cumulative sum.

```display(trans.amount.cumsum())
```
```sales           350000
returns         320000
credit fees     312500
rebates         287500
late charges    382500
shipping        375500
Name: amount, dtype: int64
```

This looks good but we need to shift the data one place to the right.

```blank=trans.amount.cumsum().shift(1).fillna(0)
display(blank)
```
```sales                0
returns         350000
credit fees     320000
rebates         312500
late charges    287500
shipping        382500
Name: amount, dtype: float64
```

We need to add a net total amount to the ``` trans``` and ``` blank``` dataframe.

```total = trans.sum().amount
trans.loc["net"] = total
blank.loc["net"] = total
display(trans)
display(blank)
```
amount
sales 350000
returns -30000
credit fees -7500
rebates -25000
late charges 95000
shipping -7000
net 375500
```sales                0
returns         350000
credit fees     320000
rebates         312500
late charges    287500
shipping        382500
net             375500
Name: amount, dtype: float64
```

Create the steps we use to show the changes.

```step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan
display(step)
```
```0         0
0       NaN
0    350000
1    350000
1       NaN
1    320000
2    320000
2       NaN
2    312500
3    312500
3       NaN
3    287500
4    287500
4       NaN
4    382500
5    382500
5       NaN
5    375500
6    375500
6       NaN
6       NaN
Name: amount, dtype: float64
```

For the net row, we need to make sure the blank value is 0 so we don’t double stack.

```blank.loc["net"] = 0
```

Plot it and see what it looks like

```my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
```

That looks pretty good but let’s try formatting the y-axis to make it more readable. We use ``` FuncFormatter``` and some of the python 2.7+ syntax to truncate decimals and add a comma to the format.

```def money(x, pos):
'The two args are the value and tick position'
return "\${:,.0f}".format(x)
```
```from matplotlib.ticker import FuncFormatter
formatter = FuncFormatter(money)
```

Pull it all together

```my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")
my_plot.yaxis.set_major_formatter(formatter)
```

Full Script

The basic graph works but I wanted to add labels and make some minor formatting changes. Here is my final script:

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter

#Use python 2.7+ syntax to format currency
def money(x, pos):
'The two args are the value and tick position'
return "\${:,.0f}".format(x)
formatter = FuncFormatter(money)

#Data to plot. Do not include a total, it will be calculated
index = ['sales','returns','credit fees','rebates','late charges','shipping']
data = {'amount': [350000,-30000,-7500,-25000,95000,-7000]}

#Store data and create a blank series to use for the waterfall
trans = pd.DataFrame(data=data,index=index)
blank = trans.amount.cumsum().shift(1).fillna(0)

#Get the net total number for the final element in the waterfall
total = trans.sum().amount
trans.loc["net"]= total
blank.loc["net"] = total

#The steps graphically show the levels as well as used for label placement
step = blank.reset_index(drop=True).repeat(3).shift(-1)
step[1::3] = np.nan

#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc["net"] = 0

#Plot and label
my_plot = trans.plot(kind='bar', stacked=True, bottom=blank,legend=None, figsize=(10, 5), title="2014 Sales Waterfall")
my_plot.plot(step.index, step.values,'k')
my_plot.set_xlabel("Transaction Types")

#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)

#Get the y-axis position for the labels
y_height = trans.amount.cumsum().shift(1).fillna(0)

#Get an offset so labels don't sit right on top of the bar
max = trans.max()
neg_offset = max / 25
pos_offset = max / 50
plot_offset = int(max / 15)

#Start label loop
loop = 0
for index, row in trans.iterrows():
# For the last item in the list, we don't want to double count
if row['amount'] == total:
y = y_height[loop]
else:
y = y_height[loop] + row['amount']
# Determine if we want a neg or pos offset
if row['amount'] > 0:
y += pos_offset
else:
y -= neg_offset
my_plot.annotate("{:,.0f}".format(row['amount']),(loop,y),ha="center")
loop+=1

#Scale up the y axis so there is room for the labels
my_plot.set_ylim(0,blank.max()+int(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation=0)
my_plot.get_figure().savefig("waterfall.png",dpi=200,bbox_inches='tight')
```

Running the script will generate this nice looking chart:

Final Thoughts

If you were not familiar with waterfall charts, hopefully this example will show you how useful they can be. I imagine that some people may think that this is an awful lot of scripting for one chart. I agree - in some respects. If you will only do one waterfall chart and never touch it again - stick with an Excel solution.

However, what if the chart is really useful and now you need to replicate it for 100 customer? What would you do next? Using Excel would be a challenge. Using this script to create 100 different charts would be fairly easy. Once again, the real value of this process is building an easily repeatable process when you need to scale the solution.

I am really enjoying learning more about pandas, matplotlib and IPython. I am pretty happy with how this solution has turned out and I hope others can learn a little bit too and apply this lessons to their daily jobs.