Automated Report Generation with Papermill: Part 1
Posted by Duarte O.Carmo in articles

Introduction
This guest post that walks through a great example of using python to automate a report generating process. I think PB Python readers will enjoy learning from this real world example using python, jupyter notebooks, papermill and several other tools.
Before we get started, I would like to introduce the author:
My name is Duarte Carmo and I’m a product manager and digital consultant. Originally from Lisbon - Portugal, but currently living and working in Copenhagen - Denmark. Find more about my work and leisure in my website.
Part 1 - Tool roundup
Welcome to part 1 of this two-part series post about automating report generation using python, jupyter, papermill, and a couple of other tools.
In the first part, we will cover 4 main important workflows that are part of the automation process. In the second and final part, we will bring everything together and build our own report automation system.
Note: This code was written in python 3.7. You might have to adapt the code for older versions of python.
All of the code for this article is available on github.
Alright, let’s get to work.
Automating report generation with Python - Why?
Not everyone can code. This might seem like an obvious statement, but once you start using python to automate or analyze things around you, you start to encounter a big problem: reproducibility. Not everyone knows how to run your scripts, use your tools, or even use a modern browser.
Let us say you built a killer script. How exactly do you make someone who has never heard the word “python” use it? You could teach them python, but that would take a long time.
In this series, we will teach you how you can automatically generate shareable Html reports from any excel file using a combination of tools, centered around python.
Creating a Jupyter Notebook reports from Excel files
Let us say you have an excel file
sales_january.xlsx
with a list
of the sales generated by a group of employees. Just like this:

Let’s start by using a jupyter notebook
sales_january.ipynb
to
create a very simple analysis of that sales data.
We start by importing the pandas and
maplotlib libraries. After that, we
specify the name of our file using the
filename
variable. Finally,
we use the
read_excel
function to read our data into a pandas DataFrame.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline # so plots are printed automatically
filename = "sales_january.xlsx"
data = pd.read_excel(filename, index_col=0)
When printing the
data
dataframe, we get the following:

After that, we plot the data using pandas:
data.plot(kind="bar", title=f"Sales report from {filename}")
And we get the following:

And that’s it! We have a jupyter notebook that analyzes (a very simple analysis let us say) a sales report in excel. Now let’s say we want to share that report with other people in the organization, what do we do?
Automating report generation using papermill
Papermill is a handy tool that allows us to “parameterize and execute” Jupyter Notebooks. This basically means that papermill allows you to execute the same jupyter notebook, with different variables defined outside its context.
To install it, run
pip install papermill
, or follow the more
complete installation instructions.
Let us say we want to generate the same report as above, but with
another excel file:
sales_february.xlsx
. You should have in your
directory, the following:
├── sales_february.xlsx ├── sales_january.html ├── sales_january.ipynb └── sales_january.xlsx
The first step is to parameterize our notebook, to do this, let us
create a
template.ipynb
file. This notebook is very similar to
sales_january.ipynb
but with a small difference: a new cell with a
tag
parameters
. Just like this:

(If you have trouble adding a tag to your notebook, visit this link)
The cell with the
parameters
tag, will allow you to run this
notebook from another python script while feeding the
filename
variable, any value you would like.
Your directory should look like this:
├── sales_february.xlsx ├── sales_january.html ├── sales_january.ipynb ├── sales_january.xlsx └── template.ipynb
You can always browse the code in the github repo.
Now that we have everything in place, let’s generate a report for a new
february_sales.xlsx
excel file.
To do it, in a new python file, or python console, run the following:
import papermill as pm
pm.execute_notebook(
'template.ipynb',
'sales_february.ipynb',
parameters=dict(filename="sales_february.xlsx")
)
Let’s break this down. The
pm.execute_notebook
function takes 3
arguments. The first,
template.ipynb
is the name of the file what we
will use as a base to run our notebook, the one with the
parameters
tag. The second argument is the name of the new notebook that we will
generate with the new arguments. Finally,
parameters
is a dictionary
of the variables that we want to insert into our template, in this case,
the
filename
variable, that will now point to our February sales report.
After running the above code, you will notice a new file in your directory:
├── sales_february.ipynb <- This one! ├── sales_february.xlsx ├── sales_january.html ├── sales_january.ipynb ├── sales_january.xlsx └── template.ipynb
Which means, that Papermill has generated a new notebook for us, based
on the
sales_february.xlsx
sales report. When opening this
notebook, we see a new graph with the new february numbers:

This is pretty handy! We could have a continuous script that always runs this notebook with different sales reports from different months. But how can we automate the process even more? Stay tuned to learn how!
In the second part of this series, you will learn how to bring all of this together to build a full report automation workflow that your colleagues can use! Sign up to the mailing list to make sure you are alerted when the next part comes out!
Updates
29-July-2019: Part 2 of this article is now available.
Comments