Mon 23 December 2019

Creating Interactive Dashboards from Jupyter Notebooks

Introduction

I am pleased to have another guest post from Duarte O.Carmo. He wrote series of posts in July on report generation with Papermill that were very well received. In this article, he will explore how to use Voilà and Plotly Express to convert a Jupyter notebook into a standalone interactive web site. In addition, this article will show examples of collecting data through an API endpoint, performing sentiment analysis on that data and show multiple approaches to deploying the dashboard.

About Duarte

Hey everyone! My name is Duarte O.Carmo and I’m a Consultant working at Jabra that loves working with python and data. Make sure to visit my website if you want to find more about me

Since this is a long article, here are the table of contents for easier navigation:

The Goal
Getting Live Reddit Data
Analyzing Reddit Data with Plotly Express
Creating a Live Dashboard with Voilà
Deploying your notebook to the web
Conclusion

The Goal

Jupyter notebooks are one of my favorite tools to work with data, they are simple to use, fast to set up, and flexible. However, they do have their disadvantages: source control, collaboration, and reproducibility are just some of them. As I illustrated in my prior post, I tend to enjoy seeing what I can accomplish with them.

An increasing need is the sharing of our notebooks. Sure, you can export your notebooks to html, pdf, or even use something like nbviewer to share them. But what if your data changes constantly? What if every time you run your notebook, you expect to see something different? How can you go about sharing something like that?

But what if your data changes constantly? What if every time you run your notebook, you expect to see something different? How can you go about sharing something like that?

In this article, I’ll show you how to create a Jupyter Notebook that fetches live data, builds an interactive plot and then how to deploy it as a live dashboard. When you want to share the dashboard, all you need to share with someone is a link.

Let’s have some fun with the data first.

Getting live Reddit data

We will use Reddit as the source of data for our dashboard. Reddit is a tremendous source of information, and there are a million ways to get access to it. One of my favorite ways to access the data is through a small API called pushshift. The documentation is right here.

Let’s say you wanted the most recent comments mentioning the word “python”. In python, you could use requests to get a json version of the data:

import requests
url = "https://api.pushshift.io/reddit/search/comment/?q=python"
request = requests.get(url)
json_response = request.json()

You can add a multitude of parameters to this request, such as:

in a certain subreddit
after a certain day
sorted by up votes
many more

To make my life easier, I built a function that allows me to call this API as a function:

def get_pushshift_data(data_type, **kwargs):
    """
    Gets data from the pushshift api.

    data_type can be 'comment' or 'submission'
    The rest of the args are interpreted as payload.

    Read more: https://github.com/pushshift/api
    """

    base_url = f"https://api.pushshift.io/reddit/search/{data_type}/"
    payload = kwargs
    request = requests.get(base_url, params=payload)
    return request.json()

Using the payload parameter and kwargs I can then add any payload I wish as a function. For example,

get_pushshift_data(data_type="comment",     # give me comments
                   q="python",              # that mention 'python'
                   after="48h",             # in the last 48 hours
                   size=1000,               # maximum 1000 comments
                   sort_type="score",       # sort them by score
                   sort="desc")             # sort descending

returns the json response. Pretty sweet right?

Analyzing the data with Plotly Express

In what subreddits does the word ‘python’ appear more?

To answer the above question, we start by getting the data with our function:

data = get_pushshift_data(data_type="comment",
                          q="python",
                          after="48h",
                          size=1000,
                          aggs="subreddit")

The aggs keyword asks pushshift to return an aggregation into subreddits, which basically means, group the results by subreddit. (read about it in the documentation)

Since the json response is pretty nested, we’ll need to navigate a bit inside of the dictionary.

data = data.get("aggs").get("subreddit")

And we transform the list of dictionaries returned into a pandas DataFrame, and get the top 10.

df = pandas.DataFrame.from_records(data)[0:10]

Here’s what our DataFrame looks like:

	doc_count	key
0	352	learnpython
1	220	AskReddit
2	177	Python
3	139	learnprogramming

These are the names of the subreddits where the word python appears most frequently in their comments !

Let’s plot our results with the Ploty Express library. Plotly Express is great, in my opinion, if you want to:

create figures quickly.
create figures that are a bit more interactive than matplotlib.
don’t mind a bit more installation and (imo) a bit less documentation.

Here’s all the code you need:

import plotly.express as px

px.bar(df,              # our dataframe
       x="key",         # x will be the 'key' column of the dataframe
       y="doc_count",   # y will be the 'doc_count' column of the dataframe
       title=f"Subreddits with most activity - comments with 'python' in the last 48h",
       labels={"doc_count": "# comments","key": "Subreddits"}, # the axis names
       color_discrete_sequence=["blueviolet"], # the colors used
       height=500,
       width=800)

Yes, perhaps a bit more verbose than matplotlib, but you get an interactive chart!

All of the details are included in the notebook for this article.

What are the most up-voted comments with the word ‘python’?

To answer this question, our function will again come in handy. Let’s aggregate things a bit.

Don’t get scared, this is a one liner that will produce similar results to above:

# get the data we need using the function
data = get_pushshift_data(data_type="comment", q="python", after="7d", size=10, sort_type="score", sort="desc").get("data")

# we only care about certain columns
df = pandas.DataFrame.from_records(data)[["author", "subreddit", "score", "body", "permalink"]]

# we only keep the first X characters of the body of the comment (sometimes they are too big)
df['body'] = df['body'].str[0:400] + "..."

# we append the string to all the permalink entries so that we have a link to the comment
df['permalink'] = "https://reddit.com" + df['permalink'].astype(str)

# style the last column to be clickable and print
df.style.format({'permalink': make_clickable})

To make a DataFrame column clickable you can can apply the following function to it:

def make_clickable(val):
    """ Makes a pandas column clickable by wrapping it in some html.
    """
    return '<a href="{}">Link</a>'.format(val,val)

The above code will return the top 10 most upvoted comments of the last 7 days:

	author	subreddit	sc or e	body	permalink
0	Saiboo	learnpyth on	11 1	Suppose you create the following python file calle…	Link
1	Kompakt	Programme rHumor	92	Some languages don’t have switch statements…look…	Link
2	clown_ world_ 2020	MrRobot	47	Just goes to show that Esmail isn’t the only brill…	Link
3	Leebert ysauce	AnimalsBe ingBros	28	They wont even be mad when the python decide to ta…	Link
4	Kompakt	Programme rHumor	23	Yep it’s true, and depending on the design of the …	Link
5	niceboy 4431	Cringetop ia	23	I have a theory (someone prove me wrong if you kno…	Link
6	kinggur u	Denmark	22	Brug af Python: +1 Brug af Python 3: +2 …	Link
7	MintyAr oma	totalwar	20	We really need Bretonnian Men-At-Arms shouting Mon…	Link
8	aspirin gtobeme	gifsthatk eepongivi ng	19	Amazing. Brought [this Monty Python clip](…	Link
9	Crimson Spooker	TwoBestFr iendsPlay	19	“Why can’t Three Houses be gritty and “realistic” …	Link

In the notebook, you can click the link column to be taken right into the comment. Hooray!

What is the sentiment in /r/python across time? Introducing TextBlob

Alright, the final analysis is a bit more complicated. We want to see the sentiment in the /r/python subreddit in some sort of time line.

First, we already now how to retrieve the most up voted comments of the past 2 days:

# get the data with our function
data = get_pushshift_data(data_type="comment",
                          after="2d",
                          size=1000,
                          sort_type="score",
                          sort="desc",
                          subreddit="python").get("data")

# define a list of columns we want to keep
columns_of_interest = ["author", "body", "created_utc", "score", "permalink"]

# transform the response into a dataframe
df = pandas.DataFrame.from_records(data)[columns_of_interest]

This gives us a pandas DataFrame with the columns specified in columns_of_interest . But how do we get the sentiment of every comment?

Enter TextBlob. A simple library that makes it ridiculously easy to get the sentiment of a sentence. Textblob returns two values, the sentiment polarity (-1 is negative; 0 is neutral; and 1 is positive) and the sentiment subjectivity (0 is objective and 1 is subjective)

Here’s an example:

import textblob

sentence1 = "Portugal is a horrible country. People drive like crazy animals."
print(textblob.TextBlob(sentence1).sentiment)
# -> Sentiment(polarity=-0.8, subjectivity=0.95)
# negative and subjective

sentence2 = "Portugal is the most beautiful country in the world because beaches face west."
print(textblob.TextBlob(sentence2).sentiment)
# -> Sentiment(polarity=0.675, subjectivity=0.75)
# positive and less subjective

Creating a live dashboard with Voilà

Voilà has a simple premise: “Voilà turns Jupyter notebooks into standalone web applications.”

Let’s back up a bit, and get everything you need running in your system. First step is to have a working setup with everything above, for that, follow these instructions .

Once that is done, you should be able to launch the dashboard with:

(env) $ voila notebooks/Dashboard.ipynb

Now, you should be able to see a web like application in a new tab in your browser from the notebook we created!

Feel free to modify this notebook according to your interests. You’ll notice that I have created some general variables in the first notebook cell, so you can fire up Jupyter Lab, and modify them and see what comes out!

Here are the general modifiable cells:

COMMENT_COLOR         = "blueviolet"    # color for your comment graph
SUBMISSION_COLOR      = "darkorange"    # color for your submission graph
TEXT_PREVIEW_SIZE     = 240             # how long should the preview be?
TERM_OF_INTEREST      = "python"        # maybe you are interested in some other term?
SUBREDDIT_OF_INTEREST = "python"        # maybe you are interested in some other subreddit?
TIMEFRAME             = "48h"           # you can define another timeline

Once you have modified your dashboard, you can launch Voilà again to see the results.

The most important thing about Voilà is that every time it runs, it actually re-runs your whole code, which yes, makes things a bit slow, but also means that the results get updated every time the page is refreshed! :tada:

Deploying your notebook to the web

First option: Using binder

Binder helps you turn a simple GitHub repo into an interactive notebook environment. They do this by using docker images to reproduce your GitHub repo’s setup.

We don’t really care about all that. We just want to publish our Voilà dashboard. To do that, follow these steps:

Create a public GitHub repo
Add the notebooks you want to publish as dashboards to it
Add a requirements.txt file just as I have in the example repo with all of your dependencies.
Go to mybinder.org
In the GitHub field add your repo’s URL.
In the GitHub branch, tag, or commit field, add “master”, otherwise, you probably know what you are doing.
In the Path to a notebook field add /voila/render/path/to/notebook.ipynb the path/to/render should be the location of your notebook in your repo. In the example, this results in voila/render/notebooks/Dashboard.ipynb
In the Path to a notebook field toggle URL (instead of the default file option)
Hit launch
Your dashboard will automatically launch :open_mouth: :tada:
You can share the link with others and they will have access to the dashboard as well.

Here is the running example of our reddit dashboard. (Takes a bit to build for the first time..)

Second Option: Using an ubuntu server in a hacky way with tmux

WARNING: This option is not 100% safe, so make sure to only use it for testing or proof of concepts, particularly if you are dealing with sensitive data!

If you want to have your dashboard running on a typical URL (such as mycooldash.com for example), you probably want to deploy it on a Linux server.

Here are the steps I used to accomplish that:

Set up your virtual private server - this Linode guide is a good start.
Make sure port 80 (the regular http port) is open

$ sudo iptables -A INPUT -p tcp --dport 80 -j ACCEPT

Once you have your repo in GitHub or somewhere else, clone it to your server.

$ git clone https://github.com/your_username/your_awesome_repo.git

You should already have python 3 installed. Try typing python3 in your console. If that fails then these instructions will help you.
Make sure you can run your dashboard, by creating a virtual environment and installing the dependencies.
Now, if you type in your console the Voilà command, and specify the port:

(env) $ voila YourNoteBook.ipynb --port=80

You can probably navigate to your server’s IP and see the dashboard. However, as soon as you exit your server, your dashboard will stop working. We are going to use a nifty trick with a tool called tmux.

Tmux is a “terminal multiplexer” (wow, that’s a big word). It basically allows us to create multiple terminal sessions at the same time, and then (yes you guessed it), keep them running indefinitely. If this sounds confusing, let’s just get to it.

Install tmux:

$ sudo apt-get install tmux

Once installed we create a new terminal session:

$ tmux new voila

You are now inside a new terminal session. Let’s get Voilà running there.

$ cd my_repo            # navigate to the repo
$ . env/bin/activate    # activate the environment
(env) $ voila MyNotebook.ipynb --port=80 # start the dashboard on port 80

You should see the dashboard in your browser
And now, for the magic, in your terminal hit ctrl + b and then d on your keyboard. This will “detach” you from that terminal where Voilà is running.
You are now back to your original terminal session. Notice that your dashboard is still running. This is because your voila terminal session is still running.
You can see it by listing the terminal sessions with:

$ tmux ls

And then attach to it via:

$ tmux attach voila

And you’ll see your Voilà logs outputting.

This is arguably a bit of a hack to have things running, but it works - so no complaints there.

Tmux is an awesome tool, and you should definitely learn more about it here.

Using Heroku or Google Cloud Platform

There are a million ways of deploying, and Voilà also has good documentation on these.

Conclusion

That was a long post! But we are finally done! Let’s summarize everything we learned:

We learned how to transform an API endpoint into a function with *kwargs
We learned how to analyze reddit data with python and Plotly Express
We learned how to analyze sentiment from sentences with TextBlob
We learned how to transform a jupyter notebook into a dashboard using Voilà.
We learned how to deploy those dashboards with Binder.org
We learned how to use tmux to deploy these kinds of tools in a server.

That was a lot of stuff, and probably there are a lot of bugs in my notebook, or explanation so make sure to:

Visit the GitHub repo were both the code and post are stored.
If there is something wrong in the code please feel free to submit an issue or a pull request.
Tweet at me if you have questions!
Visit my website if you want to learn more about my work

Hope you enjoyed it!

Practical Business Python

Creating Interactive Dashboards from Jupyter Notebooks

Introduction

About Duarte

The Goal

Getting live Reddit data

Analyzing the data with Plotly Express

In what subreddits does the word ‘python’ appear more?

What are the most up-voted comments with the word ‘python’?

What is the sentiment in /r/python across time? Introducing TextBlob

Creating a live dashboard with Voilà

Deploying your notebook to the web

First option: Using binder

Second Option: Using an ubuntu server in a hacky way with tmux

Using Heroku or Google Cloud Platform

Conclusion

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure