Mon 14 September 2020

Reading HTML tables with Pandas

The pandas read_html() function is a quick and convenient way to turn an HTML table into a pandas DataFrame. This function can be useful for quickly incorporating tables from various websites without figuring out how to scrape the site’s HTML. However, there can be some challenges in cleaning and formatting the data before analyzing it. In this article, I will discuss how to use pandas read_html() to read and clean several Wikipedia HTML tables so that you can use them for further numeric analysis.

Tue 02 June 2020

sidetable - Create Simple Summary Tables in Pandas

Posted by Chris Moffitt in articles

Today I am happy to announce the release of a new pandas utility library called sidetable. This library makes it easy to build a frequency table and simple summary of missing values in a DataFrame. I have found it to be a useful tool when starting data exploration on a new data set and I hope others find it useful as well.

This project is also an opportunity to illustrate how to use pandas new API to register custom DataFrame accessors. This API allows you to build custom functions for working with pandas DataFrames and Series and could be really useful for building out your own library of custom pandas accessor functions.

Tue 18 February 2020

Python Tools for Record Linking and Fuzzy Matching

Posted by Chris Moffitt in articles

Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.

This problem is a common business challenge and difficult to solve in a systematic way - especially when the data sets are large. A naive approach using Excel and vlookup statements can work but requires a lot of human intervention. Fortunately, python provides two libraries that are useful for these types of problems and can support complex matching algorithms with a relatively simple API.

Mon 23 December 2019

Creating Interactive Dashboards from Jupyter Notebooks

Posted by Duarte O.Carmo in articles

I am pleased to have another guest post from Duarte O.Carmo. He wrote series of posts in July on report generation with Papermill that were very well received. In this article, he will explore how to use Voilà and Plotly Express to convert a Jupyter notebook into a standalone interactive web site. In addition, this article will show examples of collecting data through an API endpoint, performing sentiment analysis on that data and show multiple approaches to deploying the dashboard.

Mon 16 December 2019

Finding Natural Breaks in Data with the Fisher-Jenks Algorithm

Posted by Chris Moffitt in articles

This article is inspired by a tweet from Peter Baumgartner. In the tweet he mentioned the Fisher-Jenks algorithm and showed a simple example of ranking data into natural breaks using the algorithm. Since I had never heard about it before, I did some research.

After learning more about it, I realized that it is very complimentary to my previous article on Binning Data and it is intuitive and easy to use in standard pandas analysis. It is definitely an approach I would have used in the past if I had known it existed.

I suspect many people are like me and have never heard of the concept of natural breaks before but have probably done something similar on their own data. I hope this article will expose this simple and useful approach to others so that they can add it to their python toolbox.

The rest of this article will discuss what the Jenks optimization method (or Fisher-Jenks algorithm) is and how it can be used as a simple tool to cluster data using “natural breaks”.

Tue 26 November 2019

Tips for Selecting Columns in a DataFrame

Posted by Chris Moffitt in articles

This article will discuss several tips and shortcuts for using iloc to work with a data set that has a large number of columns. Even if you have some experience with using iloc you should learn a couple of helpful tricks to speed up your own analysis and avoid typing lots of column names in your code.

Mon 28 October 2019

Cleaning Up Currency Data with Pandas

Posted by Chris Moffitt in articles

The other day, I was using pandas to clean some messy Excel data that included several thousand rows of inconsistently formatted currency values. When I tried to clean it up, I realized that it was a little more complicated than I first thought. Coincidentally, a couple of days later, I followed a twitter thread which shed some light on the issue I was experiencing. This article summarizes my experience and describes how to clean up messy currency fields and convert them into a numeric value for further analysis. The concepts illustrated here can also apply to other types of pandas data cleanup tasks.

Mon 14 October 2019

Binning Data with Pandas qcut and cut

Posted by Chris Moffitt in articles

When dealing with continuous numeric data, it is often helpful to bin the data into multiple buckets for further analysis. There are several different terms for binning including bucketing, discrete binning, discretization or quantization. Pandas supports these approaches using the cut and qcut functions. This article will briefly describe why you may want to bin your data and how to use the pandas functions to convert continuous data to a set of discrete buckets. Like many pandas functions, cut and qcut may seem simple but there is a lot of capability packed into those functions. Even for more experience users, I think you will learn a couple of tricks that will be useful for your own analysis.

Tue 17 September 2019

Happy Birthday Practical Business Python!

Posted by Chris Moffitt in articles

On September 17th, 2014, I published my first article which means that today is the 5th birthday of Practical Business Python. Thank you to all my readers and all those that have supported me through this process! It has been a great journey and I look forward to seeing what the future holds.

This 5 year anniversary gives me the opportunity to reflect on the blog and what will be coming next. I figured I would use this milestone to walk through a few of the stats and costs associated with running this blog for the past 5 years. This post will not be technical but I am hopeful that my readers as well as current and aspiring bloggers going down this path will find it helpful. Finally, please use the comments to let me know what content you would like to see in the future.

Mon 26 August 2019

Combine Multiple Excel Worksheets Into a Single Pandas Dataframe

Posted by Chris Moffitt in articles

One of the most commonly used pandas functions is read_excel. This short article shows how you can read in all the tabs in an Excel workbook and combine them into a single pandas dataframe using one command.

For those of you that want the TLDR, here is the command:

df = pd.concat(pd.read_excel('2018_Sales_Total.xlsx', sheet_name=None), ignore_index=True)

Read on for an explanation of when to use this and how it works.

Practical Business Python

Reading HTML tables with Pandas

sidetable - Create Simple Summary Tables in Pandas

Python Tools for Record Linking and Fuzzy Matching

Creating Interactive Dashboards from Jupyter Notebooks

Finding Natural Breaks in Data with the Fisher-Jenks Algorithm

Tips for Selecting Columns in a DataFrame

Cleaning Up Currency Data with Pandas

Binning Data with Pandas qcut and cut

Happy Birthday Practical Business Python!

Combine Multiple Excel Worksheets Into a Single Pandas Dataframe

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure