In 2016, Microsoft launched Windows Subsystem for Linux (WSL) which brought robust unix functionality to Windows. In May 2019, Microsoft announced the release of WSL 2 which includes an updated architecture that improved many aspects of WSL - especially file system performance. I have been following WSL for a while but now that WSL 2 is nearing general release, I decided to install it and try it out. In the few days I have been using it, I have really enjoyed the experience. The combo of using Windows 10 and a full Linux distro like Ubuntu is a really powerful development solution that works surprisingly well.
Record linking and fuzzy matching are terms used to describe the process of joining two data sets together that do not have a common unique identifier. Examples include trying to join files based on people’s names or merging data that only have organization’s name and address.
This problem is a common business challenge and difficult to solve in a systematic way - especially when the data sets are large. A naive approach using Excel and vlookup statements can work but requires a lot of human intervention. Fortunately, python provides two libraries that are useful for these types of problems and can support complex matching algorithms with a relatively simple API.
As part of managing the PB Python newsletter, I wanted to develop a simple way to write emails once using plain text and turn them into responsive HTML emails for the newsletter. In addition, I needed to maintain a static archive page on the blog that links to the content of each newsletter. This article shows how to use python tools to transform a markdown file into a responsive HTML email suitable for a newsletter as well as a standalone page integrated into a pelican blog.
I am pleased to have another guest post from Duarte O.Carmo. He wrote series of posts in July on report generation with Papermill that were very well received. In this article, he will explore how to use Voilà and Plotly Express to convert a Jupyter notebook into a standalone interactive web site. In addition, this article will show examples of collecting data through an API endpoint, performing sentiment analysis on that data and show multiple approaches to deploying the dashboard.
This article is inspired by a tweet from Peter Baumgartner. In the tweet he mentioned the Fisher-Jenks algorithm and showed a simple example of ranking data into natural breaks using the algorithm. Since I had never heard about it before, I did some research.
After learning more about it, I realized that it is very complimentary to my previous article on Binning Data and it is intuitive and easy to use in standard pandas analysis. It is definitely an approach I would have used in the past if I had known it existed.
I suspect many people are like me and have never heard of the concept of natural breaks before but have probably done something similar on their own data. I hope this article will expose this simple and useful approach to others so that they can add it to their python toolbox.
The rest of this article will discuss what the Jenks optimization method (or Fisher-Jenks algorithm) is and how it can be used as a simple tool to cluster data using “natural breaks”.