Tue 04 April 2017

Understanding the Transform Function in Pandas

One of the compelling features of pandas is that it has a rich library of methods for manipulating data. However, there are times when it is not clear what the various functions do and how to use them. If you are approaching a problem from an Excel mindset, it can be difficult to translate the planned solution into the unfamiliar pandas command. One of those “unknown” functions is the transform method. Even after using pandas for a while, I have never had the chance to use this function so I recently took some time to figure out what it is and how it could be helpful for real world analysis. This article will walk through an example where transform can be used to efficiently summarize data.

Mon 06 March 2017

Forecasting Website Traffic Using Facebook’s Prophet Library

Posted by Chris Moffitt in articles

A common business analytics task is trying to forecast the future based on known historical data. Forecasting is a complicated topic and relies on an analyst knowing the ins and outs of the domain as well as knowledge of relatively complex mathematical theories. Because the mathematical concepts can be complex, a lot of business forecasting approaches are “solved” with a little linear regression and “intuition.” More complex models would yield better results but are too difficult to implement.

Given that background, I was very interested to see that Facebook recently open sourced a python and R library called prophet which seeks to automate the forecasting process in a more sophisticated but easily tune-able model. In this article, I’ll introduce prophet and show how to use it to predict the volume of traffic in the next year for Practical Business Python. To make this a little more interesting, I will post the prediction through the end of March so we can take a look at how accurate the forecast is.

Tue 21 February 2017

Populating MS Word Templates with Python

Posted by Chris Moffitt in articles

In a previous post, I covered one approach for generating documents using HTML templates to create a PDF. While PDF is great, the world still relies on Microsoft Word for document creation. In reality, it will be much simpler for a business user to create the desired template that supports all the custom formatting they need in Word versus trying to use HTML+CSS. Fortunately, there is a a package that supports doing a MS Word mailmerge purely within python. This approach has the advantage of running on any system - even if Word is not installed. The benefit to using python for the merge (vs. an Excel sheet) is that you are not limited in how you retrieve or process the data. The full flexibility and power of the python ecosystem is at your finger tips. This should be a useful tool to keep in mind any time you need to automate document creation.

Mon 06 February 2017

Guide to Encoding Categorical Values in Python

Posted by Chris Moffitt in articles

In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country). Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

As with many other aspects of the Data Science world, there is no single answer on how to approach this problem. Each approach has trade-offs and has potential impact on the outcome of the analysis. Fortunately, the python tools of pandas and scikit-learn provide several approaches that can be applied to transform the categorical data into suitable numeric values. This article will be a survey of some of the various common (and a few more complex) approaches in the hope that it will help others apply these techniques to their real world problems.

Tue 17 January 2017

Data Science Challenge - Predicting Baseball Fanduel Points

Posted by Chris Moffitt in articles

Several months ago, I participated in my first crowd-sourced Data Science competition in the Twin Cities run by Analyze This!. In my previous post, I described the benefits of working through the competition and how much I enjoyed the process. I just completed the second challenge and had another great experience that I wanted to share and (hopefully) encourage others to try these types of practical challenges to build their Data Science/Analytics skills.

In this second challenge, I felt much more comfortable with the actual process of cleaning the data, exploring it and building and testing models. I found that the python tools continue to serve me well. However, I also identified a lot of things that I need to do better in future challenges or projects in order to be more systematic about my process. I am curious if the broader community has tips or tricks they can share related to some of the items I will cover below. I will also highlight a few of the useful python tools I used throughout the process. This post does not include any code but is focused more on the process and python tools for Data Science.

Mon 19 December 2016

Building a Financial Model with Pandas - Version 2

Posted by Chris Moffitt in articles

In my last article, I discussed building a financial model in pandas that could be used for multiple amortization scenarios. Unfortunately, I realized that I made a mistake in that approach so I had to rethink how to solve the problem. Thanks to the help of several individuals, I have a new solution that resolves the issues and produces the correct results.

In addition to posting the updated solution, I have taken this article as an opportunity to take a step back and examine what I should have done differently in approaching the original problem. While it is never fun to make a mistake in front of thousands of people, I’ll try to swallow my pride and learn from it.

Mon 21 November 2016

Building a Financial Model with Pandas

Posted by Chris Moffitt in articles

In my previous articles, I have discussed how to use pandas as a replacement for Excel when it comes to data wrangling. In many cases, a python + pandas solution is superior to the highly manual processes many people use for manipulating data in Excel. However, Excel is used for many scenarios in a business environment - not just data wrangling. This specific post will discuss how to do financial modeling in pandas instead of Excel. For this example, I will build a simple amortization table in pandas and show how to model various outcomes.

In some ways, building the model is easier in Excel (there are many examples just a google search away). However, as an exercise in learning about pandas, it is useful because it forces you to think about how to use pandas strengths to solve a problem in a way different from the Excel solution. In my opinion the solution is more powerful because you can build on it to run multiple scenarios, easily chart various outcomes and focus on aggregating the data in a way most useful for your needs.

Tue 06 September 2016

Creating Pandas DataFrames from Lists and Dictionaries

Posted by Chris Moffitt in articles

Whenever I am doing analysis with pandas my first goal is to get data into a panda’s DataFrame using one of the many available options. For the vast majority of instances, I use read_excel, read_csv, or read_sql.

However, there are instances when I just have a few lines of data or some calculations that I want to include in my analysis. In these cases it is helpful to know how to create DataFrames from standard python data structures such as lists or dictionaries. The basic process is not difficult but because there are several different options it is helpful to understand how each works. I can never remember whether I should use from_dict, from_records, from_items or the default DataFrame constructor. Normally, through some trial and error, I figure it out. Since it is still confusing to me, I thought I would walk through several examples below to clarify the different approaches. At the end of the article, I briefly show how this can be useful when generating Excel reports.

Mon 29 August 2016

Introduction to Data Visualization with Altair

Posted by Chris Moffitt in articles

Despite being over 1 year old, one of the most popular articles I have written is Overview of Python Visualization Tools. After these many months, it is one of my most frequently searched for, linked to and read article on this site. I think this fact speaks to hunger in the python community for one visualization tool to rise above the rest. I am not sure I want (or need) one to “win” but I do continue to watch the changes in this space with interest.

All of the tools I mentioned in the original article are still alive and many have changed quite a bit over the past year or so. Anyone looking for a visualization tool should investigate the options and see which ones meet their needs. They all have something to offer and different use-cases will drive different solutions.

In the spirit of keeping up with the latest options in this space, I recently heard about Altair which calls itself a “declarative statistical visualization library for Python.” One of the things that peaked my interest was that it is developed by Brian Granger and Jake Vanderplas. Brian is a core developer in the IPython project and very active in the scientific python community. Jake is also active in the scientific python community and has written a soon to be released O’Reilly book called Python Data Science Handbook. Both of these individuals are extremely accomplished and knowledgeable about python and the various tools in the python scientific ecosystem. Because of their backgrounds, I was very curious to see how they approached this problem.

Tue 23 August 2016

Lessons Learned from Analyze This! Challenge

Posted by Chris Moffitt in articles

I recently had the pleasure of participating in a crowd-sourced data science competition in the Twin Cities called Analyze This! I wanted to share some of my thoughts and experiences on the process - especially how this challenge helped me learn more about how to apply data science theory and open source tools to real world problems.

I also hope this article can encourage others in the Twin Cities to participate in future events. For those of you not in the Minneapolis-St. Paul metro area, then maybe this can help motivate you to start up a similar event in your area. I thoroughly enjoyed the experience and got a lot out of the process. Read on for more details.

Practical Business Python

Understanding the Transform Function in Pandas

Forecasting Website Traffic Using Facebook’s Prophet Library

Populating MS Word Templates with Python

Guide to Encoding Categorical Values in Python

Data Science Challenge - Predicting Baseball Fanduel Points

Building a Financial Model with Pandas - Version 2

Building a Financial Model with Pandas

Creating Pandas DataFrames from Lists and Dictionaries

Introduction to Data Visualization with Altair

Lessons Learned from Analyze This! Challenge

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure