Mon 13 July 2015

Best Practices for Managing Your Code Library

Introduction

Over time you have probably developed a set of python scripts that you use on a frequent basis to make your daily work more effective. However, as you start to collect a bunch of python files, the time you take take to manage them can increase greatly. Your once simple development environment can become an unmanageable mess; especially if you do not try to have some consistency and common patterns for your development process. This article will discuss some best practices to manage your python code base so that you can sustain and maintain it over the years without pulling your hair out in the process.

The Scope

I am targeting this article towards a certain problem domain. Many of the points apply universally but I’m generally going to be talking about situations where:

You are the sole maintainer (or only a very small group of people use the code in question).
There are no broader department/company wide procedures in place that cover your code.
The code is used to solve internal problems and is not meant to be used as part of a marketable product.
The code base tends to include a lot of standalone scripts without many multi-file python apps.
Most files contain between 30-300 lines of python code.
The code may be one off or used as part of a periodic reporting/analysis project.
The code will mostly be used to solve small to medium sized data problems. The data is typically hard to manipulate in Excel but does not require dedicated hardware.

The Problem

Because python is so expressive, you can do some very complex activities in a very small number of lines of code. In my particular case, I have been using pandas for a while now and have developed a nice library of scripts I can use to manipulate the data I work with on a daily basis. As you start to develop your own repository, you’ll find that you will end up with dozens of scripts that work great. However, if you use them on an infrequent basis, maintenance starts to consume more and more of your time.

When It All Works

I have tried to apply these ideas to my internal projects and have had good success. However, nothing is perfect so I’m interested to see what others say.

Before I go into specifics, let me give one example that just happened a week ago. I think it illustrates my point well.

I received a request to produce a summary report of some data. It was an exploratory data request for some sales information over a period of time and I had a good idea of how to pull it together (including other scripts that did many of the actions I needed). I figured it would have taken me 10-20 minutes of Excel manipulation to get the report. I also knew that I could put in about 1 hour and have a python script to pull the data and output it into an Excel file. What to do?

I decided to spend a little extra time and create a python script.

Fast forward a couple of days when I was discussing the report. The group had some good ideas for how to modify it. For instance, we wanted to look at prior 6 months of sales instead of 12 months. We also wanted to look at units instead of revenue.

I looked at my script and in a total of less than 5 minutes, I made all those changes and re-ran it. The new output was finished in a fraction of the time it would have taken me to manipulate it in Excel. I also know there will be more changes and that it is super easy to re-run it again if I need to. That extra time and attention I spent in the beginning will save me way more time in the long run.

This is one small example but once you start rolling, I am confident you will have many of your own. I am hopeful that these guidelines will be applicable to your situations as well.

General Guidelines

One of the biggest pieces of advice I have is to treat your code like an open source project. I do not mean that you release all your code to the world but that you should use the best practices and conventions in the open source world to make your repository more manageable. Always think about how you would hand this code off to someone else in your organization.

Use Version Control

Whatever you do, make sure you check the code into a version control system. Even if you think you’ll be the only person using it and that the code won’t grow and change that much - do it. There is no excuse. If you have an internal system, use it. If not, then look for alternatives. I know that most people use github but bitbucket is actually a little more attractive if you want to use it for private repos. You can have unlimited private repos and up to 5 users in the free plan.

The one caveat I would have is to make sure you understand your company’s policy on this type of hosting. Some companies may have restrictions on using external version control systems. I will leave it to you to navigate that.

One word of caution I would give you is to keep any confidential data sorted locally and not stored in an external repo. Your code is going to be mostly useless without the data files so I would feel most comfortable with that approach. Please make sure you understand your organization’s policies.

Even if you find yourself unable to host code externally, you can still set up a local mercurial repo. The other really cool thing is that you can use hg serve to have a local webserver that allows you to browse your repo and view changesets. This is a really useful feature for an individual developer.

Once you have a repo set up, then you should start to manage the code in the repo like you would an open source project. What does that mean?

Document the code with a README file. If you create a README, this has the advantage of giving you a nice summary of what is in the repo. In my README, I include certain items like:
- Overview of the required python versions and how to get them.
- Description of major packages (Pandas, SQL Alchemy, etc)
- Description of each file, including working files, log files, configs.
- Notes on upgrading the environment or configuring certain items.
- What the directories are used for. As mentioned above, I don’t keep the external files in a remote repo but do want to keep a record of the various directories I use and how I get the files.
- Notes on when specific files need to be run (daily, monthly, etc).
- Reminders to yourself about how to update packages or any dependencies.
Have good commit notes. It’s so easy to put in commit notes like “Minor formatting changes” or “Fixes for Joe”. However, those notes will not help you when you’re trying to figure out why you made a change many months ago. This post is a good summary of what your commits should look like.
Consider using the ticket functionality. If your code is in bitbucket or github then you get ticket functionality for free. Go ahead and use it. Anything you can do to help pull your thoughts and history together in one place is a good idea. If you use tickets, make sure to reference them in your commit notes.

Document Your Code

Even when you have only a few lines of code in a file, you should still make sure to follow good coding practices. One of the most important is good documentation. For the particular class of problems we are solving, I want to cover a couple of specific methods that have worked well for me.

I like to include a couple of items in the docstring header of my file that look something like this:

# -*- coding: utf-8 -*-
"""
Created on Tue Jun 30 11:12:50 2015

Generate 12 month sales trends of Product X, Y and Z
Report was requested by Jane Smith in marketing
Expect this to be an ad-hoc report to support new product launch in Q3 2015

Source data is from SAP ad-hoc report generated by Fred Jones
Output is summarized Excel report
"""

In the example above, I include a creation date as well as a summary of what the script is for. I also find it incredibly useful to include who is asking for it and then some idea if this is a one off request or something I intend to run frequently. Finally, I include descriptions of any input and output files. If I am working on a file that someone gave to me, I need to make sure I understand how to get it again.

In addition to summary info, I wanted to give a couple of specific examples of inline code comments.

For instance, if you have any code that you are using based on a stack overflow answer or blog post, I recommend you provide a link back to the original answer or post. In one particular case, I wanted to merge two dictionaries together. Since I was not sure the best approach, I searched the web and found a detailed stack overflow answer.

Here is what my comment looked like:

# Create one dict of all promo codes
# We know keys are unique so this is ok approach
# http://stackoverflow.com/questions/38987/how-can-i-merge-two-python-dictionaries-in-a-single-expression
all_promos = sold_to.copy()
all_promos.update(regional)

Another important item to comment would be business rationale around certain assumptions. For instance, the following code is straightforward pandas and wouldn’t warrant a comment except for understanding why are we choosing the number 3.

# Also filter out any accounts with less than 3 units.
# These accounts are just noise for this analysis.
# These are typically smaller accounts with no consistent business
all_data = all_data[all_data["Volume"] >= 3]

Code Style

Fortunately python has many tools to help you enforce the style of your code. If you want to read lots of opinions there is a reddit discussion on the options. I personally think that pyflakes is useful for the style of coding we’re discussing. I think the actual choice matters less than the fact that you do make a choice.

I encourage you to use an editor that has some sort of integration with one of these tools. I find that it helps me make sure my spacing is consistent and that I don’t have imported but unused modules. It won’t guarantee bug free code but the consistency really helps when you look at code that is several months/years old.

I also encourage you to read and follow the Python Code Style Guide. It contains a bunch of useful examples for best practices in python coding. You should refer to it often and try to incorporate these guidelines in your code, no matter how small the script.

Managing Input and Outputs

Many of the scripts will have multiple input and output files. I try to keep all of the files in one input directory and one output directory. I also include a date (and sometimes) time stamp in my files so that I can run them multiple times and have some record of the old ones. If you need to run them multiple times per day, you would need to include the time as well as the date.

Here is a code snippet I frequently use in my scripts:

# Data files are stored relative to the current script
INCOMING = os.path.join(os.getcwd(), "in")
OUTGOING = os.path.join(os.getcwd(), "out")
default_file_name = "Marketing-Launch-Report-{:%m-%d-%Y}.xlsx".format(date.today())
save_file = os.path.join(OUTGOING, default_file_name)
input_file = os.path.join(INCOMING, "inputfile.xlsx")
df = pd.read_excel(input_file)
# Do more stuff with pandas here
# Save the data to excel by creating a writer so that we can easily add
# multiple sheets
writer = pd.ExcelWriter(save_file)
df.to_excel(writer)
writer.save()

In addition to this code, here are a couple of recommendations on managing your input files:

Try not to make any alterations by hand to the file. Keep it as close to the original as possible.
Don’t delete old input files, move them to an archive directory so you don’t lose them.

Choosing 3rd Party Apps

One of the huge advantages of python is the rich ecosystem of 3rd party modules that are readily available and easy to install. The downside to this is that you can sometimes get stuck trying to figure out which ones to install. Choice is good but too much time can be wasted evaluating all the options.

In general, I try to solve my problems by:

Making sure python’s standard library can’t do it already
Use a package from Anaconda (my distro of choice)
Choosing a package that is easy to install with pip install

One example is argparse. I frequently use code like this:

import argparse

# Functions here

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Create Marketing Report')
    parser.add_argument('--accounts',
                        action='store_true',
                        help='Process Account Data)
    parser.add_argument('--sales',
                        action='store_true',
                        help='Process Sales Data')
    args = parser.parse_args()
    # Process accounts
    if args.accounts:
        # Do something
    # Process sales
    if args.sales:
        # Do Something

For me, this is perfectly serviceable for my simple scripts. I don’t need anything more.

However, python has a bunch of different Command line frameworks. For more complex command line apps, you may need some of these but if you are just getting started, the basic argparse module is going to work just fine.

If I had to use another framework, I would first take a look at the options included in Anaconda and make a choice from there.

Finally, if I really had the need for something else I would evaluate the options available to me via pip .

My point with this discussion is that just because there is a third party module, don’t assume you need it right away. Try to work your way up the decision process I outlined and use the module when you need it.

The main reason for limiting the number of additional modules is that it makes maintaining and re-creating your environment that much easier. By minimizing the number of extra modules, you minimize your setup confusion. Obviously, there is some judgement here. Certain modules are defacto standards or absolutely required for your needs. Just make sure you do need it before you blindly assume you do.

Best Practices with Pandas

If you are in a business setting (and reading this blog) you are probably using pandas in your scripts. Over time, I’ve found a few things that I consistently do that make the overall development and maintenance process easier.

Name your columns

When reading in data via excel or csv, you will end up with really long column names or columns in different dataframes that represent the same item but are labeled differently.

For instance, “State”, “st”, “state”, “US State” may all mean the same thing but are labeled differently depending on your inputs.

The other issue is when you have overly long column names like, “Assistant to the Regional Manager”. If you have to type that over and over you may want to just shorten it to “Asst”.

One solution is to define the labels when you read in the data by using the names argument:

labels = ["number", "name", "region", "territory", "Rep_Name", "state"]
df = pd.read_excel(input_file, names=labels)

The other option is to rename all the columns after you have your dataframe:

df.columns = ["number", "name", "region", "territory", "Rep_Name", "state"]

This option is best when you have a smaller number of columns and want to change all of their names.

The final option is to rename the columns using df.rename :

column_names = {'Number': 'number',
                'Account name': 'name',
                'Geographical Region': 'region',
                'Sales Representative Name': 'Rep_name',
                'Home State': 'state'
                }
df.rename(columns=column_names, inplace=True)

Using rename is useful because you can just pass in the specific columns you want to rename. It will leave any other column names untouched.

It is best to do this as early in the script as possible so that you can take advantage of it in the rest of your code.

Check your data types

Pandas is really good at reading in your data and building a useable dataframe. The downside is that you might think you have a date column but it is instead stored as an object. This could become a problem when you start trying to do calculations. I find it is best to make sure the columns you need to be a certain type are in that type so your numeric and data calculations work as expected.

For instance, if you have a numeric column and see that it is of a dtype “object” then pandas does not think it is a number.

df.dtypes # all columns have the object datatype

a    object
b    object
c    object

If you want to force the column to another type, you can use astype

df.a = df.astype(float)

If you want to force a column to a datetime type:

df.timestamp = pd.to_datetime(df.timestamp)

Pandas also has a helpful convert_objects which will convert an entire dataframe.

df = df.convert_objects(convert_numeric=True, convert_dates=True)

There is one other datatype that can be handy and that pandas will not automatically create but that is easy for you to create: Categorical. This blog post gives some good reasons why you might want to use this data type and shows how simple it is to do and how much more efficient it can make your code.

If you have a set of data in a column that falls into a small discrete number of items, categoricals might make sense. In a business setting examples of categorial items include:

Geographies like countries and states
People like management hierarchy
Products (shoes, shirts, etc.)
Attributes (colors, size, etc)

Creating a category is as simple as:

df["size"] = df["size"].astype('category')

Categoricals are also nice because you can define their order. This attribute will make certain types of output simpler. For cases where categories make sense to be ordered, go ahead and do it.

df["size"] = df["size"].cat.reorder_categories(['s', 'm', 'l', 'xl'])

Format at the end

If you generate Excel files as output, hopefully you are using XlsxWriter to format them. One of the things I recommend is that you format your data at the very end of the process. It is very frustrating to get all the way through and then realize you need to add a couple of new columns and your formatting is all messed up. I use a function call to modify the writer. I don’t actually put anything in here until I am 95%+ confident that the output looks like I want it to. Here is how I set up my formatting:

def format_excel(writer):
    """
    Go through the output Excel file and do some column formatting to make the
    numeric formats correct and ensure columns are wide enough to easily view.
    """
    workbook = writer.book
    money_fmt = workbook.add_format({'num_format': '$#,##0'})
    for tab in ["All", "Summary"]:
        worksheet = writer.sheets[tab]
        worksheet.set_column('B:G', 20)
        worksheet.set_column('I:J', 10, money_fmt)
        worksheet.set_column('L:M', 10, money_fmt)

if __name__ == "__main__":
    # Lots of code here
    writer = pd.ExcelWriter(save_file, engine='xlsxwriter')
    summary_df.to_excel(writer, "Summary")
    all_data.to_excel(writer, "All", index=False)
    format_excel(writer)
    writer.save()

Conclusion

This post is getting long, so congrats if you have made it all the way to the end. I am hoping that people will find this useful and even contribute some of their own ideas. I have a few other things I may put into a future article but, for now, I hope this is useful. Please add anything to the comments if you have some of your own ideas and tips. I will try to include them in future updates.

Practical Business Python

Best Practices for Managing Your Code Library

Introduction

The Scope

The Problem

When It All Works

General Guidelines

Use Version Control

Document Your Code

Code Style

Managing Input and Outputs

Choosing 3rd Party Apps

Best Practices with Pandas

Name your columns

Check your data types

Format at the end

Conclusion

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure