Taking care of business, one python script at a time

Sun 05 October 2014

# Using Sets for Data Analysis

Posted by Chris Moffitt in articles

## Standard Library

One of the really wonderful aspects of Python is that it has a very rich standard library of modules that allow you to do some complex programming without having to install any of the third party modules available in the python ecosystem. If you have a valid install of python on your system, you have access to all of these tools.

In this particular case, I will talk about using two modules:

• sets
A data type that allows you to manipulate collections of unique items.
• csv
A module that allows you to read and write csv files

## Sets

Think back to the Venn diagrams you learned about in grade school. They are used to represent sets of items. Here is an image based on the example in the stdlib documentation:

This diagram shows that there are two groups of people: engineers and managers. The intersection of the two groups is engineering management. Here is some very simple python code that shows how to work with these sets:

```engineers = set(['John', 'Jane', 'Jack', 'Janice'])
managers = set(['Jane', 'Jack', 'Susan', 'Zack'])
engineering_management = engineers & managers
print engineering_management
```

Here is the output when you run it:

```\$ python venn.py
set(['Jane', 'Jack'])
```

Now we know that Jane and Jack are our members of engineering management. This is a very simple example but we can use the same principal to manage a very large set of data.

## A More Complex Problem

The basic set seems very simple and straightforward, but I assert that there are simple uses of this tool that you would be hard pressed to replicate in Excel.

In this example, let’s say that you are an analyst at a company and are very proficient in Excel. You know how to write formulas, you can make pivot tables and massage data to make great presentations. One day, you’re confronted with a large data set of your company’s sales that looks something like this:

The data set is about 1000 lines and shows all your company’s sales data for a period of time. Marketing is interested in understanding customer pruchase patterns. You are interested in answering questions like these:

• How many customers have purchased shoes?
• How many customers have purchased belts?
• Who has purchased shoes but not belts?
• Which customers have purchases shoes, belts and shirts?

Normally, you would start to sort and manage the data in Excel. Using Excel’s data filters are certainly simple but they start to get more challenging when you want to get a unique list of customers. Since a customer makes multiple purchases over a period of time, you would have to sort/pivot the data and do an additional filtering step. Assuming you have a way to generate the report and get the data, how easily could you repeat it next time? How would you show a co-worker how to do it?

## Using Python to solve the problem

Using the venn diagram, here is a picture of what we would like to see:

Using the csv module, it is relatively simple to read in the data, and add to the shirts, belts and socks sets, then do some quick analysis:

• First, import the csv module:
```import csv
```
• Next open up the csv file:
```f = open("sample-sales.csv",'rt')
```
• Initialize the sets to empty values:
```shoes = set()
belts = set()
shirts = set()
```
• Read each line of the file, get the customer information and add to the appropriate set. There is one interesting fact about the set. Notice how we don’t check to see if a product is in a set or not before we add it. The set only contains unique items. This means if I add a company three times, it only shows in the list once. This feature makes the code very simple:
```for row in reader:
customer = (row[0],row[1])
category = row[3]
if category == "Shoes":
if category == "Belt":
if category == "Shirt":
f.close()
```
• Now, we can do some analysis:
```print "%s customers have purchased shoes" % len(shoes)
print "%s customers have purchased belts" % len(belts)
print "%s customers have purchased shoes but not belts" % len(shoes - belts)
print "%s customers have purchased shoes and belts" % len(shoes & belts)
print "%s customers have purchases shoes and shirts" % len(shoes & shirts)
print "%s customers have purchased shoes, belts and shirts" % len(shoes & belts & shirts)
print "The following customers are our most valued. They have purchased shoes & belts & shirts:"
for customer in shoes & belts & shirts:
print customer
```
• When you run the program, here is what you get:
```\$ python sales-analysis.py
271 customers have purchased shoes
170 customers have purchased belts
240 customers have purchased shoes but not belts
31 customers have purchased shoes and belts
79 customers have purchases shoes and shirts
9 customers have purchased shoes, belts and shirts
The following customers are our most valued. They have purchased shoes & belts & shirts:
('173693', 'Halvorson PLC')
('727064', 'Bashirian, Beier and Watsica')
('434955', 'Ledner-Kling')
('803666', 'Fritsch-Glover')
('527078', 'Kuvalis-Roberts')
('041711', 'Koepp-McLaughlin')
('151300', "Upton, Runolfsson and O'Reilly")
('004450', 'Beier-Bosco')
('343481', 'Mills Inc')
```

This program just prints out the results but you could modify this to print the data into a csv file for import into Excel or other tools.

## Final Thoughts

Here is the final fully functional program:

```import csv

f = open("sample-sales.csv",'rt')
shoes = set()
belts = set()
shirts = set()
customer = (row[0],row[1])
category = row[3]
if category == "Shoes":
if category == "Belt":
if category == "Shirt":
f.close()

print "%s customers have purchased shoes" % len(shoes)
print "%s customers have purchased belts" % len(belts)
print "%s customers have purchased shoes but not belts" % len(shoes - belts)
print "%s customers have purchased shoes and belts" % len(shoes & belts)
print "%s customers have purchases shoes and shirts" % len(shoes & shirts)
print "%s customers have purchased shoes, belts and shirts" % len(shoes & belts & shirts)
print "The following customers are our most valued. They have purchased shoes & belts & shirts:"
for customer in shoes & belts & shirts:
print customer
```

There are are certainly other options for getting this result. In fact, I may go over some of them in future articles. However, the idea here is to start to get folks to think about solving problems with simple python scripts instead of multiple steps in Excel.

There are some really great reasons why this script is better than an Excel solution:

• You have a record of what you used to create the file. Next time you need to run this analysis, just run the script and you are good to go.
• You can hand this off to someone else to run for you. No need to do a detailed document describing the steps in Excel.
• If the data changes over time and you need to clean it up, you have a good start.
• The solutions can scale to really large files. Running the script on hundres of thousands of lines is not be a problem. I ran it on a 1M line file in less than 20 seconds on a very low end laptop.

I hope this helps you get some ideas about the tremendous power Python offers you in a business setting. This very simple example starts to show you the potential for Python in your business setting.

EDIT:

• 10/4/2014: Updated based on comments from laurencrowe. I have decided not to use the {} syntax. However, I would recommend using it in your code as you get more comfortable with python.