Sun 09 November 2014

Using Pandas To Create an Excel Diff

Introduction

As part of my continued exploration of pandas, I am going to walk through a real world example of how to use pandas to automate a process that could be very difficult to do in Excel. My business problem is that I have two Excel files that are structured similarly but have different data and I would like to easily understand what has changed between the two files.

Basically, I want an Excel diff tool.

Here is a snapshot of the type of data I’m looking at:

account number	name	street	city	state	postal code
935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118
371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838
548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394
296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052
132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785

In this example, I have two customer address lists and I would like to understand:

which customers are new
which customers are removed
which customers have changed information between the two files

You can envision this being fairly useful when auditing changes in a system or potentially providing a list of changes so you can have your sales team contact new customers.

Research

My first thought was that I wanted to evaluate existing tools that could easily perform a diff on two Excel files. I did some google searching and found a stack overflow discussion on the topic.

There are some decent solutions in the thread but nothing that I felt would meet my requirements. One of my requirements is that I’d like to make it as easy as possible to replicate for someone that may not be very technically inclined. Before pandas, I might have created a script to loop through each file and do my comparison. However, I thought that I might be able to come up with a better solution using pandas. In hindsight, this was a useful exercise to help me understand more about working with pandas.

Once I decided to work work pandas, I did another search and found stack overflow thread that looked like a good start.

First Attempt

Like I did in my previous article, I am using an IPython notebook to test out my solution. If you would like to follow along, here are sample-address-1 and sample-address-2

The first step, is my normal imports:

import pandas as pd
import numpy as np

Next, read in both of our excel files into dataframes

df1 = pd.read_excel('sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
df2 = pd.read_excel('sample-address-2.xlsx', 'Sheet1', na_values=['NA'])

Order by account number and reindex so that it stays this way.

df1.sort(columns="account number")
df1=df1.reindex()
df2.sort(columns="account number")
df2=df2.reindex()

Create a diff function to show what the changes are.

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

Merge the two datasets together in a Panel . I will admit that I haven’t fully grokked the panel concept yet but the only way to learn is to keep pressing on!

diff_panel = pd.Panel(dict(df1=df1,df2=df2))

Once the data is in a panel, we use the report_diff function to highlight all the changes. I think this is a very intuitive way (for this data set) to show changes. It is relatively simple to see what the old value is and the new one. For example, someone could easily check and see why that postal code changed for account number 880043.

diff_output = diff_panel.apply(report_diff, axis=0)
diff_output.tail()

	account number	name	street	city	state	postal code
95	677936	Hodkiewicz-Koch	604 Lemke Knoll Suite 661	East Laurence	Wisconsin	98576
96	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318 —-> 64918
97	899885	Kessler and Sons	356 Johnson Isle Suite 991	Casiehaven	Wyoming	37996
98	704567	Yundt-Abbott	8338 Sauer Highway	Jennyfort	Wyoming	19932
99	880729	Huels PLC	695 Labadie Lakes Apt. 256	Port Orland	Wyoming	42977

One of the things we want to do is flag rows that have changes so it is easier to see the changes. We will create a has_change function and use apply to run the function against each row.

def has_change(row):
    if "--->" in row.to_string():
        return "Y"
    else:
        return "N"


diff_output['has_change'] = diff_output.apply(has_change, axis=1)
diff_output.tail()

	account number	name	street	city	state	postal code	has_change
95	677936	Hodkiewicz-Koch	604 Lemke Knoll Suite 661	East Laurence	Wisconsin	98576	N
96	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318 —-> 64918	Y
97	899885	Kessler and Sons	356 Johnson Isle Suite 991	Casiehaven	Wyoming	37996	N
98	704567	Yundt-Abbott	8338 Sauer Highway	Jennyfort	Wyoming	19932	N
99	880729	Huels PLC	695 Labadie Lakes Apt. 256	Port Orland	Wyoming	42977	N

It is simple to show all the columns with a change:

diff_output[(diff_output.has_change == 'Y')]

	account number	name	street	city	state	postal code	has_change
24	595932	Kuhic, Eichmann and West	4059 Tobias Inlet —-> 4059 Tobias St	New Rylanfurt	Illinois	89271	Y
30	558879	Watsica Group	95616 Enos Grove Suite 139 —-> 829 Big street	West Atlas —-> Smithtown	Iowa —-> Ohio	47419 —-> 47919	Y
96	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318 —-> 64918	Y

Finally, let’s write it out to an Excel file:

diff_output[(diff_output.has_change == 'Y')].to_excel('my-diff.xlsx')

Here is a simple program that does what I’ve just shown:

import pandas as pd
import numpy as np

# Define the diff function to show the changes in each field
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

# We want to be able to easily tell which rows have changes
def has_change(row):
    if "--->" in row.to_string():
        return "Y"
    else:
        return "N"

# Read in both excel files
df1 = pd.read_excel('sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
df2 = pd.read_excel('sample-address-2.xlsx', 'Sheet1', na_values=['NA'])

# Make sure we order by account number so the comparisons work
df1.sort(columns="account number")
df1=df1.reindex()
df2.sort(columns="account number")
df2=df2.reindex()

# Create a panel of the two dataframes
diff_panel = pd.Panel(dict(df1=df1,df2=df2))

#Apply the diff function
diff_output = diff_panel.apply(report_diff, axis=0)

# Flag all the changes
diff_output['has_change'] = diff_output.apply(has_change, axis=1)

#Save the changes to excel but only include the columns we care about
diff_output[(diff_output.has_change == 'Y')].to_excel('my-diff-1.xlsx',index=False,columns=["account number",
                                                      "name","street","city","state","postal code"])

Scaling Up

I have to be honest, I was feeling pretty good so I decided to run this on a more complex dataset and see what happened. I’ll spare you the steps but show you the output:

	account number	name	street	city	state	postal code
19	878977.0 —-> 869125	Swift PLC —-> Wiza LLC	5605 Hodkiewicz Views —-> 9824 Noemi Harbors	Summerfurt —-> North Tristin	Vermont —-> Maine	98029.0 —-> 98114
20	880043.0 —-> 875910	Beatty Inc —-> Lowe, Tremblay and Bruen	3641 Schaefer Isle Suite 171 —-> 3722 Tatyana…	North Gardnertown —-> Selmafurt	Wyoming —-> NorthDakota	64318.0 —-> 17496
21	880729.0 —-> 878977	Huels PLC —-> Swift PLC	695 Labadie Lakes Apt. 256 —-> 5605 Hodkiewic…	Port Orland —-> Summerfurt	Wyoming —-> Vermont	42977.0 —-> 98029
22	nan —-> 880043	nan —-> Beatty Inc	nan —-> 3641 Schaefer Isle Suite 171	nan —-> North Gardnertown	nan —-> Wyoming	nan —-> 64318
23	nan —-> 880729	nan —-> Huels PLC	nan —-> 695 Labadie Lakes Apt. 256	nan —-> Port Orland	nan —-> Wyoming	nan —-> 42977

Hmmm. This isn’t going to work is it?

I am going to rethink this and see if I can come up with an approach that will scale on a bigger data set.

Second Attempt

I will use a similar approach but build it out to show more details on the changes and make the solution more robust for bigger data sets. Here are the data sets for those interested: sample-address-new and sample-address-old.

Start with the standard imports.

import pandas as pd
import numpy as np

We will define our report_diff function like we did in the previous exercise.

def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

Read in the values in the two different sheets

old = pd.read_excel('sample-address-old.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('sample-address-new.xlsx', 'Sheet1', na_values=['NA'])

Label the two data sets so that when we combine them, we know which is which

old['version'] = "old"
new['version'] = "new"

We can look at the data to see what the format looks like and how many records we ended up with.

new.head()

	account number	name	street	city	state	postal code	version
0	935480	Bruen and Jones Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	new
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	new
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	new
3	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	6278	new
4	985603	Bosco-Upton	89 Big Street	Small Town	Texas	19033	new

old.head()

	account number	name	street	city	state	postal code	version
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	old
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	old
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	old
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old

len(old)

len(new)

We will add all the data together into a new table

full_set = pd.concat([old,new],ignore_index=True)

As expected, the full set includes 46 records.

full_set.count()

account number    46
name              46
street            46
city              46
state             46
postal code       46
version           46
dtype: int64

full_set.head(5)

	account number	name	street	city	state	postal code	version
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	old
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	old
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	old
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old

full_set.tail(5)

	account number	name	street	city	state	postal code	version
41	869125	Wiza LLC	9824 Noemi Harbors	North Tristin	Maine	98114	new
42	875910	Lowe, Tremblay and Bruen	3722 Tatyana Springs Apt. 464	Selmafurt	NorthDakota	17496	new
43	878977	Swift PLC	5605 Hodkiewicz Views	Summerfurt	Vermont	98029	new
44	880043	Beatty Inc	3641 Schaefer Isle Suite 171	North Gardnertown	Wyoming	64318	new
45	880729	Huels PLC	695 Labadie Lakes Apt. 256	Port Orland	Wyoming	42977	new

We use drop_duplicates to get rid of the obvious columns where there has not been any change. Note that we keep the last one using take_last=True so we can tell which accounts have been removed in the new data set.

One interesting note about drop_duplicates , you can specify which columns you care about. This functionality is really useful if you have extra columns (say sales, or notes) that you expect to change but don’t really care about for these purposes.

changes = full_set.drop_duplicates(subset=["account number","name","street","city","state","postal code"],take_last=True)

We have cut down our data set to 28 records.

len(changes)

Sort and take a look at what the data looks like. If you look at account number 132971, you can get an idea for how the data is structured.

changes.sort(columns="account number").head(5)

	account number	name	street	city	state	postal code	version
27	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	new
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old
25	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	6278	new
28	214098	Goodwin, Homenick and Jerde	649 Cierra Forks Apt. 078	Rosaberg	Colorado	47743	new
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old

Use the get_duplicates function to get a list of all the account numbers which are duplicated.

dupe_accts = changes.set_index('account number').index.get_duplicates()
dupe_accts

[132971, 935480, 985603]

Get a list of all the dupes into one frame using isin .

dupes = changes[changes["account number"].isin(dupe_accts)]
dupes

	account number	name	street	city	state	postal code	version
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	old
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old
5	985603	Bosco-Upton	03369 Moe Way	Port Casandra	Arkansas	86014	old
22	935480	Bruen and Jones Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	new
25	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	6278	new
26	985603	Bosco-Upton	89 Big Street	Small Town	Texas	19033	new

We need two data frames of the same size so split them into a new and old version.

change_new = dupes[(dupes["version"] == "new")]
change_old = dupes[(dupes["version"] == "old")]

Drop the version columns since we don’t need them any more.

change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'],axis=1)
change_old

	account number	name	street	city	state	postal code
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785
5	985603	Bosco-Upton	03369 Moe Way	Port Casandra	Arkansas	86014

Index on the account number.

change_new.set_index('account number',inplace=True)

change_new

	name	street	city	state	postal code
account number
935480	Bruen and Jones Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118
132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	6278
985603	Bosco-Upton	89 Big Street	Small Town	Texas	19033

change_old.set_index('account number',inplace=True)

change_old

	name	street	city	state	postal code
account number
935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118
132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785
985603	Bosco-Upton	03369 Moe Way	Port Casandra	Arkansas	86014

Run our diff process like we did in our first attempt now that we have the data structured in the way we need to.

diff_panel = pd.Panel(dict(df1=change_old,df2=change_new))
diff_output = diff_panel.apply(report_diff, axis=0)
diff_output

	name	street	city	state	postal code
account number
935480	Bruen Group —-> Bruen and Jones Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118
132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785 —-> 6278
985603	Bosco-Upton	03369 Moe Way —-> 89 Big Street	Port Casandra —-> Small Town	Arkansas —-> Texas	86014 —-> 19033

Looks pretty good!

We know our diff, now we need to figure out which accounts were removed in the new list. We need to find records from the “old” version that are no longer in the “new” version.

changes['duplicate']=changes["account number"].isin(dupe_accts)
removed_accounts = changes[(changes["duplicate"] == False) & (changes["version"] == "old")]
removed_accounts

	account number	name	street	city	state	postal code	version	duplicate
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old	False

The final portion is figuring out which accounts are new.

We will go back to the full set and take only the first duplicate row.

new_account_set = full_set.drop_duplicates(subset=["account number","name","street","city","state","postal code"],take_last=False)
new_account_set.head()

	account number	name	street	city	state	postal code	version
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	old
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	old
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	old
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old

Add a duplicate column again.

new_account_set['duplicate']=new_account_set["account number"].isin(dupe_accts)
new_account_set.head()

	account number	name	street	city	state	postal code	version	duplicate
0	935480	Bruen Group	5131 Nienow Viaduct Apt. 290	Port Arlie	Alabama	14118	old	True
1	371770	Cruickshank-Boyer	839 Lana Expressway Suite 234	South Viviana	Alabama	57838	old	False
2	548367	Spencer, Grady and Herman	65387 Lang Circle Apt. 516	Greenholtbury	Alaska	58394	old	False
3	296620	Schamberger, Hagenes and Brown	26340 Ferry Neck Apt. 612	McCulloughstad	Alaska	74052	old	False
4	132971	Williamson, Schumm and Hettinger	89403 Casimer Spring	Jeremieburgh	Arkansas	62785	old	True

We want to find the accounts that aren’t duplicated and are only in the new data set.

added_accounts = new_account_set[(new_account_set["duplicate"] == False) & (new_account_set["version"] == "new")]

Let’s look at all the new accounts we have added:

added_accounts

	account number	name	street	city	state	postal code	version	duplicate
27	121213	Bauch-Goldner	7274 Marissa Common	Shanahanchester	California	49681	new	False
28	214098	Goodwin, Homenick and Jerde	649 Cierra Forks Apt. 078	Rosaberg	Colorado	47743	new	False
29	566618	Greenfelder, Wyman and Harris	17557 Romaguera Field	South Tamica	Colorado	50037	new	False

Finally we can save all of this into three different sheets in an Excel file.

writer = pd.ExcelWriter("my-diff-2.xlsx")
diff_output.to_excel(writer,"changed")
removed_accounts.to_excel(writer,"removed",index=False,columns=["account number",
                                         "name","street","city","state","postal code"])
added_accounts.to_excel(writer,"added",index=False,columns=["account number",
                                         "name","street","city","state","postal code"])
writer.save()

Here is a full streamlined code example:

import pandas as pd
import numpy as np

# Define the diff function to show the changes in each field
def report_diff(x):
    return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

# Read in the two files but call the data old and new and create columns to track
old = pd.read_excel('sample-address-old.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('sample-address-new.xlsx', 'Sheet1', na_values=['NA'])
old['version'] = "old"
new['version'] = "new"

#Join all the data together and ignore indexes so it all gets added
full_set = pd.concat([old,new],ignore_index=True)

# Let's see what changes in the main columns we care about
changes = full_set.drop_duplicates(subset=["account number","name","street","city","state","postal code"],take_last=True)

#We want to know where the duplicate account numbers are, that means there have been changes
dupe_accts = changes.set_index('account number').index.get_duplicates()

#Get all the duplicate rows
dupes = changes[changes["account number"].isin(dupe_accts)]

#Pull out the old and new data into separate dataframes
change_new = dupes[(dupes["version"] == "new")]
change_old = dupes[(dupes["version"] == "old")]

#Drop the temp columns - we don't need them now
change_new = change_new.drop(['version'], axis=1)
change_old = change_old.drop(['version'], axis=1)

#Index on the account numbers
change_new.set_index('account number',inplace=True)
change_old.set_index('account number',inplace=True)

#Now we can diff because we have two data sets of the same size with the same index
diff_panel = pd.Panel(dict(df1=change_old,df2=change_new))
diff_output = diff_panel.apply(report_diff, axis=0)

#Diff'ing is done, we need to get a list of removed items

#Flag all duplicated account numbers
changes['duplicate']=changes["account number"].isin(dupe_accts)

#Identify non-duplicated items that are in the old version and did not show in the new version
removed_accounts = changes[(changes["duplicate"] == False) & (changes["version"] == "old")]

# We have the old and diff, we need to figure out which ones are new

#Drop duplicates but keep the first item instead of the last
new_account_set = full_set.drop_duplicates(subset=["account number","name","street","city","state","postal code"],take_last=False)

#Identify dupes in this new dataframe
new_account_set['duplicate']=new_account_set["account number"].isin(dupe_accts)

#Identify added accounts
added_accounts = new_account_set[(new_account_set["duplicate"] == False) & (new_account_set["version"] == "new")]

#Save the changes to excel but only include the columns we care about
writer = pd.ExcelWriter("my-diff-2.xlsx")
diff_output.to_excel(writer,"changed")
removed_accounts.to_excel(writer,"removed",index=False,columns=["account number",
                                         "name","street","city","state","postal code"])
added_accounts.to_excel(writer,"added",index=False,columns=["account number",
                                         "name","street","city","state","postal code"])
writer.save()

Here is the final output excel file: my-diff-2

Conclusion

I would not be surprised if someone looks at this and finds a simpler way to do this. However, the final code is relatively straightforward and with minimal tweaks could be applied to your custom data set. I also think this was a good exercise for me to walk through and learn more about the various pandas functions and how to use them to solve my real world problem.

I hope it is as helpful to you as it was to me!

Changes

28-Jan-2019: New and updated code is available in a new article

Practical Business Python

Using Pandas To Create an Excel Diff

Introduction

Research

First Attempt

Scaling Up

Second Attempt

Conclusion

Changes

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure