Practical Business Python

Taking care of business, one python script at a time

Mon 08 December 2014

Common Excel Tasks Demonstrated in Pandas - Part 2

Posted by Chris Moffitt in articles   

Introduction

I have been very excited by the response to the first post in this series. Thank you to all for the positive feedback. I want to keep the series going by highlighting some other tasks that you commonly execute in Excel and show how you can perform similar functions in pandas.

In the first article, I focused on common math tasks in Excel and their pandas counterparts. In this article, I’ll focus on some common selection and filtering tasks and illustrate how to do the same thing in pandas.

Getting Set Up

If you would like to follow along, you can download the excel file.

Import the pandas and numpy modules.

import pandas as pd
import numpy as np

Load in the Excel data that represents a year’s worth of sales for our sample company.

df = pd.read_excel("sample-salesv3.xlsx")

Take a quick look at the data types to make sure everything came through as expected.

df.dtypes
account number      int64
name               object
sku                object
quantity            int64
unit price        float64
ext price         float64
date               object
dtype: object

You’ll notice that our date column is showing up as a generic object . We are going to convert it to datetime object to make some future selections a little easier.

df['date'] = pd.to_datetime(df['date'])
df.head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55
df.dtypes
account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

The date is now a datetime object which will be useful in future steps.

Filtering the data

I think one of the handiest features in Excel is the filter. I imagine that almost anytime someone gets an Excel file of any size and they want to filter the data, they use this function.

Here is an image of using it for this data set:

Excel filter example

Similar to the ilter function in Excel, you can use pandas to filter and select certain subsets of data.

For instance, if we want to just see a specific account number, we can easily do that with Excel or with pandas.

Here is the Excel filter solution:

Excel filter example

It is relatively straightforward to do in pandas. Note, I am going to use the head function to show the top results. This is purely for the purposes of keeping the article shorter.

df[df["account number"]==307599].head()
account number name sku quantity unit price ext price date
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
13 307599 Kassulke, Ondricka and Metz S2-10342 17 12.44 211.48 2014-01-04 07:53:01
34 307599 Kassulke, Ondricka and Metz S2-78676 35 33.04 1156.40 2014-01-10 05:26:31
58 307599 Kassulke, Ondricka and Metz B1-20000 22 37.87 833.14 2014-01-15 16:22:22
70 307599 Kassulke, Ondricka and Metz S2-10342 44 96.79 4258.76 2014-01-18 06:32:31

You could also do the filtering based on numeric values. I am not going to show any more Excel-based samples. I am sure you get the idea.

df[df["quantity"] > 22].head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.88 2014-01-04 08:57:48
15 239344 Stokes LLC S1-06532 34 71.51 2431.34 2014-01-04 11:34:58

If we want to do more complex filtering, we can use map to filter on various criteria. In this example, let’s look for items with sku’s that start with B1.

df[df["sku"].map(lambda x: x.startswith('B1'))].head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.88 2014-01-04 08:57:48
17 239344 Stokes LLC B1-50809 14 16.23 227.22 2014-01-04 22:14:32

It’s easy to chain two or more statements together using the &.

df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.88 2014-01-04 08:57:48
26 737550 Fritsch, Russel and Anderson B1-53636 42 42.06 1766.52 2014-01-08 00:02:11
31 714466 Trantow-Barrows B1-33087 32 19.56 625.92 2014-01-09 10:16:32

Another useful function that pandas supports is called isin . It allows us to define a list of values we want to look for.

In this case, we look for all records that include two specific account numbers.

df[df["account number"].isin([714466,218895])].head()
account number name sku quantity unit price ext price date
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
5 714466 Trantow-Barrows S2-77896 17 87.63 1489.71 2014-01-02 10:07:15
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23
8 714466 Trantow-Barrows S1-50961 22 84.09 1849.98 2014-01-03 11:29:02

Pandas supports another function called query which allows you to efficiently select subsets of data. It does require the installation of numexpr so make sure you have it installed before trying this step.

If you would like to get a list of customers by name, you can do that with a query, similar to the python syntax shown above.

df.query('name == ["Kulas Inc","Barton LLC"]').head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23
33 218895 Kulas Inc S1-06532 3 22.36 67.08 2014-01-09 23:58:27
36 218895 Kulas Inc S2-34077 16 73.04 1168.64 2014-01-10 12:07:30

The query function allows you do more than just this simple example but for the purposes of this discussion, I’m showing it so you are aware that it is out there for your needs.

Working with Dates

Using pandas, you can do complex filtering on dates. Before doing anything with dates, I encourage you to sort by the date column to make sure the results return what you are expecting.

df = df.sort('date')
df.head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

The python filtering syntax shown before works with dates.

df[df['date'] >='20140905'].head()
account number name sku quantity unit price ext price date
1042 163416 Purdy-Kunde B1-38851 41 98.69 4046.29 2014-09-05 01:52:32
1043 714466 Trantow-Barrows S1-30248 1 37.16 37.16 2014-09-05 06:17:19
1044 729833 Koepp Ltd S1-65481 48 16.04 769.92 2014-09-05 08:54:41
1045 729833 Koepp Ltd S2-11481 6 26.50 159.00 2014-09-05 16:33:15
1046 737550 Fritsch, Russel and Anderson B1-33364 4 76.44 305.76 2014-09-06 08:59:08

One of the really nice features of pandas is that it understands dates so it will allow us to do partial filtering. If we want to only look for data more recent than a specific month, we can do so.

df[df['date'] >='2014-03'].head()
account number name sku quantity unit price ext price date
242 163416 Purdy-Kunde S1-30248 19 65.03 1235.57 2014-03-01 16:07:40
243 527099 Sanford and Sons S2-82423 3 76.21 228.63 2014-03-01 17:18:01
244 527099 Sanford and Sons B1-50809 8 70.78 566.24 2014-03-01 18:53:09
245 737550 Fritsch, Russel and Anderson B1-50809 20 50.11 1002.20 2014-03-01 23:47:17
246 688981 Keeling LLC B1-86481 -1 97.16 -97.16 2014-03-02 01:46:44

Of course, you can chain the criteria.

df[(df['date'] >='20140701') & (df['date'] <= '20140715')].head()
account number name sku quantity unit price ext price date
778 737550 Fritsch, Russel and Anderson S1-65481 35 70.51 2467.85 2014-07-01 00:21:58
779 218895 Kulas Inc S1-30248 9 16.56 149.04 2014-07-01 00:52:38
780 163416 Purdy-Kunde S2-82423 44 68.27 3003.88 2014-07-01 08:15:52
781 672390 Kuhn-Gusikowski B1-04202 48 99.39 4770.72 2014-07-01 11:12:13
782 642753 Pollich LLC S2-23246 1 51.29 51.29 2014-07-02 04:02:39

Because pandas understands date columns, you can express the date value in multiple formats and it will give you the results you expect.

df[df['date'] >= 'Oct-2014'].head()
account number name sku quantity unit price ext price date
1168 307599 Kassulke, Ondricka and Metz S2-23246 6 88.90 533.40 2014-10-08 06:19:50
1169 424914 White-Trantow S2-10342 25 58.54 1463.50 2014-10-08 07:31:40
1170 163416 Purdy-Kunde S1-27722 22 34.41 757.02 2014-10-08 09:01:18
1171 163416 Purdy-Kunde B1-33087 7 79.29 555.03 2014-10-08 15:39:13
1172 672390 Kuhn-Gusikowski B1-38851 30 94.64 2839.20 2014-10-09 00:22:33
df[df['date'] >= '10-10-2014'].head()
account number name sku quantity unit price ext price date
1174 257198 Cronin, Oberbrunner and Spencer S2-34077 13 12.24 159.12 2014-10-10 02:59:06
1175 740150 Barton LLC S1-65481 28 53.00 1484.00 2014-10-10 15:08:53
1176 146832 Kiehn-Spinka S1-27722 15 64.39 965.85 2014-10-10 18:24:01
1177 257198 Cronin, Oberbrunner and Spencer S2-16558 3 35.34 106.02 2014-10-11 01:48:13
1178 737550 Fritsch, Russel and Anderson B1-53636 10 56.95 569.50 2014-10-11 10:25:53

When working with time series data, if we convert the data to use the date as as the index, we can do some more filtering variations.

Set the new index using set_index .

df2 = df.set_index(['date'])
df2.head()
account number name sku quantity unit price ext price
date
2014-01-01 07:21:51 740150 Barton LLC B1-20000 39 86.69 3380.91
2014-01-01 10:00:47 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16
2014-01-01 13:24:58 218895 Kulas Inc B1-69924 23 90.70 2086.10
2014-01-01 15:05:22 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05
2014-01-01 23:26:55 412290 Jerde-Hilpert S2-34077 6 83.21 499.26

We can slice the data to get a range.

df2["20140101":"20140201"].head()
account number name sku quantity unit price ext price
date
2014-01-01 07:21:51 740150 Barton LLC B1-20000 39 86.69 3380.91
2014-01-01 10:00:47 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16
2014-01-01 13:24:58 218895 Kulas Inc B1-69924 23 90.70 2086.10
2014-01-01 15:05:22 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05
2014-01-01 23:26:55 412290 Jerde-Hilpert S2-34077 6 83.21 499.26

Once again, we can use various date representations to remove any ambiguity around date naming conventions.

df2["2014-Jan-1":"2014-Feb-1"].head()
account number name sku quantity unit price ext price
date
2014-01-01 07:21:51 740150 Barton LLC B1-20000 39 86.69 3380.91
2014-01-01 10:00:47 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16
2014-01-01 13:24:58 218895 Kulas Inc B1-69924 23 90.70 2086.10
2014-01-01 15:05:22 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05
2014-01-01 23:26:55 412290 Jerde-Hilpert S2-34077 6 83.21 499.26
df2["2014-Jan-1":"2014-Feb-1"].tail()
account number name sku quantity unit price ext price
date
2014-01-31 22:51:18 383080 Will LLC B1-05914 43 80.17 3447.31
2014-02-01 09:04:59 383080 Will LLC B1-20000 7 33.69 235.83
2014-02-01 11:51:46 412290 Jerde-Hilpert S1-27722 11 21.12 232.32
2014-02-01 17:24:32 412290 Jerde-Hilpert B1-86481 3 35.99 107.97
2014-02-01 19:56:48 412290 Jerde-Hilpert B1-20000 23 78.90 1814.70
df2["2014"].head()
account number name sku quantity unit price ext price
date
2014-01-01 07:21:51 740150 Barton LLC B1-20000 39 86.69 3380.91
2014-01-01 10:00:47 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16
2014-01-01 13:24:58 218895 Kulas Inc B1-69924 23 90.70 2086.10
2014-01-01 15:05:22 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05
2014-01-01 23:26:55 412290 Jerde-Hilpert S2-34077 6 83.21 499.26
df2["2014-Dec"].head()
account number name sku quantity unit price ext price
date
2014-12-01 20:15:34 714466 Trantow-Barrows S1-82801 3 77.97 233.91
2014-12-02 20:00:04 146832 Kiehn-Spinka S2-23246 37 57.81 2138.97
2014-12-03 04:43:53 218895 Kulas Inc S2-77896 30 77.44 2323.20
2014-12-03 06:05:43 141962 Herman LLC B1-53102 20 26.12 522.40
2014-12-03 14:17:34 642753 Pollich LLC B1-53636 19 71.21 1352.99

As you can see, there are a lot of options when it comes to sorting and filtering based on dates.

Additional String Functions

Pandas has support for vectorized string functions as well.

If we want to identify all the sku’s that contain a certain value, we can use str.contains . In this case, we know that the sku is always represented in the same way, so B1 only shows up in the front of the sku. You need to understand your data to make sure you are getting back what you expected.

df[df['sku'].str.contains('B1')].head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
6 218895 Kulas Inc B1-65551 2 31.10 62.20 2014-01-02 10:57:23
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.88 2014-01-04 08:57:48
17 239344 Stokes LLC B1-50809 14 16.23 227.22 2014-01-04 22:14:32

We can string queries together and use sort to control how the data is ordered.

df[(df['sku'].str.contains('B1-531')) & (df['quantity']>40)].sort(columns=['quantity','name'],ascending=[0,1])
account number name sku quantity unit price ext price date
684 642753 Pollich LLC B1-53102 46 26.07 1199.22 2014-06-08 19:33:33
792 688981 Keeling LLC B1-53102 45 41.19 1853.55 2014-07-04 21:42:22
176 383080 Will LLC B1-53102 45 89.22 4014.90 2014-02-11 04:14:09
1213 604255 Halvorson, Crona and Champlin B1-53102 41 55.05 2257.05 2014-10-18 19:27:01
1215 307599 Kassulke, Ondricka and Metz B1-53102 41 93.70 3841.70 2014-10-18 23:25:10
1128 714466 Trantow-Barrows B1-53102 41 55.68 2282.88 2014-09-27 10:42:48
1001 424914 White-Trantow B1-53102 41 81.25 3331.25 2014-08-26 11:44:30

Bonus Task

I frequently find myself trying to get a list of unique items in a long list within Excel. It is a multi-step process to do this in Excel but is fairly simple in pandas. Here is one way to do this using the Advanced Filter in Excel.

Excel filter example

In pandas, we use the unique function on a column to get the list.

df["name"].unique()
array([u'Barton LLC', u'Trantow-Barrows', u'Kulas Inc',
       u'Kassulke, Ondricka and Metz', u'Jerde-Hilpert', u'Koepp Ltd',
       u'Fritsch, Russel and Anderson', u'Kiehn-Spinka', u'Keeling LLC',
       u'Frami, Hills and Schmidt', u'Stokes LLC', u'Kuhn-Gusikowski',
       u'Herman LLC', u'White-Trantow', u'Sanford and Sons',
       u'Pollich LLC', u'Will LLC', u'Cronin, Oberbrunner and Spencer',
       u'Halvorson, Crona and Champlin', u'Purdy-Kunde'], dtype=object)

If we wanted to include the account number, we could use drop_duplicates .

df.drop_duplicates(subset=["account number","name"]).head()
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

We are obviously pulling in more data than we need and getting some non-useful information, so select only the first and second columns using ix .

df.drop_duplicates(subset=["account number","name"]).ix[:,[0,1]]
account number name
0 740150 Barton LLC
1 714466 Trantow-Barrows
2 218895 Kulas Inc
3 307599 Kassulke, Ondricka and Metz
4 412290 Jerde-Hilpert
7 729833 Koepp Ltd
9 737550 Fritsch, Russel and Anderson
10 146832 Kiehn-Spinka
11 688981 Keeling LLC
12 786968 Frami, Hills and Schmidt
15 239344 Stokes LLC
16 672390 Kuhn-Gusikowski
18 141962 Herman LLC
20 424914 White-Trantow
21 527099 Sanford and Sons
30 642753 Pollich LLC
37 383080 Will LLC
51 257198 Cronin, Oberbrunner and Spencer
67 604255 Halvorson, Crona and Champlin
106 163416 Purdy-Kunde

I think this single command is easier to maintain than trying to remember the Excel steps every time.

If you would like to view the notebook, feel free to download it.

Conclusion

After I posted, my first article, Dave Proffer retweeted my post and said “Good tips 2 break ur #excel addiction”. I think this is an accurate way to describe how Excel is frequently used today. So many people reach for it right away without realizing how limiting it can be. I hope this series helps people understand that there are alternatives out there and that python+pandas is an extremely powerful combination.


 
       Vote on Hacker News          

Comments