Common Excel Tasks Demonstrated in Pandas - Part 2

Introduction

I have been very excited by the response to the first post in this series. Thank you to all for the positive feedback. I want to keep the series going by highlighting some other tasks that you commonly execute in Excel and show how you can perform similar functions in pandas.

In the first article, I focused on common math tasks in Excel and their pandas counterparts. In this article, I’ll focus on some common selection and filtering tasks and illustrate how to do the same thing in pandas.

Getting Set Up

If you would like to follow along, you can download the excel file.

Import the pandas and numpy modules.

import pandas as pd
import numpy as np

Load in the Excel data that represents a year’s worth of sales for our sample company.

df = pd.read_excel("sample-salesv3.xlsx")

Take a quick look at the data types to make sure everything came through as expected.

df.dtypes

account number      int64
name               object
sku                object
quantity            int64
unit price        float64
ext price         float64
date               object
dtype: object

You’ll notice that our date column is showing up as a generic object . We are going to convert it to datetime object to make some future selections a little easier.

df['date'] = pd.to_datetime(df['date'])
df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

df.dtypes

account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
dtype: object

The date is now a datetime object which will be useful in future steps.

Filtering the data

I think one of the handiest features in Excel is the filter. I imagine that almost anytime someone gets an Excel file of any size and they want to filter the data, they use this function.

Here is an image of using it for this data set:

Similar to the ilter function in Excel, you can use pandas to filter and select certain subsets of data.

For instance, if we want to just see a specific account number, we can easily do that with Excel or with pandas.

Here is the Excel filter solution:

It is relatively straightforward to do in pandas. Note, I am going to use the head function to show the top results. This is purely for the purposes of keeping the article shorter.

df[df["account number"]==307599].head()

	account number	name	sku	quantity	unit price	ext price	date
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
13	307599	Kassulke, Ondricka and Metz	S2-10342	17	12.44	211.48	2014-01-04 07:53:01
34	307599	Kassulke, Ondricka and Metz	S2-78676	35	33.04	1156.40	2014-01-10 05:26:31
58	307599	Kassulke, Ondricka and Metz	B1-20000	22	37.87	833.14	2014-01-15 16:22:22
70	307599	Kassulke, Ondricka and Metz	S2-10342	44	96.79	4258.76	2014-01-18 06:32:31

You could also do the filtering based on numeric values. I am not going to show any more Excel-based samples. I am sure you get the idea.

df[df["quantity"] > 22].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
15	239344	Stokes LLC	S1-06532	34	71.51	2431.34	2014-01-04 11:34:58

If we want to do more complex filtering, we can use map to filter on various criteria. In this example, let’s look for items with sku’s that start with B1.

df[df["sku"].map(lambda x: x.startswith('B1'))].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
17	239344	Stokes LLC	B1-50809	14	16.23	227.22	2014-01-04 22:14:32

It’s easy to chain two or more statements together using the &.

df[df["sku"].map(lambda x: x.startswith('B1')) & (df["quantity"] > 22)].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
26	737550	Fritsch, Russel and Anderson	B1-53636	42	42.06	1766.52	2014-01-08 00:02:11
31	714466	Trantow-Barrows	B1-33087	32	19.56	625.92	2014-01-09 10:16:32

Another useful function that pandas supports is called isin . It allows us to define a list of values we want to look for.

In this case, we look for all records that include two specific account numbers.

df[df["account number"].isin([714466,218895])].head()

	account number	name	sku	quantity	unit price	ext price	date
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
5	714466	Trantow-Barrows	S2-77896	17	87.63	1489.71	2014-01-02 10:07:15
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
8	714466	Trantow-Barrows	S1-50961	22	84.09	1849.98	2014-01-03 11:29:02

Pandas supports another function called query which allows you to efficiently select subsets of data. It does require the installation of numexpr so make sure you have it installed before trying this step.

If you would like to get a list of customers by name, you can do that with a query, similar to the python syntax shown above.

df.query('name == ["Kulas Inc","Barton LLC"]').head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
33	218895	Kulas Inc	S1-06532	3	22.36	67.08	2014-01-09 23:58:27
36	218895	Kulas Inc	S2-34077	16	73.04	1168.64	2014-01-10 12:07:30

The query function allows you do more than just this simple example but for the purposes of this discussion, I’m showing it so you are aware that it is out there for your needs.

Working with Dates

Using pandas, you can do complex filtering on dates. Before doing anything with dates, I encourage you to sort by the date column to make sure the results return what you are expecting.

df = df.sort_values(by=['date'])
df.head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

The python filtering syntax shown before works with dates.

df[df['date'] >='20140905'].head()

	account number	name	sku	quantity	unit price	ext price	date
1042	163416	Purdy-Kunde	B1-38851	41	98.69	4046.29	2014-09-05 01:52:32
1043	714466	Trantow-Barrows	S1-30248	1	37.16	37.16	2014-09-05 06:17:19
1044	729833	Koepp Ltd	S1-65481	48	16.04	769.92	2014-09-05 08:54:41
1045	729833	Koepp Ltd	S2-11481	6	26.50	159.00	2014-09-05 16:33:15
1046	737550	Fritsch, Russel and Anderson	B1-33364	4	76.44	305.76	2014-09-06 08:59:08

One of the really nice features of pandas is that it understands dates so it will allow us to do partial filtering. If we want to only look for data more recent than a specific month, we can do so.

df[df['date'] >='2014-03'].head()

	account number	name	sku	quantity	unit price	ext price	date
242	163416	Purdy-Kunde	S1-30248	19	65.03	1235.57	2014-03-01 16:07:40
243	527099	Sanford and Sons	S2-82423	3	76.21	228.63	2014-03-01 17:18:01
244	527099	Sanford and Sons	B1-50809	8	70.78	566.24	2014-03-01 18:53:09
245	737550	Fritsch, Russel and Anderson	B1-50809	20	50.11	1002.20	2014-03-01 23:47:17
246	688981	Keeling LLC	B1-86481	-1	97.16	-97.16	2014-03-02 01:46:44

Of course, you can chain the criteria.

df[(df['date'] >='20140701') & (df['date'] <= '20140715')].head()

	account number	name	sku	quantity	unit price	ext price	date
778	737550	Fritsch, Russel and Anderson	S1-65481	35	70.51	2467.85	2014-07-01 00:21:58
779	218895	Kulas Inc	S1-30248	9	16.56	149.04	2014-07-01 00:52:38
780	163416	Purdy-Kunde	S2-82423	44	68.27	3003.88	2014-07-01 08:15:52
781	672390	Kuhn-Gusikowski	B1-04202	48	99.39	4770.72	2014-07-01 11:12:13
782	642753	Pollich LLC	S2-23246	1	51.29	51.29	2014-07-02 04:02:39

Because pandas understands date columns, you can express the date value in multiple formats and it will give you the results you expect.

df[df['date'] >= 'Oct-2014'].head()

	account number	name	sku	quantity	unit price	ext price	date
1168	307599	Kassulke, Ondricka and Metz	S2-23246	6	88.90	533.40	2014-10-08 06:19:50
1169	424914	White-Trantow	S2-10342	25	58.54	1463.50	2014-10-08 07:31:40
1170	163416	Purdy-Kunde	S1-27722	22	34.41	757.02	2014-10-08 09:01:18
1171	163416	Purdy-Kunde	B1-33087	7	79.29	555.03	2014-10-08 15:39:13
1172	672390	Kuhn-Gusikowski	B1-38851	30	94.64	2839.20	2014-10-09 00:22:33

df[df['date'] >= '10-10-2014'].head()

	account number	name	sku	quantity	unit price	ext price	date
1174	257198	Cronin, Oberbrunner and Spencer	S2-34077	13	12.24	159.12	2014-10-10 02:59:06
1175	740150	Barton LLC	S1-65481	28	53.00	1484.00	2014-10-10 15:08:53
1176	146832	Kiehn-Spinka	S1-27722	15	64.39	965.85	2014-10-10 18:24:01
1177	257198	Cronin, Oberbrunner and Spencer	S2-16558	3	35.34	106.02	2014-10-11 01:48:13
1178	737550	Fritsch, Russel and Anderson	B1-53636	10	56.95	569.50	2014-10-11 10:25:53

When working with time series data, if we convert the data to use the date as as the index, we can do some more filtering variations.

Set the new index using set_index .

df2 = df.set_index(['date'])
df2.head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

We can slice the data to get a range.

df2["20140101":"20140201"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

Once again, we can use various date representations to remove any ambiguity around date naming conventions.

df2["2014-Jan-1":"2014-Feb-1"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

df2["2014-Jan-1":"2014-Feb-1"].tail()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-31 22:51:18	383080	Will LLC	B1-05914	43	80.17	3447.31
2014-02-01 09:04:59	383080	Will LLC	B1-20000	7	33.69	235.83
2014-02-01 11:51:46	412290	Jerde-Hilpert	S1-27722	11	21.12	232.32
2014-02-01 17:24:32	412290	Jerde-Hilpert	B1-86481	3	35.99	107.97
2014-02-01 19:56:48	412290	Jerde-Hilpert	B1-20000	23	78.90	1814.70

df2["2014"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-01-01 07:21:51	740150	Barton LLC	B1-20000	39	86.69	3380.91
2014-01-01 10:00:47	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16
2014-01-01 13:24:58	218895	Kulas Inc	B1-69924	23	90.70	2086.10
2014-01-01 15:05:22	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05
2014-01-01 23:26:55	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26

df2["2014-Dec"].head()

	account number	name	sku	quantity	unit price	ext price
date
2014-12-01 20:15:34	714466	Trantow-Barrows	S1-82801	3	77.97	233.91
2014-12-02 20:00:04	146832	Kiehn-Spinka	S2-23246	37	57.81	2138.97
2014-12-03 04:43:53	218895	Kulas Inc	S2-77896	30	77.44	2323.20
2014-12-03 06:05:43	141962	Herman LLC	B1-53102	20	26.12	522.40
2014-12-03 14:17:34	642753	Pollich LLC	B1-53636	19	71.21	1352.99

As you can see, there are a lot of options when it comes to sorting and filtering based on dates.

Additional String Functions

Pandas has support for vectorized string functions as well.

If we want to identify all the sku’s that contain a certain value, we can use str.contains . In this case, we know that the sku is always represented in the same way, so B1 only shows up in the front of the sku. You need to understand your data to make sure you are getting back what you expected.

df[df['sku'].str.contains('B1')].head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
6	218895	Kulas Inc	B1-65551	2	31.10	62.20	2014-01-02 10:57:23
14	737550	Fritsch, Russel and Anderson	B1-53102	23	71.56	1645.88	2014-01-04 08:57:48
17	239344	Stokes LLC	B1-50809	14	16.23	227.22	2014-01-04 22:14:32

We can string queries together and use sort to control how the data is ordered.

df[(df['sku'].str.contains('B1-531')) & (df['quantity']>40)].sort_values(by=['quantity','name'],ascending=[0,1])

	account number	name	sku	quantity	unit price	ext price	date
684	642753	Pollich LLC	B1-53102	46	26.07	1199.22	2014-06-08 19:33:33
792	688981	Keeling LLC	B1-53102	45	41.19	1853.55	2014-07-04 21:42:22
176	383080	Will LLC	B1-53102	45	89.22	4014.90	2014-02-11 04:14:09
1213	604255	Halvorson, Crona and Champlin	B1-53102	41	55.05	2257.05	2014-10-18 19:27:01
1215	307599	Kassulke, Ondricka and Metz	B1-53102	41	93.70	3841.70	2014-10-18 23:25:10
1128	714466	Trantow-Barrows	B1-53102	41	55.68	2282.88	2014-09-27 10:42:48
1001	424914	White-Trantow	B1-53102	41	81.25	3331.25	2014-08-26 11:44:30

Bonus Task

I frequently find myself trying to get a list of unique items in a long list within Excel. It is a multi-step process to do this in Excel but is fairly simple in pandas. Here is one way to do this using the Advanced Filter in Excel.

In pandas, we use the unique function on a column to get the list.

df["name"].unique()

array([u'Barton LLC', u'Trantow-Barrows', u'Kulas Inc',
       u'Kassulke, Ondricka and Metz', u'Jerde-Hilpert', u'Koepp Ltd',
       u'Fritsch, Russel and Anderson', u'Kiehn-Spinka', u'Keeling LLC',
       u'Frami, Hills and Schmidt', u'Stokes LLC', u'Kuhn-Gusikowski',
       u'Herman LLC', u'White-Trantow', u'Sanford and Sons',
       u'Pollich LLC', u'Will LLC', u'Cronin, Oberbrunner and Spencer',
       u'Halvorson, Crona and Champlin', u'Purdy-Kunde'], dtype=object)

If we wanted to include the account number, we could use drop_duplicates .

df.drop_duplicates(subset=["account number","name"]).head()

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

We are obviously pulling in more data than we need and getting some non-useful information, so select only the first and second columns using iloc .

df.drop_duplicates(subset=["account number","name"]).iloc[:,[0,1]]

	account number	name
0	740150	Barton LLC
1	714466	Trantow-Barrows
2	218895	Kulas Inc
3	307599	Kassulke, Ondricka and Metz
4	412290	Jerde-Hilpert
7	729833	Koepp Ltd
9	737550	Fritsch, Russel and Anderson
10	146832	Kiehn-Spinka
11	688981	Keeling LLC
12	786968	Frami, Hills and Schmidt
15	239344	Stokes LLC
16	672390	Kuhn-Gusikowski
18	141962	Herman LLC
20	424914	White-Trantow
21	527099	Sanford and Sons
30	642753	Pollich LLC
37	383080	Will LLC
51	257198	Cronin, Oberbrunner and Spencer
67	604255	Halvorson, Crona and Champlin
106	163416	Purdy-Kunde

I think this single command is easier to maintain than trying to remember the Excel steps every time.

If you would like to view the notebook, feel free to download it.

Conclusion

After I posted, my first article, Dave Proffer retweeted my post and said “Good tips 2 break ur #excel addiction”. I think this is an accurate way to describe how Excel is frequently used today. So many people reach for it right away without realizing how limiting it can be. I hope this series helps people understand that there are alternatives out there and that python+pandas is an extremely powerful combination.

Changes

29-Nov-2020: Updated code to represent using sort_values and removing reference to ix

Practical Business Python

Common Excel Tasks Demonstrated in Pandas - Part 2

Introduction

Getting Set Up

Filtering the data

Working with Dates

Additional String Functions

Bonus Task

Conclusion

Changes

Comments

Subscribe to the mailing list

Social

Submit a Topic

Popular

Article Roadmap

Feeds

Disclosure