Practical Business Python

Taking care of business, one python script at a time

Mon 07 January 2019

Using The Pandas Category Data Type

Posted by Chris Moffitt in articles   

article header image

Introduction

In my previous article, I wrote about pandas data types; what they are and how to convert data to the appropriate type. This article will focus on the pandas categorical data type and some of the benefits and drawbacks of using it.

Pandas Category Data Type

To refresh your memory, here is a summary table of the various pandas data types (aka dtypes).

Pandas dtype mapping
Pandas dtype Python type NumPy type Usage
object str string_, unicode_ Text
int64 int int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 Integer numbers
float64 float float_, float16, float32, float64 Floating point numbers
bool bool bool_ True/False values
datetime64 NA datetime64[ns] Date and time values
timedelta[ns] NA NA Differences between two datetimes
category NA NA Finite list of text values

This article will focus on categorical data. As a quick refresher, categorical data is data which takes on a finite number of possible values. For example, if we were talking about a physical product like a t-shirt, it could have categorical variables such as:

  • Size (X-Small, Small, Medium, Large, X-Large)
  • Color (Red, Black, White)
  • Style (Short sleeve, long sleeve)
  • Material (Cotton, Polyester)

Attributes such as cost, price, quantity are typically integers or floats.

The key take away is that whether or not a variable is categorical depends on its application. Since we only have 3 colors of shirts, then that is a good categorical variable. However, “color” could represent thousands of values in other situations so it would not be a good choice.

There is no hard and fast rule for how many values a categorical value should have. You should apply your domain knowledge to make that determination on your own data sets. In this article, we will look at one approach for identifying categorical values.

The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data.

At the end of the day why do we care about using categorical values? There are 3 main reasons:

  • We can define a custom sort order which can improve summarizing and reporting the data. In the example above, “X-Small” < “Small” < “Medium” < “Large” < “X-Large”. Alphabetical sorting would not be able to reproduce that order.
  • Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plot types.
  • Categorical data uses less memory which can lead to performance improvements.

While categorical data is very handy in pandas. It is not necessary for every type of analysis. In fact, there can be some edge cases where defining a column of data as categorical then manipulating the dataframe can lead to some surprising results. Care must be taken to understand the data set and the necessary analysis before converting columns to categorical data types.

Data Preparation

One of the main use cases for categorical data types is more efficient memory usage. In order to demonstrate, we will use a large data set from the US Centers for Medicare and Medicaid Services. This data set includes a 500MB+ csv file that has information about research payments to doctors and hospital in fiscal year 2017.

First, set up imports and read in all the data:

import pandas as pd
from pandas.api.types import CategoricalDtype

df_raw = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv', low_memory=False)

I have included the low_memory=False parameter in order to surpress this warning:

interactiveshell.py:2728: DtypeWarning: Columns (..) have mixed types. Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)

Feel free to read more about this parameter in the pandas read_csv documentation.

One interesting thing about this data set is that it has over 176 columns but many of them are empty. I found a stack overflow solution to quickly drop all the columns where at least 90% of the data is empty. I thought this might be handy for others as well.

drop_thresh = df_raw.shape[0]*.9
df = df_raw.dropna(thresh=drop_thresh, how='all', axis='columns').copy()

Let’s take a look at the size of these various dataframes. Here is the original data set:

df_raw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607865 entries, 0 to 607864
Columns: 176 entries, Change_Type to Context_of_Research
dtypes: float64(34), int64(3), object(139)
memory usage: 816.2+ MB

The 500MB csv file fills about 816MB of memory. This seems large but even a low-end laptop has several gigabytes of RAM so we are nowhere near the need for specialized processing tools.

Here is the data set we will use for the rest of the article:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 607865 entries, 0 to 607864
Data columns (total 33 columns):
Change_Type                                                         607865 non-null object
Covered_Recipient_Type                                              607865 non-null object
.....
Payment_Publication_Date                                            607865 non-null object
dtypes: float64(2), int64(3), object(28)
memory usage: 153.0+ MB

Now that we only have 33 columns, taking 153MB of memory, let’s take a look at which columns might be good candidates for a categorical data type.

In order to make this a little easier, I created a small helper function to create a dataframe showing all the unique values in a column.

unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])
Column_Name Num_Unique
0 Change_Type 1
27 Delay_in_Publication_Indicator 1
31 Program_Year 1
32 Payment_Publication_Date 1
29 Dispute_Status_for_Publication 2
26 Preclinical_Research_Indicator 2
22 Related_Product_Indicator 2
25 Form_of_Payment_or_Transfer_of_Value 3
1 Covered_Recipient_Type 4
14 Principal_Investigator_1_Country 4
15 Principal_Investigator_1_Primary_Type 6
6 Recipient_Country 9
21 Applicable_Manufacturer_or_Applicable_GPO_Maki… 20
4 Recipient_State 53
12 Principal_Investigator_1_State 54
17 Principal_Investigator_1_License_State_code1 54
16 Principal_Investigator_1_Specialty 243
24 Date_of_Payment 365
18 Submitting_Applicable_Manufacturer_or_Applicab… 478
19 Applicable_Manufacturer_or_Applicable_GPO_Maki… 551
20 Applicable_Manufacturer_or_Applicable_GPO_Maki… 557
11 Principal_Investigator_1_City 4101
3 Recipient_City 4277
8 Principal_Investigator_1_First_Name 8300
5 Recipient_Zip_Code 12826
28 Name_of_Study 13015
13 Principal_Investigator_1_Zip_Code 13733
9 Principal_Investigator_1_Last_Name 21420
10 Principal_Investigator_1_Business_Street_Addre… 29026
7 Principal_Investigator_1_Profile_ID 29696
2 Recipient_Primary_Business_Street_Address_Line1 38254
23 Total_Amount_of_Payment_USDollars 141959
30 Record_ID 607865

This table highlights a couple of items that will help determine which values should be categorical. First, there is a big jump in unique values once we get above 557 unique values. This should be a useful threshold for this data set.

In addition, the date fields should not be converted to categorical.

The simplest way to convert a column to a categorical type is to use astype('category') . We can use a loop to convert all the columns we care about using astype('category')

cols_to_exclude = ['Program_Year', 'Date_of_Payment', 'Payment_Publication_Date']
for col in df.columns:
    if df[col].nunique() < 600 and col not in cols_to_exclude:
        df[col] = df[col].astype('category')

If we use df.info() to look at the memory usage, we have taken the 153 MB dataframe down to 82.4 MB. This is pretty impressive. We have cut the memory usage almost in half just by converting to categorical values for the majority of our columns.

There is one other feature we can use with categorical data - defining a custom order. To illustrate, let’s do a quick summary of the total payments made by the form of payment:

df.groupby('Form_of_Payment_or_Transfer_of_Value')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
Total_Amount_of_Payment_USDollars
Covered_Recipient_Type
Covered Recipient Physician 7.912815e+07
Covered Recipient Teaching Hospital 1.040372e+09
Non-covered Recipient Entity 3.536595e+09
Non-covered Recipient Individual 2.832901e+06

If we want to change the order of the Covered_Recipient_Type , we need to define a custom CategoricalDtype :

cats_to_order = ["Non-covered Recipient Entity", "Covered Recipient Teaching Hospital",
                 "Covered Recipient Physician", "Non-covered Recipient Individual"]
covered_type = CategoricalDtype(categories=cats_to_order, ordered=True)

Then, explicitly re_order the category:

df['Covered_Recipient_Type'] = df['Covered_Recipient_Type'].cat.reorder_categories(cats_to_order, ordered=True)

Now, we can see the sort order in effect with the groupby:

df.groupby('Form_of_Payment_or_Transfer_of_Value')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
Total_Amount_of_Payment_USDollars
Covered_Recipient_Type
Non-covered Recipient Entity 3.536595e+09
Covered Recipient Teaching Hospital 1.040372e+09
Covered Recipient Physician 7.912815e+07
Non-covered Recipient Individual 2.832901e+06

If you have this same type of data file that you will be processing repeatedly, you can specify this conversion when reading the csv by passing a dictionary of column names and types via the dtype : parameter.

df_raw_2 = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv',
                        dtype={'Covered_Recipient_Type':covered_type})

Performance

We’ve shown that the size of the dataframe is reduced by converting values to categorical data types. Does this impact other areas of performance? The answer is yes.

Here is an example of a groupby operation on the categorical vs. object data types. First, perform the analysis on the original input dataframe.

%%timeit
df_raw.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
40.3 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, on the dataframe with categorical data:

%%timeit
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
4.51 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In this case we sped up the code by 10x, going from 40.3 ms to 4.51 ms. You can imagine that on much larger data sets, the speedup could be even greater.

Watch Outs

Caution sign

Photo credit: Frans Van Heerden

Categorical data seems pretty nifty. It saves memory and speeds up code, so why not use it everywhere? Well, Donald Knuth is correct when he warns about premature optimization:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

In the examples above, the code is faster but it really does not matter when it is used for quick summary actions that are run infrequently. In addition, all the work to figure out and convert to categorical data is probably not worth it for this data set and this simple analysis.

In addition, categorical data can yield some surprising behaviors in real world usage. The examples below will illustrate a couple of issues.

Let’s build a simple dataframe with one ordered categorical variable that represents the status of the customer. This trivial example will highlight some potential subtle errors when dealing with categorical values. It is worth noting that this example shows how to use astype() to convert to the ordered category in one step instead of the two step process used earlier.

import pandas as pd
from pandas.api.types import CategoricalDtype

sales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Status': 'Silver', 'Jan': 50,  'Feb': 90,  'Mar': 95 }]
df_1 = pd.DataFrame(sales_1)
status_type = CategoricalDtype(categories=['Silver', 'Gold'], ordered=True)
df_1['Status'] = df_1['Status'].astype(status_type)

This yields a simple dataframe that looks like this:

Feb Jan Mar Status account
0 200 150 140 Gold Jones LLC
1 210 200 215 Gold Alpha Co
2 90 50 95 Silver Blue Inc

We can inspect the categorical column in more detail:

df_1['Status']
0      Gold
1      Gold
2    Silver
Name: Status, dtype: category
Categories (2, object): [Silver < Gold]

All looks good. We see the data is all there and that Gold is > then Silver.

Now, let’s bring in another dataframe and apply the same category to the status column:

sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70},
         {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]
df_2 = pd.DataFrame(sales_2)
df_2['Status'] = df_2['Status'].astype(status_type)
Feb Jan Mar Status account
0 100 100 70 Silver Smith Co
1 65 310 80 NaN Bingo

Hmm. Something happened to our status. If we just look at the column in more detail:

df_2['Status']
0    Silver
1       NaN
Name: Status, dtype: category
Categories (2, object): [Silver < Gold]

We can see that since we did not define “Bronze” as a valid status, we end up with an NaN value. Pandas does this for a perfectly good reason. It assumes that you have defined all of the valid categories and in this case, “Bronze” is not valid. You can just imagine how confusing this issue could be to troubleshoot if you were not looking out for it.

This scenario is relatively easy to see but what would you do if you had 100’s of values and the data was not cleaned and normalized properly?

Here’s another tricky example where you can “lose” the category object:

sales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140},
         {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215},
         {'account': 'Blue Inc',  'Status': 'Silver', 'Jan': 50,  'Feb': 90,  'Mar': 95 }]
df_1 = pd.DataFrame(sales_1)

# Define an unordered category
df_1['Status'] = df_1['Status'].astype('category')

sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70},
     {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]
df_2 = pd.DataFrame(sales_2)

df_2['Status'] = df_2['Status'].astype('category')

# Combine the two dataframes into 1
df_combined = pd.concat([df_1, df_2])
Feb Jan Mar Status account
0 200 150 140 Gold Jones LLC
1 210 200 215 Gold Alpha Co
2 90 50 95 Silver Blue Inc
0 100 100 70 Silver Smith Co
1 65 310 80 Bronze Bingo

Everything looks ok but upon further inspection, we’ve lost our category data type:

df_combined['Status']
0      Gold
1      Gold
2    Silver
0    Silver
1    Bronze
Name: Status, dtype: object

In this case, the data is still there but the type has been converted to an object. Once again, this is pandas attempt to combine the data without throwing errors but not making assumptions. If you want to convert to a category data type now, you can use astype('category') .

General Guidelines

Now that you know about these gotchas, you can watch out for them. But I will give a few guidelines for how I recommend using categorical data types:

  1. Do not assume you need to convert all categorical data to the pandas category data type.
  2. If the data set starts to approach an appreciable percentage of your useable memory, then consider using categorical data types.
  3. If you have very significant performance concerns with operations that are executed frequently, look at using categorical data.
  4. If you are using categorical data, add some checks to make sure the data is clean and complete before converting to the pandas category type. Additionally, check for NaN values after combining or converting dataframes.

I hope this article was helpful. Categorical data types in pandas can be very useful. However, there are a few issues that you need to keep an eye out for so that you do not get tripped up in subsequent processing. Feel free to add any additional tips or questions in the comments section below.


 
       Vote on Hacker News          

Comments