In many practical Data Science activities, the data set will contain categorical
variables. These variables are typically stored as text values which represent
various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”)
or geographic designations (State or Country). Regardless of
what the value is used for, the challenge is determining how to use this data in the analysis.
Many machine learning algorithms can support categorical values without
further manipulation but there are many more algorithms that do not. Therefore, the analyst is
faced with the challenge of figuring out how to turn these text attributes into
numerical values for further processing.
As with many other aspects of the Data Science world, there is no single answer
on how to approach this problem. Each approach has trade-offs and has potential
impact on the outcome of the analysis. Fortunately, the python tools of pandas
and scikit-learn provide several approaches that can be applied to transform the
categorical data into suitable numeric values.
This article will be a survey of some of the various common (and a few more complex)
approaches in the hope that it will help others apply these techniques to their
real world problems.
Read more...