Practical Business Pythonhttps://pbpython.com/2024-01-14T16:25:00-06:00Taking care of business, one python script at a timeIntroduction to Polars2024-01-14T16:25:00-06:002024-01-14T16:25:00-06:00Chris Moffitttag:pbpython.com,2024-01-14:/polars-intro.html<p class="first last">After a hiatus from the blog, I’m back with a new post. While I’ve still been using Python
and Pandas, I wanted to explore some new technologies. In this post, I’ll delve into
polars. This article will cover some basic polars concepts, pointing out both its
strengths and differences compared to Pandas. While I’m not ditching Pandas completely,
I’ve found potential in polars for enhancing performance and capabilities in specific
scenarios. Join me on this exploration of alternative tools and frameworks; perhaps
polars might find a place in your toolkit too.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>It’s been a while since I’ve posted anything on the blog. One of the primary reasons for the
hiatus is that I have been using python and pandas but not to do anything very new or different.</p>
<p>In order to shake things up and hopefully get back into the blog a bit, I’m going to write
about <a class="reference external" href="https://pola.rs/">polars</a>. This article assumes you know how to use pandas and are interested in
determining if polars can fit into your workflow. I will cover some
basic polars concepts that should get you started on your journey.</p>
<p>Along the way I will point out some of the things I liked and some of the differences that
that might limit your usage of polars if you’re coming from pandas.</p>
<p>Ultimately, I do like polars and what it is trying to do. I’m not ready to throw out all my
pandas code and move over to polars. However, I can see where polars could fit into my
toolkit and provide some performance and capability that is missing from pandas.</p>
<p>As you evaluate the choice for yourself, it is important to try other frameworks and tools
and evaluate them on their merits as they apply to your needs. Even if you decide polars doesn’t
meet your needs it is good to evaluate options and learn along the way. Hopefully this
article will get you started down that path.</p>
</div>
<div class="section" id="polars">
<h2>Polars</h2>
<p>As mentioned above, pandas has been the data analysis tool for python for the past few years.
Wes McKinney started the initial work on <a class="reference external" href="https://en.wikipedia.org/wiki/Pandas_(software)">pandas</a> in 2008 and the 1.0 <a class="reference external" href="https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html">release</a> was in January
2020. Pandas has been around a long time and will continue to be.</p>
<p>While pandas is great, it has it’s warts. Wes McKinney wrote about several of
these <a class="reference external" href="https://wesmckinney.com/blog/apache-arrow-pandas-internals/">challenges</a>. There are many other criticisms online but most will boil down to
two items: performance and awkward/complex <span class="caps">API</span>.</p>
<p>Polars was initially developed by Richie Vink to solve these issues. His 2021
<a class="reference external" href="https://pola.rs/posts/i-wrote-one-of-the-fastest-dataframe-libraries/">blog post</a> does a thorough job of laying out metrics to back up his claims
on the performance improvements and underlying design that leads to these benefit with polars.</p>
<p>The <a class="reference external" href="https://docs.pola.rs/user-guide/">user guide</a> concisely lays out the polars philosophy:</p>
<blockquote>
<p>The goal of Polars is to provide a lightning fast DataFrame library that:</p>
<ul class="simple">
<li>Utilizes all available cores on your machine.</li>
<li>Optimizes queries to reduce unneeded work/memory allocations.</li>
<li>Handles datasets much larger than your available <span class="caps">RAM</span>.</li>
<li>Has an <span class="caps">API</span> that is consistent and predictable.</li>
<li>Has a strict schema (data-types should be known before running the query).</li>
</ul>
<p>Polars is written in Rust which gives it C/C++ performance and allows it to fully control
performance critical parts in a query engine.</p>
<p>As such Polars goes to great lengths to:</p>
<ul class="simple">
<li>Reduce redundant copies.</li>
<li>Traverse memory cache efficiently.</li>
<li>Minimize contention in parallelism.</li>
<li>Process data in chunks.</li>
<li>Reuse memory allocations.</li>
</ul>
</blockquote>
<p>Clearly performance is an important goal in the development of polars and key reason why
you might consider using polars.</p>
<p>This article won’t discuss performance but will focus on the polars <span class="caps">API</span>. The main reason is
that for the type of work I do, the data easily fits in <span class="caps">RAM</span> on a business-class laptop.
The data will fit in Excel but it is slow and inefficient on a standard computer.
I rarely find myself waiting on pandas once I have read in the data and have done basic
data pre-processing.</p>
<p>Of course performance matters but it’s not everything. If you’re trying to make a choice
between pandas, polars or other tools don’t make a choice based on general notions of
“performance improvement” but based on what works for your specific needs.</p>
</div>
<div class="section" id="getting-started">
<h2>Getting started</h2>
<p>For this article, I’ll be using data from an <a class="reference external" href="https://pbpython.com/dataframe-gui-overview.html">earlier post</a> which you can find
on <a class="reference external" href="https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx">github</a>.</p>
<p>I would recommend following the latest polar installation instructions in the <a class="reference external" href="https://docs.pola.rs/user-guide/">user guide</a> .</p>
<p>I chose to install polars with all of the dependencies:</p>
<div class="highlight"><pre><span></span><span class="n">python</span> <span class="o">-</span><span class="n">m</span> <span class="n">pip</span> <span class="n">install</span> <span class="n">polars</span><span class="p">[</span><span class="nb">all</span><span class="p">]</span>
</pre></div>
<p>Once installed, reading the downloaded Excel file is straightforward:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">polars</span> <span class="k">as</span> <span class="nn">pl</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pl</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span>
<span class="n">source</span><span class="o">=</span><span class="s2">"2018_Sales_Total_v2.xlsx"</span><span class="p">,</span> <span class="n">schema_overrides</span><span class="o">=</span><span class="p">{</span><span class="s2">"date"</span><span class="p">:</span> <span class="n">pl</span><span class="o">.</span><span class="n">Datetime</span><span class="p">}</span>
<span class="p">)</span>
</pre></div>
<p>When I read this specific file, I found that the date column did not come
through as a <code class="code">
DateTime</code>
type so I used the <code class="code">
scheme_override</code>
argument to make
sure the data was properly typed.</p>
<p>Since data typing is so important, here’s one quick way to check on it:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">schema</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">OrderedDict</span><span class="p">([(</span><span class="s1">'account number'</span><span class="p">,</span> <span class="n">Int64</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'name'</span><span class="p">,</span> <span class="n">Utf8</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'sku'</span><span class="p">,</span> <span class="n">Utf8</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'quantity'</span><span class="p">,</span> <span class="n">Int64</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'unit price'</span><span class="p">,</span> <span class="n">Float64</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'ext price'</span><span class="p">,</span> <span class="n">Float64</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'date'</span><span class="p">,</span> <span class="n">Datetime</span><span class="p">(</span><span class="n">time_unit</span><span class="o">=</span><span class="s1">'us'</span><span class="p">,</span> <span class="n">time_zone</span><span class="o">=</span><span class="kc">None</span><span class="p">))])</span>
</pre></div>
<p>A lot of the standard pandas commands such as <code class="code">
head</code>
, <code class="code">
tail</code>
, <code class="code">
describe</code>
work as expected with a little extra output sprinkled in:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 1300px; height: auto; max-width: 100%;">
<img alt="polars head function" src="https://pbpython.com/images/polars-head.png" style="width: 1300px; height: auto; max-width: 100%;"/>
</div>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 1408px; height: auto; max-width: 100%;">
<img alt="polars describe function" src="https://pbpython.com/images/polars-describe.png" style="width: 1408px; height: auto; max-width: 100%;"/>
</div>
<p>The polars output has a couple of notable features:</p>
<ul class="simple">
<li>The <code class="code">
shape</code>
is included which is useful to make sure you’re not dropping rows or columns inadvertently</li>
<li>Underneath each column name is a data type which is another useful reminder</li>
<li>There are no index numbers</li>
<li>The string columns include ” ” around the values</li>
</ul>
<p>Overall, I like this output and do find it useful for analyzing the data and making sure
the data is stored in the way I expect.</p>
</div>
<div class="section" id="basic-concepts-selecting-and-filtering-rows-and-columns">
<h2>Basic concepts - selecting and filtering rows and columns</h2>
<p>Polars introduces the concept of <a class="reference external" href="https://docs.pola.rs/user-guide/concepts/expressions/">Expressions</a> to help you work with your data. There are four
main expressions you need to understand when working with data in polars:</p>
<ul class="simple">
<li><code class="code">
select</code>
to choose the subset of <em>columns</em> you want to work with</li>
<li><code class="code">
filter</code>
to choose the subset of <em>rows</em> you want to work with</li>
<li><code class="code">
with_columns</code>
to create <em>new</em> columns</li>
<li><code class="code">
group_by</code>
to group data together</li>
</ul>
<p>Choosing or reordering columns is straightforward with <code class="code">
select()</code>
</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"quantity"</span><span class="p">,</span> <span class="s2">"sku"</span><span class="p">))</span>
</pre></div>
<div class="figure" style="width: 266px; height: auto; max-width: 100%;">
<img alt="polars select expression" src="https://pbpython.com/images/polars-select.png" style="width: 266px; height: auto; max-width: 100%;"/>
</div>
<p>The <code class="code">
pl.col()</code>
code is used to create column expressions. You will want to use this
any time you want to specify one or more columns for an action. There are shortcuts where
you can use data without specifying <code class="code">
pl.col()</code>
but I’m choosing to show the recommended way.</p>
<p>Filtering is a similar process (note the use of <code class="code">
pl.col()</code>
again):</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span> <span class="o">></span> <span class="mi">50</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 1267px; height: auto; max-width: 100%;">
<img alt="polars filter expression" src="https://pbpython.com/images/polars-filter.png" style="width: 1267px; height: auto; max-width: 100%;"/>
</div>
<p>Coming from pandas, I found selecting columns and filtering rows to be intuitive.</p>
</div>
<div class="section" id="basic-concepts-adding-columns">
<h2>Basic concepts - adding columns</h2>
<p>The next expression, <code class="code">
with_columns</code>
, takes a little more getting used to. The easiest way
to think about it is that any time you want to add a new column to your data, you need to
use <code class="code">
with_columns</code>
.</p>
<p>To illustrate, I will add a month name column which will also show how to work with date and strings.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">((</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s2">"%b"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"month_name"</span><span class="p">)))</span>
</pre></div>
<div class="figure" style="width: 1476px; height: auto; max-width: 100%;">
<img alt="polars with-columns expression" src="https://pbpython.com/images/polars-with-columns.png" style="width: 1476px; height: auto; max-width: 100%;"/>
</div>
<p>This command does a couple of things to create a new column:</p>
<ul class="simple">
<li>Select the <code class="code">
date</code>
column</li>
<li>Access the underlying date with <code class="code">
dt</code>
and convert it to the 3 character month name using <code class="code">
strftime</code>
</li>
<li>Name the newly created column <code class="code">
month_name</code>
using the <code class="code">
alias</code>
function</li>
</ul>
<p>As a brief aside, I like using <code class="code">
alias</code>
to rename columns. As I played with polars,
this made a lot of sense to me.</p>
<p>Here’s another example to drive the point home.</p>
<p>Let’s say we want to understand how much any one product order contributes to the total
percentage unit volume for the year:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
<span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span> <span class="o">/</span> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"pct_total"</span><span class="p">)</span>
<span class="p">)</span>
</pre></div>
<div class="figure" style="width: 1431px; height: auto; max-width: 100%;">
<img alt="polars with-columns expression" src="https://pbpython.com/images/polars-with-columns-2.png" style="width: 1431px; height: auto; max-width: 100%;"/>
</div>
<p>In this example we divide the line item quantity by the total quantity <code class="code">
pl.col("quantity").sum()</code>
and label it as <code class="code">
pct_total</code>
.</p>
<p>You may have noticed that the previous <code class="code">
month_name</code>
column is not there. That’s because
none of the operations we have done are in-place. If we want to persist a new column,
we need to assign it to a new variable. I will do so in a moment.</p>
<p>I briefly mentioned working with strings but here’s another example.</p>
<p>Let’s say that any of the sku data with an “S” at the front is a special product and we want to
indicate that for each item. We use <code class="code">
str</code>
in a way very similar to the pandas <code class="code">
str</code>
accessor.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">starts_with</span><span class="p">(</span><span class="s2">"S"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"special"</span><span class="p">))</span>
</pre></div>
<div class="figure" style="width: 1392px; height: auto; max-width: 100%;">
<img alt="polars with-columns expression" src="https://pbpython.com/images/polars-with-columns-3.png" style="width: 1392px; height: auto; max-width: 100%;"/>
</div>
<p>Polars has a useful function <code class="code">
when</code>
<code class="code">
then</code>
<code class="code">
otherwise</code>
which can replace
pandas <code class="code">
mask</code>
or <code class="code">
np.where</code>
</p>
<p>Let’s say we want to create a column that indicates a special or includes the original
sku if it’s not a special product.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">starts_with</span><span class="p">(</span><span class="s2">"S"</span><span class="p">))</span>
<span class="o">.</span><span class="n">then</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">lit</span><span class="p">(</span><span class="s2">"Special"</span><span class="p">))</span>
<span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">))</span>
<span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"sales_status"</span><span class="p">)</span>
<span class="p">)</span>
</pre></div>
<p>Which yields:</p>
<div class="figure" style="width: 1448px; height: auto; max-width: 100%;">
<img alt="polars with-columns expression" src="https://pbpython.com/images/polars-with-columns-when.png" style="width: 1448px; height: auto; max-width: 100%;"/>
</div>
<p>This is somewhat analogous to an if-then-else statement in python. I personally like this
syntax because I alway struggle to use pandas equivalents.</p>
<p>This example also introduces <code class="code">
pl.lit()</code>
which we use to assign a literal value to
the columns.</p>
</div>
<div class="section" id="basic-concepts-grouping-data">
<h2>Basic concepts - grouping data</h2>
<p>The pandas <code class="code">
groupby</code>
and polars <code class="code">
group_by</code>
functional similarly but the key
difference is that polars does not have the concept of an index or multi-index.</p>
<p>There are pros and cons to this approach which I will briefly touch on later in this article.</p>
<p>Here’s a simple polars <code class="code">
group_by</code>
example to total the unit amount by sku by customer.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"qty-total"</span><span class="p">))</span>
</pre></div>
<div class="figure" style="width: 412px; height: auto; max-width: 100%;">
<img alt="polars group_by expression" src="https://pbpython.com/images/polars-groupby-1.png" style="width: 412px; height: auto; max-width: 100%;"/>
</div>
<p>The syntax is similar to pandas <code class="code">
groupby</code>
with <code class="code">
agg</code>
dictionary approach I
<a class="reference external" href="https://pbpython.com/groupby-agg.html">have mentioned</a> before. You will notice that we continue to use <code class="code">
pl.col()</code>
to
reference our column of data and then <code class="code">
alias()</code>
to assign a custom name.</p>
<p>The other big change here is that the data does not have a multi-index, the result is
roughly the same as using <code class="code">
as_index=False</code>
with a pandas groupby. The benefit of this
approach is that it is easy to work with this data without flattening or resetting your data.</p>
<p>The downside is that you can not use <code class="code">
unstack</code>
and <code class="code">
stack</code>
to make the data
wider or narrower as needed.</p>
<p>When working with date/time data, you can group data similar to the pandas <a class="reference external" href="https://pbpython.com/pandas-grouper-agg.html">grouper function</a>
by using <code class="code">
group_by_dynamic</code>
:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">group_by_dynamic</span><span class="p">(</span><span class="s2">"date"</span><span class="p">,</span> <span class="n">every</span><span class="o">=</span><span class="s2">"1mo"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"qty-total-month"</span><span class="p">)</span>
<span class="p">)</span>
</pre></div>
<div class="figure" style="width: 374px; height: auto; max-width: 100%;">
<img alt="polars group_by expression" src="https://pbpython.com/images/polars-groupby-2.png" style="width: 374px; height: auto; max-width: 100%;"/>
</div>
<p>There are a couple items to note:</p>
<ul class="simple">
<li>Polars asks that you sort the data by column before doing the <code class="code">
group_by_dynamic</code>
</li>
<li>The <code class="code">
every</code>
argument allows you to specify what date/time level to aggregate to</li>
</ul>
<p>To expand on this example, what if we wanted to show the month name and year, instead of the
date time? We can chain together the <code class="code">
group_by_dynamic</code>
and add a new column by using
<code class="code">
with_columns</code>
</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">group_by_dynamic</span><span class="p">(</span><span class="s2">"date"</span><span class="p">,</span> <span class="n">every</span><span class="o">=</span><span class="s2">"1mo"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"qty-total-month"</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s2">"%b-%Y"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"month_name"</span><span class="p">))</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"month_name"</span><span class="p">,</span> <span class="s2">"qty-total-month"</span><span class="p">)</span>
<span class="p">)</span>
</pre></div>
<div class="figure" style="width: 317px; height: auto; max-width: 100%;">
<img alt="polars group_by expression" src="https://pbpython.com/images/polars-groupby-3.png" style="width: 317px; height: auto; max-width: 100%;"/>
</div>
<p>This example starts to show the <span class="caps">API</span> expressiveness of polars. Once you understand the basic
concepts, you can chain them together in a way that is generally more straightforward than
doing so with pandas.</p>
<p>To summarize this example:</p>
<ul class="simple">
<li>Grouped the data by month</li>
<li>Totaled the quantity and assigned the column name to <code class="code">
qty-total-month</code>
</li>
<li>Change the date label to be more readable and assigned the name <code class="code">
month_name</code>
</li>
<li>Then down-selected to show the two columns I wanted to focus on</li>
</ul>
</div>
<div class="section" id="chaining-expressions">
<h2>Chaining expressions</h2>
<p>We have touched on chaining expressions but I wanted to give one full example
below to act as a reference.</p>
<p>Combining multiple expressions is available in pandas but it’s not required.
This <a class="reference external" href="https://tomaugspurger.net/posts/method-chaining/">post from Tom Augspurger</a> shows a nice example of how to use
different pandas functions to chain operations together. This is also a common topic
that <a class="reference external" href="https://twitter.com/__mharrison__?lang=en">Matt Harrison (@__mharrison__)</a> discusses.</p>
<p>Chaining expressions together is a first class citizen in polars so it is intuitive and
an essential part of working with polars.</p>
<p>Here is an example combining several concepts we showed earlier in the article:</p>
<div class="highlight"><pre><span></span><span class="n">df_month</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
<span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">month</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"month"</span><span class="p">)),</span>
<span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"date"</span><span class="p">)</span><span class="o">.</span><span class="n">dt</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s2">"%b"</span><span class="p">)</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"month_name"</span><span class="p">)),</span>
<span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span> <span class="o">/</span> <span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"pct_total"</span><span class="p">),</span>
<span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">starts_with</span><span class="p">(</span><span class="s2">"S"</span><span class="p">))</span>
<span class="o">.</span><span class="n">then</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">lit</span><span class="p">(</span><span class="s2">"Special"</span><span class="p">))</span>
<span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">))</span>
<span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"sales_status"</span><span class="p">)</span>
<span class="p">),</span>
<span class="p">)</span><span class="o">.</span><span class="n">select</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span>
<span class="s2">"name"</span><span class="p">,</span> <span class="s2">"quantity"</span><span class="p">,</span> <span class="s2">"sku"</span><span class="p">,</span> <span class="s2">"month"</span><span class="p">,</span> <span class="s2">"month_name"</span><span class="p">,</span> <span class="s2">"sales_status"</span><span class="p">,</span> <span class="s2">"pct_total"</span>
<span class="p">)</span>
<span class="p">)</span>
<span class="n">df_month</span>
</pre></div>
<div class="figure" style="width: 1125px; height: auto; max-width: 100%;">
<img alt="polars chaining" src="https://pbpython.com/images/polars-chaining.png" style="width: 1125px; height: auto; max-width: 100%;"/>
</div>
<p>I made this graphic to show how the pieces of code interact with each other:</p>
<div class="figure" style="width: 3840px; height: auto; max-width: 100%;">
<img alt="polars chaining example" src="https://pbpython.com/images/polars-example.png" style="width: 3840px; height: auto; max-width: 100%;"/>
</div>
<p>The image is small on the blog but if you open it in a new window, it should be more legible.</p>
<p>It may take a little time to wrap your head around this approach to programming. But the
results should pay off in more maintainable and performant code.</p>
</div>
<div class="section" id="additional-notes">
<h2>Additional notes</h2>
<p>As you work with pandas and polars there are convenience functions for moving back and
forth between the two. Here’s an example of creating a pandas dataframe from polars:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">with_columns</span><span class="p">(</span>
<span class="n">pl</span><span class="o">.</span><span class="n">when</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">starts_with</span><span class="p">(</span><span class="s2">"S"</span><span class="p">))</span>
<span class="o">.</span><span class="n">then</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">lit</span><span class="p">(</span><span class="s2">"Special"</span><span class="p">))</span>
<span class="o">.</span><span class="n">otherwise</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">lit</span><span class="p">(</span><span class="s2">"Standard"</span><span class="p">))</span>
<span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"sales_status"</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">to_pandas</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 1619px; height: auto; max-width: 100%;">
<img alt="polars to pandas" src="https://pbpython.com/images/polars-to-pandas.png" style="width: 1619px; height: auto; max-width: 100%;"/>
</div>
<p>Having this capability means you can gradually start to use polars and go back to pandas
if there are activities you need in polars that don’t quite work as expected.</p>
<p>If you need to work the other way, you can convert a pandas dataframe to a polars one using
<code class="code">
from_pandas()</code>
</p>
<p>Finally, one other item I noticed when working with polars is that there are some nice
convenience features when saving your polars dataframe to Excel. By default the dataframe
is stored in a table and you can make a lot of changes to the output by tweaking the
parameters to the <code class="code">
write_excel()</code>
. I recommend reviewing the <a class="reference external" href="https://docs.pola.rs/py-polars/html/reference/api/polars.DataFrame.write_excel.html">official <span class="caps">API</span> docs</a> for the details.</p>
<p>To give you a quick flavor, here is an example of some simple configuration:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">group_by</span><span class="p">(</span><span class="s2">"name"</span><span class="p">,</span> <span class="s2">"sku"</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">pl</span><span class="o">.</span><span class="n">col</span><span class="p">(</span><span class="s2">"quantity"</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">alias</span><span class="p">(</span><span class="s2">"qty-total"</span><span class="p">))</span><span class="o">.</span><span class="n">write_excel</span><span class="p">(</span>
<span class="s2">"sample.xlsx"</span><span class="p">,</span>
<span class="n">table_style</span><span class="o">=</span><span class="p">{</span>
<span class="s2">"style"</span><span class="p">:</span> <span class="s2">"Table Style Medium 2"</span><span class="p">,</span>
<span class="p">},</span>
<span class="n">autofit</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span>
<span class="n">sheet_zoom</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span>
<span class="p">)</span>
</pre></div>
<div class="figure" style="width: 450px; height: auto; max-width: 100%;">
<img alt="polars excel outpu" src="https://pbpython.com/images/polars-excel-output-sm.png" style="width: 450px; height: auto; max-width: 100%;"/>
</div>
<p>There are a lot of configuration options available but I generally find this default output
easier to work with thank pandas.</p>
</div>
<div class="section" id="additional-resources">
<h2>Additional resources</h2>
<p>I have only touched on the bare minimum of capabilities in polars. If there is interest,
I’ll write some more. In the meantime, I recommend you check out the following resources:</p>
<ul class="simple">
<li>Official <a class="reference external" href="https://docs.pola.rs/user-guide/">user guide</a></li>
<li><a class="reference external" href="https://kevinheavey.github.io/modern-polars/">Modern Polars</a> from Kevin Heavey</li>
</ul>
<p>The Modern Polars resource goes into a much more detailed look at how to work with pandas
and polars with code examples side by side. It’s a top notch resource. You should definitely
check it out.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>Pandas has been the go-to data analysis tool in the python ecosystem for over a decade.
Over that time it has grown and evolved and the surrounding ecosystem has changed. As a result
some of the core parts of pandas might be showing their age.</p>
<p>Polars brings a new approach to working with data. It is still in the early phases of its
development but I am impressed with how far it has come in the first few years. As of this
writing, polars is moving to a <a class="reference external" href="https://github.com/pola-rs/polars/releases/tag/py-0.20.0">1.0 release</a>. This milestone means that the there will be
fewer breaking changes going forward and the <span class="caps">API</span> will stabilize. It’s a good time to jump
in and learn more for yourself.</p>
<p>I’ve only spent a few hours with polars so I’m still developing my long-term view on where
it fits. Here are a few of my initial observations:</p>
<p>Polars pros:</p>
<ul class="simple">
<li>Performant design from the ground up which maximizes modern hardware and minimizes
memory usage</li>
<li>Clean, consistent and expressive <span class="caps">API</span> for chaining methods</li>
<li>Not having indices simplifies many cases</li>
<li>Useful improvement in displaying output, saving excel files, etc.</li>
<li>Good <span class="caps">API</span> and user documentation</li>
<li>No built in plotting library.</li>
</ul>
<p>Regarding the plotting functionality, I think it’s better to use the available ones than
try to include in polars. There is a <code class="code">
plot</code>
<a class="reference external" href="https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.plot.html">namespace</a> in polars but it defers to
other libraries to do the plotting.</p>
<p>Polars cons:</p>
<ul class="simple">
<li>Still newer code base with breaking <span class="caps">API</span> changes</li>
<li>Not as much third party documentation</li>
<li>Not as seamlessly integrated with other libraries (although it is improving)</li>
<li>Some pandas functions like stacking and unstacking are not as mature in polars</li>
</ul>
<p>Pandas pros:</p>
<ul class="simple">
<li>Tried and tested code base that has been improved significantly over the years</li>
<li>The multi-index support provides helpful shortcuts for re-shaping data</li>
<li>Strong integrations with the rest of the python data ecosystem</li>
<li>Good official documentation as well as lots of 3rd party sources for tips and tricks</li>
</ul>
<p>Pandas cons:</p>
<ul class="simple">
<li>Some cruft in the <span class="caps">API</span> design. There’s more than one way to do things in many cases.</li>
<li>Performance for large data sets can get bogged down</li>
</ul>
<p>This is not necessarily exhaustive but I think hits the highlights. At the end of the
day, diversity in tools and approaches is helpful. I intend to continue evaluating the
integration of polars into my analysis - especially in cases where performance becomes an
issue or the pandas code gets too be too messy. However, I don’t think pandas is going
away any time soon and I continue to be excited about pandas evolution.</p>
<p>I hope this article helps you get started. As always, if you have experiences, thoughts or
comments on the article, let me know below.</p>
</div>
Pandas Groupby Warning2022-09-26T07:25:00-05:002022-09-26T07:25:00-05:00Chris Moffitttag:pbpython.com,2022-09-26:/groupby-warning.html<p class="first last">One of the reasons I like using pandas instead of Excel for data analysis is that it is
easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there
is still plenty of opportunity to make errors with pandas code. This article discusses a
subtle issue with pandas <code>groupby</code> code that can lead to big errors if you’re not
careful. I’m writing this because I have happened upon this in the past but it still bit
me big time just recently. I hope this article can help a few of you avoid this mistake.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>One of the reasons I like using pandas instead of Excel for data analysis is that it is
easier to avoid certain types of copy-paste Excel errors. As great as pandas is, there
is still plenty of opportunity to make errors with pandas code. This article discusses a
subtle issue with pandas <code class="code">
groupby</code>
code that can lead to big errors if you’re not
careful. I’m writing this because I have happened upon this in the past but it still bit
me big time just recently. I hope this article can help a few of you avoid this mistake.</p>
</div>
<div class="section" id="the-problem">
<h2>The Problem</h2>
<p>To illustrate this problem, we’ll use a simple data set that shows sales for 20 customers
and includes their region and an internal customer segment designation of Platinum, Gold
or Silver. Here is the <a class="reference external" href="https://github.com/chris1610/pbpython/blob/master/data/sales_9_2022.xlsx">full data set</a>:</p>
<table border="1" class="table table-condense docutils">
<colgroup>
<col width="6%"/>
<col width="18%"/>
<col width="43%"/>
<col width="11%"/>
<col width="13%"/>
<col width="10%"/>
</colgroup>
<thead valign="bottom">
<tr><th class="head"><!-- -->
</th>
<th class="head">Customer <span class="caps">ID</span></th>
<th class="head">Customer Name</th>
<th class="head">Region</th>
<th class="head">Segment</th>
<th class="head">Sales</th>
</tr>
</thead>
<tbody valign="top">
<tr><td>0</td>
<td>740150</td>
<td>Barton <span class="caps">LLC</span></td>
<td><span class="caps">US</span></td>
<td>Gold</td>
<td>215000</td>
</tr>
<tr><td>1</td>
<td>714466</td>
<td>Trantow-Barrows</td>
<td><span class="caps">EMEA</span></td>
<td>Silver</td>
<td>430000</td>
</tr>
<tr><td>2</td>
<td>218895</td>
<td>Kulas Inc</td>
<td><span class="caps">EMEA</span></td>
<td>Platinum</td>
<td>410000</td>
</tr>
<tr><td>3</td>
<td>307599</td>
<td>Kassulke, Ondricka and Metz</td>
<td><span class="caps">EMEA</span></td>
<td> </td>
<td>91000</td>
</tr>
<tr><td>4</td>
<td>412290</td>
<td>Jerde-Hilpert</td>
<td><span class="caps">EMEA</span></td>
<td>Gold</td>
<td>630000</td>
</tr>
<tr><td>5</td>
<td>729833</td>
<td>Koepp Ltd</td>
<td><span class="caps">US</span></td>
<td> </td>
<td>230000</td>
</tr>
<tr><td>6</td>
<td>737550</td>
<td>Fritsch, Russel and Anderson</td>
<td><span class="caps">US</span></td>
<td>Gold</td>
<td>630000</td>
</tr>
<tr><td>7</td>
<td>146832</td>
<td>Kiehn-Spinka</td>
<td><span class="caps">US</span></td>
<td>Silver</td>
<td>615000</td>
</tr>
<tr><td>8</td>
<td>688981</td>
<td>Keeling <span class="caps">LLC</span></td>
<td><span class="caps">US</span></td>
<td>Platinum</td>
<td>515000</td>
</tr>
<tr><td>9</td>
<td>786968</td>
<td>Frami, Hills and Schmidt</td>
<td><span class="caps">US</span></td>
<td>Gold</td>
<td>215000</td>
</tr>
<tr><td>10</td>
<td>239344</td>
<td>Stokes <span class="caps">LLC</span></td>
<td><span class="caps">US</span></td>
<td>Silver</td>
<td>230000</td>
</tr>
<tr><td>11</td>
<td>672390</td>
<td>Kuhn-Gusikowski</td>
<td><span class="caps">APAC</span></td>
<td>Platinum</td>
<td>630000</td>
</tr>
<tr><td>12</td>
<td>141962</td>
<td>Herman <span class="caps">LLC</span></td>
<td><span class="caps">APAC</span></td>
<td>Gold</td>
<td>215000</td>
</tr>
<tr><td>13</td>
<td>424914</td>
<td>White-Trantow</td>
<td><span class="caps">US</span></td>
<td> </td>
<td>230000</td>
</tr>
<tr><td>14</td>
<td>527099</td>
<td>Sanford and Sons</td>
<td><span class="caps">US</span></td>
<td>Platinum</td>
<td>615000</td>
</tr>
<tr><td>15</td>
<td>642753</td>
<td>Pollich <span class="caps">LLC</span></td>
<td><span class="caps">US</span></td>
<td>Gold</td>
<td>419000</td>
</tr>
<tr><td>16</td>
<td>383080</td>
<td>Will <span class="caps">LLC</span></td>
<td><span class="caps">US</span></td>
<td>Silver</td>
<td>415000</td>
</tr>
<tr><td>17</td>
<td>257198</td>
<td>Cronin, Oberbrunner and Spencer</td>
<td><span class="caps">US</span></td>
<td>Platinum</td>
<td>425000</td>
</tr>
<tr><td>18</td>
<td>604255</td>
<td>Halvorson, Crona and Champlin</td>
<td><span class="caps">US</span></td>
<td> </td>
<td>430000</td>
</tr>
<tr><td>19</td>
<td>163416</td>
<td>Purdy-Kunde</td>
<td><span class="caps">APAC</span></td>
<td>Silver</td>
<td>410000</td>
</tr>
</tbody>
</table>
<p>The data looks pretty simple. There’s only one numeric column so let’s see what it
totals to.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">'https://github.com/chris1610/pbpython/raw/master/data/sales_9_2022.xlsx'</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"Sales"</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="mi">8000000</span>
</pre></div>
<p>We have $8,000,000 in sales in the spreadsheet. Keep that number in mind.</p>
<p>Let’s do some simple analysis to summarize sales by region:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span>
</pre></div>
<div style="max-height:1000px;max-width:500px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr style="text-align: right;">
<th></th>
<th>Sales</th>
</tr>
<tr>
<th>Region</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th><span class="caps">APAC</span></th>
<td>1255000</td>
</tr>
<tr>
<th><span class="caps">EMEA</span></th>
<td>1561000</td>
</tr>
<tr>
<th><span class="caps">US</span></th>
<td>5184000</td>
</tr>
</tbody>
</table>
</div><p>We can double check the math:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Sales</span> <span class="mi">8000000</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">int64</span>
</pre></div>
<p>Looks good. That’s what we expect. Now let’s see what sales look like by Segment:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">,</span> <span class="s1">'Segment'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span>
</pre></div>
<p>Which yields this table:</p>
<div style="max-height:1000px;max-width:400px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Sales</th>
</tr>
<tr>
<th>Region</th>
<th>Segment</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top"><span class="caps">APAC</span></th>
<th>Gold</th>
<td>215000</td>
</tr>
<tr>
<th>Platinum</th>
<td>630000</td>
</tr>
<tr>
<th>Silver</th>
<td>410000</td>
</tr>
<tr>
<th rowspan="3" valign="top"><span class="caps">EMEA</span></th>
<th>Gold</th>
<td>630000</td>
</tr>
<tr>
<th>Platinum</th>
<td>410000</td>
</tr>
<tr>
<th>Silver</th>
<td>430000</td>
</tr>
<tr>
<th rowspan="3" valign="top"><span class="caps">US</span></th>
<th>Gold</th>
<td>1479000</td>
</tr>
<tr>
<th>Platinum</th>
<td>1555000</td>
</tr>
<tr>
<th>Silver</th>
<td>1260000</td>
</tr>
</tbody>
</table>
</div><p>This looks good. No errors and the table seems reasonable. We should continue our analysis right?</p>
<p>Nope. There’s a potentially subtle issue here. Let’s sum the data to double check:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">,</span> <span class="s1">'Segment'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Sales</span> <span class="mi">7019000</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">int64</span>
</pre></div>
<p>This only includes $7,019,000. Where did the other $981,000 go? Is pandas broken?</p>
<p>You can see the issue clearly if we use the <code class="code">
dropna=False</code>
parameter to explicitly
include <code class="code">
NaN</code>
values in our results:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">,</span> <span class="s1">'Segment'</span><span class="p">],</span> <span class="n">dropna</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span>
</pre></div>
<p>Now we can see the <code class="code">
NaN</code>
combinations with <span class="caps">EMEA</span> and the <span class="caps">US</span> groupings:</p>
<div style="max-height:1000px;max-width:400px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Sales</th>
</tr>
<tr>
<th>Region</th>
<th>Segment</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top"><span class="caps">APAC</span></th>
<th>Gold</th>
<td>215000</td>
</tr>
<tr>
<th>Platinum</th>
<td>630000</td>
</tr>
<tr>
<th>Silver</th>
<td>410000</td>
</tr>
<tr>
<th rowspan="4" valign="top"><span class="caps">EMEA</span></th>
<th>Gold</th>
<td>630000</td>
</tr>
<tr>
<th>Platinum</th>
<td>410000</td>
</tr>
<tr>
<th>Silver</th>
<td>430000</td>
</tr>
<tr>
<th>NaN</th>
<td>91000</td>
</tr>
<tr>
<th rowspan="4" valign="top"><span class="caps">US</span></th>
<th>Gold</th>
<td>1479000</td>
</tr>
<tr>
<th>Platinum</th>
<td>1555000</td>
</tr>
<tr>
<th>Silver</th>
<td>1260000</td>
</tr>
<tr>
<th>NaN</th>
<td>890000</td>
</tr>
</tbody>
</table>
</div><p>If we check the sum, we can see it totals to $8M.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">,</span> <span class="s1">'Segment'</span><span class="p">],</span> <span class="n">dropna</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Sales</span> <span class="mi">8000000</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">int64</span>
</pre></div>
<p>The pandas <a class="reference external" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html">documentation</a> is very clear on this:</p>
<blockquote>
<dl class="docutils">
<dt><code class="code">
dropna:</code>
bool, default True</dt>
<dd>If True, and if group keys contain <span class="caps">NA</span> values, <span class="caps">NA</span> values together with row/column will
be dropped. If False, <span class="caps">NA</span> values will also be treated as the key in groups.</dd>
</dl>
</blockquote>
<div class="alert alert-info compound">
<p>The take away is that if your <code class="code">
groupby</code>
columns contain any <span class="caps">NA</span> values,
then you need to make a conscious decision about whether or not you want to include those
values in the grouped results.</p>
</div>
<p>If you are ok dropping those values, then use the default <code class="code">
dropna=True</code>
.</p>
<p>However, if you want to ensure that all values (Sales in this particular case) are included, then
make sure to use <code class="code">
dropna=False</code>
in your <code class="code">
groupby</code>
</p>
</div>
<div class="section" id="an-ounce-of-prevention">
<h2>An ounce of prevention</h2>
<p>The main way to deal with this potential issue is to understand if you have any <code class="code">
NaN</code>
values
in your data. There are a couple of ways to do this.</p>
<p>You can use pure pandas:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">Customer</span> <span class="n">ID</span> <span class="mi">0</span>
<span class="n">Customer</span> <span class="n">Name</span> <span class="mi">0</span>
<span class="n">Region</span> <span class="mi">0</span>
<span class="n">Segment</span> <span class="mi">4</span>
<span class="n">Sales</span> <span class="mi">0</span>
<span class="n">dtype</span><span class="p">:</span> <span class="n">int64</span>
</pre></div>
<p>There are other tools like <a class="reference external" href="https://github.com/ResidentMario/missingno">missingno</a> which provide a more robust interface for exploring
the data.</p>
<p>I’m partial to <a class="reference external" href="https://github.com/chris1610/sidetable">sidetable</a>. Here’s how to use it after it’s installed and imported:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">stb</span><span class="o">.</span><span class="n">missing</span><span class="p">()</span>
</pre></div>
<div style="max-height:1000px;max-width:400px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr style="text-align: right;">
<th></th>
<th>missing</th>
<th>total</th>
<th>percent</th>
</tr>
</thead>
<tbody>
<tr>
<th>Segment</th>
<td>4</td>
<td>20</td>
<td>20.0</td>
</tr>
<tr>
<th>Customer <span class="caps">ID</span></th>
<td>0</td>
<td>20</td>
<td>0.0</td>
</tr>
<tr>
<th>Customer Name</th>
<td>0</td>
<td>20</td>
<td>0.0</td>
</tr>
<tr>
<th>Region</th>
<td>0</td>
<td>20</td>
<td>0.0</td>
</tr>
<tr>
<th>Sales</th>
<td>0</td>
<td>20</td>
<td>0.0</td>
</tr>
</tbody>
</table>
</div><p>Regardless of the approach you use, its worth keeping in mind that you need to know if
you have any null or <code class="code">
NaN</code>
values in your data and how you would like to handle them in your analysis.</p>
<p>The other alternative to using the <code class="code">
dropna</code>
is to explicitly fill in the
values using <code class="code">
fillna</code>
</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s1">'unknown'</span><span class="p">)</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'Region'</span><span class="p">,</span> <span class="s1">'Segment'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'Sales'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span>
</pre></div>
<p>Now the unknown values are explicitly called out:</p>
<div style="max-height:1000px;max-width:400px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr style="text-align: right;">
<th></th>
<th></th>
<th>Sales</th>
</tr>
<tr>
<th>Region</th>
<th>Segment</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th rowspan="3" valign="top"><span class="caps">APAC</span></th>
<th>Gold</th>
<td>215000</td>
</tr>
<tr>
<th>Platinum</th>
<td>630000</td>
</tr>
<tr>
<th>Silver</th>
<td>410000</td>
</tr>
<tr>
<th rowspan="4" valign="top"><span class="caps">EMEA</span></th>
<th>Gold</th>
<td>630000</td>
</tr>
<tr>
<th>Platinum</th>
<td>410000</td>
</tr>
<tr>
<th>Silver</th>
<td>430000</td>
</tr>
<tr>
<th>unknown</th>
<td>91000</td>
</tr>
<tr>
<th rowspan="4" valign="top"><span class="caps">US</span></th>
<th>Gold</th>
<td>1479000</td>
</tr>
<tr>
<th>Platinum</th>
<td>1555000</td>
</tr>
<tr>
<th>Silver</th>
<td>1260000</td>
</tr>
<tr>
<th>unknown</th>
<td>890000</td>
</tr>
</tbody>
</table>
</div></div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>When working with pandas <code class="code">
groupby</code>
, the results can be surprising if you have
<code class="code">
NaN</code>
values in your dataframe columns. The default behavior is to drop those values
which means you can effectively “lose” some of your data during the process.</p>
<p>I have been bit by this behavior several times in the past. In some cases, it might not be a
big deal. In others, you might need to sheepishly explain why your numbers aren’t adding up.</p>
<p>Have you seen this before? Let me know in the comments below.</p>
</div>
Using Document Properties to Track Your Excel Reports2022-06-13T14:25:00-05:002022-06-13T14:25:00-05:00Chris Moffitttag:pbpython.com,2022-06-13:/excel-properties.html<p class="first last">When doing analysis with Jupyter Notebooks, you will frequently find yourself
generating ad-hoc Excel reports to distribute to your end-users. After time, you might
end up with dozens (or hundreds) of notebooks and it can be challenging to
remember which notebook generated which Excel report. I have started using Excel
document properties to track which notebooks generate specific Excel files. Now,
when a user asks for a refresh of a 6 month old report, I can easily find the notebook
file and re-run the analysis. This simple process can save a lot of frustration for your
future self. In this brief article will walk through how to set these properties and give some
shortcuts for using <span class="caps">VS</span> Code to simplify the process.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>When doing analysis with Jupyter Notebooks, you will frequently find yourself
generating ad-hoc Excel reports to distribute to your end-users. After time, you might
end up with dozens (or hundreds) of notebooks and it can be challenging to
remember which notebook generated which Excel report. I have started using Excel
document properties to track which notebooks generate specific Excel files. Now,
when a user asks for a refresh of a 6 month old report, I can easily find the notebook
file and re-run the analysis. This simple process can save a lot of frustration for your
future self. In this brief article will walk through how to set these properties and give some
shortcuts for using <span class="caps">VS</span> Code to simplify the process.</p>
</div>
<div class="section" id="background">
<h2>Background</h2>
<p>How often has this happened to you? You get an email from a colleague and they ask you to
refresh some analysis you did for them many months ago? You can tell that you created the
Excel file from a notebook but can’t remember which notebook you used? Despite trying to
be as organized as <a class="reference external" href="https://pbpython.com/notebook-process.html">possible</a> it is inevitable that you will waste time trying to find the
originating notebook.</p>
<p>The nice aspect of the Excel document properties is that most people don’t change them.
So, even if a user renames the file, the properties you set will be easily visible and should
point the way to where the original code sits on your system.</p>
</div>
<div class="section" id="adding-properties">
<h2>Adding Properties</h2>
<p>If you’re using pandas and <a class="reference external" href="https://xlsxwriter.readthedocs.io/example_doc_properties.html">xlsxwriter</a>, adding properties is relatively simple.</p>
<p>Here’s a simple notebook showing how I typically structure my analysis:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">datetime</span>
<span class="n">today</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="n">report_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'reports'</span> <span class="o">/</span> <span class="sa">f</span><span class="s1">'sales_report_</span><span class="si">{</span><span class="n">today</span><span class="si">:</span><span class="s1">%b-%d-%Y</span><span class="si">}</span><span class="s1">.xlsx'</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
</pre></div>
<p>The important point is that I try to always use a standard naming convention that includes
the date in the name as well as a standard directory structure.</p>
<p>Now, I’ll do a <code class="code">
groupby</code>
to show sales by month for each account:</p>
<div class="highlight"><pre><span></span><span class="n">sales_summary</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'name'</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">'M'</span><span class="p">)])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span>
<span class="s1">'ext price'</span><span class="p">:</span>
<span class="s1">'sum'</span>
<span class="p">})</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span>
</pre></div>
<p>Here’s what the basic DataFrame output looks like:</p>
<div class="figure" style="width: 2406px; height: auto; max-width: 100%;">
<img alt="Sales summary" src="https://pbpython.com/images/sales-summary-example.png" style="width: 2406px; height: auto; max-width: 100%;"/>
</div>
<p>The final step is to save the DataFrame to Excel using the <code class="code">
pd.ExcelWriter</code>
context manager
and set the document properties:</p>
<div class="highlight"><pre><span></span><span class="k">with</span> <span class="n">pd</span><span class="o">.</span><span class="n">ExcelWriter</span><span class="p">(</span><span class="n">report_file</span><span class="p">,</span>
<span class="n">engine</span><span class="o">=</span><span class="s1">'xlsxwriter'</span><span class="p">,</span>
<span class="n">date_format</span><span class="o">=</span><span class="s1">'mmm-yyyy'</span><span class="p">,</span>
<span class="n">datetime_format</span><span class="o">=</span><span class="s1">'mmm-yyyy'</span><span class="p">)</span> <span class="k">as</span> <span class="n">writer</span><span class="p">:</span>
<span class="n">sales_summary</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="n">writer</span><span class="p">,</span> <span class="n">sheet_name</span><span class="o">=</span><span class="s1">'2018-sales'</span><span class="p">)</span>
<span class="n">workbook</span> <span class="o">=</span> <span class="n">writer</span><span class="o">.</span><span class="n">book</span>
<span class="n">workbook</span><span class="o">.</span><span class="n">set_properties</span><span class="p">({</span>
<span class="s1">'category'</span><span class="p">:</span> <span class="sa">r</span><span class="s1">'c:\Users\cmoffitt\Documents\notebooks\customer_analysis'</span><span class="p">,</span>
<span class="s1">'title'</span> <span class="p">:</span> <span class="s1">'2018 Sales Summary'</span><span class="p">,</span>
<span class="s1">'subject'</span><span class="p">:</span> <span class="s1">'Analysis for Anne Analyst'</span><span class="p">,</span>
<span class="s1">'author'</span><span class="p">:</span> <span class="s1">'1-Excel-Properties.ipynb'</span><span class="p">,</span>
<span class="s1">'status'</span><span class="p">:</span> <span class="s1">'Initial draft'</span><span class="p">,</span>
<span class="s1">'comments'</span><span class="p">:</span> <span class="s1">'src_dir: customer_analysis'</span><span class="p">,</span>
<span class="s1">'keywords'</span><span class="p">:</span> <span class="s1">'notebook-generated'</span>
<span class="p">})</span>
</pre></div>
<p>Once this is done, you can view the properties in a couple of different ways.</p>
<p>First, you can hover over the filename and get a quick view:</p>
<div class="figure" style="width: 982px; height: auto; max-width: 100%;">
<img alt="Excel property hover details" src="https://pbpython.com/images/excel-properties-hover.png" style="width: 982px; height: auto; max-width: 100%;"/>
</div>
<p>You can also view the details without opening Excel:</p>
<div class="figure" style="width: 545px; height: auto; max-width: 100%;">
<img alt="Excel property details" src="https://pbpython.com/images/excel-properties-detail.png" style="width: 545px; height: auto; max-width: 100%;"/>
</div>
<p>You can view the properties through Excel:</p>
<div class="figure" style="width: 1145px; height: auto; max-width: 100%;">
<img alt="Excel property details" src="https://pbpython.com/images/excel-properties-detail-4.png" style="width: 1145px; height: auto; max-width: 100%;"/>
</div>
<p>As you can see from the example, there are a handful of <a class="reference external" href="https://xlsxwriter.readthedocs.io/example_doc_properties.html">options</a> for the properties.
I encourage you to adjust these based on your own needs. For example, I save all of my
work in a notebooks directory so it’s most useful to me to specify the <code class="code">
src_dir</code>
in
the <code class="code">
Comments</code>
section. This will quickly point me to the right directory and the <code class="code">
Authors</code>
property lets me know which specific file I used.</p>
<p>Observant readers will notice that I used this as an example to show how to adjust the date
formats of the Excel output as well. As you can see below, I have adjusted the Excel output so that
only the month and year are shown in the header. I find this much easier than going in
and adjusting every example by hand.</p>
<p>Here’s what it looks like now:</p>
<div class="figure" style="width: 1550px; height: auto; max-width: 100%;">
<img alt="Excel property details" src="https://pbpython.com/images/excel-properties-date.png" style="width: 1550px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="using-vs-code-snippets">
<h2>Using <span class="caps">VS</span> Code Snippets</h2>
<p>If you find this helpful, you may want to set up a snippet in <span class="caps">VS</span> Code to make this easier. I
covered how to create snippets in this <a class="reference external" href="https://pbpython.com/vscode-notebooks.html">article</a> so refer back to that for a refresher.</p>
<p>Here is a starter snippet to save the file to Excel and populate some properties:</p>
<div class="highlight"><pre><span></span><span class="s2">"Write Excel"</span><span class="p">:</span> <span class="p">{</span>
<span class="s2">"prefix"</span><span class="p">:</span> <span class="s2">"we"</span><span class="p">,</span>
<span class="s2">"body"</span><span class="p">:</span> <span class="p">[</span>
<span class="s2">"# Excelwriter"</span><span class="p">,</span>
<span class="s2">"with pd.ExcelWriter(report_file, engine='xlsxwriter', date_format='mmm-yyyy', datetime_format='mmm-yyyy') as writer:"</span><span class="p">,</span>
<span class="s2">"</span><span class="se">\t</span><span class="s2">$1.to_excel(writer, sheet_name='$2')"</span><span class="p">,</span>
<span class="s2">"</span><span class="se">\t</span><span class="s2">workbook = writer.book"</span><span class="p">,</span>
<span class="s2">"</span><span class="se">\t</span><span class="s2">workbook.set_properties({'category': r'$TM_DIRECTORY', 'author': '$TM_FILENAME'})"</span><span class="p">,</span>
<span class="p">],</span>
<span class="s2">"description"</span><span class="p">:</span> <span class="s2">"Write Excel file"</span>
<span class="p">}</span>
</pre></div>
<p>One nice benefit of using the snippet is that you can access <span class="caps">VS</span> Code <a class="reference external" href="https://code.visualstudio.com/docs/editor/variables-reference">variables</a> such as
<code class="code">
$TM_DIRECTORY</code>
and <code class="code">
$TM_FILENAME</code>
to pre-populate the current path and name.</p>
</div>
<div class="section" id="conclusion">
<h2>Conclusion</h2>
<p>When working with Jupyter Notebooks it is important to have a consistent process for
organizing and naming your files and directories. Otherwise the development process can
get very chaotic. Even with good organization skills, it is easy to lose track of which
scripts generate which outputs. Using the Excel document properties can be a quick and
relatively painless way to lay out some breadcrumbs so that it is easy to recreate your analysis.</p>
<p>Let me know in the comments if you have any other tips you’ve learned over the years.</p>
</div>
16 Reasons to Use VS Code for Developing Jupyter Notebooks2021-11-15T07:55:00-06:002021-11-15T07:55:00-06:00Chris Moffitttag:pbpython.com,2021-11-15:/vscode-notebooks.html<p class="first">Visual Studio Code is one of the most popular text editors with a track record of
continual improvements. One area where <span class="caps">VS</span> Code has been recently <a class="reference external" href="https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/">innovating</a> is its
Jupyter Notebook support. The early releases of <span class="caps">VS</span> Code sought to replicate existing
Jupyter Notebook features in <span class="caps">VS</span> Code. Recent <span class="caps">VS</span> Code releases have continued to develop
notebook features that provide an experience that in many cases is better than the
traditional Jupyter Notebook experience.</p>
<p>I am a big fan of using Jupyter Notebooks for python analysis - even though there are limitations.
For the type of adhoc analysis I do, the notebook combination of code and visualizations is
superior to working with ad hoc Excel files. That being said, there are times when I wish
I had a more full-featured editor for my notebook code.</p>
<p class="last">In this article I will cover 16 reasons why you should consider using <span class="caps">VS</span> Code as your editor
of choice when working with python in Jupyter Notebooks. I am not including them in any
particular order but think number 11 is one of my favorites.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>Visual Studio Code is one of the most popular text editors with a track record of
continual improvements. One area where <span class="caps">VS</span> Code has been recently <a class="reference external" href="https://devblogs.microsoft.com/python/notebooks-are-getting-revamped/">innovating</a> is its
Jupyter Notebook support. The early releases of <span class="caps">VS</span> Code sought to replicate existing
Jupyter Notebook features in <span class="caps">VS</span> Code. Recent <span class="caps">VS</span> Code releases have continued to develop
notebook features that provide an experience that in many cases is better than the
traditional Jupyter Notebook experience.</p>
<p>I am a big fan of using Jupyter Notebooks for python analysis - even though there are limitations.
For the type of adhoc analysis I do, the notebook combination of code and visualizations is
superior to working with ad hoc Excel files. That being said, there are times when I wish
I had a more full-featured editor for my notebook code.</p>
<p>In this article I will cover 16 reasons why you should consider using <span class="caps">VS</span> Code as your editor
of choice when working with python in Jupyter Notebooks. I am not including them in any
particular order but think number 11 is one of my favorites.</p>
</div>
<div class="section" id="a-single-editor-for-many-languages">
<h2>1 - A single editor for many languages</h2>
<p>It is very likely that your workflow includes working with multiple file or language types.
If you are working with <span class="caps">HTML</span>, <span class="caps">YAML</span>, <span class="caps">JSON</span>, <span class="caps">CSS</span> or Javascript files, then it is beneficial to have
one editor.</p>
<p>For example, <span class="caps">VS</span> Code is extremely customizable so you can configure your themes, colors,
fonts and so much more to make your development environment your own. If you’re already
editing text files with <span class="caps">VS</span> Code, why not start using it for notebook development?</p>
</div>
<div class="section" id="supports-multiple-python-file-types">
<h2>2 - Supports multiple python file types</h2>
<p>If you are working with python, you have three distinct options for editing files:</p>
<ul class="simple">
<li>standalone .py files</li>
<li>python <a class="reference external" href="https://pbpython.com/notebook-alternative.html">code cells</a></li>
<li>Jupyter Notebooks (<code class="code">
.ipynb</code>
)</li>
</ul>
<p><span class="caps">VS</span> Code supports all editing approaches so you can build streamlit apps as standalone files
or prototype your work in a notebook - all from the same editor.</p>
</div>
<div class="section" id="execution-time">
<h2>3 - Execution time</h2>
<p>One simple but handy benefit is that each cell shows a moving progress bar when executing code
and shows how many seconds it takes to execute. If you have processes that take seconds or
longer to run, this little feature is very helpful and is available out of the box.</p>
<div class="figure" style="width: 1224px; height: auto; max-width: 100%;">
<img alt="Data loading" src="https://pbpython.com/images/data-load-progress.png" style="width: 1224px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="outline-mode">
<h2>4 - Outline mode</h2>
<p>A big challenge with standard notebooks is that they can be difficult to navigate. <span class="caps">VS</span> Code
includes an outline mode that makes it easy to build a table of contents with Markdown.</p>
<p>If you define a Markdown cell and use Markdown formatting for <a class="reference external" href="https://www.markdownguide.org/basic-syntax/">headings</a> . You can jump
to a section of your code by clicking on the link in the outline panel.</p>
<div class="figure" style="width: 1698px; height: auto; max-width: 100%;">
<img alt="Outline mode" src="https://pbpython.com/images/outline-mode.png" style="width: 1698px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="jupyter-variable-explorer">
<h2>5 - Jupyter variable explorer</h2>
<p>Do you forget your variable names? Did you call the customer variable <code class="code">
cust</code>
or <code class="code">
customers</code>
?
I spend a lot of time scrolling through notebooks trying to remember names. The Jupyter
variable explorer fixes this problem and shows additional helpful info about the size and
type of each variable.</p>
<div class="figure" style="width: 1329px; height: auto; max-width: 100%;">
<img alt="Outline mode" src="https://pbpython.com/images/variable_explorer_2.png" style="width: 1329px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="data-viewer">
<h2>6 - Data viewer</h2>
<p>The variable explorer also allows you to view a DataFrame or Series in a separate tab. I find this
really useful for remembering column names or to quickly inspect and filter data. Without
the viewer, I would normally export data to Excel and inspect it. The viewer removes much
of that need.</p>
<div class="figure" style="width: 1338px; height: auto; max-width: 100%;">
<img alt="Data Viewer" src="https://pbpython.com/images/data-viewer.png" style="width: 1338px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="code-formatting">
<h2>7 - Code formatting</h2>
<p>I really like using a consistent code formatter like <a class="reference external" href="https://github.com/google/yapf">yapf</a> or <a class="reference external" href="https://github.com/psf/black">black</a> to format some of the
more complex pandas code. <span class="caps">VS</span> Code will apply the formatter of choice to clean up your
nested code. Consistent, readable code makes the debugging process much easier.</p>
<div class="figure" style="width: 2515px; height: auto; max-width: 100%;">
<img alt="Code formatting" src="https://pbpython.com/images/format_code_2.png" style="width: 2515px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="cell-debugging">
<h2>8 - Cell debugging</h2>
<p>If you want to invoke a rich debugging environment, you can access this functionality with
<span class="caps">VS</span> Code. One of the easiest ways to invoke it is to press F10 to run a multi-line cell
in line by line mode.</p>
<div class="figure" style="width: 2243px; height: auto; max-width: 100%;">
<img alt="Debugging" src="https://pbpython.com/images/debug.png" style="width: 2243px; height: auto; max-width: 100%;"/>
</div>
<p>You now have access to the debugging environment for more complex problems.</p>
</div>
<div class="section" id="split-editors">
<h2>9 - Split editors</h2>
<p>Sometimes you may want to have multiple windows with the code visible. <span class="caps">VS</span> Code allows you
to split and configure your editors in as many configurations as you can imagine. It’s
not as easy to do this with the standard notebook interface.</p>
<p>Here is one example with multiple panes open for one notebook.</p>
<div class="figure" style="width: 1672px; height: auto; max-width: 100%;">
<img alt="Split Editors" src="https://pbpython.com/images/split-editors.png" style="width: 1672px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="git-integration">
<h2>10 - Git integration</h2>
<p><span class="caps">VS</span> Code integrate seamlessly with git. For instance, you can see a timeline view for your
commit history.</p>
<div class="figure" style="width: 564px; height: auto; max-width: 100%;">
<img alt="Timeline view" src="https://pbpython.com/images/timeline-view.png" style="width: 564px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="better-diffs">
<h2>11 - Better diffs</h2>
<p>This may be one of the biggest reasons to consider using <span class="caps">VS</span> Code. Your notebook diffs
are easier to decipher! One of the biggest complaints about notebook files is that there is
a lot of extra meta-data and output information that makes it really difficult to see
diffs correctly. <span class="caps">VS</span> Code does some clever work to make diffs useful for notebooks.</p>
<div class="figure" style="width: 1663px; height: auto; max-width: 100%;">
<img alt="Enhanced diffs" src="https://pbpython.com/images/enhanced-diff.png" style="width: 1663px; height: auto; max-width: 100%;"/>
</div>
<p>One of the configuration options is the option to hide the differences in metadata or
output so you can just focus on code changes.</p>
<div class="figure" style="width: 336px; height: auto; max-width: 100%;">
<img alt="Enhanced diff customization" src="https://pbpython.com/images/diff-customization.png" style="width: 336px; height: auto; max-width: 100%;"/>
</div>
<p>In my opinion this diff feature is really a game changer for working with notebooks and git.</p>
</div>
<div class="section" id="intellisense">
<h2>12 - Intellisense</h2>
<p><span class="caps">VS</span> Code will try its best to help you complete your code and show documentation right in
your editor. If you can’t remember if the parameter is <code class="code">
sheet</code>
or <code class="code">
sheet_name</code>
then Intellisense wil help avoid many of those unnecessary google searches.</p>
<div class="figure" style="width: 891px; height: auto; max-width: 100%;">
<img alt="Intellisense" src="https://pbpython.com/images/intellisense-2.png" style="width: 891px; height: auto; max-width: 100%;"/>
</div>
<p>Intellisense can also help you use some of those pandas functions you just can’t remember
without looking them up:</p>
<div class="figure" style="width: 798px; height: auto; max-width: 100%;">
<img alt="Intellisense" src="https://pbpython.com/images/intellisense-1.png" style="width: 798px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="variable-peeking">
<h2>13 - Variable peeking</h2>
<p>Variable peeking allows you to see how a variable is defined without have to scroll through
your code. In this example, if you can’t remember what <code class="code">
sku_filter</code>
was set to, you
can highlight <code class="code">
sku_filter</code>
and press Alt+F12 to see this summary overlay.</p>
<div class="figure" style="width: 827px; height: auto; max-width: 100%;">
<img alt="F12 peeking" src="https://pbpython.com/images/F12-peek2.png" style="width: 827px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="gather-code">
<h2>14 - Gather code</h2>
<p>Outside of diffs, one of the biggest complaints about the notebook environment is that it
is too easy to get your order of execution out of sync. If you are doing some ad-hoc analysis
and want to recreate a specific output, you can use the gather code function to filter
down the notebook to the specific code that is used to derive the output in a cell.</p>
<div class="figure" style="width: 372px; height: auto; max-width: 100%;">
<img alt="Gather code" src="https://pbpython.com/images/gather-code.png" style="width: 372px; height: auto; max-width: 100%;"/>
</div>
<p>That code is then shown in a separate notebook.</p>
<div class="figure" style="width: 1564px; height: auto; max-width: 100%;">
<img alt="Gather code" src="https://pbpython.com/images/gathered-output.png" style="width: 1564px; height: auto; max-width: 100%;"/>
</div>
<p>This is really useful if your notebook execution order gets way out of order.</p>
</div>
<div class="section" id="snippets">
<h2>12 - Snippets</h2>
<p>Software development and data analysis involves a lot of repetitive code (i.e. copy and pasting).
The <span class="caps">VS</span> Code snippets <a class="reference external" href="https://code.visualstudio.com/docs/editor/userdefinedsnippets">functionality</a> can streamline some of this process. Here is a very simple
snippet that will include two imports whenever you type <code class="code">
si</code>
. You can configure
more complex examples too.</p>
<p>Access python snippets:</p>
<div class="figure" style="width: 902px; height: auto; max-width: 100%;">
<img alt="Snippet setup" src="https://pbpython.com/images/setup_snippets.png" style="width: 902px; height: auto; max-width: 100%;"/>
</div>
<p>Create the snippet:</p>
<div class="figure" style="width: 901px; height: auto; max-width: 100%;">
<img alt="Create snippet" src="https://pbpython.com/images/snippets-import.png" style="width: 901px; height: auto; max-width: 100%;"/>
</div>
<p>Snippet in action:</p>
<div class="figure" style="width: 1322px; height: auto; max-width: 100%;">
<img alt="Gather code" src="https://pbpython.com/images/simple-import.png" style="width: 1322px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="change-kernels">
<h2>12 - Change kernels</h2>
<p>If you are using conda or virtual environments, it is very useful to be able to quickly change
your notebook’s environment.</p>
<div class="figure" style="width: 1609px; height: auto; max-width: 100%;">
<img alt="Kernels" src="https://pbpython.com/images/kernels.png" style="width: 1609px; height: auto; max-width: 100%;"/>
</div>
<p>With the October 2021 release, you can also filter this list (if you want to get rid of the
“trash” environment). Get here by searching for Jupyter: Filter kernel.</p>
<div class="figure" style="width: 917px; height: auto; max-width: 100%;">
<img alt="Filter kernels" src="https://pbpython.com/images/filter-kernels.png" style="width: 917px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="connect-to-remote-server">
<h2>13 - Connect to remote server</h2>
<p>If you have a remote Jupyter server, you can connect to that as well.</p>
<div class="figure" style="width: 502px; height: auto; max-width: 100%;">
<img alt="Connect to remote server" src="https://pbpython.com/images/remote-server.png" style="width: 502px; height: auto; max-width: 100%;"/>
</div>
<div class="figure" style="width: 906px; height: auto; max-width: 100%;">
<img alt="Connect to remote server" src="https://pbpython.com/images/remote-server2.png" style="width: 906px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="supports-wsl">
<h2>14 - Supports <span class="caps">WSL</span></h2>
<p>I have written about <a class="reference external" href="https://pbpython.com/wsl-python.html">using <span class="caps">WSL</span></a> in the past. <span class="caps">VS</span> Code integrates well with <span class="caps">WSL</span> so you
can develop on Windows or Linux with a simple integrated approach. With Windows 11, the
<span class="caps">WSL</span> installation process is even easier than before.</p>
</div>
<div class="section" id="viewing-plots">
<h2>15 - Viewing plots</h2>
<p><span class="caps">VS</span> Code supports visualizations just like a standard notebook. In addition, you can view
all of the different plots in a separate tab. I find it helpful to have all plots in one
place for side by side analysis.</p>
<div class="figure" style="width: 1614px; height: auto; max-width: 100%;">
<img alt="Plot Viewer" src="https://pbpython.com/images/plot-viewer-1.png" style="width: 1614px; height: auto; max-width: 100%;"/>
</div>
<p>There are additional options for saving and viewing the plots.</p>
<div class="figure" style="width: 464px; height: auto; max-width: 100%;">
<img alt="Plot viewer" src="https://pbpython.com/images/plot-viewer-2.png" style="width: 464px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="plugins">
<h2>16 - Plugins</h2>
<p><span class="caps">VS</span> Code has 100’s (maybe 1000’s) of additional plugins that you may want to use in your environment.
One of the ones I like quite a bit is <a class="reference external" href="https://marketplace.visualstudio.com/items?itemName=alefragnani.project-manager">Project Manager</a>. I use this to group my
various projects together and quickly launch <span class="caps">VS</span> Code with all the code in one place.</p>
<div class="figure" style="width: 561px; height: auto; max-width: 100%;">
<img alt="Project manager plugin" src="https://pbpython.com/images/project-manager.png" style="width: 561px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>I have been using <span class="caps">VS</span> Code for notebook development for the past several weeks. It has taken
some time to get used to the new workflow but I think I am going to continue with this approach.</p>
<p>Many of these features in this article have only been in place for a few months.
So, I am excited to see what the development looks like and what new features the community
will develop in the future.</p>
<p>Let me know in the comments if you have any other experience with using some of these new
features in your own development process. Also, if you have any plugins that help with notebook
development, I’d be interested in hearing about them.</p>
</div>
Efficiently Cleaning Text with Pandas2021-02-16T07:25:00-06:002021-02-16T07:25:00-06:00Chris Moffitttag:pbpython.com,2021-02-16:/text-cleaning.html<p class="first">It’s no secret that data cleaning is a large portion of the data analysis process. When
using pandas, there are multiple techniques for cleaning text fields to prepare for
further analysis. As data sets grow large, it is important to find efficient methods that
perform in a reasonable time and are maintainable since text cleaning is a process that
evolves over time.</p>
<p class="last">This article will show examples of cleaning text fields in a large data file and illustrates
tips for how to efficiently clean unstructured text fields.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>It’s no secret that data cleaning is a large portion of the data analysis process. When
using pandas, there are multiple techniques for cleaning text fields to prepare for
further analysis. As data sets grow large, it is important to find efficient methods that
perform in a reasonable time and are maintainable since the text cleaning process evolves
over time.</p>
<p>This article will show examples of cleaning text fields in a large data file and illustrates
tips for how to efficiently clean unstructured text fields using Python and pandas.</p>
</div>
<div class="section" id="the-problem">
<h2>The problem</h2>
<p>For the sake of this article, let’s say you have a brand new craft whiskey that you would
like to sell. Your territory includes Iowa and there just happens to be an <a class="reference external" href="https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy">open data set</a>
that shows all of the liquor sales in the state. This seems like a great opportunity for
you to use your analysis skills to see who the biggest accounts are in the state. Armed
with that data, you can plan your sales process for each of the accounts.</p>
<p>Excited about the opportunity, you download the data and realize it’s pretty large. The
data set for this case is a <span class="caps">565MB</span> <span class="caps">CSV</span> file with 24 columns and 2.3M rows. This is not big
data by any means but it is big enough that it can make Excel crawl. It’s also big enough
that some of the pandas approaches will be relatively slow on your laptop.</p>
<p>For this article, I’ll be using data that includes all of 2019 sales. Due to the size, you
can <a class="reference external" href="https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy">download it</a> from the state site for a different time period.</p>
<p>Let’s get started by importing our modules and reading the data. I will also use the
<a class="reference external" href="https://github.com/chris1610/sidetable">sidetable</a> package to summarize the data. It’s not required for the cleaning but I wanted
to highlight how useful it can be for these data exploration scenarios.</p>
</div>
<div class="section" id="the-data">
<h2>The data</h2>
<p>Let’s get our data :</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">sidetable</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s1">'2019_Iowa_Liquor_Sales.csv'</span><span class="p">)</span>
</pre></div>
<p>Here’s what the data looks like.</p>
<div class="figure" style="width: 1503px; height: auto; max-width: 100%;">
<img alt="DataFrame view" src="https://pbpython.com/images/text-clean-data.png" style="width: 1503px; height: auto; max-width: 100%;"/>
</div>
<p>The first thing we might want to do is see how much each store purchases and rank them
from the largest to the smallest. We have limited resources so we should focus on those places
where we get the best bang for the buck. It will be easier for us to call on a couple of
big corporate accounts instead of a lot of mom and pop stores.</p>
<p><code class="code">
sidetable</code>
is a shortcut to summarize the data in a readable format. The alternative
is doing a <code class="code">
groupby</code>
plus additional manipulation.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">stb</span><span class="o">.</span><span class="n">freq</span><span class="p">([</span><span class="s1">'Store Name'</span><span class="p">],</span> <span class="n">value</span><span class="o">=</span><span class="s1">'Sale (Dollars)'</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">cum_cols</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 702px; height: auto; max-width: 100%;">
<img alt="sidetable summary" src="https://pbpython.com/images/text-clean-summary.png" style="width: 702px; height: auto; max-width: 100%;"/>
</div>
<p>One thing that’s apparent is that the store names are unique per location in most cases.
Ideally we would like to see all the sales for Hy-Vee, Costco, Sam’s, etc grouped together.</p>
<p>Looks like we need to clean the data.</p>
</div>
<div class="section" id="cleaning-attempt-1">
<h2>Cleaning attempt #1</h2>
<p>The first approach we can investigate is using <code class="code">
.loc</code>
plus a boolean filter with
the <code class="code">
str</code>
accessor to search for the relevant string in the <code class="code">
Store Name</code>
column.</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Hy-Vee'</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Store_Group_1'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'Hy-Vee'</span>
</pre></div>
<p>This code will search for the string ‘Hy-Vee’ using a case insensitive search and store the value
“Hy-Vee” in a new column called <code class="code">
Store_Group_1</code>
. This code will effectively convert names
like “Hy-Vee #3 / <span class="caps">BDI</span> / Des Moines” or “Hy-Vee Food Store / Urbandale” into a common “Hy-Vee”.</p>
<p>Here’s what <code class="code">
%%timeit</code>
tells us about this performance:</p>
<pre class="literal-block">
1.43 s ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<p>Normally we don’t want to optimize too early in the process but one thing we can do is use
the <code class="code">
regex=False</code>
parameter to give a speedup:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Hy-Vee'</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Store_Group_1'</span><span class="p">]</span> <span class="o">=</span> <span class="s1">'Hy-Vee'</span>
</pre></div>
<pre class="literal-block">
804 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<p>Here are the counts in the new column:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span><span class="o">.</span><span class="n">value_counts</span><span class="p">(</span><span class="n">dropna</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
<pre class="literal-block">
NaN 1617777
Hy-Vee 762568
Name: Store_Group_1, dtype: int64
</pre>
<p>We’ve cleaned up Hy-Vee but now there are a lot of other values we need to tackle.</p>
<p>The <code class="code">
.loc</code>
approach contains a lot of code and can be slow. We can use this concept
but look for some alternatives that are quicker to execute and easier to maintain.</p>
</div>
<div class="section" id="cleaning-attempt-2">
<h2>Cleaning attempt #2</h2>
<p>Another approach that is very performant and flexible is to use <code class="code">
np.select</code>
to run multiple
matches and apply a specified value upon match.</p>
<p>There are several good resources that I used to learn how to use <code class="code">
np.select</code>
. This
<a class="reference external" href="https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/">article</a> from Dataquest is a good overview. I also found this <a class="reference external" href="https://docs.google.com/presentation/d/1X7CheRfv0n4_I21z4bivvsHt6IDxkuaiAuCclSzia1E/edit#slide=id.g635adc05c1_1_1840">presentation</a> from Nathan Cheever
very interesting and information. I encourage you to check both of these out.</p>
<p>The simplest explanation for what <code class="code">
np.select</code>
does is that it evaluates a list of conditions and
applies a corresponding list of values if the condition is true.</p>
<p>In our case, our conditions will be different string lookups and the normalized string we
want to use instead will be the value.</p>
<p>After looking through the data, here’s a list of conditions and values in the <code class="code">
store_patterns</code>
list. Each tuple in this list is a <code class="code">
str.contains()</code>
lookup and the corresponding
text value we want to use to group like accounts together.</p>
<div class="highlight"><pre><span></span><span class="n">store_patterns</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Hy-Vee'</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Hy-Vee'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Central City'</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Central City'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">"Smokin' Joe's"</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s2">"Smokin' Joe's"</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Walmart|Wal-Mart'</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Wal-Mart'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Fareway Stores'</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Fareway Stores'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">"Casey's"</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s2">"Casey's General Store"</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">"Sam's Club"</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s2">"Sam's Club"</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Kum & Go'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Kum & Go'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'CVS'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'CVS Pharmacy'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Walgreens'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Walgreens'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Yesway'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Yesway Store'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Target Store'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Target'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Quik Trip'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Quik Trip'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Circle K'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Circle K'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Hometown Foods'</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span>
<span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Hometown Foods'</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s2">"Bucky's"</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s2">"Bucky's Express"</span><span class="p">),</span>
<span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s1">'Kwik'</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">),</span> <span class="s1">'Kwik Shop'</span><span class="p">)</span>
<span class="p">]</span>
</pre></div>
<p>One of the big challenge when working with <code class="code">
np.select</code>
is that it is easy to get the
conditions and values mismatched. I’ve decided to combine into a tuple to more easily keep
track of the data matches.</p>
<p>Because of this data structure, we need to break the list of tuples into two separate lists.
Using <code class="code">
zip</code>
we can take the <code class="code">
store_patterns</code>
and break into <code class="code">
store_criteria</code>
and <code class="code">
store_values</code>
:</p>
<div class="highlight"><pre><span></span><span class="n">store_criteria</span><span class="p">,</span> <span class="n">store_values</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">store_patterns</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">store_criteria</span><span class="p">,</span> <span class="n">store_values</span><span class="p">,</span> <span class="s1">'other'</span><span class="p">)</span>
</pre></div>
<p>This code will fill in each match with the text value. If there is no match, we’ll assign
it the value ‘other’.</p>
<p>Here’s what it looks like now:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">stb</span><span class="o">.</span><span class="n">freq</span><span class="p">([</span><span class="s1">'Store_Group_1'</span><span class="p">],</span> <span class="n">value</span><span class="o">=</span><span class="s1">'Sale (Dollars)'</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">cum_cols</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 465px; height: auto; max-width: 100%;">
<img alt="sidetable summary" src="https://pbpython.com/images/text-clean-summary-2.png" style="width: 465px; height: auto; max-width: 100%;"/>
</div>
<p>This looks better but there is still 32.28% of our revenue in “other” accounts.</p>
<p>What might be nice is that if there is an account that doesn’t match, we use the <code class="code">
Store Name</code>
instead of lumping all together in other. Here’s how we do that:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">store_criteria</span><span class="p">,</span> <span class="n">store_values</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span><span class="o">.</span><span class="n">combine_first</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">])</span>
</pre></div>
<p>This uses the <code class="code">
combine_first</code>
function to fill in all the <code class="code">
None</code>
values with
the <code class="code">
Store Name</code>
. This is a handy trick to keep in mind when cleaning your data.</p>
<p>Let’s check our data:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">stb</span><span class="o">.</span><span class="n">freq</span><span class="p">([</span><span class="s1">'Store_Group_1'</span><span class="p">],</span> <span class="n">value</span><span class="o">=</span><span class="s1">'Sale (Dollars)'</span><span class="p">,</span> <span class="n">style</span><span class="o">=</span><span class="kc">True</span><span class="p">,</span> <span class="n">cum_cols</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 675px; height: auto; max-width: 100%;">
<img alt="sidetable summary" src="https://pbpython.com/images/text-clean-summary-3.png" style="width: 675px; height: auto; max-width: 100%;"/>
</div>
<p>This looks better because we can continue to refine the groupings as needed. For instance,
we may want to build a string lookup for Costco.</p>
<p>Performance is not too bad for a large data set:</p>
<pre class="literal-block">
13.2 s ± 328 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<p>The benefit of this approach is that you can use <code class="code">
np.select</code>
for numeric analysis
as well as the text examples shown here. It is very flexible.</p>
<p>The one challenge with this approach is that there is a lot of code. If you had a large data set
to clean, there’s a lot of data and code intermixed in this solution.</p>
<p>Is there another approach that might have similar performance but be a little cleaner?</p>
</div>
<div class="section" id="cleaning-attempt-3">
<h2>Cleaning attempt #3</h2>
<p>The next solution is based on this excellent <a class="reference external" href="https://www.metasnake.com/blog/pydata-assign.html">code example</a> from Matt Harrison who developed
a <code class="code">
generalize</code>
function that does the matching and cleaning for us. I’ve made some changes to make
it consistent with this example but want to give Matt credit. I would never have thought of
this solution without him doing 99% of the work!</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">generalize</span><span class="p">(</span><span class="n">ser</span><span class="p">,</span> <span class="n">match_name</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="kc">False</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="kc">False</span><span class="p">):</span>
<span class="sd">""" Search a series for text matches.</span>
<span class="sd"> Based on code from https://www.metasnake.com/blog/pydata-assign.html</span>
<span class="sd"> ser: pandas series to search</span>
<span class="sd"> match_name: tuple containing text to search for and text to use for normalization</span>
<span class="sd"> default: If no match, use this to provide a default value, otherwise use the original text</span>
<span class="sd"> regex: Boolean to indicate if match_name contains a regular expression</span>
<span class="sd"> case: Case sensitive search</span>
<span class="sd"> Returns a pandas series with the matched value</span>
<span class="sd"> """</span>
<span class="n">seen</span> <span class="o">=</span> <span class="kc">None</span>
<span class="k">for</span> <span class="n">match</span><span class="p">,</span> <span class="n">name</span> <span class="ow">in</span> <span class="n">match_name</span><span class="p">:</span>
<span class="n">mask</span> <span class="o">=</span> <span class="n">ser</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="n">match</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="n">case</span><span class="p">,</span> <span class="n">regex</span><span class="o">=</span><span class="n">regex</span><span class="p">)</span>
<span class="k">if</span> <span class="n">seen</span> <span class="ow">is</span> <span class="kc">None</span><span class="p">:</span>
<span class="n">seen</span> <span class="o">=</span> <span class="n">mask</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">seen</span> <span class="o">|=</span> <span class="n">mask</span>
<span class="n">ser</span> <span class="o">=</span> <span class="n">ser</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="o">~</span><span class="n">mask</span><span class="p">,</span> <span class="n">name</span><span class="p">)</span>
<span class="k">if</span> <span class="n">default</span><span class="p">:</span>
<span class="n">ser</span> <span class="o">=</span> <span class="n">ser</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">seen</span><span class="p">,</span> <span class="n">default</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">ser</span> <span class="o">=</span> <span class="n">ser</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">seen</span><span class="p">,</span> <span class="n">ser</span><span class="o">.</span><span class="n">values</span><span class="p">)</span>
<span class="k">return</span> <span class="n">ser</span>
</pre></div>
<p>This function can be called on a pandas series and expects a list of tuples. The first tuple
item is the value to search for and the second is the value to fill in for the matched value.</p>
<p>Here’s the equivalent pattern list:</p>
<div class="highlight"><pre><span></span><span class="n">store_patterns_2</span> <span class="o">=</span> <span class="p">[(</span><span class="s1">'Hy-Vee'</span><span class="p">,</span> <span class="s1">'Hy-Vee'</span><span class="p">),</span> <span class="p">(</span><span class="s2">"Smokin' Joe's"</span><span class="p">,</span> <span class="s2">"Smokin' Joe's"</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Central City'</span><span class="p">,</span> <span class="s1">'Central City'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Costco Wholesale'</span><span class="p">,</span> <span class="s1">'Costco Wholesale'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Walmart'</span><span class="p">,</span> <span class="s1">'Walmart'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Wal-Mart'</span><span class="p">,</span> <span class="s1">'Walmart'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Fareway Stores'</span><span class="p">,</span> <span class="s1">'Fareway Stores'</span><span class="p">),</span>
<span class="p">(</span><span class="s2">"Casey's"</span><span class="p">,</span> <span class="s2">"Casey's General Store"</span><span class="p">),</span>
<span class="p">(</span><span class="s2">"Sam's Club"</span><span class="p">,</span> <span class="s2">"Sam's Club"</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Kum & Go'</span><span class="p">,</span> <span class="s1">'Kum & Go'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'CVS'</span><span class="p">,</span> <span class="s1">'CVS Pharmacy'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Walgreens'</span><span class="p">,</span> <span class="s1">'Walgreens'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Yesway'</span><span class="p">,</span> <span class="s1">'Yesway Store'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Target Store'</span><span class="p">,</span> <span class="s1">'Target'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Quik Trip'</span><span class="p">,</span> <span class="s1">'Quik Trip'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Circle K'</span><span class="p">,</span> <span class="s1">'Circle K'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'Hometown Foods'</span><span class="p">,</span> <span class="s1">'Hometown Foods'</span><span class="p">),</span>
<span class="p">(</span><span class="s2">"Bucky's"</span><span class="p">,</span> <span class="s2">"Bucky's Express"</span><span class="p">),</span> <span class="p">(</span><span class="s1">'Kwik'</span><span class="p">,</span> <span class="s1">'Kwik Shop'</span><span class="p">)]</span>
</pre></div>
<p>A useful benefit of this solution is that it is much easier to maintain this list than
the earlier <code class="code">
store_patterns</code>
example.</p>
<p>The other change I made with the <code class="code">
generalize</code>
function is that the original value will be preserved
if there is no default value provided. Instead of using <code class="code">
combine_first</code>
, the
function will take care of it all. Finally, I turned off the regex match by default for a
small performance improvement.</p>
<p>Now that the data is all set up, calling it is simple:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_2'</span><span class="p">]</span> <span class="o">=</span> <span class="n">generalize</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">],</span> <span class="n">store_patterns_2</span><span class="p">)</span>
</pre></div>
<p>How about performance?</p>
<pre class="literal-block">
15.5 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<p>It is a little slower but I think it’s a more elegant solution and what I would use in the
future if I had to do a similar text cleanup.</p>
<p>The downside to this approach is that it is designed for string cleaning. The <code class="code">
np.select</code>
solution is more broadly useful since it can be applied to numeric values as well.</p>
</div>
<div class="section" id="what-about-data-types">
<h2>What about data types?</h2>
<p>In recent versions of pandas there is a dedicated <code class="code">
string</code>
type. I tried converting
the <code class="code">
Store Name</code>
to a pandas string type to see if there was any performance improvement.
I did not notice any changes. However, it’s possible there will be speed improvements in the future
so keep that in mind.</p>
<p>While the string type did not make a difference, the <code class="code">
category</code>
type showed a lot of
promise on this data set. Refer to my <a class="reference external" href="https://pbpython.com/pandas_dtypes_cat.html">previous article</a> for details on the category data type.</p>
<p>We can convert the data to a category using <code class="code">
astype</code>
:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">'category'</span><span class="p">)</span>
</pre></div>
<p>Now re-run the <code class="code">
np.select</code>
example exactly as we did earlier:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_3'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">select</span><span class="p">(</span><span class="n">store_criteria</span><span class="p">,</span> <span class="n">store_values</span><span class="p">,</span> <span class="kc">None</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_3'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_1'</span><span class="p">]</span><span class="o">.</span><span class="n">combine_first</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">])</span>
</pre></div>
<pre class="literal-block">
786 ms ± 108 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<div class="alert alert-success compound">
<p>We went from 13s to less than 1 second by making one simple change. Amazing!</p>
</div>
<p>The reason this works is pretty straightforward. When pandas converts a column to a
categorical type, pandas will only call the expensive <code class="code">
str.contains()</code>
function
on each unique text value. Because this data set has a lot of repeated data, we get a huge
performance boost.</p>
<p>Let’s see if this works for our <code class="code">
generalize</code>
function:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store_Group_4'</span><span class="p">]</span> <span class="o">=</span> <span class="n">generalize</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">],</span> <span class="n">store_patterns_2</span><span class="p">)</span>
</pre></div>
<p>Unfortunately we get this error:</p>
<div class="highlight"><pre><span></span><span class="ne">ValueError</span><span class="p">:</span> <span class="n">Cannot</span> <span class="n">setitem</span> <span class="n">on</span> <span class="n">a</span> <span class="n">Categorical</span> <span class="k">with</span> <span class="n">a</span> <span class="n">new</span> <span class="n">category</span><span class="p">,</span> <span class="nb">set</span> <span class="n">the</span> <span class="n">categories</span> <span class="n">first</span>
</pre></div>
<p>That error highlights some of the challenge I have had in the past when dealing with Categorical
data. When merging and joining categorical data, you can run into these types of challenges.</p>
<p>I tried to figure out a good way to modify <code class="code">
generalize()</code>
to work but could not
figure it out. Bonus points to any reader that figures it out.</p>
<p>However, there is a way we can replicate the Category approach by building a lookup table.</p>
</div>
<div class="section" id="lookup-table">
<h2>Lookup table</h2>
<p>As we learned with the Categorical approach, this data set has a lot of duplicated data.
We can build a lookup table and process the resource intensive function only one time per string.</p>
<p>To illustrate how this works on strings, let’s convert the value back to a string type instead
of the category:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s1">'string'</span><span class="p">)</span>
</pre></div>
<p>First we build a lookup DataFrame that contains all the unique values and run the <code class="code">
generalize</code>
function:</p>
<div class="highlight"><pre><span></span><span class="n">lookup_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">lookup_df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">]</span><span class="o">.</span><span class="n">unique</span><span class="p">()</span>
<span class="n">lookup_df</span><span class="p">[</span><span class="s1">'Store_Group_5'</span><span class="p">]</span> <span class="o">=</span> <span class="n">generalize</span><span class="p">(</span><span class="n">lookup_df</span><span class="p">[</span><span class="s1">'Store Name'</span><span class="p">],</span> <span class="n">store_patterns_2</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 759px; height: auto; max-width: 100%;">
<img alt="Lookup table approach" src="https://pbpython.com/images/text-clean-lookup-table.png" style="width: 759px; height: auto; max-width: 100%;"/>
</div>
<p>We can merge it back into a final DataFrame:</p>
<div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">lookup_df</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'left'</span><span class="p">)</span>
</pre></div>
<pre class="literal-block">
1.38 s ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
</pre>
<p>It is slower than the <code class="code">
np.select</code>
approach on categorical data but the performance impact
might be balanced by the easier readability for maintaining the lookup list.</p>
<p>Also, the intermediate <code class="code">
lookup_df</code>
could be a great output to share with an analyst
that can help you clean up more of the data. That savings could be measured in hours of work!</p>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>This <a class="reference external" href="https://counting.substack.com/p/data-cleaning-is-analysis-not-grunt">newsletter</a> by Randy Au is a good discussion about the important of data cleaning and
the love-hate relationship many data scientists have with this task. I agree with Randy’s premise
that data cleaning is analysis.</p>
<p>In my experience, you can learn a lot about your underlying data by taking up the kind of
cleaning activities outlined in this article.</p>
<p>I suspect you are going to find lots of cases in your day to day analysis where you need
to do text cleaning similar to what I’ve shown in this article.</p>
<p>Here is a quick summary of the solutions we looked at:</p>
<table border="1" class="colwidths-given table docutils">
<caption>Text Cleaning Options</caption>
<colgroup>
<col width="30%"/>
<col width="20%"/>
<col width="50%"/>
</colgroup>
<thead valign="bottom">
<tr><th class="head">Solution</th>
<th class="head">Execution time</th>
<th class="head">Notes</th>
</tr>
</thead>
<tbody valign="top">
<tr><td><code class="code">
np.select</code>
</td>
<td>13s</td>
<td>Can work for non-text analysis</td>
</tr>
<tr><td><code class="code">
generalize</code>
</td>
<td>15s</td>
<td>Text only</td>
</tr>
<tr><td>Category Data and <code class="code">
np.select</code>
</td>
<td>786ms</td>
<td>Categorical data can get tricky when merging and joining</td>
</tr>
<tr><td>Lookup table and <code class="code">
generalize</code>
</td>
<td>1.3s</td>
<td>A lookup table can be maintained by someone else</td>
</tr>
</tbody>
</table>
<p>For some data sets, performance is not an issue so pick what clicks with your brain.</p>
<p>However, as the data grows in size (imagine doing this analysis for 50 states worth of data),
you will need to understand how to use pandas in an efficient manner for text cleaning. My
hope is that you bookmark this article and come back to it when you face a similar problem.</p>
<p>As always, if you have some other tips that might be useful to folks, let me know in the
comments. If you figure out how to make my <code class="code">
generalize</code>
function work with categorical data,
let me know too.</p>
</div>
Case Study: Automating Excel File Creation and Distribution with Pandas and Outlook2021-01-18T07:25:00-06:002021-01-18T07:25:00-06:00Chris Moffitttag:pbpython.com,2021-01-18:/excel-email.html<p class="first last">I enjoy hearing from readers that have used concepts from this blog to solve their own problems.
It always amazes me when I see examples where only a few lines of python code can solve
a real business problem and save organizations a lot of time and money. I am also impressed
when people figure out how to do this with no formal training - just with some hard work and
willingness to persevere through the learning curve.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>I enjoy hearing from readers that have used concepts from this blog to solve their own problems.
It always amazes me when I see examples where only a few lines of python code can solve
a real business problem and save organizations a lot of time and money. I am also impressed
when people figure out how to do this with no formal training - just with some hard work and
willingness to <a class="reference external" href="https://pbpython.com/plateau-of-productivity.html">persevere</a> through the learning curve.</p>
<p>This example comes from Mark Doll. I’ll turn it over to him to give his background:</p>
<blockquote>
I have been learning/using Python for about 3 years to help automate business processes
and reporting. I’ve never had any formal training in Python, but found it to be a reliable
tool that has helped me in my work.</blockquote>
<p>Read on for more details on how Mark used Python to automate a very manual process of collecting and
sorting Excel files to email to 100’s of users.</p>
</div>
<div class="section" id="the-problem">
<h2>The Problem</h2>
<p>Here’s Mark’s overview of the problem:</p>
<blockquote>
<p>A business need arose to send out emails with Excel attachments to a list of
~500 users and presented us with a large task to complete manually. Making this task harder
was the fact that we had to split data up by user from a master Excel file to create their
own specific file, then email that file out to the correct user.</p>
<p>Imagine the time it would take to manually filter, cut and paste the data into a file,
then save it and email it out - 500 times! Using this Python approach we were able to
automate the entire process and save valuable time.</p>
</blockquote>
<p>I have seen this type of problem multiple times in my experience. If you don’t have experience
with a programming language, then it can seem daunting. With Python, it’s very feasible to
automate this tedious process. Here’s a graphical view of what Mark was able to do:</p>
<div class="figure" style="width: 1219px; height: auto; max-width: 100%;">
<img alt="File paths" src="https://pbpython.com/images/email-case-study-process-transparent.png" style="width: 1219px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="solving-the-problem">
<h2>Solving the Problem</h2>
<p>The first step is getting the imports in place:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">datetime</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">shutil</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">win32com.client</span> <span class="k">as</span> <span class="nn">win32</span>
</pre></div>
<p>Now we will set up some strings with the current date and our directory structure:</p>
<div class="highlight"><pre><span></span><span class="c1">## Set Date Formats</span>
<span class="n">today_string</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">today</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%m</span><span class="si">%d</span><span class="s1">%Y_%I%p'</span><span class="p">)</span>
<span class="n">today_string2</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">today</span><span class="p">()</span><span class="o">.</span><span class="n">strftime</span><span class="p">(</span><span class="s1">'%b </span><span class="si">%d</span><span class="s1">, %Y'</span><span class="p">)</span>
<span class="c1">## Set Folder Targets for Attachments and Archiving</span>
<span class="n">attachment_path</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'data'</span> <span class="o">/</span> <span class="s1">'attachments'</span>
<span class="n">archive_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'archive'</span>
<span class="n">src_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'data'</span> <span class="o">/</span> <span class="s1">'Example4.xlsx'</span>
</pre></div>
<p>Let’s take a look at the data file we need to process:</p>
<div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">src_file</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 896px; height: auto; max-width: 100%;">
<img alt="Excel file view" src="https://pbpython.com/images/case-study-email-1.png" style="width: 896px; height: auto; max-width: 100%;"/>
</div>
<p>The next step is to group all of the <code class="code">
CUSTOMER_ID</code>
transactions together. We start by
doing a <code class="code">
groupby</code>
on <code class="code">
CUSTOMER_ID</code>
.</p>
<div class="highlight"><pre><span></span><span class="n">customer_group</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'CUSTOMER_ID'</span><span class="p">)</span>
</pre></div>
<p>It might not be apparent to you what <code class="code">
customer_group</code>
is in this case.
A loop shows how we can process this grouped object:</p>
<div class="highlight"><pre><span></span><span class="k">for</span> <span class="n">ID</span><span class="p">,</span> <span class="n">group_df</span> <span class="ow">in</span> <span class="n">customer_group</span><span class="p">:</span>
<span class="nb">print</span><span class="p">(</span><span class="n">ID</span><span class="p">)</span>
</pre></div>
<div class="highlight"><pre><span></span><span class="n">A1000</span>
<span class="n">A1001</span>
<span class="n">A1002</span>
<span class="n">A1005</span>
</pre></div>
<p>Here’s the last <code class="code">
group_df</code>
that shows all of the transactions for customer A1005:</p>
<div class="figure" style="width: 887px; height: auto; max-width: 100%;">
<img alt="Excel file view" src="https://pbpython.com/images/case-study-email-cust-group.png" style="width: 887px; height: auto; max-width: 100%;"/>
</div>
<p>We have everything we need to create an Excel file for each customer and store in a directory
for future use:</p>
<div class="highlight"><pre><span></span><span class="c1">## Write each ID, Group to Individual Excel files and use ID to name each file with Today's Date</span>
<span class="n">attachments</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">ID</span><span class="p">,</span> <span class="n">group_df</span> <span class="ow">in</span> <span class="n">customer_group</span><span class="p">:</span>
<span class="n">attachment</span> <span class="o">=</span> <span class="n">attachment_path</span> <span class="o">/</span> <span class="sa">f</span><span class="s1">'</span><span class="si">{</span><span class="n">ID</span><span class="si">}</span><span class="s1">_</span><span class="si">{</span><span class="n">today_string</span><span class="si">}</span><span class="s1">.xlsx'</span>
<span class="n">group_df</span><span class="o">.</span><span class="n">to_excel</span><span class="p">(</span><span class="n">attachment</span><span class="p">,</span> <span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="n">attachments</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">ID</span><span class="p">,</span> <span class="nb">str</span><span class="p">(</span><span class="n">attachment</span><span class="p">)))</span>
</pre></div>
<p>The <code class="code">
attachments</code>
list contains the customer <span class="caps">ID</span> and the full path to the file:</p>
<div class="highlight"><pre><span></span><span class="p">[(</span><span class="s1">'A1000'</span><span class="p">,</span>
<span class="s1">'c:</span><span class="se">\\</span><span class="s1">Users</span><span class="se">\\</span><span class="s1">chris</span><span class="se">\\</span><span class="s1">notebooks</span><span class="se">\\</span><span class="s1">2020-10</span><span class="se">\\</span><span class="s1">data</span><span class="se">\\</span><span class="s1">attachments</span><span class="se">\\</span><span class="s1">A1000_01162021_12PM.xlsx'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'A1001'</span><span class="p">,</span>
<span class="s1">'c:</span><span class="se">\\</span><span class="s1">Users</span><span class="se">\\</span><span class="s1">chris</span><span class="se">\\</span><span class="s1">notebooks</span><span class="se">\\</span><span class="s1">2020-10</span><span class="se">\\</span><span class="s1">data</span><span class="se">\\</span><span class="s1">attachments</span><span class="se">\\</span><span class="s1">A1001_01162021_12PM.xlsx'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'A1002'</span><span class="p">,</span>
<span class="s1">'c:</span><span class="se">\\</span><span class="s1">Users</span><span class="se">\\</span><span class="s1">chris</span><span class="se">\\</span><span class="s1">notebooks</span><span class="se">\\</span><span class="s1">2020-10</span><span class="se">\\</span><span class="s1">data</span><span class="se">\\</span><span class="s1">attachments</span><span class="se">\\</span><span class="s1">A1002_01162021_12PM.xlsx'</span><span class="p">),</span>
<span class="p">(</span><span class="s1">'A1005'</span><span class="p">,</span>
<span class="s1">'c:</span><span class="se">\\</span><span class="s1">Users</span><span class="se">\\</span><span class="s1">chris</span><span class="se">\\</span><span class="s1">notebooks</span><span class="se">\\</span><span class="s1">2020-10</span><span class="se">\\</span><span class="s1">data</span><span class="se">\\</span><span class="s1">attachments</span><span class="se">\\</span><span class="s1">A1005_01162021_12PM.xlsx'</span><span class="p">)]</span>
</pre></div>
<p>To make the processing easier, we convert the list to a DataFrame:</p>
<div class="highlight"><pre><span></span><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">attachments</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">'CUSTOMER_ID'</span><span class="p">,</span> <span class="s1">'FILE'</span><span class="p">])</span>
</pre></div>
<div class="figure" style="width: 453px; height: auto; max-width: 100%;">
<img alt="File paths" src="https://pbpython.com/images/case-study-email-file-paths.png" style="width: 453px; height: auto; max-width: 100%;"/>
</div>
<p>The final data prep stage is to generate a list of files with their email addresses by
merging the DataFrames together:</p>
<div class="highlight"><pre><span></span><span class="n">email_merge</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">df2</span><span class="p">,</span> <span class="n">how</span><span class="o">=</span><span class="s1">'left'</span><span class="p">)</span>
<span class="n">combined</span> <span class="o">=</span> <span class="n">email_merge</span><span class="p">[[</span><span class="s1">'CUSTOMER_ID'</span><span class="p">,</span> <span class="s1">'EMAIL'</span><span class="p">,</span> <span class="s1">'FILE'</span><span class="p">]]</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">()</span>
</pre></div>
<p>Which gives this simple DataFrame:</p>
<div class="figure" style="width: 570px; height: auto; max-width: 100%;">
<img alt="File paths" src="https://pbpython.com/images/case-study-email-address-path.png" style="width: 570px; height: auto; max-width: 100%;"/>
</div>
<p>We’ve gathered the list of customers, their emails and the attachments. Now we need to send
an email with Outlook. Refer to <a class="reference external" href="https://pbpython.com/windows-com.html">this article</a> for additional explanation of this code:</p>
<div class="highlight"><pre><span></span><span class="c1"># Email Individual Reports to Respective Recipients</span>
<span class="k">class</span> <span class="nc">EmailsSender</span><span class="p">:</span>
<span class="k">def</span> <span class="fm">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">outlook</span> <span class="o">=</span> <span class="n">win32</span><span class="o">.</span><span class="n">Dispatch</span><span class="p">(</span><span class="s1">'outlook.application'</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">send_email</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">to_email_address</span><span class="p">,</span> <span class="n">attachment_path</span><span class="p">):</span>
<span class="n">mail</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">outlook</span><span class="o">.</span><span class="n">CreateItem</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">mail</span><span class="o">.</span><span class="n">To</span> <span class="o">=</span> <span class="n">to_email_address</span>
<span class="n">mail</span><span class="o">.</span><span class="n">Subject</span> <span class="o">=</span> <span class="n">today_string2</span> <span class="o">+</span> <span class="s1">' Report'</span>
<span class="n">mail</span><span class="o">.</span><span class="n">Body</span> <span class="o">=</span> <span class="s2">"""Please find today's report attached."""</span>
<span class="n">mail</span><span class="o">.</span><span class="n">Attachments</span><span class="o">.</span><span class="n">Add</span><span class="p">(</span><span class="n">Source</span><span class="o">=</span><span class="n">attachment_path</span><span class="p">)</span>
<span class="c1"># Use this to show the email</span>
<span class="c1">#mail.Display(True)</span>
<span class="c1"># Uncomment to send</span>
<span class="c1">#mail.Send()</span>
</pre></div>
<p>We can use this simple class to generate the emails and attach the Excel file.</p>
<div class="highlight"><pre><span></span><span class="n">email_sender</span> <span class="o">=</span> <span class="n">EmailsSender</span><span class="p">()</span>
<span class="k">for</span> <span class="n">index</span><span class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">combined</span><span class="o">.</span><span class="n">iterrows</span><span class="p">():</span>
<span class="n">email_sender</span><span class="o">.</span><span class="n">send_email</span><span class="p">(</span><span class="n">row</span><span class="p">[</span><span class="s1">'EMAIL'</span><span class="p">],</span> <span class="n">row</span><span class="p">[</span><span class="s1">'FILE'</span><span class="p">])</span>
</pre></div>
<div class="figure" style="width: 873px; height: auto; max-width: 100%;">
<img alt="Outlook Email" src="https://pbpython.com/images/case-study-email-outlook.png" style="width: 873px; height: auto; max-width: 100%;"/>
</div>
<p>The last step is to move the files to our archive directory:</p>
<div class="highlight"><pre><span></span><span class="c1"># Move the files to the archive location</span>
<span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">attachments</span><span class="p">:</span>
<span class="n">shutil</span><span class="o">.</span><span class="n">move</span><span class="p">(</span><span class="n">f</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">archive_dir</span><span class="p">)</span>
</pre></div>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>This example does a nice job of automating a highly manual process where someone
likely did a lot of copying and pasting and manual file manipulation. I hope the solution that
Mark developed can help you figure out how to automate some of the more painful parts of
your job.</p>
<p>I encourage you to use this example to identify similar challenges in your day to day work.
Maybe you don’t have to work with 100’s of files but you might have a manual process you
run once a week. Even if that process only takes 1 hour, use that as a jumping off point to
figure out how to use Python to make it easier. There is no better way to learn Python
than to apply it to one of your own problems.</p>
<p>Thanks again to Mark for taking the time to walk us through this content example!</p>
</div>
Pandas DataFrame Visualization Tools2021-01-11T07:25:00-06:002021-01-11T07:25:00-06:00Chris Moffitttag:pbpython.com,2021-01-11:/dataframe-gui-overview.html<p class="first">I have talked quite a bit about how pandas is a great alternative to Excel for many tasks.
One of Excel’s benefits is that it offers an intuitive and powerful graphical interface for
viewing your data. In contrast, pandas + a Jupyter notebook offers a lot of programmatic
power but limited abilities to graphically display and manipulate a DataFrame view.</p>
<p>There are several tools in the Python ecosystem that are designed to fill this gap. They range
in complexity from simple JavaScript libraries to complex, full-featured data analysis engines.
The one common denominator is that they all provide a way to view and selectively filter
your data in a graphical format. From this point of commonality they diverge quite a bit in
design and functionality.</p>
<p class="last">This article will review several of these options in order to give you an idea of the landscape
and evaluate which ones might be useful for your analysis process.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>I have talked quite a bit about how pandas is a great alternative to Excel for many tasks.
One of Excel’s benefits is that it offers an intuitive and powerful graphical interface for
viewing your data. In contrast, pandas + a Jupyter notebook offers a lot of programmatic
power but limited abilities to graphically display and manipulate a DataFrame view.</p>
<p>There are several tools in the Python ecosystem that are designed to fill this gap. They range
in complexity from simple JavaScript libraries to complex, full-featured data analysis engines.
The one common denominator is that they all provide a way to view and selectively filter
your data in a graphical format. From this point of commonality they diverge quite a bit in
design and functionality.</p>
<p>This article will review several of these DataFrame visualization options in order to give
you an idea of the landscape and evaluate which ones might be useful for your analysis process.</p>
</div>
<div class="section" id="background">
<h2>Background</h2>
<p>For this article, we will use a sample sales data set we have used in the past. Here is a
view of the data in a traditional notebook:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">df</span>
</pre></div>
<div class="figure" style="width: 824px; height: auto; max-width: 100%;">
<img alt="DataFrame in Notebook" src="https://pbpython.com/images/df-gui-notebook.png" style="width: 824px; height: auto; max-width: 100%;"/>
</div>
<p>Here’s a similar view in Excel with a filter applied to all columns:</p>
<div class="figure" style="width: 803px; height: auto; max-width: 100%;">
<img alt="DataFrame in Excel" src="https://pbpython.com/images/df-gui-excel.png" style="width: 803px; height: auto; max-width: 100%;"/>
</div>
<p>This familiar view in Excel allows you to easily see all your data. You can filter and sort to inspect
the data and dive deeper into the details where needed. This type of functionality is most useful
when you are exploring a new dataset or tackling a new problem on an existing dataset.</p>
<p>Obviously this is not feasible with millions of rows of data. However, even if you have large
datasets and are a pandas expert, I expect you still dump DataFrames to Excel and view subsets
of data. I know I do.</p>
<p>Part of the reason I use Excel + python is that the ad-hoc abilities
to inspect the data in Excel are much better than the vanilla DataFrame views.</p>
<p>With that background, let’s look at some of the options for replicating this easy viewing
capability we have in Excel.</p>
</div>
<div class="section" id="javascript-tools">
<h2>JavaScript tools</h2>
<p>The simplest approach is to use a JavaScript library to add some interactivity to the DataFrame
view in a notebook.</p>
<div class="section" id="qgrid">
<h3>Qgrid</h3>
<p>The first one we will look at it <a class="reference external" href="https://github.com/quantopian/qgrid">Qgrid</a> from Quantopian. This Jupyter notebook widget uses
the SlickGrid component to add interactivity to your DataFrame.</p>
<p>Once it is installed, you can display a version of your DataFrame that supports sorting and
filtering data.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">qgrid</span>
<span class="kn">import</span> <span class="nn">pandas</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">widget</span> <span class="o">=</span> <span class="n">qgrid</span><span class="o">.</span><span class="n">show_grid</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
<span class="n">widget</span>
</pre></div>
<div class="figure" style="width: 1092px; height: auto; max-width: 100%;">
<img alt="Qgrid example" src="https://pbpython.com/images/qgrid-example.png" style="width: 1092px; height: auto; max-width: 100%;"/>
</div>
<p>Qgrid supports intuitive filtering using various widgets based on the underlying data types.
In addition, you can configure some of the rendering features and then read the selected
data into a DataFrame. Which is a pretty useful feature.</p>
<p>Qgrid does not perform any visualization nor does it allow you to use pandas expressions to
filter and select data.</p>
<p>Overall, Qgrid works well for simple data manipulation and inspection.</p>
</div>
<div class="section" id="pivottablejs">
<h3>PivottableJs</h3>
<p>The next option isn’t really for viewing a DataFrame but I think it’s a really useful tool
for summarizing data so I’m covering it.</p>
<p>The <a class="reference external" href="https://github.com/nicolaskruchten/jupyter_pivottablejs">pivottablejs</a> module uses a pivot table JavaScript library for interactive data
pivoting and summarizing.</p>
<p>Once it is installed, usages is simple:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pivottablejs</span> <span class="kn">import</span> <span class="n">pivot_ui</span>
<span class="n">pivot_ui</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
<p>In this example, I summarized quantity purchased for each customer by clicking and dragging.</p>
<div class="figure" style="width: 1089px; height: auto; max-width: 100%;">
<img alt="Pivot Table Example" src="https://pbpython.com/images/pivot-table-example.png" style="width: 1089px; height: auto; max-width: 100%;"/>
</div>
<p>In addition to basic sum functions, you can do some visualization and statistical analysis as well.</p>
<div class="figure" style="width: 237px; height: auto; max-width: 100%;">
<img alt="Pivot Table Example" src="https://pbpython.com/images/pivot-table-example-2.png" style="width: 237px; height: auto; max-width: 100%;"/>
</div>
<p>This widget is not useful for filtering a raw DataFrame but is really powerful for pivoting
and summarizing data. One of the nice features is that you can filter the data once
you build your pivot table.</p>
<p>The other downside with this widget is that it does not leverage any of the pandas pivoting
or selecting functions. Still, pivottablejs is a really useful tool for quick pivots and summaries.</p>
</div>
</div>
<div class="section" id="data-analysis-applications">
<h2>Data Analysis Applications</h2>
<p>The second category of <span class="caps">GUI</span> applications are full-fledged applications typically using
a web back-end like Flask or a separate application based on Qt. These applications vary
in complexity and capability from simple table views and plotting capabilities to robust
statistical analysis. One aspect that is unique about these tools is that they closely
integrate with pandas so you can use pandas code to filter the data and interact with
these applications.</p>
<div class="section" id="pandasgui">
<h3>PandasGUI</h3>
<p>The first application I will discuss is <a class="reference external" href="https://github.com/adamerose/pandasgui">PandasGUI</a>. This application is unique in that
it is a standalone app built with Qt that can be invoked from a Jupyter notebook.</p>
<p>Using the same data from the previous example, import the <code class="code">
show</code>
command:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pandasgui</span> <span class="kn">import</span> <span class="n">show</span>
<span class="n">show</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
<p>If everything works, you will end up with a separate <span class="caps">GUI</span>. Because it is a standalone application,
you can configure the view quite a bit. For example, I have moved a couple of the tabs
around to show more of the capability on one page.</p>
<p>In this example, I’m filtering the data using pandas <a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query">query syntax</a> to show one customer and
purchase quantities > 15.</p>
<div class="figure" style="width: 2413px; height: auto; max-width: 100%;">
<img alt="Pandas GUI" src="https://pbpython.com/images/pandas-gui-1.png" style="width: 2413px; height: auto; max-width: 100%;"/>
</div>
<p>PandasGUI integrates with Plotly and allows you to build visualizations as well. Here is
an example histogram of the unit price:</p>
<div class="figure" style="width: 1687px; height: auto; max-width: 100%;">
<img alt="Pandas GUI Plot" src="https://pbpython.com/images/pandas-gui-2.png" style="width: 1687px; height: auto; max-width: 100%;"/>
</div>
<p>One nice capability of PandasGUI is that the filters are in effect for the DataFrame across all
the tabs. You can use this feature to try different views of the data when plotting or
transforming the data.</p>
<p>The other capability that PandasGUI has is that you can reshape the data by pivotting
or melting it. Here’s a summary of the unit sales by <span class="caps">SKU</span>.</p>
<div class="figure" style="width: 1030px; height: auto; max-width: 100%;">
<img alt="Pandas GUI Pivot" src="https://pbpython.com/images/pandas-gui-3.png" style="width: 1030px; height: auto; max-width: 100%;"/>
</div>
<p>Here’s what the resulting view looks like:</p>
<div class="figure" style="width: 929px; height: auto; max-width: 100%;">
<img alt="Pandas GUI Pivot View" src="https://pbpython.com/images/pandas-gui-4.png" style="width: 929px; height: auto; max-width: 100%;"/>
</div>
<p>PandasGUI is an impressive application. I like how it keeps track of all the changes and
is just a small wrapper over standard pandas functionality. The program is under active
development so I will be following it closely to see how it improves and grows over time.</p>
<p>If you are curious to see more functionality, <a class="reference external" href="https://www.youtube.com/watch?v=NKXdolMxW2Y">this video</a> shows another good walk through.</p>
</div>
<div class="section" id="tabloo">
<h3>Tabloo</h3>
<p>This one gets the award for the name that makes me smile every time I see it. Hopefully a
big commercialization visualization tool doesn’t get too upset about the similarity!</p>
<p>Anyway, <a class="reference external" href="https://github.com/bluenote10/tabloo">Tabloo</a> uses a Flask backend to provide a simple visualization tool for DataFrames
as well as plotting capability similar to PandasGUI.</p>
<p>Using Tabloo is very similar to PandasGUI:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">tabloo</span>
<span class="n">tabloo</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 1175px; height: auto; max-width: 100%;">
<img alt="Tabloo View" src="https://pbpython.com/images/tabloo-1.png" style="width: 1175px; height: auto; max-width: 100%;"/>
</div>
<p>Tabloo uses query syntax like PandasGUI but I could not figure out how to add multiple
filters like I did in PandasGUI.</p>
<p>Finally, Tabloo does have some basic plotting functionality as well but it is not as rich
as PandasGUI.</p>
<div class="figure" style="width: 1175px; height: auto; max-width: 100%;">
<img alt="Tabloo View" src="https://pbpython.com/images/tabloo-1.png" style="width: 1175px; height: auto; max-width: 100%;"/>
</div>
<p>Tabloo has some interesting concepts but does not have as much capability as PandasGUI.
It has not been updated in a while so it may be dormant but I wanted to include this for as
complete a survey as possible.</p>
</div>
<div class="section" id="dtale">
<h3>Dtale</h3>
<p>The final application is <a class="reference external" href="https://github.com/man-group/dtale">Dtale</a> and it is the most sophisticated of the options. Dtale’s
architecture is similar to Tabloo in that it uses a Flask back-end but includes a robust
React front-end as well. Dtale is a mature project with a lot of documentation and a lot
of functionality. I will only cover a small subset of capabilities in this post.</p>
<p>Getting started with Dtale is similar to the other applications in this category:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">dtale</span>
<span class="n">dtale</span><span class="o">.</span><span class="n">show</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 1074px; height: auto; max-width: 100%;">
<img alt="Dtale View" src="https://pbpython.com/images/dtale-1.png" style="width: 1074px; height: auto; max-width: 100%;"/>
</div>
<p>This view gives you a hint that Dtale is much more than a DataFrame viewer. It is a very
robust statistical toolset. I can not go through all the enhanced functionality here but here is
a quick example showing a histogram of the <code class="code">
unit price</code>
column:</p>
<div class="figure" style="width: 1057px; height: auto; max-width: 100%;">
<img alt="Dtale View" src="https://pbpython.com/images/dtale-2.png" style="width: 1057px; height: auto; max-width: 100%;"/>
</div>
<p>One of the features I really like about Dtale is that you can export the code and see what
it is doing. This is a really powerful feature and differentiates Excel + Python solution
from vanilla Excel.</p>
<p>Here is an example of the code export from the visualization above.</p>
<div class="highlight"><pre><span></span><span class="c1"># DISCLAIMER: 'df' refers to the data you passed in when calling 'dtale.show'</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">DatetimeIndex</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">MultiIndex</span><span class="p">)):</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">to_frame</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
<span class="c1"># remove any pre-existing indices for ease of use in the D-Tale code, but this is not required</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s1">'index'</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s1">'ignore'</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">c</span><span class="p">)</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="p">]</span> <span class="c1"># update columns to strings in case they are numbers</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="o">~</span><span class="n">pd</span><span class="o">.</span><span class="n">isnull</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'</span><span class="si">{col}</span><span class="s1">'</span><span class="p">])][[</span><span class="s1">'</span><span class="si">{col}</span><span class="s1">'</span><span class="p">]]</span>
<span class="n">chart</span><span class="p">,</span> <span class="n">labels</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">histogram</span><span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="n">bins</span><span class="o">=</span><span class="mi">20</span><span class="p">)</span>
<span class="kn">import</span> <span class="nn">scipy.stats</span> <span class="k">as</span> <span class="nn">sts</span>
<span class="n">kde</span> <span class="o">=</span> <span class="n">sts</span><span class="o">.</span><span class="n">gaussian_kde</span><span class="p">(</span><span class="n">s</span><span class="p">[</span><span class="s1">'unit price'</span><span class="p">])</span>
<span class="n">kde_data</span> <span class="o">=</span> <span class="n">kde</span><span class="o">.</span><span class="n">pdf</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="n">labels</span><span class="o">.</span><span class="n">min</span><span class="p">(),</span> <span class="n">labels</span><span class="o">.</span><span class="n">max</span><span class="p">()))</span>
<span class="c1"># main statistics</span>
<span class="n">stats</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s1">'unit price'</span><span class="p">]</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span><span class="o">.</span><span class="n">to_frame</span><span class="p">()</span><span class="o">.</span><span class="n">T</span>
</pre></div>
<p>On the topic of filtering data, Dtale also allows you to do formatting of the data. In
the example below, I formatted the currency and date columns to be a little easier to read.</p>
<div class="figure" style="width: 1092px; height: auto; max-width: 100%;">
<img alt="Dtale View" src="https://pbpython.com/images/dtale-3.png" style="width: 1092px; height: auto; max-width: 100%;"/>
</div>
<p>As I said earlier, Dtale is a robust tool with a lot of capability. If you are interested,
I encourage you to check it out and see if it works for you.</p>
<p>One aspect to watch out for is that you may run into Windows Firewall issues when trying
to run Dtale. On a locked down corporate machine, this might be a problem. Refer to the
<a class="reference external" href="https://github.com/man-group/dtale#getting-started">documentation</a> for more details on the various installation options.</p>
<p>Regardless of that issue, I think it’s definitely worth checking out Dtale - even if it is
just to see all the functionality available to you.</p>
</div>
</div>
<div class="section" id="ide-variable-viewers">
<h2><span class="caps">IDE</span> Variable Viewers</h2>
<p>If you are doing development in a tool such as <span class="caps">VS</span> Code or Spyder, you have access to a
simple DataFrame variable viewer.</p>
<p>For example, here is the view of our DataFrame using Spyder’s variable explorer:</p>
<div class="figure" style="width: 1926px; height: auto; max-width: 100%;">
<img alt="Spyder View" src="https://pbpython.com/images/spyder-1.png" style="width: 1926px; height: auto; max-width: 100%;"/>
</div>
<p>This viewer is very convenient if you are using Spyder. You don’t have any ability to filter
the data in the <span class="caps">GUI</span> but you can change the sort order.</p>
<p><span class="caps">VS</span> Code has a similar feature. You can review my <a class="reference external" href="https://pbpython.com/notebook-alternative.html">previous article</a> if you want to see how
to use <span class="caps">VS</span> Code + Python.</p>
<p>Here is a simple view showing how you can filter the data:</p>
<div class="figure" style="width: 1318px; height: auto; max-width: 100%;">
<img alt="VS Code view" src="https://pbpython.com/images/vscode-variable-viewer.png" style="width: 1318px; height: auto; max-width: 100%;"/>
</div>
<p>Both of these features are useful if you are already doing your work in Spyder or <span class="caps">VS</span> code.
However, they do not have nearly the power of Dtale when it comes to complex filtering or
sophisticated data analysis.</p>
<p>I am hopeful though that <span class="caps">VS</span> Code will continue to improve their DataFrame viewer. It looks
like <span class="caps">VS</span> Code can do just about <a class="reference external" href="https://marketplace.visualstudio.com/items?itemName=hediet.vscode-drawio">anything</a> these days so I’ll be interested to see how this
feature evolves.</p>
</div>
<div class="section" id="excel">
<h2>Excel</h2>
<p>Recently, there has been a lot of interest in an <a class="reference external" href="https://towardsdatascience.com/python-jupyter-notebooks-in-excel-5ab34fc6439">article</a> describing how to use Jupyter
notebooks in Excel. If we want to combine the benefits of Excel and Pandas, maybe this
is a good option?</p>
<div class="section" id="pyxll">
<h3>PyXLL</h3>
<p>The previously mentioned article requires the <a class="reference external" href="https://www.pyxll.com/">PyXLL package</a> which is a commercial application.
I have no issues with a company developing a commercial product. I think it is critical
for the success of the Python ecosystem. However, a paid option means you probably need
to get more buy-in to bring it into your organization. Fortunately you can try it for free
for 30 days and see if it meets your needs.</p>
<p>With that caveat aside, let’s try it with our example data set:</p>
<div class="figure" style="width: 2359px; height: auto; max-width: 100%;">
<img alt="PyXLL Demo" src="https://pbpython.com/images/pyxll-1.png" style="width: 2359px; height: auto; max-width: 100%;"/>
</div>
<p>The real power is that you can have the notebook side by side with Excel and use jupyter
magic commands to exchange data between the notebook and Excel. In this example, using
<code class="code">
%xl_set df</code>
will place the DataFrame directly into the Excel file. Then, you can
work with Excel in a hybrid mode.</p>
<p>PyXLL has a lot of <a class="reference external" href="https://www.pyxll.com/docs/introduction.html">different capabilities</a> for integrating Python and Excel so it’s difficult
to compare it to the earlier discussed frameworks. In general, I like the idea of using the
visual components of Excel plus the power of Python programming. If you are interested
in this combination of Python and Excel you should definitely check out PyXLL.</p>
</div>
<div class="section" id="xlwings">
<h3>xlwings</h3>
<p>xlwings has been around for a while, in fact, I wrote an <a class="reference external" href="https://pbpython.com/xlwings-pandas-excel.html">old article</a> about xlwings in 2016.
xlwings is similar to PyXLL in that it is also supported by a commercial company. However there
is a community edition which is Open Source as well as a Pro version that is paid. The example
here uses the community edition. The full Pro xlwings package has several different features
for integration Excel and Python.</p>
<p>While xlwings does not integrate directly with a Jupyter notebook, you can populate an
Excel spreadsheet with a DataFrame in real time and use Excel for analysis.</p>
<p>Here is a short code snippet:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">xlwings</span> <span class="k">as</span> <span class="nn">xw</span>
<span class="n">url</span> <span class="o">=</span> <span class="s1">'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="c1"># Create a new workbook and add the DataFrame to Sheet1</span>
<span class="n">xw</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>
</pre></div>
<p>This code will open up a new Excel instance and place the <code class="code">
df</code>
into cell A1. Here is
what it looks like:</p>
<div class="figure" style="width: 2266px; height: auto; max-width: 100%;">
<img alt="xlwings" src="https://pbpython.com/images/xlwings-1.png" style="width: 2266px; height: auto; max-width: 100%;"/>
</div>
<p>This can be a quick shortcut instead of saving and re-opening Excel to look at your data.
It’s actually simple to do this so I will likely try this out some more in my
own data analysis.</p>
</div>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>This article has covered a lot of ground. Here’s an image that summarizes all the options
we discussed.</p>
<div class="figure" style="width: 2550px; height: auto; max-width: 100%;">
<img alt="DataFrame GUI Overview" src="https://pbpython.com/images/df-gui-header-2.png" style="width: 2550px; height: auto; max-width: 100%;"/>
</div>
<p>Is there one solution that works for everyone? I don’t think so. Part of the reason I
wanted to write this article is that I wanted to generate discussion about the “optimal”
solution. I am hoping that you will take this opportunity to check out some of these solutions
and see if they fit into your analysis process. Each of these solutions addresses different
aspects of the problem in different ways. I suspect that users will likely combine several
of these together - depending on the problem they are trying to solve.</p>
<p>I predict we will continue to see evolution in this space. I am hopeful that we can find
a solution that leverages some of the interactive intuitive aspects of Excel plus the power and
transparency associated with using Python and pandas for data manipulation. With
<a class="reference external" href="https://techcrunch.com/2020/11/12/python-creator-guido-van-rossum-joins-microsoft/">Guido van Rossum</a> joining Microsoft, maybe we will see some more activity in this space?</p>
<p>I don’t know where we will ultimately land but I am excited to see what the community
develops. If I have missed anything or if you have thoughts, let me know in the comments.
It’s always appreciated.</p>
</div>
<div class="section" id="changes">
<h2>Changes</h2>
<ul class="simple">
<li>12-Jan-2021: Update the xlwings example to use a simpler version - <code class="code">
xw.view()</code>
</li>
</ul>
</div>
Comprehensive Guide to Grouping and Aggregating with Pandas2020-11-09T07:25:00-06:002020-11-09T07:25:00-06:00Chris Moffitttag:pbpython.com,2020-11-09:/groupby-agg.html<p class="first">One of the most basic analysis functions is grouping and aggregating data. In some cases,
this level of analysis may be sufficient to answer business questions. In other instances,
this activity might be the first step in a more complex data science analysis. In pandas,
the <code>groupby</code> function can be combined with one or more aggregation
functions to quickly and easily summarize data. This concept is deceptively simple and most new
pandas users will understand this concept. However, they might be surprised at how useful complex
aggregation functions can be for supporting sophisticated analysis.</p>
<p class="last">This article will quickly summarize the basic pandas aggregation functions and show examples
of more complex custom aggregations. Whether you are a new or more experienced pandas user,
I think you will learn a few things from this article.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>One of the most basic analysis functions is grouping and aggregating data. In some cases,
this level of analysis may be sufficient to answer business questions. In other instances,
this activity might be the first step in a more complex data science analysis. In pandas,
the <code class="code">
groupby</code>
function can be combined with one or more aggregation
functions to quickly and easily summarize data. This concept is deceptively simple and most new
pandas users will understand this concept. However, they might be surprised at how useful complex
aggregation functions can be for supporting sophisticated analysis.</p>
<p>This article will quickly summarize the basic pandas aggregation functions and show examples
of more complex custom aggregations. Whether you are a new or more experienced pandas user,
I think you will learn a few things from this article.</p>
</div>
<div class="section" id="aggregating">
<h2>Aggregating</h2>
<p>In the context of this article, an aggregation function is one which takes multiple individual
values and returns a summary. In the majority of the cases, this summary is a single value.</p>
<p>The most common aggregation functions are a simple average or summation of values. As of
pandas 0.20, you may call an aggregation function on one or more columns of a DataFrame.</p>
<p>Here’s a quick example of calculating the total and average fare using the Titanic dataset
(loaded from seaborn):</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">sns</span><span class="o">.</span><span class="n">load_dataset</span><span class="p">(</span><span class="s1">'titanic'</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">([</span><span class="s1">'sum'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">])</span>
</pre></div>
<pre class="literal-block">
sum 28693.949300
mean 32.204208
Name: fare, dtype: float64
</pre>
<p>This simple concept is a necessary building block for more complex analysis.</p>
<p>One area that needs to be discussed is that there are multiple ways to call an aggregation
function. As shown above, you may pass a list of functions to apply to one or more columns
of data.</p>
<p>What if you want to perform the analysis on only a subset of columns? There are two other
options for aggregations: using a dictionary or a <a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html">named aggregation</a>.</p>
<p>Here is a comparison of the the three options:</p>
<div class="figure" style="width: 791px; height: auto; max-width: 100%;">
<img alt="Pandas aggregation options" src="https://pbpython.com/images/agg-options.png" style="width: 791px; height: auto; max-width: 100%;"/>
</div>
<p>It is important to be aware of these options and know which one to use when.</p>
<div class="panel panel-info">
<div class="panel-heading">
Choosing an aggregation approach</div>
<div class="panel-body">
As a general rule, I prefer to use dictionaries for aggregations.</div>
</div>
<p>The tuple approach is limited by only being able to apply one aggregation at a time to a
specific column. If I need to rename columns, then I will use the <code class="code">
rename</code>
function
after the aggregations are complete. In some specific instances, the list approach is a useful
shortcut. I will reiterate though, that I think the dictionary approach provides the most
robust approach for the majority of situations.</p>
</div>
<div class="section" id="groupby">
<h2>Groupby</h2>
<p>Now that we know how to use aggregations, we can combine this with <code class="code">
groupby</code>
to summarize data.</p>
<div class="section" id="basic-math">
<h3>Basic math</h3>
<p>The most common built in aggregation functions are basic math functions including sum, mean,
median, minimum, maximum, standard deviation, variance, mean absolute deviation and product.</p>
<p>We can apply all these functions to the <code class="code">
fare</code>
while grouping by the <code class="code">
embark_town</code>
:</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_math</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'fare'</span><span class="p">:</span>
<span class="p">[</span><span class="s1">'sum'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">,</span> <span class="s1">'median'</span><span class="p">,</span> <span class="s1">'min'</span><span class="p">,</span> <span class="s1">'max'</span><span class="p">,</span> <span class="s1">'std'</span><span class="p">,</span> <span class="s1">'var'</span><span class="p">,</span> <span class="s1">'mad'</span><span class="p">,</span> <span class="s1">'prod'</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_math</span><span class="p">)</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 879px; height: auto; max-width: 100%;">
<img alt="Basic math functions" src="https://pbpython.com/images/agg_func_math.png" style="width: 879px; height: auto; max-width: 100%;"/>
</div>
<p>This is all relatively straightforward math.</p>
<p>As an aside, I have not found a good usage for the <code class="code">
prod</code>
function which computes the
product of all the values in a group. For the sake of completeness, I am including it.</p>
<p>One other useful shortcut is to use <code class="code">
describe</code>
to run multiple built-in aggregations
at one time:</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_describe</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'describe'</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_describe</span><span class="p">)</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 654px; height: auto; max-width: 100%;">
<img alt="Basic math functions" src="https://pbpython.com/images/agg-describe.png" style="width: 654px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="counting">
<h3>Counting</h3>
<p>After basic math, counting is the next most common aggregation I perform on grouped data.
In some ways, this can be a little more tricky than the basic math. Here are three examples
of counting:</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_count</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'embark_town'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'count'</span><span class="p">,</span> <span class="s1">'nunique'</span><span class="p">,</span> <span class="s1">'size'</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'deck'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_count</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 291px; height: auto; max-width: 100%;">
<img alt="Basic math functions" src="https://pbpython.com/images/agg_func_count.png" style="width: 291px; height: auto; max-width: 100%;"/>
</div>
<p>The major distinction to keep in mind is that <code class="code">
count</code>
will not include <code class="code">
NaN</code>
values whereas <code class="code">
size</code>
will. Depending on the data set, this may or may not be a
useful distinction. In addition, the <code class="code">
nunique</code>
function will exclude <code class="code">
NaN</code>
values
in the unique counts. Keep reading for an example of how to include <code class="code">
NaN</code>
in the
unique value counts.</p>
</div>
<div class="section" id="first-and-last">
<h3>First and last</h3>
<p>In this example, we can select the highest and lowest fare by embarked town. One important
point to remember is that you must sort the data first if you want <code class="code">
first</code>
and <code class="code">
last</code>
to pick the max and min values.</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_selection</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'first'</span><span class="p">,</span> <span class="s1">'last'</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">],</span>
<span class="n">ascending</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span>
<span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_selection</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 311px; height: auto; max-width: 100%;">
<img alt="Basic math functions" src="https://pbpython.com/images/agg_func_first_last.png" style="width: 311px; height: auto; max-width: 100%;"/>
</div>
<p>In the example above, I would recommend using <code class="code">
max</code>
and <code class="code">
min</code>
but I am including
<code class="code">
first</code>
and <code class="code">
last</code>
for the sake of completeness. In other applications (such as
time series analysis) you may want to select the first and last values for further analysis.</p>
<p>Another selection approach is to use <code class="code">
idxmax</code>
and <code class="code">
idxmin</code>
to select the index value
that corresponds to the maximum or minimum value.</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_max_min</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'idxmax'</span><span class="p">,</span> <span class="s1">'idxmin'</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_max_min</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 305px; height: auto; max-width: 100%;">
<img alt="Max and Min index" src="https://pbpython.com/images/agg_idxmin_max.png" style="width: 305px; height: auto; max-width: 100%;"/>
</div>
<p>We can check the results:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[[</span><span class="mi">258</span><span class="p">,</span> <span class="mi">378</span><span class="p">]]</span>
</pre></div>
<div class="figure" style="width: 1263px; height: auto; max-width: 100%;">
<img alt="Idxmax" src="https://pbpython.com/images/idxmax_details.png" style="width: 1263px; height: auto; max-width: 100%;"/>
</div>
<p>Here’s another shortcut trick you can use to see the rows with the max <code class="code">
fare</code>
:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'class'</span><span class="p">)[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">idxmax</span><span class="p">()]</span>
</pre></div>
<div class="figure" style="width: 1283px; height: auto; max-width: 100%;">
<img alt="Idxmax" src="https://pbpython.com/images/idxmax_details_shortcut.png" style="width: 1283px; height: auto; max-width: 100%;"/>
</div>
<p>The above example is one of those places where the list-based aggregation is a useful shortcut.</p>
</div>
<div class="section" id="other-libraries">
<h3>Other libraries</h3>
<p>You are not limited to the aggregation functions in pandas. For instance, you could use
stats functions from <a class="reference external" href="https://docs.scipy.org/doc/scipy/reference/stats.html">scipy</a> or <a class="reference external" href="https://numpy.org/doc/stable/reference/routines.statistics.html">numpy</a>.</p>
<p>Here is an example of calculating the mode and skew of the fare data.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">scipy.stats</span> <span class="kn">import</span> <span class="n">skew</span><span class="p">,</span> <span class="n">mode</span>
<span class="n">agg_func_stats</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="n">skew</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">mode</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_stats</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 455px; height: auto; max-width: 100%;">
<img alt="Stats functions" src="https://pbpython.com/images/agg_stats.png" style="width: 455px; height: auto; max-width: 100%;"/>
</div>
<p>The mode results are interesting. The scipy.stats mode function returns
the most frequent value as well as the count of occurrences. If you just want the most
frequent value, use <code class="code">
pd.Series.mode.</code>
</p>
<p>The key point is that you can use any function you want as long as it knows how to interpret
the array of pandas values and returns a single value.</p>
</div>
<div class="section" id="working-with-text">
<h3>Working with text</h3>
<p>When working with text, the counting functions will work as expected. You can also use
scipy’s mode function on text data.</p>
<p>One interesting application is that if you a have small number of distinct values, you can
use python’s <code class="code">
set</code>
function to display the full list of unique values.</p>
<p>This summary of the <code class="code">
class</code>
and <code class="code">
deck</code>
shows how this approach can be useful for some data sets.</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_text</span> <span class="o">=</span> <span class="p">{</span><span class="s1">'deck'</span><span class="p">:</span> <span class="p">[</span> <span class="s1">'nunique'</span><span class="p">,</span> <span class="n">mode</span><span class="p">,</span> <span class="nb">set</span><span class="p">]}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'class'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_text</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 450px; height: auto; max-width: 100%;">
<img alt="Stats functions" src="https://pbpython.com/images/agg_text.png" style="width: 450px; height: auto; max-width: 100%;"/>
</div>
</div>
<div class="section" id="custom-functions">
<h3>Custom functions</h3>
<p>The pandas standard aggregation functions and pre-built functions from the python ecosystem
will meet many of your analysis needs. However, you will likely want to create your own
custom aggregation functions. There are four methods for creating your own functions.</p>
<p>To illustrate the differences, let’s calculate the 25th percentile of the data using
four approaches:</p>
<p>First, we can use a <a class="reference external" href="https://docs.python.org/3/library/functools.html">partial</a> function:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">functools</span> <span class="kn">import</span> <span class="n">partial</span>
<span class="c1"># Use partial</span>
<span class="n">q_25</span> <span class="o">=</span> <span class="n">partial</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="o">.</span><span class="n">quantile</span><span class="p">,</span> <span class="n">q</span><span class="o">=</span><span class="mf">0.25</span><span class="p">)</span>
<span class="n">q_25</span><span class="o">.</span><span class="vm">__name__</span> <span class="o">=</span> <span class="s1">'25%'</span>
</pre></div>
<p>Next, we define our own function (which is a small wrapper around <code class="code">
quantile</code>
):</p>
<div class="highlight"><pre><span></span><span class="c1"># Define a function</span>
<span class="k">def</span> <span class="nf">percentile_25</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">25</span><span class="p">)</span>
</pre></div>
<p>We can define a lambda function and give it a name:</p>
<div class="highlight"><pre><span></span><span class="c1"># Define a lambda function</span>
<span class="n">lambda_25</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">25</span><span class="p">)</span>
<span class="n">lambda_25</span><span class="o">.</span><span class="vm">__name__</span> <span class="o">=</span> <span class="s1">'lambda_25%'</span>
</pre></div>
<p>Or, define the lambda inline:</p>
<div class="highlight"><pre><span></span><span class="c1"># Use a lambda function inline</span>
<span class="n">agg_func</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="n">q_25</span><span class="p">,</span> <span class="n">percentile_25</span><span class="p">,</span> <span class="n">lambda_25</span><span class="p">,</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">25</span><span class="p">)]</span>
<span class="p">}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func</span><span class="p">)</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 598px; height: auto; max-width: 100%;">
<img alt="Custom agg functions" src="https://pbpython.com/images/agg_custom_funcs.png" style="width: 598px; height: auto; max-width: 100%;"/>
</div>
<p>As you can see, the results are the same but the labels of the column are all a little
different. This is an area of programmer preference but I encourage you to be familiar with
the options since you will encounter most of these in online solutions.</p>
<div class="panel panel-info">
<div class="panel-heading">
Choosing an custom function style</div>
<div class="panel-body">
I prefer to use custom functions or inline lambdas.</div>
</div>
<p>Like many other areas of programming, this is an element of style and preference but I
encourage you to pick one or two approaches and stick with them for consistency.</p>
</div>
<div class="section" id="custom-function-examples">
<h3>Custom function examples</h3>
<p>As shown above, there are multiple approaches to developing custom aggregation functions.
I will go through a few specific useful examples to highlight how they are frequently used.</p>
<p>In most cases, the functions are lightweight wrappers around built in pandas functions.
Part of the reason you need to do this is that there is no way to pass arguments to aggregations.
Some examples should clarify this point.</p>
<p>If you want to count the number of null values, you could use this <a class="reference external" href="https://medium.com/escaletechblog/writing-custom-aggregation-functions-with-pandas-96f5268a8596">function</a>:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">count_nulls</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">size</span> <span class="o">-</span> <span class="n">s</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</pre></div>
<p>If you want to include <code class="code">
NaN</code>
values in your unique counts, you need to pass
<code class="code">
dropna=False</code>
to the <code class="code">
nunique</code>
function.</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">unique_nan</span><span class="p">(</span><span class="n">s</span><span class="p">):</span>
<span class="k">return</span> <span class="n">s</span><span class="o">.</span><span class="n">nunique</span><span class="p">(</span><span class="n">dropna</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span>
</pre></div>
<p>Here is a summary of all the values together:</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_custom_count</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'embark_town'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'count'</span><span class="p">,</span> <span class="s1">'nunique'</span><span class="p">,</span> <span class="s1">'size'</span><span class="p">,</span> <span class="n">unique_nan</span><span class="p">,</span> <span class="n">count_nulls</span><span class="p">,</span> <span class="nb">set</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'deck'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_custom_count</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 865px; height: auto; max-width: 100%;">
<img alt="Custom agg functions" src="https://pbpython.com/images/agg_multiple_custom_funcs.png" style="width: 865px; height: auto; max-width: 100%;"/>
</div>
<p>If you want to calculate the 90th percentile, use <code class="code">
quantile</code>
:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">percentile_90</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">.</span><span class="n">quantile</span><span class="p">(</span><span class="o">.</span><span class="mi">9</span><span class="p">)</span>
</pre></div>
<p>If you want to calculate a trimmed mean where the lowest 10th percent is excluded, use the
scipy stats function <code class="code">
trim_mean</code>
:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">trim_mean_10</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">trim_mean</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">)</span>
</pre></div>
<p>If you want the largest value, regardless of the sort order (see notes above about <code class="code">
first</code>
and
<code class="code">
last</code>
:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">largest</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">.</span><span class="n">nlargest</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
</pre></div>
<p>This is equivalent to <code class="code">
max</code>
but I will show another example of <code class="code">
nlargest</code>
below
to highlight the difference.</p>
<p>I wrote about sparklines <a class="reference external" href="https://pbpython.com/styling-pandas.html">before</a>. Refer to that article for install instructions.
Here’s how to incorporate them into an aggregate function for a unique view of the data:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">sparkline_str</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">bins</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">histogram</span><span class="p">(</span><span class="n">x</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">sl</span> <span class="o">=</span> <span class="s1">''</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">sparklines</span><span class="p">(</span><span class="n">bins</span><span class="p">))</span>
<span class="k">return</span> <span class="n">sl</span>
</pre></div>
<p>Here they are all put together:</p>
<div class="highlight"><pre><span></span><span class="n">agg_func_largest</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="n">percentile_90</span><span class="p">,</span> <span class="n">trim_mean_10</span><span class="p">,</span> <span class="n">largest</span><span class="p">,</span> <span class="n">sparkline_str</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'class'</span><span class="p">,</span> <span class="s1">'embark_town'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_largest</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 777px; height: auto; max-width: 100%;">
<img alt="Sparkline function" src="https://pbpython.com/images/agg_sparkline.png" style="width: 777px; height: auto; max-width: 100%;"/>
</div>
<p>The <code class="code">
nlargest</code>
and <code class="code">
nsmallest</code>
functions can be useful for summarizing the data
in various scenarios. Here is code to show the total fares for the top 10 and bottom 10 individuals:</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">top_10_sum</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">.</span><span class="n">nlargest</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">bottom_10_sum</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span><span class="o">.</span><span class="n">nsmallest</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="n">agg_func_top_bottom_sum</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="n">top_10_sum</span><span class="p">,</span> <span class="n">bottom_10_sum</span><span class="p">]</span>
<span class="p">}</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'class'</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">agg_func_top_bottom_sum</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 382px; height: auto; max-width: 100%;">
<img alt="Custom agg functions" src="https://pbpython.com/images/agg_top_bottom_10.png" style="width: 382px; height: auto; max-width: 100%;"/>
</div>
<p>Using this approach can be useful when applying the <a class="reference external" href="https://en.wikipedia.org/wiki/Pareto_principle">Pareto principle</a> to your own data.</p>
</div>
<div class="section" id="custom-functions-with-multiple-columns">
<h3>Custom functions with multiple columns</h3>
<p>If you have a scenario where you want to run multiple aggregations across columns, then
you may want to use the <code class="code">
groupby</code>
combined with <code class="code">
apply</code>
as described in
this <a class="reference external" href="https://stackoverflow.com/questions/14529838/apply-multiple-functions-to-multiple-groupby-columns/47103408#47103408">stack overflow</a> answer.</p>
<p>Using this method, you will have access to all of the columns of the data and can choose
the appropriate aggregation approach to build up your resulting DataFrame
(including the column labels):</p>
<div class="highlight"><pre><span></span><span class="k">def</span> <span class="nf">summary</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="n">result</span> <span class="o">=</span> <span class="p">{</span>
<span class="s1">'fare_sum'</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">sum</span><span class="p">(),</span>
<span class="s1">'fare_mean'</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">mean</span><span class="p">(),</span>
<span class="s1">'fare_range'</span><span class="p">:</span> <span class="n">x</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">]</span><span class="o">.</span><span class="n">min</span><span class="p">()</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">result</span><span class="p">)</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'class'</span><span class="p">])</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">summary</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 419px; height: auto; max-width: 100%;">
<img alt="Custom agg functions" src="https://pbpython.com/images/agg-apply.png" style="width: 419px; height: auto; max-width: 100%;"/>
</div>
<p>Using <code class="code">
apply</code>
with <code class="code">
groupy</code>
gives maximum flexibility over all aspects of
the results. However, there is a downside. The <code class="code">
apply</code>
function is slow so this approach
should be used sparingly.</p>
</div>
</div>
<div class="section" id="working-with-group-objects">
<h2>Working with group objects</h2>
<p>Once you group and aggregate the data, you can do additional calculations on the grouped objects.</p>
<p>For the first example, we can figure out what percentage of the total fares sold
can be attributed to each <code class="code">
embark_town</code>
and <code class="code">
class</code>
combination. We use
<code class="code">
assign</code>
and a <code class="code">
lambda</code>
function to add a <code class="code">
pct_total</code>
column:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">,</span> <span class="s1">'class'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span>
<span class="s1">'fare'</span><span class="p">:</span> <span class="s1">'sum'</span>
<span class="p">})</span><span class="o">.</span><span class="n">assign</span><span class="p">(</span><span class="n">pct_total</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="o">/</span> <span class="n">x</span><span class="o">.</span><span class="n">sum</span><span class="p">())</span>
</pre></div>
<div class="figure" style="width: 438px; height: auto; max-width: 100%;">
<img alt="Percent of total" src="https://pbpython.com/images/agg_pct_total.png" style="width: 438px; height: auto; max-width: 100%;"/>
</div>
<p>One important thing to keep in mind is that you can actually do this more simply using a
<code class="code">
pd.crosstab</code>
as described in my <a class="reference external" href="https://pbpython.com/pandas-crosstab.html">previous article</a>:</p>
<div class="highlight"><pre><span></span><span class="n">pd</span><span class="o">.</span><span class="n">crosstab</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'embark_town'</span><span class="p">],</span>
<span class="n">df</span><span class="p">[</span><span class="s1">'class'</span><span class="p">],</span>
<span class="n">values</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="s1">'fare'</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="s1">'sum'</span><span class="p">,</span>
<span class="n">normalize</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 440px; height: auto; max-width: 100%;">
<img alt="Crosstab example" src="https://pbpython.com/images/agg_crosstab.png" style="width: 440px; height: auto; max-width: 100%;"/>
</div>
<p>While we are talking about <code class="code">
crosstab</code>
, a useful concept to keep in mind is that agg
functions can be combined with pivot tables too.</p>
<p>Here’s a quick example:</p>
<div class="highlight"><pre><span></span><span class="n">pd</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">df</span><span class="p">,</span>
<span class="n">index</span><span class="o">=</span><span class="p">[</span><span class="s1">'embark_town'</span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s1">'class'</span><span class="p">],</span>
<span class="n">aggfunc</span><span class="o">=</span><span class="n">agg_func_top_bottom_sum</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 716px; height: auto; max-width: 100%;">
<img alt="Custom agg functions with a pivot table" src="https://pbpython.com/images/agg_pivot_table.png" style="width: 716px; height: auto; max-width: 100%;"/>
</div>
<p>Sometimes you will need to do multiple groupby’s to answer your question. For instance,
if we wanted to see a cumulative total of the fares, we can group and aggregate by town
and class then group the resulting object and calculate a cumulative sum:</p>
<div class="highlight"><pre><span></span><span class="n">fare_group</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">,</span> <span class="s1">'class'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'fare'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span>
<span class="n">fare_group</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">cumsum</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 351px; height: auto; max-width: 100%;">
<img alt="Custom agg functions with cumulative sum" src="https://pbpython.com/images/agg_cumsum.png" style="width: 351px; height: auto; max-width: 100%;"/>
</div>
<p>This may be a little tricky to understand. Here’s a summary of what we are doing:</p>
<div class="figure" style="width: 800px; height: auto; max-width: 100%;">
<img alt="Multiple groupby with cumulative sums" src="https://pbpython.com/images/multiple-groupby.png" style="width: 800px; height: auto; max-width: 100%;"/>
</div>
<p>Here’s another example where we want to summarize daily sales data and convert it to a
cumulative daily and quarterly view. Refer to the <a class="reference external" href="https://pbpython.com/pandas-grouper-agg.html">Grouper article</a> if you are not familiar with
using <code class="code">
pd.Grouper()</code>
:</p>
<p>In the first example, we want to include a total daily sales as well as cumulative quarter amount:</p>
<div class="highlight"><pre><span></span><span class="n">sales</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="s1">'https://github.com/chris1610/pbpython/blob/master/data/2018_Sales_Total_v2.xlsx?raw=True'</span><span class="p">)</span>
<span class="n">daily_sales</span> <span class="o">=</span> <span class="n">sales</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">'D'</span><span class="p">)</span>
<span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">daily_sales</span><span class="o">=</span><span class="p">(</span><span class="s1">'ext price'</span><span class="p">,</span>
<span class="s1">'sum'</span><span class="p">))</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="n">daily_sales</span><span class="p">[</span><span class="s1">'quarter_sales'</span><span class="p">]</span> <span class="o">=</span> <span class="n">daily_sales</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span>
<span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">'Q'</span><span class="p">))</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'daily_sales'</span><span class="p">:</span> <span class="s1">'cumsum'</span><span class="p">})</span>
</pre></div>
<p>To understand this, you need to look at the quarter boundary (end of March through start of April)
to get a good sense of what is going on.</p>
<div class="figure" style="width: 768px; height: auto; max-width: 100%;">
<img alt="Cumulative total" src="https://pbpython.com/images/cumulative_total.png" style="width: 768px; height: auto; max-width: 100%;"/>
</div>
<p>If you want to just get a cumulative quarterly total, you can chain multiple groupby functions.</p>
<p>First, group the daily results, then group those results by quarter and use a cumulative sum:</p>
<div class="highlight"><pre><span></span><span class="n">sales</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">key</span><span class="o">=</span><span class="s1">'date'</span><span class="p">,</span> <span class="n">freq</span><span class="o">=</span><span class="s1">'D'</span><span class="p">)</span>
<span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">daily_sales</span><span class="o">=</span><span class="p">(</span><span class="s1">'ext price'</span><span class="p">,</span> <span class="s1">'sum'</span><span class="p">))</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span>
<span class="n">pd</span><span class="o">.</span><span class="n">Grouper</span><span class="p">(</span><span class="n">freq</span><span class="o">=</span><span class="s1">'Q'</span><span class="p">))</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span>
<span class="s1">'daily_sales'</span><span class="p">:</span> <span class="s1">'cumsum'</span>
<span class="p">})</span><span class="o">.</span><span class="n">rename</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="p">{</span><span class="s1">'daily_sales'</span><span class="p">:</span> <span class="s1">'quarterly_sales'</span><span class="p">})</span>
</pre></div>
<div class="figure" style="width: 279px; height: auto; max-width: 100%;">
<img alt="Cumulative quarterly total" src="https://pbpython.com/images/cumulative_quarterly.png" style="width: 279px; height: auto; max-width: 100%;"/>
</div>
<p>In this example, I included the named aggregation approach to rename the variable to clarify
that it is now daily sales. I then group again and use the cumulative sum to get a running
sum for the quarter. Finally, I rename the column to quarterly sales.</p>
<p>Admittedly this is a bit tricky to understand. However, if you take it step by step and
build out the function and inspect the results at each step, you will start to get the hang of it.
Don’t be discouraged!</p>
</div>
<div class="section" id="flattening-hierarchical-column-indices">
<h2>Flattening Hierarchical Column Indices</h2>
<p>By default, pandas creates a hierarchical column index on the summary DataFrame.
Here is what I am referring to:</p>
<div class="highlight"><pre><span></span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">,</span> <span class="s1">'class'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'sum'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">]})</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>
</pre></div>
<div class="figure" style="width: 381px; height: auto; max-width: 100%;">
<img alt="Hierarchical index" src="https://pbpython.com/images/hierarchical_index.png" style="width: 381px; height: auto; max-width: 100%;"/>
</div>
<p>At some point in the analysis process you will likely want to “flatten” the columns so that there
is a single row of names.</p>
<p>I have found that the following approach works best for me. I use the parameter
<code class="code">
as_index=False</code>
when grouping, then build a new collapsed column name.</p>
<p>Here is the code:</p>
<div class="highlight"><pre><span></span><span class="n">multi_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'embark_town'</span><span class="p">,</span> <span class="s1">'class'</span><span class="p">],</span>
<span class="n">as_index</span><span class="o">=</span><span class="kc">False</span><span class="p">)</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'fare'</span><span class="p">:</span> <span class="p">[</span><span class="s1">'sum'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">]})</span>
<span class="n">multi_df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'_'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">col</span><span class="p">)</span><span class="o">.</span><span class="n">rstrip</span><span class="p">(</span><span class="s1">'_'</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">multi_df</span><span class="o">.</span><span class="n">columns</span><span class="o">.</span><span class="n">values</span>
<span class="p">]</span>
</pre></div>
<p>Here is a picture showing what the flattened frame looks like:</p>
<div class="figure" style="width: 1024px; height: auto; max-width: 100%;">
<img alt="Flatten hierarchical columns" src="https://pbpython.com/images/column_flatten.png" style="width: 1024px; height: auto; max-width: 100%;"/>
</div>
<p>I prefer to use <code class="code">
_</code>
as my separator but you could use other values. Just keep in mind
that it will be easier for your subsequent analysis if the resulting column names
do not have spaces.</p>
</div>
<div class="section" id="subtotals">
<h2>Subtotals</h2>
<p>One process that is not straightforward with grouping and aggregating in pandas is adding
a subtotal. If you want to add subtotals, I recommend the <a class="reference external" href="https://github.com/chris1610/sidetable">sidetable</a> package. Here is how
you can summarize <code class="code">
fares</code>
by <code class="code">
class</code>
, <code class="code">
embark_town</code>
and <code class="code">
sex</code>
with a subtotal at each level as well as a grand total at the bottom:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sidetable</span>
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'class'</span><span class="p">,</span> <span class="s1">'embark_town'</span><span class="p">,</span> <span class="s1">'sex'</span><span class="p">])</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'fare'</span><span class="p">:</span> <span class="s1">'sum'</span><span class="p">})</span><span class="o">.</span><span class="n">stb</span><span class="o">.</span><span class="n">subtotal</span><span class="p">()</span>
</pre></div>
<div class="figure" style="width: 726px; height: auto; max-width: 100%;">
<img alt="Subtotal" src="https://pbpython.com/images/agg-subtotal.png" style="width: 726px; height: auto; max-width: 100%;"/>
</div>
<p>sidetable also allows customization of the subtotal levels and resulting labels. Refer
to the package documentation for more examples of how sidetable can summarize your data.</p>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>Thanks for reading this article. There is a lot of detail here but that is due to how
many different uses there are for grouping and aggregating data with pandas. My hope is
that this post becomes a useful resource that you can bookmark and come back to when you
get stuck with a challenging problem of your own.</p>
<p>If you have other common techniques you use frequently please let me know in the comments.
If I get some broadly useful ones, I will include in this post or as an updated article.</p>
<p>image credit: <a class="reference external" href="https://pixabay.com/users/hermann-130146/">Herman Traub</a></p>
</div>
Reading Poorly Structured Excel Files with Pandas2020-10-19T07:25:00-05:002020-10-19T07:25:00-05:00Chris Moffitttag:pbpython.com,2020-10-19:/pandas-excel-range.html<p class="first last">With pandas it is easy to read Excel files and convert the data into a DataFrame.
Unfortunately Excel files in the real world are often poorly constructed. In those
cases where the data is scattered across the worksheet, you may need to customize the way
you read the data. This article will discuss how to use pandas and openpyxl to read these types
of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>With pandas it is easy to read Excel files and convert the data into a DataFrame.
Unfortunately Excel files in the real world are often poorly constructed. In those
cases where the data is scattered across the worksheet, you may need to customize the way
you read the data. This article will discuss how to use pandas and openpyxl to read these types
of Excel files and cleanly convert the data to a DataFrame suitable for further analysis.</p>
</div>
<div class="section" id="the-problem">
<h2>The Problem</h2>
<p>The pandas <code class="code">
read_excel</code>
<a class="reference external" href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html">function</a> does an excellent job of reading Excel worksheets.
However, in cases where the data is not a continuous table starting at cell A1, the results may not
be what you expect.</p>
<p>If you try to read in this sample spreadsheet using <code class="code">
read_excel(src_file)</code>
:</p>
<div class="figure" style="width: 1050px; height: auto; max-width: 100%;">
<img alt="Excel" src="https://pbpython.com/images/excel_ranges.png" style="width: 1050px; height: auto; max-width: 100%;"/>
</div>
<p>You will get something that looks like this:</p>
<div class="figure" style="width: 1416px; height: auto; max-width: 100%;">
<img alt="Excel" src="https://pbpython.com/images/excel_range_dataframe.png" style="width: 1416px; height: auto; max-width: 100%;"/>
</div>
<p>These results include a lot of <code class="code">
Unnamed</code>
columns, header labels within a row as
well as several extra columns we don’t need.</p>
</div>
<div class="section" id="pandas-solutions">
<h2>Pandas Solutions</h2>
<p>The simplest solution for this data set is to use the <code class="code">
header</code>
and <code class="code">
usecols</code>
arguments
to <code class="code">
read_excel()</code>
. The <code class="code">
usecols</code>
parameter, in particular, can be very useful
for controlling the columns you would like to include.</p>
<p>If you would like to follow along with these examples, the file is on <a class="reference external" href="https://github.com/chris1610/pbpython/blob/master/data/shipping_tables.xlsx">github</a>.</p>
<p>Here is one alternative approach to read only the data we need.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">src_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'shipping_tables.xlsx'</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">src_file</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="s1">'B:F'</span><span class="p">)</span>
</pre></div>
<p>The resulting DataFrame only contains the data we need. In this example, we purposely
exclude the notes column and date field:</p>
<div class="figure" style="width: 564px; height: auto; max-width: 100%;">
<img alt="Clean DataFrame" src="https://pbpython.com/images/excel_range_dataframe_clean_v2.png" style="width: 564px; height: auto; max-width: 100%;"/>
</div>
<p>The logic is relatively straightforward. <code class="code">
usecols</code>
can accept Excel ranges such as <code class="code">
B:F</code>
and read in only those columns. The <code class="code">
header</code>
parameter expects a single integer that defines
the header column. This value is 0-indexed so we pass in <code class="code">
1</code>
even though this is
row 2 in Excel.</p>
<p>In some instance, we may want to define the columns as a list of numbers. In this example,
we could define the list of integers:</p>
<div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">src_file</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">])</span>
</pre></div>
<p>This approach might be useful if you have some sort of numerical pattern you want to follow
for a large data set (i.e. every 3rd column or only even numbered columns).</p>
<p>The pandas <code class="code">
usecols</code>
can also take a list of column names. This code will create an
equivalent DataFrame:</p>
<div class="highlight"><pre><span></span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span>
<span class="n">src_file</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">usecols</span><span class="o">=</span><span class="p">[</span><span class="s1">'item_type'</span><span class="p">,</span> <span class="s1">'order id'</span><span class="p">,</span> <span class="s1">'order date'</span><span class="p">,</span> <span class="s1">'state'</span><span class="p">,</span> <span class="s1">'priority'</span><span class="p">])</span>
</pre></div>
<p>Using a list of named columns is going to be helpful if the column order changes but you know
the names will not change.</p>
<p>Finally, <code class="code">
usecols</code>
can take a callable function. Here’s a simple long-form example
that excludes unnamed columns as well as the priority column.</p>
<div class="highlight"><pre><span></span><span class="c1"># Define a more complex function:</span>
<span class="k">def</span> <span class="nf">column_check</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">if</span> <span class="s1">'unnamed'</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">():</span>
<span class="k">return</span> <span class="kc">False</span>
<span class="k">if</span> <span class="s1">'priority'</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">():</span>
<span class="k">return</span> <span class="kc">False</span>
<span class="k">if</span> <span class="s1">'order'</span> <span class="ow">in</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">():</span>
<span class="k">return</span> <span class="kc">True</span>
<span class="k">return</span> <span class="kc">True</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">src_file</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">usecols</span><span class="o">=</span><span class="n">column_check</span><span class="p">)</span>
</pre></div>
<p>The key concept to keep in mind is that the function will parse each column by name and must
return a <code class="code">
True</code>
or <code class="code">
False</code>
for each column. Those columns that get evaluated to
<code class="code">
True</code>
will be included.</p>
<p>Another approach to using a callable is to include a <code class="code">
lambda</code>
expression. Here is an example
where we want to include only a defined list of columns. We normalize the names
by converting them to lower case for comparison purposes.</p>
<div class="highlight"><pre><span></span><span class="n">cols_to_use</span> <span class="o">=</span> <span class="p">[</span><span class="s1">'item_type'</span><span class="p">,</span> <span class="s1">'order id'</span><span class="p">,</span> <span class="s1">'order date'</span><span class="p">,</span> <span class="s1">'state'</span><span class="p">,</span> <span class="s1">'priority'</span><span class="p">]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_excel</span><span class="p">(</span><span class="n">src_file</span><span class="p">,</span>
<span class="n">header</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
<span class="n">usecols</span><span class="o">=</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">cols_to_use</span><span class="p">)</span>
</pre></div>
<p>Callable functions give us a lot of flexibility for dealing with the real world
messiness of Excel files.</p>
</div>
<div class="section" id="ranges-and-tables">
<h2>Ranges and Tables</h2>
<p>In some cases, the data could be even more obfuscated in Excel. In this example, we have
a table called <code class="code">
ship_cost</code>
that we want to read. If you must work with a file like this,
it might be challenging to read in with the pandas options we have discussed so far.</p>
<div class="figure" style="width: 830px; height: auto; max-width: 100%;">
<img alt="Excel table" src="https://pbpython.com/images/excel_named_table-2.png" style="width: 830px; height: auto; max-width: 100%;"/>
</div>
<p>In this case, we can use <a class="reference external" href="https://openpyxl.readthedocs.io/en/stable/">openpyxl</a> directly to parse the file and convert the data into
a pandas DataFrame. The fact that the data is in an Excel table can make this process a
little easier.</p>
<p>Here’s how to use openpyxl (once it is installed) to read the Excel file:</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">openpyxl</span> <span class="kn">import</span> <span class="n">load_workbook</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="n">src_file</span> <span class="o">=</span> <span class="n">src_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'shipping_tables.xlsx'</span>
<span class="n">wb</span> <span class="o">=</span> <span class="n">load_workbook</span><span class="p">(</span><span class="n">filename</span> <span class="o">=</span> <span class="n">src_file</span><span class="p">)</span>
</pre></div>
<p>This loads the whole workbook. If we want to see all the sheets:</p>
<div class="highlight"><pre><span></span><span class="n">wb</span><span class="o">.</span><span class="n">sheetnames</span>
</pre></div>
<pre class="literal-block">
['sales', 'shipping_rates']
</pre>
<p>To access the specific sheet:</p>
<div class="highlight"><pre><span></span><span class="n">sheet</span> <span class="o">=</span> <span class="n">wb</span><span class="p">[</span><span class="s1">'shipping_rates'</span><span class="p">]</span>
</pre></div>
<p>To see a list of all the named tables:</p>
<div class="highlight"><pre><span></span><span class="n">sheet</span><span class="o">.</span><span class="n">tables</span><span class="o">.</span><span class="n">keys</span><span class="p">()</span>
</pre></div>
<pre class="literal-block">
dict_keys(['ship_cost'])
</pre>
<p>This key corresponds to the name we assigned in Excel to the table. Now we access the table
to get the equivalent Excel range:</p>
<div class="highlight"><pre><span></span><span class="n">lookup_table</span> <span class="o">=</span> <span class="n">sheet</span><span class="o">.</span><span class="n">tables</span><span class="p">[</span><span class="s1">'ship_cost'</span><span class="p">]</span>
<span class="n">lookup_table</span><span class="o">.</span><span class="n">ref</span>
</pre></div>
<pre class="literal-block">
'C8:E16'
</pre>
<p>This worked. We now know the range of data we want to load. The final step is to convert that
range to a pandas DataFrame. Here is a short code <a class="reference external" href="https://stackoverflow.com/questions/54211828/how-do-i-convert-range-of-openpyxl-cells-to-pandas-dataframe-without-looping-tho">snippet</a> to loop through each row and convert to
a DataFrame:</p>
<div class="highlight"><pre><span></span><span class="c1"># Access the data in the table range</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">sheet</span><span class="p">[</span><span class="n">lookup_table</span><span class="o">.</span><span class="n">ref</span><span class="p">]</span>
<span class="n">rows_list</span> <span class="o">=</span> <span class="p">[]</span>
<span class="c1"># Loop through each row and get the values in the cells</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">data</span><span class="p">:</span>
<span class="c1"># Get a list of all columns in each row</span>
<span class="n">cols</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">row</span><span class="p">:</span>
<span class="n">cols</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">col</span><span class="o">.</span><span class="n">value</span><span class="p">)</span>
<span class="n">rows_list</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">cols</span><span class="p">)</span>
<span class="c1"># Create a pandas dataframe from the rows_list.</span>
<span class="c1"># The first row is the column names</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="o">=</span><span class="n">rows_list</span><span class="p">[</span><span class="mi">1</span><span class="p">:],</span> <span class="n">index</span><span class="o">=</span><span class="kc">None</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="n">rows_list</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
</pre></div>
<p>Here is the resulting DataFrame:</p>
<div class="figure" style="width: 415px; height: auto; max-width: 100%;">
<img alt="Excel shipping table" src="https://pbpython.com/images/excel_shipping_dataframe.png" style="width: 415px; height: auto; max-width: 100%;"/>
</div>
<p>Now we have the clean table and can use for further calculations.</p>
</div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>In an ideal world, the data we use would be in a simple consistent format. See <a class="reference external" href="https://www.tandfonline.com/doi/full/10.1080/00031305.2017.1375989">this paper</a>
for a nice discussion of what good spreadsheet practices look like.</p>
<p>In the examples in this article, you could easily delete rows and columns to make this more
well-formatted. However, there are times where this is not feasible or advisable. The good
news is that pandas and openpyxl give us all the tools we need to read Excel data - no
matter how crazy the spreadsheet gets.</p>
</div>
<div class="section" id="changes">
<h2>Changes</h2>
<ul class="simple">
<li>21-Oct-2020: Clarified that we don’t want to include the notes column</li>
</ul>
</div>
Case Study: Processing Historical Weather Pattern Data2020-10-12T07:25:00-05:002020-10-12T07:25:00-05:00Chris Moffitttag:pbpython.com,2020-10-12:/weather-data.html<p class="first">The main purpose of this blog is to show people how to use Python to solve real world problems.
Over the years, I have been fortunate enough to hear from readers about how they have used tips
and tricks from this site to solve their own problems. In this post, I am extremely delighted to present
a real world case study. I hope it will give you some ideas about how you can apply these
concepts to your own problems.</p>
<p class="last">This example comes from Michael Biermann from Germany. He had the challenging task of trying to
gather detailed historical weather data in order to do analysis on the relationship between
air temperature and power consumption. This article will show how he used a pipeline of Python
programs to automate the process of collecting, cleaning and processing gigabytes of weather
data in order to perform his analysis.</p>
<div class="section" id="introduction">
<h2>Introduction</h2>
<p>The main purpose of this blog is to show people how to use Python to solve real world problems.
Over the years, I have been fortunate enough to hear from readers about how they have used tips
and tricks from this site to solve their own problems. In this post, I am extremely delighted to present
a real world case study. I hope it will give you some ideas about how you can apply these
concepts to your own problems.</p>
<p>This example comes from Michael Biermann from Germany. He had the challenging task of trying to
gather detailed historical weather data in order to do analysis on the relationship between
air temperature and power consumption. This article will show how he used a pipeline of Python
programs to automate the process of collecting, cleaning and processing gigabytes of weather
data in order to perform his analysis.</p>
</div>
<div class="section" id="problem-background">
<h2>Problem Background</h2>
<p>I will turn it over to Michael to give the background for this problem.</p>
<blockquote>
<p>Hi, I’m Michael, <span class="caps">CEO</span> of a company providing services to energy providers, especially focusing on
electrical power and gas. I wanted to do an ex-post analysis to get deeper insights into
the deviation of the power consumption of electrical heating systems in comparison to the
air temperature. Since we provide power to other companies, we need to have a good grasp
on the power consumption, which correlates to the air temperature. In short, I needed to
know how well I can predict the long term temperatures and how much deviation is to be expected.</p>
<p>To be able to do this analysis, I needed historical temperatures, which are supplied by the
German weather service, <span class="caps">DWD</span>. Since it would be really time consuming to download all the
files and extract them by hand, I decided to give this a shot with Python. I know a few
things about programming, but I am pretty far from a professional programmer. The process
was a lot of trial and error, but this project turned out to be exactly the right fit for
this approach. I use a lot of hardcore Excel analysis, fetching and munching data with
Power Query M, but this was clearly over the limit to what can be done in Excel.</p>
<p>I am really happy with the results. There is hardly anything as satisfying as letting the
computer do the hard work for the next 20 min, while grabbing a cup of coffee.</p>
<p>I am also really happy to have learned a few more things about web scraping, because I can
use it in future projects to automate data fetching.</p>
</blockquote>
<p>Here is a visual to help understand the process Michael created:</p>
<div class="figure" style="width: 707px; height: auto; max-width: 100%;">
<img alt="Data Processing Pipeline" src="https://pbpython.com/images/data-pipeline.png" style="width: 707px; height: auto; max-width: 100%;"/>
</div>
<p>If you are interested in following along, all of the code examples are available <a class="reference external" href="https://github.com/chris1610/pbpython/tree/master/notebooks/case_study_weather">here</a>.</p>
</div>
<div class="section" id="downloading-the-data">
<h2>Downloading the Data</h2>
<p>The first notebook in the pipeline is <code class="code">
1-dwd_konverter_download</code>
. This notebook pulls
historical temperature data from the German Weather Service (<span class="caps">DWD</span>) server and formats it for
future use in other projects.</p>
<p>The data is delivered in hourly frequencies in a .zip file for each of the available weather
stations. To use the data, we need everything in a single .csv file with all stations side-by-side.
Also, we need the daily average.</p>
<p>To reduce computing time, we also crop all data earlier than 2007.</p>
<p>For the purposes of this article, I have limited the download to only 10 files but the
full data set is over 600 files.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="c1"># Set base values</span>
<span class="n">download_folder</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'download'</span>
<span class="n">base_url</span> <span class="o">=</span> <span class="s1">'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/historical/'</span>
<span class="c1"># Initiate Session and get the Index-Page</span>
<span class="k">with</span> <span class="n">requests</span><span class="o">.</span><span class="n">Session</span><span class="p">()</span> <span class="k">as</span> <span class="n">s</span><span class="p">:</span>
<span class="n">resp</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">base_url</span><span class="p">)</span>
<span class="c1"># Parse the Index-Page for all relevant <a href></span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">resp</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
<span class="n">links</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s2">"a"</span><span class="p">,</span> <span class="n">href</span><span class="o">=</span><span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s2">"stundenwerte_TU_.*_hist.zip"</span><span class="p">))</span>
<span class="c1"># For testing, only download 10 files</span>
<span class="n">file_max</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">dl_count</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1">#Download the .zip files to the download_folder</span>
<span class="k">for</span> <span class="n">link</span> <span class="ow">in</span> <span class="n">links</span><span class="p">:</span>
<span class="n">zip_response</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">base_url</span> <span class="o">+</span> <span class="n">link</span><span class="p">[</span><span class="s1">'href'</span><span class="p">],</span> <span class="n">stream</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># Limit the downloads while testing</span>
<span class="n">dl_count</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">dl_count</span> <span class="o">></span> <span class="n">file_max</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">download_folder</span><span class="p">)</span> <span class="o">/</span> <span class="n">link</span><span class="p">[</span><span class="s1">'href'</span><span class="p">],</span> <span class="s1">'wb'</span><span class="p">)</span> <span class="k">as</span> <span class="n">file</span><span class="p">:</span>
<span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">zip_response</span><span class="o">.</span><span class="n">iter_content</span><span class="p">(</span><span class="n">chunk_size</span><span class="o">=</span><span class="mi">128</span><span class="p">):</span>
<span class="n">file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
<span class="nb">print</span><span class="p">(</span><span class="s1">'Done'</span><span class="p">)</span>
</pre></div>
<p>This portion of code will parse the download page and find all of the zip files with the name
<code class="code">
studenwerte_TU</code>
and save them in a <code class="code">
download</code>
directory.</p>
</div>
<div class="section" id="extracting-the-data">
<h2>Extracting the Data</h2>
<p>After the first step is completed, the download directory contains multiple zip files.
The second notebook in the process is <code class="code">
2-dwd_konverter_extract</code>
which will search each
zip file for a .txt file that contains the actual temperature values.</p>
<p>The program will then extract each file and move to the <code class="code">
import</code>
directory for further processing.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">from</span> <span class="nn">zipfile</span> <span class="kn">import</span> <span class="n">ZipFile</span>
<span class="c1"># Folder definitions</span>
<span class="n">download_folder</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'download'</span>
<span class="n">import_folder</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s1">'import'</span>
<span class="c1"># Find all .zip files and generate a list</span>
<span class="n">unzip_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'download/stundenwerte_TU_*_hist.zip'</span><span class="p">)</span>
<span class="c1"># Set the name pattern of the file we need</span>
<span class="n">regex_name</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s1">'produkt.*'</span><span class="p">)</span>
<span class="c1"># Open all files, look for files that match ne regex pattern, extract to 'import'</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">unzip_files</span><span class="p">:</span>
<span class="k">with</span> <span class="n">ZipFile</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="s1">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">zipObj</span><span class="p">:</span>
<span class="n">list_of_filenames</span> <span class="o">=</span> <span class="n">zipObj</span><span class="o">.</span><span class="n">namelist</span><span class="p">()</span>
<span class="n">extract_filename</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">filter</span><span class="p">(</span><span class="n">regex_name</span><span class="o">.</span><span class="n">match</span><span class="p">,</span> <span class="n">list_of_filenames</span><span class="p">))[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">zipObj</span><span class="o">.</span><span class="n">extract</span><span class="p">(</span><span class="n">extract_filename</span><span class="p">,</span> <span class="n">import_folder</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="s1">'Done'</span><span class="p">)</span>
</pre></div>
<p>After running this script, the <code class="code">
import</code>
directory will contain text files that
look like this:</p>
<div class="highlight"><pre><span></span>STATIONS_ID<span class="p">;</span>MESS_DATUM<span class="p">;</span>QN_9<span class="p">;</span>TT_TU<span class="p">;</span>RF_TU<span class="p">;</span>eor
<span class="m">3</span><span class="p">;</span><span class="m">1950040101</span><span class="p">;</span> <span class="m">5</span><span class="p">;</span> <span class="m">5</span>.7<span class="p">;</span> <span class="m">83</span>.0<span class="p">;</span>eor
<span class="m">3</span><span class="p">;</span><span class="m">1950040102</span><span class="p">;</span> <span class="m">5</span><span class="p">;</span> <span class="m">5</span>.6<span class="p">;</span> <span class="m">83</span>.0<span class="p">;</span>eor
<span class="m">3</span><span class="p">;</span><span class="m">1950040103</span><span class="p">;</span> <span class="m">5</span><span class="p">;</span> <span class="m">5</span>.5<span class="p">;</span> <span class="m">83</span>.0<span class="p">;</span>eor
<span class="m">3</span><span class="p">;</span><span class="m">1950040104</span><span class="p">;</span> <span class="m">5</span><span class="p">;</span> <span class="m">5</span>.5<span class="p">;</span> <span class="m">83</span>.0<span class="p">;</span>eor
<span class="m">3</span><span class="p">;</span><span class="m">1950040105</span><span class="p">;</span> <span class="m">5</span><span class="p">;</span> <span class="m">5</span>.8<span class="p">;</span> <span class="m">85</span>.0<span class="p">;</span>eor
</pre></div>
</div>
<div class="section" id="building-the-dataframe">
<h2>Building the DataFrame</h2>
<p>Now that we have isolated the data we need, we must format it for further analysis.</p>
<p>There are three steps in this notebook <code class="code">
3-dwd_konverter_build_df</code>
:</p>
<div class="section" id="process-individual-files">
<h3>Process Individual Files</h3>
<p>The files are imported into a single DataFrame, stripped of unnecessary columns and filtered by date.
Then we set a <code class="code">
DateTimeIndex</code>
and concatenate them into the <code class="code">
main_df</code>
. Because the loop takes a
long time, we output some status messages, to ensure the process is still running.</p>
</div>
<div class="section" id="process-the-concatenated-main-df">
<h3>Process the concatenated main_df</h3>
<p>Then we display some info of the <code class="code">
main_df</code>
so we can ensure that there are no errors, mainly
to ensure all data-types are recognized correctly. Also, we drop duplicate entries, in case
some of the .csv files were accidentally duplicated during the development process.</p>
</div>
<div class="section" id="unstack-and-export">
<h3>Unstack and export</h3>
<p>For the final step, we unstack the <code class="code">
main_df</code>
and save it to a .csv and a .pkl file for the
next step in the analysis process. Also, we display some output to get a grasp of what is
going on.</p>
<p>Now let’s look at the code:</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">clear_output</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">import</span> <span class="nn">glob</span>
<span class="n">import_files</span> <span class="o">=</span> <span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="s1">'import/*'</span><span class="p">)</span>
<span class="n">out_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s2">"export_uncleaned"</span> <span class="o">/</span> <span class="s2">"to_clean"</span>
<span class="n">obsolete_columns</span> <span class="o">=</span> <span class="p">[</span>
<span class="s1">'QN_9'</span><span class="p">,</span>
<span class="s1">'RF_TU'</span><span class="p">,</span>
<span class="s1">'eor'</span>
<span class="p">]</span>
<span class="n">main_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">()</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">for</span> <span class="n">file</span> <span class="ow">in</span> <span class="n">import_files</span><span class="p">:</span>
<span class="c1"># Read in the next file</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="n">file</span><span class="p">,</span> <span class="n">delimiter</span><span class="o">=</span><span class="s2">";"</span><span class="p">)</span>
<span class="c1"># Prepare the df before merging (Drop obsolete, convert to datetime, filter to date, set index)</span>
<span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="n">columns</span><span class="o">=</span><span class="n">obsolete_columns</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">df</span><span class="p">[</span><span class="s2">"MESS_DATUM"</span><span class="p">]</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s2">"MESS_DATUM"</span><span class="p">],</span> <span class="nb">format</span><span class="o">=</span><span class="s2">"%Y%m</span><span class="si">%d</span><span class="s2">%H"</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="p">[</span><span class="s1">'MESS_DATUM'</span><span class="p">]</span><span class="o">>=</span> <span class="s2">"2007-01-01"</span><span class="p">]</span>
<span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s1">'MESS_DATUM'</span><span class="p">,</span> <span class="s1">'STATIONS_ID'</span><span class="p">],</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># Merge to the main_df</span>
<span class="n">main_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">concat</span><span class="p">([</span><span class="n">main_df</span><span class="p">,</span> <span class="n">df</span><span class="p">])</span>
<span class="c1"># Display some status messages</span>
<span class="n">clear_output</span><span class="p">(</span><span class="n">wait</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="s1">'Finished file: </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">file</span><span class="p">),</span> <span class="s1">'This is file </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">i</span><span class="p">))</span>
<span class="n">display</span><span class="p">(</span><span class="s1">'Shape of the main_df is: </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">main_df</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
<span class="n">i</span><span class="o">+=</span><span class="mi">1</span>
<span class="c1"># Check if all types are correct</span>
<span class="n">display</span><span class="p">(</span><span class="n">main_df</span><span class="p">[</span><span class="s1">'TT_TU'</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="o">.</span><span class="vm">__name__</span><span class="p">)</span><span class="o">.</span><span class="n">value_counts</span><span class="p">())</span>
<span class="c1"># Make sure that to files or observations a duplicates, eg. scan the index for duplicate entries.</span>
<span class="c1"># The ~ is a bitwise operation, meaning it flips all bits.</span>
<span class="n">main_df</span> <span class="o">=</span> <span class="n">main_df</span><span class="p">[</span><span class="o">~</span><span class="n">main_df</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">duplicated</span><span class="p">(</span><span class="n">keep</span><span class="o">=</span><span class="s1">'last'</span><span class="p">)]</span>
<span class="c1"># Unstack the main_df</span>
<span class="n">main_df</span> <span class="o">=</span> <span class="n">main_df</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="s1">'STATIONS_ID'</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="s1">'Shape of the main_df is: </span><span class="si">{}</span><span class="s1">'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">main_df</span><span class="o">.</span><span class="n">shape</span><span class="p">))</span>
<span class="c1"># Save main_df to a .csv file and a pickle to continue working in the next step</span>
<span class="n">main_df</span><span class="o">.</span><span class="n">to_pickle</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">out_file</span><span class="p">)</span><span class="o">.</span><span class="n">with_suffix</span><span class="p">(</span><span class="s1">'.pkl'</span><span class="p">))</span>
<span class="n">main_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">Path</span><span class="p">(</span><span class="n">out_file</span><span class="p">)</span><span class="o">.</span><span class="n">with_suffix</span><span class="p">(</span><span class="s1">'.csv'</span><span class="p">),</span> <span class="n">sep</span><span class="o">=</span><span class="s2">";"</span><span class="p">)</span>
<span class="n">display</span><span class="p">(</span><span class="n">main_df</span><span class="o">.</span><span class="n">head</span><span class="p">())</span>
<span class="n">display</span><span class="p">(</span><span class="n">main_df</span><span class="o">.</span><span class="n">describe</span><span class="p">())</span>
</pre></div>
<p>As this program runs, here is some of the progress output:</p>
<pre class="literal-block">
'Finished file: import/produkt_tu_stunde_20041101_20191231_00078.txt'
'This is file 10'
'Shape of the main_df is: (771356, 1)'
float 771356
Name: TT_TU, dtype: int64
'Shape of the main_df is: (113952, 9)'
</pre>
<p>Here is what the final DataFrame looks like:</p>
<div style="max-height:1000px;max-width:1500px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr>
<th></th>
<th colspan="9" halign="left">TT_TU</th>
</tr>
<tr>
<th>STATIONS_ID</th>
<th>3</th>
<th>44</th>
<th>71</th>
<th>73</th>
<th>78</th>
<th>91</th>
<th>96</th>
<th>102</th>
<th>125</th>
</tr>
<tr>
<th>MESS_DATUM</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2007-01-01 00:00:00</th>
<td>11.4</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>11.0</td>
<td>9.4</td>
<td>NaN</td>
<td>9.7</td>
<td>NaN</td>
</tr>
<tr>
<th>2007-01-01 01:00:00</th>
<td>12.0</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>11.4</td>
<td>9.6</td>
<td>NaN</td>
<td>10.4</td>
<td>NaN</td>
</tr>
<tr>
<th>2007-01-01 02:00:00</th>
<td>12.3</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>9.4</td>
<td>10.0</td>
<td>NaN</td>
<td>9.9</td>
<td>NaN</td>
</tr>
<tr>
<th>2007-01-01 03:00:00</th>
<td>11.5</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>9.3</td>
<td>9.7</td>
<td>NaN</td>
<td>9.5</td>
<td>NaN</td>
</tr>
<tr>
<th>2007-01-01 04:00:00</th>
<td>9.6</td>
<td>NaN</td>
<td>NaN</td>
<td>NaN</td>
<td>8.6</td>
<td>10.2</td>
<td>NaN</td>
<td>8.9</td>
<td>NaN</td>
</tr>
</tbody>
</table>
</div><p>At the end of this step, we have the file in a condensed format we can use for analysis.</p>
</div>
</div>
<div class="section" id="final-processing">
<h2>Final Processing</h2>
<p>The data contains some errors, which need to be cleaned. You can see, by looking at the
output of <code class="code">
main_df.describe(),</code>
that the minimum temperature on some
stations is -999. That means that there is no plausible measurement for this particular
hour. We change this to <code class="code">
np.nan,</code>
so that we can safely calculate the average daily value
in the next step.</p>
<p>Once these values are corrected, we need to resample to daily measurements. Pandas <code class="code">
resample</code>
makes this really simple.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="c1"># Import and export paths</span>
<span class="n">pkl_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s2">"export_uncleaned"</span> <span class="o">/</span> <span class="s2">"to_clean.pkl"</span>
<span class="n">cleaned_file</span> <span class="o">=</span> <span class="n">Path</span><span class="o">.</span><span class="n">cwd</span><span class="p">()</span> <span class="o">/</span> <span class="s2">"export_cleaned"</span> <span class="o">/</span> <span class="s2">"cleaned.csv"</span>
<span class="c1"># Read in the pickle file from the last cell</span>
<span class="n">cleaning_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_pickle</span><span class="p">(</span><span class="n">pkl_file</span><span class="p">)</span>
<span class="c1"># Replace all values with "-999", which indicate missing data</span>
<span class="n">cleaning_df</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="n">to_replace</span><span class="o">=-</span><span class="mi">999</span><span class="p">,</span> <span class="n">value</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">nan</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="kc">True</span><span class="p">)</span>
<span class="c1"># Resample to daily frequency</span>
<span class="n">cleaning_df</span> <span class="o">=</span> <span class="n">cleaning_df</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">'D'</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">round</span><span class="p">(</span><span class="n">decimals</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="c1"># Save as .csv</span>
<span class="n">cleaning_df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="n">cleaned_file</span><span class="p">,</span> <span class="n">sep</span><span class="o">=</span><span class="s2">";"</span><span class="p">,</span> <span class="n">decimal</span><span class="o">=</span><span class="s2">","</span><span class="p">)</span>
<span class="c1"># Show some results for verification</span>
<span class="n">display</span><span class="p">(</span><span class="n">cleaning_df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s1">'2011-12-31'</span><span class="p">:</span><span class="s1">'2012-01-04'</span><span class="p">])</span>
<span class="n">display</span><span class="p">(</span><span class="n">cleaning_df</span><span class="o">.</span><span class="n">describe</span><span class="p">())</span>
<span class="n">display</span><span class="p">(</span><span class="n">cleaning_df</span><span class="p">)</span>
</pre></div>
<p>Here is the final DataFrame with daily average values for the stations:</p>
<div style="max-height:1000px;max-width:1500px;overflow:auto;">
<table border="1" class="table table-condensed">
<thead>
<tr>
<th></th>
<th colspan="9" halign="left">TT_TU</th>
</tr>
<tr>
<th>STATIONS_ID</th>
<th>3</th>
<th>44</th>
<th>71</th>
<th>73</th>
<th>78</th>
<th>91</th>
<th>96</th>
<th>102</th>
<th>125</th>
</tr>
<tr>
<th>MESS_DATUM</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>2011-12-31</th>
<td>NaN</td>
<td>3.88</td>
<td>2.76</td>
<td>1.19</td>
<td>4.30</td>
<td>2.43</td>
<td>NaN</td>
<td>3.80</td>
<td>NaN</td>
</tr>
<tr>
<th>2012-01-01</th>
<td>NaN</td>
<td>10.90</td>
<td>8.14</td>
<td>4.03</td>
<td>10.96</td>
<td>10.27</td>
<td>NaN</td>
<td>9.01</td>
<td>NaN</td>
</tr>
<tr>
<th>2012-01-02</th>
<td>NaN</td>
<td>7.41</td>
<td>6.18</td>
<td>4.77</td>
<td>7.57</td>
<td>7.77</td>
<td>NaN</td>
<td>6.48</td>
<td>4.66</td>
</tr>
<tr>
<th>2012-01-03</th>
<td>NaN</td>
<td>6.14</td>
<td>3.61</td>
<td>4.46</td>
<td>6.38</td>
<td>5.28</td>
<td>NaN</td>
<td>5.63</td>
<td>3.51</td>
</tr>
<tr>
<th>2012-01-04</th>
<td>NaN</td>
<td>5.80</td>
<td>2.48</td>
<td>4.45</td>
<td>5.46</td>
<td>4.57</td>
<td>NaN</td>
<td>5.85</td>
<td>1.94</td>
</tr>
</tbody>
</table>
</div></div>
<div class="section" id="summary">
<h2>Summary</h2>
<p>There are several aspects of this case study that I really like.</p>
<ul class="simple">
<li>Michael was not an expert programmer and decided to dedicate himself to learning the
Python necessary for solving this problem.</li>
<li>It took some time for him to learn how to accomplish multiple tasks but he persevered
through all the challenges and built a complete solution.</li>
<li>This was a real world problem that would be difficult to solve with other tools but could
be automated with very few lines of Python code.</li>
<li>The process could be time consuming to run so it’s broken down into multiple stages.
This is a great idea to apply to other problems. This previous <a class="reference external" href="https://pbpython.com/notebook-process.html">article</a> actually served
as the inspiration for many of the techniques used in the solution.</li>
<li>This solution brings together many different concepts including web scraping, downloading files,
working with zip files and cleaning <span class="amp">&</span> analyzing data with pandas.</li>
<li>Michael now has a new skill that he can apply to other problems in his business.</li>
</ul>
<p>Finally, I love this quote from Michael:</p>
<blockquote>
There is hardly anything as satisfying as letting the computer do the hard work for
the next 20 min, while grabbing a cup of coffee.</blockquote>
<p>I agree 100%. Thank you Michael for taking the time to share such a great example! I hope
it gives you some ideas to apply to your own projects.</p>
</div>