{ "metadata": { "name": "", "signature": "sha256:9a829f4dec3ab792525a6e2c4858eead2c7089be44aed38ecf8b9c5bcb8bf5a7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.\n", "\n", "This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.\n", "\n", "Please refer to [this post](http://pbpython.com/excel-file-combine.html) for the full post." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Collecting the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import pandas and numpy" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "import numpy as np" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a look at the files in our input directory, using the convenient shell commands in ipython." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls ../in" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "address-state-example.xlsx report.xlsx\t\t sample-address-new.xlsx\r\n", "customer-status.xlsx\t sales-feb-2014.xlsx sample-address-old.xlsx\r\n", "excel-comp-data.xlsx\t sales-jan-2014.xlsx sample-diff-1.xlsx\r\n", "my-diff-1.xlsx\t\t sales-mar-2014.xlsx sample-diff-2.xlsx\r\n", "my-diff-2.xlsx\t\t sample-address-1.xlsx sample-salesv3.xlsx\r\n", "my-diff.xlsx\t\t sample-address-2.xlsx\r\n", "pricing.xlsx\t\t sample-address-3.xlsx\r\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a lot of files, but we only want to look at the sales .xlsx files." ] }, { "cell_type": "code", "collapsed": false, "input": [ "!ls ../in/sales*.xlsx" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "../in/sales-feb-2014.xlsx ../in/sales-jan-2014.xlsx ../in/sales-mar-2014.xlsx\r\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the python glob module to easily list out the files we need" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import glob" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "glob.glob(\"../in/sales*.xlsx\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "['../in/sales-jan-2014.xlsx',\n", " '../in/sales-mar-2014.xlsx',\n", " '../in/sales-feb-2014.xlsx']" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This gives us what we need, let's import each of our files and combine them into one file. \n", "\n", "Panda's concat and append can do this for us. I'm going to use append in this example.\n", "\n", "The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data = pd.DataFrame()\n", "for f in glob.glob(\"../in/sales*.xlsx\"):\n", " df = pd.read_excel(f)\n", " all_data = all_data.append(df,ignore_index=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data.describe()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numberquantityunit priceext price
count 1742.000000 1742.000000 1742.000000 1742.000000
mean 485766.487945 24.319173 54.985454 1349.229392
std 223750.660792 14.502759 26.108490 1094.639319
min 141962.000000 -1.000000 10.030000 -97.160000
25% 257198.000000 12.000000 32.132500 468.592500
50% 527099.000000 25.000000 55.465000 1049.700000
75% 714466.000000 37.000000 77.607500 2074.972500
max 786968.000000 49.000000 99.850000 4824.540000
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " account number quantity unit price ext price\n", "count 1742.000000 1742.000000 1742.000000 1742.000000\n", "mean 485766.487945 24.319173 54.985454 1349.229392\n", "std 223750.660792 14.502759 26.108490 1094.639319\n", "min 141962.000000 -1.000000 10.030000 -97.160000\n", "25% 257198.000000 12.000000 32.132500 468.592500\n", "50% 527099.000000 25.000000 55.465000 1049.700000\n", "75% 714466.000000 37.000000 77.607500 2074.972500\n", "max 786968.000000 49.000000 99.850000 4824.540000" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedate
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " account number name sku quantity \\\n", "0 740150 Barton LLC B1-20000 39 \n", "1 714466 Trantow-Barrows S2-77896 -1 \n", "2 218895 Kulas Inc B1-69924 23 \n", "3 307599 Kassulke, Ondricka and Metz S1-65481 41 \n", "4 412290 Jerde-Hilpert S2-34077 6 \n", "\n", " unit price ext price date \n", "0 86.69 3380.91 2014-01-01 07:21:51 \n", "1 63.16 -63.16 2014-01-01 10:00:47 \n", "2 90.70 2086.10 2014-01-01 13:24:58 \n", "3 21.05 863.05 2014-01-01 15:05:22 \n", "4 83.21 499.26 2014-01-01 23:26:55 " ] } ], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not critical in this example but the best practice is to convert the date column to a date time object." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data['date'] = pd.to_datetime(all_data['date'])" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Combining Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.\n", "\n", "First, we read in the data." ] }, { "cell_type": "code", "collapsed": false, "input": [ "status = pd.read_excel(\"../in/customer-status.xlsx\")\n", "status" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernamestatus
0 740150 Barton LLC gold
1 714466 Trantow-Barrows silver
2 218895 Kulas Inc bronze
3 307599 Kassulke, Ondricka and Metz bronze
4 412290 Jerde-Hilpert bronze
5 729833 Koepp Ltd silver
6 146832 Kiehn-Spinka silver
7 688981 Keeling LLC silver
8 786968 Frami, Hills and Schmidt silver
9 239344 Stokes LLC gold
10 672390 Kuhn-Gusikowski silver
11 141962 Herman LLC gold
12 424914 White-Trantow silver
13 527099 Sanford and Sons bronze
14 642753 Pollich LLC bronze
15 257198 Cronin, Oberbrunner and Spencer gold
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " account number name status\n", "0 740150 Barton LLC gold\n", "1 714466 Trantow-Barrows silver\n", "2 218895 Kulas Inc bronze\n", "3 307599 Kassulke, Ondricka and Metz bronze\n", "4 412290 Jerde-Hilpert bronze\n", "5 729833 Koepp Ltd silver\n", "6 146832 Kiehn-Spinka silver\n", "7 688981 Keeling LLC silver\n", "8 786968 Frami, Hills and Schmidt silver\n", "9 239344 Stokes LLC gold\n", "10 672390 Kuhn-Gusikowski silver\n", "11 141962 Herman LLC gold\n", "12 424914 White-Trantow silver\n", "13 527099 Sanford and Sons bronze\n", "14 642753 Pollich LLC bronze\n", "15 257198 Cronin, Oberbrunner and Spencer gold" ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st = pd.merge(all_data, status, how='left')\n", "all_data_st.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0 740150 Barton LLC B1-20000 39 86.69 3380.912014-01-01 07:21:51 gold
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.162014-01-01 10:00:47 silver
2 218895 Kulas Inc B1-69924 23 90.70 2086.102014-01-01 13:24:58 bronze
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.052014-01-01 15:05:22 bronze
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.262014-01-01 23:26:55 bronze
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 11, "text": [ " account number name sku quantity \\\n", "0 740150 Barton LLC B1-20000 39 \n", "1 714466 Trantow-Barrows S2-77896 -1 \n", "2 218895 Kulas Inc B1-69924 23 \n", "3 307599 Kassulke, Ondricka and Metz S1-65481 41 \n", "4 412290 Jerde-Hilpert S2-34077 6 \n", "\n", " unit price ext price date status \n", "0 86.69 3380.91 2014-01-01 07:21:51 gold \n", "1 63.16 -63.16 2014-01-01 10:00:47 silver \n", "2 90.70 2086.10 2014-01-01 13:24:58 bronze \n", "3 21.05 863.05 2014-01-01 15:05:22 bronze \n", "4 83.21 499.26 2014-01-01 23:26:55 bronze " ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks pretty good but let's look at a specific account." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st[all_data_st[\"account number\"]==737550].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
9 737550 Fritsch, Russel and Anderson S2-82423 14 81.92 1146.882014-01-03 19:07:37 NaN
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.882014-01-04 08:57:48 NaN
26 737550 Fritsch, Russel and Anderson B1-53636 42 42.06 1766.522014-01-08 00:02:11 NaN
32 737550 Fritsch, Russel and Anderson S1-27722 20 29.54 590.802014-01-09 13:20:40 NaN
42 737550 Fritsch, Russel and Anderson S1-93683 22 71.68 1576.962014-01-11 23:47:36 NaN
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " account number name sku quantity \\\n", "9 737550 Fritsch, Russel and Anderson S2-82423 14 \n", "14 737550 Fritsch, Russel and Anderson B1-53102 23 \n", "26 737550 Fritsch, Russel and Anderson B1-53636 42 \n", "32 737550 Fritsch, Russel and Anderson S1-27722 20 \n", "42 737550 Fritsch, Russel and Anderson S1-93683 22 \n", "\n", " unit price ext price date status \n", "9 81.92 1146.88 2014-01-03 19:07:37 NaN \n", "14 71.56 1645.88 2014-01-04 08:57:48 NaN \n", "26 42.06 1766.52 2014-01-08 00:02:11 NaN \n", "32 29.54 590.80 2014-01-09 13:20:40 NaN \n", "42 71.68 1576.96 2014-01-11 23:47:36 NaN " ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st['status'].fillna('bronze',inplace=True)\n", "all_data_st.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0 740150 Barton LLC B1-20000 39 86.69 3380.912014-01-01 07:21:51 gold
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.162014-01-01 10:00:47 silver
2 218895 Kulas Inc B1-69924 23 90.70 2086.102014-01-01 13:24:58 bronze
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.052014-01-01 15:05:22 bronze
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.262014-01-01 23:26:55 bronze
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ " account number name sku quantity \\\n", "0 740150 Barton LLC B1-20000 39 \n", "1 714466 Trantow-Barrows S2-77896 -1 \n", "2 218895 Kulas Inc B1-69924 23 \n", "3 307599 Kassulke, Ondricka and Metz S1-65481 41 \n", "4 412290 Jerde-Hilpert S2-34077 6 \n", "\n", " unit price ext price date status \n", "0 86.69 3380.91 2014-01-01 07:21:51 gold \n", "1 63.16 -63.16 2014-01-01 10:00:47 silver \n", "2 90.70 2086.10 2014-01-01 13:24:58 bronze \n", "3 21.05 863.05 2014-01-01 15:05:22 bronze \n", "4 83.21 499.26 2014-01-01 23:26:55 bronze " ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the data just to make sure we're all good." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st[all_data_st[\"account number\"]==737550].head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
9 737550 Fritsch, Russel and Anderson S2-82423 14 81.92 1146.882014-01-03 19:07:37 bronze
14 737550 Fritsch, Russel and Anderson B1-53102 23 71.56 1645.882014-01-04 08:57:48 bronze
26 737550 Fritsch, Russel and Anderson B1-53636 42 42.06 1766.522014-01-08 00:02:11 bronze
32 737550 Fritsch, Russel and Anderson S1-27722 20 29.54 590.802014-01-09 13:20:40 bronze
42 737550 Fritsch, Russel and Anderson S1-93683 22 71.68 1576.962014-01-11 23:47:36 bronze
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 14, "text": [ " account number name sku quantity \\\n", "9 737550 Fritsch, Russel and Anderson S2-82423 14 \n", "14 737550 Fritsch, Russel and Anderson B1-53102 23 \n", "26 737550 Fritsch, Russel and Anderson B1-53636 42 \n", "32 737550 Fritsch, Russel and Anderson S1-27722 20 \n", "42 737550 Fritsch, Russel and Anderson S1-93683 22 \n", "\n", " unit price ext price date status \n", "9 81.92 1146.88 2014-01-03 19:07:37 bronze \n", "14 71.56 1645.88 2014-01-04 08:57:48 bronze \n", "26 42.06 1766.52 2014-01-08 00:02:11 bronze \n", "32 29.54 590.80 2014-01-09 13:20:40 bronze \n", "42 71.68 1576.96 2014-01-11 23:47:36 bronze " ] } ], "prompt_number": 14 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Using Categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -\n", "\n", "\"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.\"\n", "\n", "For our purposes, the status field is a good candidate for a category type.\n", "\n", "You must make sure you have a recent version of pandas installed for this example to work." ] }, { "cell_type": "code", "collapsed": false, "input": [ "pd.__version__" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 15, "text": [ "'0.15.2'" ] } ], "prompt_number": 15 }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we typecast it to a category using astype." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st[\"status\"] = all_data_st[\"status\"].astype(\"category\")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn't immediately appear to change anything yet." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0 740150 Barton LLC B1-20000 39 86.69 3380.912014-01-01 07:21:51 gold
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.162014-01-01 10:00:47 silver
2 218895 Kulas Inc B1-69924 23 90.70 2086.102014-01-01 13:24:58 bronze
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.052014-01-01 15:05:22 bronze
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.262014-01-01 23:26:55 bronze
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ " account number name sku quantity \\\n", "0 740150 Barton LLC B1-20000 39 \n", "1 714466 Trantow-Barrows S2-77896 -1 \n", "2 218895 Kulas Inc B1-69924 23 \n", "3 307599 Kassulke, Ondricka and Metz S1-65481 41 \n", "4 412290 Jerde-Hilpert S2-34077 6 \n", "\n", " unit price ext price date status \n", "0 86.69 3380.91 2014-01-01 07:21:51 gold \n", "1 63.16 -63.16 2014-01-01 10:00:47 silver \n", "2 90.70 2086.10 2014-01-01 13:24:58 bronze \n", "3 21.05 863.05 2014-01-01 15:05:22 bronze \n", "4 83.21 499.26 2014-01-01 23:26:55 bronze " ] } ], "prompt_number": 17 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Buy you can see that it is a new data type." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.dtypes" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "account number int64\n", "name object\n", "sku object\n", "quantity int64\n", "unit price float64\n", "ext price float64\n", "date datetime64[ns]\n", "status category\n", "dtype: object" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.sort(columns=[\"status\"]).head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
1741 642753 Pollich LLC B1-04202 8 95.86 766.882014-02-28 23:47:32 bronze
1232 218895 Kulas Inc S1-06532 29 42.75 1239.752014-09-21 11:27:55 bronze
579 527099 Sanford and Sons S1-27722 41 87.86 3602.262014-04-14 18:36:11 bronze
580 383080 Will LLC B1-20000 40 51.73 2069.202014-04-14 22:44:58 bronze
581 383080 Will LLC S2-10342 15 76.75 1151.252014-04-15 02:57:43 bronze
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ " account number name sku quantity unit price \\\n", "1741 642753 Pollich LLC B1-04202 8 95.86 \n", "1232 218895 Kulas Inc S1-06532 29 42.75 \n", "579 527099 Sanford and Sons S1-27722 41 87.86 \n", "580 383080 Will LLC B1-20000 40 51.73 \n", "581 383080 Will LLC S2-10342 15 76.75 \n", "\n", " ext price date status \n", "1741 766.88 2014-02-28 23:47:32 bronze \n", "1232 1239.75 2014-09-21 11:27:55 bronze \n", "579 3602.26 2014-04-14 18:36:11 bronze \n", "580 2069.20 2014-04-14 22:44:58 bronze \n", "581 1151.25 2014-04-15 02:57:43 bronze " ] } ], "prompt_number": 19 }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering." ] }, { "cell_type": "code", "collapsed": false, "input": [ " all_data_st[\"status\"].cat.set_categories([ \"gold\",\"silver\",\"bronze\"],inplace=True)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can sort it so that gold shows on top." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.sort(columns=[\"status\"]).head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
account numbernameskuquantityunit priceext pricedatestatus
0 740150 Barton LLC B1-20000 39 86.69 3380.912014-01-01 07:21:51 gold
1193 257198 Cronin, Oberbrunner and Spencer S2-82423 23 52.90 1216.702014-09-09 03:06:30 gold
1194 141962 Herman LLC B1-86481 45 52.78 2375.102014-09-09 11:49:45 gold
1195 257198 Cronin, Oberbrunner and Spencer B1-50809 30 51.96 1558.802014-09-09 21:14:31 gold
1197 239344 Stokes LLC B1-65551 43 15.24 655.322014-09-10 11:10:02 gold
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ " account number name sku quantity \\\n", "0 740150 Barton LLC B1-20000 39 \n", "1193 257198 Cronin, Oberbrunner and Spencer S2-82423 23 \n", "1194 141962 Herman LLC B1-86481 45 \n", "1195 257198 Cronin, Oberbrunner and Spencer B1-50809 30 \n", "1197 239344 Stokes LLC B1-65551 43 \n", "\n", " unit price ext price date status \n", "0 86.69 3380.91 2014-01-01 07:21:51 gold \n", "1193 52.90 1216.70 2014-09-09 03:06:30 gold \n", "1194 52.78 2375.10 2014-09-09 11:49:45 gold \n", "1195 51.96 1558.80 2014-09-09 21:14:31 gold \n", "1197 15.24 655.32 2014-09-10 11:10:02 gold " ] } ], "prompt_number": 21 }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st[\"status\"].describe()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 22, "text": [ "count 1742\n", "unique 3\n", "top bronze\n", "freq 764\n", "Name: status, dtype: object" ] } ], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].mean()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quantityunit priceext price
status
gold 24.680723 52.431205 1325.566867
silver 23.814241 55.724241 1339.477539
bronze 24.589005 55.470733 1367.757736
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ " quantity unit price ext price\n", "status \n", "gold 24.680723 52.431205 1325.566867\n", "silver 23.814241 55.724241 1339.477539\n", "bronze 24.589005 55.470733 1367.757736" ] } ], "prompt_number": 23 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Of course, you can run multiple aggregation functions on the data to get really useful information " ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].agg([np.sum,np.mean, np.std])" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
quantityunit priceext price
summeanstdsummeanstdsummeanstd
status
gold 8194 24.680723 14.478670 17407.16 52.431205 26.244516 440088.20 1325.566867 1074.564373
silver 15384 23.814241 14.519044 35997.86 55.724241 26.053569 865302.49 1339.477539 1094.908529
bronze 18786 24.589005 14.506515 42379.64 55.470733 26.062149 1044966.91 1367.757736 1104.129089
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 24, "text": [ " quantity unit price \\\n", " sum mean std sum mean std \n", "status \n", "gold 8194 24.680723 14.478670 17407.16 52.431205 26.244516 \n", "silver 15384 23.814241 14.519044 35997.86 55.724241 26.053569 \n", "bronze 18786 24.589005 14.506515 42379.64 55.470733 26.062149 \n", "\n", " ext price \n", " sum mean std \n", "status \n", "gold 440088.20 1325.566867 1074.564373 \n", "silver 865302.49 1339.477539 1094.908529 \n", "bronze 1044966.91 1367.757736 1104.129089 " ] } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.\n", "\n", "Maybe we should look at how many bronze customers we have and see what is going on.\n", "\n", "What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.\n", "\n", "I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean." ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_data_st.drop_duplicates(subset=[\"account number\",\"name\"]).ix[:,[0,1,7]].groupby([\"status\"])[\"name\"].count()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 25, "text": [ "status\n", "gold 4\n", "silver 7\n", "bronze 9\n", "Name: name, dtype: int64" ] } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers." ] } ], "metadata": {} } ] }