{
 "metadata": {
  "name": "",
  "signature": "sha256:9a829f4dec3ab792525a6e2c4858eead2c7089be44aed38ecf8b9c5bcb8bf5a7"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Introduction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One of the most common tasks for pandas and python is to automate the process to aggregate data from multiple spreadsheets and files.\n",
      "\n",
      "This article will walk through the basic flow required to parse multiple excel files, combine some data, clean it up and analyze it.\n",
      "\n",
      "Please refer to [this post](http://pbpython.com/excel-file-combine.html) for the full post."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Collecting the Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Import pandas and numpy"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Let's take a look at the files in our input directory, using the convenient shell commands in ipython."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls ../in"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "address-state-example.xlsx  report.xlsx\t\t   sample-address-new.xlsx\r\n",
        "customer-status.xlsx\t    sales-feb-2014.xlsx    sample-address-old.xlsx\r\n",
        "excel-comp-data.xlsx\t    sales-jan-2014.xlsx    sample-diff-1.xlsx\r\n",
        "my-diff-1.xlsx\t\t    sales-mar-2014.xlsx    sample-diff-2.xlsx\r\n",
        "my-diff-2.xlsx\t\t    sample-address-1.xlsx  sample-salesv3.xlsx\r\n",
        "my-diff.xlsx\t\t    sample-address-2.xlsx\r\n",
        "pricing.xlsx\t\t    sample-address-3.xlsx\r\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "There are a lot of files, but we only want to look at the sales .xlsx files."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!ls ../in/sales*.xlsx"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "../in/sales-feb-2014.xlsx  ../in/sales-jan-2014.xlsx  ../in/sales-mar-2014.xlsx\r\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Use the python glob module to easily list out the files we need"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import glob"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "glob.glob(\"../in/sales*.xlsx\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 5,
       "text": [
        "['../in/sales-jan-2014.xlsx',\n",
        " '../in/sales-mar-2014.xlsx',\n",
        " '../in/sales-feb-2014.xlsx']"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This gives us what we need, let's import each of our files and combine them into one file. \n",
      "\n",
      "Panda's concat and append can do this for us. I'm going to use append in this example.\n",
      "\n",
      "The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data = pd.DataFrame()\n",
      "for f in glob.glob(\"../in/sales*.xlsx\"):\n",
      "    df = pd.read_excel(f)\n",
      "    all_data = all_data.append(df,ignore_index=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data.describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>count</th>\n",
        "      <td>   1742.000000</td>\n",
        "      <td> 1742.000000</td>\n",
        "      <td> 1742.000000</td>\n",
        "      <td> 1742.000000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>mean</th>\n",
        "      <td> 485766.487945</td>\n",
        "      <td>   24.319173</td>\n",
        "      <td>   54.985454</td>\n",
        "      <td> 1349.229392</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>std</th>\n",
        "      <td> 223750.660792</td>\n",
        "      <td>   14.502759</td>\n",
        "      <td>   26.108490</td>\n",
        "      <td> 1094.639319</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>min</th>\n",
        "      <td> 141962.000000</td>\n",
        "      <td>   -1.000000</td>\n",
        "      <td>   10.030000</td>\n",
        "      <td>  -97.160000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>25%</th>\n",
        "      <td> 257198.000000</td>\n",
        "      <td>   12.000000</td>\n",
        "      <td>   32.132500</td>\n",
        "      <td>  468.592500</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>50%</th>\n",
        "      <td> 527099.000000</td>\n",
        "      <td>   25.000000</td>\n",
        "      <td>   55.465000</td>\n",
        "      <td> 1049.700000</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>75%</th>\n",
        "      <td> 714466.000000</td>\n",
        "      <td>   37.000000</td>\n",
        "      <td>   77.607500</td>\n",
        "      <td> 2074.972500</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>max</th>\n",
        "      <td> 786968.000000</td>\n",
        "      <td>   49.000000</td>\n",
        "      <td>   99.850000</td>\n",
        "      <td> 4824.540000</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 7,
       "text": [
        "       account number     quantity   unit price    ext price\n",
        "count     1742.000000  1742.000000  1742.000000  1742.000000\n",
        "mean    485766.487945    24.319173    54.985454  1349.229392\n",
        "std     223750.660792    14.502759    26.108490  1094.639319\n",
        "min     141962.000000    -1.000000    10.030000   -97.160000\n",
        "25%     257198.000000    12.000000    32.132500   468.592500\n",
        "50%     527099.000000    25.000000    55.465000  1049.700000\n",
        "75%     714466.000000    37.000000    77.607500  2074.972500\n",
        "max     786968.000000    49.000000    99.850000  4824.540000"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Alot of this data may not make much sense for this data set but I'm most interested in the count row to make sure the number of data elements makes sense."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 740150</td>\n",
        "      <td>                  Barton LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 39</td>\n",
        "      <td> 86.69</td>\n",
        "      <td> 3380.91</td>\n",
        "      <td> 2014-01-01 07:21:51</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 714466</td>\n",
        "      <td>             Trantow-Barrows</td>\n",
        "      <td> S2-77896</td>\n",
        "      <td> -1</td>\n",
        "      <td> 63.16</td>\n",
        "      <td>  -63.16</td>\n",
        "      <td> 2014-01-01 10:00:47</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 218895</td>\n",
        "      <td>                   Kulas Inc</td>\n",
        "      <td> B1-69924</td>\n",
        "      <td> 23</td>\n",
        "      <td> 90.70</td>\n",
        "      <td> 2086.10</td>\n",
        "      <td> 2014-01-01 13:24:58</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 307599</td>\n",
        "      <td> Kassulke, Ondricka and Metz</td>\n",
        "      <td> S1-65481</td>\n",
        "      <td> 41</td>\n",
        "      <td> 21.05</td>\n",
        "      <td>  863.05</td>\n",
        "      <td> 2014-01-01 15:05:22</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 412290</td>\n",
        "      <td>               Jerde-Hilpert</td>\n",
        "      <td> S2-34077</td>\n",
        "      <td>  6</td>\n",
        "      <td> 83.21</td>\n",
        "      <td>  499.26</td>\n",
        "      <td> 2014-01-01 23:26:55</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "   account number                         name       sku  quantity  \\\n",
        "0          740150                   Barton LLC  B1-20000        39   \n",
        "1          714466              Trantow-Barrows  S2-77896        -1   \n",
        "2          218895                    Kulas Inc  B1-69924        23   \n",
        "3          307599  Kassulke, Ondricka and Metz  S1-65481        41   \n",
        "4          412290                Jerde-Hilpert  S2-34077         6   \n",
        "\n",
        "   unit price  ext price                 date  \n",
        "0       86.69    3380.91  2014-01-01 07:21:51  \n",
        "1       63.16     -63.16  2014-01-01 10:00:47  \n",
        "2       90.70    2086.10  2014-01-01 13:24:58  \n",
        "3       21.05     863.05  2014-01-01 15:05:22  \n",
        "4       83.21     499.26  2014-01-01 23:26:55  "
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "It is not critical in this example but the best practice is to convert the date column to a date time object."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data['date'] = pd.to_datetime(all_data['date'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Combining Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company's customer segmentation strategy or some other mechanism for identifying their customers.\n",
      "\n",
      "First, we read in the data."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "status = pd.read_excel(\"../in/customer-status.xlsx\")\n",
      "status"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0 </th>\n",
        "      <td> 740150</td>\n",
        "      <td>                      Barton LLC</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1 </th>\n",
        "      <td> 714466</td>\n",
        "      <td>                 Trantow-Barrows</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2 </th>\n",
        "      <td> 218895</td>\n",
        "      <td>                       Kulas Inc</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3 </th>\n",
        "      <td> 307599</td>\n",
        "      <td>     Kassulke, Ondricka and Metz</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4 </th>\n",
        "      <td> 412290</td>\n",
        "      <td>                   Jerde-Hilpert</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>5 </th>\n",
        "      <td> 729833</td>\n",
        "      <td>                       Koepp Ltd</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>6 </th>\n",
        "      <td> 146832</td>\n",
        "      <td>                    Kiehn-Spinka</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>7 </th>\n",
        "      <td> 688981</td>\n",
        "      <td>                     Keeling LLC</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>8 </th>\n",
        "      <td> 786968</td>\n",
        "      <td>        Frami, Hills and Schmidt</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td> 239344</td>\n",
        "      <td>                      Stokes LLC</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>10</th>\n",
        "      <td> 672390</td>\n",
        "      <td>                 Kuhn-Gusikowski</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>11</th>\n",
        "      <td> 141962</td>\n",
        "      <td>                      Herman LLC</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>12</th>\n",
        "      <td> 424914</td>\n",
        "      <td>                   White-Trantow</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>13</th>\n",
        "      <td> 527099</td>\n",
        "      <td>                Sanford and Sons</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>14</th>\n",
        "      <td> 642753</td>\n",
        "      <td>                     Pollich LLC</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>15</th>\n",
        "      <td> 257198</td>\n",
        "      <td> Cronin, Oberbrunner and Spencer</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "    account number                             name  status\n",
        "0           740150                       Barton LLC    gold\n",
        "1           714466                  Trantow-Barrows  silver\n",
        "2           218895                        Kulas Inc  bronze\n",
        "3           307599      Kassulke, Ondricka and Metz  bronze\n",
        "4           412290                    Jerde-Hilpert  bronze\n",
        "5           729833                        Koepp Ltd  silver\n",
        "6           146832                     Kiehn-Spinka  silver\n",
        "7           688981                      Keeling LLC  silver\n",
        "8           786968         Frami, Hills and Schmidt  silver\n",
        "9           239344                       Stokes LLC    gold\n",
        "10          672390                  Kuhn-Gusikowski  silver\n",
        "11          141962                       Herman LLC    gold\n",
        "12          424914                    White-Trantow  silver\n",
        "13          527099                 Sanford and Sons  bronze\n",
        "14          642753                      Pollich LLC  bronze\n",
        "15          257198  Cronin, Oberbrunner and Spencer    gold"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We want to merge this data with our concatenated data set of sales. We use panda's merge function and tell it to do a left join which is similar to Excel's vlookup function."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st = pd.merge(all_data, status, how='left')\n",
      "all_data_st.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 740150</td>\n",
        "      <td>                  Barton LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 39</td>\n",
        "      <td> 86.69</td>\n",
        "      <td> 3380.91</td>\n",
        "      <td>2014-01-01 07:21:51</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 714466</td>\n",
        "      <td>             Trantow-Barrows</td>\n",
        "      <td> S2-77896</td>\n",
        "      <td> -1</td>\n",
        "      <td> 63.16</td>\n",
        "      <td>  -63.16</td>\n",
        "      <td>2014-01-01 10:00:47</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 218895</td>\n",
        "      <td>                   Kulas Inc</td>\n",
        "      <td> B1-69924</td>\n",
        "      <td> 23</td>\n",
        "      <td> 90.70</td>\n",
        "      <td> 2086.10</td>\n",
        "      <td>2014-01-01 13:24:58</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 307599</td>\n",
        "      <td> Kassulke, Ondricka and Metz</td>\n",
        "      <td> S1-65481</td>\n",
        "      <td> 41</td>\n",
        "      <td> 21.05</td>\n",
        "      <td>  863.05</td>\n",
        "      <td>2014-01-01 15:05:22</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 412290</td>\n",
        "      <td>               Jerde-Hilpert</td>\n",
        "      <td> S2-34077</td>\n",
        "      <td>  6</td>\n",
        "      <td> 83.21</td>\n",
        "      <td>  499.26</td>\n",
        "      <td>2014-01-01 23:26:55</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "   account number                         name       sku  quantity  \\\n",
        "0          740150                   Barton LLC  B1-20000        39   \n",
        "1          714466              Trantow-Barrows  S2-77896        -1   \n",
        "2          218895                    Kulas Inc  B1-69924        23   \n",
        "3          307599  Kassulke, Ondricka and Metz  S1-65481        41   \n",
        "4          412290                Jerde-Hilpert  S2-34077         6   \n",
        "\n",
        "   unit price  ext price                date  status  \n",
        "0       86.69    3380.91 2014-01-01 07:21:51    gold  \n",
        "1       63.16     -63.16 2014-01-01 10:00:47  silver  \n",
        "2       90.70    2086.10 2014-01-01 13:24:58  bronze  \n",
        "3       21.05     863.05 2014-01-01 15:05:22  bronze  \n",
        "4       83.21     499.26 2014-01-01 23:26:55  bronze  "
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This looks pretty good but let's look at a specific account."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st[all_data_st[\"account number\"]==737550].head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S2-82423</td>\n",
        "      <td> 14</td>\n",
        "      <td> 81.92</td>\n",
        "      <td> 1146.88</td>\n",
        "      <td>2014-01-03 19:07:37</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>14</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> B1-53102</td>\n",
        "      <td> 23</td>\n",
        "      <td> 71.56</td>\n",
        "      <td> 1645.88</td>\n",
        "      <td>2014-01-04 08:57:48</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>26</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> B1-53636</td>\n",
        "      <td> 42</td>\n",
        "      <td> 42.06</td>\n",
        "      <td> 1766.52</td>\n",
        "      <td>2014-01-08 00:02:11</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>32</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S1-27722</td>\n",
        "      <td> 20</td>\n",
        "      <td> 29.54</td>\n",
        "      <td>  590.80</td>\n",
        "      <td>2014-01-09 13:20:40</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>42</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S1-93683</td>\n",
        "      <td> 22</td>\n",
        "      <td> 71.68</td>\n",
        "      <td> 1576.96</td>\n",
        "      <td>2014-01-11 23:47:36</td>\n",
        "      <td> NaN</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "    account number                          name       sku  quantity  \\\n",
        "9           737550  Fritsch, Russel and Anderson  S2-82423        14   \n",
        "14          737550  Fritsch, Russel and Anderson  B1-53102        23   \n",
        "26          737550  Fritsch, Russel and Anderson  B1-53636        42   \n",
        "32          737550  Fritsch, Russel and Anderson  S1-27722        20   \n",
        "42          737550  Fritsch, Russel and Anderson  S1-93683        22   \n",
        "\n",
        "    unit price  ext price                date status  \n",
        "9        81.92    1146.88 2014-01-03 19:07:37    NaN  \n",
        "14       71.56    1645.88 2014-01-04 08:57:48    NaN  \n",
        "26       42.06    1766.52 2014-01-08 00:02:11    NaN  \n",
        "32       29.54     590.80 2014-01-09 13:20:40    NaN  \n",
        "42       71.68    1576.96 2014-01-11 23:47:36    NaN  "
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This account number was not in our status file, so we have a bunch of NaN's. We can decide how we want to handle this situation. For this specific case, let's label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st['status'].fillna('bronze',inplace=True)\n",
      "all_data_st.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 740150</td>\n",
        "      <td>                  Barton LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 39</td>\n",
        "      <td> 86.69</td>\n",
        "      <td> 3380.91</td>\n",
        "      <td>2014-01-01 07:21:51</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 714466</td>\n",
        "      <td>             Trantow-Barrows</td>\n",
        "      <td> S2-77896</td>\n",
        "      <td> -1</td>\n",
        "      <td> 63.16</td>\n",
        "      <td>  -63.16</td>\n",
        "      <td>2014-01-01 10:00:47</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 218895</td>\n",
        "      <td>                   Kulas Inc</td>\n",
        "      <td> B1-69924</td>\n",
        "      <td> 23</td>\n",
        "      <td> 90.70</td>\n",
        "      <td> 2086.10</td>\n",
        "      <td>2014-01-01 13:24:58</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 307599</td>\n",
        "      <td> Kassulke, Ondricka and Metz</td>\n",
        "      <td> S1-65481</td>\n",
        "      <td> 41</td>\n",
        "      <td> 21.05</td>\n",
        "      <td>  863.05</td>\n",
        "      <td>2014-01-01 15:05:22</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 412290</td>\n",
        "      <td>               Jerde-Hilpert</td>\n",
        "      <td> S2-34077</td>\n",
        "      <td>  6</td>\n",
        "      <td> 83.21</td>\n",
        "      <td>  499.26</td>\n",
        "      <td>2014-01-01 23:26:55</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 13,
       "text": [
        "   account number                         name       sku  quantity  \\\n",
        "0          740150                   Barton LLC  B1-20000        39   \n",
        "1          714466              Trantow-Barrows  S2-77896        -1   \n",
        "2          218895                    Kulas Inc  B1-69924        23   \n",
        "3          307599  Kassulke, Ondricka and Metz  S1-65481        41   \n",
        "4          412290                Jerde-Hilpert  S2-34077         6   \n",
        "\n",
        "   unit price  ext price                date  status  \n",
        "0       86.69    3380.91 2014-01-01 07:21:51    gold  \n",
        "1       63.16     -63.16 2014-01-01 10:00:47  silver  \n",
        "2       90.70    2086.10 2014-01-01 13:24:58  bronze  \n",
        "3       21.05     863.05 2014-01-01 15:05:22  bronze  \n",
        "4       83.21     499.26 2014-01-01 23:26:55  bronze  "
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Check the data just to make sure we're all good."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st[all_data_st[\"account number\"]==737550].head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>9 </th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S2-82423</td>\n",
        "      <td> 14</td>\n",
        "      <td> 81.92</td>\n",
        "      <td> 1146.88</td>\n",
        "      <td>2014-01-03 19:07:37</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>14</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> B1-53102</td>\n",
        "      <td> 23</td>\n",
        "      <td> 71.56</td>\n",
        "      <td> 1645.88</td>\n",
        "      <td>2014-01-04 08:57:48</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>26</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> B1-53636</td>\n",
        "      <td> 42</td>\n",
        "      <td> 42.06</td>\n",
        "      <td> 1766.52</td>\n",
        "      <td>2014-01-08 00:02:11</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>32</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S1-27722</td>\n",
        "      <td> 20</td>\n",
        "      <td> 29.54</td>\n",
        "      <td>  590.80</td>\n",
        "      <td>2014-01-09 13:20:40</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>42</th>\n",
        "      <td> 737550</td>\n",
        "      <td> Fritsch, Russel and Anderson</td>\n",
        "      <td> S1-93683</td>\n",
        "      <td> 22</td>\n",
        "      <td> 71.68</td>\n",
        "      <td> 1576.96</td>\n",
        "      <td>2014-01-11 23:47:36</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "    account number                          name       sku  quantity  \\\n",
        "9           737550  Fritsch, Russel and Anderson  S2-82423        14   \n",
        "14          737550  Fritsch, Russel and Anderson  B1-53102        23   \n",
        "26          737550  Fritsch, Russel and Anderson  B1-53636        42   \n",
        "32          737550  Fritsch, Russel and Anderson  S1-27722        20   \n",
        "42          737550  Fritsch, Russel and Anderson  S1-93683        22   \n",
        "\n",
        "    unit price  ext price                date  status  \n",
        "9        81.92    1146.88 2014-01-03 19:07:37  bronze  \n",
        "14       71.56    1645.88 2014-01-04 08:57:48  bronze  \n",
        "26       42.06    1766.52 2014-01-08 00:02:11  bronze  \n",
        "32       29.54     590.80 2014-01-09 13:20:40  bronze  \n",
        "42       71.68    1576.96 2014-01-11 23:47:36  bronze  "
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability."
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Using Categories"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation -\n",
      "\n",
      "\"Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.\"\n",
      "\n",
      "For our purposes, the status field is a good candidate for a category type.\n",
      "\n",
      "You must make sure you have a recent version of pandas installed for this example to work."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.__version__"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 15,
       "text": [
        "'0.15.2'"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "First, we typecast it to a category using astype."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st[\"status\"] = all_data_st[\"status\"].astype(\"category\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This doesn't immediately appear to change anything yet."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 740150</td>\n",
        "      <td>                  Barton LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 39</td>\n",
        "      <td> 86.69</td>\n",
        "      <td> 3380.91</td>\n",
        "      <td>2014-01-01 07:21:51</td>\n",
        "      <td>   gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 714466</td>\n",
        "      <td>             Trantow-Barrows</td>\n",
        "      <td> S2-77896</td>\n",
        "      <td> -1</td>\n",
        "      <td> 63.16</td>\n",
        "      <td>  -63.16</td>\n",
        "      <td>2014-01-01 10:00:47</td>\n",
        "      <td> silver</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 218895</td>\n",
        "      <td>                   Kulas Inc</td>\n",
        "      <td> B1-69924</td>\n",
        "      <td> 23</td>\n",
        "      <td> 90.70</td>\n",
        "      <td> 2086.10</td>\n",
        "      <td>2014-01-01 13:24:58</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 307599</td>\n",
        "      <td> Kassulke, Ondricka and Metz</td>\n",
        "      <td> S1-65481</td>\n",
        "      <td> 41</td>\n",
        "      <td> 21.05</td>\n",
        "      <td>  863.05</td>\n",
        "      <td>2014-01-01 15:05:22</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 412290</td>\n",
        "      <td>               Jerde-Hilpert</td>\n",
        "      <td> S2-34077</td>\n",
        "      <td>  6</td>\n",
        "      <td> 83.21</td>\n",
        "      <td>  499.26</td>\n",
        "      <td>2014-01-01 23:26:55</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 17,
       "text": [
        "   account number                         name       sku  quantity  \\\n",
        "0          740150                   Barton LLC  B1-20000        39   \n",
        "1          714466              Trantow-Barrows  S2-77896        -1   \n",
        "2          218895                    Kulas Inc  B1-69924        23   \n",
        "3          307599  Kassulke, Ondricka and Metz  S1-65481        41   \n",
        "4          412290                Jerde-Hilpert  S2-34077         6   \n",
        "\n",
        "   unit price  ext price                date  status  \n",
        "0       86.69    3380.91 2014-01-01 07:21:51    gold  \n",
        "1       63.16     -63.16 2014-01-01 10:00:47  silver  \n",
        "2       90.70    2086.10 2014-01-01 13:24:58  bronze  \n",
        "3       21.05     863.05 2014-01-01 15:05:22  bronze  \n",
        "4       83.21     499.26 2014-01-01 23:26:55  bronze  "
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Buy you can see that it is a new data type."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.dtypes"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 18,
       "text": [
        "account number             int64\n",
        "name                      object\n",
        "sku                       object\n",
        "quantity                   int64\n",
        "unit price               float64\n",
        "ext price                float64\n",
        "date              datetime64[ns]\n",
        "status                  category\n",
        "dtype: object"
       ]
      }
     ],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.sort(columns=[\"status\"]).head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>1741</th>\n",
        "      <td> 642753</td>\n",
        "      <td>      Pollich LLC</td>\n",
        "      <td> B1-04202</td>\n",
        "      <td>  8</td>\n",
        "      <td> 95.86</td>\n",
        "      <td>  766.88</td>\n",
        "      <td>2014-02-28 23:47:32</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1232</th>\n",
        "      <td> 218895</td>\n",
        "      <td>        Kulas Inc</td>\n",
        "      <td> S1-06532</td>\n",
        "      <td> 29</td>\n",
        "      <td> 42.75</td>\n",
        "      <td> 1239.75</td>\n",
        "      <td>2014-09-21 11:27:55</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>579 </th>\n",
        "      <td> 527099</td>\n",
        "      <td> Sanford and Sons</td>\n",
        "      <td> S1-27722</td>\n",
        "      <td> 41</td>\n",
        "      <td> 87.86</td>\n",
        "      <td> 3602.26</td>\n",
        "      <td>2014-04-14 18:36:11</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>580 </th>\n",
        "      <td> 383080</td>\n",
        "      <td>         Will LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 40</td>\n",
        "      <td> 51.73</td>\n",
        "      <td> 2069.20</td>\n",
        "      <td>2014-04-14 22:44:58</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>581 </th>\n",
        "      <td> 383080</td>\n",
        "      <td>         Will LLC</td>\n",
        "      <td> S2-10342</td>\n",
        "      <td> 15</td>\n",
        "      <td> 76.75</td>\n",
        "      <td> 1151.25</td>\n",
        "      <td>2014-04-15 02:57:43</td>\n",
        "      <td> bronze</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 19,
       "text": [
        "      account number              name       sku  quantity  unit price  \\\n",
        "1741          642753       Pollich LLC  B1-04202         8       95.86   \n",
        "1232          218895         Kulas Inc  S1-06532        29       42.75   \n",
        "579           527099  Sanford and Sons  S1-27722        41       87.86   \n",
        "580           383080          Will LLC  B1-20000        40       51.73   \n",
        "581           383080          Will LLC  S2-10342        15       76.75   \n",
        "\n",
        "      ext price                date  status  \n",
        "1741     766.88 2014-02-28 23:47:32  bronze  \n",
        "1232    1239.75 2014-09-21 11:27:55  bronze  \n",
        "579     3602.26 2014-04-14 18:36:11  bronze  \n",
        "580     2069.20 2014-04-14 22:44:58  bronze  \n",
        "581     1151.25 2014-04-15 02:57:43  bronze  "
       ]
      }
     ],
     "prompt_number": 19
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      " all_data_st[\"status\"].cat.set_categories([ \"gold\",\"silver\",\"bronze\"],inplace=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 20
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Now, we can sort it so that gold shows on top."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.sort(columns=[\"status\"]).head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>account number</th>\n",
        "      <th>name</th>\n",
        "      <th>sku</th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "      <th>date</th>\n",
        "      <th>status</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0   </th>\n",
        "      <td> 740150</td>\n",
        "      <td>                      Barton LLC</td>\n",
        "      <td> B1-20000</td>\n",
        "      <td> 39</td>\n",
        "      <td> 86.69</td>\n",
        "      <td> 3380.91</td>\n",
        "      <td>2014-01-01 07:21:51</td>\n",
        "      <td> gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1193</th>\n",
        "      <td> 257198</td>\n",
        "      <td> Cronin, Oberbrunner and Spencer</td>\n",
        "      <td> S2-82423</td>\n",
        "      <td> 23</td>\n",
        "      <td> 52.90</td>\n",
        "      <td> 1216.70</td>\n",
        "      <td>2014-09-09 03:06:30</td>\n",
        "      <td> gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1194</th>\n",
        "      <td> 141962</td>\n",
        "      <td>                      Herman LLC</td>\n",
        "      <td> B1-86481</td>\n",
        "      <td> 45</td>\n",
        "      <td> 52.78</td>\n",
        "      <td> 2375.10</td>\n",
        "      <td>2014-09-09 11:49:45</td>\n",
        "      <td> gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1195</th>\n",
        "      <td> 257198</td>\n",
        "      <td> Cronin, Oberbrunner and Spencer</td>\n",
        "      <td> B1-50809</td>\n",
        "      <td> 30</td>\n",
        "      <td> 51.96</td>\n",
        "      <td> 1558.80</td>\n",
        "      <td>2014-09-09 21:14:31</td>\n",
        "      <td> gold</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1197</th>\n",
        "      <td> 239344</td>\n",
        "      <td>                      Stokes LLC</td>\n",
        "      <td> B1-65551</td>\n",
        "      <td> 43</td>\n",
        "      <td> 15.24</td>\n",
        "      <td>  655.32</td>\n",
        "      <td>2014-09-10 11:10:02</td>\n",
        "      <td> gold</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 21,
       "text": [
        "      account number                             name       sku  quantity  \\\n",
        "0             740150                       Barton LLC  B1-20000        39   \n",
        "1193          257198  Cronin, Oberbrunner and Spencer  S2-82423        23   \n",
        "1194          141962                       Herman LLC  B1-86481        45   \n",
        "1195          257198  Cronin, Oberbrunner and Spencer  B1-50809        30   \n",
        "1197          239344                       Stokes LLC  B1-65551        43   \n",
        "\n",
        "      unit price  ext price                date status  \n",
        "0          86.69    3380.91 2014-01-01 07:21:51   gold  \n",
        "1193       52.90    1216.70 2014-09-09 03:06:30   gold  \n",
        "1194       52.78    2375.10 2014-09-09 11:49:45   gold  \n",
        "1195       51.96    1558.80 2014-09-09 21:14:31   gold  \n",
        "1197       15.24     655.32 2014-09-10 11:10:02   gold  "
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st[\"status\"].describe()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 22,
       "text": [
        "count       1742\n",
        "unique         3\n",
        "top       bronze\n",
        "freq         764\n",
        "Name: status, dtype: object"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to give us the average of the values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].mean()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>quantity</th>\n",
        "      <th>unit price</th>\n",
        "      <th>ext price</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>status</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gold</th>\n",
        "      <td> 24.680723</td>\n",
        "      <td> 52.431205</td>\n",
        "      <td> 1325.566867</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>silver</th>\n",
        "      <td> 23.814241</td>\n",
        "      <td> 55.724241</td>\n",
        "      <td> 1339.477539</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>bronze</th>\n",
        "      <td> 24.589005</td>\n",
        "      <td> 55.470733</td>\n",
        "      <td> 1367.757736</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 23,
       "text": [
        "         quantity  unit price    ext price\n",
        "status                                    \n",
        "gold    24.680723   52.431205  1325.566867\n",
        "silver  23.814241   55.724241  1339.477539\n",
        "bronze  24.589005   55.470733  1367.757736"
       ]
      }
     ],
     "prompt_number": 23
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Of course, you can run multiple aggregation functions on the data to get really useful information "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.groupby([\"status\"])[\"quantity\",\"unit price\",\"ext price\"].agg([np.sum,np.mean, np.std])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr>\n",
        "      <th></th>\n",
        "      <th colspan=\"3\" halign=\"left\">quantity</th>\n",
        "      <th colspan=\"3\" halign=\"left\">unit price</th>\n",
        "      <th colspan=\"3\" halign=\"left\">ext price</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th></th>\n",
        "      <th>sum</th>\n",
        "      <th>mean</th>\n",
        "      <th>std</th>\n",
        "      <th>sum</th>\n",
        "      <th>mean</th>\n",
        "      <th>std</th>\n",
        "      <th>sum</th>\n",
        "      <th>mean</th>\n",
        "      <th>std</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>status</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>gold</th>\n",
        "      <td>  8194</td>\n",
        "      <td> 24.680723</td>\n",
        "      <td> 14.478670</td>\n",
        "      <td> 17407.16</td>\n",
        "      <td> 52.431205</td>\n",
        "      <td> 26.244516</td>\n",
        "      <td>  440088.20</td>\n",
        "      <td> 1325.566867</td>\n",
        "      <td> 1074.564373</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>silver</th>\n",
        "      <td> 15384</td>\n",
        "      <td> 23.814241</td>\n",
        "      <td> 14.519044</td>\n",
        "      <td> 35997.86</td>\n",
        "      <td> 55.724241</td>\n",
        "      <td> 26.053569</td>\n",
        "      <td>  865302.49</td>\n",
        "      <td> 1339.477539</td>\n",
        "      <td> 1094.908529</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>bronze</th>\n",
        "      <td> 18786</td>\n",
        "      <td> 24.589005</td>\n",
        "      <td> 14.506515</td>\n",
        "      <td> 42379.64</td>\n",
        "      <td> 55.470733</td>\n",
        "      <td> 26.062149</td>\n",
        "      <td> 1044966.91</td>\n",
        "      <td> 1367.757736</td>\n",
        "      <td> 1104.129089</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 24,
       "text": [
        "       quantity                       unit price                        \\\n",
        "            sum       mean        std        sum       mean        std   \n",
        "status                                                                   \n",
        "gold       8194  24.680723  14.478670   17407.16  52.431205  26.244516   \n",
        "silver    15384  23.814241  14.519044   35997.86  55.724241  26.053569   \n",
        "bronze    18786  24.589005  14.506515   42379.64  55.470733  26.062149   \n",
        "\n",
        "         ext price                            \n",
        "               sum         mean          std  \n",
        "status                                        \n",
        "gold     440088.20  1325.566867  1074.564373  \n",
        "silver   865302.49  1339.477539  1094.908529  \n",
        "bronze  1044966.91  1367.757736  1104.129089  "
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks backwards.\n",
      "\n",
      "Maybe we should look at how many bronze customers we have and see what is going on.\n",
      "\n",
      "What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.\n",
      "\n",
      "I'm purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous articles and play with this command yourself to understand what all these commands mean."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "all_data_st.drop_duplicates(subset=[\"account number\",\"name\"]).ix[:,[0,1,7]].groupby([\"status\"])[\"name\"].count()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "status\n",
        "gold      4\n",
        "silver    7\n",
        "bronze    9\n",
        "Name: name, dtype: int64"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}