This article is a review of Chris Albon’s book, Machine Learning with Python Cookbook. This book is in the tradition of other O’Reilly “cookbook” series in that it contains short “recipes” for dealing with common machine learning scenarios in python. It covers the full spectrum of tasks from simple data wrangling and pre-processing to more complex machine learning model development and deep learning implementations. Since this is such a fast moving and broad topic, it is nice to get a new book that covers the latest topics and presents them in a compact but very useful format. Bottom line, I enjoyed reading this book and think it will be a useful resource to have on my python bookshelf. Read on for some more details about the book and who will benefit most from reading it.
Where does this book fit?
As data science, machine learning and AI have become more and more popular, there is a proliferation of books that try to cover these topics in differing manners. Some books go very deep in the math and theory behind the various machine learning algorithms. Others try to cover a lot of content but do not provide a quick reference resource with code examples for solving real world problems. Machine Learning with Python Cookbook, fills this code-heavy niche with lots of examples. There are very few paragraphs with math equations or details behind the implementation of machine learning algorithms. Instead, Chris Albon breaks the topics down into bite size chunks that solve a very specific problem. Each of the nearly 200 recipes follows a similar format:
- Problem definition
- Discussion (optional)
- Additional resources (optional)
In most cases, the problem definition is as simple as “You want to multiply two matrices” or “You need to visualize a model created by a decision tree learning algorithm.” This organization makes it convenient to look at the table of contents, and find the relevant section with ease.
Each solution is fully self-contained and can be copied and pasted into a standalone script or jupyter notebook and executed. In addition, the code sample includes all the necessary imports as well as sample data sets (e.g. Iris, Titanic, MNIST). They are all around 12-20 lines of code with comments included so they are easy to dissect and understand.
In some cases, there is further discussion about the approach as well as hints and tips related to the solutions. In many cases, topics like performance for larger and more complex data sets are discussed and options are presented for managing those situations.
Finally, the author also includes links to more details that might be useful when you need to dive into the problem in more depth.
Who should read it?
The author is very clear that this book is not an introduction to python or machine learning. Since the recipes are short, the actual python code is fairly simple. There’s no need to understand complex python data structures or programming constructs outside of lists and dictionaries. You should know how to install python libraries such as numpy, pandas and scikit-learn.
More importantly, you should have at least some experience using these libraries to load and manipulate data. I also highly recommend that you have done some work with building predictive models with scikit-learn. A lot of the value I gained from this book was related to learning solutions to problems I encountered in my own work.
Finally, some basic understanding of supervised and unsupervised machine learning algorithms is going to be really helpful. For example, if you do not know the types of problems where you would use linear vs. logistic regression or why you might need to use dimensionality reduction, then this book (especially chapters 9 and higher) might not make sense.
How should you read it?
Because the book is a cookbook, it’s not necessary to read it from page 1 through 340. However, I do think it is best to skim through it in order to understand what content is available. For instance, I felt very comfortable with the content in chapter 2 (Loading Data) and Chapter 3 (Data Wrangling) so I skimmed the content. For other chapters, I felt like I got a lot more out of reading the examples in depth since I did not have as much experience with those topics.
Ultimately though, this is a resource that is meant to sit beside your computer and provide a quick lookup for a specific problem. With that goal in mind, it achieves its aim admirably.
The book only has 340 pages of content but it is broken down into 21 chapters. In my opinion, this is a good structure because each chapter provides a concise introduction of a topic and specific code examples that solve common problems.
The chapters start with basic numpy functions, then move to more complex pandas and sckit-learn functions and close out with some keras examples. Here’s a list of each chapter along with its primary focus:
- Vectors, Matrices and Arrays [numpy]
- Loading Data [scikit-learn, pandas]
- Data Wrangling [pandas]
- Handling Numerical Data [pandas, scikit-learn]
- Handling Categorical Data [pandas, scikit-learn]
- Handling Text [NLTK, scikit-learn]
- Handling Dates and Times [pandas]
- Handling Images [OpenCV, matplotlib]
- Dimensionality Reduction Using Feature Extraction [scikit-learn]
- Dimensionality Reduction Using Feature Selection [scikit-learn]
- Model Evaluation [scikit-learn]
- Model Selection [scikit-learn]
- Linear Regression [scikit-learn]
- Trees and Forests [scikit-learn]
- K-Nearest Neighbors [scikit-learn]
- Logistic Regression [scikit-learn]
- Support Vector Machines [scikit-learn]
- Naive Bayes [scikit-learn]
- Clustering [scikit-learn]
- Neural Networks [keras]
- Saving and Loading Trained Models [scikit-learn, keras]
To illustrate how the chapters work, let’s look at chapter 15 which cover K-Nearest Neighbors (KNN). In this cases, the introduction recipe (15.0) gives a concise summary of KNN and why it is a popular tool.
Now that we remember what KNN is used for, we’re likely going to want to apply it
to our data. First, we will want “to find an observation’s
nearest observations (neighbors).”
Recipe 15.1 contains specific code as well as some more detail around the various
algorithm parameters we can tweak such as the distance metrics (Euclidean, Manhattan or Minkowski).
Next, recipe 15.2 shows how to take some unknown data and predict its class based on neighbors. This recipe uses the iris data set but also includes important caveats about scaling data when using KNN.
Recipe 15.3 then moves on to cover a common challenge with KNN, specifically how do you select the
best value for k? This recipe uses scikit-learn’s
to conduct a cross-validation of KNN classifiers with different values of
. The code is simple
to comprehend and easy to extend to your own data sources.
The point is that each chapter can be consumed at the individual recipe level or read more broadly to understand the concept in more detail. I really like this approach because so many topics are covered at a quick pace. If I feel the need to dive into the mathematical rationale for an approach, I can use these recipes as a jumping off point for further review.
The only criticism I can place is that I wish there were more topics covered in the content. Some specific areas I would have liked to learn about are coverage of ensemble methods as well as a discussion about xgboost.
In some cases, it might be useful to understand some of the additional libraries in the python eco-system. From a NLP perspective, I know that NLTK is the standard but have heard good things about spaCy as well so would be curious where it fits in this space. The neural network space is changing rapidly so I think keras was a good choice but it might be interesting to learn about some of the other options like PyTorch.
I am sure there are a lot of other potential topics that were considered so I can imagine it was really tough to decide what was in and out of scope. All of my suggestions are based on topics that sprang to my mind and are meant only as potential ideas for another edition (if that is the plan).
Originally, I had some concerns about using the basic data sets (Titanic, Iris, etc) in most examples. However, now that I have reflected on it, I like that the examples are so self-contained and think it would be much more difficult to create such a great resource if there needed to be more explanation of the data.
Also, it would be nice if the code examples were available online so you could do some quick copying and pasting instead of typing it all in by hand. This may be available so if I find it, I’ll be sure to update it.
The final comment I have is related to the price of the book. The current US list price is $59.99 which may seem steep for a 340 page book. However, I think the book is worth it and encourage those interested to purchase it. The content is great and I see it being very useful to those using pandas + scikit-learn on a frequent basis. It is clear that Chris knows what he is talking about and he explains the details well. I predict that this book will become well broken in as I frequently refer to it.
The second reason it is important to purchase these books is so that authors and publishers know that the python community values this type of content. I can not imagine how long it took Chris to write this book. I can only guess that the royalties will probably not afford him an early retirement any time soon! Still, I do want to make sure he gets at least some compensation for this valuable resource and want to provide encouragement to him for a job well done.
Overall, the Machine Learning with Python Cookbook is an extremely useful book which is aptly described in the tag line as “Practical Solutions From Preprocessing to Deep Learning.” Chris has done a fabulous job of collecting a lot of the most common machine learning problems and summarizing solutions. I definitely encourage those of you using any of the libraries mentioned here to pick up this book. I have added this book to my recommended resources page so please check it out and see if any of the other recommendations might be useful. Also, let me know if you find this review useful.