I recently had the pleasure of participating in a crowd-sourced data science competition in the Twin Cities called Analyze This! I wanted to share some of my thoughts and experiences on the process - especially how this challenge helped me learn more about how to apply data science theory and open source tools to real world problems.
I also hope this article can encourage others in the Twin Cities to participate in future events. For those of you not in the Minneapolis-St. Paul metro area, then maybe this can help motivate you to start up a similar event in your area. I thoroughly enjoyed the experience and got a lot out of the process. Read on for more details.
Analyze This! is a crowd-source data science competition. Think of it as a mashup of an in person Kaggle competition, plus a data science user group mixed in with a little bit of Toastmasters. The result is a really cool series of events that accomplishes two things. First, it helps individuals build their data science skills on a real world problem. Secondly it helps an organization get insight into their data challenges.
The process starts when the Analyze This organizers partner with a host organization to identify a real-world problem that could be solved with data analytics. Once the problem is defined and the data gathered, it is turned over to a group of eager volunteers who spend a couple of months analyzing the data and developing insights and actionable next steps for solving the defined problem. Along the way, there are periodic group meetings where experts share their knowledge on a specific data science topic. The process culminates in a friendly competition where the teams present the results to the group. The host organization and event organizers judge the results based on a pre-defined rubric. A final winning team typically wins a modest financial reward (more than enough for a dinner but not enough to pay the rent for the month).
In this specific case, Analyze This! partnered with the Science Museum of Minnesota to gather and de-identify data related to membership activity. The goal of the project was to develop a model to predict whether or not a member would renew their membership and use this information to increase membership renewal rates for the museum.
As I mentioned earlier, the entire process was really interesting, challenging and even fun. Here are a few of my learnings and observations that I took away from the events that I can apply to future challenges and real life data science projects:
The Best Way to Learn is By Doing
I came into the event with a good familiarity with python but not as much real-world experience with machine learning algorithms. I have spent time learning about various ML tools and have played with some models but at some point, you can only look at Titanic or Iris data sets for so long!
The best analogy I can think of is that it is like taking a math class and looking at the solution in the answer key. You may think you understand how to get to the solution but “thinking you can” is never the same as spending time wrestling with the problem on your own and “knowing you can.”
Because the data set was brand new to us all, it forced us all to dig in and struggle with understanding the data and divining insights. There was no “right answer” that we could look at in advance. The only way to gain insight was to wrestle with the data and figure it out with your team. This meant reasearching the problem and developing working code examples.
Descriptive Analytics Still Matter
Many people have seen some variation of the chart that looks like this:
Because I wanted to learn about ML, I tended to jump ahead in this chart and go straight for the predictive model without spending time on the descriptive analytics. After sitting through the presentations from each group, I realized that I should have spent more time looking at the data from a standard stats perspective and use some of those basic insights to help inform the eventual model. I also realized that the descriptive analytics were really useful in helping to tell the story around the final recommendations. In other words, it’s not all about a fancy predictive model.
Speaking of Models
In this specific case, all the teams developed models to predict a members likely renewal based on various traits. Across the group, the teams tried pretty much any model that is available in the python or R ecosystem. Despite how fancy everyone tried to get, a simple logistic regression model won out. I think the moral of the story is that sometimes a relatively simple model with good results beats a complex model with a marginally better results.
Python Served Me Well
My team (and several others) used python for much of the analysis. In addition to pandas and scikit-learn, I leveraged jupyter notebooks for a lot of exploratory data analysis. Of course, I used conda to setup a python3 virtual environment for this project which made it really nice to play around with various tools without messing up other python environments.
I experimented with folium to visualize geographic data. I found it fairly simple to build interesting, data-rich maps with this tool. If there is interesting, I may write about it more in the future.
I also took TPOT for a spin. It worked well and I think it generated some useful models. We eventually used a different model but I plan to continue learning more about TPOT and look forward to seeing how it continues to improve.
Presenting Results is a Skill
One of the key aspects of the Analyze This challenge that I enjoyed is that each team had to present their solutions during a 10 minute presentation. Because we had all spent time with the same data set, we were all starting from a similar baseline. It was extremely interesting to see how the teams presented their results and used various visualizations to explain their process and provide actionable insight. We all tended to identify several common features that drove renewal rates but it was interesting to see how different teams attacked a similar problem from different angles.
Several of the groups scored results that were very close to each other. The scoring rubric factored in more weight on the the presentation than on the actual model results which I think is a wise move and separates this challenge from something like a kaggle competition.
The other interesting/challenging part of presenting the results was the wide range of knowledge in the room. On one end of the spectrum there were PhD’s, Data Scientists and very experienced statisticians. On the other end were people just learning some of these concepts and had little or no training in data science or statistics. This wide spread of knowledge meant that each group had to think carefully about how to present their information in a way that would appeal to the entire audience.
Community is important
One of the goals of the Analyze This organizers is to foster a community for data science learning. I felt like they did a really nice job of making everyone feel welcome. Even though this was a competition, the more experienced members were supportive of the less knowledgeable individuals. There was a lot of formal and informal knowledge sharing.
I have seen several variations of this venn diagram to describe data scientists.
During the competition, I noticed that the pool of participants fit into many of these categories. We had everything from people that do data science as a full time job to web developers to people just interested in learning more. The really great thing was that it was a supportive group and people were willing to share knowledge and help others.
My experience with this cross-section of people reinforced my belief that the “perfect data scientist” does lie at the intersection of these multiple functions.
I hope the Analyze This! group can continue building on the success of this competition and encourage even more people to participate in the process.
I am really excited about the people I met through this process. I ended up working with a great group of guys on my team. I also got to learn a little more about how others are doing Data Science in the Twin Cities. Of course, I used this as an opportunity to expand my network.
I am sure you can tell that I’m a big supporter of Analyze This!, its mission and the people that are leading the program. Pedro, Kevin, Jake, Mitchell, Daniel and Justin did a tremendous amount of work to make this happen. I am very impressed with their knowledge and dedication to make this happen. They are doing this to help others and build up the community. They receive no pay for the countless hours of work they put into it.
The process was a great way to learn more about data science and hone my skills in a real-world test. I got to meet some smart people and help a worthy organization (hopefully) improve their membership renewal rates. I highly encourage those of you that might be at FARCON 2016, to stop by and listen to the group presentations. I also encourage you to look for the next challenge and find some time to participate. I am confident you will find it time well spent.