Today we’re excited to feature a guest post by entrepreneur/engineer Nick Rose. When overcrowding at his gym became a problem, he dove into machine learning to find a solution. This is a great tale of using ML to solve real-world problems, tackling the challenges of learning without an immediate tutor, and crowdsourcing for help.

Nick’s original post can be found on Medium, along with his followup “Machine Learning as a Service (MLaaS) with Sklearn and Algorithmia“. Enjoy!

It’s 9 AM on a Tuesday, and there’s a horrific sign blocking the entrance to the weight room — “One In, One Out.” When this happens, it means the campus gym is so crowded that staff can only let you in once someone else walks out. How could this happen today? I thought I was safe this early in the morning. The gym isn’t supposed to be packed right now!

The UC Berkeley weight room located in the RSF (Recreational Sports Facility) has a maximum capacity of about 200 students at any given time. This presents a crowding problem for a university of some 27,000 undergrads. In 2014, Ollie O’Donnell and I set out to alleviate this problem by creating an app that would tell you exactly how crowded the gym is before you go.

What can we learn from all the data gathered from the gym over the past year, and what kinds of predictions can we make about how crowded the gym will be in the future? Machine learning is the perfect tool for this task as it can incorporate many different features into the answers it gives, from time of day and temperature to whether or not it’s a holiday.

Full disclaimer before moving on: I am by no means an expert in machine learning. It’s quite new to me and, like these algorithms, I’m learning more every day. Please feel free to correct me on any mistakes I inevitably make so I can adjust my weight vector.

Great machine learning intro here, if you’re new to this.

### The Data

Over the past year we collected more than 29,000 people counts. Using Pandas, I merged those counts with some other helpful variables like weather and holiday information. I fetched the weather data using a handy API called DarkSky (formerly Forecast.io). This presented a great format for reading historical data, but I wanted to be able to use all this history to predict the future as well.

Machine learning models are great for learning from large amounts of data. The general idea is to train your algorithm on about 70% of the data, test it on the other 30% to judge how accurate it is, and then use your trained model to make predictions. Your model’s score on the test set is a number between 0 and 1, representing what fraction of the predictions were close to the actual people counts in the test set. A score of 1.0 means your model is great at predicting, 0.5 means it only gets it right half the time, and any worse and you might as well just randomly guess.

### Attempt 1: Learning Alone

I started with my favorite language, Python, and found a handy ML library called scikit-learn. The library even lays out a map of which model to use given your objectives.

Since I wanted to predict how crowded the gym will be (a number of people between zero and who knows), I was dealing with predicting a continuous variable, not a class. This means I was in regression land. I tried every different model within that bubble, from a Stochastic Gradient Descent (SGD) Regressor to a Support Vector Regression (SVR).

Here’s an example of using scikit-learn to create a machine learning model:

model = SVR(C=1, cache_size=500, epsilon=1, kernel='rbf')

The hyperparameters (C=1, cache_size=500, etc) help make your model more or less accurate. Let’s see what kind of effect C had on my accuracy score:

Clearly, my model wasn’t doing too hot. The top score of about 0.70 means that the model would only tell you how crowded the gym would be in the future accurately about 70% of the time. Increasing C does trend towards increasing the score, but every power of ten increase in C added hours to training the model while only marginally increasing the accuracy score. I could let my computer spend hours training with C=1,000,000, but the score wouldn’t pass 0.72.

My SVR model with a high score of 0.72 wasn’t good enough, and it was the best I could do. It was time to enlist the help of people who actually knew what they were doing.

### Attempt 2: Crowdsourcing on Kaggle

Kaggle, a collaborative platform for data science competitions and machine learning, was the perfect place to upload my dataset and let people take shots at it. You can find it here: https://www.kaggle.com/nsrose7224/crowdedness-at-the-campus-gym/

I shared the link on social media feeds, promising to buy anyone coffee who could beat my accuracy score. I knew competition and caffeine would go a long way towards convincing people to help, but I was overwhelmed by the responses and help I got. Within a few days, my dataset was the #1 hottest featured set on Kaggle:

Several kind souls on Kaggle made kernels (online code notebooks) breaking down my features, testing various prediction models, and transforming my data. I was impressed at the knowledge and skill of these anonymous data science wizards and in awe at the sheer volume of information I didn’t know yet about machine learning.

Finally, at a friend’s suggestion, I increased my accuracy score from 0.72 to 0.86 by switching my ML model to a Random Forest Regressor. So Ryan, I owe you a cup of Philz coffee when you come to the Bay Area.

Enough of that. Let’s get to the good stuff and why most of you came here — when should I go to the gym to avoid the crowds?

### The Results

1. Features: An essential part of machine learning is figuring out which of your features are important to determining your predictions and which are redundant. I’ll let user Donyoe from Kaggle take over at this point:

I found it super interesting that whether or not it’s a federal holiday doesn’t really seem to have much of an effect on the number of people in the gym. Next, Donyoe broke down how people counts are related to time of day, day of the week, and the temperature.

2. Trends: User Demetri P generated some really fascinating heat maps.

The next heat map Demetri made shows the gradient of gym population (change over time).

So according to Demetri, the best time to lift is around 8–10 A.M. Coincidentally, this is the time I’ve been going since freshman year, so I guess I nailed it by accident.

3. Accuracy: By using a Random Forest Regressor model, my test score accuracy soared 10% higher than previously to around 88%. Fine tuning the “n_estimators” hyperparameter of the model (the number of trees in the forest) helps even more.

I finally had a model with an accuracy score I could be proud of. What can I do now with these predictions?

You can see all these kernels and more here.

### The Future

Up until now, my app had been forecasting predictions by learning only from the previous week. It had no notion of temperature, semester effects, or any other complex features. Now, I want to see if we can do better. I’ve deployed my Random Forest Regressor on Algorithmia to generate predictions as an API call. Starting now, the web app will offer beta testing of machine learned predictions for the campus gym.

I would like to hear from students if these are indeed accurate and how helpful they are so that I can incorporate their feedback. There will be mistakes and off-days, but I’ll continue to improve the accuracy of the model and add richer features. If this goes well, I’ll add machine learning to the other campus areas as well. My objective is to learn as much as these algorithms do so I can build more and more intelligent services for people to use.

Thanks to everyone who helped out on my quest for data science knowledge — Sherdil Niyaz, Allen Guo, Ryan Peden, Kaggle users Donyoe, Demetri P, Nirajverma, Hamidreza Omidvar.