Algorithmia Blog - Deploying AI at scale

Developing your own machine learning projects 

Tower of books with overlaid text "Communication is the skill that sets the data tinkerers apart from those who influence business decisions"

Photo by Lysander Yuen

Data scientists and machine learning engineers often encounter a disconnect between what they learned (in school, a bootcamp, or independently), and how this knowledge is applied in their work. For instance, you may be proficient in R or Python, but still be usure how the code or the libraries you’re pulling from relates to actual use cases. 

Machine learning is more than its techniques

Machine learning techniques and principles are interesting to learn, but like many technical disciplines, they do not exist simply for the sake of existing. 

Machine learning is the technical foundation of data science; practitioners use their knowledge of statistics combined with computer science to develop predictive models and uncover patterns in data. These models help businesses with tasks like pricing, developing new products, or identifying the best customer for a service.  

So how does one go from coding in a classroom to advising executives on data-driven decision making? It starts with practice—designing and implementing your own machine learning models using real-world datasets. As you gain experience with more projects, you’ll begin to have a better understanding of which algorithms and methods are appropriate for specific types of questions. 

Read on to learn more about machine learning projects and how you design them. 

 

Cereal O's in a milky bowl

Photo by David Streit

What are some machine learning projects?

Some classic machine learning projects that you may have already been exposed to as a student include: 

  • Boston Housing – This project is meant to teach simple linear regression. You predict the value of a home based on independent variables like the number of rooms in a dwelling, a town’s crime rate, and the number of owner-occupied homes nearby.
  • Iris Flowers – This project teaches basic classification techniques. The purpose is to classify an observation into one of three species of iris based on five attributes related to sepal length, petal length, and petal width.  
  • Handwriting recognition – Using the Modified National Institute of Standards and Technology (MNIST) dataset, the goal of this project is to identify individual digits from handwritten images. This is a good project to attempt if you are interested in neural networks and deep learning AI
  • Breakfast cereals – The breakfast cereal data set contains nutrition and consumer rating information from 80 products. Because of its breadth of categorical and numerical variables, you can develop projects that include many machine learning techniques including dimension reduction and clustering.

All of the previous projects have corresponding datasets on Kaggle, a community where you can compete to solve real-world machine learning problems. It’s useful in that It also allows you to see different people’s approaches to the same question.

Where do I find data?

While Kaggle is one place to start, there are plenty of other sources with which to find datasets. As you conduct more machine learning projects, you may find yourself wanting to analyze more obscure or lesser-known datasets. Here are a few places to look: 

  • UC Irvine Machine Learning Repository – The UCI repository maintains 488 datasets that range in topics from smartwatch activity to forest fire tracking. This is also the home of the Iris dataset we spoke about above.
  • Data.gov – Multiple US federal agencies house their data here. This is an open-source place to find databases related to social sciences, public health, economics, and education. 
  • Quandl – Quandl is a platform that houses finance and and economics datasets created by hundreds of publishers. It offers free and paid services.
  • Five-Thirty-Eight – This data-driven reporting outlet makes all of its sources available on GitHub. Topics include sports, politics, and pop culture. 

How do you make a machine learning model from scratch?

If you really want to challenge yourself with a machine learning project, you can develop a model from scratch. As a reminder, a machine learning model is the equation or computation you develop by applying sample data to the parameters of an algorithm. An algorithm is a set of pre-defined rules. 

Although the specifics will be different based on your actual project, here are the general steps of developing a machine learning model: 

Find a problem to solve

Think about subjects that are interesting to you. Within that category find a problem that could be solved or at least initially approached through data analysis. For example, let’s say we’re interested in healthcare and we’d like to explore what it means to be a “good” hospital in the United States. 

Surgeons in the OR

Photo by Abraham Popocatl

Find relevant data and refine the question

It’s unlikely that we’ll find a dataset that would directly answer such a broad question, but we have an idea of where we should start. 

Medicare publishes data on hospital quality. One dataset in particular is a patient survey that lists an overall rating (1-5) based on dimensions like nurse communication, physician communication, and quietness of the hospitals.

Using this data, we can form more precise questions. For example: 

Which components of patient satisfaction influence a hospital’s overall score?

Or

Given a hospital’s nurse/physician/quietness/etc. data, what is its predicted overall quality score?

Import the data

Data can come in multiple formats, including JSON, XML, and CSV. How you choose to import the data will depend on if you want to conduct your analysis in R, Python, or a proprietary platform.

Explore and clean the data 

Now you’ll want to clean the data. This means getting rid of missing, null, and/or nonsensical values and possibly removing unnecessary columns. At this point you’ll also want to do some data visualizations to see if there are any interesting associations to explore further. 

Visualizing the data may also help you figure out if your machine learning project is a supervised or unsupervised learning task. In brief, a supervised task has a target outcome (number or a category), while an unsupervised task does not. Unsupervised tasks are often used to find patterns and associations in the data.

If you are considering a supervised method, then you will need to determine if a classification or regression algorithm is most appropriate. This will depend on the question you are trying to answer.

Develop and refine the model

This is where you begin to experiment and use your outside knowledge and intuition to make adjustments to your model. To train the data, and then validate it later, you will need to split data into training, validation, and test datasets. 

Many data scientists often start with the most basic algorithms when developing a model and move up from there: 

  • Linear regression for basic regression.
  • Logistic regression for basic classification.
  • K-means clustering for an unsupervised task. 

Once you think that you’ve developed the best model, you can evaluate its performance with metrics like r-squared values or AUC-ROC curve

Communicate your results

Communication is the skill that sets the data tinkerers apart from those who influence business decisions. A crucial part of a data scientist’s job is communicating what they’ve uncovered to a company’s leaders. And they’re doing more than just reporting; they’re also offering recommendations based on what they observed in the data.

Say for example (using our Medicare data scenario), you find that nurse communication is the most correlated variable to the overall patient satisfaction score. What do you do with that information? Your recommendation might be to brief healthcare leaders about the importance of conducting further research about the effects of nurse communications at the highest and lowest rated hospitals. 

It’s also good practice to post the results and the conclusions of your machine learning projects on your personal blog or to GitHub to share with the larger data science community. It’s an opportunity for you to help others learn, receive feedback, and possibly publicize a solution to a previously unsolved problem.