Algorithmia Blog - Deploying AI at scale

5 machine learning models you should know

k-means clustering
(Depiction of a clustering model, Medium)

Getting started with machine learning starts with understanding the how and why behind employing particular methods. We’ve chosen five of the most commonly used machine learning models on which to base the discussion.

AI taxonomy 

Before diving too deep, we thought we’d define some important terms that are often confused when discussing machine learning. 

  • Algorithm – A set of predefined rules used to solve a problem. For example, simple linear regression is a prediction algorithm used to find a target value (y) based on an independent variable (x). 
  • Model – The actual equation or computation that is developed by applying sample data to the parameters of the algorithm. To continue the simple linear regression example, the model is the equation of the line of best fit of the x and y values in the sample set plotted against each other.
  • Neural network  – A multilayered algorithm that consists of an input layer, output layer, and a hidden layer in the middle. The hidden layer is a series of stacked algorithms that iterate until the computer chooses a final output. Neural networks are sometimes referred to as “black box” algorithms because humans don’t have a clear and structured idea how the computer is making its decisions. 
  • Deep learning – Machine learning methods based on neural network architecture. “Deep” refers to the large number of algorithms employed in the hidden layer (often more than 100).
  • Data science – A discipline that combines math, computer science, and business/domain knowledge. 

Machine learning methods

Machine learning methods are often broken down into two broad categories: supervised learning and unsupervised learning

Supervised learning – Supervised learning methods are used to find a specific target, which must also exist in the data. The main categories of supervised learning include classification and regression. 

  • Classification – Classification models often have a binary target sometimes phrased as a “yes” or “no.” A variation on this model is probability estimation in which the target is how likely a new observation is to fall into a particular category. 
  • Regression – Regression models always have a numeric target. They model the relationship between a dependent variable and one or more independent variables. 

Unsupervised learning – Unsupervised learning methods are used when there is no specific target to find. Their purpose is to form groupings within the dataset or make observations about similarities. Further interpretation would be needed to make any decisions on these results. 

  • Clustering – Clustering models look for subgroups within a dataset that share similarities. These natural groupings are similar to each other, but different than other groups. They may or may not have any actual significance. 
  • Dimension reduction – These models reduce the number of variables in a dataset by grouping similar or correlated attributes.

It’s important to note that individual models are not necessarily used in isolation. It often takes a combination of supervised and unsupervised methods to solve a data science problem. For example, one might use a dimension-reduction method on a large dataset and then use the new variables in a regression model. 

To that end, Model pipelining involves the act of splitting up machine learning workflows into modular, reusable parts to couple together with other model applications to build more powerful software over time. 

What are the most popular machine learning algorithms? 

Below we’ve detailed some of the most common machine learning algorithms. They’re often mentioned in introductory data science courses and books and are a good place to begin. We’ve also provided some examples of how these algorithms are used in a business context. 

Linear regression

Linear regression is a method in which you predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. The Boston Housing Dataset is one of the most commonly used resources for learning to model using linear regression. With it, you can predict the median value of a home in the Boston area based on 14 attributes, including crime rate per town, student/teacher ratio per town, and the number of rooms in the house. 

K-means clustering

K-means clustering is a method that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis. Clustering is often used as a market segmentation approach to uncover similarity among customers or uncover an entirely new segment altogether. 

k-means clustering

(Medium)

Principal component analysis (PCA)

PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability. 

A common application of PCA is aiding in the interpretation of surveys that have a large number of questions or attributes. For example, global surveys about culture, behavior or well-being are often broken down into principal components that are easy to explain in a final report. In the Oxford Internet Survey, researchers found that their 14 survey questions could be distilled down to four independent factors. 

K-nearest neighbors (k-NN)

Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data. 

In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation. 

k-nearest neighbors depicted

(ResearchGate)

Classification and regression trees (CART)

Decision trees are a transparent way to separate observations and place them into subgroups. CART is a well-known version of a decision tree that can be used for classification or regression. You choose a response variable and make partitions through the predictor variables. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision. 

Titanic survivor dataset(community.jmp)

How do I choose the best model for machine learning?

The model you choose for machine learning depends greatly on the question you are trying to answer or the problem you are trying to solve. Additional factors to consider include the type of data you are analyzing (categorical, numerical, or maybe a mixture of both) and how you plan on presenting your results to a larger audience. 

The five model types discussed herein do not represent the full collection of model types out there, but are commonly used for most business use cases we see today. Using the above methods, companies can conduct complex analysis (predict, forecast, find patterns, classify, etc.) to automate workflows.

To learn more about machine learning models, visit algorithmia.com.

Shorten deployment time with Algorithmia

Pie graph of ML projects deployed

Every year, millions of dollars are wasted planning, cleaning, and training machine learning (ML) models that will never get to production. This means that more than half of data science projects are not fully deployed—and some never will be, resulting in zero generated revenue.

When organizations are asked about their machine learning business challenges, deployment difficulties are cited as the biggest obstacle. 

Reduce waste; increase value

The solution looks simple at first:

  • Make it fast and simple to deploy ML models
  • Reduce the learning curve
  • Stop asking data scientists to do DevOps
  • Automate where possible and measure the results
  • Reduce model deployment time from months (or years) to minutes

But let’s deep dive into how to make these solutions feasible.

Remove barriers to deployment

If a data science department is isolated, it may not have a reliable DevOps team and must learn to deploy its own models. However, when data scientists are tasked with deploying models, they face a number of challenges: 

  • They must learn a wide range of DevOps-related skills that are not part of their core competencies.
  • They will spend a lot of time learning to properly containerize models, implement complex serving frameworks, and design CI/CD workflows.

This pulls them away from their primary mission of designing and building models, and often working on the challenges above have varying degrees of success.

But let’s say an IT or DevOps team is available, now the data scientists are faced with a new set of challenges:

  • IT is used to working with conventional application deployments, which differ from ML in a number of ways, often requiring a unique “snowflake” environment for each individual model. 
  • Information security restrictions further complicate deployment, requiring various levels of isolation and auditing. Because ML models are opaque, and data science teams follow non-standard development practices, IT is often unable to add sufficient tooling, such as fine-grained logging or error-handling. 

From there, developers are typically borrowed from other teams to help solve these problems—for example, writing wrapper code to add company-standard authentication—which can cause further slowdowns and resource consumption.

ML infrastructure layout

Reduce the learning curve

To succeed in ML efforts, companies must reduce the breadth of knowledge each individual team is responsible for, allowing them to specialize in their own practices. When they are able to do so, the learning curve for each team can be reduced and they can quickly scale up their own activities and projects.

Stop asking data scientists to do DevOps

A key mechanism for enabling this is a managed platform for the deployment, serving, and management of ML models. This platform provides the following benefits:

  • Separation of concerns: data scientists can focus on model building and app developers can focus on integration.
  • Low DevOps: managed platforms require minimal oversight and DevOps never need to be involved in the deployment of an individual model.
  • Reduced manual tool-building: authentication, data connectors, monitoring, and logging are built-in.

Selecting the right platform for ML is critical. At minimum, it should:

  • Provide deployment, serving, and model management within a single environment to enhance accessibility and reduce tool thrashing.
  • Allow app developers and other consumers to quickly locate, test, and integrate appropriate models.
  • Support any language and framework so data scientists are not constrained in their development.
  • Allow data scientists to write code, not containers, so they can remain focused at the model level.
  • Not require any deep proprietary changes to models, cause vendor lock-in, or tie model-serving to a specific model-training platform.
  • Embed within a choice of private cloud, with full portability between infrastructure providers.
  • Work with existing choices for CI/CD and other DevOps tooling.

ML infrastructure diagram

Go the last mile

With the problems of model deployment and serving contained, your company can focus on creating and tracking value and ROI.

By providing application developers with a searchable, structured way of locating models, cross-departmental communication barriers can be reduced to zero. The instant that the ML team releases a model for general consumption, it becomes discoverable and usable by developers. Developers can consume the model with a global API and have cut-and-paste code ready to drop into any application or service.

As models are added and consumed, oversight becomes key. Monitoring, logging, and showbacks allow for seamless continuous operation while demonstrating the value of each model. Companies can properly allocate resources across teams and prove ROI on each individual project.

ML infrastructure plugging into apps and code

Start deploying today

Don’t become the alchemy Gartner warned about: “Through 2020, 80 percent of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization” (Gartner). 

Take stock of your company-wide ML initiatives. If you’re not deploying all of your models into production, Algorithmia can help you get there. If your data science teams are running their own DevOps, or your IT team is being overloaded with ML needs, our managed solution is the right tool to get your model productionization back on track.

Algorithmia is the leader in machine learning deployment, serving, and management. Our product deploys and manages models for Fortune 100 companies, US intelligence agencies, and the United Nations. Our public algorithm marketplace has helped more than 90,000 engineers and data scientists deploy their models.

Continuous deployment for machine learning—a re:Invent talk by Diego Oppenheimer

Algorithmia coming to AWS re:Invent

Meet with us!

AWS re:Invent is next month, and we are pleased to announce that Algorithmia CEO, Diego Oppenheimer, will be speaking on the new software development lifecycle (SDLC) for machine learning. Often we get variations on this question: how can we adapt our infrastructure, operations, staffing, and training to meet the challenges of ML without throwing away everything that already works? Diego is prepared with answers. His talk will cover how machine learning (ML) will fundamentally change the way we build and maintain applications.

Currently, many data science and ML deployment teams are struggling to fit an ML workflow into tools that don’t make sense for the job. This session will help clarify the differences between traditional and ML-driven SDLCs, cover common challenges that need to be overcome to derive value from ML, and provide answers to questions about current technological trends in ML software. Finally, Diego will outline how to build a process and tech stack to bring efficiency to your company’s ML development. 

Diego’s talk will be on 4 December at 1:40pm in the Nuvola Theater in the Aria. 

Coming soon: the 2020 State of Enterprise Machine Learning Report

Additionally, Diego will share insights from our upcoming 2020 State of Enterprise Machine Learning survey report, which will be an open-source guide for how the ML landscape is evolving. The report will focus on these findings:

  1. Shifts in the number of data scientists employed at companies in all industries and what that portends for the future of ML
  2. Use case complexity and customer-centric applications in smaller organizations
  3. ML operationalization (having a deployed ML lifecycle) capabilities (and struggles) across all industries
  4. Trends in ML challenges: scale, version-control, model reproducibility, and aligning a company for ML goals
  5. Time to model deployment and wasted time
  6. What determines ML success at the producer level (data scientist and engineer) and at the director and VP level

Pick up a copy of the report at Algorithmia’s booth.

Diego and his team will be available throughout the week to answer questions about infrastructure specifics, ML solutions, and new use cases at Booth 311

Meet with our team

If you or your team will be in Las Vegas for re:Invent this year, we want to meet with you. Our sales engineers would love to cater a demo of Algorithmia’s product for your specific needs and demonstrate our latest features. Book some time with us!

Read the full press report here.

Customer churn prediction with machine learning

Illustration of revolving door with customers leaving

Why is churn prediction important? 

Defined loosely, churn is the process by which customers cease doing business with a company. Preventing a loss in profits is one clear motivation for reducing churn, but other subtleties may underlie a company’s quest to quell it. Most strikingly, the cost of customer acquisition usually starkly outweighs that of customer retention, so stamping out churn also compels from a more subtle financial perspective. 

While churn presents an obvious difficulty to businesses, its remedy is not always immediately clear. In many cases, and without descriptive data, companies are at a loss as to what drives it. Luckily, machine learning provides effective methods for identifying churn’s underlying factors and proscriptive tools for addressing it.

Methods for solving high churn rate

As with any machine learning task, the first, and often the most crucial step, is gathering data. Typical datasets used in customer churn prediction tasks will often curate customer data such as time spent on a company website, links clicked, products purchased, demographic information of users, text analysis of product reviews, tenure of the customer-business relationship, etc. The key here is that the data be high quality, reliable, and plentiful. 

Good results can often still be obtained with sparse data, but obviously more data is usually better. Once the data has been chosen, the problem must be formulated and the data featurization chosen. It’s important that this stage be undertaken with an attention to detail, as churn can mean different things to different enterprises. 

Types of churn

For some, the problem is best characterized as predicting the ratio of churn to retention. For others, predicting the percentage risk that an individual customer will churn is desired. And for many more, identifying churn might constitute taking a global perspective and predicting the future rate at which customers might exit the business relationship. All of these are valid qualifications, but they must be chosen consistently across the customer churn prediction pipeline.

Once the data has been chosen, prepped, and cleaned, modeling can begin. While identifying the most suitable deep learning prediction model can be more of an art than a science, we’re usually dealing with a classification problem (predicting whether a given individual will churn) for which certain models are standards of practice. 

For classification problems such as this, both decision trees and logistic regression are desirable for their ease of use, training and inference speed, and interpretable outputs. These should be the go-to methods in any practitioner’s toolbox for establishing a baseline accuracy before moving onto more complex modeling choices. 

For decision trees, the model can be further tweaked by experimenting with adding random forests, bagging, and boosting. Beyond these two choices, Convolutional Neural Networks, Support Vector Machines, Linear Discriminant Analysis, and Quadratic Discriminant Analysis can all serve as viable prediction models to try. 

Defining metrics with customer data 

Once a model has been chosen, it needs to be evaluated against a consistent and measurable benchmark. One way to do this is to examine the model’s ROC (Receiver Operating Characteristic) curve when applied to a test set. Such a curve plots the True Positive rate against the False Positive rate. By looking to maximize the AUC (area under the curve), one can tune a model’s performance or assess tradeoffs between different models. 

Another useful metric is the Precision-Recall Curve, which, you guessed it, plots precision vs. recall. It’s useful in problems where one class is more qualitatively interesting than the other, which is the case with churn because we’re interested in the smaller proportion of customers looking to leave than those who aren’t (although we do care about them as well). 

In this case, a business would hope to develop potential churners with high precision so as to target potential interventions at them. For example, one such intervention might involve an email blast offering coupons or discounts to those most likely to churn. By carefully selecting which customers to target, businesses can allay the cost of these redemptive measures and increase their effectiveness.

Sifting through insights from model output 

Once the selected model has been tuned, a post-hoc analysis can be conducted. An examination of which input data features were most informative to the model’s success could suggest areas to target and improve. The total pool of customers can even be divided into segments, perhaps by using a clustering algorithm such as k-means. 

This allows businesses to hone in on the particular markets where they may be struggling and custom tailor their churn prevention approaches to meet those markets’ individual needs. They can also tap into the high interpretability of their prediction model (if such an interpretable model was selected) and use it to identify the decisions which led those customers to churn.

Combating churn with machine learning

While churn prediction can look like a daunting task, it’s actually not all that different from any machine learning problem. When looked at generally, the overall workflow looks much the same. However, special care must be given to the feature selection, model interpretation, and post-hoc analysis phases so that appropriate measures can be taken to alleviate churn. 

In this way, the key skill in adapting machine learning to churn prediction lies not in any particular, specialized model to the task but in the domain knowledge of the practitioner and that person’s ability to make knowledgeable business decisions given the black box of a model’s output.

Perfect order fulfillment: a Tevec case study

Shipping containers

Read the case study

Algorithmia is fortunate to work with companies across many industries with varied use cases as they develop machine learning programs. We are delighted to showcase the great work one of our customers is doing and how the AI Layer is able to power their machine learning lifecycle.

Tevec is a Brazil-based company that hosts Tevec.AI, a supply chain recommendation platform that uses machine learning to forecast demand and suggest optimized replenishment/fulfillment order for logistics chains. Put simply, Tevec ensures retailers and goods transport companies deliver their products to the right place at the right time.

In founder Bento Ribeiro’s own words, the “Tevec Platform is a pioneer in the application of machine learning for the recognition of demand behavior patterns, automating the whole process of forecasting and calculation of ideal product restocking lots at points of sale and distribution centers, allowing sales planning control, service level, and regulatory stocks.”

Tevec runs forecasting and inventory-optimization models and customizes user permissions so they can adjust the parameters of their inventory routine, such as lead times, delivery dates, minimum inventory, and service levels. Users can fine-tune the algorithms and adapt for specific uses or priorities. 

The challenge: serving and managing at scale

Initially, Tevec was embedding ML models directly into its platform, causing several issues:

  • Updating: models and applications were on drastically different update cycles, with models changing many times between application updates
  • Versioning: model iterating and ensuring all apps were calling the most appropriate model was difficult to track and prone to error
  • Data integrations: manual integrations and multi-team involvement made customization difficult
  • Model management: models were interacting with myriad endpoints such as ERP, PoS systems, and internal platforms, which was cumbersome to manage

Algorithmia provides the ability to not worry about infrastructure and guarantees that models we put in production will be versioned and production-quality.”  

Luiz Andrade, CTO, Tevec

The solution: model hosting made simple with serverless microservices

Tevec decoupled model development from app development using the AI Layer so it can seamlessly integrate API endpoints, and users can maintain a callable library of every model version. Tevec’s architecture and data science teams now avoid costly and time-consuming DevOps tasks; that extra time can be spent on building valuable new models in Python, “the language of data science,” Andrade reasons. That said, with the AI Layer, Tevec can run models from any framework, programming language, or data connector—future-proofing Tevec’s ML program.

With Algorithmia in place, Tevec’s data scientists can test and iterate models with dependable product continuity, and can customize apps for customers without touching models, calling only the version needed for testing. 

Algorithmia’s serverless architecture ensures the scalability Tevec needs to meet its customers demands without the costs of other autoscaling systems, and Tevec only pays for compute resources it actually uses.

Looking ahead

Tevec continues to enjoy 100-percent year-on-year growth, and as it scales so will its ML architecture deployed on Algorithmia’s AI Layer. Tevec is planning additional products beyond perfect order forecasts and it is evaluating new frameworks for specific ML use cases—perfect for the tool-agnostic AI Layer. Tevec will continue to respond to customer demands as it increases the scale and volume of its service so goods and products always arrive on time at their destinations.

Algorithmia is the whole production system, and we really grabbed onto the concept of serverless microservices so we don’t have to wait for a whole chain of calls to receive a response.”

Luiz Andrade, CTO, Tevec

Read the full Tevec case study.