Algorithmia Blog - Deploying AI at scale

Machine learning methods with R

iris virginica in a field

(Wikispecies)

R is an excellent language for machine learning. R primarily excels in the following three areas:

  1. Data manipulation
  2. Plotting
  3. Built-in libraries

R manipulates data using a native, in-built data structure called a data frame. Let’s load some data now and see how R can help us to work with it.

The data

Luckily for us, R comes with some built-in datasets that we can simply load and have ready to go. We’ll use one such dataset called iris to test some of R’s machine learning capabilities. This dataset contains 50 flowers, each one of three different different species: iris setosa, iris versicolor, and iris virginica. We can classify these data points using a data frame containing 4 different measurement attributes: sepal length, sepal width, petal length, and petal width. Let’s load the data and get started.

data("iris")
head(iris)

The R head command lets us inspect the first few elements of a dataset. You’ll see that the first elements all have the same class label, given by the Species column. When training a machine learning model, we want our data to be randomized so let’s fix this.

Shuffling the data

The elements of our data frame are the rows, so to shuffle the data we need to permute the rows. To do this we can use the built-in R functions, nrow and samplenrow(iris) returns a range the size of the iris dataset. Applying sample to this returns a permutation of the range. We can then index into the dataset using the shuffled row indices as shown below.

shuffled <- iris[sample(nrow(iris)),]
head(shuffled)

Now, when we take a look at the head of the dataset, we see that we’re getting a mix of different flowers, which is great!

Train and validation splits

Now that we’ve got the data shuffled, let’s partition it into a training and validation set. We’ll hold out 20 percent of the data points for the validation set. We can get the two partitions as follows:

n <- nrow(shuffled)
split <- .2 * n
train <- shuffled[1:split,]
validate <- shuffled[split:n,]

Defining a model

Once that we have our data divided into train and validation splits, we can start training a model on it. We’ll first explore using boosted decision trees for classification with an R package called xgboost. Let’s first install it.

install.packages("xgboost")
require(xgboost)

Now, we can construct the model. The XGBoost interface requires that we section the training data into the actual X data and the species labels. To do this, we just select the corresponding columns from the data frame. We also need to convert the Species labels into numerical values in \(\{0, 1, 2\}\) as XGBoost only handles numbered class labels. We’ll do the same for the validation set.

# Training set
dcols <- c(1:4)
lcols <- c(5)
train$Species <- as.character(train$Species)
train$Species[train$Species == "setosa"] <- 0
train$Species[train$Species == "virginica"] <- 1
train$Species[train$Species == "versicolor"] <- 2
X <- train[,dcols]
labels <- train[,lcols]

# Validation Set
validate$Species <- as.character(validate$Species)
validate$Species[validate$Species == "setosa"] <- 0
validate$Species[validate$Species == "virginica"] <- 1
validate$Species[validate$Species == "versicolor"] <- 2
Xv <- validate[,dcols]
labelsV <- validate[,lcols]

Training the model

Now that we’ve gotten that out of the way, we can begin training the model. The interface for xgboost is super simple; we can train the model with a single call.

booster <- xgboost(data = as.matrix(X), label = labels, max.depth = 2, eta = 1, nthread = 2, nrounds = 2, objective = "multi:softmax", num.class = 3)
[1] train-merror:0.000000 
[2] train-merror:0.000000 

The parameters we used deserve some explanation.

  1. We set max.depth = 2 to use decision trees that have a maximum depth of 2. Deeper trees can give higher accuracy but are also higher in variance. Depth 2 should be a good balance for our purposes.
  2. eta specifies the learning rate for our model. This is a hyperparameter that requires some experimentation and tuning to get right. We’ll use 1 for this example.
  3. nthread specifies the number of CPU threads to use while training the model. This is a small dataset so we don’t need many threads, but we would for larger datasets.
  4. nrounds specifies the number of training epochs (passes) to perform over the data. We’ll start with 2.

Evaluating model performance

Now that we’ve trained our boosted tree model, we can evaluate its performance on the validation set. To do this, we can use the predict function that comes packaged with xgboost in order to generate our predictions.

preds <- predict(booster, as.matrix(Xv))
head(preds)
[1] 1 1 0 0 1 0

You can see that our boosted model is predicting nicely. All that is left is to calculate is its accuracy. We will calculate the mean of all entries of preds that are not equal to the corresponding entries of labelsV.

err <- mean(preds != labelsV)
print(paste("validation-error=", err))
[1] "validation-error= 0.0495867768595041"
print(paste("validation-accuracy=", 1 - err))
[1] "validation-accuracy= 0.950413223140496"

Nice! The validation accuracy on the dataset is equal to about 95 percent, which is a great performance from the model.

Visualizing our results

To finish off the project, we’ll just visualize our predictions. First, we will plot the native data using a library called ggvis. Before that though, we must install the package.

install.packages("ggvis")

Now, since we’re visually limited to two dimensions, we will choose to plot our classes vs. the Sepal Length and Width attributes. This can be done as follows.

library(ggvis)
Registered S3 method overwritten by 'dplyr':
  method           from
  print.rowwise_df     
data <- iris
data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

To visualize how our model predicts in comparison, we’ll quickly run it across all data points, not just the validation set. To combine the training and validation data, we use the rbind function, which just joins the two data frames vertically. We also undo the work we did before, converting from numeric labels back to species names.

all_data <- rbind(X, Xv)
preds <- predict(booster, as.matrix(all_data))
all_data["Species"] <- preds
all_data$Species[all_data$Species == 0] <- "setosa"
all_data$Species[all_data$Species == 1] <- "virginica"
all_data$Species[all_data$Species == 2] <- "versicolor"
head(all_data)

Everything looks good! Let’s plot the final result, using the same code as before.

all_data %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>% layer_points()

We can visually compare the true plot with the plot of our predictions. There should only be a few points misclassified between the two!

To learn more about machine learning models in R, visit https://algorithmia.com/developers/algorithm-development/languages/r.

Algorithmia at Big Data London

People walking on Millennium Bridge in London. Bridge built by Architect Norman Foster

Recently, Algorithmia ventured from Seattle to London to discover what was happening at the Big Data London (BDL) conference in Kensington. We had great conversations with data engineers, data analysts, and business leaders about how Algorithmia makes it easy to deploy machine learning models into production. Our platform handles the MLOps portion of the data science pipeline, so data scientists can focus instead on solving business problems with data.

Highlights from the booth

At BDL, we got the opportunity to talk with many companies about where they are in their ML journeys. While some are just starting to evaluate use cases and consider infrastructure requirements, it’s very encouraging to hear about how they are planning to put their models into production. This is an important step and it is often overlooked. You don’t want to choose a training platform, for instance, that locks you into a specific ecosystem. It’s better to use the best possible platforms and services for each stage of your data science pipeline rather than get locked into one that tries to do everything, without excelling at any portion of it.

We also talked to many data scientists who are at the stage where they have several models sitting on laptops waiting to be utilized in production but don’t know where to go from there. This is a very common scenario, and Algorithmia has white glove customer support to help you get models off laptops and into operation.

Of course, there are also engineers and business owners who are experiencing the same friction points that Algorithmia helps address in the MLOps workflow. This includes: versioning, model updating, centralized repositories, and of course dependency management and scaling.

If any of these stages of the ML roadmap resonate with you, come talk to us at AWS re:Invent where we can go into more detail about getting your models deployed to a scalable, reliable infrastructure today.

Special topics in big data

There were several core themes at the conference, and ones that turned out to be very popular were: Data Governance, Self-Service Analytics, DataOps, Customer Experience Analytics, and of course Machine Learning and AI. 

Some crowd favorites included A GDPR Retrospective: Implementation by a Large-Scale Data Organization in Reality which covered GDPR compliance from a technical standpoint, rather than a business point of view like some of the other talks within that track. Another popular talk within Data Governance, focused on how data management is a customer service story, not just a technical one in Data Governance as a Customer Service. Here at Algorithmia we feel the same way about model management!

To be expected, there were some standout talks in the Keynote Theater. One of our favorites was from EY’s Chief Scientist Harvey Lewis, a leader in applied ML in business, who talked about the need for humans in the loop in the AI pipeline. Lewis covered use cases that showed how important it is to combine humans with machine learning algorithms to ensure that inferences are accurate when it comes to safety, compliance, and risk in the realm of accounting, auditing, and consultancy firms

Another big hit in the Keynote Theater was Making Everyone A Data Person At Lloyd’s. This talk focused on empowering all users across various teams within an organization to be more data-informed. The speakers talked about their initiative called the Data Lab within Lloyd’s Data, which focuses on making everyone within their company data-literate through mentorship and training.

Our tracks of interest

One of the tracks with the longest queues was the Self-Service Analytics tracks. We know because the Algorithmia booth was right near it so we got a chance to chat with many folks waiting in line. A crowd favorite came from our friends at Tableau, who served up a great talk on how to explore data and gain actionable insights with natural language processing

And of course, our favorite track: AILab, which hosted talks on everything from ethics in AI, to extracting actionable insights from machine learning. It also covered infrastructure and scaling modern machine learning systems. 

We’ve thought a lot about these subjects to, so be sure to read up on racial bias in AI,, gaining insights on customer churn data and scaling your machine learning models.

What was missing from the talks, was substance surrounding the difficulty in the deployment cycle. While scaling is important, making sure you can automate your ML deployment lifecycle is crucial. We’ve covered everything from shortening your deployment time to what makes cloud infrastructure crucial to machine learning

That wraps up our take on our first experience at the Big Data London data conference. And if you’re going to re:Invent next month, check out Diego Oppenheimer’s talk on continuous deployment, and don’t forget to set up a meeting to see how Algorithmia can enable your model deployment, serving, and management at scale.  

5 machine learning models you should know

k-means clustering
(Depiction of a clustering model, Medium)

Getting started with machine learning starts with understanding the how and why behind employing particular methods. We’ve chosen five of the most commonly used machine learning models on which to base the discussion.

AI taxonomy 

Before diving too deep, we thought we’d define some important terms that are often confused when discussing machine learning. 

  • Algorithm – A set of predefined rules used to solve a problem. For example, simple linear regression is a prediction algorithm used to find a target value (y) based on an independent variable (x). 
  • Model – The actual equation or computation that is developed by applying sample data to the parameters of the algorithm. To continue the simple linear regression example, the model is the equation of the line of best fit of the x and y values in the sample set plotted against each other.
  • Neural network  – A multilayered algorithm that consists of an input layer, output layer, and a hidden layer in the middle. The hidden layer is a series of stacked algorithms that iterate until the computer chooses a final output. Neural networks are sometimes referred to as “black box” algorithms because humans don’t have a clear and structured idea how the computer is making its decisions. 
  • Deep learning – Machine learning methods based on neural network architecture. “Deep” refers to the large number of algorithms employed in the hidden layer (often more than 100).
  • Data science – A discipline that combines math, computer science, and business/domain knowledge. 

Machine learning methods

Machine learning methods are often broken down into two broad categories: supervised learning and unsupervised learning

Supervised learning – Supervised learning methods are used to find a specific target, which must also exist in the data. The main categories of supervised learning include classification and regression. 

  • Classification – Classification models often have a binary target sometimes phrased as a “yes” or “no.” A variation on this model is probability estimation in which the target is how likely a new observation is to fall into a particular category. 
  • Regression – Regression models always have a numeric target. They model the relationship between a dependent variable and one or more independent variables. 

Unsupervised learning – Unsupervised learning methods are used when there is no specific target to find. Their purpose is to form groupings within the dataset or make observations about similarities. Further interpretation would be needed to make any decisions on these results. 

  • Clustering – Clustering models look for subgroups within a dataset that share similarities. These natural groupings are similar to each other, but different than other groups. They may or may not have any actual significance. 
  • Dimension reduction – These models reduce the number of variables in a dataset by grouping similar or correlated attributes.

It’s important to note that individual models are not necessarily used in isolation. It often takes a combination of supervised and unsupervised methods to solve a data science problem. For example, one might use a dimension-reduction method on a large dataset and then use the new variables in a regression model. 

To that end, Model pipelining involves the act of splitting up machine learning workflows into modular, reusable parts to couple together with other model applications to build more powerful software over time. 

What are the most popular machine learning algorithms? 

Below we’ve detailed some of the most common machine learning algorithms. They’re often mentioned in introductory data science courses and books and are a good place to begin. We’ve also provided some examples of how these algorithms are used in a business context. 

Linear regression

Linear regression is a method in which you predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. The Boston Housing Dataset is one of the most commonly used resources for learning to model using linear regression. With it, you can predict the median value of a home in the Boston area based on 14 attributes, including crime rate per town, student/teacher ratio per town, and the number of rooms in the house. 

K-means clustering

K-means clustering is a method that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis. Clustering is often used as a market segmentation approach to uncover similarity among customers or uncover an entirely new segment altogether. 

k-means clustering

(Medium)

Principal component analysis (PCA)

PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability. 

A common application of PCA is aiding in the interpretation of surveys that have a large number of questions or attributes. For example, global surveys about culture, behavior or well-being are often broken down into principal components that are easy to explain in a final report. In the Oxford Internet Survey, researchers found that their 14 survey questions could be distilled down to four independent factors. 

K-nearest neighbors (k-NN)

Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data. 

In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation. 

k-nearest neighbors depicted

(ResearchGate)

Classification and regression trees (CART)

Decision trees are a transparent way to separate observations and place them into subgroups. CART is a well-known version of a decision tree that can be used for classification or regression. You choose a response variable and make partitions through the predictor variables. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision. 

Titanic survivor dataset(community.jmp)

How do I choose the best model for machine learning?

The model you choose for machine learning depends greatly on the question you are trying to answer or the problem you are trying to solve. Additional factors to consider include the type of data you are analyzing (categorical, numerical, or maybe a mixture of both) and how you plan on presenting your results to a larger audience. 

The five model types discussed herein do not represent the full collection of model types out there, but are commonly used for most business use cases we see today. Using the above methods, companies can conduct complex analysis (predict, forecast, find patterns, classify, etc.) to automate workflows.

To learn more about machine learning models, visit algorithmia.com.

Shorten deployment time with Algorithmia

Pie graph of ML projects deployed

Every year, millions of dollars are wasted planning, cleaning, and training machine learning (ML) models that will never get to production. This means that more than half of data science projects are not fully deployed—and some never will be, resulting in zero generated revenue.

When organizations are asked about their machine learning business challenges, deployment difficulties are cited as the biggest obstacle. 

Reduce waste; increase value

The solution looks simple at first:

  • Make it fast and simple to deploy ML models
  • Reduce the learning curve
  • Stop asking data scientists to do DevOps
  • Automate where possible and measure the results
  • Reduce model deployment time from months (or years) to minutes

But let’s deep dive into how to make these solutions feasible.

Remove barriers to deployment

If a data science department is isolated, it may not have a reliable DevOps team and must learn to deploy its own models. However, when data scientists are tasked with deploying models, they face a number of challenges: 

  • They must learn a wide range of DevOps-related skills that are not part of their core competencies.
  • They will spend a lot of time learning to properly containerize models, implement complex serving frameworks, and design CI/CD workflows.

This pulls them away from their primary mission of designing and building models, and often working on the challenges above have varying degrees of success.

But let’s say an IT or DevOps team is available, now the data scientists are faced with a new set of challenges:

  • IT is used to working with conventional application deployments, which differ from ML in a number of ways, often requiring a unique “snowflake” environment for each individual model. 
  • Information security restrictions further complicate deployment, requiring various levels of isolation and auditing. Because ML models are opaque, and data science teams follow non-standard development practices, IT is often unable to add sufficient tooling, such as fine-grained logging or error-handling. 

From there, developers are typically borrowed from other teams to help solve these problems—for example, writing wrapper code to add company-standard authentication—which can cause further slowdowns and resource consumption.

ML infrastructure layout

Reduce the learning curve

To succeed in ML efforts, companies must reduce the breadth of knowledge each individual team is responsible for, allowing them to specialize in their own practices. When they are able to do so, the learning curve for each team can be reduced and they can quickly scale up their own activities and projects.

Stop asking data scientists to do DevOps

A key mechanism for enabling this is a managed platform for the deployment, serving, and management of ML models. This platform provides the following benefits:

  • Separation of concerns: data scientists can focus on model building and app developers can focus on integration.
  • Low DevOps: managed platforms require minimal oversight and DevOps never need to be involved in the deployment of an individual model.
  • Reduced manual tool-building: authentication, data connectors, monitoring, and logging are built-in.

Selecting the right platform for ML is critical. At minimum, it should:

  • Provide deployment, serving, and model management within a single environment to enhance accessibility and reduce tool thrashing.
  • Allow app developers and other consumers to quickly locate, test, and integrate appropriate models.
  • Support any language and framework so data scientists are not constrained in their development.
  • Allow data scientists to write code, not containers, so they can remain focused at the model level.
  • Not require any deep proprietary changes to models, cause vendor lock-in, or tie model-serving to a specific model-training platform.
  • Embed within a choice of private cloud, with full portability between infrastructure providers.
  • Work with existing choices for CI/CD and other DevOps tooling.

ML infrastructure diagram

Go the last mile

With the problems of model deployment and serving contained, your company can focus on creating and tracking value and ROI.

By providing application developers with a searchable, structured way of locating models, cross-departmental communication barriers can be reduced to zero. The instant that the ML team releases a model for general consumption, it becomes discoverable and usable by developers. Developers can consume the model with a global API and have cut-and-paste code ready to drop into any application or service.

As models are added and consumed, oversight becomes key. Monitoring, logging, and showbacks allow for seamless continuous operation while demonstrating the value of each model. Companies can properly allocate resources across teams and prove ROI on each individual project.

ML infrastructure plugging into apps and code

Start deploying today

Don’t become the alchemy Gartner warned about: “Through 2020, 80 percent of AI projects will remain alchemy, run by wizards whose talents will not scale in the organization” (Gartner). 

Take stock of your company-wide ML initiatives. If you’re not deploying all of your models into production, Algorithmia can help you get there. If your data science teams are running their own DevOps, or your IT team is being overloaded with ML needs, our managed solution is the right tool to get your model productionization back on track.

Algorithmia is the leader in machine learning deployment, serving, and management. Our product deploys and manages models for Fortune 100 companies, US intelligence agencies, and the United Nations. Our public algorithm marketplace has helped more than 90,000 engineers and data scientists deploy their models.

Continuous deployment for machine learning—a re:Invent talk by Diego Oppenheimer

Algorithmia coming to AWS re:Invent

Meet with us!

AWS re:Invent is next month, and we are pleased to announce that Algorithmia CEO, Diego Oppenheimer, will be speaking on the new software development lifecycle (SDLC) for machine learning. Often we get variations on this question: how can we adapt our infrastructure, operations, staffing, and training to meet the challenges of ML without throwing away everything that already works? Diego is prepared with answers. His talk will cover how machine learning (ML) will fundamentally change the way we build and maintain applications.

Currently, many data science and ML deployment teams are struggling to fit an ML workflow into tools that don’t make sense for the job. This session will help clarify the differences between traditional and ML-driven SDLCs, cover common challenges that need to be overcome to derive value from ML, and provide answers to questions about current technological trends in ML software. Finally, Diego will outline how to build a process and tech stack to bring efficiency to your company’s ML development. 

Diego’s talk will be on 4 December at 1:40pm in the Nuvola Theater in the Aria. 

Coming soon: the 2020 State of Enterprise Machine Learning Report

Additionally, Diego will share insights from our upcoming 2020 State of Enterprise Machine Learning survey report, which will be an open-source guide for how the ML landscape is evolving. The report will focus on these findings:

  1. Shifts in the number of data scientists employed at companies in all industries and what that portends for the future of ML
  2. Use case complexity and customer-centric applications in smaller organizations
  3. ML operationalization (having a deployed ML lifecycle) capabilities (and struggles) across all industries
  4. Trends in ML challenges: scale, version-control, model reproducibility, and aligning a company for ML goals
  5. Time to model deployment and wasted time
  6. What determines ML success at the producer level (data scientist and engineer) and at the director and VP level

Pick up a copy of the report at Algorithmia’s booth.

Diego and his team will be available throughout the week to answer questions about infrastructure specifics, ML solutions, and new use cases at Booth 311

Meet with our team

If you or your team will be in Las Vegas for re:Invent this year, we want to meet with you. Our sales engineers would love to cater a demo of Algorithmia’s product for your specific needs and demonstrate our latest features. Book some time with us!

Read the full press report here.