Algorithmia Blog - Deploying AI at scale

Protecting your machine learning system from attack

Algorithmia castle with a moat defending model tampering arrows.

Machine learning model security is not discussed enough. In serverless GPU–attached environments, block storage solutions like S3 are dependable for persisting your model files. Other than the URI, no other relevant information regarding the model file is saved in the source code. 

This exposes an interesting angle to potentially attack an ML system in production. An exposed API key or backdoor could allow hackers to replace a model file with their own. Below we’ll talk about why we can’t trust models coming from untrusted sources. We will also demonstrate a quick and simple process to authenticate models before loading them into memory.

What are the security implications?

Using open-source tools like TensorFlow potentially exposes your platform to cyber attacks. Even though open-source communities are known to quickly patch up bugs, this time delta may be more than enough time for hackers to initiate an attack or drastically affect business operations.

If your models are analyzing sensitive information like credit card information for fraud detection or scanning legal documents to help with discovery for a court case, hackers could use an exploit to export this information back to themselves.

The question of “why models haven’t been more protected than before now” has a trivial answer. Like with most emerging technologies, most companies do not or cannot determine what these exploits can be. Therefore as the industry matures, these security measures need to be implemented quickly, especially for models that process highly sensitive data.

According to TensorFlow’s official documentation that can be found here, TensorFlow models are basically programs and aren’t sandboxed within TensorFlow. A sandboxed application wouldn’t have access to files outside its environment and wouldn’t be able to communicate over the network. As it is, TensorFlow can read and write files, send and receive data over the network, and spawn additional processes, all of which are potentially vulnerable to attack. 

The documentation summarizes that “TensorFlow models are programs, and need to be treated as such from a security perspective.”

Authenticating model metrics

Another place where authentication can take place is during continuous integration, where metrics like F1 scores are calculated. This will ensure that the model being tested in CI and the model being deployed in production are the same. Authenticating models prevents data scientists from accidentally overwriting models files and prevents fraudulent models getting into production.

Authentication by computing model hash

Model inference source code is version-controlled via git. Since model files can be huge (several GBs), they must be stored in scalable object/blob storage systems, such as S3. Even though some object-storage services do offer the benefit of keeping track of file hashing, this isn’t the standard case across all services and may not be exposed through the service you’re using to deploy your model. 

Simple file-based hash authentication

Because of the potential for model-tampering, it makes sense to calculate the model file hash right after training, and hard-code this into the source code to prevent model file-tampering in flight. This will allow the inference service to verify the model file during runtime, before executing any TensorFlow model code. This is especially true when the only thing hard-coded in the model file is the filename in the source code.

Advanced weight-based hash authentication

Another way to calculate hashes is to use the weights that are provided in model files. The benefit to this approach is that it would be independent of model format and would work across different frameworks.

Fingerprinting models in this approach would provide consistency, reliability, and reproducibility, and protect an ML system from vulnerabilities. 

Model authentication demonstration on Algorithmia

Documentation walkthrough of model authentication on Algorithmia

We have implemented the simple file-based hash authentication method into our Jupyter notebook. The example trains a simple MNIST model, saves and calculates the hash of the model, deploys the model with the hash, and runs model authentication at runtime before running the model.

Ensure your ML models in production haven’t been hotswapped without anyone noticing. 

Model security with Algorithmia 

As machine learning becomes part of standard software development, vulnerabilities and new methods of attack are surfacing. Fortunately, Algorithmia has built-in security features to prevent model tampering and we are committed to stewarding these practices in the enterprise for everyone to benefit from. Algorithmia aims to empower every organization to achieve its full potential through the use of artificial intelligence and machine learning. 

Read more about ML security

Robust Physical-World Attacks on Deep Learning Visual Classification (arxiv) 

Adversarial Examples for Malware Protection (Patrick McDaniel)

Federated learning and securing data (Digitalist) 

Taking a closer look at machine learning techniques 

dendrogram graph outline

Analytic thinking has become a necessary skill for almost everyone working in a business environment. Although data scientists and analysts may be more intimately involved in handling and manipulating data, managers, executives, and other business leaders will be the ones making decisions based on a technical team’s insights and findings. 

Becoming a data-driven business requires everyone in an organization to understand the principles behind data science, including the machine learning techniques that transform raw data into insightful information. The purpose of this piece is to provide managers and aspiring data scientists an overview of the different methods that can be used to solve business questions.

What is a machine learning model?

In discussions about data science and machine learning, the term “model” is often thrown around. A machine learning model is the actual computation or equation that a data scientist develops after applying training data to an algorithm. It is the result of later steps in a data science project.

This piece will focus on machine learning techniques or ways of approaching problems along with examples of algorithms that fall into these categories. We’ll also mention a few real-world machine learning models for clearer examples of how algorithms have been applied in the enterprise. 

What data does machine learning use?

Machine learning models can be developed with almost any kind of data you can imagine. It can be numerical or categorical. More specifically, text-based data can be used for sentiment analysis, while images can be used to develop facial emotion recognition models.  

Data for machine learning comes from various sources including internal databases within an organization (most likely proprietary), and public or open-source datasets. What kinds of data an individual will use to develop a machine learning model will depend on the business question they are trying to solve.

What are some popular machine learning methods?

Machine learning techniques are often broken down into three categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised and unsupervised learning are more commonly used in the business context, so our focus will be on techniques from these two categories. 

Supervised learning methods are used to find a specific target (numerical or categorical), which must also exist in the data. Unsupervised methods are employed when there is no specific target in mind. They are often used to uncover patterns or natural groupings in the data. We’ll note that there are some algorithms that could fall into either category depending on the specificity of the question being asked. 

Supervised learning: classification vs regression 

Regression and classification are the two main subcategories of supervised learning. While both are predictive methods, regression has a numerical output, while classification predicts the category that a new observation would fall into. This is often a binary output, but you can create models for more than two categories. A variation of classification known as class probability estimation is a numerical output (from 0 to 1) of how likely it is that a new observation will fall into a particular category.

  • Linear regression With linear regression, you can predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. Linear regression models are one of the most familiar types of models, as many people have been exposed to linear equations as a part of their math education.
  • Support vector machine (SVM) SVM can be used for regression or classification. Linear SVM works by maximizing the distance between classes and drawing a line down the middle. New data is categorized by how it falls along that line. Non-linear SVM is used for more complex functions (like those with exponents) to more accurately find the widest point between data.
  • Logistic regression Despite the name, logistic regression is a classification algorithm—more specifically it performs a class probability estimation task. A logistic function is applied to a linear equation and the output is interpreted as the log-odds (a number that ranges from -∞-∞) of a new event being a member of a particular class. The log odds can then be translated into the probability (number 0-1) of a new item being a member of the class.
  • Decision tree Decision trees are a supervised segmentation technique that places observations in the data into subgroups. 
    • CART is a well-known version of a decision tree that can be used for classification or regression. Once the data scientist chooses a response variable, the computer program will make partitions through the predictor variables. The program automatically chooses the number of partitions to prevent underfitting or overfitting the data to the model. Decision trees are useful in situations where  interested parties need to see the entire logical reasoning behind a decision.
  • Random forest Simply put, a random forest is a group of decision trees that all have the same response variable, but slightly different predictor variables. The output of a random forest model is calculated by taking a “vote” of the predicted classification for each tree and having the forest output the majority opinion.  

Unsupervised methods 

  • Clustering Clustering refers to machine learning techniques that are used to find natural groupings among observations in a dataset. Also known as unsupervised segmentation, clustering techniques have two main types: hierarchical and k-means.
    • Hierarchical clustering This method produces a tree-shaped structure known as a dendrogram. Each node in the dendrogram is a cluster based on the similarity of the observations in it. In agglomerative hierarchical clustering, it is a bottom up approach that starts with each observation as its own cluster. As you move up the tree, the number of clusters becomes smaller until the top node contains every observation. The opposite is divisive clustering in which all observations begin in one cluster, and then you divide downward until you reach the desired number of clusters. One of the most well-known hierarchical visualizations is the “Tree of Life” dendrogram that charts all life on earth.
    • K-means clustering K-means clustering is a machine learning algorithm that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis based on domain knowledge. This type of clustering is often used in marketing and market research as an approach to uncover similarity among customers or to uncover a previously unknown segment.

Other machine learning techniques  

The following machine learning techniques can be applied to regression or classification problems. 

Data reduction 

Data reduction algorithms reduce the number of variables in a data set by grouping similar or correlated attributes.

  • Principal Component Analysis (PCA) PCA is a commonly used dimension reduction technique that groups together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability. PCA is often used in the analysis of large survey datasets. This technique makes interpreting these kinds of surveys much simpler and allows researchers to make assertions about behaviors.
  • Similarity matching A similarity matching algorithm attempts to find similar individuals or observations based on the information that is already known about them. For example, a bank might use a similarity matching algorithm to find customers best suited for a new credit card based on the attributes of customers who already have the card.
  • K-Nearest Neighbor (KNN) Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often Euclidian or Manhattan) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data. In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation. In the previous banking and credit card scenario, a classification output might be a simple yes or no to extend an offer. A prediction output might be the initial credit card limit offered to the customer.
  • Link prediction This method tries to predict the possibility and strength between two connections. This is often used for recommendations for social networking and e-commerce platforms. For example, if two unconnected people share a large number of mutual connections, a link-prediction model may suggest that these two people connect.

Combined methods 

Business problems are complex, and you may find that you’ll need to use multiple machine learning techniques to achieve your goal. An important part of data science is understanding how these algorithms work together to answer questions. For example, a data scientist might use PCA in the development of a regression model by first combining similar variables to make the analysis more manageable. 

What are the most important machine learning algorithms?

It’s hard to say what is the most important or best machine learning algorithm or whether there even is one. The methods you use will depend on your specific project needs and the data you have available. A critical skill for anyone interested in using machine learning in the business environment is knowing how to organize a data science project and thinking about which algorithms and techniques it should be approached with. 

Keep Learning

A deeper dive into supervised and unsupervised learning.

A look at random forest algorithms.

Open-source machine learning tools for use in data science projects.

How machine learning works

Eye drawing over a brain outline

The early stages of machine learning saw experiments involving theories of computers recognizing patterns in data and learning from them. Today, after building upon those foundational experiments, machine learning is more complex. 

While machine learning algorithms have been around for a long time, the ability to apply complex algorithms to big data applications more rapidly and effectively is a more recent development. Being able to do these things with some degree of sophistication can set a company ahead of its competitors.   

How does machine learning work?

Machine learning is a form of artificial intelligence (AI) that teaches computers to think in a similar way to how humans do: learning and improving upon past experiences. It works by exploring data, identifying patterns, and involves minimal human intervention

Almost any task that can be completed with a data-defined pattern or set of rules can be automated with machine learning. This allows companies to transform processes that were previously only possible for humans to perform—think responding to customer service calls, bookkeeping, and reviewing resumes. 

two machine learning techniques: supervised and unsupervised

Machine learning uses two main techniques:

  • Supervised learning allows you to collect data or produce a data output from a previous ML deployment. Supervised learning is exciting because it works in much the same way humans actually learn. 

In supervised tasks, we present the computer with a collection of labeled data points called a training set (for example a set of readouts from a system of train terminals and markers where they had delays in the last three months). 

  • Unsupervised machine learning helps you find all kinds of unknown patterns in data. In unsupervised learning, the algorithm tries to learn some inherent structure to the data with only unlabeled examples. Two common unsupervised learning tasks are clustering and dimensionality reduction

In clustering, we attempt to group data points into meaningful clusters such that elements within a given cluster are similar to each other but dissimilar to those from other clusters. Clustering is useful for tasks such as market segmentation.

Dimension reduction models reduce the number of variables in a dataset by grouping similar or correlated attributes for better interpretation (and more effective model training).

How is machine learning used?

From automating tedious manual data entry, to more complex use cases like insurance risk assessments or fraud detection, machine learning has many applications, including client-facing functions like customer service, product recommendations (see Amazon product suggestions or Spotify’s playlisting algorithms), and internal applications inside organizations to help speed up processes and reduce manual workloads.  

A major part of what makes machine learning so valuable is its ability to detect what the human eye misses. Machine learning models are able to catch complex patterns that would have been overlooked during human analysis. 

Thanks to cognitive technology like natural language processing, machine vision, and deep learning, machine learning is freeing up human workers to focus on tasks like product innovation and perfecting service quality and efficiency. 

You might be good at sifting through a massive but organized spreadsheet and identifying a pattern, but thanks to machine learning and artificial intelligence, algorithms can examine much larger sets of data and understand patterns much more quickly

How machine learning mimics human analysis.

What is the best programming language for machine learning?

Most data scientists are at least familiar with how R and Python programming languages are used for machine learning, but of course, there are plenty of other language possibilities as well, depending on the type of model or project needs. Machine learning and AI tools are often software libraries, toolkits, or suites that aid in executing tasks. However, because of its widespread support and multitude of libraries to choose from, Python is considered the most popular programming language for machine learning. 

In fact, according to GitHub, Python is number one on the list of the top machine learning languages on their site. Python is often used for data mining and data analysis and supports the implementation of a wide range of machine learning models and algorithms. 

Supported algorithms in Python include classification, regression, clustering, and dimensionality reduction. Though Python is the leading language in machine learning, there are several others that are very popular. Because some ML applications use models written in different languages,  frameworks like Algorithmia’s serverless microservices architecture comes into play to allow models to be built in multiple languages and seamlessly pipelined together.

The bottom line

Machine learning can provide value to consumers as well as to enterprises. An enterprise can gain insights into its competitive landscape and customer loyalty and forecast sales or demand in real time with machine learning. 

Our machine learning platform has built-in tools for versioning, deployment, pipelining, and integrating with customers’ current workflows. Algorithmia integrates with any technology your organization is currently using, fitting in seamlessly to make machine learning deployment a breeze, getting you from model building to productionization much faster. 

If you’re already implementing machine learning in your enterprise or you’d like to start, see how Algorithmia can help.

button to learn more about Algorithmia

Further Reading

Machine Learning Use Cases

More on types of supervised and unsupervised models

Unsupervised learning and emotion recognition

Instant, repeatable ML model deployment using DevOps principles

Algorithmia is DevOps for ML

DevOps Engineers are delighted when they find a product or service that fits in with their already refined CI/CD processes. When we choose a new pipeline tool, engineers depend on a number of factors. Ease of use, repeatable processes, and a solid support model are key. Choosing a continuous integration tool for machine learning model deployment is no different.

Algorithmia puts DevOps up front

When a company starts with a DevOps perspective from the onset, the finished product can more easily complement new deployment challenges every day. Algorithmia recognizes model deployment obstacles and takes those key aspects of DevOps to heart in its ML platform, which offers instant, repeatable deployments, thereby removing the burden that typically falls on data scientists. 

By using existing DevOps concepts centered on deployment, security, and scalability, Algorithmia overcomes many of the hurdles necessary to decrease the time it takes for ML models to make it to production.

Algorithmia also adheres to DevOps principles in providing a fully scalable production environment for hosting completed machine learning models. Since much of the underlying technology is run on tried and true products and processes, important functions like auditing and traceability are in place up front

Future-proofing model management 

From the start you will notice that models are maintained in a familiar repository structure (Git). These models can be consumed through most major programming languages data scientists use, such as Python, R, Java, and Algorithmia’s own CLI client. 

Most big data spans vast amounts of storage, and giving your models access to that data should not force you to move from one cloud to another. In addition to providing your own, secure area for Hosted Data Collections, the data you already host in Amazon S3, Google Cloud Storage, Azure, and even Dropbox is easily accessible by your models from within your Algorithmia account.

Model collaboration made easy

To complement its deployment capabilities, Algorithmia has included a number of other features meant to enable collaboration inside (or even outside) your organization. Publishing your models and sharing them with other users can help build more advanced applications and is paramount to preventing tech silos in decentralized organizations. When you manage your model portfolio on Algorithmia, you control how and where your data is shared.

DevOps engineers strive to remove barriers that block innovation in all aspects of software engineering. Now faced with an additional task of deploying a myriad of AI models, that same attitude will ensure data science products will open up even more opportunities for data exploration and use. 

Remove deployment barriers

Luckily, deploying machine learning models with Algorithmia can be just another process intrinsic to the tenets of DevOps engineers. Algorithmia recognizes that there would be significant challenges in the field of machine learning model deployment otherwise. Data scientists would take on unnecessary aspects of infrastructure (that could be easily handled by the platform) to ensure their models would complement existing DevOps procedures. On the DevOps side, Algorithmia recognized that without a deployment platform, DevOps engineers might be deploying something foreign to them. For that reason, the Algorithmia platform is the natural approach to the new and ever-evolving field of machine learning for DevOps engineers and data scientists alike.

Further resources

The Algorithmia Learning and Training Center

Machine learning infrastructure best practices

Permissioning your algorithms on Algorithmia

Multiclass classification in machine learning

Ice cubes being sorted by opacity. Result: Clear ice cubes in one tray and opaque ice cubes in another.

What is multiclass classification? 

Multiclass classification is a classification task that consists of more than two classes, (ie. using a model to identify animal types in images from an encyclopedia). In multiclass classification, a sample can only have one class (ie. an elephant is only an elephant; it is not also a lemur). 

Outside of regression, multiclass classification is probably the most common machine learning task. In classification, we are presented with a number of training examples divided into K separate classes, and we build a machine learning model to predict to which of those classes previously unseen data belongs (ie. the animal types from the example above). In seeing the training data, the model learns patterns specific to each class and uses those patterns to predict the membership of future data.

Multiclass classification use cases

For example, a cybersecurity company might want to be able to monitor a user’s email inbox and classify incoming emails as either potential phishers or not. To do so, it might train a classification model on the email texts and inbound email addresses and learn to predict from which sorts of URLs threatening emails tend to originate. 

As another example, a marketing company might serve an online ad and want to predict whether a given customer will click on it. (This is a binary classification problem.)

How classifier machine learning works

Hundreds of models exist for classification. In fact, it’s often possible to take a model that works for regression and make it into a classification model. This is basically how logistic regression works. We model a linear response WX + b to an input and turn it into a probability value between 0 and 1 by feeding that response into a sigmoid function. We then predict that an input belongs to class 0 if the model outputs a probability greater than 0.5 and belongs to class 1 otherwise.

Another common model for classification is the support vector machine (SVM). An SVM works by projecting the data into a higher dimensional space and separating it into different classes by using a single (or set of) hyperplanes. A single SVM does binary classification and can differentiate between two classes. In order to differentiate between K classes, one can use (K – 1) SVMs. Each one would predict membership in one of the K classes.

Naive Bayes in ML classifiers

Within the realm of natural language processing and text classification, the Naive Bayes model is quite popular. Its popularity in large part arises from the fact of how simple it is and how quickly it trains. In the Naive Bayes classifier, we use Bayes’ Theorem to break down the joint probability of membership in a class into a series of conditional probabilities. 

The model makes the naive assumption (hence Naive Bayes) that all the input features to the model are mutually independent. While this isn’t true, it’s often a good enough approximation to get the results we want. The probability of class membership then breaks down into a product of probabilities, and we just classify an input X as class k if k maximizes this product.

Deep learning classification examples

There also exist plenty of deep learning models for classification. Almost any neural network can be made into a classifier by simply tacking a softmax function onto the last layer. The softmax function creates a probability distribution over K classes, and produces an output vector of length K. Each element of the vector is the probability that the input belongs to the corresponding class. The most likely class is chosen by selecting the index of that vector having the highest probability.

While many neural network architectures can be used, some work better than others. Convolutional Neural Networks (CNNs) typically fare very well on classification tasks, especially for images and text. A CNN extracts useful features from data, particularly ones that are invariant to scaling, transformation, and rotation. This helps it detect images that may be rotated, shrunken, or off-center, allowing it to achieve higher accuracy in image classification tasks.

Unsupervised classification

While nearly all typical classification models are supervised, you can think of unsupervised classification as a clustering problem. In this setting, we want to assign data into one of K groups without having labeled examples ahead of time (just as in unsupervised learning). Classic clustering algorithms such as k-means, k-medoids, or hierarchical clustering perform well at this task.

Keep learning

A guide to reinforcement learning

What is sentiment analysis

How do microservices work?