All posts by Besir Kurtulmus

Protecting your machine learning system from attack

Algorithmia castle with a moat defending model tampering arrows.

Machine learning model security is not discussed enough. In serverless GPU–attached environments, block storage solutions like S3 are dependable for persisting your model files. Other than the URI, no other relevant information regarding the model file is saved in the source code. 

This exposes an interesting angle to potentially attack an ML system in production. An exposed API key or backdoor could allow hackers to replace a model file with their own. Below we’ll talk about why we can’t trust models coming from untrusted sources. We will also demonstrate a quick and simple process to authenticate models before loading them into memory.

What are the security implications?

Using open-source tools like TensorFlow potentially exposes your platform to cyber attacks. Even though open-source communities are known to quickly patch up bugs, this time delta may be more than enough time for hackers to initiate an attack or drastically affect business operations.

If your models are analyzing sensitive information like credit card information for fraud detection or scanning legal documents to help with discovery for a court case, hackers could use an exploit to export this information back to themselves.

The question of “why models haven’t been more protected than before now” has a trivial answer. Like with most emerging technologies, most companies do not or cannot determine what these exploits can be. Therefore as the industry matures, these security measures need to be implemented quickly, especially for models that process highly sensitive data.

According to TensorFlow’s official documentation that can be found here, TensorFlow models are basically programs and aren’t sandboxed within TensorFlow. A sandboxed application wouldn’t have access to files outside its environment and wouldn’t be able to communicate over the network. As it is, TensorFlow can read and write files, send and receive data over the network, and spawn additional processes, all of which are potentially vulnerable to attack. 

The documentation summarizes that “TensorFlow models are programs, and need to be treated as such from a security perspective.”

Authenticating model metrics

Another place where authentication can take place is during continuous integration, where metrics like F1 scores are calculated. This will ensure that the model being tested in CI and the model being deployed in production are the same. Authenticating models prevents data scientists from accidentally overwriting models files and prevents fraudulent models getting into production.

Authentication by computing model hash

Model inference source code is version-controlled via git. Since model files can be huge (several GBs), they must be stored in scalable object/blob storage systems, such as S3. Even though some object-storage services do offer the benefit of keeping track of file hashing, this isn’t the standard case across all services and may not be exposed through the service you’re using to deploy your model. 

Simple file-based hash authentication

Because of the potential for model-tampering, it makes sense to calculate the model file hash right after training, and hard-code this into the source code to prevent model file-tampering in flight. This will allow the inference service to verify the model file during runtime, before executing any TensorFlow model code. This is especially true when the only thing hard-coded in the model file is the filename in the source code.

Advanced weight-based hash authentication

Another way to calculate hashes is to use the weights that are provided in model files. The benefit to this approach is that it would be independent of model format and would work across different frameworks.

Fingerprinting models in this approach would provide consistency, reliability, and reproducibility, and protect an ML system from vulnerabilities. 

Model authentication demonstration on Algorithmia

Documentation walkthrough of model authentication on Algorithmia

We have implemented the simple file-based hash authentication method into our Jupyter notebook. The example trains a simple MNIST model, saves and calculates the hash of the model, deploys the model with the hash, and runs model authentication at runtime before running the model.

Ensure your ML models in production haven’t been hotswapped without anyone noticing. 

Model security with Algorithmia 

As machine learning becomes part of standard software development, vulnerabilities and new methods of attack are surfacing. Fortunately, Algorithmia has built-in security features to prevent model tampering and we are committed to stewarding these practices in the enterprise for everyone to benefit from. Algorithmia aims to empower every organization to achieve its full potential through the use of artificial intelligence and machine learning. 

Read more about ML security

Robust Physical-World Attacks on Deep Learning Visual Classification (arxiv) 

Adversarial Examples for Malware Protection (Patrick McDaniel)

Federated learning and securing data (Digitalist) 

From Jupyter Notebook to a Fully Scalable Model

We spend a lot of time focused on giving data scientists the best experience for deploying their machine learning models. We think they should not only use the best tools for the job, they should also be able to integrate their work easily with other tools. Today we’ll highlight one such integration: Jupyter Notebook.

When we work in Jupyter Notebook—an open-source project tool used for creating and sharing documents that contain live code snippets, visualizations, and markdown text—we are reminded how easy it is to use our API to deploy a machine learning model from within a notebook.

About Jupyter Notebook

These notebooks are popular among data scientists and are used both internally and externally to share information about data exploration, model training, and model deployment. Jupyter Notebook supports running both Python and R, two of the most common languages used by data scientists today.

An example Jupyter Notebook

Jupyter Notebook.

How We Use It

We don’t think you should be limited to creating algorithms/models solely through our UI. Instead, to give data scientists a better and more comprehensive experience, we built an API endpoint that gives more control over your algorithms. You can create, update, and publish algorithms using our Python client, which lets data scientists deploy their models directly from their existing Jupyter Notebook workflows.

We put together the following walkthrough to help guide users through the process to deploy from within a notebook. The first few steps are shown below:

In this example, after setting up credentials, we download and prep a dataset and build and deploy a text classifier model. You can see the full example notebook here. And for more information about our API, please visit our guide.

More of a Visual Learner?

Watch this short demo that goes through the full workflow.

Results of The First Trustless Machine Learning Contract

Two of the most interesting things potentially ever are happening in our lifetime: the rise of machine learning and the blockchain revolution.

Machine Learning (ML) systems have been able to surpass humans in many problem domains. These systems are now better at lip reading, speech recognition, location tagging, playing Go, image classification, and more.

With the invention of the blockchain and bitcoin, we’ve seen a wave of new cryptocurrencies and distributed applications built on these new blockchains.

The DanKu protocol is an overlap between the blockchain and Machine Learning. It helps facilitate exchanging ML models on the Ethereum blockchain. We even published a whitepaper about it here. You can read more about the DanKu protocol in our previous blog post.

Read More…

Trustless Machine Learning Contracts: Evaluating and Exchanging Machine Learning Models on the Ethereum Blockchain

Machine Learning algorithms are being developed and improved at an incredible rate, but are not necessarily getting more accessible to the broader community. That’s why today Algorithmia is announcing DanKu, a new blockchain-based protocol for evaluating and purchasing ML models on a public blockchain such as Ethereum. DanKu enables anyone to get access to high quality, objectively measured machine learning models. At Algorithmia, we believe that widespread access to algorithms and deployment solutions is going to be a fundamental building block of a balanced future for AI, and DanKu is a step towards that vision.

The DanKu protocol utilizes blockchain technology via smart contracts. The contract allows anyone to post a data set, an evaluation function, and a monetary reward for anyone who can provide the best trained machine learning model for the data. Participants train deep neural networks to model the data, and submit their trained networks to the blockchain. The blockchain executes these neural network models to evaluate submissions, and ensure that payment goes to the best model.

The contract allows for the creation of a decentralized and trustless marketplace for exchanging ML models. This gives ML practitioners an opportunity to monetize their skills directly. It also allows any participant or organization to solicit machine learning models from all over the world. This will incentivize the creation of better machine learning models, and make AI more accessible to companies and software agents. Anyone with a dataset, including software agents can create DanKu contracts.

We’re also launching the first DanKu competition for a machine learning problem. Read More…

Advanced Algorithm Design

We host more than 4000 algorithms for over 50k developers. Here is a list of best practices we’ve identified for designing advanced algorithms. We hope this can help you and your team. Read More…