Pipelines have been growing in popularity, and now they are everywhere you turn in data science, ranging from simple data pipelines to complex machine learning pipelines. The overarching purpose of a pipeline is to streamline processes in data analytics and machine learning. Now let’s dive in a little deeper.
What is a machine learning pipeline?
One definition of a machine learning pipeline is a means of automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs. This type of ML pipeline makes the process of inputting data into the ML model fully automated.
Another type of ML pipeline is the art of splitting up your machine learning workflows into independent, reusable, modular parts that can then be pipelined together to create models. This type of ML pipeline makes building models more efficient and simplified, cutting out redundant work.
This goes hand-in-hand with the recent push for microservices architectures, branching off the main idea that by splitting your application into basic and siloed parts you can build more powerful software over time. Operating systems like Linux and Unix are also founded on this principle. Basic functions like ‘grep’ and ‘cat’ can create impressive functions when they are pipelined together.
Why pipelining is so important
To understand why pipelining is so important in machine learning performance and design, take into account a typical ML workflow.
In a mainstream system design, all of these tasks would be run together in a monolith. This means the same script will extract the data, clean and prepare it, model it, and deploy it. Since machine learning models usually consist of far less code than other software applications, the approach to keep all of the assets in one place makes sense.
However, when trying to scale a monolithic architecture, three significant problems arise:
- Volume: when deploying multiple versions of the same model, you have to run the whole workflow twice, even though the first steps of ingestion and preparation are exactly identical.
- Variety: when you expand your model portfolio, you’ll have to copy and paste code from the beginning stages of the workflow, which is inefficient and a bad sign in software development.
- Versioning: when you change the configuration of a data source or other commonly used part of your workflow, you’ll have to manually update all of the scripts, which is time consuming and creates room for error.
With ML pipelining, each part of your workflow is abstracted into an independent service. Then, each time you design a new workflow, you can pick and choose which elements you need and use them where you need them, while any changes made to that service will be made on a higher level.
A pipelining architecture solves the problems that arise at scale:
- Volume: only call parts of the workflow when you need them, and cache or store results that you plan on reusing.
- Variety: when you expand your model portfolio, you can use pieces of the beginning stages of the workflow by simply pipelining them into the new models without replicating them.
- Versioning: when services are stored in a central location and pipelined together into various models, there is only one copy of each piece to update. All instances of that code will update when you update the original.
How ML pipelines benefit performance and organization
This type of ML pipeline improves the performance and organization of the entire model portfolio, getting models from into production quicker and making managing machine learning models easier.
Scheduling and runtime optimization
As your machine learning portfolio scales, you’ll see that many parts of your pipeline get heavily reused across the entire team. Knowing this, you can program your deployment for those common algorithm-to-algorithm calls. This gets the right algorithms running seamlessly, reducing compute time and avoiding cold starts.
Language and framework agnosticism
In a monolithic architecture, you have to be consistent in the programming language you use and load all of your dependencies together. But since a pipeline uses API endpoints, different parts can be written in different languages and use their own framework. This is a key strength when scaling ML initiatives, since it allows pieces of models to be reused across the technology stack, regardless of language or framework types.
Broader applicability and fit
With the ability to take pieces of models to reuse in other workflows, each string of functions can be used broadly throughout the ML portfolio. Two models may have different end goals, but both require the same specific step near the beginning. With pipelining, that step can be utilized in both models because any services can fit into any application.
Use cases of an ML pipeline
Here are a couple use cases that help illustrate why pipelining is important for scaling machine learning teams.
Natural Language Processing
Tasks in natural language processing often involve multiple repeatable steps. To illustrate, here’s an example of a Twitter sentiment analysis workflow.
This workflow consists of data being ingested from Twitter, cleaned for punctuation and whitespace, tokenized and lemmatized, and then sent through a sentiment analysis algorithm that classifies the text.
Keeping all of these functions together makes sense at first, but when you begin to apply more analyses to this dataset it makes sense to modularize your workflow.
Here’s what multiple analyses of this data would look like with monolithic structures:
Here’s what multiple analyses of the same data looks like with pipelined components:
With this architecture, it’s easy to swap out the algorithms with other algorithms, update the cleaning or preprocessing steps, or scrape tweets from a different user without breaking the other elements of your workflow. There is no copying and pasting changes into all iterations, and this simplified structure with less overall pieces will run smoother.
Challenges of an ML pipeline
The challenge organizations face when it comes to implementing a pipelining architecture into their machine learning systems is that this type of system is a huge investment to build internally. It often seems easier to just stick with whatever architecture the organization is using now. And when considering building the structure internally, that is probably true.
However, there is another way to invest in ML pipelining without spending the time and money that it takes to build it. Algorithmia offers this system to organizations to make it easier to scale their machine learning endeavors. It doesn’t have to be a challenge to implement this technology into your organization’s workflows.
Pipelining with Algorithmia
Algorithmia was built from the ground up with machine learning at scale use cases in mind. Pipelining is a key part of any full scale deployment solution. Teams need to be able to productionize models as parts of a whole.
With Algorithmia, pipelining machine learning is simple:
- Algorithms are packaged as microservices with API endpoints: calling any algorithm or function is as easy as `algorithm.pipe(input)`
- Pipelines can be input agnostic, since multiple languages and frameworks can be pipelined together
- You can set permissions for models and choose to allow a model to call other algorithms
A lot of important aspects of pipelining happen on the backend, too. When you define your pipeline, Algorithmia is optimizing scheduling behind the scenes to make your runtime faster and more efficient. If one algorithm consistently calls another, the system will pre-start the dependent models to reduce compute time and save you money.
Pipelining is just one of the features that Algorithmia has to offer. Learn more about automating your DevOps for machine learning by watching a demo video of Algorithmia.
You can read more case studies and information about pipelining ML in our whitepaper “Pipelining machine learning models together.”