Algorithmia Blog - Deploying AI at scale

Algorithmia and BERT language modeling

A photo of a red fox

Natural language processing has been one of the most poignant and visible uses of machine learning capabilities in recent years. From the basics of recurrent neural network architectures that were able to detect the first named entity pairings, to now where transformers are able to look at an entire paragraph or book simultaneously using parallel processing on GPUs, we’ve clearly seen some tremendous improvements. 

However nothing has been quite as dramatic for the field as a new architecture, Bidirectional Encoding Representations from Transformers or BERT.

In this post, we’ll walk through what BERT is and provide an easy way to use it on Algorithmia.

How recurrent neural networks work

Before we talk about BERT itself, let’s start with one of the cornerstone building blocks of many machine learning architectures—Recurrent Neural Networks (RNNs).

RNN model depictionSource:

Many datasets containing signals we’d like to extract are sequential, meaning that for each element x(t) in an input sequence (in the graphic above), y(t) depends not only on x(t), but x(t-1), and x(t-n). 

A great example of this is language—imagine this sentence: The quick brown fox jumps over the lazy ___. You may have an idea what that last word is supposed to be. This is due to how language is constructed—each word adds context and morphs the meaning of the sentence. To consider any of the  words individually, without context, would make it difficult to predict that the last word was dog

Using context for more accurate predictions 

Recurrent neural networks are a unique architecture that allow for the state of previous operations to be preserved inside the model. This means that I could design an RNN model that, given each word individually (a, quick, brown, fox, …), I could train the architecture to successfully predict dog and many other things, which is a simplistic description of how RNNs work. Let’s take a look at what some downsides to recurrent architectures are. 

Challenges in recurrent architectures

Vanishing gradient problem

One drawback is called the Vanishing Gradient Problem, which stems from how information is stored in RNNs. As mentioned, information from x(t-n) is stored in the network to help predict y(t), however when n gets to be a very large number, that information eventually starts to leak out.

There have been improvements to reduce this impact, such as Long-Short Term Memory layers (LSTMs) or Gradient Recurrent Units (GRUs), however this problem continues to persist in very, very long-range information sharing.

Information processing

The second problem stems from how information is processed. As mentioned, information from x(t-n) is used to help predict y(t). This means we need to calculate the value of y(t-n) before we can even start work on y(t), which can make parallelizing the training/inference processes quite difficult if not impossible for many tasks. 

This isn’t always a problem, however, especially for some smaller architectures, but if you intend to use scaled deep learning models, you will very quickly run into a brick wall in how fast you can train the model. 

This is one of the reasons why researchers have historically preferred to focus on other ML projects like image processing, as the power of deep learning was unable to provide any value to many RNN models.

Transfer learning

The third problem is a difficulty with transfer learning. The concept of transfer learning is the process of taking an ML model pre-trained on some generic dataset and re-training it on a specialized object dataset for the specific project or your problem. 

This kind of process is very common in the image processing world but has proven to be quite challenging for even relatively standard sequential tasks, such as Natural Language Processing. This is because any model you are planning to use for transfer learning must have been trained with the same type of objective as the one you plan on tackling. 

Transfer learning requires a shared set of necessary transformations between model objectives, which is where we see benefits in training time  and model / accuracy.

In the field of image classification,  we’re almost always looking for objects in an image, generally a natural photograph (like family vacation pictures from the bBahamas, etc). However if you attempted to reuse a general classification model to classify artifacts in x-ray stereographs, your model will really struggle to provide any value.

This kind of scenario has plagued NLP algorithms since it’s inception, as many NLP tasks are disparate and have objectives (such as Named Entity Recognition, or tText pPrediction) that are very difficult to leverage transfer learning for from one task to another.

This is where BERT comes in, and why it’s so special. BERT uses a multi-headed objective system that takes the most common NLP objectives and trains a model that’s capable of being successful in all of them. We’ll look at BERT models more in-depth below.

Other types of RNNs 

Attention networks

A new architecture was created by Google researchers a couple of years ago that approaches sequential problems in a different way.

A depiction of a recurrent neural network with an attention layer

With attention networks, we’re processing every variable in our sequence (x(0) all the way to x(t)) at once, rather than one at a time. 

We’re able to do this because the attention layer is able to view all the data at once using its limited number of weights to focus on only the parts of the input that matter for the next prediction. This means we’re able to parallelize training our model and also take advantage of GPUs.

Transformer networks

As a progression on attention networks, transformers have multiple “sets” of weights per attention layer that are able to focus on different parts of an attention vector. These are called transformer heads. 

Other than that, the big difference between attention and transformer networks is the concept of stacking attention/linear layers on top of each other (while taking some concepts from residual network architectures) in a similar way to convolutional neural networks. This creates the paradigm of deep learning, which allows us to avoid the vanishing gradient problem by ensuring that information from previous layers always bubbles up to the last layer of the network. 

These networks have become state of the art for natural language processing, considered jointly with the fact that they can be trained effectively using GPUs and TPUs, which allows researchers to make them even deeper.

A depiction of a transformer network model

Bidirectional Encoding Representations from Transformers (BERT)

Attention architectures allow us to solve two of the biggest problems of working with RNNs and be able to train much faster due to the parallelization attention models provide. With the introduction of transformers, using residual connections and multiple transformer heads, we can avoid the vanishing gradient problem, allowing us to construct deeper models and take advantage of the deep learning paradigm. 

But we’re still missing something; we haven’t addressed a third problem—NLP models are terrible for transfer learning.

This is where BERT comes in. It’s trained on two different objectives to normalize the parameters to be more general-purpose. Like many NLP architectures, a model is first trained to predict missing words and then to encode them into an internal representation using the “bag of words” metric. 

Unlike with typical training systems however, BERT is provided with not just one representation of a block of text, but two—one right-left, the other left-right. Hence it’s a bidirectional encoder. 

This phrase “embedding encoder” is also much deeper and contains significantly more parameters than earlier encoding systems such as word2vec or GLoVe

Besides that, the word “encoding” is not independent of the context, which allows BERT to have a very deep and rich understanding of the vocabulary used in the training corpus.

Diagrams of BERT models in semi- and supervised learning environments

Once a word encoder internal model is trained, a classifier is stacked on top of the model, which can be trained for a variety of tasks. In the pre-trained examples, a simple Spam/Not Spam binary classifier is constructed, but obviously this could be used for other systems as well, such as Named Entity Recognition of sentiment analysis, to name a few.

BERT and Algorithmia

A big benefit of BERT is that it generates very rich encodings of word representations that can be used for tasks involving large documents with many sentences. This is helpful because one model can be used to construct many downstream applications of varying complexity, such as document classification or semi-supervised document topic clustering.

Algorithmia has deployed two examples of BERT models on Algorithmia, one in TensorFlow, and the other on PyTorch. As you can see, the source code is also available using the new Github for Algorithmia integration, which allows you to more easily use the code you’d like.

Both of these models are able to provide rich representations of a sentence, and can be used as a first stage for many NLP downstream tasks that are specialized for your business case.

What is artificial intelligence engineering?

According to LinkedIn’s 2020 Emerging Jobs Report, the demand for “Artificial Intelligence Specialists” (comprised of a few related roles), has grown 74 percent in the last four years. With more companies than ever (even those outside of the tech) relying on AI tasks as part of their everyday business, demand for practitioners with this skill will only rise. 

In our 2020 state of enterprise machine learning report, we noted that the number of data science–related workers is relatively low but the demand for those types of skills is great and growing exponentially. 

If you’ve been curious about how to become an AI engineer or if you’re interested in shifting your current engineering role into one more focused on AI, you’ve come to the right place. 

By the end of this post you’ll understand: 

  • The role of an AI engineer.
  • The educational requirements to be an AI engineer.
  • The knowledge requirements to be an AI engineer.
  • The AI engineering career landscape. 

What is an AI engineer?

An artificial intelligence engineer is an individual who works with traditional machine learning techniques like natural language processing and neural networks to build models that power AI–based applications. 

The type of applications created by AI engineers include: 

  • Contextual advertising based on sentiment analysis
  • Language translation 
  • Visual identification or perception

Is an AI engineer a data engineer or scientist? 

You may be wondering how the role of an AI engineer differs from that of a data engineer or a data scientist. While all three roles work together within a business, they do differ in several ways: 

  • Data engineers write programs to extract data from sources and transform it so that it can be manipulated and analyzed. They also optimize and maintain data pipelines.  
  • Data scientists build machine learning models meant to support business decision making. They are often looking at the business from a higher strategic point than an AI engineer typically would.

What does it take to be an AI engineer?

AI engineering is a relatively new field, and those who currently hold this title come from a range of backgrounds. The following are some of the traits that many have in common. 


Many AI engineers moved over from previous technical roles and often have undergraduate or graduate degrees in fields that are required for those jobs. These include: 

  • Computer science
  • Statistics
  • Applied mathematics
  • Linguistics 
  • Cognitive science 

Most of the above degrees have some relevance to artificial intelligence and machine learning. 

Technical skills

Two of the most important technical skills for an AI engineer to master are programming and math/statistics. 

  • Programming: Software developers moving into an AI role or developers with a degree in computer science likely already have a grasp on a few programming languages. Two of the most commonly used languages in AI, and specifically machine learning, are Python and R. Any aspiring AI engineer should at least be familiar with these two languages and their most commonly used libraries and packages.
  • Math/statistics: AI engineering is more than just coding. Machine learning models are based on mathematical concepts like statistics and probability. You will also need to have a firm grasp on concepts like statistical significance when you are determining the validity and accuracy of your models.

Soft skills

AI engineers don’t work in a vacuum. So while technical skills will be what you need for modeling, you’ll also need the following soft skills to get your ideas across to the entire organization. 

  • Creativity – AI engineers should always be on the lookout for tasks that humans do inefficiently and machines could do better. You should stay abreast of new AI applications within and outside of your industry and consider if they could be used in your company. In addition, you shouldn’t be afraid to try out-of-the-box ideas. 
  • Business knowledge – It’s important to remember that your role as an AI engineer is meant to provide value to your company. You can’t provide value if you don’t really understand your company’s interest and needs from a strategic and tactical level. 

A cool AI application doesn’t mean much if it isn’t relevant to your company or can’t improve business operations in any way. You’ll need to understand your company’s business model, who the target customers and targets are, and if it has any long- or short-term product plans. 

  • Communication – In the role of an AI engineer, you’ll have the opportunity to work with groups all over your organization, and you’ll need to be able to speak their language. For example, for one project you’ll have to: 
    • Discuss your needs with data engineers so they can deliver the right data sources to you.
    • Explain to finance/operations how the AI application you’re developing will save costs in the long run or bring in more revenue.
    • Work with marketing to develop customer-focused collateral explaining the value of a new application.
  • Prototyping – Your ideas aren’t necessarily going to be perfect on the first attempt. Success will depend on your ability to quickly test and modify models until you find something that works.

Can I turn my current engineering role into an AI role?

Yes. Experienced software developers are well-suited to make the transition into AI engineering. You presumably have the command of more than one programming language and the foundational knowledge to learn another. It’s also likely that you’ve already worked with machine learning models in some capacity possibly by incorporating them into other applications. 

If you are interested in pursuing an AI engineering role within an organization where you already work, your knowledge of the business and knowledge of how the engineering team works will be crucial. 

How much does an artificial intelligence engineer earn in salary?

Artificial intelligence engineers are in high demand, and the salaries that they command reflect that. According to estimates from job sites like Indeed and ZipRecuiter, an AI engineer can make anywhere between $90,000 and $200,000 (and possibly more) depending on their qualifications and experience. 

Another factor that will determine salary is location. According to the LinkedIn Emerging Jobs Report mentioned earlier, most AI engineering jobs are located in the San Francisco Bay area, Los Angeles, Seattle, Boston, and New York City. 

Continue learning

Big data and artificial intelligence: a quick comparison

The best AI programming languages to use

Developing your own machine learning projects

Big data and artificial intelligence: a quick comparison

purple background with a gear and circles radiating out from it

The big data industry has grown at an incredible rate as businesses realize the importance of insightful data analysis. But what exactly is big data, and how does it correspond to artificial intelligence? We will compare the two realms, what they are, their differences, and how the combination of both leads to results beyond traditional human capability.

What are big data and artificial intelligence?

Big data

Big Data is a field focused on managing large amounts of data from a variety of sources. Big data comes into play when the volume of data is too large for traditional data management practices to be effective. Companies have long collected massive amounts of information about consumers, pricing, transactions, and product security, but eventually the volume of data collected proved too much for humans to manually analyze.

The three V's of big data: volume, velocity, and variety

The essence of big data can be broken into “the three v’s of big data”:

  • Volume: The amount of data being collected
  • Velocity: The rate at which data is received and acted upon
  • Variety: The different forms of data collected, (structured and unstructured data sources)

Artificial intelligence

Artificial intelligence (AI) is the development and implementation of computer systems that are capable of logic, reasoning, and decision making. This self-learning technology uses visual perception, emotion recognition, and language translation to analyze data and output information in a more efficient manner than human-driven methods.

In fact, you likely already interact with AI systems on a daily basis. The largest companies in the world, such as Amazon, Google, and Facebook, use artificial intelligence in their user interfaces. AI is what powers personal assistants like Siri, Alexa, and Bixby, and allows websites to recommend products, videos, or articles that might interest you. These targeted suggestions aren’t a coincidence, they are a result of artificial intelligence.

What is the difference between big data and artificial intelligence?

The difference between artificial intelligence and big data lies in the output of each. Artificial intelligence analyzes inputs to learn and improve its sorting or patterning processes over time, using data that it gathers to provide a more accurate diagnostic. 

Colorful text on a white background: big data and AI are often used in conjunction.

In contrast, big data is the overarching pool of information that is accumulated from various data sources, to then be analyzed by artificial intelligence. Big data and artificial intelligence are often used in conjunction with one another, but each fulfill very different roles, one is information and the other is a treatment of that information.

How big data and artificial intelligence work together

Big data and artificial intelligence are interdependent. Although each discipline is distinct, the presence of each is crucial in allowing for the other to function at its highest degree. AI does use data, but its ability to analyze and learn from this data is limited by the quantity of information that is fed into the system. Big data provides a vast sample of this information, making it the gas that fuels top-end artificial intelligence systems.

By harnessing big data resources, artificial intelligence systems can make more informed decisions, provide better user recommendations, and find ever-improving efficiencies in your models. However, an agreed-upon ruleset for data collection and data structure must be in place prior to AI implementation to ensure production of the best data possible.

Some benefits of AI and big data:

  1. Less labor-intensive data analytics
  2. Machine learning helps to relieve common data problems
  3. Doesn’t lessen the importance of humans in the analytic process
  4. More predictable and prescriptive analytics

Algorithmia understands big data and AI challenges

The world of big data and artificial intelligence can be overwhelming, but these processes are crucial for enterprises to have in place to stay competitive. However, implementation of effective systems comes with its own set of challenges. Algorithmia understands these needs and hosts a serverless microservices architecture that allows enterprises to easily deploy and manage machine learning models at scale. 

See how Algorithmia can help you build better software for your organization in our video demo.

see how Algorithmia can help you build better software.

Continue learning

The best AI programming languages to use

AI software adds exciting possibilities to established development practices

How machine learning works

What is the difference between a data scientist and a data engineer?

gold background with a cloud and data charts

Data scientists and data engineers fulfill different positions within an organization, but often work in conjunction with one another. Below we will discuss the difference between these roles; including job responsibilities, typical projects, and the technical skills needed for each.

What is a data scientist?

Data scientists analyze mass amounts of structured and unstructured data, often including big data and data mining, with the goal of extracting knowledge and insights to be used for crucial business decisions. Data scientists have the ability to understand and translate the meaning of incoming data and tell compelling stories that explain the implications of their findings to key stakeholders.

What is a data engineer?

Data engineers employ tools and programming languages to design, build, test, and maintain Application Programming Interfaces, or API’s. This process stack is then used to accumulate, store, and process large amounts of data in real time. Data scientists rely heavily on effective architecture to bring structure and format to ever-changing datasets. Data engineers are responsible for creating these systems.

Data-driven organizations highly value people with these capabilities since a well made infrastructure can provide significant competitive advantages over companies with less-than-optimal data collection.

Role requirements

Data scientists and data engineers can have bachelor’s degrees, master’s degrees, or PhDs in computer science. However, this requirement is beginning to be overlooked if a candidate exhibits the necessary skills for a position. Beyond a common field of study, those involved with data engineering frequently have a programming background and use languages like Python, Java, or Scala, while data scientists often pursue education or training in mathematics, statistics, economics, or physics.

Differences between data scientists and data engineers

Next, let’s discuss some of the critical skills that employers seek when hiring for these positions, followed by the typical projects that these roles can expect to participate in.

Skills and requirements

Data scientists

  • Proficiency in Python, Java, R, and SQL 
  • Ability to organize, present, and analyze data
  • Understand how to apply best practices to data mining and cleansing
  • Experience in manipulating datasets and building statistical models
  • Experience in big data tools like Hadoop, Hive, and Pig
  • Experience with unstructured data management efficiency
  • Strong collaboration skills 

Data engineers

  • Proficiency in Python, R, C/C++, Ruby, Perl, and Java
  • Ability to build and design large-scale applications
  • In-depth knowledge of database solutions, especially SQL (Cassandra and Bigtable are also beneficial)
  • Database architecture, data warehousing, data modeling, and data mining experience
  • Capable of distributed computing and pipelining algorithms to yield predictive accuracy
  • Understanding of Hadoop-based analytics (e.g. HBase, Hive, Pig, and MapReduce)
  • Vast knowledge of operating systems, particularly UNIX, Linux, and Solaris
  • Experience with ETL (Extract, Transform, Load) tools such as StitchData or Segment
  • Note: Although machine learning is traditionally used by data scientists, having an understanding of ML can also be helpful to data engineers when constructing useful solutions for analysts.

Typical projects for each role

Data scientists

  • Prototype ideas and create custom statistical models/algorithms (includes research and testing)
  • Utilize clean data for the analysis, testing, creation, and presentation of results
  • Understand company needs to better help in strategic planning and development of products/solutions
  • Present results to internal stakeholders and external clients in a compelling way
  • Collaborate (when needed) with data engineers to create AI/ML models
  • Work closely with your team to communicate analyzes in an easily understood manner

Data engineers

  • Design, build, test, and maintain big data infrastructures and processing systems
  • Create and maintain optimal data pipeline architecture
  • Build analytics tools that utilize data pipelines to deliver actionable insights
  • Gather and clean raw data to prepare for analysis
  • Automate manual processes and optimize data delivery
  • Assemble complex datasets to fulfill functional and non-functional company needs

Demand for data practitioners

Demand for data scientists far exceeds supply in the current state of the industry. The increase of use cases and proven results from machine learning applications has caused a hiring fervor as companies look to their data to provide insights into customer behavior and cost reduction opportunities. 

88 percent increase in data engineer job postings in 2018

Additionally, data engineers were reported to have the leading number of job postings in the tech field, with an 88 percent increase during 2018.

Demand is high for both of these positions, creating opportunities for those with the necessary skills and commitment to innovation in computer science. Companies are always looking for people who can make an impact in how their business collects, analyzes, and implements data-centric insights. 

Can data engineers become data scientists?

Data engineers can become data scientists, but the transition may be challenging. Though the technical skills needed to be a data scientist may be covered by a data engineer’s experience, the non-technical skills, like knowing how to analyze data and extract valuable information from it, might need refinement. One benefit of having experience in both fields is that collaboration between the two positions could be made easier, leading to more efficient architecture and analysis.

Algorithmia understands data scientists and data engineers

Algorithmia understands needs and challenges of both roles, which is why we created a serverless microservices architecture that allows organizations to deploy and govern their machine learning models at scale with ease. 

See how Algorithmia helps data scientists and data engineers build better software for your organization in our video demo.

Learn more about Algorithmia call to action

Recommended readings

2020 machine learning predictions and the shortage of data scientists

Explanation of roles: machine learning engineers vs data scientists

Machine learning engineers and data scientists biggest challenge: deploying models at scale

Webinar: 2020 state of enterprise machine learning

webinar slide 1: 2020 state of enterprise machine learning

Last week, our CEO, Diego Oppenheimer, and CEO of ArthurAI, Adam Wenchel, hosted a webinar on the state of enterprise machine learning in 2020. The webinar was moderated by Algorithmia VP of Engineering, Ken Toole.

View recording

Diego and Adam leverage their knowledge of AI and machine learning and offer their enterprise experience making these technologies available for companies to automate their business operations. This is a great opportunity to learn from industry leaders what they see happening in the AI/ML space and what is likely ahead for companies deciding to incorporate AI and ML into their workflows.

Webinar Overview

The talk started with a look at our 2020 state of enterprise machine learning report, which published in December. The report focused on seven key findings:

  • the role of the data scientist and the rise of data science arsenals at companies to prepare for data value extraction via machine learning models.
  • the most common challenges to developing mature machine learning programs are deployment, versioning, and aligning stakeholders within an organization.
  • investment in AI/ML in the enterprise is growing swiftly with several industries leading the charge.
  • most companies are spending more than 8 days, and some times up to a year, deploying a single model.
  • the majority of companies undertaking ML initiatives are in relatively early stages (ie. developing use cases, building models, or working on deployment).
  • there is a discrepancy in determining what ML success looks like across industries and roles within an organization.
  • business use cases for machine learning vary, but the most common ones are for gaining customer insight and for reducing costs.

2020 report findings

One of the topics of discussion surrounded how DevOps, engineering, and data science teams are organizing around machine learning. Diego and Adam both mention the blending of roles and the morphing of resources across business units. About this change, Adam said:

Having to change the way groups are organized in order to be successful is something we see over and over again.” — Adam Wenchel, CEO ArthurAI

Model deployment challenges

A topic that Algorithmia cares about deeply is the time to deployment for machine learning models in the enterprise. We talk to a massive number of companies that say they spend between 8 and 90 days deploying, and an alarming number of companies who spend more than 90 days, and we think that’s unnecessary and a waste of valuable resources.

Time to deployment is where we see a giant gap; the fact that it could potentially take 90 days, or even more in some cases, to deploy a single model is scary because the cost balloons during that time and it’s unacceptable to the C-suite.” – Diego Oppenheimer, CEO Algorithmia

Listen to the full story

The webinar covers many trends in the AI/ML space, and it’s a great opportunity to hear from three leaders in enterprise machine learning. Watch the full webinar here and if you’d like a copy of the slides, click below.

See slides