Following the release of the 2020 State of Enterprise Machine Learning report, we created an interactive data visualization so anyone can explore the survey data, conduct analysis, and see how a company’s machine learning efforts compare to others like it.
The State of Enterprise Machine Learning (ML) experience shares eight questions that were posed in our survey and the associated results. After exploring the data, download the full report to read our assessments and predictions about where ML development is headed.
Explore the data three different ways
Our report shares findings from nearly 750 survey respondents whom we polled in the fall of 2019. However, if you want to see how other companies of a similar size to yours are using machine learning, the interactive experience allows you to test your own hypotheses and arrive at findings tailored to you.
The interactive experience lets you filter by industry, company size, or job title to see “slices” of the data.
Using the interactive experience
- Scroll down the page to see the graphic visualizations and how the data break down generally.
- Then apply a filter (specify it further, if desired). Scroll to see how the data changes.
- Hover over graphs for specific percentages.
- Refer to the report to glean more insight into the current and future state of ML.
We intend the interactive experience to provide details about how long it takes a specific job type or industry to deploy a machine learning model, what the state of ML maturity is for companies of a specific size, and whether or not certain industries are leading the ML charge.
Insights like these can be the impetus for new machine learning stories and journeys.
Where is machine learning development headed?
The 2020 survey data and report confirms that ML in the enterprise is progressing at a lightning pace. Though the majority of companies are still in the early stages of developing ML maturity, it is incorrect to think there is time to delay ML development at your company.
Algorithmia is committed to adding to this interactive experience every year after we conduct our State of Enterprise Machine Learning survey. Read the full report for a year-over-year comparison with 2018; patterns are already starting to emerge. And stay tuned for next year’s data.
If your organization is not currently ML–oriented, know that your competitors are. Now is the time to future-proof your organization with AI/ML. Get ahead of the competition with Algorithmia.
In the last year alone, there have been countless developments in machine learning (ML) tooling and applications. Facial recognition and other computer vision applications are more sophisticated, natural language processing applications like sentiment analysis are increasingly complex, and the number of ML models in development is staggering.
In 2019, we spoke with thousands of companies in various stages of machine learning maturity, and we developed hypotheses about the state of machine learning and the traction it’s gaining in the enterprise across all industries. In October, we undertook a massive survey effort, polling nearly 750 business decision makers from organizations thinking about, developing, and implementing robust machine learning efforts.
We analyzed the data we gathered, gleaning insight into various ML use cases, roadmaps, and the changes companies had seen in recent months in budget, R&D, and head count.
Data science: modern-day gold rush
We put together seven key findings from our analysis and published them in our 2020 State of Enterprise Machine Learning report. The first finding is likely not at all surprising: the field of data science is undergoing tremendous flux as word of demand, potential salaries, quick bootcamps, and open positions bounce around the internet.
But let’s dig into what we found in our survey data to get a better picture of what’s happening in the field.
The rise of the data science arsenal
One of the pieces of data we collected was the number of data scientists employed at the respondent’s place of work. We hear repeatedly from companies that management is prioritizing hiring for the data science role above many others, including traditional software engineering, IT, and DevOps.
Half of people polled said their companies employ between one and 10 data scientists. This is actually down from 2018 (we polled in 2018 as well) where 58 percent of respondents said their companies employ between one and 10 data scientists. Like us, you might wonder why. We would have expected more companies to have one to 10 data scientists because investment in AI and ML is known to be growing (Gartner).
Movement in the data science realm
However, In 2018, 18 percent of companies employed 11 or more data scientists. This year, however, 39 percent of companies have 11 or more, suggesting that organizations are ramping up their hiring efforts to build data science arsenals of more than 10 people.
Another observation from 2018 was that barely 2 percent of companies had more than 1,000 data scientists; today that number is just over 3 percent, indicating small but significant growth. Companies in this data science bracket are likely the big FAANG tech giants—Facebook, Apple, Amazon, Netflix, and Google (Yahoo); their large data science teams are working hard to derive sophisticated insight from the vast amounts of data they store.
Demand for data scientists
Between 2012 and 2017, the number of data scientist jobs on LinkedIn increased by more than 650 percent (KDnuggets). The talent deficit and high demand for data science skills mean hiring and maintaining data science teams will only become more difficult for small and mid-sized companies that cannot offer the same salary and benefits packages as the FAANG companies.
As demand for data scientists grows, we may see a trend of junior-level hires having less opportunity to structure data science and machine learning efforts within their teams, as much of the structuring and program scoping may have already been done by predecessors who overcame the initial hurdles.
New roles, the same data science
We will likely also see the merging of traditional business intelligence and data science roles in order to fill immediate requirements in the latter talent pool since both domains use data modeling (BI work uses statistical methods to analyze past performance, and data science makes predictions about future events or performance).
Gartner predicts that the overall lack of data science resources will result in an increasing number of developers becoming involved in creating and managing machine learning models (Gartner CIO survey). This blending of roles, will likely lead to another phenomenon related to this finding: more names and job titles for the same work. We are seeing an influx of new job titles in data science such as Machine Learning Engineer, ML Developer, ML Architect, Data Engineer, Machine Learning Operations (ML Ops), and AI Ops as the industry expands and companies attempt to distinguish themselves and their talent from the pack.
The 2020 report and predicting an ML future
The strategic takeaway from the 2020 State of Enterprise Machine Learning survey for us was that a growing number of companies are entering the early stages of ML development, but of those that have moved beyond the initial stages, are encountering challenges in deployment, scaling, versioning, and other sophistication efforts. As a result, we will likely see a boom in the number of ML companies providing services to overcome these obstacles in the near term.
We will do a deeper dive into the other key findings in the coming weeks. In the meantime, we invite you to read the full report and to interact with our survey data in our 2020 State of Enterprise Machine Learning interactive experience.
AI software enters business workflow
When we hear the term AI software, some of us think of a futuristic world where machine learning has taken artificial intelligence to extreme levels. Fortunately, today’s AI services provide tools for all types of businesses to interact with complex data.
AI software examples
AI software called Natural Language Processing allows for the understanding of voice commands in home automation devices and provides intelligence for language translation.
Facial recognition is a machine learning use case that is used by social media platforms to accurately tag photos. Open-Source Facial Recognition is a deep learning model that recognizes not only that a face exists but also who the face belongs to.
The open availability of these and other models allows for data scientists to be immediately productive in their use of AI software for data analysis.
Infrastructure changes ahead for machine learning workflows
As more and more aspects of AI become mainstream, software and business services will include it as a critical part of their roadmaps. Existing infrastructure will have additional requirements geared more toward new problems a business is trying to solve with an AI software implementation.
The future-reaching nature and highly adaptable features of a centralized repository of machine learning models have already provided solutions to a large number of analytic problems with big data.
Algorithmia is leading the way to a machine learning–oriented future by providing a scalable deployment infrastructure that handles critical aspects of the machine learning lifecycle: deployment, manageability, scalability, and security. In this way, data scientists and DevOps can focus on using their expertise to do their intended jobs while Algorithmia seamlessly handles the rest. Designed to complement existing processes, Algorithmia will easily become your central hub for ML developments.
Typical languages for AI software development
Many programming languages used for AI software development are familiar to those accustomed to using powerful programs and scripting tools to automate various tasks. For instance, DevOps engineers use Python to manipulate data beyond normal read, write, and update routines.
Python is conducive to AI software creation tasks due to the familiar object-oriented design, extensive libraries, and fast development time to support neural networks and other NLP solutions.
Scala is a prominent machine learning language and is gaining popularity because Spark, a big data processor, is written in Scala. Scala is a compiled language and offers flexibility and scalability, which lends itself well to big data projects.
Of course, Java is popular for its ease of use and ability for data scientists to debug and package models used. Large-scale projects take advantage of Java’s simplified workflow, and it has aspects that make it desired for graphical representations of data.
In addition to these languages, Algorithmia provides a treasure trove of pre-developed machine learning models for most major AI software languages in languages such as Python, R, Rust, Go, Swift, and Scala.
AI software should “just work”
Before tools, processes, and infrastructure matured, DevOps engineers were busy pioneering methods to automate products and services all the way to production. Key aspects of this CI/CD pipeline include source code management, building, packaging, and deployment, all of which must be done in a secure, repeatable manner with little to no human interaction necessary.
This usually involves loosely tying a number of different products and technologies together. The easiest approach is using an existing AI platform; there is no need to recreate the wheel.
Frictionless AI and ML model management
Algorithmia handles everything that would normally require close collaboration between data scientists and DevOps engineers. Often times, data scientists serve dual purposes: developing new tools and workflows in addition to solving critical business problems.
Moreover, DevOps likely has never had to deploy a ML model. By incorporating an auto-scaling, serverless platform, Algorithmia allows for consistent deployment of your models for internal or external consumption.
As with all problem-solving initiatives that involve large data sets, accessing that data quickly and without the need to migrate to alternate formats is paramount. In addition to data hosted in the AI Platform, data stored with major cloud providers connect to the project with ease using an intuitive interface. By using the concept of “collections,” the Algorithmia AI Platform’s Data Model Layer allows teams of customers to work in a private subset of models, moderate model publishing, and organize models into logical groups based on teams.
Avoiding AI software engineering and infrastructure pitfalls
Another critical aspect of a successful AI model deployment pipeline is quality documentation. The need to achieve fast results while also gaining the confidence of stakeholders is only possible if the team is aware of the full capabilities of the AI platform they choose.
The scalability of the Algorithmia platform is the product of much development in cloud computing. After pushing your model’s code with Git, Algorithmia takes over. It not only handles the DevOps aspects of publishing your model as an API, it controls all aspects of preparing the model for scale.
This advancement in AI software engineering enables data scientists to deliver solutions in a fraction of the time while providing tried and true DevOps processes that will not be foreign to an established team.
Start your machine learning journey on the right foot
Choosing the right AI platform for your team is probably the most influential factor in determining the direction in which your ML model development will mature.
Many companies that offer solutions in the AI software realm also offer a myriad of other services; Algorithmia only does AI software. For a demo of what Algorithmia can do for your company’s ML program, sign up here.
Data scientists and machine learning engineers often encounter a disconnect between what they learned (in school, a bootcamp, or independently), and how this knowledge is applied in their work. For instance, you may be proficient in R or Python, but still be usure how the code or the libraries you’re pulling from relates to actual use cases.
Machine learning is more than its techniques
Machine learning techniques and principles are interesting to learn, but like many technical disciplines, they do not exist simply for the sake of existing.
Machine learning is the technical foundation of data science; practitioners use their knowledge of statistics combined with computer science to develop predictive models and uncover patterns in data. These models help businesses with tasks like pricing, developing new products, or identifying the best customer for a service.
So how does one go from coding in a classroom to advising executives on data-driven decision making? It starts with practice—designing and implementing your own machine learning models using real-world datasets. As you gain experience with more projects, you’ll begin to have a better understanding of which algorithms and methods are appropriate for specific types of questions.
Read on to learn more about machine learning projects and how you design them.
What are some machine learning projects?
Some classic machine learning projects that you may have already been exposed to as a student include:
- Boston Housing – This project is meant to teach simple linear regression. You predict the value of a home based on independent variables like the number of rooms in a dwelling, a town’s crime rate, and the number of owner-occupied homes nearby.
- Iris Flowers – This project teaches basic classification techniques. The purpose is to classify an observation into one of three species of iris based on five attributes related to sepal length, petal length, and petal width.
- Handwriting recognition – Using the Modified National Institute of Standards and Technology (MNIST) dataset, the goal of this project is to identify individual digits from handwritten images. This is a good project to attempt if you are interested in neural networks and deep learning AI.
- Breakfast cereals – The breakfast cereal data set contains nutrition and consumer rating information from 80 products. Because of its breadth of categorical and numerical variables, you can develop projects that include many machine learning techniques including dimension reduction and clustering.
All of the previous projects have corresponding datasets on Kaggle, a community where you can compete to solve real-world machine learning problems. It’s useful in that It also allows you to see different people’s approaches to the same question.
Where do I find data?
While Kaggle is one place to start, there are plenty of other sources with which to find datasets. As you conduct more machine learning projects, you may find yourself wanting to analyze more obscure or lesser-known datasets. Here are a few places to look:
- UC Irvine Machine Learning Repository – The UCI repository maintains 488 datasets that range in topics from smartwatch activity to forest fire tracking. This is also the home of the Iris dataset we spoke about above.
- Data.gov – Multiple US federal agencies house their data here. This is an open-source place to find databases related to social sciences, public health, economics, and education.
- Quandl – Quandl is a platform that houses finance and and economics datasets created by hundreds of publishers. It offers free and paid services.
- Five-Thirty-Eight – This data-driven reporting outlet makes all of its sources available on GitHub. Topics include sports, politics, and pop culture.
How do you make a machine learning model from scratch?
If you really want to challenge yourself with a machine learning project, you can develop a model from scratch. As a reminder, a machine learning model is the equation or computation you develop by applying sample data to the parameters of an algorithm. An algorithm is a set of pre-defined rules.
Although the specifics will be different based on your actual project, here are the general steps of developing a machine learning model:
Find a problem to solve
Think about subjects that are interesting to you. Within that category find a problem that could be solved or at least initially approached through data analysis. For example, let’s say we’re interested in healthcare and we’d like to explore what it means to be a “good” hospital in the United States.
Find relevant data and refine the question
It’s unlikely that we’ll find a dataset that would directly answer such a broad question, but we have an idea of where we should start.
Medicare publishes data on hospital quality. One dataset in particular is a patient survey that lists an overall rating (1-5) based on dimensions like nurse communication, physician communication, and quietness of the hospitals.
Using this data, we can form more precise questions. For example:
Which components of patient satisfaction influence a hospital’s overall score?
Given a hospital’s nurse/physician/quietness/etc. data, what is its predicted overall quality score?
Import the data
Data can come in multiple formats, including JSON, XML, and CSV. How you choose to import the data will depend on if you want to conduct your analysis in R, Python, or a proprietary platform.
Explore and clean the data
Now you’ll want to clean the data. This means getting rid of missing, null, and/or nonsensical values and possibly removing unnecessary columns. At this point you’ll also want to do some data visualizations to see if there are any interesting associations to explore further.
Visualizing the data may also help you figure out if your machine learning project is a supervised or unsupervised learning task. In brief, a supervised task has a target outcome (number or a category), while an unsupervised task does not. Unsupervised tasks are often used to find patterns and associations in the data.
If you are considering a supervised method, then you will need to determine if a classification or regression algorithm is most appropriate. This will depend on the question you are trying to answer.
Develop and refine the model
This is where you begin to experiment and use your outside knowledge and intuition to make adjustments to your model. To train the data, and then validate it later, you will need to split data into training, validation, and test datasets.
Many data scientists often start with the most basic algorithms when developing a model and move up from there:
- Linear regression for basic regression.
- Logistic regression for basic classification.
- K-means clustering for an unsupervised task.
Communicate your results
Communication is the skill that sets the data tinkerers apart from those who influence business decisions. A crucial part of a data scientist’s job is communicating what they’ve uncovered to a company’s leaders. And they’re doing more than just reporting; they’re also offering recommendations based on what they observed in the data.
Say for example (using our Medicare data scenario), you find that nurse communication is the most correlated variable to the overall patient satisfaction score. What do you do with that information? Your recommendation might be to brief healthcare leaders about the importance of conducting further research about the effects of nurse communications at the highest and lowest rated hospitals.
It’s also good practice to post the results and the conclusions of your machine learning projects on your personal blog or to GitHub to share with the larger data science community. It’s an opportunity for you to help others learn, receive feedback, and possibly publicize a solution to a previously unsolved problem.
Machine learning model security is not discussed enough. In serverless GPU–attached environments, block storage solutions like S3 are dependable for persisting your model files. Other than the URI, no other relevant information regarding the model file is saved in the source code.
This exposes an interesting angle to potentially attack an ML system in production. An exposed API key or backdoor could allow hackers to replace a model file with their own. Below we’ll talk about why we can’t trust models coming from untrusted sources. We will also demonstrate a quick and simple process to authenticate models before loading them into memory.
What are the security implications?
Using open-source tools like TensorFlow potentially exposes your platform to cyber attacks. Even though open-source communities are known to quickly patch up bugs, this time delta may be more than enough time for hackers to initiate an attack or drastically affect business operations.
If your models are analyzing sensitive information like credit card information for fraud detection or scanning legal documents to help with discovery for a court case, hackers could use an exploit to export this information back to themselves.
The question of “why models haven’t been more protected than before now” has a trivial answer. Like with most emerging technologies, most companies do not or cannot determine what these exploits can be. Therefore as the industry matures, these security measures need to be implemented quickly, especially for models that process highly sensitive data.
According to TensorFlow’s official documentation that can be found here, TensorFlow models are basically programs and aren’t sandboxed within TensorFlow. A sandboxed application wouldn’t have access to files outside its environment and wouldn’t be able to communicate over the network. As it is, TensorFlow can read and write files, send and receive data over the network, and spawn additional processes, all of which are potentially vulnerable to attack.
The documentation summarizes that “TensorFlow models are programs, and need to be treated as such from a security perspective.”
Authenticating model metrics
Another place where authentication can take place is during continuous integration, where metrics like F1 scores are calculated. This will ensure that the model being tested in CI and the model being deployed in production are the same. Authenticating models prevents data scientists from accidentally overwriting models files and prevents fraudulent models getting into production.
Authentication by computing model hash
Model inference source code is version-controlled via git. Since model files can be huge (several GBs), they must be stored in scalable object/blob storage systems, such as S3. Even though some object-storage services do offer the benefit of keeping track of file hashing, this isn’t the standard case across all services and may not be exposed through the service you’re using to deploy your model.
Simple file-based hash authentication
Because of the potential for model-tampering, it makes sense to calculate the model file hash right after training, and hard-code this into the source code to prevent model file-tampering in flight. This will allow the inference service to verify the model file during runtime, before executing any TensorFlow model code. This is especially true when the only thing hard-coded in the model file is the filename in the source code.
Advanced weight-based hash authentication
Another way to calculate hashes is to use the weights that are provided in model files. The benefit to this approach is that it would be independent of model format and would work across different frameworks.
Fingerprinting models in this approach would provide consistency, reliability, and reproducibility, and protect an ML system from vulnerabilities.
Model authentication demonstration on Algorithmia
We have implemented the simple file-based hash authentication method into our Jupyter notebook. The example trains a simple MNIST model, saves and calculates the hash of the model, deploys the model with the hash, and runs model authentication at runtime before running the model.
Ensure your ML models in production haven’t been hotswapped without anyone noticing.
Model security with Algorithmia
As machine learning becomes part of standard software development, vulnerabilities and new methods of attack are surfacing. Fortunately, Algorithmia has built-in security features to prevent model tampering and we are committed to stewarding these practices in the enterprise for everyone to benefit from. Algorithmia aims to empower every organization to achieve its full potential through the use of artificial intelligence and machine learning.
Read more about ML security
Adversarial Examples for Malware Protection (Patrick McDaniel)
Federated learning and securing data (Digitalist)