YData and Algorithmia: High-quality data meets enterprise MLOps
Editor’s note: Today’s post is a guest post by Fabiana Clemente, Founder and Chief Data Officer at YData, in collaboration with Aslı Sabancı, Applied Machine Learning Engineer at Algorithmia. It was originally published on the YData blog on April 6, 2021.
According to the 2021 enterprise trends in machine learning report by Algorithmia, 83% of all organizations have increased their AI/ML budgets year-on-year, and the average number of data scientists employed has grown by 76% over the same period. However, the process of developing machine learning–based solutions goes far beyond model development—and you need a lot more than just the right budget and staffing resources to succeed.
For ML models to deliver good results and get better over time, they need to be trained with high-quality data. Not only that, but they must also be built and tuned in a continuous manner to keep the insights delivery accurate and relevant, and ensure continued performance in production. And of course, you won’t get any value from ML unless your models are actually put into production.
Many organizations are struggling to extract the full value from their ML investments for these reasons. Algorithmia’s same report revealed a number of challenges that organizations are facing throughout the ML lifecycle, especially with ML governance and integration with ML technologies. In fact, the report found that the time required to deploy a model has actually increased, due in large part to operational and tooling concerns such as these.
That’s why we’re excited to share a new integration between YData and Algorithmia. YData is the first data development platform for improved data quality. Algorithmia is the enterprise machine learning operations (MLOps) platform. It manages all stages of the production ML lifecycle within existing operational processes, so you can put models into production quickly, securely, and cost-effectively—unlocking the value contained in them for your business.
This blog post will explain how you can combine these two powerful machine learning platforms to not only improve the quality of your data you use to train your ML models, but also to enable them to deliver useful insights in a production environment.
How YData and Algorithmia work together
As mentioned previously, poor data quality is a major challenge preventing organizations from extracting the full value contained in their ML. It might sound simple to solve data issues such as missing data, imbalanced datasets, and absence of labels, but those who work in data science know that good data is an endangered species in enterprise environments. Many factors can lead to training models with poor data, from scarcity of data to lack of technical expertise from data science teams.
And while improving the quality of your training datasets is very important to ensure better outcomes and generalization from ML models, it’s just as important to ensure an easy path to model deployment. Combining YData’s platform and Algorithmia can not only increase the productivity of data science teams and improve the value delivered by ML models, but also relieve the pain and bottlenecks to deploy the solutions in production environments.
YData allows for seamless integration with Algorithmia, enabling organizations to speed up their path to production for ML models built with high-quality data.
Improving data quality with YData
Imbalanced datasets are a reality in the industry, where the distribution of the represented classes are often biased or skewed. The problems of imbalance classification are not always straightforward to solve. However, one solution is balancing the classes to achieve better quality of data through dataset balancing through augmentation of the less represented classes.
In this demo, we’ll depict the end-to-end process of developing a classification model for highly imbalanced datasets using YData’s synthetic data generation open-source, augmenting the fraudulent events in Kaggle Credit Card fraud dataset, and serving the trained model on Algorithmia.
Image: Preview of the credit card dataset.
Image: Plot of V2 and V4 original dataset eigenvectors.
This dataset is heavily biased towards “Normal” transactions—having a very low proportion of real “Fraud” transactions to train on. This severely hampers the model’s performance, as it only gets majority samples from the “Normal” class in the training stage.
Image: The dataset is highly imbalanced with a ratio of 1/580 normal to fraudulent events.
The XGBoost classifier model is used to train on the data. XGBoost is a popular library among machine learning practitioners, known for its high performance and memory-efficient implementation of gradient boosted decision trees, with incremental improvement at each cycle.
Image: Baseline results using a tree-based algorithm on the imbalanced dataset.
YData’s python SDK allows you to generate synthetic data samples that inherit the similar characteristics as the original data. Here, we generate 400 new samples of “Fraud” transactions to balance the dataset, since the number of “Fraud” transactions as a percentage of total transactions are too low to achieve a good training set. These synthetic data samples are then embedded into the original training data.
Image: The synthetic generated frauds vs. the original ones.
Plotting the synthetic fraud samples shows that the nature of the generated data is similar to the original fraud samples in the training set.
The synthetic data hugely improves the accuracy-precision metrics of the model by balancing the data bias in the original dataset by generating additional data samples of the “Fraudulent” kind, without changing the inherent nature of the data-samples or giving skewed and unrealistic data points.
Note the improved model metrics after the synthetic data was added to the training set.
Image: Improvement of 7% accuracy in detecting the test dataset frauds.
Serving the model on Algorithmia
Our next step is to serve our trained model on Algorithmia. We start by initializing the Algorithmia client and uploading our serialized model file to Algorithmia’s hosted data collection.
Image: Initialize Algorithmia client and upload the serialized model files.
After uploading the model, we continue by creating a Python algorithm that will be built on Algorithmia’s Python 3.7 runtime environment.
Image: Create the Algorithm to serve the model.
After creating the algorithm, we put down our Algorithm dependencies on a requirements.txt file and our model serving code on a .py file. To push these files to our algorithm’s git repository, we can use YData Lab’s git tools or Algorithmia’s Web IDE.
Once the algorithm files are pushed to the repo, we build our algorithm and have our serving endpoint ready. We can test it with some test input and make sure our model predictions are as expected.
Image: Test the Algorithm by passing the data sample.
Get started with the YData-Algorithmia integration
Many challenges prevent organizations from unlocking the full value of their ML investments. The better your AI infrastructure covers the data science lifecycle, the more time will remain for your data science teams to focus on injecting valuable business value into the models. When used together, YData and Algorithmia can help you build models based on high-quality data and deploy them rapidly to production, freeing up your data scientists to focus on innovation—and achieving greater results for your business.
For a step-by-step walkthrough of this integration, you can check out the repository and implement similar workflows for yourself.