We hear more and more everyday that businesses are sitting on troves of valuable data. It’s compared to precious metals, unrefined oil, or cash in a vault. But those items aren’t valuable simply because they exist. Their value comes from what is created out of them. The same holds true for data. Rows full of numbers and text only become useful when you can tell stories and draw insights from them.
For those less familiar with data-driven business initiatives, the path from raw data, to extracting insights, to making decisions based on those insights may seem like a black hole. But like any process of turning a raw material into a valuable product, there is a system to follow and a way to avoid the black hole. In the case of data, it comes in the form of data science projects.
The intent of this article is to guide you through the process of creating and executing a data science project, including selecting machine learning models most appropriate for your goals. While this is written in the business context, this process is relevant to those working on personal projects as well.
What is a data science project?
A data science project is the structured process of using data to answer a question or solve a business problem. Conducting data science projects is becoming more common as more companies become more proactive about finding value in the data they have been storing. Common goals for these projects include:
- Developing more targeted and effective marketing campaigns
- Increasing internal operational efficiency
- Revenue forecasting
- Predicting likelihood of default (banking/financial services)
Prompting a data science project
There are two common scenarios in which a data science project might start. The first begins at the top of an organization with directives from senior management. They may have outlined specific problems to be explored and are looking for employees to find opportunities for improvement through the use of data. It’s common for organizations like this to have data scientists or senior analysts embedded in divisions of the organization. This helps them obtain the relevant business knowledge, in addition to their technical skills, to draw out relevant insights.
Data science projects can also begin at the individual level. It’s not uncommon for an employee to notice a problem or inefficiency and want to fix it. If they have access to the company’s data warehouse and analytics tools, they may begin their investigation alone before bringing others in on the project.
An example of a data science project
A data scientist at a brick-and-mortar retailer may be tasked with developing a predictive model to judge the likely success of new locations of the store. The business goal of this project is for real estate and facilities division team members of this company to understand the success of other established locations and use this knowledge to guide decision making in future transactions.
Note that we will use this retail location example and variations on it for the entirety of this piece to further emphasize points.
How does machine learning fit into a data science project?
Before getting too far into this discussion, we need to define a few terms. There is often some confusion between machine learning and data science, with some individuals believing that one is “better” than the other, or that they are somehow mutually exclusive.
Data science is an encompassing term that refers to a discipline whose main pillars are:
- Mathematics, specifically statistics
- Computer science
- Business acumen and domain knowledge
Machine learning is a subfield of artificial intelligence. It is the process of using algorithms to learn and understand large amounts of data and then make predictions based on specific questions asked. Machine learning regression modeling is where math and computer science intersect, as it takes compute power and a knowledge of programming to develop and build on these statistical models.
From these definitions, it should be clear that machine learning is a vital component of data science. It is the bridge between raw data and solving business problems. You will need to build models and validate them before drawing any conclusions or providing recommendations.
The data science workflow and project process
When beginning your data science project, it’s useful to frame it as a series of questions that we will discuss in detail.
What business problem am I trying to solve?
While your personal projects don’t necessarily require a specific focus, businesses are looking to reach certain targets like increasing revenue, cutting costs, operating more efficiently, decreasing customer churn, etc.
With that in mind, consider how the answer to your project question would influence the business. Ideally, it would give the company the information it needs to develop a plan of action.
Let’s illustrate this using our retail store example. Instead of asking “Which store brought in the most revenue during Q2?” frame it as “Why did store 123 bring in the most revenue in Q2?” The first question gives you a simple answer that probably can’t be acted upon without further research. The second question suggests that recommendations can easily be extracted out of the answer.
If you are unsure of the question you want to ask, it’s helpful to first engage in exploratory analysis—making visualizations and small manipulations of the raw data, especially in your area of the business. If anything jumps out, or looks like an opportunity for further research, you can begin your question there.
Do I have all of the data I need to answer this question?
To develop a predictive model about retail store success, you probably need some the following information:
- Store address
- Type of location (In a mall? Standalone building?)
- Revenue by period
- Square footage
- Daily traffic
- Number of employees per location
Your company likely has all of this information, but it’s probably stored within various SaaS applications and databases. In addition, you may need some information from publicly available data sources like demographics, population, and weather trends, to round out your picture of the location.
How will I put everything together in a manageable form?
Combining data sources into a form that you can analyze usually involves the ETL (Extract, Transform, Load) process through the use of one or multiple tools.
Here’s an overview of ETL:
- Extraction – The process of pulling data from various sources (relational databases, SaaS applications, etc.).
- Transformation – Data undergoes a series of changes based on rules that meet the requirements needed for analysis. This step includes data cleaning and normalization (putting numerical values in standard units).
- Load – Extracted and transformed data is sent to the end system, usually a data warehouse where it can be linked to an analytics tool.
How will I approach the analysis?
Before deciding on the machine learning model you will use (we’ll get into some actual use cases in the next section), think about how you would frame the answer to your question. Maybe you’re going to make a prediction or possibly uncover segments. What you choose to do will depend on the type of data available to you and your business goals.
How will I communicate my results to a broader audience?
In other words, what do you plan on doing with the results of your data science project? For example, will you create a dashboard, send a report to interested parties once a month? Or only discuss when asked about it—remember, you are trying to provide value to the business. This is particularly important point to keep in mind for self-directed projects.
Which algorithms are used for machine learning?
Machine learning algorithms can be broken down broadly into two methods: supervised learning and unsupervised learning. A supervised method requires there to be a defined target with data to compare it to. An unsupervised method does not have any specific target.
Let’s illustrate this difference with two questions related to retail stores in our hypothetical example.
- Unsupervised: Do our retail stores fall into natural groupings?
- Supervised: How can we identify stores with a high likelihood of converting customers into store credit card holders?
The supervised question has an explicit target: we want to find stores that share a business-specific characteristic. The unsupervised grouping isn’t looking for anything in particular.
It’s important to note that neither of these methods is “better” or more useful than the other. Their value depends completely on business goals. An unsupervised method is particularly useful when trying to uncover segments that don’t appear obvious by just looking at data laid out in spreadsheets.
In our retail store example, once placing stores in natural groupings, business teams might be able to use their domain knowledge and intuition to infer something about these stores that is not explicitly laid out in the data. The supervised example is useful for a company that has a goal in mind, and wants to bring all stores up to the level of the successful ones.
Supervised machine learning methods
- Regression – This is a predictive data science algorithm that explores the relationship between a dependent variable and one or more independent variables. The output is always a numeric value. Continuing with our example, you could use a linear regression to predict a new loctaion’s potential revenue, given a set of numeric variables.
- Classification – This is a predictive method used to determine which category a new observation belongs to. The target output is two or more categories, often framed simply as “yes” or “no.” Example: Given the data we have about other store locations, and our definition of success, should we open a new store in this location? Yes/No.
- Class probability estimation – A binary classification is not always useful in every situation. Even our retail store example requires more nuance than a simple yes or no. This is the advantage of class probability estimation, which predicts the likelihood that a new observation belongs to a specific class. Example: Given the data we have about other stores, and our definition of success, what is the likelihood this new store will be successful? The output is a numeric estimate between 0 and 1.
Unsupervised machine learning methods
- Clustering – The unsupervised question examples earlier would probably lead a data scientist to develop a clustering model. Clustering means grouping observations based on similarities. It’s also a form of exploratory data analysis. When interpreting clusters, you will need to look at the underlying components of each group, conduct summary statistics, and compare this information to other groups. It’s important to determine if these clusters have any significant meaning based on your knowledge of the business.
- Dimension reduction – When attempting to analyze multiple large data sets, you can run into the problem of having too many variables that are intercorrelated. Dimension reduction is the process of eliminating redundant variables in a data set. This is a reduction of the number of variables in a data set. Breaking down data into vital components can be analysis in and of itself, or it can be a first step in refining linear regression models. A commonly used dimension reduction data science algorithm is principal components analysis (PCA).
Neural networks and how they fit into data science algorithms
Neural networks have come in and out of fashion in the computer science and cognitive computing communities for the past seven decades. They have seen a resurgence recently because of an increase in compute power and more practical applications of the technology. Neural networks are also the underlying architecture of deep learning AI.
While neural networks are really their own discipline, we’ll discuss them briefly here. Neural networks have three parts: the input layer, output layer, and hidden layer. The input and output layers are part of almost any algorithm—you provide data, and the computer returns some information. The hidden layer is the interesting part. You can think of it as a stack of algorithms (supervised or unsupervised), that build on each other until it reaches a final output.
Neural networks are often referred to as “black boxes,” meaning you don’t really have an understanding of the “thought” process. In some situations it may be fine not to know, but in other business contexts like financial services and credit scoring, this lack of transparency can be problematic. Keep this in mind if you are considering incorporating neural networks into your data science project.
An additional risk of neural networks is that they can fit training data too well, and become irrelevant when trying to analyze general population data.
The importance of data structures and algorithms in data science
As we mentioned earlier, the technical component of data science skills is where math and computer science meet. Having a foundation in statistical methods is essential to data science, as is having an understanding of not just programming, but computer science itself.
Data structures and algorithms are the foundation of computer science. A data structure is an organized way of storing data and using it efficiently. And as discussed, an algorithm is an unambiguous, finite, step-by-step procedure to reach a desired output.
So why is this important to a data scientist? For one, developing algorithms for data science projects is not a one-time task. You will be constantly refining the model with new variables and rows of data. With more data comes more demands on processors and records that take longer to access. Large-scale data science projects cannot be efficiently modified or replicated without the base understanding of how data is organized and processed in a computer. Data scientists should not be reinventing the wheel every time they develop an algorithm. Instead, they should be thinking about how an algorithm can be easily scaled and reproduced.
People often confuse data science and machine learning, but they are in fact separate entities, despite what the memes say. Let’s make clear what differences there are between data science and machine learning and give some examples of how each is used in business settings.
Is data science the same as machine learning?
Data science and machine learning are similar but not the same thing. Data science is a broad category of work that deals with data and computing. Machine learning falls into that category, but not all data science is machine learning. It’s like how all squares are rectangles but not all rectangles are squares. All machine learning is data science, but not all data science is machine learning.
What is data science?
Data science includes programming skills and knowledge of mathematics and statistics with the goal of gaining meaningful insights from data. Data analysis, information engineering, artificial intelligence, and machine learning all fall under the category of data science.
What is machine learning?
Machine learning is actually a type of data analysis with automated analytical model building. As a branch of artificial intelligence, machine learning is based on the notion that systems can learn from datasets, identify patterns within them, and make decisions without human control. In machine learning projects, a data scientist builds a model programmed to find patterns with certain rules. Then, the model is fed training data to analyze in order to quality-control the results. Once it is properly trained, the machine learning model is ready to perform its function without the help of humans.
How is data science used in the enterprise?
Data science has a wide range of uses, involving all parts of the enterprise from marketing to finance. Data science has proved its value, and data scientists are always finding new ways to implement solutions in the enterprise. The most data-driven businesses tend to win, so companies today cannot expect to be successful without leveraging data. Here are a few of the ways data science is being used in the enterprise.
- Product Development: There is a lot of information that needs to go into product decisions. Data science makes it easier to analyze all the relevant data to come to the best conclusion possible. Data science makes product development not only more efficient but also metrics-based, a smart way to conduct business.
- Price Optimization: Keeping prices competitive is crucial in industries such as ecommerce. Data science can be used to scrape prices from competing sites and implement dynamic pricing to keep prices lower than the competition.
- Product recommendations: Recommended products often drive upsells on retail sites, and these are made using data science to analyze customer interactions with the website to glean behavioral trends and make recommendations.
- Customer Segmentation: Data analysis can be used to segment customers into different audiences. Companies have been segmenting customers for decades, but with data science, it is becoming a more robust practice.
How is machine learning used in the enterprise?
Machine learning is a more recent development in business. Some companies are just beginning to fully grasp the potential for machine learning at the enterprise level. The possibilities really are endless for machine learning use cases. Some business processes or decisions up until recently required humans to crunch numbers and review data; they can now be done using artificial intelligence algorithms. Here are some of the popular ways companies are using machine learning, but remember, there are always new solutions being developed.
- Fraud Detection: Models can be trained to analyze transaction details in real time and classify them as either legitimate or fraudulent, alerting the team when there is suspicious activity.
- Medical Diagnosis: Machine learning is now being used in healthcare diagnostics to identify patterns in images and other data. ML models can analyze MRIs, CAT scans, physician notes, and more.
- Demand Forecasting: Predictive models can make forecasts for future demand as well as other business metrics such as customer churn, customer retention, and sales forecasts.
- Image and Speech Recognition: Companies like Google use image recognition to classify images and for reverse image search and speech recognition for their virtual digital assistants and voice activated applications.
Algorithmia can help
Machine learning and data science are important innovations in the business world. Algorithmia understands the value of implementing machine learning at the enterprise level, which is why we created the AI Layer.
The AI Layer allows data scientists to focus on training models rather than infrastructure and deployment challenges. Machine learning models can be difficult to get into production, but with the AI Layer in place from the beginning, productionizing ML is painless.
The AI Layer empowers ML leadership, data scientists, and devops teams to deploy and serve machine learning models quickly, giving them valuable time back for focusing on evaluating model output and health.
Currently, data scientists are spending the majority of their time on infrastructure tasks—not their core roles. The AI Layer is a serverless microservices architecture that makes deploying, serving, and scaling challenge-free.
Get a demo of the AI Layer to see how it can benefit your organization.
Customers have an abundance of options when it comes to products for purchase. This excess of options, however, increases the risk of poor customer retention. Since acquiring new customers costs much more than keeping current customers, a higher retention rate is always better.
Customer retention represents the number of customers who continue purchasing from a company after their first purchase. This is usually measured as the customer retention rate, which is the percentage of customers your company has retained over a certain time period. The opposite of retention rate is churn rate, which represents the percentage of customers a company has lost over a given time period.
Customer retention analytics can be done through machine learning, allowing companies to base their product and marketing strategies on predictive customer analytics rather than less reliable predictions made manually.
In a survey of more than 500 business decision-makers that Algorithmia conducted in the fall of 2018, 59 percent of large companies said that customer retention was their primary use case for machine learning technology.
What Is Customer Retention Analysis?
Customer retention analysis is the application of statistics in order to understand how long customers are retained before churning out and to identify trends in customer retention. This type of analysis discerns how long customers usually stick around, whether or not seasonality affects customer retention, and discovers behaviors and factors that differentiate retained customers from churned customers.
Why Is Customer Retention Analysis Important For Your Company?
Customer retention analysis is important for your company because it helps you understand which personas have higher retention rates and discern which features impact retention. This provides actionable insights that can help you make more effective product and marketing decisions.
It can be difficult for a product or sales team to know how well a product is actually performing with the target audience. They may think that features and messaging is on brand and clear because acquisition numbers are growing. However, just because new customers are purchasing a product does not necessarily mean customers like the product or service enough to stick around.
That is where customer retention analytics comes in. Every company needs data in order to make effective business and marketing decisions. Machine learning makes this easier than it has ever been before, which is great news for companies that wish to leverage this data.
How Do You Analyze Customer Retention?
Machine learning for customer retention analytics uses past customer data to predict future customer behavior. This is done using big data. In today’s data-driven world, companies can track hundreds of data points about thousands of customers. Therefore, the input data for the customer retention model could be any combination of the following:
- Customer demographics
- Membership/loyalty rewards
- Transaction/purchase history
- Email/phone call history
- Any other relevant customer data
During the model training process, this data will be used to find correlations and patterns to create the final trained model to predict customer retention. Not only does this tell you the overall churn risk of your customer base, but it can determine churn risk down to the individual customer level. You could use this data to proactively market to those customers with higher churn risk or find ways to improve your product, customer service, messaging, etc. in order to lower your overall churn rate.
How Do You Improve Retention?
To improve retention, you have to first understand the cause of your retention issues. As discussed, machine learning models are a very efficient way to analyze customer retention to determine risks and solutions.
Data science teams can build the machine learning models necessary for this type of predictive analytics, but there are challenges associated with developing machine learning processes. For example, deploying models written in different languages is not easy, to say the least. Algorithmia’s AI Layer solves these issues using a serverless microservice architecture, which allows each service to be deployed independently with options to pipeline them together.
Another challenge is overcoming the cost of time lost to building, training, testing, deploying, and managing a model, let alone multiple in a machine learning program.
Improving customer retention is one of the main uses Algorithmia’s early adopters focused on because it is one of the simpler machine learning models to build and use, and it’s even easier with the serverless microservices framework provided by the AI Layer. Our platform has built-in tools for versioning, deployment, pipelining, and integrating with customers’ current workflows. The AI Layer integrates with any technology your organization is currently using, fitting in seamlessly to make machine learning easier, getting you from data collection to model deployment and analysis much faster.
To learn more about how the AI Layer can benefit your company, watch a demo to see how much easier your machine learning projects can be.
TensorFlow 2.0 shipped today, 30 September 2019, with new features, such as faster debugging and iteration with Eager Execution, a TensorFlow-enhanced implementation of the Keras API, and simplification and compatibility improvements across its APIs. TensorFlow 2.0 is a major upgrade that should increase performance, streamline workflows, and provide more compatibility for new or updated models.
We offer day 1 support
At Algorithmia, we believe data scientists should be able to deploy and serve models from any framework and keep up with the pace of tool development. To that end, we’re eager to announce that we support model deployments in the TensorFlow 2.0 framework—Google’s latest version that was released today.
Our Enterprise customers will receive the same support in their next product update.
While TensorFlow 2.0 includes a conversion tool for existing 1.x models, those conversions will not be fully automatic. Rest assured that the AI Layer will remain fully backward-compatible with all previous versions of TensorFlow—and the more than 15 other frameworks we support.
We won’t stop there. We want to provide users with the freedom to choose the best tool for every job, and that means immediate support for future versions of TensorFlow and other frameworks in development. If you have any questions about framework support or our rollout schedule, please contact your account manager or send us a message.
Happy model deployment!
While consumer-facing applications of machine learning (ML) have gotten a lot of attention (Netflix, Uber, and Amazon) the back office deserves some recognition. Enterprise-level systems that run the business—think finance, robo-advisors, accounting, operations, human resources, and procurement—tend to be large, complex, and process-centric. But, they also use and produce large amounts of both structured and unstructured data that can be handled in new ways to save time and money.
Machine learning combined with solution-specific software can dramatically improve the speed, accuracy, and effectiveness of back-office operations and help organizations reimagine how back-office work gets done.
A current trend among mid and large organizations is to implement Robotic Process Automation (RPA) in the back office to minimize manual tasks and achieve efficiencies. While there are specific use cases that make RPA an appropriate technology, there are significant differences with a machine learning approach.
Robotic Process Automation and artificial intelligence
Robotic Process Automation is software that mimics human actions while AI is built to simulate human intelligence. As an example, an RPA bot can be programmed to receive invoices via email (triggered on a specific subject line), download the invoice and put in a specific folder. An AI activity would be to “read” the invoice and extract the pertinent information like amount, invoice number, supplier name, due date, etc.
One of the more interesting downsides of RPA as outlined by Garter in the Magic Quadrant Report on RPA Vendors is that RPA automations create long-term technical debt, rather than overcoming it.
As you overlay RPA onto current technology and tasks, you are locking yourself into those technologies instead of updating and evolving.
Organizations must manually track the systems, screens, and fields that each automation touches in each third-party application and update the RPA scripts as those systems change. This is very challenging in a SaaS world in which product updates happen much more regularly than on-prem.
As such, the shift toward AI, and specifically ML, is to improve process, not just speed. Here are five specific applications of ML that can be used to improve back-office operations:
Account reconciliation (finance)
Account reconciliations are painful and error-prone. They are also critical to every business to ensure the proper controls are in place to close the books accurately and on-time. Many companies do this manually (which really means using Excel, macros, pivot tables, and Visual Basic) or have invested in RPA, which doesn’t get you very far, or in a Boolean rules-based system, which is expensive to set up and not super accurate.
Challenges in Account Reconciliation
An ML approach is ideal for account reconciliations, specifically matching reconciliations, because you have ground-truth data—previous successful matched transactions and consistent fields in subsequent reconciliations. The challenge has been that for large and complex data sets, the combinatorial problem of matching is really hard. Companies like Sigma IQ have focused on this problem and solved it with a combination of machine learning and point-solution software as a hosted platform.
Invoice processing/accounts payable (accounting)
We introduced invoice processing earlier in this article as a use case for ML in the back office as a way to understand the differences between RPA and ML. The reality is that every business deals with invoices at some level, and as natural language processing (NLP) and ML advance, these improvements will roll down from the enterprise level to small businesses.
Aberdeen Group indicates that well-implemented accounts payable systems can reduce time and costs by 75 percent, decrease error rates by 85 percent, and improve process efficiency by 300 percent, so it makes sense to pursue, right?
Using ML to augment accounting
Companies like AODocs are extending their NLP and ML capabilities to take some of the pain out of invoice management by automatically capturing information from invoices and triggering the appropriate workflow. These types of solutions can greatly reduce or eliminate manual data entry, increase accuracy, and match invoice to purchase order.
Employee attrition detection (HR)
There are many applications of AI in the HR function, including applicant tracking and recruiting (resume scanning and skills analysis), attracting talent before hiring, individual skills management/performance development (primarily via regular assessment analysis), and enterprise resource management.
Using ML to track employee satisfaction
One interesting use case from an ML−NLP perspective is employee attrition. Hiring is expensive, and retaining employees and keeping them happy is imperative to sustainable growth. Identifying attrition risk requires source data—like a consistently applied employee survey that uses unstructured data analysis for the open field comments. Overlaying this data with factors such as tenure, time since last pay raise/promotion, sick days used, scores on performance reviews, skill set competitiveness with market, and generally available employment market data can help assess probability of satisfaction.
Predicting repairs and upkeep for machinery (operations)
The influx of sensors into all types of equipment including trucks, oil rigs, assembly lines, and trains means an explosion of data on usage, wear, and tear of such equipment. Pairing this data with historical records on when certain types of equipment need certain preemptive maintenance means expensive machinery can be scheduled for downtime and repair not just based on number of hours used or number of miles driven, but what actual usage is.
Predix is a General Electric company that powers industrial apps to process the historic performance data of equipment. Its sensors and signals can be used to discern a variety of operational outcomes such as when machinery might fail so that you can plan for—or even prevent—major malfunctions and downtime.
Predictive analytics for stock in transit (procurement)
For companies that spend a lot of money on hard goods that need to be moved for either input into manufacturing or delivery to a retail shelf, stock in transit is a major source of opportunity for applying ML models to predict when goods will arrive at a destination.
Item tracking has improved dramatically with sensors, but it is only a point-in-time solution that doesn’t predict when the goods will arrive or when they should arrive. Weather, traffic, type of transport, risk probabilities, and historical performance are all part of the data that can help operations nail the flow of goods for optimal process timing.
Stock in Transit
SAP S/4HANA has an entire module dedicated to making trade-off predictions between different options for stock in transit solutions to meet customer order objectives.
Further opportunities for back office machine learning
These are just five of the hundreds of use cases ML paired with solution-specific software can be applied in order to improve the way the back office functions. Whether it is cutting down on manual tasks, improving accuracy, reducing costs, or helping teams change their critical processes wholesale, machine learning can augment nearly every back-office process.