dendrogram graph outline

Analytic thinking has become a necessary skill for almost everyone working in a business environment. Although data scientists and analysts may be more intimately involved in handling and manipulating data, managers, executives, and other business leaders will be the ones making decisions based on a technical team’s insights and findings. 

Becoming a data-driven business requires everyone in an organization to understand the principles behind data science, including the machine learning techniques that transform raw data into insightful information. The purpose of this piece is to provide managers and aspiring data scientists an overview of the different methods that can be used to solve business questions.

What is a machine learning model?

In discussions about data science and machine learning, the term “model” is often thrown around. A machine learning model is the actual computation or equation that a data scientist develops after applying training data to an algorithm. It is the result of later steps in a data science project.

This piece will focus on machine learning techniques or ways of approaching problems along with examples of algorithms that fall into these categories. We’ll also mention a few real-world machine learning models for clearer examples of how algorithms have been applied in the enterprise. 

What data does machine learning use?

Machine learning models can be developed with almost any kind of data you can imagine. It can be numerical or categorical. More specifically, text-based data can be used for sentiment analysis, while images can be used to develop facial emotion recognition models.  

Data for machine learning comes from various sources including internal databases within an organization (most likely proprietary), and public or open-source datasets. What kinds of data an individual will use to develop a machine learning model will depend on the business question they are trying to solve.

What are some popular machine learning methods?

Machine learning techniques are often broken down into three categories: supervised learning, unsupervised learning, and reinforcement learning. Supervised and unsupervised learning are more commonly used in the business context, so our focus will be on techniques from these two categories. 

Supervised learning methods are used to find a specific target (numerical or categorical), which must also exist in the data. Unsupervised methods are employed when there is no specific target in mind. They are often used to uncover patterns or natural groupings in the data. We’ll note that there are some algorithms that could fall into either category depending on the specificity of the question being asked. 

Supervised learning: classification vs regression 

Regression and classification are the two main subcategories of supervised learning. While both are predictive methods, regression has a numerical output, while classification predicts the category that a new observation would fall into. This is often a binary output, but you can create models for more than two categories. A variation of classification known as class probability estimation is a numerical output (from 0 to 1) of how likely it is that a new observation will fall into a particular category.

  • Linear regression With linear regression, you can predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. Linear regression models are one of the most familiar types of models, as many people have been exposed to linear equations as a part of their math education.
  • Support vector machine (SVM) SVM can be used for regression or classification. Linear SVM works by maximizing the distance between classes and drawing a line down the middle. New data is categorized by how it falls along that line. Non-linear SVM is used for more complex functions (like those with exponents) to more accurately find the widest point between data.
  • Logistic regression Despite the name, logistic regression is a classification algorithm—more specifically it performs a class probability estimation task. A logistic function is applied to a linear equation and the output is interpreted as the log-odds (a number that ranges from -∞-∞) of a new event being a member of a particular class. The log odds can then be translated into the probability (number 0-1) of a new item being a member of the class.
  • Decision tree Decision trees are a supervised segmentation technique that places observations in the data into subgroups. 
    • CART is a well-known version of a decision tree that can be used for classification or regression. Once the data scientist chooses a response variable, the computer program will make partitions through the predictor variables. The program automatically chooses the number of partitions to prevent underfitting or overfitting the data to the model. Decision trees are useful in situations where  interested parties need to see the entire logical reasoning behind a decision.
  • Random forest Simply put, a random forest is a group of decision trees that all have the same response variable, but slightly different predictor variables. The output of a random forest model is calculated by taking a “vote” of the predicted classification for each tree and having the forest output the majority opinion.  

Unsupervised methods 

  • Clustering Clustering refers to machine learning techniques that are used to find natural groupings among observations in a dataset. Also known as unsupervised segmentation, clustering techniques have two main types: hierarchical and k-means.
    • Hierarchical clustering This method produces a tree-shaped structure known as a dendrogram. Each node in the dendrogram is a cluster based on the similarity of the observations in it. In agglomerative hierarchical clustering, it is a bottom up approach that starts with each observation as its own cluster. As you move up the tree, the number of clusters becomes smaller until the top node contains every observation. The opposite is divisive clustering in which all observations begin in one cluster, and then you divide downward until you reach the desired number of clusters. One of the most well-known hierarchical visualizations is the “Tree of Life” dendrogram that charts all life on earth.
    • K-means clustering K-means clustering is a machine learning algorithm that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis based on domain knowledge. This type of clustering is often used in marketing and market research as an approach to uncover similarity among customers or to uncover a previously unknown segment.

Other machine learning techniques  

The following machine learning techniques can be applied to regression or classification problems. 

Data reduction 

Data reduction algorithms reduce the number of variables in a data set by grouping similar or correlated attributes.

  • Principal Component Analysis (PCA) PCA is a commonly used dimension reduction technique that groups together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability. PCA is often used in the analysis of large survey datasets. This technique makes interpreting these kinds of surveys much simpler and allows researchers to make assertions about behaviors.
  • Similarity matching A similarity matching algorithm attempts to find similar individuals or observations based on the information that is already known about them. For example, a bank might use a similarity matching algorithm to find customers best suited for a new credit card based on the attributes of customers who already have the card.
  • K-Nearest Neighbor (KNN) Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often Euclidian or Manhattan) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data. In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation. In the previous banking and credit card scenario, a classification output might be a simple yes or no to extend an offer. A prediction output might be the initial credit card limit offered to the customer.
  • Link prediction This method tries to predict the possibility and strength between two connections. This is often used for recommendations for social networking and e-commerce platforms. For example, if two unconnected people share a large number of mutual connections, a link-prediction model may suggest that these two people connect.

Combined methods 

Business problems are complex, and you may find that you’ll need to use multiple machine learning techniques to achieve your goal. An important part of data science is understanding how these algorithms work together to answer questions. For example, a data scientist might use PCA in the development of a regression model by first combining similar variables to make the analysis more manageable. 

What are the most important machine learning algorithms?

It’s hard to say what is the most important or best machine learning algorithm or whether there even is one. The methods you use will depend on your specific project needs and the data you have available. A critical skill for anyone interested in using machine learning in the business environment is knowing how to organize a data science project and thinking about which algorithms and techniques it should be approached with. 

Keep Learning

A deeper dive into supervised and unsupervised learning.

A look at random forest algorithms.

Open-source machine learning tools for use in data science projects.