(Depiction of a clustering model, Medium)
Getting started with machine learning starts with understanding the how and why behind employing particular methods. We’ve chosen five of the most commonly used machine learning models on which to base the discussion.
Before diving too deep, we thought we’d define some important terms that are often confused when discussing machine learning.
- Algorithm – A set of predefined rules used to solve a problem. For example, simple linear regression is a prediction algorithm used to find a target value (y) based on an independent variable (x).
- Model – The actual equation or computation that is developed by applying sample data to the parameters of the algorithm. To continue the simple linear regression example, the model is the equation of the line of best fit of the x and y values in the sample set plotted against each other.
- Neural network – A multilayered algorithm that consists of an input layer, output layer, and a hidden layer in the middle. The hidden layer is a series of stacked algorithms that iterate until the computer chooses a final output. Neural networks are sometimes referred to as “black box” algorithms because humans don’t have a clear and structured idea how the computer is making its decisions.
- Deep learning – Machine learning methods based on neural network architecture. “Deep” refers to the large number of algorithms employed in the hidden layer (often more than 100).
- Data science – A discipline that combines math, computer science, and business/domain knowledge.
Machine learning methods
Machine learning methods are often broken down into two broad categories: supervised learning and unsupervised learning.
Supervised learning – Supervised learning methods are used to find a specific target, which must also exist in the data. The main categories of supervised learning include classification and regression.
- Classification – Classification models often have a binary target sometimes phrased as a “yes” or “no.” A variation on this model is probability estimation in which the target is how likely a new observation is to fall into a particular category.
- Regression – Regression models always have a numeric target. They model the relationship between a dependent variable and one or more independent variables.
Unsupervised learning – Unsupervised learning methods are used when there is no specific target to find. Their purpose is to form groupings within the dataset or make observations about similarities. Further interpretation would be needed to make any decisions on these results.
- Clustering – Clustering models look for subgroups within a dataset that share similarities. These natural groupings are similar to each other, but different than other groups. They may or may not have any actual significance.
- Dimension reduction – These models reduce the number of variables in a dataset by grouping similar or correlated attributes.
It’s important to note that individual models are not necessarily used in isolation. It often takes a combination of supervised and unsupervised methods to solve a data science problem. For example, one might use a dimension-reduction method on a large dataset and then use the new variables in a regression model.
To that end, Model pipelining involves the act of splitting up machine learning workflows into modular, reusable parts to couple together with other model applications to build more powerful software over time.
What are the most popular machine learning algorithms?
Below we’ve detailed some of the most common machine learning algorithms. They’re often mentioned in introductory data science courses and books and are a good place to begin. We’ve also provided some examples of how these algorithms are used in a business context.
Linear regression is a method in which you predict an output variable using one or more input variables. This is represented in the form of a line: y=bx+c. The Boston Housing Dataset is one of the most commonly used resources for learning to model using linear regression. With it, you can predict the median value of a home in the Boston area based on 14 attributes, including crime rate per town, student/teacher ratio per town, and the number of rooms in the house.
K-means clustering is a method that forms groups of observations around geometric centers called centroids. The “k” refers to the number of clusters, which is determined by the individual conducting the analysis. Clustering is often used as a market segmentation approach to uncover similarity among customers or uncover an entirely new segment altogether.
Principal component analysis (PCA)
PCA is a dimension-reduction technique used to reduce the number of variables in a dataset by grouping together variables that are measured on the same scale and are highly correlated. Its purpose is to distill the dataset down to a new set of variables that can still explain most of its variability.
A common application of PCA is aiding in the interpretation of surveys that have a large number of questions or attributes. For example, global surveys about culture, behavior or well-being are often broken down into principal components that are easy to explain in a final report. In the Oxford Internet Survey, researchers found that their 14 survey questions could be distilled down to four independent factors.
K-nearest neighbors (k-NN)
Nearest-neighbor reasoning can be used for classification or prediction depending on the variables involved. It is a comparison of distance (often euclidian) between a new observation and those already in a dataset. The “k” is the number of neighbors to compare and is usually chosen by the computer to minimize the chance of overfitting or underfitting the data.
In a classification scenario, how closely the new observation is to the majority of the neighbors of a particular class determines which class it is in. For this reason, k is often an odd number to prevent ties. For a prediction model, an average of the targeted attribute of the neighbors predicts the value for the new observation.
Classification and regression trees (CART)
Decision trees are a transparent way to separate observations and place them into subgroups. CART is a well-known version of a decision tree that can be used for classification or regression. You choose a response variable and make partitions through the predictor variables. The computer typically chooses the number of partitions to prevent underfitting or overfitting the model. CART is useful in situations where “black box” algorithms may be frowned upon due to inexplicability, because interested parties need to see the entire process behind a decision.
How do I choose the best model for machine learning?
The model you choose for machine learning depends greatly on the question you are trying to answer or the problem you are trying to solve. Additional factors to consider include the type of data you are analyzing (categorical, numerical, or maybe a mixture of both) and how you plan on presenting your results to a larger audience.
The five model types discussed herein do not represent the full collection of model types out there, but are commonly used for most business use cases we see today. Using the above methods, companies can conduct complex analysis (predict, forecast, find patterns, classify, etc.) to automate workflows.
To learn more about machine learning models, visit algorithmia.com.