What Is Machine Learning?
JeanFrancoisPuget | May 18 2016 | Visits (38723)
Can you explain me what machine learning is? I often get this question from colleagues and customers, and answering it is tricky. What is tricky is to give the intuition behind what machine learning is really useful for.
I'll review common answers and give you my preferred one.
The first category of answer to the question is what IBM calls cognitive computing. It is about building machines (computers, software, robots, web sites, mobile apps, devices, etc) that do not need to be programmed explicitly. This view of machine learning can be traced back to Arthur Samuel's definition from 1959:
Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel is one of the pioneers of machine learning. While at IBM he developed a program that learned how to play checkers better than him.
Samuel's definition is a great definition, but maybe a little too vague. Tom Mitchell, another well regarded machine learning researcher, proposed a more precise definition in 1998:
Well posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
Let's take an example for the sake of clarity. Let's assume we are developing a credit card fraud detection system. The task T of that system is to flag credit card transactions as fraudulent or not. The performance measure P could be the percentage of fraudulent transactions that are detected. The system learns if the percentage of fraudulent transactions that are detected increases over time. Here the experience E is the set of already processed transaction records. Once a transaction is processed, then we know if it is a fraud or not, and we can feed that information to the system for it learn.
Note that the choice of the performance measure is critical. The one we chose is too simplistic. Indeed, if the system flags all transactions as fraudulent, then it achieves a 100% performance, however, this system would be useless! We need something more sensible, like detecting as much fraud as possible, while flagging as little as possible honest transactions as fraud. There are ways to capture this double goal fortunately, but we won't discuss them here Point is that once we have a performance metric, then we can tell if the system learns or not from experience.
Machine Learning Algorithms
The above definitions are great as they set a clear goal for machine learning. However, they do not tell us how to achieve that goal. We should make our definition more specific. This brings us to the second category of definitions, who describe machine learning algorithms. Here are some of the most popular ones. In each case the algorithm is given a set of examples to learn from.
Supervised Learning. The algorithm is given training data which contains the "correct answer" for each example. For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions. For each transaction, the training data would contain a flag that says if it is fraudulent or not.
Unsupervised Learning. The algorithm looks for structure in the training data, like finding which examples are similar to each other, and group them in clusters.
We have more concrete definitions, but still no clue about what to do next.
Machine Learning Problems
If defining categories of machine learning algorithms isn't good enough, then can we be more specific? One possible way is to refine the task of machine learning by looking at classes of problems it can solve. Here are some common ones:
Regression. A supervised learning problem where the answer to be learned is a continuous value. For instance, the algorithm could be fed with a record of house sales with their price, and it learns how to set prices for houses.
Classification. A supervised learning problem where the answer to be learned is one of finitely many possible values. For instance, in the credit card example the algorithm must learn how to find the right answer between 'fraud' and 'honest'. When there are only two possible value we say it is a binary classification problem.
Segmentation. An unsupervised learning problem where the structure to be learned is a set of clusters of similar examples. For instance, market segmentation aims at grouping customers in clusters of people with similar buying behavior.
Network analysis. An unsupervised learning problem where the structure to be learned is information about the importance and the role of nodes in the network. For instance, the page rank algorithm analyzes the network made of web pages and their hyperlinks, and finds what are the most important pages. This is used in web search engines like Google. Other network analysis problem include social network analysis.
The list of problem types where machine learning can help is much longer, but I'll stop here because this isn't helping us that much. We still don't have a definition that tells us what to do, even if we're getting closer.
Machine Learning Workflow
Issue with the above definitions is that developing a machine learning algorithm isn't good enough to get a system that learns. Indeed, there is a gap between a machine learning algorithms and a learning system. I discussed this gap in Machine Learning Algorithm != Learning Machine where I derived this machine learning workflow:
A machine learning algorithm is used in the 'Train' step of the workflow. Its output (a trained model) is then used in the 'Predict' part of the workflow. What differentiate between a good and a bad machine algorithm is the quality of predictions we will get in the 'Predict' step. This leads us to yet another definition of machine learning:
The purpose of machine learning is to learn from training data in order to make as good as possible predictions on new, unseen, data.
This is my favorite definition, as is links the 'Train' step to the 'Predict' step of the machine learning workflow.
One thing I like with the above definition is that it explains why machine learning is hard. We need to build a model that defines the answer as a function of the example features. So far so good. Issue is that we must build a model that leads to good prediction on unforeseen data. If you think about it, this seems like an impossible task. How can we evaluate the quality of a model without looking at the data on which we will make predictions? Answering that question is what keeps busy researchers in Machine Learning. The general idea is that we assume that unforeseen data is similar to the data we can see. If a model is good on the data we can see, then it should be good for unforeseen data. Of course, Devil is in detail, and relying blindly on the data we can see can lead to major issues known as overfitting. I'll come back to this later, and I recommend reading Florian Dahms' What is "overfitting"? in the meantime.
A Simple Example
Let me explain the definition a bit. Data comes in as a table (a 2D matrix) with one example per row. Examples are described by features, with one feature per column. There is a special column which contains the 'correct answer' (the ground truth) for each example. The following is an example of such data set, coming from past house sales:
Name Surface Rooms Pool Price
House1 2,000 4 0 270,000
House5 3,500 6 1 510,000
House12 1,500 4 0 240,000
There are 3 examples, each described by 4 features, a name, the surface, the number of rooms, and the presence of a pool. The target is the price, represented in the last column. The goal is to find a function that relates the price to the features, for instance:
price = 100 * surface + 20,000 * pool + 15,000 * num_room
Once we have that function, then we can use it with new data. For instance, when we get a new house, say house22 with 2,000 sq. feet, 3 rooms, and no pool, we can compute a price:
price(house22) = 100 * 2,000 + 20,000 * 0 + 15,000 * 3 = 245,000
Let's assume that house22 is sold at 255,000. Our predicted price is off by 10,000. This is the prediction error that we want to minimize. Another formula for price definition may lead to more accurate price predictions. The goal of machine learning is to find a price formula that leads to the most accurate predictions for future house sales.
In practice, we will look for formulas that provide good predictions on the data we can see, i.e. the above table. I say formulas, but machine learning is not limited to formulas. Machine learning models can be much more complex. Point is that a machine learning model can be used to compute a target (here the price) from example features. The goal of machine learning is to find a model that leads to good predictions in the future.
Some of the definitions listed above are taken from Andew Ng's Stanford machine learning course. I recommend this course (or the updated version available for free on Coursera) for those willing to deep dive on machine learning.
I found a more formal statement of my favorite definition in this presentation by Peter Prettenhofer and Gille Louppe (if a reader knows when this definition was first used, then please let me know):