How do Machines Learn?
Machine learning has become one of the hottest buzzwords in business. You see it everywhere, everyone wants it, everyone is talking about it, and nearly everyday there is a slew of new companies or concepts that claim to use it to improve your life in some game-changing way. But how, exactly, do machines “learn”? What does it mean when we say that a machine is learning?
Put simply, machine learning is any mechanism by which computers are able to make determinations without human-generated code.
We do this when it would be too difficult or too constraining to code a solution. For example, if I want to simply know the distance between two points, it is better to program a haversine formula (how we calculate distance over a sphere, like the Earth) and calculate kilometers between two places. But, if I want to know the likelihood that a given sentence might contain a reference to a male name, I would not want to try and define all possible combinations and variables that would be needed to determine this. Instead, I want to teach my machine to learn from examples and figure out the likely answer without defined programming.
Neural networks have become the hot topic in this arena, but they are only one of many methods of learning. Let’s begin with standard statistical learning, which still dominates a lot of the landscape. To keep things simple, I’ll refrain from using math, but it is important to know that traditional machine learning is rooted in computational statistics that use large quantities of related data in order to infer information about the whole.
Let’s consider an easy example: I want to know if a name is likely to be male or female.
First, we need to assemble some training data, which is the large set of examples that we will use to make the computer learn. In a supervised model (which is common), this training data will need to be annotated, which means for every example in the set there will be a corresponding correct answer. A few example entries for such a training set are:
Because these training sets need to be quite large, it is a good idea to get this data from open sources if they’re available. For example, the US Census could provide you with the most popular male and female names in the United States for a given time period (for free). If you can’t get any public data for your particular problem, however, you’ll need to curate your training set, which can be one of the most time consuming parts of machine learning.
Once you have your training set, you will next need to define your features. Features are extracted bits of information that we will pull from the data to give to the learning algorithm. The best way to think about features is to wonder, what about the data is likely to help make a determination? Feature engineering is another challenging element of machine learning, and is one of the most critical steps in a supervised model (and also why the industry is migrating away from supervised workflows; but we can talk more about that later).
When considering our example of names, let’s consider just three features:
The last letter of the name.
The second to last letter of the name.
Whether or not the last letter is a vowel.
These may seem silly, but when you look at just the five provided examples, you can see a pattern start to emerge almost immediately:
MICHAEL: L, E, false
STACEY: Y, E, true
PAUL: L, U, false
DAVID: D, I, false
JENNI: A, D, true
Do you see the pattern? You’ll notice that the 3 male names all have consonants for the last letter, and all female names have vowels. So now you may ask, why don’t we just write a function that determines gender simply based on whether the last letter is a vowel? Because we are only looking at five examples. What about, for example, Karen? That is a female name that ends in a consonant, so it would fail our simple heuristic. However, because we are also looking at other features (the last letter and the second-to-last letter) we can uncover other patterns. For example, how likely is it that male names end in L or D? Is that statistically significant? What about names that have D for the second to last letter and end in a vowel? What about names that have Y and E as the last two letters of the name? And so on…
By creating a very large training set and running these features through a learning algorithm, we can teach our machines to detect and remember statistically significant patterns in the data. Then, when it sees a new piece of data without any annotation, it can make a “guess” at what the answer should be. This is usually accompanied by a confidence score, which is a number between 0 and 1 that represents how confident the model is in it’s guess.
This is actually a greatly simplified example, but this line of thinking can be applied to very large problems, such as self-driving cars. By feeding sample data into a learning algorithm, we can train a car to learn what is the likely best response to unrecognized new situations that share similar characteristics with other known situations. Arguably, with the proper training set, you could teach a computer to predict and behave properly in almost any condition for almost any problem. This is how machine learning will revolutionize our lives. We no longer need to create concrete solutions for every problem. We just need to observe the problem, document it, collect data, and then let the machine figure out the details for us.