This is the latest installment in our Step By Step series, where analysts and research assistants showcase how we use cutting edge tools and resources to complete projects. ESI uses all of these programs and utilities to better serve our clients and solve problems.
Artificial intelligence (AI) is the science of programming computers to perceive their environment and make rational, cognitive decisions in order to achieve a goal. It is one of the most rapidly progressing and sought after technologies in the world. It is, however, a rather general term. When most people talk about artificial intelligence, they are usually talking about machine learning. At its most basic definition, machine learning is a method of teaching computers to make predictions based on data. It is a branch of artificial intelligence developed in the 1950s, and the methods invented today aren’t fundamentally different from the methods invented years ago. Why all the interest in investment right now?
What distinguishes these past few years from what came before is the prevalence of data. Unlike in the past, people today have become collectors of data from devices like smartphones, Fitbits, credit card purchases and the like. Many machine learning algorithms automatically improve based on this kind of data, as long as computers have the computing processing power to handle it (they do now). In his book, The Master Algorithm, Pedro Domingos offers a nice simple way of understanding machine learning. He explains: “Every algorithm has an input and an output: the data goes into the computer, the algorithm does what it will with it, and out comes the result. Machine learning turns this around: in goes the data and the desired result and out comes the algorithm that turns one into the other.” Generally speaking, machine learning algorithms are trained to find statistical relationships in the data that allow them to make good guesses when presented with new examples.
Early AI programs typically excelled at just one thing. For example, the AI Deep Blue could play chess at a championship level, but that’s all it could do (you couldn’t have it play Atari, for example). Over time we realized that we needed programs that could solve many problems without needing to be rewritten. This blog will walkthrough this kind of machine learning in a very simple example and explain why its development is important. To begin, I will give you a problem that sounds easy but is impossible to solve without machine learning: Is it possible to write code to tell the difference between an apple and an orange? Imagine you wrote a program that takes an image file as input, does some analysis, and outputs the type of fruit. How can you solve this? You would have to start by writing lots of manual rules. For example, you could write code to count how many orange pixels there are and compare that to the number of red ones. That ratio would give you a hint about the type of fruit.
# lots of code
That works fine on simple images like these:But as you dive deeper into the problem, you will find out that the real world is messy, and the rules you write start to break. How would you write code to handle black and white photos or images with no apples or oranges in them at all? In fact, for just about any rule you write, I can find an image where it won’t work. You would need to write tons of rules, and that’s just to tell the difference between apples and oranges. If I have you a new problem, you need to start all over again. To solve this problem, we can create an algorithm that can figure out the rules for us, so we don’t have to write them by hand. And for that, we’re going to train what’s called a classifier. You can think of a classifier as a function. It takes some data as input and assigns a label to it as output. For example, I could have a picture and want to classify it as an apple or an orange. Or I could have an email and want to classify it as spam or not spam. The technique to write the classifier automatically is called supervised learning.
To use supervised learning, we will follow a simple procedure with a few standard steps. The first step is collecting training data. These are essentially examples of the problem we want to solve. For our problem, we are going to write a function to classify a piece of fruit. For starters, it will take a description of the fruit and predict whether it’s an apple or orange based on features like its weight and texture.
To collect our training data, assume we head out to an orchard and collect some data. During our trip we look at different apples and oranges and write down measurements that describe them in a table. In machine learning, these measurements are called features. To keep things simple, here we’ve used just two – how much each fruit weighs in grams and its texture, which can be bumpy or smooth.
Each row in our training data is an example. It describes one piece of fruit. The last column is known as the label. It identifies what type of fruit is in each row, and in this case there are just two possibilities – apples or oranges. You can think of these as all the examples we want the classifier to learn from. The more training data you have, the better a classifier you can create.
To code this program up, we will work with the scikit-learn, a free machine learning library for Python (before installing scikit-learn you need to have Python 3.6+ installed, which you can download here). There are many ways to download scitkit-learn, but the easiest way is to use Anaconda. This makes it easy to get all the dependencies set up and works well in Windows, Mac or Linux.
Run a simple Python program and load libraries
To run the simple Python program, you will need to start a Python interactive shell which can be accessed from your local computer (if you have Anaconda installed, you will want to use the Anaconda Prompt). The command you will need to use to enter into the Python interactive console for your default version of Python is:
Set up training data
Before we setup our training data, we need to make sure to load the scikit-learn package.
Now let’s write down our training data in code. We will use two variables – features and labels.
features = [[140, “smooth”], [130, “smooth”], [150, “bumpy”], [170, “bumpy”]]
labels = [“apple”, “apple”, “orange”, “orange”]
In the above code, the features contain the first two columns, and labels contain the last. You can think of these features as inputs to the classifier and labels as the output we want. Because scikit-learn works best with integers, we’re going to change the variable types of all features to integers instead of strings – using 0 for bumpy and 1 for smooth.
features = [[140, 1], [130, 1], [150, 0], [170, 0]]
We will do the same for our labels – using 0 for apple and 1 for orange.
labels = [0, 0, 1, 1]
Training the classifier
The next step involves using these example features to train a classifier. The type of classifier we will use is called a decision tree.
There are many different types of classifier, but for the purposes of this tutorial and simplicity, you can think of a classifier as a box of rules. Regardless of the type of classifier, however, the input and output type are always the same. Before we use our classifier, we will need to import the decision tree into the environment.
from sklearn import tree
Then on the next line in our script, we will create the classifier.
clf = tree.DecisionTreeClassifier()
At this point, the classifier is just an empty box of rules since it doesn’t know anything about apples or oranges yet. To train it, we’ll need a learning algorithm. If you think of a classifier is a box of rules, you can think of a learning algorithm as the procedure that creates them. It does so by finding patterns in the training data. For example, it might notice oranges tend to weigh more, so it will create a rule saying that the heavier the fruit is, the more likely it is to be an orange. In scikit, the training algorithm is included in the classifier object, and it’s called fit. You can think of fit as being a synonym for “find patterns in data”.
clf = clf.fit(features, labels)
At this point, we have a trained classifier. Let’s test it out and use it to classify a new fruit. The input to the classifier is the features for a new example. Let’s say the fruit we want to classify is 160 grams and bumpy.
print(clf.predict(X = [[160, 0]]))
The output will be 0 if it’s an apple or 1 if it’s an orange. Before we hit enter and see what the classifier predicts, let’s think for a second. If you had to guess, what would you say the output should be? To figure that out, compare this fruit to our training data. It looks similar to an orange because it’s heavy and bumpy. When we hit enter, we can see that it’s what our classifier predicts as well. Recall that we used 0 for apple, and 1 for orange.
If everything worked for you, then you have completed your first machine learning project in Python. You can create a new classifier for a new problem just by changing the training data. That makes this approach far more reusable than writing rules for each problem. We have covered much of the basic theory underlying supervised machine learning here, but of course we have only scratched the surface. A much deeper understanding of the topics discussed herein is necessary. Fortunately, with the abundance of open source libraries and resources available today, programming with machine learning is becoming easier and more accessible to more users every day. But to get it right, you will also have to understand a few important concepts, such as:
- How much training data do you need?
- How is the tree created?
- What makes a good feature?
However you approach it, it is clear that machine learning is an incredibly powerful tool. It promises to help us address some of our most pressing problems, as well as open up whole new worlds of opportunity. This becomes even more realized every day with the growth of big data.
Full source code:
Carlos Bonilla is a Research Analyst at Econsult Solutions, Inc. with a focus in economic and environmental issues in urban areas. At ESI, Carlos applies a strong background in spatial analysis, cartography, and data visualization. Prior to joining the team, Carlos worked as a fellow for Azavea’s Summer of Maps, a program that pairs students with local and national non-profits to perform geospatial analysis.