Decision trees and classification trees
Decision trees are often used to easily visualize the different choices to be taken, the uncertainty under which they are to be taken and their outcomes. They are easy to visualize and to understand even for non-technical audience.
Let’s see this with a very easy example. You are running a car selling business and you want you employee to bring the customer to the car they are the most likely to buy.
Basically, the decision tree is splitting the customers per some criteria to maximize the homogeneity of the groups in the final node. Assumption is that if the individuals have been split intelligently groups in the final nodes will have similar behavior.
From Decision trees to classification trees.
In the example the splits were done accordingly to common sense. Now, we want to do the splits using the training data. The goal of the classification tree is to predict which class a given observation belongs. We want to the computer to answer these kinds of questions:
– Is Mr. Smith and his five children belonging to the MPV buyers segment or the SUV one?
– Which digit between 0 and 9 is it?
– Was a woman in 3rd class more likely to survive the Titanic than a Man in 1st class ?
Let’s say we have four segment of customers, SUV, MPV, Sedan and sports car buyers. The decision criteria will be the one minimizing impurity in the two sets. Impurity can be computed using different criteria such as Gini index or entropy. Hence the algorithm will put together peoples with characteristics shared by people from the same classes.
Split after split, the tree will get deeper. To choose the right tree size, there are two ways:
• Getting the tree deep enough and then deleting the nodes that are not useful (pruning)
• Stopping the growth of the tree once a criterion has been reached
Once the tree has been trained, it can be used to predict the class of a new observation.
Predicting Titanic survival with classification tree.
Now, we will use R and the Rpart package to predict whether the passenger would survive the Titanic disaster or not. To do this, we used the titanic data from Kaggle and trained a tree. The input variables are:
• The gender of the passenger
• The Class of the passenger’s cabin
• The port where the passenger embarked
• The number of siblings and spouse
• The age of the passenger
The learned tree is plotted below:
Let’s do some prediction:
• A woman in the 2nd or first class could have 94% chance of surviving and is predicted to have survived. As we could expect women are more likely to survive since they had the priority to access lifeboat.
• On the other hand, a female passenger in the third class older than 28, having embarked in Southampton with at least a sibling would only have 30% chance of surviving. Hence the model predict that this population would have perished in the disaster.
Why Classification trees are great
Classification trees are not the most efficient algorithm for learning. State-of-the arts algorithm as random forest, gradient boosting, neural networks, … tends to perform significantly before. However, classification trees have a huge advantage, they are very easy to understand and visualize. The algorithms mentioned above are black-boxes for non-technical audiences which can creates reluctances to use them and apply their recommendations.
Having a good algorithm is nice, having an algorithm that will be used is great and far more useful.