Machine Learning: Decision Tree

A thorough introduction to Decision Tree

Posted by Yingfan on April 20, 2022

Decision Trees can perform both classification and regression tasks, and even multioutput tasks.

Decision Trees don’t require feature scaling or centering.

Classification

Training and Visualization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:] # petal length and width
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)

# visualize
from graphviz import Source
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file=os.path.join(IMAGES_PATH, "iris_tree.dot"),
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

Source.from_file(os.path.join(IMAGES_PATH, "iris_tree.dot"))

  • samples - how many training instances it applies to
  • value - how many training instances of each class this node applies to
  • class - prediction
  • gini - measures impurity

Gini impurity

  • This measures how pure your node is.
  • \(G_i=1-\sum_k^np^2_{i,k}\) is gini $G_i$ of a node $i$, where $p^2_{i,k}$ is the ratio of class k instances among the training instances in the ith node.
  • A node is pure - gini=0 - if all training instances it applies to belong to the same class, for example, left node on depth 1.

Entropy

  • By default, the Gini impurity measure is used, but you can select the entropy impurity measure instead by setting the criterion hyperparameter to “entropy”
  • a set’s entropy is zero when it contains instances of only one class
  • Gini impurity is slightly faster to compute, so it is a good default
  • Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees

Estimating class probabilities

First it traverses the tree to find the leaf node for this instance, and then it returns the ratio of training instances of class k in this node.

The CART Training algorithm

Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to train Decision Trees (also called growing trees).

  • first splitting the training set into two subsets using a single feature $k$ and a threshold $t_k$
  • pair $(k, t_k)$ need to minimizes the cost function $J(k,t_k)=\frac{m_{left}}{m}G_{left}+\frac{m_{right}}{m}G_{right}$
  • Once the CART algorithm has successfully split the training set in two, it splits the subsets using the same logic, then the sub-subsets, and so on, recursively

Regularization

  • max_depth
  • min_samples_split - the minimum number of samples a node must have before it can be split
  • min_samples_leaf - the minimum number of samples a leaf node must have
  • min_weight_fraction_leaf - same as min_samples_leaf but expressed as a fraction of the total number of weighted instances
  • max_leaf_nodes - the maximum number of leaf nodes
  • max_features - the maximum number of features that are evaluated for splitting at each node

Notice Increasing min_* hyperparameters or reducing max_* hyperparameters will regularize the model

Regression

visualization

The main difference is that instead of predicting a class in each node, it predicts a value.

For example, suppose you want to make a prediction for a new instance with x1 = 0.6.

You traverse the tree starting at the root, and you eventually reach the leaf node that predicts value = 0.111.

This prediction is the average target value of the 110 training instances associated with this leaf node, and it results in a mean squared error equal to 0.015 over these 110 instances.

The fitted line is show below.

CART algorithm

The CART algorithm works mostly the same way as earlier, except that instead of trying to split the training set in a way that minimizes impurity, it now tries to split the training set in a way that minimizes the MSE.

\[J(k,t_k)=\frac{m_{left}}{m}MSE_{left}+\frac{m_{right}}{m}MSE_{right}\]

Some characteristics

  • Decision Trees don’t require feature scaling or centering.
  • easy to overfit
  • sensitive to training set rotation.
    • because its decision boundaries are orthogonal (all splits are perpendicular to an axis
  • very sensitive to small variations in the training data
  • Random Forests can limit this instability by averaging predictions over many trees