Decision Tree Learning: Pruning in C4.5 Algorithm

Pruning is a way to make a decision tree simpler so it makes more generalization of the data better, leading to more accurate prediction and more readable interpretation. By this way, we can avoid overfitting. There are some methods in pruning, one of them is what C4.5 algorithm does. It’s called pessimistic pruning and based on statistical correction. Moreover, it takes only training set to build the tree and doesn’t need validation set or test set to estimate errors while pruning. We can make pessimistic estimate of the error rate by using the upper confidence limit (one-tail), which calculated with this formula:

$$e=\frac{f+\frac{z^2}{2N}+z \sqrt{\frac{f}{N}-\frac{f^2}{N}+\frac{z^2}{4N^2}}}{1+\frac{z^2}{N}}$$

where $N$ is the number of samples, $f=\frac{E}{N}$, $E$ is number of errors, and $z$ is z-score for confidence interval $c$. As the default, C4.5 will use 25% confidence interval so that $z=0.67$ (you can get the z-score by looking at Z table).

I will recall a decision tree from this post to make example how pruning is conducted. DT below is the process of pruning.

First, we start from the lowest level of node, which is health plan contribution. We’re going to compare errors before and after pruning.

Node: Health Plan Contribution

Before Pruning

 Branch  $E$  $N$  $f$  $e$
 None  2  6  0.33  0.47
 Half  1  2  0.5  0.72
 Full  2  6  0.33  0.47

So instead of 33%, 50%, and 33% error rates from each branch, we will use the pessimistic estimate of 47%, 72%, 47% respectively. The final error is weighted error from three branches or leaves.

$\text{Final error}=\frac{6}{14}\times 33\text{%} + \frac{2}{14}\times 50\text{%} + \frac{6}{14}\times 33\text{%}=51%$

After Pruning

We’ll compute estimated error if this node is replaced by a majority class. This node has 9 bad examples and 5 good examples.

 Node  $E$  $N$  $f$  $e$
 Health Plan Contribution  5  14  0.36  0.46

You can see the estimated error after pruned is lower than before pruned, it means that the pruning is worth it. So this node will be replaced by majority class, which is bad.

Node: Working Hours per Week

Before Pruning

 Branch  $E$  $N$  $f$  $e$
<=36  1  2  0.5  0.72
>36  5  14  0.36  0.46

$\text{Final error}=\frac{2}{16}\times 72\text{%} + \frac{14}{16}\times 46\text{%}=50%$

After Pruning

 Node  $E$  $N$  $f$  $e$
Working Hours per Week  6  16  0.375  0.48

Since the estimated error after pruned is lower than before pruned, the node is replaced by a majority class, bad.

Note

Although this method gives better performance and quite fast, it often doesn’t prune enough. But we can improve the quality of decision tree by tuning parameter. In this case, we can change the underlying confidence interval (25%), into a more better number. This confidence interval also plays control on determining how strict our pruning would be. For example, increasing confidence interval will make z-score bigger and in turn the estimated error become more larger. As consequence, it’s harder to prune the tree (error before pruning >> error after pruning) and the decision tree will overfit. Simply put, too higher CI = overfit and too lower CI = underfit.

 

Reference:

Data Mining: Practical Machine Learning Tools and Techniques

2 thoughts on “Decision Tree Learning: Pruning in C4.5 Algorithm

Leave a Reply

Your email address will not be published. Required fields are marked *