Pruning is a way to make a decision tree simpler so it makes more generalization of the data better, leading to more accurate prediction and more readable interpretation. By this way, we can avoid overfitting. There are some methods in pruning, one of them is what C4.5 algorithm does. It’s called **pessimistic pruning** and based on statistical correction. Moreover, it takes only training set to build the tree and doesn’t need validation set or test set to estimate errors while pruning. We can make pessimistic estimate of the error rate by using the upper confidence limit (one-tail), which calculated with this formula:

$$e=\frac{f+\frac{z^2}{2N}+z \sqrt{\frac{f}{N}-\frac{f^2}{N}+\frac{z^2}{4N^2}}}{1+\frac{z^2}{N}}$$

where $N$ is the number of samples, $f=\frac{E}{N}$, $E$ is number of errors, and $z$ is z-score for confidence interval $c$. As the default, C4.5 will use 25% confidence interval so that $z=0.67$ (you can get the z-score by looking at Z table).

I will recall a decision tree from this post to make example how pruning is conducted. DT below is the process of pruning.

First, we start from the lowest level of node, which is health plan contribution. We’re going to compare errors before and after pruning.

### Node: Health Plan Contribution

#### Before Pruning

Branch | $E$ | $N$ | $f$ | $e$ |

None | 2 | 6 | 0.33 | 0.47 |

Half | 1 | 2 | 0.5 | 0.72 |

Full | 2 | 6 | 0.33 | 0.47 |

So instead of 33%, 50%, and 33% error rates from each branch, we will use the pessimistic estimate of 47%, 72%, 47% respectively. The final error is weighted error from three branches or leaves.

$\text{Final error}=\frac{6}{14}\times 33\text{%} + \frac{2}{14}\times 50\text{%} + \frac{6}{14}\times 33\text{%}=51%$

#### After Pruning

We’ll compute estimated error if this node is replaced by a majority class. This node has 9 *bad* examples and 5 *good* examples.

Node | $E$ | $N$ | $f$ | $e$ |

Health Plan Contribution | 5 | 14 | 0.36 | 0.46 |

You can see the estimated error after pruned is lower than before pruned, it means that the pruning is worth it. So this node will be replaced by majority class, which is *bad*.

### Node: Working Hours per Week

#### Before Pruning

Branch | $E$ | $N$ | $f$ | $e$ |

<=36 | 1 | 2 | 0.5 | 0.72 |

>36 | 5 | 14 | 0.36 | 0.46 |

$\text{Final error}=\frac{2}{16}\times 72\text{%} + \frac{14}{16}\times 46\text{%}=50%$

#### After Pruning

Node | $E$ | $N$ | $f$ | $e$ |

Working Hours per Week | 6 | 16 | 0.375 | 0.48 |

Since the estimated error after pruned is lower than before pruned, the node is replaced by a majority class, *bad*.

### Note

Although this method gives better performance and quite fast, it often doesn’t prune enough. But we can improve the quality of decision tree by tuning parameter. In this case, we can change the underlying confidence interval (25%), into a more better number. This confidence interval also plays control on determining how strict our pruning would be. For example, increasing confidence interval will make z-score bigger and in turn the estimated error become more larger. As consequence, it’s harder to prune the tree (error before pruning >> error after pruning) and the decision tree will overfit. Simply put, too higher CI = overfit and too lower CI = underfit.

Reference:

Data Mining: Practical Machine Learning Tools and Techniques

boleh tau nama yang punya website.. soalnya buat daftar pustaka TA