Decision Tree
DTs are powerful ML algorithms that can perform both classification and regression tasks. Decision trees are easy to interpret (often called white box models).
The decision tree starts with a root node, which represents the entire dataset. It then recursively splits the data based on certain features to create branches or child nodes Each split is determined by selecting the feature that best separates or classifies the data, typically using metrics like information gain or Gini impurity.
The process continues until a certain stopping criterion is met. This criterion could be reaching a maximum depth, having a minimum number of samples in each leaf node, or other conditions.
The leaf nodes of the tree represent the final predictions or outcomes.
Node inspection
- first raw: splitting criterion
- samples count how many samples a node applies to
- value counts how many training samples of each class the node applies to
- class: for a leaf node is the prediction, for root and intermediate nodes, it is the most common label
- gini is a measure of impurity (gini=0 means pure) and is computed a: $G_i = 1 - \sum_{k=1}^{n} p_{i,k}^2$ Where $p_{i,k}$ is the ratio of class k samples among the training instances of the i-th node. Example: the Gini impurity of the depth-1 left node is equal to: $$1 − (32/1855) 2 + (1823/1855) 2 ≈ 0.0339$$
The CART algorithm
Scikit-Learn uses the CART algorithm to train DTs.
CART splits the set of each node using a single feature k and a threshold $t_k$. The algorithm chooses the pair $(k, t_k)$ that produces the purest subsets (lowest gini).
CART cost function for classification:
$$J(k, t_k) = \frac{m_{left}}{m} \cdot G_{left} + \frac{m_{right}}{m} \cdot G_{right}$$
Regularisation methods
CART recursively splits each subset until the impurity cannot be reduced with further splits or one of the following conditions is met:
- max_depth: max number of layers of the tree
- min_samples_split: minimum number of samples required to split an internal node in a decision tree
- example: if min_samples_split is set to 50, then each split in the tree must have at least 50 samples. This means that only nodes with more than 50 samples can be split into two child nodes, and leaf nodes must have fewer than 50 samples
- min_samples_leaf: minimum number of samples required to be at a leaf node
- example: if min_samples_leaf is set to 20, then each leaf node in the tree must have at least 20 samples. This means that even if a node has more than 50 samples, it will not be split into two child nodes if it would result in a leaf node with fewer than 20 samples
Limitations of DTs:
- Sensitive to training set rotation
- Sensitive to small variations of the training set (adding/removing training samples might results in totally different splits)
Random Forest can limit DT instability by averaging the prediction of many DTs.
DTs for regression
A Decision Tree for regression is a supervised machine learning algorithm used for predicting continuous numerical values
Let’s consider an example anomaly detection system that predicts the average size of the incoming network packets.
The CART algorithm tries to split the training set in a way that minimises the MSE.
Making predictions
Once the tree is trained, each leaf node represents a partitioned subset of the data. The prediction for a new data point falling into that leaf node is typically the mean of the target variable within that subset.
Considerations on DTs
Decision trees are easy to understand and interpret, as the resulting tree structure resembles a flowchart with clear decision rules. Their rules are also easy to transform into IDS rules.
However, decision trees are prone to overfitting. With no regularisation (default parameters) they may become too complex and fit the training data too closely, resulting in poor generalisation to unseen data
Let’s take a look at the accuracy metrics of our multi-class decision tree!