Decision Tree

NIAD+ML

ML algorithms

Decision Tree

DTs are powerful ML algorithms that can perform both classification and regression tasks. Decision trees are easy to interpret (often called white box models).

The decision tree starts with a root node, which represents the entire dataset. It then recursively splits the data based on certain features to create branches or child nodes Each split is determined by selecting the feature that best separates or classifies the data, typically using metrics like information gain or Gini impurity.

The process continues until a certain stopping criterion is met. This criterion could be reaching a maximum depth, having a minimum number of samples in each leaf node, or other conditions.

The leaf nodes of the tree represent the final predictions or outcomes.

landscape

Node inspection

first raw: splitting criterion
samples count how many samples a node applies to
value counts how many training samples of each class the node applies to
class: for a leaf node is the prediction, for root and intermediate nodes, it is the most common label
gini is a measure of impurity (gini=0 means pure) and is computed a: $G_i = 1 - \sum_{k=1}^{n} p_{i,k}^2$ Where $p_{i,k}$ is the ratio of class k samples among the training instances of the i-th node. Example: the Gini impurity of the depth-1 left node is equal to: $$1 − (32/1855) 2 + (1823/1855) 2 ≈ 0.0339$$

The CART algorithm

Scikit-Learn uses the CART algorithm to train DTs.

CART splits the set of each node using a single feature k and a threshold $t_k$. The algorithm chooses the pair $(k, t_k)$ that produces the purest subsets (lowest gini).

CART cost function for classification:

$$J(k, t_k) = \frac{m_{left}}{m} \cdot G_{left} + \frac{m_{right}}{m} \cdot G_{right}$$

Regularisation methods

CART recursively splits each subset until the impurity cannot be reduced with further splits or one of the following conditions is met:

max_depth: max number of layers of the tree
min_samples_split: minimum number of samples required to split an internal node in a decision tree
- example: if min_samples_split is set to 50, then each split in the tree must have at least 50 samples. This means that only nodes with more than 50 samples can be split into two child nodes, and leaf nodes must have fewer than 50 samples
min_samples_leaf: minimum number of samples required to be at a leaf node
- example: if min_samples_leaf is set to 20, then each leaf node in the tree must have at least 20 samples. This means that even if a node has more than 50 samples, it will not be split into two child nodes if it would result in a leaf node with fewer than 20 samples

Limitations of DTs:

Sensitive to training set rotation
Sensitive to small variations of the training set (adding/removing training samples might results in totally different splits)

Random Forest can limit DT instability by averaging the prediction of many DTs.

DTs for regression

A Decision Tree for regression is a supervised machine learning algorithm used for predicting continuous numerical values

Let’s consider an example anomaly detection system that predicts the average size of the incoming network packets.

The CART algorithm tries to split the training set in a way that minimises the MSE.

Making predictions

Once the tree is trained, each leaf node represents a partitioned subset of the data. The prediction for a new data point falling into that leaf node is typically the mean of the target variable within that subset.

Considerations on DTs

Decision trees are easy to understand and interpret, as the resulting tree structure resembles a flowchart with clear decision rules. Their rules are also easy to transform into IDS rules.

However, decision trees are prone to overfitting. With no regularisation (default parameters) they may become too complex and fit the training data too closely, resulting in poor generalisation to unseen data

Let’s take a look at the accuracy metrics of our multi-class decision tree!

Random Forest