Ensemble Learning

Ensemble Learning

Ensemble learning is a machine learning technique that combines the predictions of multiple machine learning models to produce a more accurate prediction

  • This is done by training multiple models on different subsets of the training data, or by using different machine learning algorithms ("base learners" or “weak learners”)
  • The predictions of the individual models are then aggregated in a way to produce the final prediction

Voting systems

Hard voting (majority)

Hard Voting, or Majority Voting (Classification): In hard voting, each base model in the ensemble independently makes a prediction, and the final prediction is determined by taking a majority vote. The class that receives the most votes becomes the ensemble’s predicted class

Soft voting (averaging)

Classification: In soft voting, each base model provides a probability distribution over the possible classes for a given input. The final prediction is calculated by averaging these probability distributions. Example with 3 base models:

Class 1: (M1: 0.9 + M2: 0.7 + M3: 0.8) / 3 = 0.8

Class 2: (M1: 0.1 + M2: 0.3 + M3: 0.2) / 3 = 0.2

Regression: For regression tasks, soft voting involves averaging the predicted values from individual base models.

Weighted voting

In weighted voting, each base model is assigned a weight that reflects its relative importance or performance. It works for both classification and regression tasks:

  • Train multiple base models on the same training data
  • Assign a weight to each base model based on its performance on the validation data
  • For each new data point, have each base model predict the class label (or the value)
  • Multiply the predicted class label (or predicted value) from each base model by its weight
  • Sum the weighted class labels (or predicted values) from all of the base models
  • Classification: Assign the data point to the class with the highest weighted class label
  • Regression: Assign the data point the weighted average predicted value

Bagging and pasting

Definition: Bagging (bootstrap aggregation) and pasting are two ensemble learning techniques that involve training multiple base models on different subsets of the training data and combining their predictions to make final predictions.

The main difference between the two techniques is how they create these subsets:

  • Bagging: Bagging creates subsets of the training data by randomly sampling with replacement. This means that some data points may be repeated in multiple subsets, while others may be omitted
  • Pasting: Pasting creates subsets of the training data by randomly sampling without replacement. This means that each data point is selected only once in a subset, ensuring that the subsets do not contain duplicate instances

Robustness to noisy data

Both bagging and pasting can be used to improve the performance of machine learning models on a variety of tasks, such as classification, regression, and anomaly detection

  • They are particularly useful for tasks where the training data is limited or noisy
  • Ensemble methods often use different subsets of data or different algorithms for their base models
    • Each base model will learn a slightly different representation of the training data
    • By averaging or voting on the predictions of the base models, we can reduce overfitting

Ability to reduce overfitting

Bagging and pasting can reduce overfitting by training multiple base models on different subsets of the training data

  • These models can be different algorithms or the same algorithm trained on different subsets of data
  • Each model has its strengths and weaknesses and might overfit different parts of the data
  • By combining these diverse models, the ensemble captures a broader range of patterns in the data, reducing the risk of overfitting to any one specific pattern

Out-of-bag score

  • The OOB score is a performance metric for a machine learning model, specifically for ensemble models such as random forests
  • It is calculated using the samples that are not used in the training of the model, which are called out-of-bag samples
  • Each tree of a Random Forest is trained on a slightly different subset of the data
  • The OOB score is calculated by making predictions on each OOB sample with all the trees in the forest for which the sample is OOB. The OOB score is obtained with majority vote (classification) or averaging (regression)
  • The OOB score is a good estimate of the generalisation error of the random forest model. This is because the out-of-bag samples are essentially a held-out validation set