Handling Imbalanced Data

NIAD+ML

Handling imbalanced data is a common challenge in machine learning, especially in classification tasks where the classes are not represented equally.

Imbalanced training data can lead to biased models that perform poorly on the minority class.

Remember that early-stopping with Keras monitors val_loss by default, but you can select val_accuracy, which is affected by unbalanced training and validation sets.

Resampling methods

Oversampling: Increase the number of instances in the minority class by duplicating samples

Undersampling: Decrease the number of instances in the majority class by randomly removing samples. This can be risky as it may lead to loss of valuable information

Combining Over- and Under-sampling: Apply a combination of oversampling the minority class and undersampling the majority class to balance the dataset effectively

Data-level techniques

Collect More Data: If possible, collect more data for the minority class to balance the dataset
Data Augmentation: Introduce variations into the existing minority class samples to create new instances, similar to oversampling
- commonly used in machine learning and deep learning to artificially increase the size of a training dataset by applying various transformations to existing data samples
  - Example: image data augmentation might involve rotation, flip, scaling, noise addition, brightness and contrast of the original images of the training set to produce new images
  - In cyber-security, data augmentation is often used for adversarial training, a technique to make machine learning models more robust to AML attacks

Ensemble models

Create an ensemble of multiple models and combine their predictions (e.g., using Random Forest) to improve generalisation and handle class imbalance

Bagging techniques (random samples with replacement, such as RF)
- Random resampling: Some of the subsets may contain more minority class samples due to the randomness in sampling, allowing the model to give more attention to the minority class

In the aggregation step, bagging combines the predictions from multiple base learners

If some base learners correctly classify minority class instances, their predictions contribute positively to the final ensemble decision

Algorithmic techniques

Change Algorithm: Some algorithms inherently handle imbalanced data better than others. For instance, ensemble methods like Random Forest and gradient boosting tend to perform well
Class Weighting: It involves assigning different weights to classes in the training data to influence the learning algorithm to pay more attention to the minority class
- Class weighting is particularly useful in situations where simple oversampling or undersampling techniques are not sufficient to balance the class distribution
- Many machine learning algorithms allow assigning different weights to classes. Adjusting class weights can penalise misclassifications in the minority class more heavily, making the model focus on it
- Gradient Descent: During the training process, the gradients of the loss function are calculated for updating the model parameters (weights). Class weighting ensures that gradients related to minority class instances have a more significant impact on the weight updates. Gradient update with logistic regression:

$$\theta_j = \theta_j - \eta \frac{\partial}{\partial \theta_j} MSE(\theta) \qquad \frac{\partial}{\partial \theta_j} MSE(\theta) = \frac{2}{m} \sum_{i=1}^{m} (\theta^\top \bar{\mathbf{x}}^{(i)} - \bar{\mathbf{y}}^{(i)}) \bar{\mathbf{x}}_j^{(0)}$$

Polynomial features

Polynomial features are a type of feature engineering technique used in machine learning to capture nonlinear relationships between variables.

Overfitting: Adding high-degree polynomial features can make the model overly complex, leading to overfitting, especially if the dataset is small.

NOTE: See the previous examples related to linear and logistic regression

Bag-of-Words

Bag-of-words (BoW) technique can be used for feature pre-processing, especially for text data to convert text data into numerical vectors that machine learning algorithms can understand

Tokenisation: Split the text into individual words or tokens

This step is essential to break down the text into meaningful units

Vocabulary Creation: Create a vocabulary of unique words from the entire corpus of text data. Each unique word in the vocabulary will become a feature in the BoW model Vectorization: For each document or text sample, create a numerical vector representing the frequency of words in the vocabulary
Each element in the vector corresponds to the count of the respective word in the document Sparse Matrix Representation: The BoW representation results in a sparse matrix where most of the elements are zero because a typical text document only contains a subset of the entire vocabulary

Feature scaling Encodings