Dataset splitting

NIAD+ML

ML Bones

Dataset splitting

Dataset splitting involves dividing a given dataset into distinct subsets for the purpose of training, validating, and testing machine learning models. This process is essential for assessing the performance and generalisation ability of these models. The three main subsets typically created during dataset splitting are:

Training Set: The training set is the largest portion of the dataset and is used to train the machine learning model. This is where the model learns the underlying patterns and relationships within the data. A well-designed training set is crucial for building a robust and accurate model
Validation Set: The validation set is used to fine-tune and optimize the model during the training process. It helps in selecting hyperparameters, such as learning rates or model architectures, and in monitoring the model’s performance as it trains. This set provides an estimate of how well the model might perform on unseen data
Test Set: The test set is a completely separate portion of the dataset that the model has never seen during training or validation. It serves as an independent evaluation dataset used to assess the model’s generalisation performance. The test set is crucial for determining how well the model is likely to perform in real-world applications

Common practice

The dataset must be split into training, validation and test set.

It is common to use 90% of the data for training and 10% for testing, with 10% of the training set used for validation
More precisely training set 81%, validation set 9%, test set 10%

The proportions depend on the size of the dataset: in a dataset with 10 Millions of samples, 1% of the data (100000 samples) devoted to testing could be enough.

Approaches to dataset splitting

The process of dataset splitting should be done carefully to ensure that the subsets are representative of the overall dataset and that they maintain the same distribution of data as the original dataset. Common techniques for splitting datasets include:

Random sampling: random selection of data points from the original dataset to populate each of these subsets
Stratified sampling: ensures that each class or category is represented proportionally
k-fold cross-validation: dividing the dataset into multiple subsets for iterative training and validation
Temporal sampling: dividing the dataset, ensuring that the test set is composed of the latest samples

Kfold Cross-validation

When we have limited data, dividing the dataset into Train and Validation sets may cause some data points with useful information to be excluded from the training procedure, and the model fails to learn the data distribution properly.

One way to solve this problem is called cross-validation. The following procedure is followed for each of the k “folds”: A model is trained using k−1 of the folds as training data;

the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance of the model is computed as the average validation score of the model obtained on k different splits Drawback: the training time is multiplied by the number of validation sets k. 25

Dataset
	Training set	Test set
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4

The number of folds k is usually set between 5 and 10

Stratified KFold

Stratified Kfold ensures that the folds are made by preserving the percentage of samples for each class. Thus, if the training set is balanced with 50% benign samples and 50% DDoS samples, each of the folds will be balanced in the same way Stratified Kfold is the default setting in the python libraries for crossvalidation E.g., the sci-kit method GridSearchCV

Evaluation metrics