Training NN

Hyperparameters tuning

Hyperparameter tuning is the process of systematically adjusting the hyperparameters of a machine learning model to optimize its performance on a specific task.

Hyperparameters are values set before training the model and are not learned from the data during the training process.

Examples of hyperparameters include the learning rate, the number of hidden layers in a neural network, the batch size, etc.

Goals of hyperparameter tuning

  • The goal of hyperparameter tuning is to find the set of hyperparameters that result in the best performance of the model on the task of interest
  • This is typically done by training multiple models with different combinations of hyperparameters and evaluating their performance using a validation set
  • The set of hyperparameters that result in the highest performance (e.g, accuracy, F1 Score, Recall, etc.) is then used to train the final model

Formulation

Hyperparameter optimization can be represented in equation form as:

λ¯ = argmaxλ∈Λ f λ(y, y

here is the space of hyperparameter combinations and measures the performance of the machine learning system with the combination of hyperparameters Λ f λ(y, yλ

The performance is usually defined as the accuracy but any other metric can also be used (e.g., F1 Score, Recall, Precision, etc.)

  • The optimisation algorithms try to find the combination of hyperparameters that maximises λ¯ f λ(y, yIn the case of regression problems, the solution is the set of hyperparameters that minimise the RMSE between the predicted values and the values in the validation set: λ¯ = argminλ∈Λ f λ(y, y

Architectural Hyperparameters

  • Number of Layers: Determines the depth of the neural network
  • Number of Neurons (Units) in Each Layer: Specifies the number of nodes in each layer
  • Network Topology: Specifies how neurons are interconnected (e.g., fully connected, convolutional layers in CNNs, recurrent connections in RNNs)

Activation function: usually the ReLU activation function works well for the hidden layers. Other options for hidden layers are Leaky ReLU, ELU, among others

  • Leaky ReLU and ELU try to avoid the problem of dying neurons. ELU converges faster but is slower than ReLU at test time (the exponential).
  • LeakyReLU is a generalisation of ReLU, with coefficient a as a hyperparameter.
  • is an hyperparameter for ELU

For the output layers, the activation function depends on the problem (e.g., sigmoid for binary problems, softmax for multi-class problems) 7

  • Learning Rate: Determines the step size during optimisation. Too high a learning rate can cause overshooting, and too low a learning rate can lead to slow convergence or getting stuck in local minima
  • Batch Size: Number of training examples used in one iteration of training. Larger batch sizes can speed up training, but very large batches may lead to memory issues
  • Epochs: Number of times the entire dataset is passed forward and backward through the neural network during training
  • Optimizer: Specifies the optimization algorithm used (e.g., Adam, SGD, etc.)

Regularization Hyperparameters

  • Weight Decay: Penalty term added to the loss function to prevent large weights (e.g., L1 and L2 norms)
  • Dropout Rate: Regularization technique where random neurons are dropped during training to prevent overfitting
  • Early Stopping: Training is stopped when the validation performance starts to degrade, preventing overfitting

More details about dropout when we will discuss Regularisation Techniques

9

Problem-specific hyperparameters

  • Maximum number of packets/flow: the input layer has a fixed shape that you must set
    • This shape is determined by the flow representation of choice
    • In the case of packet-based representation, the maximum number of packets/flow must be set in advance
  • Time window: in real-world application scenarios, the network traffic is collected for a certain time window and then sent to the IDS for analysis
    • The size of the time window must also be set, as it determines how the traffic flows are collected

10

Hyperparameter tuning techniques

  • Manual tuning: In this approach, the data scientist adjusts the hyperparameters based on their understanding of the model and the problem at hand
    • While manual tuning can be slow and tedious, it allows for a better understanding of how the hyperparameters affect the model.
  • Automated tuning: This approach uses algorithms to search for the best hyperparameter values more efficiently
    • One popular method is grid search, where a grid of hyperparameter values is defined, and the model is trained and evaluated for each combination of values
    • Other advanced techniques, such as random search, Bayesian optimization, and genetic algorithms, can also be used for automated hyperparameter tuning

Grid search

  • Given a set of hyperparameters and a list of possible choices for each of them (defined by the user), a grid search algorithm will train a model multiple times, each time with a different combination of hyperparameters. E.g.,
  • LR = [0.1,0.01,0.001] BATCH_SIZE = [1024,2048] HIDDEN_LAYERS = [1,2,4,8,16] PACKETS_PER_FLOW = [1,2,3,4,5,10,20,50,100]
  • The algorithm will train the model for 325*9= 270 times
  • The best combination is the one that produces the highest score on the validation set

Random search

Source: Bergstra, J., & Bengio, Y. (2012). Random search for hyperparameter optimization. Journal of machine learning research, 13(2).

  • Grid search is a good approach when the number of combinations of hyperparameters is limited and the model size permits fast training
  • In random search, instead of trying every combination of hyperparameters, random combinations of hyperparameters are sampled from a specified distribution
  • With random search, one can set the maximum number of combinations to test, so as to control the total training time
  • As with grid search, the best combination (among those randomly selected) is the one that produces the highest score on the validation set
  • Random search is a stochastic method, meaning that the results will vary each time it is run
    • To obtain reliable results, it is usually recommended to run the search multiple times and average the results13

Randomized values

Random search selects a random number based on a statistical distribution of the hyper-parameter

Automated tuning steps

  • Identify the hyperparameters: Start by identifying the hyperparameters of your machine learning model that need to be optimized
  • Define the range of values: For each hyperparameter, define the range of possible values that you want to consider. This can be done based on prior knowledge or through experimentation
  • Choose a distribution: For each hyperparameter, choose a distribution to sample from (e.g., uniform, Gaussian, etc.), or the list of values in the case of a grid search
  • Set the number of trials (random search): Specify the number of random trials to run during the search. This will determine the number of combinations of hyperparameters that will be sampled

Recap Kfold Cross-validation

When we have limited data, dividing the dataset into Train and Validation sets may cause some data points with useful information to be excluded from the training procedure, and the model fails to learn the data distribution properly

  • One way to solve this problem is called cross-validation. The following procedure is followed for each of the k “folds”: A model is trained using k−1 of the folds as training data

the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)

  • The performance of the model is computed as the average validation score of the model obtained on k different splits Drawback: the training time is multiplied by the number of validation sets k 16
Dataset
Training set Test set
Fold1 Fold2 Fold3 Fold4
Fold1 Fold2 Fold3 Fold4
Fold1 Fold2 Fold3 Fold4
Fold1 Fold2 Fold3 Fold4

The number of folds k is usually set between 5 and 10

Regularisation techniques

  • Regularisation techniques are essential in training deep learning models to prevent overfitting, where a model performs well on the training data but poorly on unseen data
  • Regularization methods add constraints to the optimisation problem, discouraging overly complex models that might fit the training data too closely
  • Common regularisation techniques are:
    • Early stopping
    • L1 regularisation
    • L2 regularisation
    • Dropout

Early stopping

Given a combination of hyperparameters, when do we stop training and jump to the next combination?

Early stopping is a strategy to stop the training process when the error on the validation set reaches the minimum value Here is where the model starts overfitting on the training data

Usually the curves are not smooth and it can be hard to understand whether we have reached the minimum

Patience: one solution is to stop only if the error stays above the minimum for a pre-defined number of epochs

L1/L2 regularisation

Add a penalty term to the loss function. This penalty term discourages the model from assigning excessively large weights to the input features

$$L1_penalty = \lambda \sum_{i=1}^{n} |w_i| \qquad L2_penalty = \lambda \sum_{i=1}^{n} w_i^2$$

Where: are the L1/L2 regularisation terms added to the loss function, is the regularisation parameter, controlling the strength of the regularisation L1_ penalty, L2_ penalty λ

A higher value of leads to stronger regularisation λ

represent the individual weights in the model. wi

Binary cross-entropy with L1/L2 regularisation

$$\mathbf{11:} J_W(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{m} \sum_{i=1}^m \left( \mathbf{y}_i \log(\hat{\mathbf{y}}_i) + (1 - \mathbf{y}i) \log(1 - \hat{\mathbf{y}}i) \right) + \lambda \sum{l=1}^L \sum{j=1}^{n^{(0)}} |\mathbf{w}_j^{(l)}|$$

$$\mathbf{12:} J_W(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{m} \sum_{i=1}^m \left( \mathbf{y}_i \log(\hat{\mathbf{y}}_i) + (1 - \mathbf{y}i) \log(1 - \hat{\mathbf{y}}i) \right) + \lambda \sum{l=1}^L \sum{j=1}^{n^{(0)}} (\mathbf{w}_j^{(l)})^2$$

Where: is the number of samples in a mini-batch of training samples, represents the true labels (either 0 or 1) of the -th sample, represents the predicted probability of the -th sample being in class 1, is the total number of layers in the neural network, is the total number of weights in layer , represents the individual weights in layer , is the regularization parameter, controlling the strength of the L1 regularization m yi i ŷ i i L n(l) l w(l) j l λ

A higher value of leads to stronger regularization λ

Dropout

  • Dropout works by randomly deactivating (or “dropping out”) a fraction of neurons during training, meaning these neurons are ignored during forward and backward passes
  • At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step p
  • After training, neurons don’t get dropped anymore

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

Why Dropout Helps Prevent Overfitting

  • Dropout improves the model’s generalization ability by introducing noise into the learning process
    • This process is similar to training multiple smaller networks and averaging their outputs, which makes the final network less likely to rely on any single feature or neuron
  • Implicit Ensemble Learning: Dropout effectively makes a neural network behave like an ensemble of different, “smaller” networks because each mini-batch during training is processed by a different subset of the full network
    • At inference time, the averaged behaviour of these subnetworks is approximated by the entire network, which is typically more robust 23

The Syn Flood use-case (see the slideset on Datasets)

Dropout in practice

  • Dropout is implemented as an additional layer in neural network frameworks like TensorFlow, PyTorch, and Keras
  • The dropout rate (a hyperparameter) specifies the proportion of neurons to drop
    • Typical values range from 0.2 to 0.5, depending on the network architecture and the complexity of the problem

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

import tensorflow as tf

model.add(tf.keras.layers.Dense(128, activation=‘relu’)) model.add(tf.keras.layers.Dropout(0.5)) # Dropout layer with a rate of 0.5 (50%) model.add(tf.keras.layers.Dense(64, activation=‘relu’))

  • Dropout is implemented as an additional layer in neural network frameworks like TensorFlow, PyTorch, and Keras
  • The dropout rate (a hyperparameter) specifies the proportion of neurons to drop
    • Typical values range from 0.2 to 0.5, depending on the network architecture and the complexity of the problem

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

import tensorflow as tf

model.add(tf.keras.layers.Dense(128, activation=‘relu’)) model.add(tf.keras.layers.Dropout(0.5)) # Dropout layer with a rate of 0.5 (50%) model.add(tf.keras.layers.Dense(64, activation=‘relu’))

Technical detail about Dropout

  • Suppose , in which case during inference a neuron would be connected to twice as many input neurons as it would be (on average) during training p = 50 %
    • To compensate for this fact, we need to multiply each neuron’s input connection weights by after training 0.5
    • If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well

Limitations of Dropout

  • Not in Output Layers: Dropout is generally not applied to output layers, especially in networks for tasks like classification, as it might cause unstable predictions
  • Training Time: Dropout can slow down training since the model needs to learn redundant representations
  • Optimization Challenges: If dropout is too high, it can lead to underfitting, as the model cannot learn enough meaningful patterns