Training NN

NIAD+ML

Training NN

Hyperparameters tuning

Hyperparameter tuning is the process of systematically adjusting the hyperparameters of a machine learning model to optimize its performance on a specific task.

Hyperparameters are values set before training the model and are not learned from the data during the training process.

Examples of hyperparameters include the learning rate, the number of hidden layers in a neural network, the batch size, etc.

Goals of hyperparameter tuning

The goal of hyperparameter tuning is to find the set of hyperparameters that result in the best performance of the model on the task of interest
This is typically done by training multiple models with different combinations of hyperparameters and evaluating their performance using a validation set
The set of hyperparameters that result in the highest performance (e.g, accuracy, F1 Score, Recall, etc.) is then used to train the final model

Formulation

Hyperparameter optimization can be represented in equation form as:

λ¯ = argmaxλ∈Λ f λ(y, y)̂

here is the space of hyperparameter combinations and measures the performance of the machine learning system with the combination of hyperparameters Λ f λ(y, y)̂ λ

The performance is usually defined as the accuracy but any other metric can also be used (e.g., F1 Score, Recall, Precision, etc.)

The optimisation algorithms try to find the combination of hyperparameters that maximises λ¯ f λ(y, y)̂ In the case of regression problems, the solution is the set of hyperparameters that minimise the RMSE between the predicted values and the values in the validation set: λ¯ = argminλ∈Λ f λ(y, y)̂

Architectural Hyperparameters

Number of Layers: Determines the depth of the neural network
Number of Neurons (Units) in Each Layer: Specifies the number of nodes in each layer
Network Topology: Specifies how neurons are interconnected (e.g., fully connected, convolutional layers in CNNs, recurrent connections in RNNs)

Activation function: usually the ReLU activation function works well for the hidden layers. Other options for hidden layers are Leaky ReLU, ELU, among others

Leaky ReLU and ELU try to avoid the problem of dying neurons. ELU converges faster but is slower than ReLU at test time (the exponential).
LeakyReLU is a generalisation of ReLU, with coefficient a as a hyperparameter.
⍺ is an hyperparameter for ELU

For the output layers, the activation function depends on the problem (e.g., sigmoid for binary problems, softmax for multi-class problems) 7

Learning Rate: Determines the step size during optimisation. Too high a learning rate can cause overshooting, and too low a learning rate can lead to slow convergence or getting stuck in local minima
Batch Size: Number of training examples used in one iteration of training. Larger batch sizes can speed up training, but very large batches may lead to memory issues
Epochs: Number of times the entire dataset is passed forward and backward through the neural network during training
Optimizer: Specifies the optimization algorithm used (e.g., Adam, SGD, etc.)

Regularization Hyperparameters

Weight Decay: Penalty term added to the loss function to prevent large weights (e.g., L1 and L2 norms)
Dropout Rate: Regularization technique where random neurons are dropped during training to prevent overfitting
Early Stopping: Training is stopped when the validation performance starts to degrade, preventing overfitting

More details about dropout when we will discuss Regularisation Techniques

Problem-specific hyperparameters

Maximum number of packets/flow: the input layer has a fixed shape that you must set
- This shape is determined by the flow representation of choice
- In the case of packet-based representation, the maximum number of packets/flow must be set in advance
Time window: in real-world application scenarios, the network traffic is collected for a certain time window and then sent to the IDS for analysis
- The size of the time window must also be set, as it determines how the traffic flows are collected

Hyperparameter tuning techniques

Manual tuning: In this approach, the data scientist adjusts the hyperparameters based on their understanding of the model and the problem at hand
- While manual tuning can be slow and tedious, it allows for a better understanding of how the hyperparameters affect the model.
Automated tuning: This approach uses algorithms to search for the best hyperparameter values more efficiently
- One popular method is grid search, where a grid of hyperparameter values is defined, and the model is trained and evaluated for each combination of values
- Other advanced techniques, such as random search, Bayesian optimization, and genetic algorithms, can also be used for automated hyperparameter tuning

Grid search

Given a set of hyperparameters and a list of possible choices for each of them (defined by the user), a grid search algorithm will train a model multiple times, each time with a different combination of hyperparameters. E.g.,
LR = [0.1,0.01,0.001] BATCH_SIZE = [1024,2048] HIDDEN_LAYERS = [1,2,4,8,16] PACKETS_PER_FLOW = [1,2,3,4,5,10,20,50,100]
The algorithm will train the model for 325*9= 270 times
The best combination is the one that produces the highest score on the validation set

Random search

Source: Bergstra, J., & Bengio, Y. (2012). Random search for hyperparameter optimization. Journal of machine learning research, 13(2).

Grid search is a good approach when the number of combinations of hyperparameters is limited and the model size permits fast training
In random search, instead of trying every combination of hyperparameters, random combinations of hyperparameters are sampled from a specified distribution
With random search, one can set the maximum number of combinations to test, so as to control the total training time
As with grid search, the best combination (among those randomly selected) is the one that produces the highest score on the validation set
Random search is a stochastic method, meaning that the results will vary each time it is run
- To obtain reliable results, it is usually recommended to run the search multiple times and average the results13

Randomized values

Random search selects a random number based on a statistical distribution of the hyper-parameter

Automated tuning steps

Identify the hyperparameters: Start by identifying the hyperparameters of your machine learning model that need to be optimized
Define the range of values: For each hyperparameter, define the range of possible values that you want to consider. This can be done based on prior knowledge or through experimentation
Choose a distribution: For each hyperparameter, choose a distribution to sample from (e.g., uniform, Gaussian, etc.), or the list of values in the case of a grid search
Set the number of trials (random search): Specify the number of random trials to run during the search. This will determine the number of combinations of hyperparameters that will be sampled

Recap Kfold Cross-validation

When we have limited data, dividing the dataset into Train and Validation sets may cause some data points with useful information to be excluded from the training procedure, and the model fails to learn the data distribution properly

One way to solve this problem is called cross-validation. The following procedure is followed for each of the k “folds”: A model is trained using k−1 of the folds as training data

the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy)

The performance of the model is computed as the average validation score of the model obtained on k different splits Drawback: the training time is multiplied by the number of validation sets k 16

		Dataset
	Training set	Test set
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4
Fold1	Fold2	Fold3	Fold4

The number of folds k is usually set between 5 and 10

Regularisation techniques

Regularisation techniques are essential in training deep learning models to prevent overfitting, where a model performs well on the training data but poorly on unseen data
Regularization methods add constraints to the optimisation problem, discouraging overly complex models that might fit the training data too closely
Common regularisation techniques are:
- Early stopping
- L1 regularisation
- L2 regularisation
- Dropout

Early stopping

Given a combination of hyperparameters, when do we stop training and jump to the next combination?

Early stopping is a strategy to stop the training process when the error on the validation set reaches the minimum value Here is where the model starts overfitting on the training data

Usually the curves are not smooth and it can be hard to understand whether we have reached the minimum

Patience: one solution is to stop only if the error stays above the minimum for a pre-defined number of epochs

L1/L2 regularisation

Add a penalty term to the loss function. This penalty term discourages the model from assigning excessively large weights to the input features

$$L1_penalty = \lambda \sum_{i=1}^{n} |w_i| \qquad L2_penalty = \lambda \sum_{i=1}^{n} w_i^2$$

Where: are the L1/L2 regularisation terms added to the loss function, is the regularisation parameter, controlling the strength of the regularisation L1_ penalty, L2_ penalty λ

A higher value of leads to stronger regularisation λ

represent the individual weights in the model. wi

Binary cross-entropy with L1/L2 regularisation

$$\mathbf{11:} J_W(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{m} \sum_{i=1}^m \left( \mathbf{y}_i \log(\hat{\mathbf{y}}_i) + (1 - \mathbf{y}i) \log(1 - \hat{\mathbf{y}}i) \right) + \lambda \sum{l=1}^L \sum{j=1}^{n^{(0)}} |\mathbf{w}_j^{(l)}|$$

$$\mathbf{12:} J_W(\mathbf{y}, \hat{\mathbf{y}}) = -\frac{1}{m} \sum_{i=1}^m \left( \mathbf{y}_i \log(\hat{\mathbf{y}}_i) + (1 - \mathbf{y}i) \log(1 - \hat{\mathbf{y}}i) \right) + \lambda \sum{l=1}^L \sum{j=1}^{n^{(0)}} (\mathbf{w}_j^{(l)})^2$$

Where: is the number of samples in a mini-batch of training samples, represents the true labels (either 0 or 1) of the -th sample, represents the predicted probability of the -th sample being in class 1, is the total number of layers in the neural network, is the total number of weights in layer , represents the individual weights in layer , is the regularization parameter, controlling the strength of the L1 regularization m yi i ŷ i i L n(l) l w(l) j l λ

A higher value of leads to stronger regularization λ

Dropout

Dropout works by randomly deactivating (or “dropping out”) a fraction of neurons during training, meaning these neurons are ignored during forward and backward passes
At every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability of being temporarily “dropped out,” meaning it will be entirely ignored during this training step, but it may be active during the next step p
After training, neurons don’t get dropped anymore

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

Why Dropout Helps Prevent Overfitting

Dropout improves the model’s generalization ability by introducing noise into the learning process
- This process is similar to training multiple smaller networks and averaging their outputs, which makes the final network less likely to rely on any single feature or neuron
Implicit Ensemble Learning: Dropout effectively makes a neural network behave like an ensemble of different, “smaller” networks because each mini-batch during training is processed by a different subset of the full network
- At inference time, the averaged behaviour of these subnetworks is approximated by the entire network, which is typically more robust 23

The Syn Flood use-case (see the slideset on Datasets)

Dropout in practice

Dropout is implemented as an additional layer in neural network frameworks like TensorFlow, PyTorch, and Keras
The dropout rate (a hyperparameter) specifies the proportion of neurons to drop
- Typical values range from 0.2 to 0.5, depending on the network architecture and the complexity of the problem

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

import tensorflow as tf

model.add(tf.keras.layers.Dense(128, activation=‘relu’)) model.add(tf.keras.layers.Dropout(0.5)) # Dropout layer with a rate of 0.5 (50%) model.add(tf.keras.layers.Dense(64, activation=‘relu’))

Dropout is implemented as an additional layer in neural network frameworks like TensorFlow, PyTorch, and Keras
The dropout rate (a hyperparameter) specifies the proportion of neurons to drop
- Typical values range from 0.2 to 0.5, depending on the network architecture and the complexity of the problem

Dropout (source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc.")

import tensorflow as tf

Technical detail about Dropout

Suppose , in which case during inference a neuron would be connected to twice as many input neurons as it would be (on average) during training p = 50 %
- To compensate for this fact, we need to multiply each neuron’s input connection weights by after training 0.5
- If we don’t, each neuron will get a total input signal roughly twice as large as what the network was trained on and will be unlikely to perform well

Limitations of Dropout

Not in Output Layers: Dropout is generally not applied to output layers, especially in networks for tasks like classification, as it might cause unstable predictions
Training Time: Dropout can slow down training since the model needs to learn redundant representations
Optimization Challenges: If dropout is too high, it can lead to underfitting, as the model cannot learn enough meaningful patterns

Backpropagation Activation functions