Autoencoders

NIAD+ML

Architectures

Autoencoders

Introduction to Autoencoders
Dimensionality reduction with linear autoencoders
Stacked autoencoders
Unsupervised pre-training with stacked autoencoders
Convolutional autoencoders
Network anomaly detection with autoencoders

An autoencoder is a type of artificial neural network capable of learning representations of the input data (called latent representations or codings) without any supervision (unsupervised learning).

An autoencoder is trained to encode input data into a compact representation and then decode it back to its original form.

The encoding process forces the autoencoder to capture the most important features of the input data in the reduced-dimensional representation.

Autoencoders are often used for tasks such as:

image denoising
anomaly detection
data compression

Main concepts

The architecture of an autoencoder consists of an encoder and a decoder

The encoder takes the input data and maps it to a lower-dimensional representation

The decoder reconstructs the original data from this representation

The training objective is to minimize the reconstruction error (or loss), which measures the difference between the input and the output of the autoencoder
The outputs are often called the reconstructions because the autoencoder tries to reconstruct the inputs
The cost function (e.g., MSE) contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs

Autoencoders can be viewed as a type of self-supervised learning, employing a supervised learning approach with automatically generated labels. In this context, the labels are essentially identical to the inputs themselves.

When the internal representation has a lower dimensionality than the input data, the autoencoder is said to be undercomplete
An undercomplete autoencoder cannot trivially copy its inputs to the codings, it must find a way to output a copy of its inputs
It is forced to learn the most important features in the input data (and drop the unimportant ones)

PCA with undercomplete linear autoencoders

If the autoencoder uses only linear activations and the cost function is the MSE, then it ends up performing Principal Component Analysis (PCA)

Recap on PCA

Principal Component Analysis (PCA) is a technique for dimensionality reduction
Principal components are new variables that are constructed as linear combinations of the initial variables
The main idea behind principal components is to select the axis that preserves the maximum amount of variance, i.e. it’s the line that maximises the average of the squared distances from the projected points (red dots) to the origin
The second axis is that accounts for the largest amount of the remaining variance, and so on…
Dimensionality Reduction: PCA reduces the number of variables, making computations more efficient and effective
Visualisation: Data with reduced dimensions can be visualised more easily, allowing for better understanding and interpretation 6

Dimensionality reduction with autoencoders

An autoencoder with a bottleneck layer (the layer with the smallest number of neurons) that corresponds to the desired dimensionality of the PCA space

The activation functions in both the encoder and decoder layers are linear (e.g., identity activation function)
The linear activation functions in the encoder and decoder effectively create a linear transformation between the input and the reduced-dimensional representation

Training

The linear autoencoder is trained with the goal of minimizing the reconstruction error (MSE), which is the difference between the input and the output

During training, the autoencoder will learn a linear mapping from the original space to a lower-dimensional space and back

Encoder Weights as Principal Components

Once the autoencoder is trained, the weights of the encoder’s layer (the layer that corresponds to the bottleneck layer) can be interpreted as the principal components of your data

Dimensionality Reduction

To perform dimensionality reduction on new data, you can use the encoder part of the trained linear autoencoder to map the data to the lower-dimensional space

2D data representation

Like PCA, a Linear autoencoder can be used for visualising the training data E.g., from 21 features to only 2, like in the example below

# Define the autoencoder
autoencoder = Sequential([
 Dense(2, activation='linear', input_shape=(21,)), # Encoding layer
 Dense(data.shape[1], activation='linear') # Decoding layer
])

# Compile the model

autoencoder.compile(optimizer=‘adam’, loss=‘mean_squared_error’)

# Train the autoencoder

autoencoder.fit(data, data, epochs=100, batch_size=32, shuffle=True)

Extract the encoder part for dimensionality reduction encoder = Sequential([autoencoder.layers[0]]) # The first layer is the encoder encoded_data_2d = encoder.predict(data)

source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc."

Linear Autoencoder and PCA

The linear autoencoder essentially learns a linear transformation that captures the most important features of the data in a reduced-dimensional space, similar to PCA
However, keep in mind that the learned components might NOT be the same as those obtained from traditional PCA, but they should capture similar linear relationships in the data

Stacked autoencoders

A stacked autoencoder, also known as a deep autoencoder, is an autoencoder with multiple hidden layers

the output of each encoder gets fed into the next encoder until the last encoder feeds its output into a chain of decoders
This allows a step-by-step compression and decompression of the input data
The architecture of a stacked autoencoder is typically symmetrical with regard to the central hidden layer (the coding layer )
The activation function in the output layer can be the Sigmoid (normalised input data) or the Linear function (nonnormalised data) 11

Problem: training a NIDS with unlabelled network data

Having plenty of unlabeled data and little labelled data is common Building a large unlabelled dataset is often cheap Labelling network flow can usually be done reliably only by humans error-prone, time-consuming and costly

Solution: use a pre-trained stacked autoencoder to learn the meaningful representations from the network data without relying on labelled information

Unsupervised pre-training of a NIDS with stacked autoencoders

The approach involves initial unsupervised training of an autoencoder on the entire dataset, incorporating both labelled and unlabelled samples, where the labels are disregarded during this phase

The key strategy is to pre-train the autoencoder to learn meaningful representations from the data without relying on labelled information
Subsequently, we leverage the pre-trained encoder to construct a new neural network for building the NIDS
Finally, we train the NIDS using the few labelled samples by freezing the encoder layers
The rationale behind unsupervised pretraining with stacked autoencoders lies in the ability of the initial layers to autonomously capture general features and patterns present in the data 13

Convolutional Autoencoders

A Convolutional Autoencoder (CAE) is a type of autoencoder that uses convolutional layers in its architecture

Convolutional autoencoders are specifically designed for tasks involving grid-like data
Similar to traditional autoencoders, convolutional autoencoders consist of an encoder and a decoder
The encoder uses convolutional layers to capture spatial patterns and features in the input data
The decoder, uses upsample layers to reconstruct the input from the encoded representation
Convolutional autoencoders find applications in image denoising, image compression
Convolutional autoencoders can be also used in network anomaly detection, when the network flows are organised in grid-like structures of traffic features

Sample code

conv_encoder = Sequential(name='ConvEncoder', layers=[Reshape([X_train.shape[1],X_train.shape[2], 1], 
input_shape=[X_train.shape[1],X_train.shape[2]]),
 Conv2D(16, kernel_size=3, padding="same", activation="relu"), 
 MaxPool2D(pool_size=(2,2)), 
 Conv2D(32, kernel_size=3, padding="same", activation="relu"), 
 MaxPool2D(pool_size=(2,2))]) 
conv_decoder = Sequential(name='ConvDecoder', layers=[Conv2D(16, kernel_size=3, padding="same", activation="relu", input_shape=[25, 5, 32]), 
 UpSampling2D((2, 2)),
 Conv2D(1, kernel_size=3, padding="same", activation="sigmoid"), 
 UpSampling2D((2, 2)),
 Reshape([X_train.shape[1],X_train.shape[2]]) ]) 
conv_ae = Sequential([conv_encoder, conv_decoder])

Network anomaly detection with autoencoders

Autoencoders are a type of artificial neural network commonly used for anomaly detection in several domains, including network security

Autoencoders leverage unsupervised learning to identify deviations from normal patterns in the network traffic data

Training phase:

The autoencoder is trained on a dataset containing only benign network traffic
The network learns to map the input data to itself, effectively learning a compressed representation that captures the essential features of benign flows
The loss function measures the difference between the input and the reconstructed output
- MSE is commonly used as the loss function

Testing phase

During the testing phase, the network is given new data, including normal and potentially anomalous network flows
The reconstruction error is calculated by comparing the input data with its reconstructed output
- Higher reconstruction errors indicate that the input deviates from the learned patterns of benign flows
Threshold Setting:
- A threshold is set on the reconstruction error
  - Flows with reconstruction errors above this threshold are considered anomalies
- The threshold can be determined based on statistical methods or domain knowledge
- It’s a trade-off between false positives and false negatives

Challenges and considerations

Choosing Architecture: The architecture of the autoencoder, including the number of layers and nodes, must be carefully chosen based on the characteristics of the data
Balancing False Positives and False Negatives: Setting an appropriate threshold is crucial to balance between detecting anomalies and avoiding false alarms
Concept drift: In the presence of concept drift, the patterns of normal behaviour may change over time
- If the autoencoder is trained on historical data with outdated normal patterns, it may struggle to accurately reconstruct and detect anomalies in the new data
- Regularly retraining the autoencoder on updated data is one strategy

MLP CNN