Autoencoders
- Introduction to Autoencoders
- Dimensionality reduction with linear autoencoders
- Stacked autoencoders
- Unsupervised pre-training with stacked autoencoders
- Convolutional autoencoders
- Network anomaly detection with autoencoders
An autoencoder is a type of artificial neural network capable of learning representations of the input data (called latent representations or codings) without any supervision (unsupervised learning).
An autoencoder is trained to encode input data into a compact representation and then decode it back to its original form.
The encoding process forces the autoencoder to capture the most important features of the input data in the reduced-dimensional representation.
Autoencoders are often used for tasks such as:
- image denoising
- anomaly detection
- data compression
Main concepts
The architecture of an autoencoder consists of an encoder and a decoder
The encoder takes the input data and maps it to a lower-dimensional representation
The decoder reconstructs the original data from this representation
- The training objective is to minimize the reconstruction error (or loss), which measures the difference between the input and the output of the autoencoder
- The outputs are often called the reconstructions because the autoencoder tries to reconstruct the inputs
- The cost function (e.g., MSE) contains a reconstruction loss that penalizes the model when the reconstructions are different from the inputs
Autoencoders can be viewed as a type of self-supervised learning, employing a supervised learning approach with automatically generated labels. In this context, the labels are essentially identical to the inputs themselves.
- When the internal representation has a lower dimensionality than the input data, the autoencoder is said to be undercomplete
- An undercomplete autoencoder cannot trivially copy its inputs to the codings, it must find a way to output a copy of its inputs
- It is forced to learn the most important features in the input data (and drop the unimportant ones)
PCA with undercomplete linear autoencoders
If the autoencoder uses only linear activations and the cost function is the MSE, then it ends up performing Principal Component Analysis (PCA)
Recap on PCA
- Principal Component Analysis (PCA) is a technique for dimensionality reduction
- Principal components are new variables that are constructed as linear combinations of the initial variables
- The main idea behind principal components is to select the axis that preserves the maximum amount of variance, i.e. it’s the line that maximises the average of the squared distances from the projected points (red dots) to the origin
- The second axis is that accounts for the largest amount of the remaining variance, and so on…
- Dimensionality Reduction: PCA reduces the number of variables, making computations more efficient and effective
- Visualisation: Data with reduced dimensions can be visualised more easily, allowing for better understanding and interpretation 6
Dimensionality reduction with autoencoders
An autoencoder with a bottleneck layer (the layer with the smallest number of neurons) that corresponds to the desired dimensionality of the PCA space
- The activation functions in both the encoder and decoder layers are linear (e.g., identity activation function)
- The linear activation functions in the encoder and decoder effectively create a linear transformation between the input and the reduced-dimensional representation
Training
The linear autoencoder is trained with the goal of minimizing the reconstruction error (MSE), which is the difference between the input and the output
During training, the autoencoder will learn a linear mapping from the original space to a lower-dimensional space and back
Encoder Weights as Principal Components
Once the autoencoder is trained, the weights of the encoder’s layer (the layer that corresponds to the bottleneck layer) can be interpreted as the principal components of your data
Dimensionality Reduction
To perform dimensionality reduction on new data, you can use the encoder part of the trained linear autoencoder to map the data to the lower-dimensional space
2D data representation
Like PCA, a Linear autoencoder can be used for visualising the training data E.g., from 21 features to only 2, like in the example below
# Define the autoencoder
autoencoder = Sequential([
Dense(2, activation='linear', input_shape=(21,)), # Encoding layer
Dense(data.shape[1], activation='linear') # Decoding layer
])
# Compile the model
autoencoder.compile(optimizer=‘adam’, loss=‘mean_squared_error’)
# Train the autoencoder
autoencoder.fit(data, data, epochs=100, batch_size=32, shuffle=True)
Extract the encoder part for dimensionality reduction encoder = Sequential([autoencoder.layers[0]]) # The first layer is the encoder encoded_data_2d = encoder.predict(data)
source: Géron, A. (2022). Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow. " O’Reilly Media, Inc."
Linear Autoencoder and PCA
- The linear autoencoder essentially learns a linear transformation that captures the most important features of the data in a reduced-dimensional space, similar to PCA
- However, keep in mind that the learned components might NOT be the same as those obtained from traditional PCA, but they should capture similar linear relationships in the data
Stacked autoencoders
A stacked autoencoder, also known as a deep autoencoder, is an autoencoder with multiple hidden layers
- the output of each encoder gets fed into the next encoder until the last encoder feeds its output into a chain of decoders
- This allows a step-by-step compression and decompression of the input data
- The architecture of a stacked autoencoder is typically symmetrical with regard to the central hidden layer (the coding layer )
- The activation function in the output layer can be the Sigmoid (normalised input data) or the Linear function (nonnormalised data) 11
Problem: training a NIDS with unlabelled network data
Having plenty of unlabeled data and little labelled data is common Building a large unlabelled dataset is often cheap Labelling network flow can usually be done reliably only by humans error-prone, time-consuming and costly
Solution: use a pre-trained stacked autoencoder to learn the meaningful representations from the network data without relying on labelled information
Unsupervised pre-training of a NIDS with stacked autoencoders
The approach involves initial unsupervised training of an autoencoder on the entire dataset, incorporating both labelled and unlabelled samples, where the labels are disregarded during this phase
- The key strategy is to pre-train the autoencoder to learn meaningful representations from the data without relying on labelled information
- Subsequently, we leverage the pre-trained encoder to construct a new neural network for building the NIDS
- Finally, we train the NIDS using the few labelled samples by freezing the encoder layers
- The rationale behind unsupervised pretraining with stacked autoencoders lies in the ability of the initial layers to autonomously capture general features and patterns present in the data 13
Convolutional Autoencoders
A Convolutional Autoencoder (CAE) is a type of autoencoder that uses convolutional layers in its architecture
- Convolutional autoencoders are specifically designed for tasks involving grid-like data
- Similar to traditional autoencoders, convolutional autoencoders consist of an encoder and a decoder
- The encoder uses convolutional layers to capture spatial patterns and features in the input data
- The decoder, uses upsample layers to reconstruct the input from the encoded representation
- Convolutional autoencoders find applications in image denoising, image compression
- Convolutional autoencoders can be also used in network anomaly detection, when the network flows are organised in grid-like structures of traffic features
Sample code
conv_encoder = Sequential(name='ConvEncoder', layers=[Reshape([X_train.shape[1],X_train.shape[2], 1],
input_shape=[X_train.shape[1],X_train.shape[2]]),
Conv2D(16, kernel_size=3, padding="same", activation="relu"),
MaxPool2D(pool_size=(2,2)),
Conv2D(32, kernel_size=3, padding="same", activation="relu"),
MaxPool2D(pool_size=(2,2))])
conv_decoder = Sequential(name='ConvDecoder', layers=[Conv2D(16, kernel_size=3, padding="same", activation="relu", input_shape=[25, 5, 32]),
UpSampling2D((2, 2)),
Conv2D(1, kernel_size=3, padding="same", activation="sigmoid"),
UpSampling2D((2, 2)),
Reshape([X_train.shape[1],X_train.shape[2]]) ])
conv_ae = Sequential([conv_encoder, conv_decoder])
Network anomaly detection with autoencoders
Autoencoders are a type of artificial neural network commonly used for anomaly detection in several domains, including network security
Autoencoders leverage unsupervised learning to identify deviations from normal patterns in the network traffic data
Training phase:
- The autoencoder is trained on a dataset containing only benign network traffic
- The network learns to map the input data to itself, effectively learning a compressed representation that captures the essential features of benign flows
- The loss function measures the difference between the input and the reconstructed output
- MSE is commonly used as the loss function
Testing phase
- During the testing phase, the network is given new data, including normal and potentially anomalous network flows
- The reconstruction error is calculated by comparing the input data with its reconstructed output
- Higher reconstruction errors indicate that the input deviates from the learned patterns of benign flows
- Threshold Setting:
- A threshold is set on the reconstruction error
- Flows with reconstruction errors above this threshold are considered anomalies
- The threshold can be determined based on statistical methods or domain knowledge
- It’s a trade-off between false positives and false negatives
- A threshold is set on the reconstruction error
Challenges and considerations
- Choosing Architecture: The architecture of the autoencoder, including the number of layers and nodes, must be carefully chosen based on the characteristics of the data
- Balancing False Positives and False Negatives: Setting an appropriate threshold is crucial to balance between detecting anomalies and avoiding false alarms
- Concept drift: In the presence of concept drift, the patterns of normal behaviour may change over time
- If the autoencoder is trained on historical data with outdated normal patterns, it may struggle to accurately reconstruct and detect anomalies in the new data
- Regularly retraining the autoencoder on updated data is one strategy