Activation functions
Activation functions in neural networks are fundamental components that introduce non-linearity, allowing the network to learn complex patterns and make predictions
- Non-Linearity: Activation functions introduce non-linearity into the network, enabling it to model and learn complex, nonlinear relationships in data
- Binary Classification: In binary classification tasks, activation functions like sigmoid are used in the output layer to squash the network’s output between 0 and 1, representing probabilities for class membership
- Multi-Class Classification: For problems with multiple classes, the softmax activation function is applied to the output layer. It converts raw scores into probabilities, helping identify the most likely class
- Regression: In regression tasks, activation functions are not used in the output layer because the network needs to predict continuous values. The output is typically the raw weighted sum of inputs
The step function
The step function is one of the simplest activation functions used in machine learning and binary classification tasks
It’s a threshold-based activation function that outputs 1 if the input is greater than or equal to zero and 0 otherwise. Mathematically, the step function can be defined as follows:
$$\text{step}(x) = \begin{cases} 0 & \text{if } x < 0 \ 1 & \text{if } x \ge 0 \end{cases}$$
The step function was historically used in early neural networks and perceptrons for binary classification tasks
- However, it has limitations, especially in the context of gradient-based optimization techniques like backpropagation
- The step function is not differentiable at because it has a sharp corner at that point, making it unsuitable for gradient-based learning algorithms x = 0
The Sigmoid function
The sigmoid function, also known as the logistic function, squashes the output to a range between 0 and 1
This function is often used to model the probability that a given input belongs to one of two classes, e.g.:
{ Normal if σ(z) < 0.5 Anomalous if σ(z) ≥ 0.5
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
- The sigmoid function is smooth and differentiable everywhere, making it suitable for gradient-based optimisation techniques such as gradient descent.
- Used in both logistic regression and neural networks, to model the probability that a given input sample belongs to a particular class (usually denoted as class 1).
- The sigmoid function was historically used in the hidden layers of neural networks. However, it has been replaced by other activation functions like ReLU (Rectified Linear Unit) and its variants due to the vanishing gradient problem.
Step function vs Sigmoid for binary classification tasks
- Gradient-based optimisation
- Sigmoid function: The sigmoid function is smooth and differentiable everywhere, including at the transition point (0.5). This property is crucial for using gradient-based optimization techniques, like backpropagation, to train neural networks
- Step function: The step function’s sharp corner at 0 prevents the calculation of gradients at that point
- Probability interpretation
- Sigmoid function: the output of the sigmoid function can be interpreted as the probability of belonging to a particular class
- Step function: The step function, while binary in nature, doesn’t provide a probabilistic interpretation. It only tells you whether the input is above or below the threshold (usually 0) without providing a confidence score.
The vanishing gradient problem
The problem of vanishing gradients is a common issue in neural networks, particularly when using activation functions like the sigmoid.
- Gradients and Backpropagation: In neural networks, training involves adjusting the weights of the connections between neurons to minimize a loss function.
- This adjustment is made using optimization algorithms like gradient descent, which relies on gradients (derivatives) of the loss function with respect to the network’s parameters (weights). Gradients indicate the direction and magnitude of the steepest ascent of the loss function.
When using activation functions like the sigmoid, the derivative of the function becomes very small as the absolute value of the input becomes large.
$$ \sigma’(\chi) = \sigma(\chi)(1 - \sigma(\chi)). $$
When is very positive or very negative, approaches 1 or 0, respectively. Consequently, approaches 0 x σ(x) σ′(x)
$$ \sigma(\mathbf{x}) = \sigma(\mathbf{x})(1 - \sigma(\mathbf{x})) \qquad \dots $$
$$ \vdots $$
$$ \sigma(\mathbf{x}) = \sigma(\mathbf{x}) \qquad \dots $$
- This vanishing gradient means that during backpropagation, the gradients of the loss with respect to the weights in the early layers of the network become extremely small
- As a result, the weights in those layers receive negligible updates, effectively hindering their learning (remember that the gradients are
updated with the following: θj = θj − η ∂ ∂θj J(θ)
ReLU (Rectified Linear Unit)
ReLU is a simple yet effective non-linear function that introduces non-linearity to the network, allowing it to learn complex patterns in the data
ReLU(x) = max(0,x)
- If the input x is positive, ReLU returns the input value directly; if the input is negative, it returns 0 The function essentially “rectifies” negative values to zero while leaving positive values unchanged
Key Characteristics
- Simplicity: ReLU is computationally efficient and easy to implement, making it a popular choice in neural network architectures.
- Non-Linearity: Although ReLU is a linear function for positive inputs, its non-linearity (by setting negative inputs to zero) enables neural networks to approximate complex, nonlinear relationships in the data.
- Avoids Vanishing Gradients: ReLU helps mitigate the vanishing gradients problem encountered with activation functions like sigmoid. ReLU’s derivative is 1 for positive inputs, which prevents gradients from becoming too small during backpropagation, allowing for more efficient learning, especially in deep networks
Softmax
Softmax takes as input a vector of real numbers and transforms it into a vector of probabilities using the following formula:
$$\text{Softmax}(z_k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} \quad k \in [1, K]$$
- is the number of classes
- represents the raw score for class Kzk k
- Softmax exponentiates the input scores, turning them into positive values, and then normalises them by dividing each exponentiated score by the sum of all exponentiated scores and 1 and that they add up to 1, forming a valid probability distribution [z1, ⋯,zK] [p̂1, ⋯, p̂K]
- The normalisation ensures that the resulting values lie between 0
Key Characteristics of Softmax
- Probabilistic Interpretation: The softmax function converts raw scores into probabilities, allowing the model to express confidence scores for each class
- The class with the highest probability is considered the model’s prediction
- Multi-Class Classification: Softmax is commonly used in multi-class classification problems where an input can belong to one of several classes
- Differentiability: Softmax is differentiable everywhere, making it suitable for gradient-based optimization techniques like backpropagation
The Open-world challenge
Softmax-based intrusion detection systems (IDS) often operate under the assumption that they know all possible attack types (closed-world assumption).
Real-world environments are open-world, where new and evolving threats constantly emerge!