Optimisation Algorithms
Gradient Descent
Definition: Gradient Descent is an optimisation algorithm used in ML and DL to minimise a function by iteratively moving in the direction of the steepest decrease in the function’s value.
- It’s commonly employed to optimise the parameters of a model to minimise a cost (or loss) function, which measures the difference between the predicted values and the actual values in a dataset
- Basic idea: Imagine you are standing on a mountain, and your goal is to find the lowest point. You can’t see the entire landscape, but you can feel the slope beneath your feet
- Gradient Descent works in a similar way. It helps you find the minimum of a function by iteratively moving in the direction of the steepest decrease in the function’s value.
Key concepts
Gradient: In the context of Gradient Descent, the gradient represents the slope of the function at a specific point. For a function with multiple variables, the gradient is a vector that points in the direction of the steepest increase in the function’s value.
Learning rate: The learning rate (also known as step size) determines how big of a step you take during each iteration.
Learning rate
Choosing an appropriate learning rate is crucial. Too small, and the algorithm may take a long time to converge; too large, and you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.
Training Linear Regression with Gradient Descent: the cost function
Gradient descent is an algorithm that iteratively tweaks the parameters in order to minimise the cost function
Steps of Gradient Descent
- Initialize Parameters: Start with initial random values for the parameters you want to optimise (random initialisation of ). θ
- Compute the Gradient: Calculate the gradient of the loss function with respect to the parameters. This gradient indicates the direction of the steepest increase in the loss.
- Update Parameters: Adjust the parameters in the opposite direction of the gradient to minimise the loss function:
- New parameter = Old parameter (Learning Rate × Gradient)
- Iterate: Repeat steps 2 and 3 until a stopping criterion is met, such as a maximum number of iterations or the change in the loss becoming negligible.
$$\theta_j = \theta_j - \eta \frac{\partial}{\partial \theta_j} J(\theta)$$
7
Training Linear Regression with Gradient Descent is the number of samples in the m
Batch Gradient Descent
Partial derivatives tell you how much the cost function will change if you change just a little bit**:** θj
$$MSE(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\theta^\top \mathbf{x}^{(i)} - \mathbf{y}^{(i)})^2 \qquad \frac{\partial}{\partial \theta_j} MSE(\theta) = \frac{2}{m} \sum_{i=1}^{m} (\theta^\top \mathbf{x}^{(i)} - \mathbf{y}^{(i)}) \mathbf{x}_j^{(i)}$$
Batch Gradient Descent computes the gradient using the entire dataset
- Notice that Batch Gradient Descent involves calculations over the full training set X, at each Gradient Descent step!
- This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step
- As a result, it is terribly slow on very large training sets
- It can be slow for large datasets but is guaranteed to converge to the global minimum for convex functions (such as the MSE of Linear Regression)
Mini-Batch Gradient Descent
- When dealing with large datasets, computing the gradient of the entire dataset can be computationally expensive
- Mini-batch gradient descent (MBGD) addresses this issue by dividing the dataset into small batches
- Instead of updating the parameters based on the gradient of the entire dataset (batch gradient descent), MBGD updates the parameters based on the average gradient of a mini-batch of data points
Steps of MBGD
- Divide the Dataset: Divide the entire dataset into small batches of size . Typical batch sizes can range from a few samples to hundreds or even thousands of samples, depending on the size of the dataset and available computational resources. _
- Iterative Optimisation: For each iteration (or epoch*) of the algorithm, randomly shuffle the mini-batches and iterate through them. For each mini-batch, compute the gradient of the cost function with respect to the parameters using the data points in the mini-batch.
- Parameter Update: Update the model parameters based on the average gradient of the current mini-batch. The update rule is similar to batch gradient descent but uses the average gradient of the mini-batch.
*An epoch refers to one complete pass through the entire training dataset by the learning algorithm 11
Convergence of MBGD ∂ ∂θj MSE(θ) = 2 m′ m′ ∑ i=1 (θ⊤x(i) − y(i) )x(i) j is the , i.e., the number of samples in a batch m′ _
MBGD often converges faster than batch gradient descent because it updates the parameters more frequently. For each epoch, MBGD performs gradient descent steps, where is the number of samples in the training set. m/m′ m
Gradient Descent with Momentum
Gradient Descent with Momentum is an optimisation algorithm commonly used for training machine learning models, especially deep neural networks
It helps to accelerate the convergence towards the minimum of the loss function:
- speeding up the movement through flat regions (plateaus)
- escaping from local minima by powering up small gradients
Training with Gradient Descent with Momentum
Momentum in gradient descent helps accelerate convergence by maintaining a velocity term , which incorporates past gradients vj
- is the velocity for the -th parameter (initialised to ) vj j 0
- is the momentum coefficient (usually set to 0.9) β ∈ [0,1)
- is the learning rate η
is the partial derivative of the cost function with respect to the -th parameter ∂J ∂θj J j θj
$$\text{Learning rate} \qquad \nu_j = \beta \nu_j + \eta \frac{\partial}{\partial \theta_j} J(\theta) \qquad \qquad \theta_j = \theta_j - \nu_j$$
$$ \theta_j = \theta_j - \nu_j $$
Nestorov Accelerated Gradient
Nesterov Accelerated Gradient (NAG) improves Gradient Descent with momentum by calculating the gradient at the lookahead position θ + βv, rather than at the current position θj
$$\mathbf{v}{j} = \beta \mathbf{v}{j} + \eta \frac{\partial}{\partial \theta_{j}} J \left( \theta + \beta \mathbf{v} \right)$$
$$\theta_{j} = \theta_{j} - \mathbf{v}_{j}$$
Explanation of NAG
Lookahead Gradient: Instead of using the gradient at the current position , NAG anticipates where the parameter will be by evaluating the gradient at the lookahead position , allowing for more informed updates θj θ + βv
Faster Convergence: NAG accelerates convergence by moving more aggressively towards the minimum, as the lookahead gradient takes into account the future direction of movement
- is the gradient at point ∇2 θ + βv
Gradient Descent with Keras
Gradient Descent
optimizer = SGD(learning_rate=0.001)
Gradient Descent with momentum
optimizer = SGD(learning_rate=0.001, momentum=0.9)
Gradient Descent with momentum and NAG
optimizer = SGD(learning_rate=0.001, momentum=0.9, nesterov=True)
RMSProp (Root Mean Square Propagation)
- RMSprop is an adaptive learning rate method that adjusts the learning rate for each parameter during training
- This adaptive nature helps RMSprop to handle different scales of gradients and avoid getting stuck in local minima
- Adjusting Learning Rate: RMSProp accumulates the squared gradients to adjust the learning rate for each parameter
- The learning rate is scaled inversely proportional to the square root of the accumulated gradients
- This helps to reduce the learning rate for parameters with large gradients, preventing oscillations and speeding up convergence
RMSProp formulation
For each parameter , the exponential moving average of the squared gradients is updated as: θj
$$\mathbf{s}{j} = \beta \mathbf{s}{j} + (1 - \beta) \left( \frac{\partial}{\partial \theta_{j}} J(\theta) \right)$$
2
The update rule for the -th parameter is given by: j θj
$$ \theta_j = \theta_j - \frac{n}{\sqrt{s_j + \epsilon}} \frac{\partial}{\partial \theta_j} J(\theta), $$
Adam (adaptive momentum estimation)
- Adam combines the ideas of momentum optimisation and RMSProp
- The algorithm calculates an exponential moving average of the gradient and the squared gradient
two parameters and control the decay rates of these moving averages β1 β2
$$\text{• usually set to } \beta_1 = 0.9 \text{ and } \beta_2 = 0.999$$
Adam formulation
For each parameter , the exponential moving average of the squared gradients is updated as: θj
$$v_j = \beta_1 v_j + (1 - \beta_1) \frac{\partial}{\partial \theta_j} J(\theta) \qquad s_j = \beta_2 s_j + (1 - \beta_2) \left(\frac{\partial}{\partial \theta_j} J(\theta)\right)$$
2
The update rule for the -th parameter is given by: j θj
$$\theta_j = \theta_j - \frac{\eta \hat{v}_j}{\sqrt{\hat{s}_j + \epsilon}} \quad \text{where } \hat{v}_j = \frac{\nu_j}{1 - \beta_1^{\ell}} \text{ and } \hat{s}_j = \frac{s_j}{1 - \beta_2^{\ell}}$$
Both the first momentum and the second momentum are initialized to zero vj sj
- As a result, in the early iterations, they will be biased towards zero -> this can slow down convergence
- and are used at the beginning of training (initial steps ) to reduce the initial bias v̂ j ŝ j t
21 and increase as the number of iterations increases, gradually reducing the influence of the initial bias (1 − βt 1) (1 − βt 2) t
RMSProp and Adam with Keras
RMSProp
optimizer = RMSprop(learning_rate=0.001,rho=0.9) Note that the “rho” argument corresponds to the decay rate β
Adam
optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)