NIAD+ML

ML Bones

Optimisation Algorithms

Gradient Descent

Definition: Gradient Descent is an optimisation algorithm used in ML and DL to minimise a function by iteratively moving in the direction of the steepest decrease in the function’s value.

It’s commonly employed to optimise the parameters of a model to minimise a cost (or loss) function, which measures the difference between the predicted values and the actual values in a dataset
Basic idea: Imagine you are standing on a mountain, and your goal is to find the lowest point. You can’t see the entire landscape, but you can feel the slope beneath your feet
- Gradient Descent works in a similar way. It helps you find the minimum of a function by iteratively moving in the direction of the steepest decrease in the function’s value.

Key concepts

Gradient: In the context of Gradient Descent, the gradient represents the slope of the function at a specific point. For a function with multiple variables, the gradient is a vector that points in the direction of the steepest increase in the function’s value.

Learning rate: The learning rate (also known as step size) determines how big of a step you take during each iteration.

Learning rate

Choosing an appropriate learning rate is crucial. Too small, and the algorithm may take a long time to converge; too large, and you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Training Linear Regression with Gradient Descent: the cost function

Gradient descent is an algorithm that iteratively tweaks the parameters in order to minimise the cost function

Steps of Gradient Descent

Initialize Parameters: Start with initial random values for the parameters you want to optimise (random initialisation of ). θ
Compute the Gradient: Calculate the gradient of the loss function with respect to the parameters. This gradient indicates the direction of the steepest increase in the loss.
Update Parameters: Adjust the parameters in the opposite direction of the gradient to minimise the loss function:
- New parameter = Old parameter (Learning Rate × Gradient)
Iterate: Repeat steps 2 and 3 until a stopping criterion is met, such as a maximum number of iterations or the change in the loss becoming negligible.

$$\theta_j = \theta_j - \eta \frac{\partial}{\partial \theta_j} J(\theta)$$

Training Linear Regression with Gradient Descent is the number of samples in the m

Batch Gradient Descent

Partial derivatives tell you how much the cost function will change if you change just a little bit**:** θj

$$MSE(\theta) = \frac{1}{m} \sum_{i=1}^{m} (\theta^\top \mathbf{x}^{(i)} - \mathbf{y}^{(i)})^2 \qquad \frac{\partial}{\partial \theta_j} MSE(\theta) = \frac{2}{m} \sum_{i=1}^{m} (\theta^\top \mathbf{x}^{(i)} - \mathbf{y}^{(i)}) \mathbf{x}_j^{(i)}$$

Batch Gradient Descent computes the gradient using the entire dataset

Notice that Batch Gradient Descent involves calculations over the full training set X, at each Gradient Descent step!
- This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step
- As a result, it is terribly slow on very large training sets
It can be slow for large datasets but is guaranteed to converge to the global minimum for convex functions (such as the MSE of Linear Regression)

Mini-Batch Gradient Descent

When dealing with large datasets, computing the gradient of the entire dataset can be computationally expensive
Mini-batch gradient descent (MBGD) addresses this issue by dividing the dataset into small batches
Instead of updating the parameters based on the gradient of the entire dataset (batch gradient descent), MBGD updates the parameters based on the average gradient of a mini-batch of data points

Steps of MBGD

Divide the Dataset: Divide the entire dataset into small batches of size . Typical batch sizes can range from a few samples to hundreds or even thousands of samples, depending on the size of the dataset and available computational resources. _
Iterative Optimisation: For each iteration (or epoch*) of the algorithm, randomly shuffle the mini-batches and iterate through them. For each mini-batch, compute the gradient of the cost function with respect to the parameters using the data points in the mini-batch.
Parameter Update: Update the model parameters based on the average gradient of the current mini-batch. The update rule is similar to batch gradient descent but uses the average gradient of the mini-batch.

*An epoch refers to one complete pass through the entire training dataset by the learning algorithm 11

Convergence of MBGD ∂ ∂θj MSE(θ) = 2 m′ m′ ∑ i=1 (θ⊤x(i) − y(i) )x(i) j is the , i.e., the number of samples in a batch m′ _

MBGD often converges faster than batch gradient descent because it updates the parameters more frequently. For each epoch, MBGD performs gradient descent steps, where is the number of samples in the training set. m/m′ m

Gradient Descent with Momentum

Gradient Descent with Momentum is an optimisation algorithm commonly used for training machine learning models, especially deep neural networks

It helps to accelerate the convergence towards the minimum of the loss function:

speeding up the movement through flat regions (plateaus)
escaping from local minima by powering up small gradients

Training with Gradient Descent with Momentum

Momentum in gradient descent helps accelerate convergence by maintaining a velocity term , which incorporates past gradients vj

is the velocity for the -th parameter (initialised to ) vj j 0
is the momentum coefficient (usually set to 0.9) β ∈ [0,1)
is the learning rate η

is the partial derivative of the cost function with respect to the -th parameter ∂J ∂θj J j θj

$$\text{Learning rate} \qquad \nu_j = \beta \nu_j + \eta \frac{\partial}{\partial \theta_j} J(\theta) \qquad \qquad \theta_j = \theta_j - \nu_j$$

$$ \theta_j = \theta_j - \nu_j $$

Nestorov Accelerated Gradient

Nesterov Accelerated Gradient (NAG) improves Gradient Descent with momentum by calculating the gradient at the lookahead position θ + βv, rather than at the current position θj

$$\mathbf{v}{j} = \beta \mathbf{v}{j} + \eta \frac{\partial}{\partial \theta_{j}} J \left( \theta + \beta \mathbf{v} \right)$$

$$\theta_{j} = \theta_{j} - \mathbf{v}_{j}$$

Explanation of NAG

Lookahead Gradient: Instead of using the gradient at the current position , NAG anticipates where the parameter will be by evaluating the gradient at the lookahead position , allowing for more informed updates θj θ + βv

Faster Convergence: NAG accelerates convergence by moving more aggressively towards the minimum, as the lookahead gradient takes into account the future direction of movement

is the gradient at point ∇2 θ + βv

Gradient Descent with Keras

Gradient Descent

optimizer = SGD(learning_rate=0.001)

Gradient Descent with momentum

optimizer = SGD(learning_rate=0.001, momentum=0.9)

Gradient Descent with momentum and NAG

optimizer = SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

RMSProp (Root Mean Square Propagation)

RMSprop is an adaptive learning rate method that adjusts the learning rate for each parameter during training
This adaptive nature helps RMSprop to handle different scales of gradients and avoid getting stuck in local minima
Adjusting Learning Rate: RMSProp accumulates the squared gradients to adjust the learning rate for each parameter
- The learning rate is scaled inversely proportional to the square root of the accumulated gradients
- This helps to reduce the learning rate for parameters with large gradients, preventing oscillations and speeding up convergence

RMSProp formulation

For each parameter , the exponential moving average of the squared gradients is updated as: θj

$$\mathbf{s}{j} = \beta \mathbf{s}{j} + (1 - \beta) \left( \frac{\partial}{\partial \theta_{j}} J(\theta) \right)$$

The update rule for the -th parameter is given by: j θj

$$ \theta_j = \theta_j - \frac{n}{\sqrt{s_j + \epsilon}} \frac{\partial}{\partial \theta_j} J(\theta), $$

Adam (adaptive momentum estimation)

Adam combines the ideas of momentum optimisation and RMSProp
The algorithm calculates an exponential moving average of the gradient and the squared gradient

two parameters and control the decay rates of these moving averages β1 β2

$$\text{• usually set to } \beta_1 = 0.9 \text{ and } \beta_2 = 0.999$$

Adam formulation

For each parameter , the exponential moving average of the squared gradients is updated as: θj

$$v_j = \beta_1 v_j + (1 - \beta_1) \frac{\partial}{\partial \theta_j} J(\theta) \qquad s_j = \beta_2 s_j + (1 - \beta_2) \left(\frac{\partial}{\partial \theta_j} J(\theta)\right)$$

The update rule for the -th parameter is given by: j θj

$$\theta_j = \theta_j - \frac{\eta \hat{v}_j}{\sqrt{\hat{s}_j + \epsilon}} \quad \text{where } \hat{v}_j = \frac{\nu_j}{1 - \beta_1^{\ell}} \text{ and } \hat{s}_j = \frac{s_j}{1 - \beta_2^{\ell}}$$

Both the first momentum and the second momentum are initialized to zero vj sj

As a result, in the early iterations, they will be biased towards zero -> this can slow down convergence
and are used at the beginning of training (initial steps ) to reduce the initial bias v̂ j ŝ j t

21 and increase as the number of iterations increases, gradually reducing the influence of the initial bias (1 − βt 1) (1 − βt 2) t

RMSProp and Adam with Keras

RMSProp

optimizer = RMSprop(learning_rate=0.001,rho=0.9) Note that the “rho” argument corresponds to the decay rate β

Adam

optimizer = Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

Monitoring the training process