NIAD+ML

Partial Derivatives

Partial derivatives measure how a function changes as one of its variables changes, while keeping the others constant.

For a function $f(x, y)$ , which depends on two variables $x$ and $y$, the partial derivative with respect to $x$, denoted as $\frac{\partial f}{\partial x}$, shows how $f$ changes as $x$ changes while remains $y$ constant.

When finding the partial derivative with respect to $x$, we treat $y$ as a constant and differentiate $f(x, y)$ with respect to $x$ only.

Partial derivatives of a cost function

Let’s take a sample cost function $J(w_0, w_1) = w_0^2 + w_1^2$ and compute its partial derivatives to explain the concept:

$$\frac{\partial J}{\partial \boldsymbol{w}_0} = \frac{\partial}{\partial \boldsymbol{w}_0} (\boldsymbol{w}_0^2 + \boldsymbol{w}_1^2)$$

Since $w_1^2$ is a constant (with respect to $w_0^2$ ), its derivative is 0, and we just differentiate $w_0^2$

$$\frac{\partial J}{\partial w_0} = 2w_0$$

This means that as $w_0$ changes, the rate of change of $J(w_0, w_1)$ is proportional to $2w_0$, while $w_1$ remains constant.

3D plot of the function

The following plot displays the surface plot of

(red curve) (blue curve) J(w0,w1) = w2 0 + w2 1 J(w0,2) = w2 0 + 2 J(0,w1) = 0 + w2 1

Gradient vector

The gradient vector at point is obtained using the gradient of the function , which is given by: ( w 0 , w 1 ) = (0,2 ) J ( w 0 , w 1 ) = w 20 + w 21

$$\nabla J(\boldsymbol{w}_0, \boldsymbol{w}_1) = \left(\frac{\partial J}{\partial \boldsymbol{w}_0}, \frac{\partial J}{\partial \boldsymbol{w}_1}\right) = (2\boldsymbol{w}_0, 2\boldsymbol{w}_1)$$

At point , the gradient vector is: ( w 0 , w 1 ) = (0,2 ) ∇ J (0,2 ) = (0,4 )

Update of parameters

With gradient descent, the parameters of the model are updated with the following formula:

$$\boldsymbol{w}{j}^{t} = \boldsymbol{w}{j}^{t-1} - \eta \frac{\partial}{\partial \boldsymbol{w}_{j}} J(\mathbf{w})$$

where $\eta$ is the learning rate.

In this example, at step t = 1: computed at point wtj = w w 10 = w 00 − 0.2 ⋅ 0 = 0 w 11 = w 01 − 0.2 ⋅ 4 = 1.2 In the next step, the gradient vector will be (w10 ,w 11 ) = (0,1.2 )

The chain rule

The chain rule is a fundamental theorem in calculus that helps compute the derivative of a composite function

It relates the derivative of the outer function to the derivatives of the inner functions

Single-Variable Chain Rule: y = f(g(x))

$$\begin{aligned} \frac{d\mathbf{y}}{dx} &= \frac{df}{dg} \cdot \frac{dg}{dx} \ \text{Multi-Variable Chain Rule: } \mathbf{y} &= f(g(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n)) \ \frac{\partial \mathbf{y}}{\partial \mathbf{x}_i} &= \frac{df}{dg} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}_i} \end{aligned}$$

Example: logistic regression

In logistic regression, the hypothesis function is defined as:

$$\hat{\mathbf{y}} = h_W(\mathbf{x}) = \sigma(\overline{\mathbf{x}_0^\perp} \cdot \mathbf{w}_0^\perp + \mathbf{x}_1 \cdot \mathbf{w}_1) = \sigma(\mathbf{x}_1 \cdot \mathbf{w}_1 + b) = \frac{1}{1 + e^{-(\mathbf{x}_1 \cdot \mathbf{w}_1 + b)}}$$

The binary cross-entropy cost function for a single training example , with label , is given by: (x0, x1) y ∈ {0,1}

$$J(\mathbf{y}, \hat{\mathbf{y}}) \doteq -\mathbf{y}\log(\hat{\mathbf{y}}) - (1 - \mathbf{y})\log(1 - \hat{\mathbf{y}})$$

Partial derivative of J with respect to w1

Using the chain rule, we differentiate the cost function with respect to w1

Since depends on , which in turn depends on and , we have: J ŷ= hW(z) = σ(z) w1 b

$$\frac{\partial J(\mathbf{y}, \hat{\mathbf{y}})}{\partial z} = \frac{\partial J}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial z} = \left(-\frac{\mathbf{y}}{\hat{\mathbf{y}}} + \frac{1 - \mathbf{y}}{1 - \hat{\mathbf{y}}}\right) \cdot \hat{\mathbf{y}} \cdot (1 - \hat{\mathbf{y}}) = \hat{\mathbf{y}} - \mathbf{y}$$

$$\text{where } z = x_1 \cdot w_1 + b. \text{ Hence: }$$

$$\frac{\partial J(\mathbf{y}, \hat{\mathbf{y}})}{\partial w_1} = \left(\frac{\partial J}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial z}\right) \cdot \frac{\partial z}{\partial w_1} = (\hat{\mathbf{y}} - \mathbf{y}) \cdot \mathbf{x}_1$$

$$\bigwedge_{\begin{subarray}{c} \text{Difference between} \ \text{prediction and label} \end{subarray}} \qquad \begin{cases} \text{State note: importance} \ \text{State rate} \cdot \text{regression} \ \text{of feature re-scaling} \end{cases}$$

Backpropagation