Partial Derivatives
Partial derivatives measure how a function changes as one of its variables changes, while keeping the others constant.
For a function $f(x, y)$ , which depends on two variables $x$ and $y$, the partial derivative with respect to $x$, denoted as $\frac{\partial f}{\partial x}$, shows how $f$ changes as $x$ changes while remains $y$ constant.
When finding the partial derivative with respect to $x$, we treat $y$ as a constant and differentiate $f(x, y)$ with respect to $x$ only.
Partial derivatives of a cost function
Let’s take a sample cost function $J(w_0, w_1) = w_0^2 + w_1^2$ and compute its partial derivatives to explain the concept:
$$\frac{\partial J}{\partial \boldsymbol{w}_0} = \frac{\partial}{\partial \boldsymbol{w}_0} (\boldsymbol{w}_0^2 + \boldsymbol{w}_1^2)$$
Since $w_1^2$ is a constant (with respect to $w_0^2$ ), its derivative is 0, and we just differentiate $w_0^2$
$$\frac{\partial J}{\partial w_0} = 2w_0$$
This means that as $w_0$ changes, the rate of change of $J(w_0, w_1)$ is proportional to $2w_0$, while $w_1$ remains constant.
3D plot of the function
The following plot displays the surface plot of
- (red curve) (blue curve) J(w0,w1) = w2 0 + w2 1 J(w0,2) = w2 0 + 2 J(0,w1) = 0 + w2 1
5
Gradient vector
The gradient vector at point is obtained using the gradient of the function , which is given by: ( w 0 , w 1 ) = (0,2 ) J ( w 0 , w 1 ) = w 20 + w 21
$$\nabla J(\boldsymbol{w}_0, \boldsymbol{w}_1) = \left(\frac{\partial J}{\partial \boldsymbol{w}_0}, \frac{\partial J}{\partial \boldsymbol{w}_1}\right) = (2\boldsymbol{w}_0, 2\boldsymbol{w}_1)$$
At point , the gradient vector is: ( w 0 , w 1 ) = (0,2 ) ∇ J (0,2 ) = (0,4 )
Update of parameters
With gradient descent, the parameters of the model are updated with the following formula:
$$\boldsymbol{w}{j}^{t} = \boldsymbol{w}{j}^{t-1} - \eta \frac{\partial}{\partial \boldsymbol{w}_{j}} J(\mathbf{w})$$
where $\eta$ is the learning rate.
In this example, at step t = 1: computed at point wtj = w w 10 = w 00 − 0.2 ⋅ 0 = 0 w 11 = w 01 − 0.2 ⋅ 4 = 1.2 In the next step, the gradient vector will be (w10 ,w 11 ) = (0,1.2 )
The chain rule
The chain rule is a fundamental theorem in calculus that helps compute the derivative of a composite function
It relates the derivative of the outer function to the derivatives of the inner functions
Single-Variable Chain Rule: y = f(g(x))
$$\begin{aligned} \frac{d\mathbf{y}}{dx} &= \frac{df}{dg} \cdot \frac{dg}{dx} \ \text{Multi-Variable Chain Rule: } \mathbf{y} &= f(g(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n)) \ \frac{\partial \mathbf{y}}{\partial \mathbf{x}_i} &= \frac{df}{dg} \cdot \frac{\partial \mathbf{g}}{\partial \mathbf{x}_i} \end{aligned}$$
Example: logistic regression
In logistic regression, the hypothesis function is defined as:
$$\hat{\mathbf{y}} = h_W(\mathbf{x}) = \sigma(\overline{\mathbf{x}_0^\perp} \cdot \mathbf{w}_0^\perp + \mathbf{x}_1 \cdot \mathbf{w}_1) = \sigma(\mathbf{x}_1 \cdot \mathbf{w}_1 + b) = \frac{1}{1 + e^{-(\mathbf{x}_1 \cdot \mathbf{w}_1 + b)}}$$
The binary cross-entropy cost function for a single training example , with label , is given by: (x0, x1) y ∈ {0,1}
$$J(\mathbf{y}, \hat{\mathbf{y}}) \doteq -\mathbf{y}\log(\hat{\mathbf{y}}) - (1 - \mathbf{y})\log(1 - \hat{\mathbf{y}})$$
Partial derivative of J with respect to w1
Using the chain rule, we differentiate the cost function with respect to w1
Since depends on , which in turn depends on and , we have: J ŷ= hW(z) = σ(z) w1 b
$$\frac{\partial J(\mathbf{y}, \hat{\mathbf{y}})}{\partial z} = \frac{\partial J}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial z} = \left(-\frac{\mathbf{y}}{\hat{\mathbf{y}}} + \frac{1 - \mathbf{y}}{1 - \hat{\mathbf{y}}}\right) \cdot \hat{\mathbf{y}} \cdot (1 - \hat{\mathbf{y}}) = \hat{\mathbf{y}} - \mathbf{y}$$
$$\text{where } z = x_1 \cdot w_1 + b. \text{ Hence: }$$
$$\frac{\partial J(\mathbf{y}, \hat{\mathbf{y}})}{\partial w_1} = \left(\frac{\partial J}{\partial \hat{\mathbf{y}}} \cdot \frac{\partial \hat{\mathbf{y}}}{\partial z}\right) \cdot \frac{\partial z}{\partial w_1} = (\hat{\mathbf{y}} - \mathbf{y}) \cdot \mathbf{x}_1$$
$$\bigwedge_{\begin{subarray}{c} \text{Difference between} \ \text{prediction and label} \end{subarray}} \qquad \begin{cases} \text{State note: importance} \ \text{State rate} \cdot \text{regression} \ \text{of feature re-scaling} \end{cases}$$