Softmax Regression
Logistic regression can be generalised to support multiple classes: this is called Softmax regression or Multinomial Logistic regression .
Softmax regression first computes the score $(\theta^k)^Tx$ for each class, then it estimates the probability of each class by applying the softmax function, also indicated with (θ(k))⊤x k ∈ [1,K] σ
$$ \sigma \left{ \begin{bmatrix} \theta_0^{(1)} & \theta_0^{(2)} & \cdots & \theta_0^{(K)} \ \theta_1^{(1)} & \theta_1^{(2)} & \cdots & \theta_1^{(K)} \ \vdots & \vdots & \ddots & \vdots \ \theta_n^{(1)} & \theta_n^{(2)} & \cdots & \theta_n^{(K)} \end{bmatrix} \begin{bmatrix} \boldsymbol{x}_0 \ \boldsymbol{x}_1 \ \vdots \ \boldsymbol{x}_n \end{bmatrix} \right} $$
The softmax function
The softmax function is used in softmax regression to convert raw scores $(\theta^k)^Tx$ into probabilities that sum up to 1. For a given set of scores $z_1 = (\theta^1)^Tx, …, z_k = (\theta^k)^Tx$, the softmax function computes the probabilities
$$\hat{p}k = \sigma((\theta^{(k)})^\top \mathbf{x}) = \sigma(z_k) = \frac{e^{z_k}}{\sum{j=1}^K e^{z_j}} \quad k \in [1, K]$$
$$h_{\theta}(\mathbf{x}) = \sigma \left( \begin{bmatrix} \theta_0^{(1)} & \theta_0^{(2)} & \cdots & \theta_0^{(K)} \ \theta_1^{(1)} & \theta_1^{(2)} & \cdots & \theta_1^{(K)} \ \vdots & \vdots & \ddots & \vdots \ \theta_n^{(1)} & \theta_n^{(2)} & \cdots & \theta_n^{(K)} \end{bmatrix}^\top \begin{bmatrix} \mathbf{x}_0 \ \mathbf{x}_1 \ \vdots \ \mathbf{x}_n \end{bmatrix} \right) = \sigma \left( \begin{bmatrix} z_0 \ z_1 \ \vdots \ z_K \end{bmatrix} \right) = \begin{bmatrix} \hat{p}_0 \ \hat{p}_1 \ \vdots \ \hat{p}_K \end{bmatrix}$$
Classification with softmax regression
Softmax regression predicts the class with the highest estimated probability, as summarised in the following equation:
$$\hat{\mathbf{y}} = \operatorname*{argmax}_{k \in [1, K]} (\hat{p}_k)$$
The argmax operator returns the value of that maximises the estimated probability k.
Categorical cross-entropy
Categorical cross-entropy is the cost function used to train softmax regression models. It penalises the model when it estimates a low probability for a target class.
$$J(\Theta) = -\frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} \mathbf{y}_k^{(i)} \log(\hat{p}_k^{(i)})$$
where:
- $\Theta$ is the parameter matrix that stores all the $\theta_k$ with $k \in [1, K]$
- $y_k^i$ is the target probability that the sample i belongs to class k, usually is either 0 or 1
Cost minimisation
The partial derivatives of the cost function with regard to the -th model parameter for class is given in the following equation: j θj k
$$\frac{\partial}{\partial \theta_j^{(k)}} J(\Theta) = \frac{1}{m} \sum_{i=1}^{m} (\hat{\rho}_k^{(i)} - \mathbf{y}_k^{(i)}) \mathbf{x}_j^{(i)}$$
where is the probability that the -th sample belongs to class p ̂ (i) k = σ((θ(k) ) ⊤x(i) ) i k
Note that for K=2 the cost function is equivalent to Logistic Regression
$$\begin{aligned} J(\Theta) &= -\frac{1}{m} \sum_{i=1}^{m} \left( \mathbf{y}_1^{(i)} \log(\hat{p}_1^{(i)}) + \mathbf{y}_2^{(i)} \log(\hat{p}2^{(i)}) \right) = \ &= -\frac{1}{m} \sum{i=1}^{m} \left( \mathbf{y}_1^{(i)} \log(\hat{p}_1^{(i)}) + (1 - \mathbf{y}_1^{(i)}) \log(1 - \hat{p}_1^{(i)}) \right) \end{aligned}$$