Logistic Regression

Logistic Regression

Logistic Regression is ML technique specifically designed for solving binary classification problems, where the goal is to predict one of two possible outcomes, often denoted as 0 and 1.

[!CAUTION] Despite its name, logistic regression is used for classification, not regression.

Logistic Function (Sigmoid): The core of logistic regression is the logistic (or sigmoid) function. It’s an S-shaped curve that maps any real-valued number to a value between 0 and 1.

$$\delta(z) = \frac{1}{1 + e^{-z}}$$ ![[logistic_regression.png]] This function is used to model the probability that a given input belongs to one of the two classes.

Estimation of probabilities

Just like linear regression, a Logistic regression model computes the weighted sum of the input features (plus the bias term).

Logistic regression returns the estimated probability computed using the sigmoid function: $$h_{\theta}(x) = \sigma(\theta_0x_0 + \theta_1x_1 + … + \theta_nx_n) = \sigma(\theta^{T}x) \in [0, 1]$$

graph LR;
    classDef circle fill:#bbb,stroke:#000,stroke-width:1px,rx:50,ry:50;

    x0(["x₀"]) -->|θ₀| sigma(("σ(θᵀx)"));
    x1(["x₁"]) -->|θ₁| sigma;
    x2(["x₂"]) -->|θ₂| sigma;
    x3(["x₃"]) -->|θ₃| sigma;
    xn(["xₙ"]) -->|θₙ| sigma;
    
    subgraph "Input features"
        x1
        x2
        x3
        xn
    end
    
    class x0,x1,x2,x3,xn,sigma circle;

Classification with Logistic Regression

Once the Logistic regression model has estimated the probability that an instance belongs to the positive class (the class labelled with 1), it can make its prediction as follows:

$$ \hat{y} = \begin{cases} 0 & \text{if } h_{\theta}(\mathbf{x}) < 0.5 \ 1 & \text{if } h_{\theta}(\mathbf{x}) \geq 0.5 \end{cases} $$

Properties of Logistic regression

  • Linear Decision Boundary: Logistic regression assumes a linear relationship between the input features. The decision boundary, which separates the two classes, is a straight line (in two dimensions) or a hyperplane (in higher dimensions)
  • Training: The goal is to find the best set of coefficients (weights) that minimizes the error in predicting the class probabilities. Logistic regression is trained on a labelled dataset, where both input features and their corresponding class labels are known
  • Interpretability: One of the advantages of logistic regression is its interpretability. You can easily interpret the coefficients of the model to understand how each feature influences the probability of belonging to a particular class

Cost function

For a training set with samples, the overall cost function is the average of the individual losses: m $$J(\theta) = \frac{1}{m} \sum_{i=1}^{m} L(\mathbf{y}^{(i)}, h_{\theta}(\mathbf{x}^{(i)})) $$ where:

  • J(θ) is the overall cost function
  • m is the number of samples in the training set
  • y*(i) is the actual class label of the i-th sample
  • (x(i) is the predicted probability of the i-th sample x(i)

The goal during the training of logistic regression is to find the parameters that minimise this cost function by using optimisation algorithms like Batch/mini-batch gradient descent θ.

Training Logistic Regression

The cost function (or loss function) used in logistic regression is the binary cross-entropy loss, also known as log loss. For a single data point, it is calculated as follows:

$$L(\mathbf{y}, h_{\theta}(\mathbf{x})) = - \left( \mathbf{y} \cdot \log(h_{\theta}(\mathbf{x})) + (1 - \mathbf{y}) \cdot \log(1 - h_{\theta}(\mathbf{x})) \right)$$

where:

  • y is the actual class label (0 or 1)
    • hθ*(x) is the predicted probability that the instance belongs to class 1 (the output of the logistic sigmoid function)

Binary cross-entropy

The binary cross-entropy loss function penalises the model more when it predicts a probability far from the actual label. $$L(\mathbf{y}, h_{\theta}(\mathbf{x})) = - \left( \mathbf{y} \cdot \log(h_{\theta}(\mathbf{x})) + (1 - \mathbf{y}) \cdot \log(1 - h_{\theta}(\mathbf{x})) \right) \in (0, 1)$$

  • When the actual label (y) is 1, it penalises more if the predicted probability (hθ*(x)) is closer to 0
  • When the actual label is 0, it penalises more if the predicted probability is closer to 1
  • The negative sign is used to ensure the overall loss is minimised during the optimisation process

Cost minimisation

The Logistic regression cost function is convex, therefore gradient descent is guaranteed to find the global minimum (if the learning rate is not too large and we wait long enough).

The partial derivatives of the cost function with regard to the j-th model parameter $\theta_j$ is given in the following equation:

$$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^{m} \left( \sigma(\theta^\top \mathbf{x}^{(i)}) - \mathbf{y}^{(i)} \right) \mathbf{x}^{(i)}_j$$

Gradient descent

With the gradient vector containing all the partial derivatives, you can use the Batch Gradient Descent algorithm: $$\theta_j = \theta_j - \eta \frac{\partial}{\partial \theta_j} J(\theta) \quad j \in [0, n]$$

where η is the learning rate.

NOTE: with mini-batch Gradient descent, m is the number of samples in a mini-batch.

Polynomial Logistic Regression

Logistic regression can be extended to handle polynomial features by incorporating polynomial terms into the feature space. This technique is known as Polynomial Logistic Regression.

In polynomial logistic regression, the hypothesis function is an extension of the logistic regression hypothesis that includes polynomial terms: $$h_{\theta}(\mathbf{x}) = \sigma(\theta_0 + \theta_1\mathbf{x}_1 + \dots + \theta_n\mathbf{x}n + \theta{11}\mathbf{x}1^2 + \theta{12}\mathbf{x}_1\mathbf{x}2 + \dots + \theta{dd}\mathbf{x}_n^d)$$ where d is the degree of the polynomial.

The cost function (or loss function) for polynomial logistic regression is the same as that for regular logistic regression: it is the binary cross-entropy loss.