Evaluation metrics

Evaluation metrics

Evaluation metrics in binary problems

Positive samples, also known as “anomalies” or “malicious samples,” represent instances or data points that are associated with known security threats or malicious activities. In network security, positive samples are malicious traffic flows. Negative samples, also known as “benign samples” or “normal samples,” represent instances or data points that are considered non-malicious or part of normal, legitimate activities. In network security, negative samples are benign (or legitimate) traffic flows.

Definitions:

  • TP (true positive): number of correctly classified malicious flows
  • FN (false negative): number of misclassified malicious flows
  • TN (true negative): number of correctly classified benign flows
  • FP (false positive): number of misclassified benign flows

Accuracy

Accuracy is the percentage of data points (e.g., network traffic flows) that are correctly classified by the model. It is calculated by dividing the number of correct predictions by the total number of predictions $$ACC = \frac{TP + TN}{TP + TN + FP + FN} \quad ACC \in {0, 1}$$

Recall or TPR

Recall is the fraction of actual positives that are correctly predicted as positive. It is also known as the true positive rate (TPR) $$TPR = \frac{TP}{TP + FN} \qquad TPR \in [0, 1]$$

Precision or PPV

Precision is the fraction of positive predictions that are actually positive. It is also known as the positive predictive value (PPV). $$PPV = \frac{TP}{TP + FP} \qquad PPV \in [0, 1]$$

F1 Score

The F1 score is a measure of the accuracy of a model’s predictions in binary classification problems. It is calculated as the harmonic mean of precision and recall. $$F1 = 2 \cdot \frac{PPV \cdot TPR}{PPV + TPR} \qquad F1 \in [0, 1]$$

Accuracy vs F1 score

Scenario: your dataset is composed of 99 benign samples and 1 malicious sample your model is the following function f(x) = 0 (everything is classified as negative/benign)

ACC = $(0+99)/(0+99+0+1)=0.99$ very high score!!!

F1 = $2*(0*0)/(0+0)$ our model has a problem, as it does not support one class

The problem is with unbalanced datasets. Very common in real-world applications, where the benign samples are the majority of the training data

False Positive Rate

Definition: the false positive rate (FPR) is the proportion of negative instances that are incorrectly classified as positive. It is also known as the false alarm rate $$FPR = \frac{FP}{FP + TN} \qquad FPR \in [0, 1]$$

False Negative Rate

Definition: The false negative rate (FNR) is the proportion of positive instances that are incorrectly classified as negative. It is also known as the miss rate $$FNR = \frac{FN}{FN + TP} \qquad FNR \in [0, 1]$$

Confusion Matrix for binary problems

A confusion matrix is a table that is used to visualise the performance of a classification algorithm.

Evaluation metrics in multi-class problems

Calculating evaluation metrics for a multi-class classification problem with more than two classes involves extending the concepts from binary classification to account for the multiple classes

Confusion matrix

TP: The actual value and predicted value should be the same FN: The sum of values of corresponding rows except for the TP value FP: The sum of values of the corresponding column except for the TP value TN: The sum of values of all columns and rows except the values of that class that we are calculating the values for

Predicted A Predicted B Predicted C
Actual A TP_A FP_B
FN_A
FP_C
FN_A
Actual B FP_A
FN_B
TP_B FP_C
FN_B
Actual C FP_A
FN_C
FP_B
FN_C
TP_C

Precision, Recall and F1 Score

First, we compute the metrics for each class:

  • Accuracy = sum(TP_i)/(sum(TP_i)+sum(FN_i))
  • Precision_i = TP_i / (TP_i + FP_i)
  • Recall_i = TP_i / (TP_i + FN_i)
  • F1_i = 2 * (Precision_i * Recall_i) / (Precision_i + Recall_i)

Example

Let’s consider an example with three classes: We want to compute TP, FN, FP and TN metrics for class A

  • Accuracy = (16+17+11)/(16+2+1+0+17+1+0+0+11) = 0.92
  • Precision_A = TP_A/(TP_A+FP_A)=16/(16+0)=1
  • Recall_A = TP_A/(TP_A+FN_A)=16/(16+2+1)=0.84
  • F1_A = 0.91
Predicted A Predicted B Predicted C
Actual A TP_A
16
FN_A
2
FN_A
1
Actual B FP_A
0
TN_A
17
TN_A
1
Actual C FP_A
0
TN_A
0
TN_A
11

Aggregation methods

Macro averaging calculates the mean of the metric over all classes:

Macro_precision = mean(Precision_i)

Macro_recall = mean(Recall_i)

Macro_F1 = mean(F1_i)

Weighted averaging calculates the mean of the metric weighted by the number of samples in each class:

Weighted_precision = sum(n_i * Precision_i) / sum(n_i)

Weighted_recall = sum(n_i * Recall_i) / sum(n_i)

Weighted_F1 = sum(n_i * F1_i) / sum(n_i)

where n_i is the number of samples in class i.

Which averaging strategy to use depends on the specific problem and the desired outcome:

  • If it is important to give equal weight to all classes, then macro averaging may be a good choice
  • If it is important to give more weight to larger classes, then weighted averaging may be a better choice.