Evaluation metrics
Evaluation metrics in binary problems
Positive samples, also known as “anomalies” or “malicious samples,” represent instances or data points that are associated with known security threats or malicious activities. In network security, positive samples are malicious traffic flows. Negative samples, also known as “benign samples” or “normal samples,” represent instances or data points that are considered non-malicious or part of normal, legitimate activities. In network security, negative samples are benign (or legitimate) traffic flows.
Definitions:
- TP (true positive): number of correctly classified malicious flows
- FN (false negative): number of misclassified malicious flows
- TN (true negative): number of correctly classified benign flows
- FP (false positive): number of misclassified benign flows
Accuracy
Accuracy is the percentage of data points (e.g., network traffic flows) that are correctly classified by the model. It is calculated by dividing the number of correct predictions by the total number of predictions $$ACC = \frac{TP + TN}{TP + TN + FP + FN} \quad ACC \in {0, 1}$$
Recall or TPR
Recall is the fraction of actual positives that are correctly predicted as positive. It is also known as the true positive rate (TPR) $$TPR = \frac{TP}{TP + FN} \qquad TPR \in [0, 1]$$
Precision or PPV
Precision is the fraction of positive predictions that are actually positive. It is also known as the positive predictive value (PPV). $$PPV = \frac{TP}{TP + FP} \qquad PPV \in [0, 1]$$
F1 Score
The F1 score is a measure of the accuracy of a model’s predictions in binary classification problems. It is calculated as the harmonic mean of precision and recall. $$F1 = 2 \cdot \frac{PPV \cdot TPR}{PPV + TPR} \qquad F1 \in [0, 1]$$
Accuracy vs F1 score
Scenario: your dataset is composed of 99 benign samples and 1 malicious sample your model is the following function f(x) = 0 (everything is classified as negative/benign)
ACC = $(0+99)/(0+99+0+1)=0.99$ very high score!!!
F1 = $2*(0*0)/(0+0)$ our model has a problem, as it does not support one class
The problem is with unbalanced datasets. Very common in real-world applications, where the benign samples are the majority of the training data
False Positive Rate
Definition: the false positive rate (FPR) is the proportion of negative instances that are incorrectly classified as positive. It is also known as the false alarm rate $$FPR = \frac{FP}{FP + TN} \qquad FPR \in [0, 1]$$
False Negative Rate
Definition: The false negative rate (FNR) is the proportion of positive instances that are incorrectly classified as negative. It is also known as the miss rate $$FNR = \frac{FN}{FN + TP} \qquad FNR \in [0, 1]$$
Confusion Matrix for binary problems
A confusion matrix is a table that is used to visualise the performance of a classification algorithm.
Evaluation metrics in multi-class problems
Calculating evaluation metrics for a multi-class classification problem with more than two classes involves extending the concepts from binary classification to account for the multiple classes
Confusion matrix
TP: The actual value and predicted value should be the same FN: The sum of values of corresponding rows except for the TP value FP: The sum of values of the corresponding column except for the TP value TN: The sum of values of all columns and rows except the values of that class that we are calculating the values for
Predicted A | Predicted B | Predicted C | |
---|---|---|---|
Actual A | TP_A | FP_B FN_A |
FP_C FN_A |
Actual B | FP_A FN_B |
TP_B | FP_C FN_B |
Actual C | FP_A FN_C |
FP_B FN_C |
TP_C |
Precision, Recall and F1 Score
First, we compute the metrics for each class:
- Accuracy = sum(TP_i)/(sum(TP_i)+sum(FN_i))
- Precision_i = TP_i / (TP_i + FP_i)
- Recall_i = TP_i / (TP_i + FN_i)
- F1_i = 2 * (Precision_i * Recall_i) / (Precision_i + Recall_i)
Example
Let’s consider an example with three classes: We want to compute TP, FN, FP and TN metrics for class A
- Accuracy = (16+17+11)/(16+2+1+0+17+1+0+0+11) = 0.92
- Precision_A = TP_A/(TP_A+FP_A)=16/(16+0)=1
- Recall_A = TP_A/(TP_A+FN_A)=16/(16+2+1)=0.84
- F1_A = 0.91
Predicted A | Predicted B | Predicted C | |
---|---|---|---|
Actual A | TP_A 16 |
FN_A 2 |
FN_A 1 |
Actual B | FP_A 0 |
TN_A 17 |
TN_A 1 |
Actual C | FP_A 0 |
TN_A 0 |
TN_A 11 |
Aggregation methods
Macro averaging calculates the mean of the metric over all classes:
Macro_precision = mean(Precision_i)
Macro_recall = mean(Recall_i)
Macro_F1 = mean(F1_i)
Weighted averaging calculates the mean of the metric weighted by the number of samples in each class:
Weighted_precision = sum(n_i * Precision_i) / sum(n_i)
Weighted_recall = sum(n_i * Recall_i) / sum(n_i)
Weighted_F1 = sum(n_i * F1_i) / sum(n_i)
where n_i is the number of samples in class i.
Which averaging strategy to use depends on the specific problem and the desired outcome:
- If it is important to give equal weight to all classes, then macro averaging may be a good choice
- If it is important to give more weight to larger classes, then weighted averaging may be a better choice.