Feature selection methods

NIAD+ML

ML Bones

Recursive Feature Elimination (RFE): recursively removes features and builds the model to identify which combination of features gives the best performance
Feature reset: zeroing one feature at a time from the validation sets and measuring the change in accuracy
Permutation: This method measures the decrease in model performance when the values of a specific feature are randomly shuffled
- If shuffling a feature decreases performance significantly, that feature is deemed important
Partial Dependence Plots (PDP): show the relationship between a feature and the predicted outcome by averaging predictions across the dataset
- It helps visualize how a feature impacts the model output

Recursive Feature Elimination (RFE)

1. Train the Model: Start with all features and train a model (e.g., a linear regression, decision tree or random forest)
1. Rank Features: Rank the features based on their importance, such as the magnitude of the coefficients for linear models or gini importance in treebased models
1. Remove the Least Important Feature: Remove the feature with the lowest importance score
1. Re-train the Model: Train the model on the reduced set of features
1. Repeat: Continue until the desired number of features is reached

1. Select the feature of interest: Choose a feature, say , whose impact on the model predictions you want to analyze (e.g., packet_length: the average size of packets in a network flow) F
1. Sample data points: From the dataset, select a subset of data points XPDP = {x1, x2, …, xn}
- This could be the entire dataset or a sample of it, depending on computational resources and the size of the dataset
1. Vary the selected feature: Let’s call the set of values of observed in (e.g., 50, 64, 77, 200 bytes) F̂= {f j }j∈[1,n] F XPDP
- For each data point in , vary the values of the chosen feature across its values in . Keep all other feature values fixed xi ∈ XPDP F F̂
- This produces a new set of data points with X̂ PDP = {xij } i, j ∈ [1,n]

Make predictions: For each new data point , compute the model’s prediction xij ŷ

This produces a set of predicted values with {ŷ ij } i, j ∈ [1,n]

prediction over all data points. This can be represented as:

$$\hat{\mathfrak{y}}(F_j) = \frac{1}{n} \sum_{i=1}^n \hat{\mathfrak{y}}_{ij} \quad j \in [1, n]$$

Plot the results: Plot the average predicted values for . This plot shows the partial dependence of the model’s predictions on y(̂Fj ) j ∈ [1,n] F

This process gives insight into how the feature influences the predictions of the model, while accounting for the average effect of other features F

Positive or Negative Slope: If the PDP curve has a positive slope, the feature has a positive relationship with the target variable; if it has a negative slope, the relationship is negative
Flat PDP: If the PDP is relatively flat, it means that the feature does not significantly impact the model’s predictions, implying low importance
Non-linear Patterns: PDPs can capture non-linear relationships, showing how the effect of a feature on the prediction can vary across its range

These methods can return negative values for one or more features:

Irrelevant or Noisy Features: If a feature doesn’t contain relevant information or if it’s noisy (contains random or irrelevant data), removing it can improve the model’s performance
Multicollinearity: When two or more features are highly correlated, they can provide redundant information to the model. This can lead to instability
- Removing one of these features can improve the performance

Random Forests can be useful to get a first features ranking

Excluding irrelevant features reduces model training and tuning time
The ranking can be different from that obtained with method-agnostic approaches
Method-agnostic approaches can be used to further tune the model
- Removing irrelevant features improves inference time