Feature selection methods
- Recursive Feature Elimination (RFE): recursively removes features and builds the model to identify which combination of features gives the best performance
- Feature reset: zeroing one feature at a time from the validation sets and measuring the change in accuracy
- Permutation: This method measures the decrease in model performance when the values of a specific feature are randomly shuffled
- If shuffling a feature decreases performance significantly, that feature is deemed important
- Partial Dependence Plots (PDP): show the relationship between a feature and the predicted outcome by averaging predictions across the dataset
- It helps visualize how a feature impacts the model output
Recursive Feature Elimination (RFE)
-
- Train the Model: Start with all features and train a model (e.g., a linear regression, decision tree or random forest)
-
- Rank Features: Rank the features based on their importance, such as the magnitude of the coefficients for linear models or gini importance in treebased models
-
- Remove the Least Important Feature: Remove the feature with the lowest importance score
-
- Re-train the Model: Train the model on the reduced set of features
-
- Repeat: Continue until the desired number of features is reached
Partial Dependence Plots (PDP)
-
- Select the feature of interest: Choose a feature, say , whose impact on the model predictions you want to analyze (e.g., packet_length: the average size of packets in a network flow) F
-
- Sample data points: From the dataset, select a subset of data points XPDP = {x1, x2, …, xn}
- This could be the entire dataset or a sample of it, depending on computational resources and the size of the dataset
-
- Vary the selected feature: Let’s call the set of values of observed in (e.g., 50, 64, 77, 200 bytes) F̂= {f j }j∈[1,n] F XPDP
- For each data point in , vary the values of the chosen feature across its values in . Keep all other feature values fixed xi ∈ XPDP F F̂
- This produces a new set of data points with X̂ PDP = {xij } i, j ∈ [1,n]
- Make predictions: For each new data point , compute the model’s prediction xij ŷ
This produces a set of predicted values with {ŷ ij } i, j ∈ [1,n]
- Average the predictions: For each value , compute the average Fj ∈ F̂
prediction over all data points. This can be represented as:
$$\hat{\mathfrak{y}}(F_j) = \frac{1}{n} \sum_{i=1}^n \hat{\mathfrak{y}}_{ij} \quad j \in [1, n]$$
- Plot the results: Plot the average predicted values for . This plot shows the partial dependence of the model’s predictions on y(̂Fj ) j ∈ [1,n] F
This process gives insight into how the feature influences the predictions of the model, while accounting for the average effect of other features F
Interpretation
- Positive or Negative Slope: If the PDP curve has a positive slope, the feature has a positive relationship with the target variable; if it has a negative slope, the relationship is negative
- Flat PDP: If the PDP is relatively flat, it means that the feature does not significantly impact the model’s predictions, implying low importance
- Non-linear Patterns: PDPs can capture non-linear relationships, showing how the effect of a feature on the prediction can vary across its range
Feature reset and permutation: negative values
These methods can return negative values for one or more features:
- Irrelevant or Noisy Features: If a feature doesn’t contain relevant information or if it’s noisy (contains random or irrelevant data), removing it can improve the model’s performance
- Multicollinearity: When two or more features are highly correlated, they can provide redundant information to the model. This can lead to instability
- Removing one of these features can improve the performance
Considerations on feauture importance
Random Forests can be useful to get a first features ranking
- Excluding irrelevant features reduces model training and tuning time
- The ranking can be different from that obtained with method-agnostic approaches
- Method-agnostic approaches can be used to further tune the model
- Removing irrelevant features improves inference time