Feature selection methods

Feature selection methods

  • Recursive Feature Elimination (RFE): recursively removes features and builds the model to identify which combination of features gives the best performance
  • Feature reset: zeroing one feature at a time from the validation sets and measuring the change in accuracy
  • Permutation: This method measures the decrease in model performance when the values of a specific feature are randomly shuffled
    • If shuffling a feature decreases performance significantly, that feature is deemed important
  • Partial Dependence Plots (PDP): show the relationship between a feature and the predicted outcome by averaging predictions across the dataset
    • It helps visualize how a feature impacts the model output

Recursive Feature Elimination (RFE)

    1. Train the Model: Start with all features and train a model (e.g., a linear regression, decision tree or random forest)
    1. Rank Features: Rank the features based on their importance, such as the magnitude of the coefficients for linear models or gini importance in treebased models
    1. Remove the Least Important Feature: Remove the feature with the lowest importance score
    1. Re-train the Model: Train the model on the reduced set of features
    1. Repeat: Continue until the desired number of features is reached

Partial Dependence Plots (PDP)

    1. Select the feature of interest: Choose a feature, say , whose impact on the model predictions you want to analyze (e.g., packet_length: the average size of packets in a network flow) F
    1. Sample data points: From the dataset, select a subset of data points XPDP = {x1, x2, …, xn}
    • This could be the entire dataset or a sample of it, depending on computational resources and the size of the dataset
    1. Vary the selected feature: Let’s call the set of values of observed in (e.g., 50, 64, 77, 200 bytes) F̂= {f j }j∈[1,n] F XPDP
    • For each data point in , vary the values of the chosen feature across its values in . Keep all other feature values fixed xiXPDP F F̂
    • This produces a new set of data points with X̂ PDP = {xij } i, j ∈ [1,n]
  1. Make predictions: For each new data point , compute the model’s prediction xij ŷ

This produces a set of predicted values with {ŷ ij } i, j ∈ [1,n]

  1. Average the predictions: For each value , compute the average FjF̂

prediction over all data points. This can be represented as:

$$\hat{\mathfrak{y}}(F_j) = \frac{1}{n} \sum_{i=1}^n \hat{\mathfrak{y}}_{ij} \quad j \in [1, n]$$

  1. Plot the results: Plot the average predicted values for . This plot shows the partial dependence of the model’s predictions on yFj ) j ∈ [1,n] F

This process gives insight into how the feature influences the predictions of the model, while accounting for the average effect of other features F

Interpretation

  • Positive or Negative Slope: If the PDP curve has a positive slope, the feature has a positive relationship with the target variable; if it has a negative slope, the relationship is negative
  • Flat PDP: If the PDP is relatively flat, it means that the feature does not significantly impact the model’s predictions, implying low importance
  • Non-linear Patterns: PDPs can capture non-linear relationships, showing how the effect of a feature on the prediction can vary across its range

Feature reset and permutation: negative values

These methods can return negative values for one or more features:

  • Irrelevant or Noisy Features: If a feature doesn’t contain relevant information or if it’s noisy (contains random or irrelevant data), removing it can improve the model’s performance
  • Multicollinearity: When two or more features are highly correlated, they can provide redundant information to the model. This can lead to instability
    • Removing one of these features can improve the performance

Considerations on feauture importance

Random Forests can be useful to get a first features ranking

  • Excluding irrelevant features reduces model training and tuning time
  • The ranking can be different from that obtained with method-agnostic approaches
  • Method-agnostic approaches can be used to further tune the model
    • Removing irrelevant features improves inference time