Encodings
Categorical encoding
Definition: Categorical feature encoding is the process of converting categorical variables (variables that can take on a limited, fixed number of values) into numerical representations that can be used by machine learning algorithms
There are several methods for categorical feature encoding, such as: label encoding, one-hot encoding etc.
Encoding methods
Label encoding: Assigns a unique integer to each category. Suitable for ordinal data where the categories have a meaningful order. Example:
['Benign', 'SynFlood', 'WebDDoS'] -> [0, 1, 2]
One-Hot Encoding: Creates binary columns for each category. If there are ’n’ categories, it creates ’n’ binary columns. Each column indicates the presence or absence of the corresponding category. Example:
['Benign', 'SynFlood', 'WebDDoS'] ->
Benign: [1, 0, 0]
SynFlood: [0, 1, 0]
WebDDoS: [0, 0, 1]
Binary Encoding: Combines the advantages of both label encoding and one-hot encoding. It first assigns unique integers to categories and then represents these integers in binary form. Example:
[‘Benign’, ‘SynFlood’, ‘WebDDoS’] -> Benign: 0 -> 00 SynFlood: 1 -> 01 WebDDoS: 2 -> 10
Ordinal Encoding: Assigns integers to categories based on the specified order. Useful for ordinal data where categories have a meaningful ranking. Example:
[‘Low’, ‘Medium’, ‘High’] -> [0, 1, 2]
Count Encoding: Replaces categories with their frequency counts in the dataset. Useful when the frequency of occurrence of a category is important information. For instance, how many packets in a flow have certain TCP flags set:
[‘SYN’, ‘ACK’, ‘SYN’, ‘FYN’, ‘SYN’] -> [3, 1, 3, 1, 3]
Feature Hashing: Converts categories into numerical values using hash functions. Useful when dealing with a large number of unique categories
For instance, tshark returns the highest protocol seen in a given packet: given the large number of possible outputs, a way to translate the strings with the name of the protocol is by computing their hash:
[‘HTTP’, ‘DHCP’, ‘TLS’] -> [491, 191, 602] (using a hash function)