Datasets
Datasets
Definition: In the context of cybersecurity applications, a network traffic dataset refers to a structured and organised collection of digital communications and interactions that occur within a computer network.
Challenges about datasets
The availability of good-quality datasets is a critical factor in the development of effective ML-based cybersecurity applications. However, there are a number of challenges associated with collecting and curating these datasets.
These challenges include:
- The volume of data: The amount of data that is generated by cyberattacks is constantly growing. This makes it difficult to collect and store enough data to train effective ML models
- The diversity of data: Cyberattacks are constantly evolving, which means that the data that is used to train ML models needs to be updated regularly. This can be a challenge, as it requires access to up-to-date data
- The labelling of data: Many ML-based cybersecurity applications require labelled data, which means that each data point needs to be manually classified as malicious or benign. This can be a time-consuming and expensive process
Why do we need datasets?
- Training Models: Machine learning algorithms, especially supervised learning, require large and diverse datasets to train models effectively. These models learn from patterns in the data and use this knowledge to make predictions or classifications.
- In cybersecurity, ML models can be trained to recognise normal network behaviour and identify anomalies or potential threats.
- Lack of realistic training network data: Obtaining high-quality network traffic datasets for cybersecurity applications can be challenging due to various reasons, including privacy concerns and proprietary considerations.
Lack of realistic training network data
- Privacy Concerns: Network traffic contains sensitive information, and sharing it can potentially violate user privacy. Personally identifiable information (PII), confidential business data, and other sensitive content can be inadvertently included in network traffic. This makes organisations and Internet Providers cautious of sharing their data, especially when it involves external parties
- Proprietary Information: Industries often consider their network traffic patterns and configurations as proprietary and competitive advantages. Sharing this information might lead to a loss of intellectual property or an edge over competitors
- Data Volume and Diversity: Even when organisations are willing to share data, the sheer volume of network traffic generated can be overwhelming. It might be challenging to capture, store, and share representative datasets that cover a wide range of scenarios and threats
- Data Bias and Representativeness: Datasets need to accurately represent the real-world network traffic environment to ensure the effectiveness of ML models. Biased or incomplete datasets might lead to models that perform poorly
- Data Preprocessing: Raw network traffic data requires significant preprocessing before it’s suitable for machine learning. Cleaning, anonymisation, and structuring the data while maintaining its integrity can be complex
Common solutions
Data anonymisation: The objective is to remove or alter any information that could potentially identify the source, destination, or content of specific network communications. This is vital for sharing network traffic data for research, analysis, and threat detection while mitigating privacy risks. Data anonymisation techniques include: IP Address Anonymization, Timestamp Noise, Port and Protocol Aggregation, Payload Encryption or Truncation, etc.
Public Datasets: Some organizations or research institutions release network traffic datasets to the public for research purposes. However, these might not fully represent all real-world scenarios as they are usually generated in controlled environments (testbeds)
Public datasets (CIC)
The Canadian Institute for Cybersecurity (CIC) https://www.unb.ca/cic/index.html has released several datasets for research purposes, primarily focusing on intrusion detection and network security. The datasets are publicly available in the form of prerecorded traffic traces, including full packet payloads, plus supplementary text files containing labels and statistical details for each traffic flow.
- CICIDS2017: The CICIDS2017 dataset is designed for research in network intrusion detection systems (IDS). It is a comprehensive dataset that includes a diverse range of network traffic scenarios, both benign and malicious. The dataset contains various types of attacks, including DoS (Denial of Service), DDoS (Distributed Denial of Service), and brute-force attacks. Link: https://www.unb.ca/cic/datasets/ ids-2017.html
- CICDDoS2019 consists of several days of network activity, and includes both benign traffic and 13 different types of DDoS attacks. Link: https://www.unb.ca/cic/datasets/ddos-2019.html
Other popular datasets (UNSW)
- The UNSW-NB15 dataset contains network traffic data that includes both normal (benign) traffic and various types of network attacks. It covers 9 network attack categories, including probing, DoS, worms, etc.. Along with the pcap files, the dataset includes csv files with 49 features and the class label of each traffic flow. Link: https:// research.unsw.edu.au/projects/unsw-nb15-dataset
- The BoT-IoT dataset was created to address the growing concern of security in IoT ecosystems. It contains network traffic data from IoT devices, which are susceptible to various types of attacks. The dataset includes both legitimate IoT device traffic and malicious network activities such as probing, DoS and DDoS attacks. Link: https:// research.unsw.edu.au/projects/bot-iot-dataset
Other datasets
- The MAWILab dataset is a collection of packet-level network traffic traces captured from various parts of the internet. Link: http:// www.fukuda-lab.org/mawilab/index.html
- The Kitsune dataset contains nine different network attacks on a commercial IP-based surveillance system and an IoT network. The dataset includes reconnaissance, MitM, DoS, and botnet attacks. Link: https://archive.ics.uci.edu/dataset/516/kitsune+network+attack+dataset
Industrial Control Systems
- The SWaT (Secure Water Treatment) [1] dataset was created to support research in the security of water treatment plants and contains sensor readings and actuator states. It includes cyber attacks on the sensor reading to create physical damage (e.g., tank overflow)
- The EPIC (Electrical Power and Intelligent Control) [2] dataset contains data collected from a power-grid testbed (called EPIC). The testbed includes energy sources such as motor-driven generators and photovoltaic panels, and batteries. The dataset includes various types of attacks, such as power supply interruptions, cyber-attacks and physical damage.
[1] Lamshöft, Kevin, et al. “Information hiding in cyber physical systems: Challenges for embedding, retrieval and detection using sensor data of the SWAT dataset.” Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security. 2021. [2] Ahmed, Chuadhry Mujeeb, and Nandha Kumar Kandasamy. “A comprehensive dataset from a smart grid testbed for machine learning based cps security research.” Cyber-Physical Security for Critical Infrastructures Protection: First International Workshop, CPS4CIP 2020, Guildford, UK, September 18, 2020, Revised Selected Papers 1. Springer International Publishing, 2021.
11
Useful details of the CIC datasets
- The benign traffic of the dataset has been generated using distribution models for web (HTTP/S), remote shell (SSH), file transfer (FTP) and email (SMTP) applications
- Malicious data include network intrusions (DDoS, Brute Force attacks, port scans, etc.). The attacks have been generated by using publicly available tools (e.g., HOIC, LOIC, Hydra, Ares botnet, etc.) or with custom Python scripts.
Statistical traffic features
- Statistics of the network traffic are available in .csv files that can be downloaded with the pcap files.
- Each line of a csv file contains 80+ statistics of a network traffic bi-direction flow (either benign or malicious), plus the label assigned to that flow (e.g., Benign, DDoS, PortScan, etc.).
- Each line starts with the 5-tuple identifier of the traffic flow, followed by statistical features related to the flow such as Flow Duration, Flow IAT Mean, Fwd Packet Length Mean, etc.
Flow ID | Source IP | Source Destination IP Destination Protocol Flow Port | Duration Packets | Total Fwd Total Backward Total Length of Total Length of Fwd Packet Packets |
Fwd Packets | Bwd Packets | Length Max | Label | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
192.168.10.5-104.16.207.165-54865-443-6 | 104.16.207.165 | 443 192.168.10.5 | 54865 | 12 | BENIGN | ||||||||
172.16.0.1-192.168.10.50-49650-80-6 | 172.16.0.1 | 49650 192.168.10.50 | 80 | 6 1293792 | 26 | 11607 | 20 DDoS |
Packet-level features
Packet-level features refer to packet attributes that can be extracted from both packet headers and payloads and used as input to ML models for network anomaly detection
Examples of header features:
- TCP flags (SYN, ACK, FIN, RST, etc.)
- ICMP message types
- IP Flags (Don’t fragment (DF), More Fragments (MF))
Sometimes, some characteristics of the payload are also converted into features for the ML algorithm. For instance:
Payload size
Payload content(e.g., Frequency of specific characters or patterns in payload, payload entropy)
- Protocol-specific Payload Analysis: For example, HTTP request methods, URLs, HTTP response codes, DNS queries, etc.
- Payload Compression or Encryption Indicators: Indications of payload compression or encryption
Examples of packet-level features
Extracted from the traffic traces with tools like tcpdump and tshark
In this example, a HTTP DDoS flow is represented as a list of packets (lines)
- Each packet is represented as a list of packet-level features: (0) IAT, (1) Pkt_Len, (2) Highest_Protocol, (3) IP_Flags, (4) Protocols, (5) TCP_Len, (6) TCP_Ack, (7) TCP_Flags, (8) TCP_Win, (9) UDP_Len, (10) ICMP_Type
Remarks on packet-level features
- Payloads are typically subject to end-to-end encryption, therefore some information (such as application protocols) can be only extracted if the feature extraction is executed at one end-point of the communication.
- Compared to statistical flow-level features, a packet-level representation of network traffic preserves the semantics of flows (i.e., the behaviour of packets within each flow) and the categorical features of packets (e.g., TCP flags, ICMP type, etc.)
Practical example with Wireshark
An easy way to inspect and analyse the structure of network packets is using the well-known open-source application called Wireshark (https:// www.wireshark.org/)
Wireshark output
Packet header analysis 18
The Syn Flood use-case
Another example: the SWAT dataset
Sensor: Level Transmitter Actuator: Pump
Sensor: Flow meter
Timestamp | FIT101 | LIT101 | MV101 | P101 | P102 | AIT201 | AIT202 | AIT203 | FIT201 | MV201 | P201 | P202 | Normal/Attack | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28/12/2015 10:29:08 AM |
2.459085 | 815.7115 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Normal | |
28/12/2015 10:29:09 AM |
2.444031 | 815.4761 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Normal | |
28/12/2015 10:29:10 AM |
2.428979 | 815.9471 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Normal | |
28/12/2015 10:29:11 AM |
2.424174 | 816.1041 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Normal | |
28/12/2015 10:29:12 AM |
2.424174 | 816.3788 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Normal | |
28/12/2015 10:29:13 AM |
2.447234 | 816.8499 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | … | Normal |
28/12/2015 10:29:14 AM |
2.493675 | 817.6742 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Attack | |
28/12/2015 10:29:15 AM |
2.535951 | 817.949 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Attack | |
28/12/2015 10:29:16 AM |
2.535951 | 817.949 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Attack | |
28/12/2015 10:29:17 AM |
2.5699 | 818.4592 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Attack | |
28/12/2015 10:29:18 AM |
2.610575 | 818.8911 | 2 | 1 | 1 | 262.625 | 8.46533 | 319.7385 | 0 | 1 | 1 | 1 | Attack |