NIAD+ML

Context

Datasets

Definition: In the context of cybersecurity applications, a network traffic dataset refers to a structured and organised collection of digital communications and interactions that occur within a computer network.

Challenges about datasets

The availability of good-quality datasets is a critical factor in the development of effective ML-based cybersecurity applications. However, there are a number of challenges associated with collecting and curating these datasets.

These challenges include:

The volume of data: The amount of data that is generated by cyberattacks is constantly growing. This makes it difficult to collect and store enough data to train effective ML models
The diversity of data: Cyberattacks are constantly evolving, which means that the data that is used to train ML models needs to be updated regularly. This can be a challenge, as it requires access to up-to-date data
The labelling of data: Many ML-based cybersecurity applications require labelled data, which means that each data point needs to be manually classified as malicious or benign. This can be a time-consuming and expensive process

Why do we need datasets?

Training Models: Machine learning algorithms, especially supervised learning, require large and diverse datasets to train models effectively. These models learn from patterns in the data and use this knowledge to make predictions or classifications.
- In cybersecurity, ML models can be trained to recognise normal network behaviour and identify anomalies or potential threats.
Lack of realistic training network data: Obtaining high-quality network traffic datasets for cybersecurity applications can be challenging due to various reasons, including privacy concerns and proprietary considerations.

Lack of realistic training network data

Privacy Concerns: Network traffic contains sensitive information, and sharing it can potentially violate user privacy. Personally identifiable information (PII), confidential business data, and other sensitive content can be inadvertently included in network traffic. This makes organisations and Internet Providers cautious of sharing their data, especially when it involves external parties
Proprietary Information: Industries often consider their network traffic patterns and configurations as proprietary and competitive advantages. Sharing this information might lead to a loss of intellectual property or an edge over competitors
Data Volume and Diversity: Even when organisations are willing to share data, the sheer volume of network traffic generated can be overwhelming. It might be challenging to capture, store, and share representative datasets that cover a wide range of scenarios and threats
Data Bias and Representativeness: Datasets need to accurately represent the real-world network traffic environment to ensure the effectiveness of ML models. Biased or incomplete datasets might lead to models that perform poorly
Data Preprocessing: Raw network traffic data requires significant preprocessing before it’s suitable for machine learning. Cleaning, anonymisation, and structuring the data while maintaining its integrity can be complex

Common solutions

Data anonymisation: The objective is to remove or alter any information that could potentially identify the source, destination, or content of specific network communications. This is vital for sharing network traffic data for research, analysis, and threat detection while mitigating privacy risks. Data anonymisation techniques include: IP Address Anonymization, Timestamp Noise, Port and Protocol Aggregation, Payload Encryption or Truncation, etc.

Public Datasets: Some organizations or research institutions release network traffic datasets to the public for research purposes. However, these might not fully represent all real-world scenarios as they are usually generated in controlled environments (testbeds)

Public datasets (CIC)

The Canadian Institute for Cybersecurity (CIC) https://www.unb.ca/cic/index.html has released several datasets for research purposes, primarily focusing on intrusion detection and network security. The datasets are publicly available in the form of prerecorded traffic traces, including full packet payloads, plus supplementary text files containing labels and statistical details for each traffic flow.

CICIDS2017: The CICIDS2017 dataset is designed for research in network intrusion detection systems (IDS). It is a comprehensive dataset that includes a diverse range of network traffic scenarios, both benign and malicious. The dataset contains various types of attacks, including DoS (Denial of Service), DDoS (Distributed Denial of Service), and brute-force attacks. Link: https://www.unb.ca/cic/datasets/ ids-2017.html
CICDDoS2019 consists of several days of network activity, and includes both benign traffic and 13 different types of DDoS attacks. Link: https://www.unb.ca/cic/datasets/ddos-2019.html

Other popular datasets (UNSW)

The UNSW-NB15 dataset contains network traffic data that includes both normal (benign) traffic and various types of network attacks. It covers 9 network attack categories, including probing, DoS, worms, etc.. Along with the pcap files, the dataset includes csv files with 49 features and the class label of each traffic flow. Link: https:// research.unsw.edu.au/projects/unsw-nb15-dataset
The BoT-IoT dataset was created to address the growing concern of security in IoT ecosystems. It contains network traffic data from IoT devices, which are susceptible to various types of attacks. The dataset includes both legitimate IoT device traffic and malicious network activities such as probing, DoS and DDoS attacks. Link: https:// research.unsw.edu.au/projects/bot-iot-dataset

Other datasets

The MAWILab dataset is a collection of packet-level network traffic traces captured from various parts of the internet. Link: http:// www.fukuda-lab.org/mawilab/index.html
The Kitsune dataset contains nine different network attacks on a commercial IP-based surveillance system and an IoT network. The dataset includes reconnaissance, MitM, DoS, and botnet attacks. Link: https://archive.ics.uci.edu/dataset/516/kitsune+network+attack+dataset

Industrial Control Systems

The SWaT (Secure Water Treatment) [1] dataset was created to support research in the security of water treatment plants and contains sensor readings and actuator states. It includes cyber attacks on the sensor reading to create physical damage (e.g., tank overflow)
The EPIC (Electrical Power and Intelligent Control) [2] dataset contains data collected from a power-grid testbed (called EPIC). The testbed includes energy sources such as motor-driven generators and photovoltaic panels, and batteries. The dataset includes various types of attacks, such as power supply interruptions, cyber-attacks and physical damage.

[1] Lamshöft, Kevin, et al. “Information hiding in cyber physical systems: Challenges for embedding, retrieval and detection using sensor data of the SWAT dataset.” Proceedings of the 2021 ACM Workshop on Information Hiding and Multimedia Security. 2021. [2] Ahmed, Chuadhry Mujeeb, and Nandha Kumar Kandasamy. “A comprehensive dataset from a smart grid testbed for machine learning based cps security research.” Cyber-Physical Security for Critical Infrastructures Protection: First International Workshop, CPS4CIP 2020, Guildford, UK, September 18, 2020, Revised Selected Papers 1. Springer International Publishing, 2021.

Useful details of the CIC datasets

The benign traffic of the dataset has been generated using distribution models for web (HTTP/S), remote shell (SSH), file transfer (FTP) and email (SMTP) applications
Malicious data include network intrusions (DDoS, Brute Force attacks, port scans, etc.). The attacks have been generated by using publicly available tools (e.g., HOIC, LOIC, Hydra, Ares botnet, etc.) or with custom Python scripts.

Statistical traffic features

Statistics of the network traffic are available in .csv files that can be downloaded with the pcap files.
Each line of a csv file contains 80+ statistics of a network traffic bi-direction flow (either benign or malicious), plus the label assigned to that flow (e.g., Benign, DDoS, PortScan, etc.).
Each line starts with the 5-tuple identifier of the traffic flow, followed by statistical features related to the flow such as Flow Duration, Flow IAT Mean, Fwd Packet Length Mean, etc.

Flow ID	Source IP		Source Destination IP Destination Protocol Flow Port			Duration Packets		Total Fwd Total Backward Total Length of Total Length of Fwd Packet Packets	Fwd Packets	Bwd Packets	Length Max		Label
192.168.10.5-104.16.207.165-54865-443-6	104.16.207.165		443 192.168.10.5	54865					12				BENIGN
172.16.0.1-192.168.10.50-49650-80-6	172.16.0.1		49650 192.168.10.50	80		6 1293792			26	11607			20 DDoS

Packet-level features

Packet-level features refer to packet attributes that can be extracted from both packet headers and payloads and used as input to ML models for network anomaly detection

Examples of header features:

TCP flags (SYN, ACK, FIN, RST, etc.)
ICMP message types
IP Flags (Don’t fragment (DF), More Fragments (MF))

Sometimes, some characteristics of the payload are also converted into features for the ML algorithm. For instance:

Payload size

Payload content(e.g., Frequency of specific characters or patterns in payload, payload entropy)

Protocol-specific Payload Analysis: For example, HTTP request methods, URLs, HTTP response codes, DNS queries, etc.
Payload Compression or Encryption Indicators: Indications of payload compression or encryption

Examples of packet-level features

Extracted from the traffic traces with tools like tcpdump and tshark

In this example, a HTTP DDoS flow is represented as a list of packets (lines)

Each packet is represented as a list of packet-level features: (0) IAT, (1) Pkt_Len, (2) Highest_Protocol, (3) IP_Flags, (4) Protocols, (5) TCP_Len, (6) TCP_Ack, (7) TCP_Flags, (8) TCP_Win, (9) UDP_Len, (10) ICMP_Type

Remarks on packet-level features

Payloads are typically subject to end-to-end encryption, therefore some information (such as application protocols) can be only extracted if the feature extraction is executed at one end-point of the communication.
Compared to statistical flow-level features, a packet-level representation of network traffic preserves the semantics of flows (i.e., the behaviour of packets within each flow) and the categorical features of packets (e.g., TCP flags, ICMP type, etc.)

Practical example with Wireshark

An easy way to inspect and analyse the structure of network packets is using the well-known open-source application called Wireshark (https:// www.wireshark.org/)

Wireshark output

Packet header analysis 18

The Syn Flood use-case

Another example: the SWAT dataset

Sensor: Level Transmitter Actuator: Pump

Sensor: Flow meter

Timestamp	FIT101	LIT101	MV101	P101	P102	AIT201	AIT202	AIT203	MV201	P201	P202		Normal/Attack
28/12/2015 10:29:08 AM	2.459085	815.7115	2	1	1	262.625	8.46533	319.7385	1	1	1		Normal
28/12/2015 10:29:09 AM	2.444031	815.4761	2	1	1	262.625	8.46533	319.7385	1	1	1		Normal
28/12/2015 10:29:10 AM	2.428979	815.9471	2	1	1	262.625	8.46533	319.7385	1	1	1		Normal
28/12/2015 10:29:11 AM	2.424174	816.1041	2	1	1	262.625	8.46533	319.7385	1	1	1		Normal
28/12/2015 10:29:12 AM	2.424174	816.3788	2	1	1	262.625	8.46533	319.7385	1	1	1		Normal
28/12/2015 10:29:13 AM	2.447234	816.8499	2	1	1	262.625	8.46533	319.7385	1	1	1	…	Normal
28/12/2015 10:29:14 AM	2.493675	817.6742	2	1	1	262.625	8.46533	319.7385	1	1	1		Attack
28/12/2015 10:29:15 AM	2.535951	817.949	2	1	1	262.625	8.46533	319.7385	1	1	1		Attack
28/12/2015 10:29:16 AM	2.535951	817.949	2	1	1	262.625	8.46533	319.7385	1	1	1		Attack
28/12/2015 10:29:17 AM	2.5699	818.4592	2	1	1	262.625	8.46533	319.7385	1	1	1		Attack
28/12/2015 10:29:18 AM	2.610575	818.8911	2	1	1	262.625	8.46533	319.7385	1	1	1		Attack

Basics of Networking Cyber Attacks