Detection of DDoS Attacks

Introduction

What is Distributed Denial-of-service (DDoS) Attack?

Under normal circumstances, a server in a network that processes the request from legitimate clients in the network as and when they arrive. A Denial-of-service(DoS) attack will hamper this normal functioning of the network by overloading the server with requests/traffic from a malicious device.

Instead of a single device when multiple devices simultaneously overload a target host with unnecessary internet traffic, it is called Distributed Denial of service (DDoS) attack.

Literature Review

The author of the proposed work [2] were able to identify the important feature corresponding to a DDoS attack. We can conclude from their findings that features like IP addresses, bit rate and package size immensely help to distinguish between legitimate and malicious users.

In reference [3], Various ML models have been implemented and compared against each other in terms of F scores. All the models show an F score of above 0.95 with fuzzy c means algorithm being the best one with a F score of 0.987.

This published work [4] discusses how developing technologies like cloud computing, IoT and Artificial intelligence has led to an increase in the number of DDoS attacks, making them harder to predict. Based on latest progress on DDoS detection, the Authors have concluded that Naïve based and Random Forest techniques work efficiently in detecting the attacks.

Data Set

We have used a DDoS attack dataset, which is a combination of multiple intrusion detection dataset created by extracting their DDoS data. [1]

This dataset is balanced and has over 12 million datapoint with 48 features.

Some of the features that we will be interested in are:

Flow duration
Total Packet Sent
Packet Size
Time Between Two Flows
Idle vs Active Time
Source and Destination IPs

Preprocessing

Dataset Cleaning and Visualization
the following steps were taken to clean the dataset:

We first checked if the dataset is truly balanced and what is the distribution of hte labels. Out of the total 12794627 data points we found out the 6321980 datapoints had "Benign" labels and 6472647 datapoints had "DDOS" labels. So we know that the dataset is balanced.
We observed that "bwd_pkt_len_ max"(Backward packet length) and "bwd_pkt_len_min had inf values. So we converted inf vlues to NAN values and then propped those datapoints
The next step was to delete all the datapoints which had nan values in atleast one feature. We found out that 29713 datapoints with "Benign" label and 30 datapoints with "DDOS" labels had NAN values. On further examination we found out that feature "Flow bytes/s" had all the NAN values. Since significant number of NAN values were coming from "Benign" label we decided to delete the feature so that the dataset is balanced and we get rid of all the NAN values.
Information presented in feature "FlowId" is part of the information presented in features "src ip", "dst ip", "src prot", "dest port" and protocol. We dropped the feature "FlowID" to avoid duplication.
We can't use "Timestamp" features that doesnot have AM/PM values. timestamp is not in the 24hr format so without the AM/PM values, timestamp is incomplete. only 28% of the timestamp features have complete timestamp values and the dataset only had 85K unique timestamp values so we decided to drop the column
We also encoded src_ip and dst_ip using the Ipaddress library
We divided the remailing 80 features achieved after data cleaning into subsets of 20 features each and observed heatmaps individually on these subsets, an example of which is shown in Fig 1.
On observing various heat maps, we took of 6 features that we believed to be the most relevant in the dataset as seen in Fig 2.
Fig 3 shows the correlationbetween features that are believed to be important according to existing literature.
A Random Forest Classifier model trained on features selected by us and features selected through prior work, both give a similar accuracy.

Dataset Preprocessing

from sklearn.preprocessing we used the StandardScaler function to scale all of our data between -1 and 1
Below are the results returned by PCA. We decided to proceed with 24 features with a variance of 0.95

Variance	Features Returned
0.99	35
0.97	29
0.95	24
0.90	20

Selection Of Parameters

Processing of high dimensional data can be challenging
Reducing the number of features by forward selection reduces training time, reduces the chance of overfitting and results in simplified and interpretable models
In each iteration of the training model, the feature with best metric value is selected
Relevant features obtained:

Source IP
Destination IP
Source Port
Destination Port
Protocol
Flow Duration : The mean time of each flow between source and destination IP
Flow IAT Mean: The mean time between each flow between source and destination IP
Packet Size Avg: The average size of the packets sent between the source and destination IP

Metrics used for Evaluation

Metrics for Clustering

Adjusted Mutual Information: \[AMI(Yact, Ypred) = {{ MI (Yact, Ypred) - E[MI (Yact, Ypred)]} \over avg(H(Yact) , H(Ypred)) - E[MI (Yact, Ypred)]}\]
Adjusted Random Score:
\[\text{rand index} = { \text{true positives} + \text{true negatives}\over \binom{n}2}\] where n is the total numner of datapoints in our dataset

\[\text{adjusted rand index} = { \text{rand index} - E[\text{rand index}] \over max(\text{rand index}) - E[\text{rand index}]}\]
Fowlkes Mallows Score: \[\text{fowlkes mallows score} = { \text{true positives}\over \sqrt{(\text{true positive + false positive})*(\text{true positive + false negative})}}\]

Metrics for Classification

Test Accuracy: \[accuracy = { \text{true positives} + \text{true negatives}\over \text{total predictions}}.\]
Precision: \[precision = { \text{true positives} \over \text{true positives} + \text{false positives}}.\]
Recall: \[recall = { \text{true positives} \over \text{true positives} + \text{false negatives}}.\]
F1 Score: \[F1 = { 2* \text{precision} * \text{recall} \over \text{precision} + \text{recall}}.\]

Models trained

Random Forest

Random Forest is proven to be an efficient learning algorithm for large datasets compared to other training algorithms

Test Accuracy	Precision	Recall	F1 Score
99.99%	1	1	1

Fig: Confusion Matrix for Random Forest Model

Fig: Feature importance from the random forest model

Feature 3 and 4 most important for prediction
Accuracy obtaining after training the model with features 3 and 4 was 99.97%
Thus we can effectively reduce the dataset to two features for further algorithms

SVM is more effective in high dimensional spaces and is realtively more memory efficient

The predicted parameters (trained weights) give inference about the importance of each feature. The direction of association i.e. positive or negative is also given

This has zero false positive and is thus not good at identifying DDOS attacks. We decided to move to a more advance clustering algorithm.

GMM is used to classify data based on the probability distribution. It is robust to outliers

Looking at the Fowlkes Mallows score, we can evaluate how closely our predictions matched the ground truth. This score consistently stayed at approximately 0.6. From this low value, it can be concluded can be concluded that by GMM our model’s clusterings were not able to closely match the distribution of the dataset.

We started looking into some underlying reason for this happening in both clustering algorithms implemented

To do this we plotted the components 3 and 4 for both the DDOS (label 1) and Benign (label 0) data. We observed that the two data clsuters were overlapping, or more specifically the DDOS data all lied inside a Benign data cluster. We believe this is the reason for the large misclassification, and for the models to completely skip DDOS data.

We also tried plotting the Principal components 1 and 2 for the DDOS and benign data , to see whether the larger explained variance in them could differentiate between these clusters better.

However, these also failed to provide a satisfactory boundary or split between the two classes.

We then trained the data on all the 24 PCA components to see if that improved our model. Below is the confusion matrix for the same. While it did increase the true positive classification by a bit, it still wasn't significant enough to be satisfactory

We can thus say that unsupervised learning does not give us a good result for this dataset

Task Title	Task Owner
Introduction & Background	Mansi, Sweta, Shrestha
Problem Definition	All
Methods	All
Potential Results and Discussion	All
PPT and Video Recording	Mansi
Github Page	Shrestha
Data Cleaning and Preprocessing	Sweta, Mansi, Shrestha
Feature selection and visualization	Sweta, Mansi, Shrestha
Data Visualization	Sweta, Vastav and Pranav
Model Traning	Sweta, Mansi
Github Page (Midterm report)	Sweta, Mansi, Shrestha
Forward Feature Selection	Pranav, Vastav
Feature Selection from Random Forest	Sweta
Models training (on important features / components)	Sweta, Mansi
Unsupervised models failure analysis	Sweta, Mansi
Powerpoint	Sweta, Mansi
Video	Mansi
Report	Sweta, Mansi , Shrestha

Detection of DDoS Attacks using Machine Learning

Introduction

What is Distributed Denial-of-service (DDoS) Attack?

Literature Review

Data Set

Problem Definition

Why is DDoS Harmful?

Methodology

Preprocessing

Models trained

Conclusion

Team Members Contribution

References