Detection of DDoS Attacks using Machine Learning

Introduction

What is Distributed Denial-of-service (DDoS) Attack?

Under normal circumstances, a server in a network that processes the request from legitimate clients in the network as and when they arrive. A Denial-of-service(DoS) attack will hamper this normal functioning of the network by overloading the server with requests/traffic from a malicious device.

Instead of a single device when multiple devices simultaneously overload a target host with unnecessary internet traffic, it is called Distributed Denial of service (DDoS) attack.

Literature Review

The author of the proposed work [2] were able to identify the important feature corresponding to a DDoS attack. We can conclude from their findings that features like IP addresses, bit rate and package size immensely help to distinguish between legitimate and malicious users.

In reference [3], Various ML models have been implemented and compared against each other in terms of F scores. All the models show an F score of above 0.95 with fuzzy c means algorithm being the best one with a F score of 0.987.

This published work [4] discusses how developing technologies like cloud computing, IoT and Artificial intelligence has led to an increase in the number of DDoS attacks, making them harder to predict. Based on latest progress on DDoS detection, the Authors have concluded that Naïve based and Random Forest techniques work efficiently in detecting the attacks.

Data Set

We have used a DDoS attack dataset, which is a combination of multiple intrusion detection dataset created by extracting their DDoS data. [1]

This dataset is balanced and has over 12 million datapoint with 48 features.

Some of the features that we will be interested in are:

Problem Definition

It is possible to stop an attack from a single malicious device, Identifying and mitigating an attack from a distributed system is difficult

Identifying and mitigating an attack from a single malicious device can be easily handled by the network, it becomes a challenge when multiple devices simultaneously attack the target host.

Why is DDoS Harmful?

Methodology

We first analyzed the data and performed dimensionality reduction. Our aim while doing this project was to use multiple machine learning models to identify the devices that are part of the attacking network before it overwhelms the system and compare their performance.

Data Preprocessing

Models used

Metrics:

Preprocessing

Dataset Cleaning and Visualization
the following steps were taken to clean the dataset:
  1. We first checked if the dataset is truly balanced and what is the distribution of hte labels. Out of the total 12794627 data points we found out the 6321980 datapoints had "Benign" labels and 6472647 datapoints had "DDOS" labels. So we know that the dataset is balanced.
  2. We observed that "bwd_pkt_len_ max"(Backward packet length) and "bwd_pkt_len_min had inf values. So we converted inf vlues to NAN values and then propped those datapoints
  3. The next step was to delete all the datapoints which had nan values in atleast one feature. We found out that 29713 datapoints with "Benign" label and 30 datapoints with "DDOS" labels had NAN values. On further examination we found out that feature "Flow bytes/s" had all the NAN values. Since significant number of NAN values were coming from "Benign" label we decided to delete the feature so that the dataset is balanced and we get rid of all the NAN values.
  4. Information presented in feature "FlowId" is part of the information presented in features "src ip", "dst ip", "src prot", "dest port" and protocol. We dropped the feature "FlowID" to avoid duplication.
  5. We can't use "Timestamp" features that doesnot have AM/PM values. timestamp is not in the 24hr format so without the AM/PM values, timestamp is incomplete. only 28% of the timestamp features have complete timestamp values and the dataset only had 85K unique timestamp values so we decided to drop the column
  6. We also encoded src_ip and dst_ip using the Ipaddress library
  7. We divided the remailing 80 features achieved after data cleaning into subsets of 20 features each and observed heatmaps individually on these subsets, an example of which is shown in Fig 1.
  8. On observing various heat maps, we took of 6 features that we believed to be the most relevant in the dataset as seen in Fig 2.
  9. Fig 3 shows the correlationbetween features that are believed to be important according to existing literature.
  10. A Random Forest Classifier model trained on features selected by us and features selected through prior work, both give a similar accuracy.
Dataset Preprocessing
  1. from sklearn.preprocessing we used the StandardScaler function to scale all of our data between -1 and 1
  2. Below are the results returned by PCA. We decided to proceed with 24 features with a variance of 0.95
  3. Variance Features Returned
    0.99 35
    0.97 29
    0.95 24
    0.90 20

    Selection Of Parameters
    Metrics used for Evaluation

    Metrics for Clustering
    • Adjusted Mutual Information: \[AMI(Yact, Ypred) = {{ MI (Yact, Ypred) - E[MI (Yact, Ypred)]} \over avg(H(Yact) , H(Ypred)) - E[MI (Yact, Ypred)]}\]
    • Adjusted Random Score:

      \[\text{rand index} = { \text{true positives} + \text{true negatives}\over \binom{n}2}\] where n is the total numner of datapoints in our dataset

      \[\text{adjusted rand index} = { \text{rand index} - E[\text{rand index}] \over max(\text{rand index}) - E[\text{rand index}]}\]

    • Fowlkes Mallows Score: \[\text{fowlkes mallows score} = { \text{true positives}\over \sqrt{(\text{true positive + false positive})*(\text{true positive + false negative})}}\]

    Metrics for Classification
    • Test Accuracy: \[accuracy = { \text{true positives} + \text{true negatives}\over \text{total predictions}}.\]
    • Precision: \[precision = { \text{true positives} \over \text{true positives} + \text{false positives}}.\]
    • Recall: \[recall = { \text{true positives} \over \text{true positives} + \text{false negatives}}.\]
    • F1 Score: \[F1 = { 2* \text{precision} * \text{recall} \over \text{precision} + \text{recall}}.\]

    Models trained

    Random Forest

    Random Forest is proven to be an efficient learning algorithm for large datasets compared to other training algorithms

    Test Accuracy Precision Recall F1 Score
    99.99% 1 1 1
    Fig: Confusion Matrix for Random Forest Model
    Fig: Feature importance from the random forest model

Support Vector Machine classification (SVM)

SVM is more effective in high dimensional spaces and is realtively more memory efficient

Test Accuracy Precision Recall F1 Score
99.99% 0.99 0.99 0.99
Fig: Confusion Matrix for SVM
Logistic Regression

It is one of the simplest machine learning algorithms to train

The predicted parameters (trained weights) give inference about the importance of each feature. The direction of association i.e. positive or negative is also given

Outputs well-calibrated probabilities

Test Accuracy Precision Recall F1 Score
85% 0.99 0.76 0.86
Fig: Confusion Matrix for Logistic Regression
K-Means

It is realtively simple to implement and suitable for large datasets

Adjusted Mutual Info Score Adjusted Random Score Fowles Mallows Score
0.1854 0.0542 0.6376
Fig: Confusion Matrix for K-Means

This has zero false positive and is thus not good at identifying DDOS attacks. We decided to move to a more advance clustering algorithm.

Gaussian Mixture Model (GMM)

GMM is used to classify data based on the probability distribution. It is robust to outliers

Adjusted Mutual Info Score Adjusted Random Score Fowles Mallows Score
0.2237 0.2016 0.6309

Looking at the Fowlkes Mallows score, we can evaluate how closely our predictions matched the ground truth. This score consistently stayed at approximately 0.6. From this low value, it can be concluded can be concluded that by GMM our model’s clusterings were not able to closely match the distribution of the dataset.

Fig: Confusion Matrix for GMM

This also has 0 false positives. This is not promising

We started looking into some underlying reason for this happening in both clustering algorithms implemented

To do this we plotted the components 3 and 4 for both the DDOS (label 1) and Benign (label 0) data. We observed that the two data clsuters were overlapping, or more specifically the DDOS data all lied inside a Benign data cluster. We believe this is the reason for the large misclassification, and for the models to completely skip DDOS data.

Fig: Component 3 and 4 scatter plot for label 1
Fig: Component 3 and 4 scatter plot for label 0

We also tried plotting the Principal components 1 and 2 for the DDOS and benign data , to see whether the larger explained variance in them could differentiate between these clusters better.

However, these also failed to provide a satisfactory boundary or split between the two classes.

Fig: Component 1 and 2 scatter plot for label 1
Fig: Component 1 and 2 scatter plot for label 0

We then trained the data on all the 24 PCA components to see if that improved our model. Below is the confusion matrix for the same. While it did increase the true positive classification by a bit, it still wasn't significant enough to be satisfactory

Adjusted Mutual Info Score Adjusted Random Score Fowles Mallows Score
0.6488 0.6381 0.7994
Fig: Confusion Matrix for GMM

We can thus say that unsupervised learning does not give us a good result for this dataset

Conclusion

Team Members Contribution

Task Title Task Owner
Introduction & Background Mansi, Sweta, Shrestha
Problem Definition All
Methods All
Potential Results and Discussion All
PPT and Video Recording Mansi
Github Page Shrestha
Data Cleaning and Preprocessing Sweta, Mansi, Shrestha
Feature selection and visualization Sweta, Mansi, Shrestha
Data Visualization Sweta, Vastav and Pranav
Model Traning Sweta, Mansi
Github Page (Midterm report) Sweta, Mansi, Shrestha
Forward Feature Selection Pranav, Vastav
Feature Selection from Random Forest Sweta
Models training (on important features / components) Sweta, Mansi
Unsupervised models failure analysis Sweta, Mansi
Powerpoint Sweta, Mansi
Video Mansi
Report Sweta, Mansi , Shrestha

References

  1. "DDoS Dataset", Kaggle, www.kaggle.com/datasets/devendra416/ddos-datasets, Accessed: 3 Oct 2022
  2. R. R. Rejimol Robinson and C. Thomas, "Ranking of machine learning algorithms based on the performance in classifying DDoS attacks," 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), 2015, pp. 185-190, doi: 10.1109/RAICS.2015.7488411.
  3. Suresh, Manjula, and R. Anitha. "Evaluating machine learning algorithms for detecting DDoS attacks." International Conference on Network Security and Applications. Springer, Berlin, Heidelberg, 2011.
  4. Zhang, Boyang, Tao Zhang, and Zhijian Yu. "DDoS detection and prevention based on artificial intelligence techniques." 2017 3rd IEEE International Conference on Computer and Communications (ICCC). IEEE, 2017.