Under normal circumstances, a server in a network that processes the request from legitimate clients in the network as and when they arrive. A Denial-of-service(DoS) attack will hamper this normal functioning of the network by overloading the server with requests/traffic from a malicious device.
Instead of a single device when multiple devices simultaneously overload a target host with unnecessary internet traffic, it is called Distributed Denial of service (DDoS) attack.
The author of the proposed work [2] were able to identify the important feature corresponding to a DDoS attack. We can conclude from their findings that features like IP addresses, bit rate and package size immensely help to distinguish between legitimate and malicious users.
In reference [3], Various ML models have been implemented and compared against each other in terms of F scores. All the models show an F score of above 0.95 with fuzzy c means algorithm being the best one with a F score of 0.987.
This published work [4] discusses how developing technologies like cloud computing, IoT and Artificial intelligence has led to an increase in the number of DDoS attacks, making them harder to predict. Based on latest progress on DDoS detection, the Authors have concluded that Naïve based and Random Forest techniques work efficiently in detecting the attacks.
We have used a DDoS attack dataset, which is a combination of multiple intrusion detection dataset created by extracting their DDoS data. [1]
This dataset is balanced and has over 12 million datapoint with 48 features.
Some of the features that we will be interested in are:
It is possible to stop an attack from a single malicious device, Identifying and mitigating an attack from a distributed system is difficult
Identifying and mitigating an attack from a single malicious device can be easily handled by the network, it becomes a challenge when multiple devices simultaneously attack the target host.
We first analyzed the data and performed dimensionality reduction. Our aim while doing this project was to use multiple machine learning models to identify the devices that are part of the attacking network before it overwhelms the system and compare their performance.
Data Preprocessing
Models used
Metrics:
| Variance | Features Returned | ||
|---|---|---|---|
| 0.99 | 35 | ||
| 0.97 | 29 | ||
| 0.95 | 24 | ||
| 0.90 | 20 |
| Test Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|
| 99.99% | 1 | 1 | 1 |
SVM is more effective in high dimensional spaces and is realtively more memory efficient
| Test Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|
| 99.99% | 0.99 | 0.99 | 0.99 |
It is one of the simplest machine learning algorithms to train
The predicted parameters (trained weights) give inference about the importance of each feature. The direction of association i.e. positive or negative is also given
Outputs well-calibrated probabilities
| Test Accuracy | Precision | Recall | F1 Score |
|---|---|---|---|
| 85% | 0.99 | 0.76 | 0.86 |
It is realtively simple to implement and suitable for large datasets
| Adjusted Mutual Info Score | Adjusted Random Score | Fowles Mallows Score |
|---|---|---|
| 0.1854 | 0.0542 | 0.6376 |
This has zero false positive and is thus not good at identifying DDOS attacks. We decided to move to a more advance clustering algorithm.
Gaussian Mixture Model (GMM)GMM is used to classify data based on the probability distribution. It is robust to outliers
| Adjusted Mutual Info Score | Adjusted Random Score | Fowles Mallows Score |
|---|---|---|
| 0.2237 | 0.2016 | 0.6309 |
Looking at the Fowlkes Mallows score, we can evaluate how closely our predictions matched the ground truth. This score consistently stayed at approximately 0.6. From this low value, it can be concluded can be concluded that by GMM our model’s clusterings were not able to closely match the distribution of the dataset.
This also has 0 false positives. This is not promising
We started looking into some underlying reason for this happening in both clustering algorithms implemented
To do this we plotted the components 3 and 4 for both the DDOS (label 1) and Benign (label 0) data. We observed that the two data clsuters were overlapping, or more specifically the DDOS data all lied inside a Benign data cluster. We believe this is the reason for the large misclassification, and for the models to completely skip DDOS data.
We also tried plotting the Principal components 1 and 2 for the DDOS and benign data , to see whether the larger explained variance in them could differentiate between these clusters better.
However, these also failed to provide a satisfactory boundary or split between the two classes.
We then trained the data on all the 24 PCA components to see if that improved our model. Below is the confusion matrix for the same. While it did increase the true positive classification by a bit, it still wasn't significant enough to be satisfactory
| Adjusted Mutual Info Score | Adjusted Random Score | Fowles Mallows Score |
|---|---|---|
| 0.6488 | 0.6381 | 0.7994 |
We can thus say that unsupervised learning does not give us a good result for this dataset
| Task Title | Task Owner |
|---|---|
| Introduction & Background | Mansi, Sweta, Shrestha |
| Problem Definition | All |
| Methods | All |
| Potential Results and Discussion | All |
| PPT and Video Recording | Mansi |
| Github Page | Shrestha |
| Data Cleaning and Preprocessing | Sweta, Mansi, Shrestha |
| Feature selection and visualization | Sweta, Mansi, Shrestha |
| Data Visualization | Sweta, Vastav and Pranav |
| Model Traning | Sweta, Mansi |
| Github Page (Midterm report) | Sweta, Mansi, Shrestha |
| Forward Feature Selection | Pranav, Vastav |
| Feature Selection from Random Forest | Sweta |
| Models training (on important features / components) | Sweta, Mansi |
| Unsupervised models failure analysis | Sweta, Mansi |
| Powerpoint | Sweta, Mansi |
| Video | Mansi |
| Report | Sweta, Mansi , Shrestha |