Skip to content

A project where malware binaries were classified by APT groups using various extracted artifacts

Notifications You must be signed in to change notification settings

JuhiPatel28/Malware-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Classification Project

Skills Used

  • Python
  • Machine Learning
  • Data Processing/Scripting
  • Machine Learning

Description

Cyber attack attribution is the process of attempting to trace back a piece of code or malware to a perpetrator of a cyberattack. As cyber attacks have become more prevalent, cyber attack attribution is becoming more valuable. The process of cyber attack attribution can be done using reverse engineering. From the metadata of the malware executable file, we can gather data such as date of creation, variable names used, and what library calls are imported. This information can be used as features for attribution analysis. We need to extract the features from malware that can be used for attribution and analyse them using some technique to attribute the attacks.

Data Classification, Extraction, and Use

Since we are doing static analysis, we needed to consider whether the malwares were packed or not. Packing is a form of code obfuscation that modifies the format of the malware by compressing or encrypting the data, so we would not be able to extract the artifacts we need without putting the malware through some type of unpacking software. Because of this, we used a tool called PEiD that can detect whether or not a malware sample is packed and classified all of our malware into either unpacked or packed categories. The number of packed and unpacked malwares per APT group are summarized in the table below. We have only used unpacked malware samples for attribution to APT Group.

alt text

We have extracted the following attributes from the VirusTotal reports for each malware sample

  • pe-resource-langs
  • imports
  • pe-entry-points

We created a python script that would extract the above attributes from each of the json malware report files and consolidate it into one csv file. We created a CSV file with the mentioned three attributes, resource (i.e hash of the malware) and APT group columns which we fetched from the GitHub repository. Below is a snapshot of the unprocessed dataset.

alt text

Data Preprocessing

Next, the data needed to be cleaned in order for it to be inputted into the machine learning model. Brackets and commas in the language and library column were removed and all values were formatted to be uppercase so the model will not differentiate between them and give inaccurate results. Then, the column for the library and language features were one hot encoded. This resulted in creation of dummy variables for each unique library and language value. Then, the APT groups were mapped to integers so they can properly be used in the classifier model. Lastly, we removed 792 rows with null values in multiple columns. After preprocessing, we were left with 148 features and 2862 rows. Below is the snapshot of the preprocessed data.

alt text

Machine Learning Model

For our machine learning analysis, we decided to use Random Forest since it uses multiple decision trees which prevent overfitting. Also, it is very easy to calculate the important features in Random Forest. As well as, the Random Forest protects error correlation between prediction trees. We chose a 70-30 split for the data for training and testing. We used the scikit-learn RandomForestClassifier function. The initial value for hyperparameters of the RandomForestClassifier were:

min_samples_leaf=50, n_estimators=150, bootstrap=True, oob_score=True, n_jobs=-1, random_state=seed, max_features='auto'

The model initially achieved 58% accuracy. To increase the accuracy, we did some hyperparameter tuning. We adjusted the number of estimators to 300 and the minimum number of leaf samples to 3 and achieved an accuracy of 83%. Further hypertuning did not achieve better results, so we called it at 83%.

To ensure our model is not overfitting or underfitting, we performed cross-validation. For 20 fold random cross-validation, we achieve an accuracy of 86% which proved the model was not underfitting or overfitting

We also calculated the features that contributed most to the classification. The attribute entry point is the most important attribute followed by neutral which is the value for the pe-resource-langs.

alt text

Conclusion

The model was quite efficient with only three main features and 2862 malware samples. The accuracy of the model may increase by using more features such as entropy, no of sections or more malware samples. We extracted less features due to time constraint and because there is a lack of APT malware datasets.

About

A project where malware binaries were classified by APT groups using various extracted artifacts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages