- Python
- Machine Learning
- Data Processing/Scripting
- Machine Learning
Cyber attack attribution is the process of attempting to trace back a piece of code or malware to a perpetrator of a cyberattack. As cyber attacks have become more prevalent, cyber attack attribution is becoming more valuable. The process of cyber attack attribution can be done using reverse engineering. From the metadata of the malware executable file, we can gather data such as date of creation, variable names used, and what library calls are imported. This information can be used as features for attribution analysis. We need to extract the features from malware that can be used for attribution and analyse them using some technique to attribute the attacks.
Since we are doing static analysis, we needed to consider whether the malwares were packed or not. Packing is a form of code obfuscation that modifies the format of the malware by compressing or encrypting the data, so we would not be able to extract the artifacts we need without putting the malware through some type of unpacking software. Because of this, we used a tool called PEiD that can detect whether or not a malware sample is packed and classified all of our malware into either unpacked or packed categories. The number of packed and unpacked malwares per APT group are summarized in the table below. We have only used unpacked malware samples for attribution to APT Group.
We have extracted the following attributes from the VirusTotal reports for each malware sample
- pe-resource-langs
- imports
- pe-entry-points
We created a python script that would extract the above attributes from each of the json malware report files and consolidate it into one csv file. We created a CSV file with the mentioned three attributes, resource (i.e hash of the malware) and APT group columns which we fetched from the GitHub repository. Below is a snapshot of the unprocessed dataset.
Next, the data needed to be cleaned in order for it to be inputted into the machine learning model. Brackets and commas in the language and library column were removed and all values were formatted to be uppercase so the model will not differentiate between them and give inaccurate results. Then, the column for the library and language features were one hot encoded. This resulted in creation of dummy variables for each unique library and language value. Then, the APT groups were mapped to integers so they can properly be used in the classifier model. Lastly, we removed 792 rows with null values in multiple columns. After preprocessing, we were left with 148 features and 2862 rows. Below is the snapshot of the preprocessed data.
For our machine learning analysis, we decided to use Random Forest since it uses multiple decision trees which prevent overfitting. Also, it is very easy to calculate the important features in Random Forest. As well as, the Random Forest protects error correlation between prediction trees. We chose a 70-30 split for the data for training and testing. We used the scikit-learn RandomForestClassifier function. The initial value for hyperparameters of the RandomForestClassifier were:
min_samples_leaf=50, n_estimators=150, bootstrap=True, oob_score=True, n_jobs=-1, random_state=seed, max_features='auto'
The model initially achieved 58% accuracy. To increase the accuracy, we did some hyperparameter tuning. We adjusted the number of estimators to 300 and the minimum number of leaf samples to 3 and achieved an accuracy of 83%. Further hypertuning did not achieve better results, so we called it at 83%.
To ensure our model is not overfitting or underfitting, we performed cross-validation. For 20 fold random cross-validation, we achieve an accuracy of 86% which proved the model was not underfitting or overfitting
We also calculated the features that contributed most to the classification. The attribute entry point is the most important attribute followed by neutral which is the value for the pe-resource-langs.
The model was quite efficient with only three main features and 2862 malware samples. The accuracy of the model may increase by using more features such as entropy, no of sections or more malware samples. We extracted less features due to time constraint and because there is a lack of APT malware datasets.