Malware Classification

This project attempts to classify the entries in the Microsoft Malware Classification Challenge dataset using random forests. The following scripts are used for this project:

check_accuracy.py - checks the accuracy of the output from pipeline.py against the true labels (only works for X_small)
most_common_ngrams.py - given a vector of word/n-gram counts, selects the top 1000 words/n-grams associated with each class
pipeline.py - executes data preprocessing and random forests on the large dataset
preprocessing_asm.py - generates features from the asm files using sc.WholeTextFiles
preprocessing_asm_download.py - generates features from the asm files by downloading the text file to drive
preprocessing_bytes.py - generates features from the bytes files using sc.WholeTextFiles
preprocessing_bytes_download.py - generates features from the bytes files by downloading the text file to drive
preprocessing_filesize.py - generates metadata features from the asm and bytes files

##Problem Description

The dataset for the Microsoft Malware Classification Challenge is composed of known malware files representing a mix of 9 different families. The uncompressed dataset is approximately 500GB. Files are organized in pairs consisting of a .bytes file and a .asm file. Each .bytes file is the hexadecimal representation of the file's binary content, without the PE header (to ensure sterility). Each .asm file is the metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc.

Each pair of files is associated with one of the following malware families:

Ramnit
Lollipop
Kelihos_ver3
Vundo
Simda
Tracur
Kelihos_ver1
Obfuscator.ACY
Gatak

The goal of the challenge is to correctly classify the malware family associated with each pair of files based on the content within the files.

##Feature Extraction and Transformation

Each entry consists of a pair of files -- a bytes file and an asm file. From the bytes file, we extracted the bigram counts of the hexadecimal 'words' in the file. Each bigram consists of four characters, e.g., '53 8F'. This gave us a feature vector of length 65536 for each file (because there are 65536 possible hexadecimal bigrams).

To reduce the size of this feature vector, we counted the 1000 top bigrams for each class and filtered our bigram counts to only include these top bigrams. This reduced our feature vector from 65536 to ~5800.

From the asm file, we extracted the following features:

count of lines associated with each prefix (e.g. HEADER, idata, rdata)
bigram counts of common opcode commands (e.g., 'push pop' or 'mov jmp')
counts of .dlls affected
counts of __stdcall, FUNCTION, and call commands
counts of other special commands and datatypes, such as dwords and references to db

With minimum filtering, the asm feature vector generated from the small train dataset was over 100,000 in length. We set a minimum document frequency threshold to filter out feature entries that appeared in the training dataset fewer than a certain count. By setting the minimum document frequency to 30, we reduced our asm feature vector from the small train dataset to ~5800 in length.

We also extracted metadata information from the bytes and asm files. From each pair of files, we calculated the following features:

ratio of size of bytes file to asm file
ratio of size of bytes file to zipped bytes file
ratio of size of asm file to zipped asm file
ratio of zipped bytes file to zipped asm file

Different combinations of all features mentioned above were used in the classification methods listed below.

##Random Forests and Model Tuning

We used random forests to classify the model because it had the highest performance on our small testing dataset. Other classifiers we tested included Naive Bayes and gradient-boosted trees. In addition, random forests trains very quickly compared to other gradient-based classifiers such as gradient boosted trees and multi-layer perceptrons. Furthermore, the random forests classifier that comes with the Spark ML package is configured to handle multi-class classification, whereas other classifiers such as gradient-boosted trees can only do binary classification out of the box.

Initially, we deployed a random forest model on all 65536 bigram counts from the .bytes files. We attained an average accuracy of 93% on the small test set using 100 trees with a max depth of 5, and an average accuracy of 94%-95% on the small test set using 1000 trees with a max depth of 5.

After filtering down the bigrams to ~5800 top bigrams, we attained an average accuracy of 95% using random forests with 200 trees with a max depth of 5. By increasing the max depth to 8, we managed to increase the average accuracy to 96%-97% on the small test set and attained a max accuracy of 98% under certain seeds.

Next, we tested random forest performance on the asm features (without bytes features). After filtering the asm features down to ~5800 features (by raising the minimum document frequency to 30), we attained an average accuracy of 97%-98% on the small test set using random forests with 200 trees and a max depth of 8.

We then combined the asm features and bytes features in a single vector of ~11500 features. Surprisingly, our average accuracy dropped to 97% using a random forest of 200 trees with a max depth of 8. We determined that we had too many features, and the random forest was splitting on features that may not be providing much information. After using a chi-square feature selector to reduce our combined feature vector from ~11500 to 300, we attained an average accuracy of 98% on the small test set and max accuracy of 98.6% using a random forest of 100 trees with a max depth of 8.

Finally, we added the metadata features (filesize features) to our 300-length vector of our best features. We were disappointed to discover that random forest accuracy actually dropped slightly (by about 0.5% to 97.5%) on the small test set when these new metadata features were included.

Our final model uses a random forest classifier of 100 trees with max depth 12 on the top 150 features from bytes (selected using chi-square feature selection) and the top 150 features from asm (selected using chi-square feature selection).

##Additional Things We Tried that Didn't Work We tried the following features/techniques, but none improved the performance of our model:

PCA dimensionality reduction
TF-IDF of bytes features and asm features
Opcode 4-grams from asm file
Voting using multiple random forests

##Runtime Notes

Due to the large size of the dataset (500GB), we had issues fitting the data into memory during runtime. Therefore, we developed an alternative script that, for every bytes or asm file, first downloads the file to the local disk from S3, then processes it for features, and then deletes it to conserve hard disk space. This method runs far more slowly than using sc.WholeTextFiles due to the time it takes to download each file but has very low memory/space requirements.

These alternative preprocessors are included in the repository as preprocessing_asm_download.py and preprocessing_bytes_download.py.

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
data		data
defunct		defunct
experiments		experiments
extra		extra
output		output
Contributors.md		Contributors.md
README.md		README.md
check_accuracy.py		check_accuracy.py
license.md		license.md
most_common_ngrams.py		most_common_ngrams.py
pipeline.py		pipeline.py
preprocessing_asm.py		preprocessing_asm.py
preprocessing_asm_download.py		preprocessing_asm_download.py
preprocessing_bytes.py		preprocessing_bytes.py
preprocessing_bytes_download.py		preprocessing_bytes_download.py
processing_filesize.py		processing_filesize.py

License

premsakore/Microsoft-Malware-Classification-Challenge

Folders and files

Latest commit

History

Repository files navigation

Malware Classification

About

Resources

License

Stars

Watchers

Forks

Languages