Malware Classification using Deep Learning

Classifying malware through deep-learning based on malware behaviors data.

You can obtain further info by reading FAQs section.

Running this project

This project contains four-directories, in which they have their own purposes:

BehaviorDownload - Contains Python script in which you can download the sample malware behaviors dataset for the use of this project.
DataPreProcess - Contains script to transform behaviors data into 10,000 size vector.
MachineLearning - Contains code for Deep Denoising Autoencoders and Deep Neural Network, two primary machine learning algorithm used for this project.
WebSystem - Contains web-system created using both Node.js (for front-end) and Python Flask (back-end).

Training Flow

Step-by-step of training the data

This particular step-by-steps are focusing on those who are using Ubuntu distro.

It is theoretically-possible that this project to run under Windows but I am not advising it. So you are on your own.

I also assume that you have NVIDIA GPU as it accelerates the training process tenfolds compared to running the training process in CPU.

1. Install dependencies

By default, code written inside this project uses Python 3 syntax, so you need to install the right Python interpreter version.

sudo apt install python3 pythone3-venv git

It is advisable for you to create virtual environment for the purpose of running the project. Below step is OPTIONAL but it is HIGHLY RECOMMENDED.

mkdir -p ~/python-venv/malware-DL-env
python3 -m venv ~/python-venv/malware-DL-env

Activate our new created profile (need to do everytime you want to run the project)

source ~/python-venv/malware-DL-env/bin/activate

Clone this project

git clone https://github.com/shahril96/Malware-Classification-using-Deep-Learning
cd Malware-Classification-using-Deep-Learning

Install python external libraries

pip3 install --upgrade pip
pip3 install -r requirements.txt

Install CUDA 10 package

# Get Ubuntu version
VERSION=$(lsb_release -r | awk '{print $2}' | sed 's/\.//g')

# Install CUDA Toolkit 10
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${VERSION}/x86_64/cuda-repo-ubuntu${VERSION}_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu${VERSION}/x86_64/7fa2af80.pub && sudo apt update
sudo dpkg -i cuda-repo-ubuntu${VERSION}_10.0.130-1_amd64.deb

sudo apt update
sudo apt install -y cuda

# Install CuDNN 7 and NCCL 2
wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu${VERSION}/x86_64/nvidia-machine-learning-repo-ubuntu${VERSION}_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu${VERSION}_1.0.0-1_amd64.deb

sudo apt update
sudo apt install -y libcudnn7 libcudnn7-dev libnccl2 libc-ares-dev

Install PyTorch (deep-learning framework). Follow the below link for the installation process. Also make sure that you're installing PyTorch inside your correct Python version. (you can check using python3 --version)

https://pytorch.org/get-started/locally

If you followed above instruction clearly and having no installation problem. Then you're good to go.

2. Getting Malware Behaviors Data

I have prepared all malware hashes for my training data inside BehaviorDownload/malware_hash.txt. To obtain the malware behavior data, you can use Virus Total service by using /file/report API and specifying allinfo parameter to obtain summarized behaviors data.

To accomodate this fetching task, I have created fetcher script vt_behavior_hash_fromfile.py inside BehaviorDownload.

cd BehaviorDownload
python3 vt_behavior_hash_fromfile.py malware_hash.txt VT-Data <api-key>
cd ..

Please do note that getting data from /file/report using allinfo parameter requires Private API key. Either you have a friend who do have one, or if not possible, then you can request for Academic API access in which you need to contact them and ask one.

3. Pre-Process Behaviors Data

As behaviors data downloaded from VirusTotal is in JSON structure format, it must be converted into another fixed-format that neural network can understand.

To accomplish this task, I am following David et al. paper technique, where they used 1-gram (unigram) technique for the feature representation.

Move all JSON files that have been downloaded to DataPreProcess/behaviors-data folder.

mkdir -p DataPreProcess/behaviors-data
mv -v BehaviorDownload/VT-Data/*.json DataPreProcess/behaviors-data

Then run DataPreProcess/pre_process.py script.

cd DataPreProcess
python3 pre_process.py behaviors-data
cd ..

It will generate two files, which are:

dataset.csv.xz - contains compressed 1-gram data feature that has been transformed
top_unigrams.txt - contains top 1-gram (unigram) mapping

Optional - bitstring_visualize.py contains code to visually show transformed data. This can give you an idea on how two malwares behaviors are different on bit feature-level.

4. Feature Compress Behaviors Data

Now, we need to reduce the data dimensions from 10,000 size vector into 20 size vector using Deep Denosing Autoencoders. The reason on why I chose this method can be seen by reading FAQs section below.

Copy the generated BehaviorDownload/dataset.csv.xz into MachineLearning folder, and run DAE.py to start the deep denoising autoencoders training process.

cp -v BehaviorDownload/dataset.csv.xz MachineLearning/
cd MachineLearning/
python3 DAE.py

It will take some time (half-day) for the training to be completed. By default, this network is going to be trained for 1,000 epochs.

The code also has the capability of resuming training if the process is interrupted half-way, through saving checkpoint-DAE.pt checkpoint data. Just run the script back and it will resume the training back.

If you're impatient to wait until 1,000 epochs to complete the training, you can adjust DAE.py and change the num_epochs hyper-parameter.

Once the training has been completed, there will be several files generated:

DAE-Trained-Model.pt - contains parameters (weights and biases) of the trained network
encoded-form-DAE.csv - contains behaviors data that has been compressed into 20 size vector.
checkpoint-DAE.pt - contains network training checkpoint (this can be safely deleted after training has completed)

Optional - umap_visualize.py contains code to visualize compressed 20 size vector into two-dimensional graph. Highly recommended if you want to see for yourself on how the data is compressed but still retaining their spatial informations.

5. Training Compressed Behaviors Data for Classification Task

Copy encoded-form-DAE.csv into MachineLearning/supervised_training folder.

cp -v encoded-form-DAE.csv supervised_training/

Now, as next phase will be training the network for classification task, we need to split the dataset into two files, which are training dataset (70%) and test dataset (30%). This can be accomplished by running split_dataset.py script.

cd MachineLearning/supervised_training
python3 split_dataset.py encoded-form-DAE.csv

Two files will be generated, which named as train_dataset.csv and test_dataset.csv.

Go to supervised_training/MLP folder and run MLP.py to start the deep neural network training process. This will take half-day in order to finish the 1,000 epochs.

cd MachineLearning/supervised_training/MLP
python3 MLP.py

This script also has the capability of resuming the training in case it is interrupted half-way, through the use of checkpoint-MLP.pt checkpoint file.

Optional - For those who are curious on why there are other folders reside in supervised_training. This is a leftover from my experiment before to observe which machine learning algorithms have better classification accuracy on this malware behaviors dataset. I guess I will just left it there in case someone also curious to do experiment for themselves. :)

Once the training has completed, this script will generate three files, which are:

MLP-Trained-Model.pt - contains parameters (weights and biases) of the trained network
MLP_CM.png - confusion matrix image
MLP_CR.png - classification report image

6. Web System for Malware Classification/Prediction Task

Our networks now have the ability to predict unknown malwares that falls into (Cerber, CryptoWall, GandCarb, Petya, Sality) type. Inside WebSystem contains web prediction system which can accomplish this prediction task.

This web system contains two components, which are front-end and back-end.

front-end - uses Node.js + Vue.js, which serves as interface for the prediction system
back-end - uses Python Flask, which serves API to the front-end component, and acts as network interence system

We need to copy the trained networks and some other files for it to run.

cp -v DataPreProcess/{dataset.csv.xz,top_unigrams.txt} WebSystem/back-end/resources/
cp -v MachineLearning/DAE-Trained-Model.pt WebSystem/back-end/resources/model_data/
cp -v MachineLearning/supervised_training/MLP/MLP-Trained-Model.pt WebSystem/back-end/resources/model_data/

Both front-end and back-end need to run at the same time for the web system to function. We can open up two new terminals which we need to run each of the component separately.

Terminal 1

source ~/python-venv/malware-DL-env/bin/activate  # activate our environment
cd WebSystem/front-end/
npm install  # install Node.js dependencies
npm run dev  # run the front-end component

Terminal 2

source ~/python-venv/malware-DL-env/bin/activate  # activate our environment
cd WebSystem/back-end/
python3 run

If all is done well, your web browser should open up a new page into http://localhost:8080 showing the web system interface.

You have done all the required steps. Well done! :)

Screenshot

A screenshot for an eyecandy.

FAQs

What is it?

This project explores the possibility of training a deep neural net which can classify a small subject of chosen malwares type (Cerber, CryptoWall, GandCarb, Petya, Sality).

This project was done for my FYP (final-year project) for my university (Universiti Teknologi Mara) requirement, and the technique used was largely inspired by reading David et al. paper.

What technique does this project used?

In order to achieve this task, this project relies on malware behaviors produced by running the malware executable. Malware behaviors data is obtained by running the malware into some sandboxes which have the capability of capturing runtime executable behaviors when it is running. In this particular project, I was depending on Cuckoo Sandbox output report for the dataset.

However, running Cuckoo Sandbox myself is particularly expensive as it requires some amount of times even for a single malware sample. To reduce my burden of collecting this behaviors data, I relied on Virus Total API report's allinfo to obtain summarized behaviors data.

This behaviors data is then pre-processed further to obtain vector (10,000) for every malware samples. This steps is important to as deep learning requires input-vector for its output.

However, 10,000 size vector is still considerably large to train. To solve this problem, deep denoising autoencoders is trained on all behaviors data. The deep denoising autoencoders has the capability of dimensionality-reduction, where it can compress data with large number fo features into small number of features data, while still retaining its spatial information. In this particular case, 10,000 size vector is reduced into 20 size vector.

Having the input-vector ready, then it is trained using deep neural networks for classification task. The end result was a model in which is capable of classifying malware and output the correct malware-type.

What is Deep Denosing Autoencoders? Why did you chose it?

To compress from 10,000 size vector into 20 size vector, then we need some algorithm which can do dimensionality-reduction to the data while still maintaining its spatial information. This will reduce the complexity to train the behaviors data later when running Deep Neural Network.

To do dimensionality-reduction to data, we have several algorithms such as PCA and Autoencoders. Through sevaral online reading, Autoencoders has much great capability as it is able to have non-linear encoders/decoders compared to PCA in which it is restricted to linear map.

I chose Deep Denoising Autoencoders in which I got the idea while reading David et al. paper. The different between my technique compared to them is, they trained the network layer-by-layer (pretraining the network) using Hinton et al. idea of Deep Belief Network (DBN). While this idea is phenomenal in 2006 (first method to succesfully trained deep neural network), modern techniques (such as using ReLU activation function) makes this method irrelevant anymore.

My idea is basically to just construct a normal deep autoencoders without any pretraining tricks. The layer size that I chose is (10000, 3000, 500, 100, 20, 100, 500, 3000, 10000). As in the middle of the network, the layer-size is smaller compared to others. This creates bottleneck, and the network is forced to learn on how to represent data correctly in lower dimension. Denoising technique is also applied to generalize the data, preventing network overfitting.

Why you're not releasing your sample behaviors data?

As my sample input data was obtained using Virus Total service, so I must honor their TOS. Their TOS clearly said that I am not allowed to release any data that is obtained publicly to the net.

Can you release trained models?

You can get it from here. Just follow step 6 from above instruction and you are good to go.

What software did you use for training your behaviors data.

This project is particularly written using Python. PyTorch deep-learning (DL) framework was chosen as it offers me enough freedom and it is easier to debug compared to Tensorflow. Furthermore, I heard it had much love in academia so I gave it a try. Currently, I love it a lot. :)

I also tested different kind of machine learning algorithms (Scikit-learn, XGBoost), which can be found inside MachineLearning/supervised_training folder.

On what hardware did you train your dataset?

I own a laptop which has NVIDIA GTX 980M, so it is still pretty fast for me (about half-day for for deep autoconders training).

If you have no resource capability, I highly suggest for you to try Google Colaboratory, as they are offering free-service (for 12 hours per instance) with a NVIDIA Tesla K80.

License

This project is MIT-style licensed, as found in the LICENSE file.

Contact

If you have any enquiries or questions, you can open up an Github's issue above or contact me personally on mohd_shahril_96@yahoo.com.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
BehaviorDownload		BehaviorDownload
DataPreProcess		DataPreProcess
MachineLearning		MachineLearning
WebSystem		WebSystem
LICENSE		LICENSE
README.md		README.md
flow_diagram.png		flow_diagram.png
requirements.txt		requirements.txt
web-prediction-system.jpg		web-prediction-system.jpg

License

sucof/Malware-Classification-using-Deep-Learning

Folders and files

Latest commit

History

Repository files navigation

Malware Classification using Deep Learning

Running this project

Training Flow

Step-by-step of training the data

1. Install dependencies

2. Getting Malware Behaviors Data

3. Pre-Process Behaviors Data

4. Feature Compress Behaviors Data

5. Training Compressed Behaviors Data for Classification Task

6. Web System for Malware Classification/Prediction Task

You have done all the required steps. Well done! :)

Screenshot

FAQs

What is it?

What technique does this project used?

What is Deep Denosing Autoencoders? Why did you chose it?

Why you're not releasing your sample behaviors data?

Can you release trained models?

What software did you use for training your behaviors data.

On what hardware did you train your dataset?

License

Contact

About

Resources

License

Stars

Watchers

Forks

Languages