Skip to content

sanjass/Malware-Classification-Seq2Seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Seq2SeqCode

Final project for an MIT class in Advanced NLP.

This codebase is for the Powershell setting only.

See final paper HERE

Distributed representations of words, sentences, and documents have been crucial to finding good ways to use neural networks for NLP tasks (word2vec, doc2vec, seq2seq). These same distributed representations of textual data also serve us well in the related domain of programming languages; programs themselves are just structured text. As programming languages and NLP communities come to see their own common interests, NLP-derived neural models are being built to work on programmatic data as well (code2vec). In this project, we want to investigate whether these neural models, which use distributed representations of lines of code or abstract syntax trees, are robust to various types of obfuscation and adversarial inputs. We find that perturbations to Java programs either by variable substitution or by deadcode insertion cause little difference in classification by code2vec, but that obfuscated PowerShell programs cause an otherwise well-performing malware classifier to perform close to chance.

Autoencoder Seq2Seq model adapted from here

Dependencies:

  • Miniconda3/Anaconda3
  • Python3
  • PyTorch

Getting started

  1. Clone this repo: git clone git@github.mit.edu:sanjas/Seq2SeqCode.git
  2. Create Conda environment: conda env create -f environment.yml
  3. Make sure you are cd'ed into the repo's root and run: . ./start.sh to activate the environment and set the PYTHONPATH.

Data

Make sure you set the dataset path correctly in tools/data_load.py. On the first run, it will take a long time (how exactly depends on the number of CPUs, since it's parallelized) for the whole dataset to be processed. On subsequent runs it will take less than 30 sec to load.

Seq2Seq Training

Run python tools/train.py. Some hyperameters are set in model/hyperparams.py.

Seq2Seq Testing

To see the performance on the test data run python tools/test.py

Seq2Seq Inference

Run python tools/inference.py

Malware classifier (with and without obfuscations)

Code located in the classifier subdirectory