Seq2SeqCode

Final project for an MIT class in Advanced NLP.

This codebase is for the Powershell setting only.

See final paper HERE

Distributed representations of words, sentences, and documents have been crucial to finding good ways to use neural networks for NLP tasks (word2vec, doc2vec, seq2seq). These same distributed representations of textual data also serve us well in the related domain of programming languages; programs themselves are just structured text. As programming languages and NLP communities come to see their own common interests, NLP-derived neural models are being built to work on programmatic data as well (code2vec). In this project, we want to investigate whether these neural models, which use distributed representations of lines of code or abstract syntax trees, are robust to various types of obfuscation and adversarial inputs. We find that perturbations to Java programs either by variable substitution or by deadcode insertion cause little difference in classification by code2vec, but that obfuscated PowerShell programs cause an otherwise well-performing malware classifier to perform close to chance.

Autoencoder Seq2Seq model adapted from here

Dependencies:

Miniconda3/Anaconda3
Python3
PyTorch

Getting started

Clone this repo: git clone git@github.mit.edu:sanjas/Seq2SeqCode.git
Create Conda environment: conda env create -f environment.yml
Make sure you are cd'ed into the repo's root and run: . ./start.sh to activate the environment and set the PYTHONPATH.

Data

Make sure you set the dataset path correctly in tools/data_load.py. On the first run, it will take a long time (how exactly depends on the number of CPUs, since it's parallelized) for the whole dataset to be processed. On subsequent runs it will take less than 30 sec to load.

Seq2Seq Training

Run python tools/train.py. Some hyperameters are set in model/hyperparams.py.

Seq2Seq Testing

To see the performance on the test data run python tools/test.py

Seq2Seq Inference

Run python tools/inference.py

Malware classifier (with and without obfuscations)

Code located in the classifier subdirectory

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
astrepr		astrepr
classifier		classifier
model		model
paper		paper
plots		plots
tools		tools
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
start.sh		start.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

astrepr

astrepr

classifier

classifier

model

model

paper

paper

plots

plots

tools

tools

.gitignore

.gitignore

README.md

README.md

environment.yml

environment.yml

start.sh

start.sh

Repository files navigation

Seq2SeqCode

Dependencies:

Getting started

Data

Seq2Seq Training

Seq2Seq Testing

Seq2Seq Inference

Malware classifier (with and without obfuscations)

About

Releases

Packages

Languages

sanjass/Malware-Classification-Seq2Seq

Folders and files

Latest commit

History

Repository files navigation

Seq2SeqCode

Dependencies:

Getting started

Data

Seq2Seq Training

Seq2Seq Testing

Seq2Seq Inference

Malware classifier (with and without obfuscations)

About

Topics

Resources

Stars

Watchers

Forks

Languages