Based on https://github.com/spcl/ncc
ncc
(Neural Code Comprehension) is a general Machine Learning technique to learn semantics from raw code in virtually any programming language. It relies on inst2vec
, an embedding space and graph representation of LLVM IR statements and their context.
- GNU / Linux or Mac OS
- Python (3.6.5)
- Packages from requirements.txt
- wine
- ida7.2 (installed in wine and configured using manual from scripts/ida_setup.txt)
- mcsema
- llvm-dis
By default, inst2vec
will be trained on publicly available code. Some additional datasets are available on demand and you may add them manually to the training data. For more information on how to do this as well as on the datasets in general, see datasets.
$ python train_inst2vec.py --helpfull # to see the full list of options
$ python train_inst2vec.py \
> # --context_width ... (default: 2)
> # --data ... (default: data/, automatically generated one. You may provide your own)
Alternatively, you may skip this step and use pre-trained embeddings.
$ python train_inst2vec.py \
> --embeddings_file ... (path to the embeddings p-file to evaluate)
> --vocabulary_folder ... (path to the associated vocabulary folder)
Algorithm classification
Task: Classify applications into 104 classes given their raw code.
Code and classes provided by https://sites.google.com/site/treebasedcnn/ (see Convolutional neural networks over tree structures for programming language processing)
Train:
$ python train_task_classifyapp.py --helpfull # to see the full list of options
$ python train_task_classifyapp.py --input_data task/classifyapp --out task/classifyapp \
--maxlen 0 --model_name NCC_classifyapp --inference=False
Works only on linux(x64), don't move this script to directory where exist not empty subdirectory with name tmp. Before run replace path below in file script.sh: ida_path - path to ida64.exe llvm-dis - path to llvm-dis mcsema-lift - path to mcsema-lift app - path to folder with train_task_classifyapp.py
Run:
$ ./script.sh path-to-elf
Works only on linux(x64). Download dataset from https://sites.google.com/site/treebasedcnn/, unpack it.
Replace parameters values and run:
$ python scripts/compile_data.py --ir_per_file count_of_files --num_samples count_of_samples \
--num_threads 8 --dataset_path path_to_dataset --ida_path path_to_ida \
--llvm_dis path_to_llvm_dis --mcsema_lift path_to_mcsema_lift
Where count_of_files - Number of .ll files generated per input with random choice of optimization level count_of_samples - Number samples choicen from source per class
Splitted dataset appears in current directory. If you want change sizes of partitions, replace parts var initialisation in scripts/split_data.py by your values.
Run:
$ python scripts/split_data.py --path path_to_lifted_dataset
Replace parameters values and run:
$ python scripts/get_function_from_cpp.py --lib_path path_to_clang_library \
--path_in path_to_dataset --path_out path_to_place_names
Replace parameters values and run:
$ python scripts/del_functions.py --path_funcs path_from_previous_step --num_threads 8 \
--path_llvm path_to_lifted_and_splited_dataset
Firstly complete previous step. Put obtained dataset (folders ir_train, ir_val, ir_test) to task/task_classifyapp_lifted.
Train:
$ python train_task_classifyapp.py --input_data task/classifyapp_lifted \
--out task/classifyapp_lifted --maxlen 1100 \
--model_name NCC_classifyapp_lifted --mode train
Where maxlen - max legth of preproccesed sequence in dataset. More then 90% of sequences have length less or equal then 1100. If 0 specified then it computed dynamically, but its high memory and computationally consuming.
Test obtained model:
$ python train_task_classifyapp.py --input_data task/classifyapp_lifted \
--out task/classifyapp_lifted --maxlen 1100 \
--model_name NCC_classifyapp_lifted --mode test
To use obtained model to predict labels for .ll files, put them to inference/ir_test folder and run:
$ python train_task_classifyapp.py --input_data task/classifyapp_lifted \
--out task/classifyapp_lifted --maxlen 1100 \
--model_name NCC_classifyapp_lifted --mode predict --json_file name
Result is file in json fromat will be located in folder inference with specified name.
NCC is published under the New BSD license, see LICENSE.