Skip to content

galvanic/adversarialML

Repository files navigation

Adversarial Machine Learning

MSc thesis project code

Getting Started

  1. clone the repo and cd into the folder
  2. create a YAML configuration file for the experiment setup.
  • name: the name of the index column in the dataframe, a shorter value will make it easier to manipulate
  • key: a dimension along which the experiment varies
  • values: varied for the experiment

For example, you can have one experiment using the adaline classifier and another using the logistic regression classifier. You would express that as:

- name: classifier
  key: [classifier, type]
  values:
    - adaline
    - logistic regression

Where [adaline, logistic regression] is a list of the values that classifier can take.

A more general example:

- name: classifier
  key: [classifier, type]
  values:
    - adaline
    - logistic regression
    - naive bayes

- name: attack
  key: [attack, type]
  values:
    - dictionary
    - empty
    - ham
    - focussed

- name: '% poisoned'
  key: [attack, parameters, percentage_samples_poisoned]
  values:
    - .0
    - .1
    - .2
    - .5

The order of the (name, key, values) group counts, as that is the order the columns will be in the DataFrame results (but the order can then be changed).

  1. Check the values in default_spec.yaml, especially the dataset_filename key (although this can also be a key that is varied.

    Here is a full example:

    dataset_filename: trec2007-1607252257
    label_type:
      ham_label: -1
      spam_label: 1
    
    classifier:
      type: none
      training_parameters: {}
      testing_parameters: {}
    
    attack:
      type: none
      parameters: {}
  2. decide how many threads to run your code on. Given the size of the dataset in memory, allocate at least "double" the amount of RAM. For example, if you run on 8 cores, make sure you have 16GB of RAM otherwise you will get a MemoryError.

  3. Run the pipeline, here with 4 threads for example:

    python3 main.py ~/path/to/experiment/config.yaml ~/folder/where/dataset/is/ ~/folder/to/save/results/to/ 4

Details

This repo includes code for:

  • feature extraction from spam datasets:

    the features for an email are the (binary) presence or absence of a token (a word)

  • poisoning attacks on the training data:

    • dictionary attack: dictionary.py

      all the features of the poisoned emails are set to one

    • empty attack: empty.py

      all the features of the poisoned emails are set to zero

    • ham attack: ham.py

      contaminating emails contain features indicative of the ham class

    • focussed attack: focussed.py

  • training and testing of binary classification models:

A few ipython notebooks showcase the results and brief initial observations/interpretations:

TODO

  • implement attacker knowledge

  • prepare batch test specs to find good learning rates depending on classifier and dataset

  • implement different attacks in adaptive experiment pipeline

  • write extract functions for:

    • enron
    • MNIST
  • test experiments on MNIST

  • brainstorm attacks for adaptive convex combination experiment

  • implement regret measure

software eng (ie. not directly important for this project) stuff:

  • implement how to store experiment files, prob grouped in batches
  • assert all matrix shapes and types
  • implement data loading from different filetypes (automatically detect npy, dat, csv, etc.)
  • add tests

optimisations:

  • ? optimise pipeline for experiments where the same dataset, same attacks, etc. are used or not worth the time ? -> look into Makefile to manage dependencies between files
  • profile code
  • re-implement logging of intermediate results, but maybe only first few characters, or statistics or info on the array (contains nan, something like that), would need to see what is actually useful
  • ? bit arrays
  • ? explicitly free memory
  • ? make ipython notebook on MI and feature selection for ham

About

MSc thesis code - framework and notebooks for Adversarial Machine Learning experiments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published