Adversarial Machine Learning

MSc thesis project code

Getting Started

clone the repo and cd into the folder
create a YAML configuration file for the experiment setup.

name: the name of the index column in the dataframe, a shorter value will make it easier to manipulate
key: a dimension along which the experiment varies
values: varied for the experiment

For example, you can have one experiment using the adaline classifier and another using the logistic regression classifier. You would express that as:

- name: classifier
  key: [classifier, type]
  values:
    - adaline
    - logistic regression

Where [adaline, logistic regression] is a list of the values that classifier can take.

A more general example:

- name: classifier
  key: [classifier, type]
  values:
    - adaline
    - logistic regression
    - naive bayes

- name: attack
  key: [attack, type]
  values:
    - dictionary
    - empty
    - ham
    - focussed

- name: '% poisoned'
  key: [attack, parameters, percentage_samples_poisoned]
  values:
    - .0
    - .1
    - .2
    - .5

The order of the (name, key, values) group counts, as that is the order the columns will be in the DataFrame results (but the order can then be changed).

Check the values in default_spec.yaml, especially the dataset_filename key (although this can also be a key that is varied.

Here is a full example:

dataset_filename: trec2007-1607252257
label_type:
  ham_label: -1
  spam_label: 1

classifier:
  type: none
  training_parameters: {}
  testing_parameters: {}

attack:
  type: none
  parameters: {}

decide how many threads to run your code on. Given the size of the dataset in memory, allocate at least "double" the amount of RAM. For example, if you run on 8 cores, make sure you have 16GB of RAM otherwise you will get a MemoryError.

Run the pipeline, here with 4 threads for example:

python3 main.py ~/path/to/experiment/config.yaml ~/folder/where/dataset/is/ ~/folder/to/save/results/to/ 4

Details

This repo includes code for:

feature extraction from spam datasets:
- from the TREC 2007 dataset: extract.py
the features for an email are the (binary) presence or absence of a token (a word)
poisoning attacks on the training data:
- dictionary attack: dictionary.py
  
  all the features of the poisoned emails are set to one
- empty attack: empty.py
  
  all the features of the poisoned emails are set to zero
- ham attack: ham.py
  
  contaminating emails contain features indicative of the ham class
- focussed attack: focussed.py
training and testing of binary classification models:
- ADALINE model: adaline.py
  
  like the better known perceptron, a single layer neural network that calculates a weighted sum of inputs. the difference is that it trains on this weighted sum, and outputs the thresholded weighted sum
- Logistic regression model: logistic_regression
- Naive Bayes classifier model: naivebayes.py

A few ipython notebooks showcase the results and brief initial observations/interpretations:

analysis of offline experiment batch 1608030254
analysis of experiment batches 1608310218 and 1608302248, where I looked at the effect of varying the adaptive rate: online notebook
window operator
window size
analysis of varying attacker knowledge

TODO

implement attacker knowledge
prepare batch test specs to find good learning rates depending on classifier and dataset
implement different attacks in adaptive experiment pipeline
write extract functions for:
- enron
- MNIST
test experiments on MNIST
brainstorm attacks for adaptive convex combination experiment
implement regret measure

software eng (ie. not directly important for this project) stuff:

implement how to store experiment files, prob grouped in batches
assert all matrix shapes and types
implement data loading from different filetypes (automatically detect npy, dat, csv, etc.)
add tests

optimisations:

? optimise pipeline for experiments where the same dataset, same attacks, etc. are used or not worth the time ? -> look into Makefile to manage dependencies between files
profile code
re-implement logging of intermediate results, but maybe only first few characters, or statistics or info on the array (contains nan, something like that), would need to see what is actually useful
? bit arrays
? explicitly free memory
? make ipython notebook on MI and feature selection for ham

Name		Name	Last commit message	Last commit date
Latest commit History 226 Commits
analyse		analyse
attacks		attacks
classifiers		classifiers
config		config
helpers		helpers
notebooks		notebooks
pipelines		pipelines
.gitignore		.gitignore
README.md		README.md
extract.py		extract.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyse

analyse

attacks

attacks

classifiers

classifiers

config

config

helpers

helpers

notebooks

notebooks

pipelines

pipelines

.gitignore

.gitignore

README.md

README.md

extract.py

extract.py

main.py

main.py

requirements.txt

requirements.txt

Repository files navigation

Adversarial Machine Learning

Getting Started

Details

TODO

software eng (ie. not directly important for this project) stuff:

optimisations:

About

Releases

Packages

Languages

galvanic/adversarialML

Folders and files

Latest commit

History

Repository files navigation

Adversarial Machine Learning

Getting Started

Details

TODO

software eng (ie. not directly important for this project) stuff:

optimisations:

About

Resources

Stars

Watchers

Forks

Languages