European Court of Human Rights OpenData ML Classification Hyperparameter Tuning

Paper that describes full experiment is under: Project_paper.docx

It is recommendet to have an Anaconda with python 3.5 environment installed with preinstalled data science packages like: pandas, numpy, scikit-learn etc.
Dependencies:

conda install tensorflow keras

For hyperparameter tuning HpBandSter was used.

pip install hpbandster

To load the datasets ECHR-OD_loader was used with data storage environment variable named: ECHR_OD_PATH

Directory structure:

├── different_models
│   ├── results
│   │   ├── bow
│   │   │	├── dnn
│   │   │	├── extratrees
│   │   │	├── random_forest
│   │   │	├── svm
│   │   │	├── xgboost
│   │   ├── desc
│   │   │	├── ...
│   │   ├── desc_bow
│   │   │	├── ...
│   ├── all_results
│   │   ├── 1_gram/1000_tokens/1_article...
│   │   ├── 2_gram/1000_tokens/1_article...
│   │   ├── ...
│   ├── best_results.csv
│   ├── customizable_optimizer.py
│   ├── model_workers.py
│   ├── onedrive_api.py
│   ├── run_optimization.py
│   ├── store_all_results.py
│   ├── store_best.py
├── DNN
├── ExtraTrees
├── ...
├── ...

Docker image

Full working environment is available with docker image:

sudo docker run -ti amasend/echr:version2

Docker environment only includes OneDrive version of experiment. Local storage will be included in the next release.

Functionality

How to run from a repository?

Firstly create a directory that will be containing all the generated datasets.

mkdir ~/dataset_storage

Download all datasets from: (All datasets could be found on https://1drv.ms/f/s!AkC_hbDmloKIjHh80Vdqc9OUSnwX)
Create additional directory for temporary dataset storage where specific datasets will be placed during computation.

mkdir ~/data

Export environmental variable pointing a data directory for ECHR adta loader.

export ECHR_OD_PATH=~/data

Run optimization process. You can specify which model you want to evaluate, on which flavors, n_grams and tokens.

python run_optimization.py --flavors --models --n_grams --tokens --onedrive --dataset_storage

flavors - ['desc', 'bow', 'desc+bow']
models - ['extratrees', 'random_forest', 'dnn', 'xgboost', 'svm']
n_grams - [1, 2, 3, 4, 5, 6]
tokens - [1000, 5000, 7000, 10000, 30000, 60000, 80000, 100000]
onedrive - True/False (if True use OneDrive storage, if False use local storage)
dataset_storage - ~/dataset_storage (path to dataset storage)

Please have in mind that computation time for all combinations is really huge. By runnig this experiment with a default parameters you should be aware that each n_gram and token is computed separetly (one after another), only all models and flavors are multiprocessed.

Generating results plots

All components are under '/different_models' directory.
After generating all results or partial results we can make plots of that results for better understanding.

python show_results.py --results_path "path_to_results_folder" --model "model_name" --n_grams "number_of_grams" --tokens "number_of_tokens" --flavor "flavor_name" --metric "metric_name_to_display" --option "which_plot_to_display"

There are couple of methods implemented for user use and based on that, we can plot as follows:

best new results comparison per model and n-gram (after closing one plot, another will appear for the next n-gram)

python show_results.py --results_path "all_results" --model "random_forest" --metric "test_mean_c_MCC" --option "new_best_results"

old best results vs new results (old best among all models, new best only from one model), per n-gram

python show_results.py --results_path "all_results" --model "random_forest" --metric "test_mean_c_MCC" --option "old_vs_new" --n_grams 5

normalized confusion matrices (only for new results)

python show_results.py --results_path "all_results" --model "random_forest" --metric "test_mean_c_MCC" --option "confusion_matrix" --n_grams 5 --tokens 5000

computing times (run times) per flavor and per optimizer run

python show_results.py --results_path "all_results" --model "random_forest" --metric "test_mean_c_MCC" --option "computing_times" --n_grams 5 --tokens 5000

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
plots		plots
Project_paper.docx		Project_paper.docx
README.md		README.md
best_results.csv		best_results.csv
customizable_optimizer.py		customizable_optimizer.py
file_manager.py		file_manager.py
mermaid_graph.PNG		mermaid_graph.PNG
model_workers.py		model_workers.py
onedrive_api.py		onedrive_api.py
run_optimization.py		run_optimization.py
show_results.py		show_results.py
store_all_results.py		store_all_results.py
store_best.py		store_best.py

amasend/ECHR_hyperparameter_tuning

Folders and files

Latest commit

History

Repository files navigation

European Court of Human Rights OpenData ML Classification Hyperparameter Tuning

Paper that describes full experiment is under: Project_paper.docx

Directory structure:

Docker image

Functionality

How to run from a repository?

Generating results plots

About

Resources

Stars

Watchers

Forks

Languages