debias-ml

A practical, explainable and effective approach to reducing bias in machine learning algorithms.

1. Overview

1.1. Problem Statement

Machine learning is being used at an ever increasing rate to automate decisions that were previously made by humans. Decisions on job applications, college admissions and sentencing guidelines (to name a few) can have life changing consequences for people so it’s important to be fair. But the argument for reducing bias is not just an ethical one, it’s a financial one too. Multiple studies have found that teams that are more diverse in race and gender, significantly outperform teams that aren’t. In reducing bias in our algorithms we hope not just to replicate past performance but to exceed it.

1.2. Solution

In DebiasML, I have developed a practical and explainable solution through novel application of oversampling. Though popular for data imbalance problems, oversampling has not been adopted to address bias. When tested on the Adult UCI dataset, DebiasML outperforms the state of the art (GANs) on many dimensions. It results in a significantly higher F1 score (as much as +17%) whilst being equally accurate; training is ten times faster; it is model agnostic, transparent and by construction improves diversity in its predictions.

The graphic below shows the distribution of predictions on the test set as it changes with the oversampling factor along with performance and bias metrics:

2. Resource list

Presentation slides explaining the problem, solution approach and results in 5 mins are available here
Presentation recording will be linked to here when available
Blog post explaining the problem, solution approach and results will be linked to here when available
Streamlit reports:
- Data Analysis Report showing only the data analysis
- Model Analysis Report showing only model amalysis
- Oversampling Analysis Report showing only oversampling analysis## 4. Running the code on your machine
- Full report showing data exploration and oversampling analysis

3. Running the code on your machine

3.1. Requisites

anaconda
Python 3.6 (Keras and TensorFlow don't work with Python 3.7)
Streamlit

This repo uses conda's virtual environment for Python 3.

Install (mini)conda if not yet installed:

For MacOS:

$ wget http://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
$ chmod +x miniconda.sh
$ ./miniconda.sh -b

Create the virtual environment:

cd into the directory and create the debial-ml conda virtual environment from environment.yml

$ conda env create -f environment.yml

Activate the virtual environment:

$ source activate debias-ml

3.2. Running the code

As described above, there are four scripts which can be run to produce Streamlit reports.

analysis_data.py
analysis_model.py
analysis_oversampling.py
analysis.py

These can be all be run from the command line. To do this cd into the source directory and call,

$ python analysis_data.py

The scripts are listed in order of running time.

4. Data

Testing of this methodology was performed using census income data (UCI Adult dataset):

32561 data points
Target feature: ann_salary > $50K
14 features (in addition to the target feature) including race and gender
76% of the population earns less than $50K
67% of the population is male
85% of the population is white

4.1. File structure / data flow in the code

The raw data files are saved in data/raw
The raw data is converted to csv format and saved as data/preprocessed/adult-data.csv
The input parameters are set manually in config/params.ini
After processing, the code saves a new csv file containing the processed data in data/processed/adult-data.csv
Parameters which are calculated in data processing and required for later calculations are written to the config file config/new_params.ini

4.2. Running the code on a new data set

Save the csv file in the folder data/preprocessed/
Edit the parameter values in the config file, config/params.ini
Don't worry about overwriting the parameters for adult-data.csv, a copy of the config file is saved as adult-data_params.ini
Follow the instructions above for running the code

Notes:

While efforts have been made to generalise, this code has not been tested on other datasets
The Oversampler class is designed to remove bias from two sensitive features simultaneously

5. To Dos

Increase test coverage
Make it work for removing bias against a single sensitive feature
Add a method to the Oversampler to output weights (rather than the oversampled data points)
Check for and remove hard codes plot labels
Write code to find features with bias and rank them
Write bias_metrics class (follow sklearn)
Design and implement infrastructure for stratified sampling on multiple dimensions

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.streamlit/cache		.streamlit/cache
config		config
data		data
figures		figures
results		results
source		source
static		static
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

License

zhangzq921/debias-ml

Folders and files

Latest commit

History

Repository files navigation