This is a github page for DM_lab Project 1. --maintained by Rui Liang
We used 2 datasets for this experiment: Census(adult), german, the same as in the paper
We first did exploratory analyssis for the datasets
i.e. overview of the adult dataset:
the proportion of poctected instances labeled positively:
As can be seen, we found that the 'sex' is not the only sensitive attribute in adult dataset, 'race' can be also considered as sensitive attribute
The complete analysis for each dataset will be stored in file'dataset exploratory analysis' as .ipynb
First, we deal with missing value, i.e, in adult dataset, there ara missing values only in categorical features, and the number will not affect the prediction,so discard all missing value
categorical feature --> numerical feature
train-test split: 2/3 1/3
reindex the protected feature(sensitive) i.e, in adult dataset, set 'sex' in 2nd place
Use the confidence score based on boosting hypothesis to find the optimal error decision boundary shift for protected group that achieves statistical Parity. \ It follows the logic: if the confidence is lower, it's more possible the data point is misclassified. we found the data points with small confidence, and flip their label to achieve statistical parity.
i.e. this figure shows the confidence score of protected group and others of Adaboost:
The SDB method generalize also on Logistic Regression and SVM:
define the confidence of logistic regression simply as the value that the classifier takes before rounding
define the confidence as the distance of a point from the separating hyperplane
We reproduce the proposed method in the paper and use the same setup (run the method for each dataset 10 times and take the average, the whole process takes about 12hours). Besides, in adult we select 'race' as protected attribute and apply the method on it. The experimental results are reported in the method/experiment-SDB.py and method/experiment-SDB-race.py , the figures are in method/plots.