Machine Learning Fairness Matrix

DSC 180B | B10

Intro

Intro

Today, machine learning tools are becoming increasingly common and are being used for decision-making in important contexts. However, the algorithms that are used often are suspect to many biases and it’s often hard to determine if they are fair or not in the intuitive sense. Due to this issue, there have been efforts to create mathematical definitions of fairness to see if an algorithm is being fair.

But this is difficult because we know very little of how different fairness definitions relate to different models.

Our capstone project centers around furthering research done in the area of fairness in machine learning, and how it relates to specific data and models. We have produced a 3-dimensional matrix displaying the performances of different combinations of models and datasets, evaluated on different fairness metrics.

Fairness Metrics

Common fairness metrics that exist in ML literature



Overall Accuracy - Overall accuracy of predictions on test set. Higher is better, 1 is ideal.

Overall Brier Score - Measures quality of calibration given predicted probabilities and true classes. Lower is better, 0 is ideal.

Demographic Parity (ratio) - Measures whether different classes are being predicted positive at the same rates, agnostic of actual features. Higher is better, 1 is ideal.

For these below metrics, we’re measuring the maximum difference between the metric values for all groups in protected class. Lower is better, with 0 being ideal.

F1 Score (range) - Measure of classification accuracy that combines precision and recall.

Equalized Odds (range) - Measures whether all groups have same probability of being predicted positive, for both true positives and negatives.

False Positive Rate (range) - Measures proportion of true negatives are incorrectly predicted as positives.

Recall (range) - Measures what proportion of true positive labels are in fact predicted as positives.

Brier Score (range) - Measures calibration quality for a single group.

Accuracy (range) - Measures accuracy of predictions for a single group.

Models

popular machine learning architectures



Logistic Regression - Regression model where log odds are a linear combination of predictor variables.

Naïve Bayes - Simple Bayesian network with low computational time complexity, but which assumes features are independent.

k-Nearest Neighbors - Classifies points by assigning most common class among k nearest points in the feature space; in scikit-learn defaults, k = 5.

Decision Tree - Learns a branched tree of conditions about each data point, with each leaf assigning the point to the most likely class.

Random Forest - Ensemble method that uses bootstrapping and random subsets of features to create multiple decision trees, choosing the most common prediction as the class label.

Support Vector Machine - Aims to optimally separate points with a hyperplane in the feature space with the greatest distance from each class.

Multilayer Perceptron - Basic feedforward neural network for classification, uses stochastic gradient descent to train.



matrix result


Demo

To try out our code with your own CSV, feel free to test our demo which displays a table of certain models and metrics given a CSV and config details.