Exercise Set 7: Fairness

In this exercise set, we will be looking at fairness.

We first look at fairness with models, focusing first on how to examine a model in relation to fairness criteria (assesment), and then how to post-process a model to make it more fair according to a specific criteria (mitigation).

After this we look at admission to Berkeley, a classic example demonstrating a fairness analysis with only historical data and no model.

Fairness with models

In this exercise we will utilize the same dataset as in session 4, Supervised Learning, but this time we will be using the full dataset. The dataset is about classification of high income people, namely the Census Income Data Set from the UCI Machine Learning Repository.

A large part of these exercises will be dedicated to getting familiar with fairlearn, which is a package that makes working with fair machine learning easier. In addition to documenting their code, they also have a large amount of examples and text regarding fairness in machine learning in their User Guide.

Exercise 1.1

Based on the code below, what is \(Y, X, A\) using the terminology from the lecture? Is \(A\) part of \(X\), which we will build our model on? If yes, is this required?

# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from fairlearn.datasets import fetch_adult

# Load data
data = fetch_adult(as_frame=True)
X = pd.get_dummies(data.data)
y_true = (data.target == '>50K') * 1
sex = data.data['sex']

# Print data description
print(data.DESCR)

**Author**: Ronny Kohavi and Barry Becker  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Adult) - 1996  
**Please cite**: Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996  

Prediction task is to determine whether a person makes over 50K a year. Extraction was done by Barry Becker from the 1994 Census database. A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1)&& (HRSWK>0))

This is the original version from the UCI repository, with training and test sets merged.

### Variable description

Variables are all self-explanatory except __fnlwgt__. This is a proxy for the demographic background of the people: "People with similar demographic characteristics should have similar weights". This similarity-statement is not transferable across the 51 different states.

Description from the donor of the database: 

The weights on the CPS files are controlled to independent estimates of the civilian noninstitutional population of the US.  These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. These are:
1.  A single cell estimate of the population 16+ for each state.
2.  Controls for Hispanic Origin by age and sex.
3.  Controls by Race, age and sex.

We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used. The term estimate refers to population totals derived from CPS by creating "weighted tallies" of any specified socio-economic characteristics of the population. People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.


### Relevant papers  

Ronny Kohavi and Barry Becker. Data Mining and Visualization, Silicon Graphics.  
e-mail: ronnyk '@' live.com for questions.

Downloaded from openml.org.

data.data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country'],
      dtype='object')

print('Target value counts')
print(y_true.value_counts())
print()
print('Sex value counts')
print(sex.value_counts())

Target value counts
0    37155
1    11687
Name: class, dtype: int64

Sex value counts
Male      32650
Female    16192
Name: sex, dtype: int64

Exercise 1.2

We will first create a model to analyze. Create a Decision Tree Classifier with max_depth=5 and min_samples_split=50 and fit it on the data. On the basis of this model, predict the outcomes for the same data.

# YOUR CODE

A large part of fairness assesment is looking at metrics across sensitive features. To aid in this, fairlearn has a concept which is called a MetricFrame, which aids in this.

We will start with perhaps the most basic fairness assesment, looking at performance differences across a metric.

Exercise 1.3

Create a MetricFrame with the metric accuracy_score you know from sklearn.metrics >Hints: > > The API Docs are available online and support searching! > > Creating the MetricFrame itself won’t return any output

# YOUR CODE

Assesment

It’s nice to work with the MetricFrame because it lets us look at performance across the sensitive attribute quickly.

Exercise 1.4

Report the overall accuracy and accuracy by group. Does the model perform equally well across sex?

Hints:

The API Docs are available online and support searching!

The MetricFrame has some useful methods

# YOUR CODE

Having now looked at accuracy across sex, we want to move closer to the fairness criteria independence and separation.

Exercise 1.5

In addition to the accuracy score, you should now also include the selection rate, false positive rate and false negative rate in your MetricFrame. Report them overall and across sex.

Does the model satisfy independence, separation or both? Was this expected?

Hints:

The API Docs are available online and support searching!

You can input a dictionary where the keys are names and the values are metrics to the MetricFrame

All three metrics are available in fairlearn.metrics

# YOUR CODE

When reporting multiple metrics, it can sometimes become hard to maintain an overview of the differences and ratios across the sensitive attribute, but luckily MetricFrame can help us with this.

Exercise 1.5

Report the absolute differences and ratios across sex across the sensitive feature.

Hints:

The API Docs are available online and support searching!

The MetricFrame has some useful methods

# YOUR CODE

However, many people find figures nicer to look at than numbers. Let’s create one!

Exercise 1.6

Create a bar plot of the different metrics across sex.

Hints:

The API Docs are available online and support searching!

The MetricFrame.by_group is a DataFrame, which means it support bar plots through plot.bar.

To achieve a nicer bar plot, you can try changing the following keywords: subplots, layout, legend, figsize and title

# YOUR CODE

Mitigation

To mitigate fairness concerns, we will utilize the ThresholdOptimizer in fairlearn.postprocessing, which is based on Hardt, M., Price, E., & Srebro, N. (2016). Equality of opportunity in supervised learning. Advances in neural information processing systems, 29.. The idea is the one we discussed during the lecture: Using a combination of different thresholds and random classifiers, different fairness criteria can be satisfied, but for more information you should read the paper!

Exercise 1.7

Post-process the classifier we have created to satisfy demographic parity whilst optimizing accuracy.

Hints:

The API Docs are available online and support searching!

The fairlearn follows the sklearn syntax with .fit and .predict

# YOUR CODE

Exercise 1.6

Examine the metrics of the post-processed classifier. Does it satisfy independence, and at what selection rate? Has it influenced any of the other metrics? Why is this?

Hints:

The API Docs are available online and support searching!

Use any method of vizualising metrics across groups that you like

# YOUR CODE

Exercise 1.7

Repeat the two previous exercises, but now satisfying separation.

Hints:

The API Docs are available online and support searching!

separation is also sometimes known as equalized odds.

# YOUR CODE

This ends our voyage with fairlearn, but note that fairlearn also has implementations of reduction algorithms (which support more fairness constraints and regression fairness) and intersectionality and control features.

Fairness without models

Sometimes you do not have access to the full modelling process, and merely have observed outcomes. A such example is given the Berkeley_Admissions_Data.csv dataset, which is a three-way table that presents admissions data at the University of California, Berkeley in 1973 according to the variables department (A, B, C, D, E, F), gender (male, female), and outcome (admitted, denied) encoded as Yes and No. In this case, we can still assess fairness in some ways.

The exercise is inspired by a similar exercise by my colleague Roberta Sinatra, who also supplied the data, and is based on the original paper Bickel, P. J., Hammel, E. A., & O’Connell, J. W. (1975). Sex Bias in Graduate Admissions: Data from Berkeley: Measuring bias is harder than is usually assumed, and the evidence is sometimes contrary to expectation. Science, 187(4175), 398-404..

Exercise 2.1

Load and look at the dataset

Hints:

pandas supports reading many file types.

df = pd.read_csv("Berkeley_Admissions_Data.csv")
df

	Dept	Male Yes	Male No	Female Yes	Female No
0	A	512	313	89	19
1	B	313	207	17	8
2	C	120	205	202	391
3	D	138	279	131	244
4	E	53	138	94	299
5	F	22	351	24	317
6	All	1158	1493	557	1278

Exercise 2.2

Focusing on Berkeley overall, did their admissions suffer from gender bias?

Hints: > Subset the row pertaining to all departments > > What fairness metrics can you calculate with the given information? > > The code snippet df["new_col"] = df.apply(lambda x: x["col1"] + x["col2"], axis = 1) creates a new column called new_col which is adds together x["col1"] + x["col2"], which can also be used in combinations with other common operators such as /, - and *. There’s an example in exercise 2.3 and code for a plot > >

# YOUR CODE

Exercise 2.3

Now perform a similar analysis within each department. Do you maintain your conclusion from exercise 2.2?

Hints: > Subsets the rows pertaining to each of the department > > The code beneath creates a plot with amount of applicants per department. Perhaps this code could be amended to create other plots, or aid in a discussion. > > Do you find any evidence of Simpson’s paradox?

# Calculate amount
df["M_num"] = df.apply(lambda x: x["Male Yes"] + x["Male No"], axis = 1)
df["F_num"] = df.apply(lambda x: x["Female Yes"] + x["Female No"], axis = 1)

# Barplot for first six rows (.iloc[0:6])
fig, ax = plt.subplots(figsize=(10,6))
df[["M_num", "F_num"]].iloc[0:6].plot.bar(title="Number of applicants per department", 
                                                      rot=0, ylabel="Number of applicants", xlabel="Departments", ax=ax)
ax.set_xticklabels(["A", "B", "C", "D", "E", "F"])
ax.legend(["Male", "Female"])
plt.show()

# YOUR CODE