Remember me

Register  |   Lost password?


arXiv logo for blog page

ProPublica's COMPAS Data Revisited. (arXiv:1906.04711v1 [econ.GN])

Tue, 11 Jun 2019 23:03:33 GMT

In this paper I re-examine the COMPAS recidivism score and criminal history
data collected by ProPublica in 2016, which has fueled intense debate and
research in the nascent field of `algorithmic fairness' or `fair machine
learning' over the past three years. ProPublica's COMPAS data is used in an
ever-increasing number of studies to test various definitions and methodologies
of algorithmic fairness. This paper takes a closer look at the actual datasets
put together by ProPublica. By doing so, I find that ProPublica made an
important data processing mistake when it created some of the key datasets most
often used by other researchers. In particular, the datasets built to study the
likelihood of recidivism within two years of the original COMPAS screening
date. As I show in this paper, ProPublica made a mistake implementing the
two-year sample cutoff rule for recidivists in such datasets (whereas it
implemented an appropriate two-year sample cutoff rule for non-recidivists). As
a result, ProPublica incorrectly kept a disproportionate share of recidivists.
This data processing mistake leads to biased two-year recidivism datasets, with
artificially high recidivism rates. This also affects the positive and negative
predictive values. On the other hand, this data processing mistake does not
impact some of the key statistical measures highlighted by ProPublica and other
researchers, such as the false positive and false negative rates, nor the
overall accuracy.