The Surprisingly Nuanced Morality around Privacy Technologies
In which we look under the hood of state-of-the-art privacy technologies and discover moral nuances that are often too subtle to survive into the public discourse.
The US Census Bureau has a difficult job: it is tasked by the US Constitution to provide accurate statistics to the government (both federal and to the states) and it is tasked by federal law to protect the privacy of each individual that participates in the census.
For the longest time, the obvious solution to this conflict was to provide aggregate statistics to those who needed them while keeping the individual contributions secret.
Unfortunately, clever researchers figured out methods to cross-correlate several of those aggregate statistics to “re-identify” the individual contributions.
This forced the US Census Bureau to look for more defensible privacy-protecting methods of release for their aggregate statistics. They involved a variety of experts and decided to use differential privacy (DP) for the release of the 2020 Census data. This is part of a larger effort they call “Disclosure Avoidance System” (DAS).
I won’t got into the math details of DP which can be daunting1, but the concept is rather straightforward: DP measures how much contributions to a dataset can change it. It gives a rigorous way to evaluate the worst case scenario of how much exposure to privacy violation a single data contributor assumes for a given data release method. The main claim to fame is future proofing: it’s worst-case, so no matter what additional data or clever hack someone will find in the future, that exposure will remain the same. A provably hack-proof privacy protection technology!
Here we encounter our first important point of nuance: DP is not a method, it’s a measure of the protection offered by a method. Technically, it makes no sense to say “I’m using differential privacy to protect your data”, just like it would make no sense to say “I’m using differential space to travel to you”. The correct way would be to say “I’m using $algorithm to protect your data and I can guarantee you $amount of differential privacy”, just like we’d say “I’m using $transport to cover $miles to travel to you”. Both $algorithm and $amount are parameters.
When the US Census Bureau says “we’ll use DP” what they really mean is “we’ll use $algorithm and guarantee each data contributor $amount of differential privacy”.
Unfortunately, they found themselves in a pickle: they found lots of appetite around picking $algorithm and very little around picking $amount.
Hunting for Epsilon
In the DP research community, $amount is called epsilon or ε. Sometimes this is called a “privacy budget”. Once they understood which $algorithm to use, the US Census Bureau needed to find a way to understand what ε they should use for the 2020 data release.
We encounter the second point of nuance here: epsilon has a relationship with precision. The more “differential privacy” you offer to your data contributors, the less precise the aggregates will be. This is far from obvious but it happens because one of the fundamental concepts around DP algorithms is that they work by adding some noise to the original data.
You’re probably thinking: that’s crazy! what’s the point of adding noise to the original data?! Wouldn’t it defeat the purpose of the whole data collection?!
Turns out, by carefully tuning how ($algorithm) and how much noise ($amount), you can show that the results are within a certain “error range” from the unprotected statistics. Even if you’re not familiar with statistical inferencing, it’s intuitive to understand that even an imprecise measurement is often enough to make good decisions. For example, you don’t need to know how big your car is in fractions of inches if you want to find out if it will fit in your garage, feet will likely be enough.
One clever yet simple $algorithm that offers differential privacy is the randomized response: flip a coin and then tell the truth on tails or say “yes” on heads. Conceptually, we could ask a person the most private, uncomfortable and potentially damaging question (say, “are you an illegal immigrant?”) but nobody would know if the “yes” answer they submitted was due to their immigration status or due to the (secret) state of their coin. This offers “plausible deniability” to the individual respondent and yet, in aggregate, we know in advance that 50% of those “yes” are due to the coin. We can just remove those and we obtain the ratio of yes/no in the underlying distribution, which is what we wanted. The coin toss effectively destroyed half of our signal, but if the population is large enough the result will be very close and yet not a single person could be deported based on this information.
In practice, algorithms that offer differential privacy are more sophisticated, but the concept is similar: add some noise to the data in proportion to how much a single person can influence the whole dataset and release that modified data instead. Even if a super powerful attacker figured out everyone else’s original data but yours, you would still have that noise to protect your privacy by giving you some plausible deniability that it wasn’t your answer.
How much deniability? Well, that’s where epsilon comes into play.
Epsilon as a rival good
The US Census Bureau basically needs to find out how much noise to add, which translates into how much imprecision in the aggregate results can be tolerated.
In 2018, the chief scientist of the US Census Bureau co-authored a paper titled “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices”. The abstract alone tells the story in a nutshell (emphasis is mine):
Statistical agencies face a dual mandate to publish accurate statistics while
protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer
science, assumes data are published using an efficient differentially private
algorithm. Optimal choice weighs the demand for accurate statistics against
the demand for privacy. Examples from U.S. statistical programs show how
our framework can guide decision-making. Further progress requires a better
understanding of willingness-to-pay for privacy and statistical accuracy.
The solution they proposed can be summarized by this plot
which basically means “the optimal privacy protection regime is when the additional privacy protection matches the cost of doing so”, which, ok, solves one problem but introduces another: what is the social cost of privacy protection?
The paper offers one illuminating example: how the value of epsilon can be effectively be tied to the cost of a meal for a child in a public school. By introducing noise, the school now receives more or less funding than it would effectively need to serve its population of children. Some schools will get more and some will get less. Those which get less will have to offer less or smaller meals (or find additional income).
It suddenly gets very real when the vague and exotic epsilon gets directly linked to the risk of a child’s hunger.
Based on their analysis, they show how using a value of ε = 2.52 could cause the misallocation of $0.70 of federal funding for each student, while ε = 4.74 would bring it down to $0.38 per student2.
The New York Times Stirs the Pot
While debates about data as the new oil and users being the product around BigTech raged furiously, the debate about the social cost of privacy protection was happening only in small academic circles or in privacy policy groups inside big information technology companies.
This changed overnight when the New York Times dropped a PR nuke on the whole thing. Published in February 2020, an article titled “Changes to the Census Could Make Small Towns Disappear” appeared in the NYT Sunday Review as an opinion piece. It paints a gloomy picture of how adding noise to the census might lead to small communities being ravaged by the discrepancy in funding trigged by the injection of noise to protect privacy.
Suddenly, a debate that only interested concerned privacy technology practitioners like myself got propelled to the front pages of mainstream media. The article doesn’t mince words:
The more the algorithm muddles the results, the more difficult it will be, for example, for a data scientist to combine a set of addresses and credit scores with census results to learn the age and race of people living on a certain block.
While the algorithm helps protect respondents’ confidentiality, a test run on the last census shows it may produce wildly inaccurate numbers for rural areas and minority populations.
According to the official 2010 census, 90 people lived in Kalawao County, on the northern coast of the Hawaiian island of Molokai. At the time, Kalawao was America’s second-smallest county. Results using the privacy algorithm, however, showed 716 people living there in 2010 — almost an eightfold increase.
In Toksook Bay, the population dropped from 590 people to 540 in the test run. Mr. Pitka said that a decrease in the count due to the privacy algorithm would be “disappointing and hurtful.”
…
Getting a slice of tax revenue may be the primary concern for many local governments. The decennial census will shape the allocation of $1.5 trillion in government spending.
DP effectively guarantees that the same amount of communities would get over-counted as the same would get under-counted… but there is a well known widespread human cognitive bias called loss aversion: nearly all human beings prefer avoiding losses then acquiring equivalent gains, as you can see by how asymmetrical these two perceptual curves are.
All of a sudden, the use of DP for the 2020 US Census went from “poster child in privacy protection” to the growing perception that losing money in order to protect privacy was a bad bargain.
States sue the Federal Government
It didn’t take long for things to heat up.
In March 2020, the State of Alabama sued the US Department of Commerce over the use of planned use of differential privacy by the US Census Bureau in the 2020 US Census. In April 2020, 16 other states joined the suit. The original complaint has even starker words than that NYT article (again, emphasis mine):
Congress has ordered the Secretary of Commerce to work with the States to learn what they need for redistricting and then report to each State accurate"tabulations of population" for subparts of each State for purposes of "legislative apportionment
or districting of such State." 13 U.S.C. § 141(c). But the Secretary, through the Census Bureau, has announced that she will instead provide the States purposefully flawed population tabulations.The Bureau intends to use a novel statistical method called differential privacy to intentionally skew the population tabulations provided to States to use for redistricting. Thus, while the Bureau touts its mission "to count everyone once, only once, and in the right place," it will force Alabama to redistrict using results that purposefully count people in the wrong place.
Rivalry means Politics
The US Census tabulation are the base for two very politically charged dynamics: federal funding allocation and voting redistricting.
Loss aversion sure plays a big role in the negative emotional reaction to “intentional skew” (to quote the Alabama state representatives), but I feel the emotional effect on voting might be even larger because block-level differentially private aggregates might contain enough variance (due to the noising done to protect individual privacy) to weaken gerrymandering efforts.
Think about it: would you be spending a lot of time and effort trying to redraw district lines by influence election outcomes if you weren’t confident that the data you’re using to make these calculations would be an accurate representation of the voting population?
The value of epsilon got real the moment we started talking about children’s hunger, but it got radioactive the moment the fight for privacy protection touched politicians on the ability to influence elections.
Harvard enters the fray
State representatives were not the only ones worried about the impact of the Disclosure Avoidance System. In June 2021 paper titled “The Impact of the U.S. Census Disclosure Avoidance System on Redistricting and Voting Rights Analysis” several authors from Harvard University suggest that:
The injected noise makes it impossible for states to accurately comply with the One Person, One Vote principle. Our analysis finds that the DAS-protected data are biased against certain areas, depending on voter turnout and partisan and racial composition, and that these biases lead to large and unpredictable errors in the analysis of partisan and racial gerrymanders.
Their final recommendation: give up, DP is more harm than good.
Not everybody was happy with this paper. Various differential privacy experts weighted in:
To understand the flaw in the paper’s argument, consider the role of smoking in determining cancer risk. Statistical study of medical data has taught us that smoking causes cancer. Armed with this knowledge, if we are told that 40 year old Mr. S is a smoker, we can conclude that he has an elevated cancer risk. The statistical inference of elevated cancer risk—made before Mr. S was born—did not violate Mr. S’s privacy. To conclude otherwise is to define science to be a privacy attack. This is the mistake made in the paper.
This is basically what Kenny et al. found.
Aside from the entertainment value of the confrontation, what I find interesting here is that while the Kenny at al. paper was pointing at the fact that DP might effectively have harmful societal consequences, the DP practitioners focused exclusively on the definition of privacy and whether statistical inferencing should or should not be considered “privacy”. The part of the paper that questioned societal harm did not enter the picture.
The US Census Bureau Sets Epsilon
Only 8 days after the Kenny et al. paper was made available (so it’s unlikely there is any causal link between the two), the US Census Bureau made their recommendation on the topic:
they are sticking to offering differential privacy guarantees.
the global privacy-loss budget will be ε = 19.61 (17.14 for the persons file + 2.47 for the housing unit data).
You’re probably thinking: “all right, cool, what’s the big deal?”. But I have one final thing to show you.
One thing that I didn’t mention before is that the impact of epsilon has a logarithmic scale, similar to the Richter scale of earthquake magnitude or decibels.
So, the ratio between ε = 2.52 (the epsilon suggested in the paper from 2018 and the range generally suggested by the DP community as safe) and ε = 19.61 (the current recommendation) is e^19.61/e^2.52 = 328,484,431/12 = 27,373,702.
The privacy protection afforded to participants in the US census by the use of differential privacy had to be reduced by an eye popping 2.7 billion % from privacy expert recommendations in order to satisfy collective needs!
Considerations
This story tells different things that I think are important to highlight, in no particular order:
There is often a rival tension between what benefits individuals and what benefits groups and this is often conveniently left out of debates around privacy and consumer protection. This tension has existed in the political discourse as long as humans have been able to talk about it. I’m not surprised that math and computer science can’t solve it, but I am surprised by how absent such thoughts are in the discourse around privacy protection technologies.
It seems to be really hard for most people to navigate the morality of such nuanced discourse. Feels like a super-imposed duality that needs to collapse onto some certainty to make sense (DP is good, you don’t get it! DP is bad, give it up!). In practice, DP is just a framing methodology to think rigorously about information leakage, but it says nothing about risks or costs and even less about the balanced debates that need to happen to bridge mathematical methods and policy.
It isn’t often that you see a government push the envelope in terms of technological innovation, especially under intense and disjoint legislative constraints, volatile political environment and during a pandemic! There are a number of civil servants that should be commended for this work and their contribution to the field of responsible information technology. In my opinion, you are unsung heroes.
The government had to dilute privacy protections to achieve its goals of societal value. It’s frankly unclear how much protection is left in practice with these now very lax bounds. But it’s still nice to see statistical rigor reflected in the picture, at least in some capacity. Looking forward to see what happens in this space after the 2020 census release is made public.
It’s proper and healthy to worry about the risk for abuse and misuse of data from large and powerful information technology companies. But it’s unproductive to believe there is a silver bullet solution that provides iron clad individual privacy protections and the ability to provide collective benefits via the use of such tools. Things like Apple’s “what happens on your phone stays on your phone” taunt campaigns are not just misleading (how could you text somebody if that text only stayed on your phone?!) but set a tone for a public perception that clean-cut solutions exist and that when they are not used is simply because they are not wanted. The difficulties the US Census endured to bring more rigorous privacy protection to data contributors are refreshing because they show these are really hard problems to solve even when it’s clear all the appropriate incentives are there.
If you’re curious about this, head over to Damien’s blog and read his excellently explained series of posts on the topic. He also has a fantastic piece that explains how reconstruction attacks work.
In practice, protection is inversely correlated with epsilon: more epsilon means “more precision” not “more protection”.