Interview on Differential Privacy in the US Census

I talk with David van Riper, Director of Spatial Analysis at the Minnesota Population Center, about the implementation of differential privacy techniques in the 2020 US Census. We discuss some of the new issues demographers and researchers dependent on the census will face this decade and the intricate tradeoffs inherent in designing a census which is both accurate and confidential.

Relevant Links and Sources:

Protecting Privacy with MATH: YouTube explanation of differential privacy by Minute Physics.
To Reduce Privacy Risks, the Census Plans to Report Less Accurate Data: by Mark Hansen, New York Times.
Changes to the Census Could Make Small Towns Disappear: by David van Riper, New York Times Opinion.
A History of Census Privacy Protections Timeline Graphic: by the US Census Bureau.
2020 Census Data Products: Disclosure Avoidance Modernization: by the US Census Bureau.
Balancing Data Utility and Confidentiality in the 2020 US Census: a living document by danah boyd.
Democracy’s Data Infrastructure: by Dan Bouk & danah boyd, Knight Institute and Law and Political Economy Project.
The American Census: A Social History: by Margo J. Anderson (2015). Yale University Press.
Challenges to the Confidentiality of U.S. Federal Statistics,1910–1965: by Margo J. Anderson.
Differntially Private Fair Learning: by Jagielski et al., Proceedings of the 36th International Conference on Machine Learning.
Fair decision making using privacy-protected data: by Pujol et al., Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency.
Trade-Offs between Fairness, Interpretability, and Privacy in Machine Learning by Sushant Agarwal.

Interview Transcript

Intro: The United States decennial census undertakes the grand and complicated mission of counting and surveying the nation’s now over 300 million inhabitants. Conducted every ten years, the census hopes to capture the demographic fitness of this vast nation, tabulating the age, sex, race, and ethnicity of every individual from coast to coast, leaving hopefully no person unturned or uncounted or forgotten.

While the census’ original purpose was to apportion the US House of Representatives, it has evolved over two centuries into a governmental and socioeconomic behemoth, dictating all manner of functions, such as the allocation of federal tax dollars, state redistricting, and national infrastructure and economic planning.

Census data has long since been part and parcel for social science research and commercial strategy. Communities and associations at all geographic strata ranging from states to tribal lands have come to regularly depend on the census for their various needs, These constituents are now referred to as the census stakeholders, for much is at stake for them if the census fails to deliver on its mission.

Over the years, the decennial census has grown in size and scope with the infrastructure of this country. The census was vital to wartime mobilization in every major American conflict of the past century. However, the census’ history has not been without its controversies and political shenanigans. The census was embroiled in some of the major legislative debates of this country’s history, ranging from slave representation during the antebellum period to the immigration quotas of the 1920s.

The year 2020 opens a new chapter in the census history book. The Census Bureau now faces a different kind of challenge, a never before seen technological danger that threatens to change how stakeholders use the census, and possibly even the very spirit of the census itself, for all future generations.

Recent advances in computing power and the widespread availability of commercial data about millions in our populace, now allow someone with just a laptop to unearth a possibly very scary amount of information by analyzing the publically released census tables. Such an adversary can now expose confidential information about individual census participants, thereby reidentifying them. In 2016, experts at the Census bureau, fearing just this, staged a false attack on itself to see just how far one could go in reidentifying the American populace. The results were startling: 50 percent of people were successfully reidentified (that is, their sex, age, race, ethnicity, and location exposed) and, if you allowed for one mistake such as getting a person’s age wrong by one or two years, the accuracy went up to whopping 90 percent of the populace.

Prompted by these findings, the bureau turned to differential privacy, a cutting-edge theoretical concept that could mathematically guarantee privacy against reidentification attacks. In December of 2018, the Bureau announced that the 2020 Census would, for the first time, implement differential privacy as a safeguard.

Getting into the nitty gritty for a bit, differential privacy calls for adding a certain amount of random noise to a data set before it’s released to the public. The million dollar question here is “how much noise?”. One might imagine controlling the amount of noise injection with a knob. Dialing the knob one way injects more noise, granting confidentiality to census takers but at the dismay of stakeholders who might find the census has been overly contaminated for their purposes. Dialing the knob the other way, we get a greater risk of privacy breach, but an assurance of accuracy and utility.

In other words, there’s a tradeoff between usability and risk, or between the accuracy of the dataset and the privacy of it, determined by our random noise knob. The bureau sets this knob using “epsilon” which, despite the fancy and math-y sounding name, is just a privacy-loss budget, or a cap on how much noise they can spend on their dataset.

Now, in terms of the law, Title 13 of the U.S. Code mandates that the bureau must keep each census participant’s information private, and this is the creed which the bureau has upheld for over fifty years. That is, disclosure avoidance, or protecting the confidentiality of the people, is not merely a 2020 issue, but one the Bureau has wrestled with for several decades.

However, now in the digital age, where we see an ever-increasing number of data thefts, hacks, and leaks, the Bureau must once again adapt, as it has in the past, this time with a new weapon in hand: differential privacy.

But, as it turns out, this weapon is something of a double-edge, and some say the Bureau has turned the knob in the wrong direction, overly favoring privacy over accuracy.

Today, I speak with David van Riper. David is the Director of Spatial Analysis at the Minnesota Population Center and is also associated with IPUMS, which stands for Integrated Public Use Microdata Series. They basically handle the selective dispersal of census and survey data to researchers. David and his boss, Professor Steven Ruggles at the University of Minnesota, have been at the fore of many of the criticisms raised towards the Bureau for their use of differential privacy.

In my discussion with David, we go over some of the specific harms this privacy approach can incur and the real-life consequences of having the census lose some of its accuracy, especially at local levels. We also discuss what communication between demographers and the Bureau has been like during the last few years, the technical challenges of understanding the effect of noise injection on census data, and we also ponder whether the privacy-accuracy knob has indeed been set correctly or not by the Bureau.

Without further ado, here’s David.

Joe: It’s great to have you on. So you’ve been one of the most vocal critics of the use of differential privacy techniques in the 2020 census. So just to get the ball rolling here, what do you see as the main issue with the use of this new privacy method?

David: I think the decennial census is really the cornerstone, or what I would say the cornerstone of American democracy, right? It really forms the basis for congressional apportionment, although that’s not impacted by differential privacy, but also legislative redistricting, both at the local level, the state level and the federal level. Along with that, you know, the decennial census forms the basis for the population estimates program and the American Community Survey. And all of these are used to disperse millions of dollars in federal aid to states and localities. Additionally, the decennial census is used for, you know, in lots of applications by local urban planners by public health experts who are looking at, you know, age adjusting for rates of COVID, or for trying to determine locations of where to do vaccine, vaccine sighting and the differential privacy as a tool, I find it to be an incredibly blunt instrument, in that any every statistic essentially, with very few exceptions, can have noise injected into it. And then the data have to be post processed in order to make it consistent by variable types that males and females and the totals and by geography, so that counties add up to the States. And all of this is leading to the data that we’ve seen thus far being incredibly inaccurate, or incredibly different compared to the, in this case, the 2010 data that we have the ability to compare it to. And that what we what I’ve seen thus far, and that the current level of inaccuracy is just too, too great to be really useful data for for this broad sweep of applications. And so I think that the, you’ve got this incredibly broad user base who wants to use the statistics for an incredibly large, broad swath of applications. And yet, you’ve got this kind of method that essentially injects noise into everything, making it difficult for all of these users to continue to use the data, and really, I think puts the puts the Census Bureau at risk of really harming their relationships with, you know, all these groups that that use their data.

Joe: So what are some of the practical real life ramifications of having inaccurate data be presented in the census?

David: Yeah. So one of the things that I’ve been working on lately has been working on some public health related applications. So we, me and some colleagues at the Centers for Disease Control, have been looking at age adjusted rates of asthma, asthma visits to emergency departments. And we have the data for all towns in Massachusetts. And we are finding that if you compute the age adjusted rate, using the differentially private data, versus the published 2010 data, in particular, for small towns, like towns less than 1500 people, you can see those age adjusted rates of being different by 10 to 20 percentage points. So all of a sudden, if you’re using the differentially private data, you may think there’s a big spike in asthma ED rates, and then you would inefficiently allocate resources to those locations or, vice versa, tt may look like there’s a decline in asthma ED rates, and you’re not targeting those communities, effectively with with the resources they need to to address that issue. And so I think you really run into the risk of, you know, seeing statistics in the, you know, coming out of the data that are then, you know, the actual change is not due to actual changes in the numerator, right? We’re just seeing the denominator change. And so you’re seeing rates bounce up and down. But it’s all a function of the noise and the differentially private data, not actual changes in asthma ED visit rates. And we’ve seen similar results for syphilis at the county level. We’ve seen similar results for asthma ED visits at the county level throughout the US. And so I think the public health infrastructure could really be in tough shape in terms of allocating the scarce resources we have efficiently.

Joe: So the idea behind differential privacy in theory, though, is that the the noise injected in a particular part of the data should, in the grand scheme of things sort of wash out when you look at the data in its totality, but you’re saying that the issue is sort of arising at a local level, when you look at these small populations or these small datasets, where somehow the noise is becoming more pronounced?

David: Not that it’s there, it’s the fact that the bell case of the noise is there, right? But, if you want to intervene on a local level, you have to intervene on that local level alone, you can’t just like average, out, oh, across all of Massachusetts, at the town level, the data are okay, I need to go into this particular city and say, Hey, I see something’s going on here. And because of that noise injection, you don’t really know how that rate might be changing. And so yeah, in totality, the noise does kind of average out over all places. But if you want to intervene, you have to intervene for a particular geographic unit in a particular locality. And those two things really seem to be at odds with, you know, this particular methodology.

Joe: Now, so going back to your earlier description of the process itself, is the issue here really coming from the noise injection itself? Or is it somehow this post processing that you mentioned of the data to satisfy all the various invariants, such as the total state or county population, and making the numbers sensible? Where do you find in practice the source of this inaccuracy here?

David: So I’ll say I actually don’t totally know the source of the inaccuracies. The Census Bureau will argue that the inaccuracies are mostly due to the post processing, and not due to the noise injection, and truthfully, I can’t say whether or not that’s true or not, I think that might totally be be the case. And I think the drawback is the post processing makes it very, very challenging for the Bureau to, say, produce a confidence bound or confidence intervals around those statistics. They were just injecting noise and publishing that. We could, as you know, data users kind of say, okay, we think that’s a count plus or minus something based on the distribution they draw from, but because they have to do the post processing, there’s no easy way for them to create those error bounds. And, you see a statistic, and you don’t really have any way of telling how accurate it really is, except for the fact that you know how much epsilon was allocated to a particular statistic in a particular geographic level. So you can kind of back of the envelope tell what’s what’s going on. And so, you know, I’d say both things are definitely causing it here, yeah, I think that the smaller the magnitude of the noise that gets injected, I’m hoping the less impact post processing has, because you’re gonna have fewer negative counts. And you’re gonna have, you know, the ability to do more consistency checks, and post processing, not having to move counts around so much. And so if you make the magnitude smaller, hopefully it makes less postprocessing, which kind of lessens the impact that has.

Joe: So it seems like the inaccuracy is coming a little bit from both, which is partly both, I guess, the theoretical setup of what you’re defining by privacy, and also the particular implementation details of what the Bureau is doing for the census. But is there a, I guess, genuine appreciation among demographers and end users of the census that differential privacy is, as a concept, a valid notion of privacy here, or is there contention in that as well?

David: I think that there’s contention. I think there are many data users, demographers and others who who definitely understand where the Census Bureau is coming from right? that the Title 13 requirement that’s put on the Census Bureau is definitely like, we definitely take that seriously. I totally get that, that that’s an issue. I think. I think some people, demographers, think that the Census Bureau is reading too much into their Title 13 requirements, and that some demographers think that the Title 13 means that you can’t release like the name, or releasing the name or an address of an individual person would be identifiable. Not releasing their age or their sex, the characteristics of a person don’t fall under Title 13. So I think there’s people reading that law separately. And that’s where a lot of the contention I think comes in. And then the bureau reading the law the way they do, they have gone to differential privacy as the method for it. Demographers reading it as just the characteristics. They think that the traditional ways of doing disclosure control swapping would suffice and still meet the letter of the law.

Joe: Many of these characteristics you mentioned, such as your race, ethnicity or age, they’re not really private information, in a sense that I mean, they’re publicly visible to anyone who might know me or a particular individual, and they’re more or less obtainable from publicly available sources of information. So I guess the real danger that many people have been mentioning is that this data, however innocuous it may appear, could sort of be cross referenced with other sources of data that do contain more sensitive information, such as, you know, medical records or credit information. So do you feel somehow that the burden of privacy has unfairly been shifted onto the census here, when really the breach may sort of be in other realms?

David: Right? Yeah. I think there’s two things going on here. One, the decennial census is like a complete enumeration of the population, right? So of all the datasets that the federal government collected, it’s probably the one that’s the most susceptible to the reconstruction and re-identification attack, because it technically is, all of us. And so of all the products like it’s definitely the one kind of at risk for privacy, private data to be leaked. Two: you know, the Census Bureau, for some datasets or other data producers, they require you to sign you know, agreements, saying that I will not, you know, link these data with other products under the penalty of law. And so I’ve signed many of those in my life to use health data or other data sets. That, you know, the Census Bureau is essentially saying that because of this Title 13 requirement, we have to do everything we can to make sure that data will never be used nefariously, when that could be placed, that burden could be placed on the user. And maybe that’s a way that we should be moving to in the US, is placing more of the burden on the user to follow the Title 13 laws.

Joe: And thinking about the history of these privacy methods itself. In previous iterations of the census, the bureau relied on other disclosure avoidance techniques, such as swapping the values of two households in a particular region, or hiding, particularly sensitive values that might identify someone, they also imputed or filled in these missing or incomplete or nonsensical form answers using their own correction schemes, many of which were never publicly revealed in full. But in short, the end product was never really the “population ground truth”. And demographers have always had to accommodate this for I guess, about as long as the census existed. So how do you see sort of these accommodations changing for the differentially private data? And I guess my accommodations here, I mean, like, imparting some uncertainty due to the the noise injection or maybe taking into account some kind of potential bias.

David: So I think, typically, the decennial census being a census of the population, demographers effectively treated those, even though they were swapping, even though there was, you know, imputation, we treated those as essentially ground truth because there was no other data set to work from. We didn’t have a comparison data set that we could say, oh, the decennial census is inaccurate here. And because the bureau didn’t give us the swap rates, or some of those statistics that you could try to back that out. So while demographers would admit that yet we know there’s uncertainty in the data, there was no way for us to to have to accommodate it. Because we didn’t know how to do it, even though we knew that like it’s not perfect, right? We know that that’s true. With respect to moving forward, I think demographers are certainly willing to to build those uncertainty estimates into their modeling, as long as we get guidance from the bureau about how to do that. Right? So as of today, the bureau does not plan to release any error bounds or confidence intervals for any of their statistics. And I guess there are some simulation methods that have been developed that you can use to kind of bootstrap or simulate the error bounds, but if the bureau wants people to build the uncertainty into their modeling, they are going to have to help users out by providing the tools and methods to be able to do that. But I think we would be willing to build those into our models as long as we get help, you know, in doing that,

Joe: So communication between the Bureau and the the end users is crucial here.

David: Yes, yeah, absolutely. Yep. Yep. I mean, in this process of differential privacy, Right? That has been a I think, I would say, critiquing the Census Bureau, a weak spot. Now we have, you know, we have not had a lot of communication, at first related to what this was going to do. And we as end users did a lot of analysis to figure out what this was going to mean. So we were kind of always on a, chasing the bureau trying to get up to speed on what this means. And then they bureau says, Oh, you, you can just use the right the Laplace or the double geometric distribution to back out the noise. And then you ask them, and they say, no, you can’t do that, because of postprocessing. And there’s no easy way. And we like, we’re gonna need more assurances that this is possible. And I think we have, we just haven’t gotten that yet.

Joe: But many of these previous methods, you know, such as swapping data between households relied on the so called maxim of privacy from obscurity, where details of the implementation would be hidden to those outside of the Bureau, for fear that such knowledge might pose too useful to an attacker. So is differential privacy any better here in being more transparent about the procedure? And does that outweigh some of your perceived negatives? Or has sort of the novelty and learning curve of differential privacy kind of made this transparency less obvious in users and demographers? And could that change over time?

David: Yeah. Well, okay, so I think there’s two different things, right? There’s transparency in terms of the basic method, right? We know they do some assignment of the privacy loss budget, and they draw statistical distributions, take random values, and they do this, the post processing is a little bit of a black box. That, we certainly don’t have a full sense of how that works. But the transparency has certainly helped raise the level of knowledge about, kind of, disclosure avoidance in the decennial census, I think we’re still worried about the transparency related to how the privacy loss budget gets allocated, right? That’s a policy decision that’s going to be made by a committee at the Census Bureau. And unless they publish all of their notes about how they came to their final decision, there’s still some level of opaqueness, even in differential privacy, because that decision could still be held by a subset of people in an agency. And if they’re like, that’s private, that seems to really go against the, you know, the differential privacy, the benefits of it. I think that if the Census Bureau was planning differential privacy for the 2030 census, and they started talking to us three years ago, and started putting this out, I think that users would be, you know, totally willing to do everything we’ve done thinking about 2030, the bringing this up three years before the 2020 decennial census, when no one in the user community had heard about it, that is really kind of poisoned the well. And I think if we just had more, more lead time to get up to speed, that if that would have helped kind of maybe smooth out the rollout process.

Joe: So a lot of these sort of parameters or knobs here in question, for example, the privacy budget, epsilon, you know, how that budget is allocated across the different geographies? What invariants are being set, they all have a pretty complicated interplay, and are, I guess, in tandem contributing to these issues you’re mentioning. At the same time, I’m also getting the feeling that some of these parameters while probably chosen using, you know, evidence based heuristics and practical considerations, were ultimately chosen somewhat ad hoc. Do you feel like more time, as you alluded to, about these choices will help us understand how best to set these parameters? And will sort of get rid of this effective opaqueness?

David: Yeah, I think I want to give the Census Bureau a lot of credit for putting out so many demonstration data sets that have kind of tweaked some of these knobs and has helped the user community see what happens when you you know, you change the allocation to different queries, you get certain improved accuracy in some cases and reduced accuracy based on that. I think that has helped a lot because it has opened up space for the user community to really think and talk about what are the critical statistics that we really want to use. I think if the bureau was collecting a lot of the data they’ve spent the last year collecting from user groups having all these sit downs, like if they had started that in 2015, they would have had five years to, to have that conversation. And I think just we need more of that time to really, to really, you know, smooth out the relationship and to reduce the opaqueness right? So if we see, okay, like, changing this knob really has very little impact on the data set, we don’t have to worry about that parameter anymore. But this parameter over here, that’s the critical one, we need to talk about, if we could kind of coalesce around that that would help the user community feel like we’re being heard. And the bureau would have time then to modify their methods to take that into account.

Joe: And some have called for the release of this so called pre post processed data where, you know, the negative and fractional values, you know, warts and all, would be made available. Would having this data set in hands alleviate some of the concerns here or for this?

David: I think, just from purely from a researcher perspective, you know, with my hat on is like someone who wants to, I would love to see that data, I think that data would be incredibly powerful, because that would help us get a sense for what does the what’s a realistic version of the data with just noise injected into it, right? Like, what what are the ranges for the values for particular statistics that we would be working with. That would be very helpful for the users to start to develop the methods that you can use to analyze these more uncertain datasets. And so I think that’s where you could really see a cool triangulation among statisticians, data users and computer scientists to really get at oh, like, we can develop methods for doing a, b, and c, if we have the raw noise injected data that we can’t really do with the post-processed data set. So, I’ve been an advocate for that. And I think that would be really helpful for users to understand kind of how, how it operates.

Joe: Right, but the fear, I guess, on the Bureau’s end has been, in part, that people might just be overly intimidated, or might be overly dismissive of the census altogether, if they start seeing these, you know, negative or fractional values. So in some sense, is some level of maturity here required in the research community to sort of handle this data set?

David: I actually don’t think so. I think the research community is ready to handle. I think it’s the local government. It’s the agencies that don’t have the sophistication to maybe analyze that. And that’s where if you put two statistics out, and you got, okay, I’m looking at, you know, children under the age of five, because I need to open a new school, and I’ve got, you know, a count of 100 in my published data set, and then the raw ones, the 150. Yeah, if you don’t know that, that’s 150. There’s uncertainty around that if you need more, you need more, this need to be more sophisticated. And that’s just going to take time, right? We need to build that training into college classes, higher education, to be able to do that. And and as of now, right, as far as I know, Gary King at Harvard is one of the few statisticians trying to even develop inferential methods for using these data. Otherwise, most of the work has been on how do I make privatised data, not how do I analyze. And I think that’s where the fact that we don’t have those analytical methods yet, is really scary. But I think the user community is willing to work with that. But the user community is not the ones who are going to develop those, it has to be done in conversation with statisticians and computer science.

Joe: I guess it sort of reminds me of about 100 years ago, when when sampling was first introduced to sort of take away some of the long form questions from the census and put them in surveys, or to check against the accuracy of the census. And now sampling was, when it was first introduced, sort of met with a lot of disbelief by legislators and policymakers. But eventually it was accepted over time. So perhaps this might…

David: Yeah. Statisticians were developing the methods for analyzing sample data. And I think we’re just, you know, it feels like we’re at this phase where the computer scientists are developing the methods for doing differential privacy. The statisticians are still trying to figure out what to do with that. And the users are still way back here, trying to wait for all of this to get sorted out. Because I mean, while the bureau says that differential privacy is not new, it is only 15 years old. And it’s not like it has 100 years of sampling theory behind it. Like it, it still is pretty new compared to, to a lot of things we deal with.

Joe: Right one salient issue or case where some of the issues you mentioned arise is in counting these tribal populations of indigenous or Native American tribes. And one of the issues raised is that there’s sort of a random lottery going on with the with various blocks, where there would be like negative or positive biases or undercounts over counts. And it would sort of be a luck of the draw happening across different blocks. And this, sort of, appears to me like an effective swapping is happening here between groups of people being traded between like larger, more populous areas, such as big cities, and smaller municipalities. Do you find this a useful way of looking at it? Like there’s a new newer version of swapping happening here?

David: Right. That’s kind of how I see it, right? You’re on the small places can’t go below zero. So they kind of get upward bias, right? The big places can move down, right, they can kind of keep moving down. And so you’re kind of moving people from the large population areas, people are getting moved to small population areas, I think the tricky thing about the Native American populations are that on reservations, for example, the native population is the big group. And so they are kind of getting that count down. And non native american counts are kind of moving up, because they’re quite small, relatively speaking. And so you can see that, you know, in places where a minority group in terms of the whole US, but is the local majority, if they’re seeing their population kind of diluted or diminished. That’s, that’s very nerve wracking. It’s very nerve racking for these these groups, and they have every right to be, you know, worried about that, because their political representation or you know, federal funds, especially to the tribes, is very dependent on the counts in those areas. And if you’re seeing fewer AIAN identifying people there, that means you’re getting less money, you know, how fair is that to those to those communities, I argue it’s not very fair.

Joe: Alright, so there’s a lot at stake here for the local user. So one suggestion that’s been brought up, is to sort of change the geographical precision of the census, or to do away with publishing precise block level data, which was actually the norm for, I guess, some of the census’ history. But I understand there are many end users who ultimately depend on the census for their local needs, both by tradition, and also by law. So do you think this is a good long term idea? Or is this sort of just shoving the problem under the rug?

David: In terms of not publishing block data? So I think that publishing data at the block level should should still be be done, I think you can question how many statistics do we need to publish at the block level? That would be the argument I would go for is okay, like, what are the like, what are the legal? How are the block level data required for various legal things? And what statistics do they use? And let’s just publish those statistics, the block level, and then maybe we can aggregate all other stats up to the block group or tract or some other other level? But what’s the minimum required at the block level? I think that that’s the way to go. I think now that we have redistricting requirements that kind of require equal populations. And then you get all these special districts that are kind of defined by blocks, I don’t think there’s any way to totally go away from census blocks as a unit of analysis. I think you could get by maybe you could create a new geography between blocks and Block Groups that’s slightly coarser, that would allow you to do legal analysis, but not quite be at that fine grained block level.

Joe: But it would probably be too much of a paradigm shift for most of the practical considerations.

David: That would be quite different. And I think it would take time, right? You’d have to start that conversation now for 2030 to get users ready for that.

Joe: Right. Right. The timetable for releasing information about the 2020 census clearly wouldn’t be appropriate. Right?

David: Right. Yeah, that’s not great. That’s not gonna happen. But let’s have that, like, that’s a worthwhile conversation to have. We should have that conversation over the next nine years, thinking about 2030 and getting to that, you know, if this was 2011, and we were starting this conversation, for 2020, that’d be great. I think that would have been a great way to, you know, have nine years to to get ready. And so I think if you go to the user community, we can come up with lots of ideas and ways to maybe make this better. It’s more a function of like, the bureau will come back and say, oh, we didn’t have that in a federal register notice, so we can’t do that. I’m like, well, you didn’t ask us if we thought that was useful to even play. It’s the putting the cart before the horse or the horse, you’re trying to get that going. And I think I attribute some of that to the lack of communication around this transition. And how users did feel a bit blindsided by by this,

Joe: I see now related to the issue of communication. So another somewhat less technical line of criticism levied at the 2020 census, made, for example, by some at the National Conference of State Legislatures, or the NCSL, is that any deviation from the Constitution’s one person, one vote principle, shouldn’t even be on the table or that some of this, I guess, is hostility at the mere prospect of the bureau publishing falsified numbers at all, despite the fact that it, in some sense, has for most of its history. So the 2020 census does call for greater statistical education for the census users, legislators, and also the general public who I guess plays the most crucial role in being the census participants themselves. Differential privacy and the 2020 census disclosure avoidance methods are right now considered sort of a very high tech concept, but might one day be accepted in the general public for what they really are, that is, you know, very strong privacy measures. So could this sort of reassure people who might otherwise not have historically participated in the census that it’s sort of safe and confidential to do so and maybe, hopefully resolve some of the historical undercount issues lingering in the census? From way back when? Or is this sort of like a pipe dream?

David: I think, as of today, it’s a pipe dream, I think that trying to explain to someone that, oh, we’re only going to leak so much of your privacy. People just don’t understand what that means. I mean, I feel that you know, and I am as educated on this as any data user is, I still don’t fully understand what it means to have my privacy leaked, when a statistic is published. And I don’t understand how even how to quantify that for somebody, right? Like, e to the epsilon. So I’m like, that would need to be translated. So I think that, that while it might, at some point in time, reduce people’s willingness, I don’t think that it’s on the forethought of people who don’t want to participate in the census. I don’t think that that would reassure them to do so. Only because you can still get re-identified right, even with differentially private data, there’s still a risk of re-identification. It’s not like it’s totally gone. And so you still can’t even say, there’s still some uncertainty in there. And you’re like, do you want to bring that up for the users because then nobody will participate if there’s still that out there. And so I don’t really think that the general user is ever going to understand what this means for them. And I also think that, you know, as much as the bureau pushes back on this, the statistics captured in the decennial census are just not that sensitive, private, right? Like maybe your household structure is sensitive or private, but like, you could buy a credit bureau dataset that has my income in it that I’ve never reported to the census, and you’ll learn a lot more about me from that credit report, including my address, then you’ll ever get out of the decennial census.

Joe: Right. But I guess this is in part sort of the spirit of a public institution shouldn’t encourage or endorse… shouldn’t make that easy, right? It’s sort of what what’s going on here.

David: Yeah. But I just don’t think that the general public would really understand the privacy protections that differential privacy is any more than they’d understand the, the limitations of swapping, right, as, I think the other thing I wanted to go back to is talking about the NCSL. The total population count has always been invariant at the block level up until today. And so maybe the characteristics of the people living in that block were different, but the headcount was the headcount, and this is the first time that’s not really the case. I know they do imputation of whole households, but you know, it’s relatively low. This is really the first time that we’re getting perturbation of the counts. And I think that’s where NCSL and other groups are really worried. This is the first time we’re seeing kind of a proactive infusion of noise into the data that that we’ve just never had to deal with in the past.

Joe: And you sort of touched on the the guarantees of differential privacy and whether they actually hold up practically. And this is sort of where the notion of relative privacy sort of butts heads with absolute privacy. Now, as the tools and technology of at the adversary’s disposal grows with time, and they will grow rapidly, I suppose, as well as gray market sources of confidential data, such as, as you mentioned, credit bureau data, for example, it is possible that some of this discussion may even be moot. And perhaps differential privacy in the census, although it has the guarantees it has, will just be made obsolete by other readily available sources of unsecured data. Now, thinking back to the privacy-accuracy tradeoff, and imagining this sort of worst case scenario, do you think it would be better for the Bureau to just forego differential privacy in this situation? And at the very least deliver a perfectly accurate product? Or do you think this goes maybe against some of the spirit of what they’re doing?

David: I mean, I think that, like, I don’t think that they should just release the perfectly accurate enumeration of all characters. Like, I think they need to do something to protect the privacy, you know, the absolute, I think absolute privacy’s, you know, right, they’re not going to go publish names or publishing addresses like that,

Joe: Even if such information was too easily accessible.

David: Right. Right. Right, like, and so on, like me, you know, maybe they can still do some other types of disclosure control, more traditional kind of disclosure control, to provide that, but I would say that, right, like, if you can go to the gray market or the black market and get these data, why do you need the census, right? Like, what if I’m an adversary, and I have to, you know, solve the reconstruction problem, but I can just go spend 10 grand on a data set, I’ll probably just spend 10 grand on the data set and, right, get it there and not try to jump through all the hoops that I have to for the census data.

Joe: Right. And here, we’re sort of thinking of the adversary as a private entity or maybe a foreign government or something. But in reality, many of the abuses of census data in our country’s history were perpetrated by the state itself, such as, you know, when the various War Powers Act were, was used to identify draft dodgers during World War I, or when census data was used to help intern Japanese Americans during World War II. Now, laws, including the Constitution itself, and people in office, and the wider political sensibilities can and do change over time. And there’s no guarantee here that the relevant confidentiality laws will will stay in place forever. Especially in this country, at least we’re seeing that the state is slowly more and more sort of in cahoots with some of these tech companies who are compiling larger and larger sources of data ripe for deanonymization, perhaps even better than the census. So what might stop a future tyrannical government from just overturning all the safeguards we have in place today, just making differential privacy obsolete in a whole nother way? Is this something you think about with regards to the census or people concerned with privacy think about in terms of the long term future?

David: I think I mean, I think people concerned with privacy, certainly think about that as a really bad situation, right? Like if you had a tyrannical government come in and just be like, we’re gonna use these data for our own nefarious purposes to target people. I think privacy advocates think a lot about that and think a lot about that as a bad outcome and I think it’s a possible bad outcome. But I would say, you know, not to dismiss what’s happened in the past, we actually have relatively few examples of the state, you know, weaponizing these data against its its people, and that we do have a lot of safeguards. I think there are a lot of safeguards put in place and civil servants, you know, have stood up against the misuse of government data. That just because a tyrannical government wants to do something doesn’t necessarily mean that the state will be able to do that. I would also say that, right, differential privacy is for the public data, right? Tyrannical government comes in, they can just get the private data and do whatever they want with it. And unlike that, that differential privacy doesn’t protect us from a tyrannical government misusing or weaponizing the data against us because they can just say, give it to us. We want the raw data. We’ll use it for bad purposes.

Joe: But a real manifestation sort of approaching this scenario. was the recent fiasco where the Trump administration posed a citizenship question or tried to pose the citizenship question on the 2020 census. And so his Commerce Department tried to include this question on citizenship or immigration status, only to eventually be struck down by the Supreme Court. So do you feel like this is an example where we’re sort of approaching this scenario or or that was more of a demonstration that actually our our laws and our political system that are in place are strong enough to overcome?

David: Yeah, I think that’s more of an example of our institutions, right? The adversary of government, right, you’ve got the courts. You’ve got, you know, people were standing up to push back on that as a misuse of power, essentially. And I think that’s a great example of how the US government’s form can stop some of these actions from happening. And I think that’s a good thing. I think that’s a very, that’s a positive outcome in this case.

Joe: And I guess even just this whole, made up scenario to begin with is sort of like a faraway and abstract Boogeyman that, that when we’re talking about privacy, whereas the concerns of census stakeholders is a very real and urgent matter. So is that where you sort of feel like the trade off is coming about here in terms of whose practical needs…

David: Yeah. Right, right. Right. Right, I feel as if the practical needs for the broad sweep of people who use the data is kind of being trampled in favor of this kind of amorphous real attacker or people who want to get you. It’s like this big, bad Boogie person out there. And that we need to worry about them. And the people who sit down at their computer every day and use data to try to help their communities, we can’t give you as good a data, because we’re worried about this amorphous thing over here. I think that that has shifted too far in that direction and away from the needs of the of the local users.

Joe: Right. And so if things do sway in the favor of, I guess, the privacy advocates, do you think demographers and other users of the census will eventually sort of have to turn to alternative products for their purposes, you know, such as the the ACS, or the American Community Survey, which is the other major bureau product, collecting long form information on income employment and ancestry, or even commercial products, potentially?

David: I think that that’s definitely possible. I think you will see, you know, well resourced agencies who can’t afford to develop their own either purchase data or develop their own estimates or projections will move to that. Under resourced entities who just don’t have the, you know, the time or money to do that will be hamstrung, they won’t be able to do that. And so we’ll kind of get this, you know, we’re building up this inequality within the country based on what you have, what you have access to. And the other thing that’s going to be tricky is, right, what do you benchmark all those data against? Right? Typically, you’d use the decennial census as your reference data. And if that is not certain, you’re just going to have no uncertainty in anything. There’ll be a lot of uncertainty in all your data because you don’t really know what’s what’s true. And so I think that the ACS is another option, but the bureau plans to use differential privacy for the ACS, and we don’t really know what that looks like yet. But because it’s a survey, I think there’s going to be less noise inject required to kind of protect that. And so I can see people really moving to that as a source. But I think, like, a lot of entities will, will probably try to do their own thing. But but there’s going to be so much heterogeneity, right? New York State is going to throw a ton of money at creating really good data for your state. Wyoming is not going to bother with it. And so you’re gonna really going to get this inequality in the in the data infrastructure in the US.

Joe: So the data infrastructure might in a sense, become less centralized…

David: Yeah, yeah, absolutely. Right. And you can see these states, right, like going to the utility agencies and the county, audit the county parcel property tax data and start to build up their own censuses of their areas. But then in states where we don’t have that built up, we’re just not going to be able to do that. And so it’s going to be maybe less fair for the people that live in those places to get the services that they need.

Joe: Now in terms of the law, so recently the state of Alabama has filed a lawsuit contesting among other things, this this very privacy protection system used in the census, This case could even potentially be fast tracked to the Supreme Court. Now, where do you see the ensuing legal conflict over the 2020 census heading in, you know, future months or years? And how do you think the courts will ultimately rule on this?

David: I have no idea on the courts. So I always say, I think the the Alabama case is the first step of probably many cases around this particular topic. I think that we’re moving into that idea of, you know, what, what is privacy? How do you interpret it? I think that the court case, almost dissociates the technicalities of differential privacy over swapping, moving from the the technical bits into what does it mean, to interpret Title 13 at the Census Bureau? And I think that is where who knows where the courts will come down, right, like, do they do they find that people have a strict right to privacy in the census? Or do they not? And I don’t think that a technical argument is necessarily going to win the day in the courtroom.

Joe: Right, but it will be a matter of the wider consciousness having more education and statistical literacy when it comes to this ultimately.

David: Yeah. Yeah.

Joe: Yeah. So last thing I wanted to talk about is one thing that I found in the literature, so the issues you’re describing with the census privacy methods is, in my view, somehow one of fairness or equitability when it comes to the allocation of, say, federal funding for these smaller subpopulations. Now, incidentally, fairness in AI is a very hot topic in the machine learning community right now. And interestingly, there have been a couple of very recent works, studying algorithms, which can both enforce differential privacy, and also treat minorities, in say race or gender, fairly in any of the decisions it makes. The standard example given here is that you don’t want a computer algorithm to predict vastly different rates of recidivism in former convicts for different races. Some of these papers show in fact that in some very simple settings, it’s theoretically impossible to ensure both privacy and fairness actually. And I feel like that’s sort of what might be related to what’s going on here with the census and the issues you’re raising, in that there’s this fairness versus privacy trade off that’s also apparent, in addition to the accuracy versus privacy trade off. Do you think this should somehow be incorporated into the census design? And maybe somehow the bureau isn’t considering the right knobs? And there there are hidden factors at play?

David: Yeah, yeah, I hadn’t thought of that that way. But I do like that framing. I think one of the things that I’ve talked about in the past is that, you know, I really feel like the differential privacy is not fair for really small subpopulations, where the decennial census is the only reliable statistics we have for those groups. And now I understand that those are the groups that would also be at risk of being re-identified. It’s small. But, like, from an equitability of getting what those communities deserve based on their size, and their counts like, boy, that certainly, you know, means if we’re losing on accuracy, and we’re losing on equitability? Is differential privacy, really the right thing to do if you’re losing on two of the three, the three things, and I think the fairness piece kind of comes up in, we have to have all these conversations, right, about, you know, what do we privilege for better, as loaded as that word is? Like, what are the things we privilege in the data versus what’s less important? And I think those are conversations that are more philosophical than anything, they’re not technical, right? It’s a philosophical argument that all the arguments we’ve had are more focused on the technical side, right? How do you turn knob A to get a good outcome versus turning knob B to get a better outcome? Right? That’s what we focused on. We haven’t focused on is it fair? Is it equitable? And I like the idea: I think the AI example’s really good. I’ve been not paying tons of attention to it, but I’ve been seeing a lot of that where the training data is often biased. And so then you get the algorithms are biased because the training data are biased. And if we’re creating differentially private census data that is, maybe not biased is the right word, but like it’s saying different things about populations. Does that then get reinforced in algorithms down the road?

Joe: Right, right. So much of this field is building algorithms which can sort of enforce equitability across a sensitive variables, say race or gender, so that the decisions it makes do not deviate greatly. It’s sort of actually very similar to the definition of differential privacy itself..

David: Interesting.

Joe: Yeah, differential privacy. A definition very similar to differential privacy can actually be rephrased as a notion of fairness. So so it is interesting.

David: Yeah, that is interesting. I had not quite put it in those terms. I really, I think that’s a really great like way to kind of reframe the question, and it’s just a slightly different way to talk about it, and probably would bring out a new set of arguments for and against it.

Joe: Right. Now, the victim at the core of all of these various issues, or these various constraints of privacy, fairness, accuracy, is ultimately always the the minority subpopulation. And of course, different minorities across different variables will prefer different ideals, but in practice speaking to end users who deal with these communities, do you think there’s sort of a preference for, for fairness, for proper allocation of federal funding, say and resources over say, you know, fears of being rightly identified?

David: Yeah, absolutely. I absolutely think that, you know, a lot of these these groups, you know, MALDEF and Asian American Justice Coalition, like, I think the federal funding and the resources that that these groups deserve… they think that’s really at risk with differential privacy. And they view that as a crucial limitation, and that the re-dentification part still feels very ambiguous and amorphous. And until you have, like, an example of this person was targeted because they were re-identified in the census. It’s hard to make that a concrete, fear.

Joe: Right. Regardless of which ideal we end up preferring, in the end, the these minority groups will always be at the heart of these debates.

David: Yeah. Yep.

Joe: So yeah, I mean, that was that was a really great conversation. Thanks for coming on. Lastly, do you want to thank or shout out anyone or let people know where to find you online?

David: Yeah, yeah. So my name is Dave Van Riper. I’m the director of spatial analysis at the Minnesota population center. My boss, Steve Ruggles has been one of the other leading critics of differential privacy. And I’m going to give him credit for giving me the space to spend the last almost two years really taking a deep dive and writing and talking about this. You can find at @dcvanriper on Twitter. And if you go to nhgis.org, we create tabulations from the demonstration data that the Census Bureau puts out, and you can download that and compare the differentially private counts versus the published counts from 2010 to see, you know, how different the statistics are. And so we’re working with the Census Bureau to construct those to help the end user.

Joe: Okay, great. Thank you.