Any data you collect will probably leak. Any data you retain will definitely leak, given enough time.
Both of these statements were once controversial, but today, they’re commonsense. If Equifax, the CIA, the NSA, The Office of Personnel Management, Facebook and dating sites can’t keep our secrets secret, then neither can your business.
In truth, industry’s old, ill-placed confidence in the security of data was always an example of motivated reasoning. Collecting data is so cheap and storing it is so easy, and there were so many analysts and investors and hustling grifters exclaiming that “data is the new oil” that it seemed fiscally irresponsible not to collect everything you could, and retain it forever.
Who knew how that data could be put to profitable use in the future? It was raining soup, so it was time to fill your boots – even if you couldn’t find a market for soup-in-a-boot today, there was no doubt that such a market would appear in the foreseeable future.
Given such a value proposition, it’s not surprising that the people doing the collecting and the retaining of data talked themselves into the idea that both activities could be undertaken safely.
But of course, they were wrong, and as history has caught up with them – as breach after breach has hit in ever-increasing waves – the rationale has changed. Now, rather than arguing that breaches are inevitable, the story goes that breaches aren’t a big deal: every time there’s a data breach, company spokespeople recite the catechism: “We take our customers’ privacy very seriously. None of the data that leaked was compromising.”
Some of that is “privacy nihilism” – it was all going to leak eventually, so what’s the difference? But there’s a more insidious version of this, which argues that breach data isn’t a problem because bad people can’t do much with it. This isn’t just nihilism; it’s denialism.
Breach apologists argue that the data they leak isn’t compromising because it’s anonymized, or because key identifiers were removed from it. This profoundly misunderstands how data is used – and abused.
Re-identification of anonymized data-sets is a hot research topic for computer science today, with researchers creating automatic tools that piece together disparate data-sets to identify the people in them: for example, you can merge a health authority’s database of anonymized prescribing data (doctor, medicine, date and time) with a breached database of taxi journeys that includes trips to hospitals that coincide with the prescribing times to infer who is taking antipsychotic medications, or antiretrovirals or cancer therapeutics.
Many data-protection vendors have promised that they can inject noise into data-sets to prevent re-identification, but those promises rarely survive contact with security researchers who evaluate their claims.
It’s been years since the first significant re-identification theoretical work was done, and things keep getting worse for those who insist that anonymization is possible.
Re-identification methods tell us a lot about how digital criminals operate and their incredible frugality and resourcefulness.
Like our 1930s Depression-era haunted ancestors, identity thieves never throw anything away, and they find ways to use every scrap of leftover to make something new.
Usernames and passwords can be recycled in credential-stuffing attacks that allow them to break into security cameras from Ring and Nest, order takeout, or track and immobilize entire fleets of corporate vehicles. Breach identities can be used to overwhelm regulatory proceedings with plausible fake comments or to create fleets of Twitter identities.
Criminals operate by combining and recombining data-sets, using one company’s breach in combination with a public data source, and a third company’s anonymous data release to wreak incredible havoc. They might even get enough data fragments to fraudulently obtain a duplicate deed for your house and sell it to someone else while you’re on holiday.
Never mind that no one can point to a specific piece of data you’re liable to lose control over someday and say, “That, that’s the data-point that will cost someone their house, or let their stalker find them or expose their retirement savings to thieves.”
It’s similarly true that no one can point to a specific droplet of dioxin in a factory’s illegal effluent pipe and say, “That, that is the carcinogen that will kill a young mother of three, some five miles downstream of the pipe.” This doesn’t stop companies that poison the water or the air from paying the price.
The harms from breaches are stochastic (i.e., randomly determined), not deterministic.
We can’t know for sure which data will do which harm, but we know that harm is inevitable and it gets worse the bigger the breach is.
So far, remedies for those who have been injured by breaches have been severely limited, but they’re getting stiffer. Home Depot’s 2014 breach cost it $0.34/customer in direct compensation. But that was then. Breached Yahoo! customers may get compensated $100 each. Facebook just got hit with a $5B fine, and the party’s just getting started.
The harms from breaches are cumulative: like toxic waste in nature, breaches build up in the information environment, and they are effectively immortal in their potential for damage. As the public – and the law – come to grips with this, we’re likely to see greater and greater remedies for those whose data has been released into the wild (forever).
Remember, breaches affect everyone alike – all political persuasions, rich and poor, including the governing classes and lawmakers themselves.
Inevitably, we will see the framework for breach remedies transformed to look more like the remedies for other probabilistic harms, such as environmental harms.
When that happens, it might be too late for you: the data you’re warehousing today might already have been ex-filtrated from your network without you even knowing that it’s happened – until one of your customers finds out the hard way that you’ve compromised them, and seeks legal remedies.
Your insurer isn’t going to write policies for you – or errors and omissions policies for your board – if you’re warehousing all this digital toxic waste in leaky digital barrels, not once the penalties for losing control over it start to turn into real money.
Maybe you could still justify all that risk if the profits from all that data were commensurate with it. But as researchers keep discovering, the benefits from data are wildly oversold – the efficacy of ad targeting based on users’ behavior is almost identical targeting based on the content of the pages where the ads appear, which requires no user data.
But if you’re an ad-tech company or a Big Tech platform like Facebook or Google, the mystique about the ability of data to convert customers allows you to sell your product as a massive premium, while intimidating potential competitors who think that they will never get started because they can never collect as much data as the companies that are already in the space.
The people who claim data is the new oil are people who are selling the data, and the claims they make about the ways that this data lets you do amazing things are sales literature, not peer-reviewed studies.
Data was never the new oil. It was always the new toxic waste: pluripotent, immortal – and impossible to contain. You don’t want to be making more of it, and you definitely should be getting rid of the supply you’ve so unwisely stockpiled so far.
Data minimization isn’t just good practice; it’s good business. Collect as little data as you can, and keep it as briefly as you can. If your privacy policy fits on the back of a napkin – because you’re collecting almost nothing and processing it only for specific purposes, and then deleting it forever – you’re on the right track!
This article reflects the opinions of the author.