Wednesday, April 27, 2011

... And the Pursuit of Ha-PII-ness

Criminal law does not define in print every type of gun or object that might kill a person. It talks about the act itself, and evaluates intent and the surrounding circumstances to decide if something is a murder, homicide, manslaughter, self-defense, etc., whether there was so called "malice aforethought" involved and so on. Can you imagine someone arguing that running over another person with a "Killdozer" is not a crime since its not on the predefined weapons list?
Privacy law should be treated in the same manner. Currently, there are lists of explicitly defined Personally Identifiable Information, or PII. PII refers to information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual. 
One of my criticisms with regards to existing privacy legislature is related to cases where information types were explicitly defined (e.g. a person’s name, telephone number, etc.). Since some of these types of information are very dynamic and fleeting as technology advances, it becomes easy to bypass explicitly stated data types by using ones that are not explicitly stated.

Here are four categories of workarounds, or types of information not covered by explicit rules:
  • Temporary Identifier Workarounds: Like the saying "The 'temporary' becomes the 'permanent'," many times it is possible to use "temporary" identifiers to uncover a permanent one, or link them together to create a persistent chain of temporary identifiers. There is the famous case of Internet IP addresses not being on the PII list. Many times, there is some second temporary identifier that remains constant between IP address changes that make it possible to link the old and new IP addresses together - like a website cookie or a site't internal username identifier.
  • New and changing identifier types: Is your Facebook ID PII? More and more so. And you should be very wary of apps and referrers who leak it. By the time legislators add it to the PII list,  Facebook might not even be that relevant (depending who you ask...). And what about your Android ID (unique identifier based on the specific handset and your Google account) or the iPhone's unique device ID (UDID)? Its hard to keep up with new services and their proprietary IDs. 
  • "Joined Identifiers" (for lack of a better name): This is when pieces of information that seem innocuous when considered independently, generate a not-so-innocuous identifier when combined together. For example, a set of properties that your browser advertises, and could be used to "fingerprint" your computer, like this, or this (seriously, try it). There's also work like Sweeney's on how simple demographics like gender, date of birth, and zip code, could be used to uniquely identify a large percentage of the US population. This data-combo essentially becomes PII, and should be treated with the same respect.
  • "Inferred Identifiers": Big Data also means big data-mining and inference. Some of it can be very revealing about a person's identity, and can even be used for de-anonymizing users. Golle and Partridge show how people could be uniquely identified if the approximate locations of an individual's home and workplace can both be deduced from a location trace, combined with public US Census data. Narayanan and Shmatikov show how the supposedly anonimyzed Netflix Prize dataset could be deanonimyzed. These are just a few examples.
We need laws and policies that are able to advance as fast as new technology, or even faster - we need them flexible enough to deal with developments before they happen.  I believe we need to take a step back from specific and explicit identifiers, and define our policies in more robust terms, and increase focus on use and intent rather than the identifier itself. 
That might be where updated definitions for Fair Information Practices Principles (FIPPs) come into play. US Department of Commerce's "Green Paper" on privacy from Dec 2010 talks about a dynamic privacy framework, but not exactly in the same sense that I am trying to get at. 
I wish we had a way to define our laws like a modular computer program, or particularly the test of adherence to laws. There are inputs, and an  evaluation "function" that takes the inputs and spits out a some result if something is PII or not under a predefined context. A parameter that is considered PII under one context, might not be considered PII in a different context. We (or "THEY") should be able to swap out the inputs and evaluation functions as new ones come into play, to evaluate PII.
Who knows, perhaps PII could/should be defined not in a discrete way, but in a statistical manner. For example, given relatively accurate GPS coordinates of a person’s home in Manhattan would still make it very hard to extract the identity of the person due to the population density in the area. However, GPS coordinates of a person living in a rural agricultural area might indicate very accurately who the person is, as they might point to the only house in the area. In that case, the accurate GPS coordinates should not be used, but they could be reduced in resolution to include a much larger radius of confidence, that would include a predefined minimal population density so that the information is not considered PII anymore. We would of course want to generalize this to more than just GPS input.
Anyways, before I wade even deeper into meta-legislature and meta-nerdiness territory, here's the bottom line: The concept of Personally Identifiable Information is still very important, but should be radically redefined in order to remain relevant, together with an update of the principles of Fair Information Practices. Its good for the ecosystem going forward, its good for end-users, and its going to be good for businesses who play fair. 

Further reading: 
- Bill proposal: Personal Data Privacy and Security Act of 2009 (I especially like the list of those who oppose)


Post a Comment