Wednesday, April 27, 2011

... And the Pursuit of Ha-PII-ness


Criminal law does not define in print every type of gun or object that might kill a person. It talks about the act itself, and evaluates intent and the surrounding circumstances to decide if something is a murder, homicide, manslaughter, self-defense, etc., whether there was so called "malice aforethought" involved and so on. Can you imagine someone arguing that running over another person with a "Killdozer" is not a crime since its not on the predefined weapons list?
Privacy law should be treated in the same manner. Currently, there are lists of explicitly defined Personally Identifiable Information, or PII. PII refers to information that can be used to uniquely identify, contact, or locate a single person or can be used with other sources to uniquely identify a single individual. 
One of my criticisms with regards to existing privacy legislature is related to cases where information types were explicitly defined (e.g. a person’s name, telephone number, etc.). Since some of these types of information are very dynamic and fleeting as technology advances, it becomes easy to bypass explicitly stated data types by using ones that are not explicitly stated.

Here are four categories of workarounds, or types of information not covered by explicit rules:
  • Temporary Identifier Workarounds: Like the saying "The 'temporary' becomes the 'permanent'," many times it is possible to use "temporary" identifiers to uncover a permanent one, or link them together to create a persistent chain of temporary identifiers. There is the famous case of Internet IP addresses not being on the PII list. Many times, there is some second temporary identifier that remains constant between IP address changes that make it possible to link the old and new IP addresses together - like a website cookie or a site't internal username identifier.
  • New and changing identifier types: Is your Facebook ID PII? More and more so. And you should be very wary of apps and referrers who leak it. By the time legislators add it to the PII list,  Facebook might not even be that relevant (depending who you ask...). And what about your Android ID (unique identifier based on the specific handset and your Google account) or the iPhone's unique device ID (UDID)? Its hard to keep up with new services and their proprietary IDs. 
  • "Joined Identifiers" (for lack of a better name): This is when pieces of information that seem innocuous when considered independently, generate a not-so-innocuous identifier when combined together. For example, a set of properties that your browser advertises, and could be used to "fingerprint" your computer, like this, or this (seriously, try it). There's also work like Sweeney's on how simple demographics like gender, date of birth, and zip code, could be used to uniquely identify a large percentage of the US population. This data-combo essentially becomes PII, and should be treated with the same respect.
  • "Inferred Identifiers": Big Data also means big data-mining and inference. Some of it can be very revealing about a person's identity, and can even be used for de-anonymizing users. Golle and Partridge show how people could be uniquely identified if the approximate locations of an individual's home and workplace can both be deduced from a location trace, combined with public US Census data. Narayanan and Shmatikov show how the supposedly anonimyzed Netflix Prize dataset could be deanonimyzed. These are just a few examples.
We need laws and policies that are able to advance as fast as new technology, or even faster - we need them flexible enough to deal with developments before they happen.  I believe we need to take a step back from specific and explicit identifiers, and define our policies in more robust terms, and increase focus on use and intent rather than the identifier itself. 
That might be where updated definitions for Fair Information Practices Principles (FIPPs) come into play. US Department of Commerce's "Green Paper" on privacy from Dec 2010 talks about a dynamic privacy framework, but not exactly in the same sense that I am trying to get at. 
*** READER ADVISORY WARNING *** NERDY RANT APPROACHING ***READER ADVISORY WARNING *** NERDY RANT APPROACHING *** READER ADVISORY WARNING *** NERDY RANT APPROACHING *** READER
I wish we had a way to define our laws like a modular computer program, or particularly the test of adherence to laws. There are inputs, and an  evaluation "function" that takes the inputs and spits out a some result if something is PII or not under a predefined context. A parameter that is considered PII under one context, might not be considered PII in a different context. We (or "THEY") should be able to swap out the inputs and evaluation functions as new ones come into play, to evaluate PII.
Who knows, perhaps PII could/should be defined not in a discrete way, but in a statistical manner. For example, given relatively accurate GPS coordinates of a person’s home in Manhattan would still make it very hard to extract the identity of the person due to the population density in the area. However, GPS coordinates of a person living in a rural agricultural area might indicate very accurately who the person is, as they might point to the only house in the area. In that case, the accurate GPS coordinates should not be used, but they could be reduced in resolution to include a much larger radius of confidence, that would include a predefined minimal population density so that the information is not considered PII anymore. We would of course want to generalize this to more than just GPS input.
*** END READER ADVISORY WARNING *** END READER ADVISORY WARNING *** END READER ADVISORY 
Anyways, before I wade even deeper into meta-legislature and meta-nerdiness territory, here's the bottom line: The concept of Personally Identifiable Information is still very important, but should be radically redefined in order to remain relevant, together with an update of the principles of Fair Information Practices. Its good for the ecosystem going forward, its good for end-users, and its going to be good for businesses who play fair. 

Further reading: 
- Bill proposal: Personal Data Privacy and Security Act of 2009 (I especially like the list of those who oppose)

Wednesday, April 20, 2011

The World is Flast

Yesterday was the day of the Boston Marathon, the world's oldest annual marathon.
Last Friday, as I was returning home from Logan airport, the shuttle busses and subway were already filled with marathon visitors, comparing stats, qualification times, and some other strange terms that I wasn't really familiar with, like "running", whatever that means...


Apparently, the marathon itself is not the only race in this story. What I wanted to focus on is not the marathon day, but the events surrounding this year's marathon registration back in the fall.  The 2011 marathon was sold out in a record 8 hours and 3 minutes, compared to the 2010 marathon which sold out in 65 days, also a record at the time. Many long-time Boston Marathon runners, and many others who prepared for this marathon for months, even years, and made the required qualifications preliminary competitions, were not able to register. Some complained about the fact that thousands of non-qualifying runners who signed up to promote various charities and fundraising efforts were taking spots that should have been reserved for the competitive runners. The organizers of the marathon even made raised the bar for the qualifying scores of next year's marathon. However, I am not sure that the issue is with the Boston Marathon's organization. This is part of a much bigger trend.

Google's developer conference, Google I/O, sold out in 59 minutes. In 2009 it took 90 days to sell out, and in 2010 it took 50 days. I am happy to be one of the lucky ones going this year, but I also feel bad for so many developers who wanted to go. Here too, some of the disappointed ones pointed out that many registrants are doing it just because of the cool gifts and gadgets Google has been known to give out, while pushing serious developers aside. But again. I think its more than that. 

Apple's WWDC 2011 sold out in 10 hours, nearly 20 times faster than in 2010 (which took 8 days to sell-out), nearly 70 times faster than 2009 (one month), and nearly 150 times the 2008 conference (two months). Burning Man Festival's first three tiers of tickets were also sold out in record time, and there's even talk on reaching the cap on the total number of available tickets. Conan O'Brien's 42 show tour sold out in several hours. Charlie Sheen's tour sold out in 18 minutes

The world is not just getting flat. Its getting fast, super fast. Like a good restaurant in New York City, once the word is out on a lucrative event - everybody gets in line. Information that used to spread in closed circles of the hard core fans and advocates, is now instantly trending on twitter and the other social sites, and the race is on. Scalpers and second market sellers are part of the game too. A google I/O Academia ticket, originally prices at $450, sold for over $2000 on ebay. and many others over $1000. 

This whole thing is starting to remind me of high-speed trading on Wall Street. Firms placing their computers closer to the stock exchange to gain a few extra microseconds on a transaction. People modifying the TCP/IP stack to make their algorithm act fractions of a second faster. Already the tech savvy (and persistent) have an advantage - when Google I/O's servers crashed mid-transaction, some of them were able to rescue their unique registration key from the browser session that crashed, and re-enter it in the URL field to continue their registration even after the official site already announced it sold out. Are we going to start seeing software like e-bay's automatic bidding apps (e.g. EZsniper) pop up for concerts and conferences? Will people start setting up their servers close to Ticketmaster headquarters?


The world is not just flat. The world is flast. And getting flaster.



From the desk of the Procrastinating Perfectionist

For years I have been accumulating heaps of notes and thoughts about everything ranging from my work and research topics, opinions on various issues, short prose, and even some ad campaigns and slogans for businesses that should never see the light of day (on that note, if anyone ever wants to open a strip club close to MIT campus, I got a full branding packet for "M.I.Tits". Call me.) These notes are in various forms of existence - from one-line idea sketches to fully baked essays, and for a long time I've been thinking of letting them loose somewhere. 

However, there had always been a long chain of prerequisites that I tied myself up with. I can't open a blog before I have the perfect theme for it. And I need to research what's the best blogging platform, and I probably want to set it up myself on my own server so I have the most flexibility with it, and I need to update my homepage first (its currently about 3 years out of date, and the new homepage I've been working on is far from complete and already more than one year out of date...), and I need to buy a domain name, and so on. Setting up my online presence has always been something I do in my spare time. If you actually know me in person you would probably ask - "what spare time?", to which I answer - Exactly.

I've decided enough is enough. Time to focus on what actually matters, and the rest will follow. 
So here goes. In this blog I will try to put up some new reflections as well as some older content that hasn't yet seen the light of day, and I'll generally try to use this blog as a place for arranging my different thoughts and ideas.

This is me.
This is my blog.
Thisinformed.

Now lets see how long I can keep this up.