| Tuesday May 31st 2016

AOL, Netflix, and the end of open access to research data


Correction: The authors of the Netflix de-anonymization study contacted me to point out that they originally published a draft of their results a mere two weeks after Netflix released its dataset. Netflix has known about their study for over a year.

Over the past year, there have been a number of high-profile incidents in which sensitive user data was accidentally revealed to the Internet at large. As a result, I believe that high-tech companies will never again share anonymized data on their users with academic researchers, at least not without requiring contracts and nondisclosure agreements. For the users and privacy advocates, this is probably a good thing. However, for researchers, the scientific community, and Internet users who want cool new technologies, this is almost certainly a change for the worse.

In 2006, Netflix released over 100 million movie ratings made by 500,000 subscribers to their online DVD rental service. The company then offered $1 million to anyone who could improve the company’s system of DVD recommendation. In order to protect its customers’ privacy, Netflix anonymized the data set by removing any personal details.

Researchers announced this week that they were able to de-anonymize the data, by comparing the Netflix data against publicly available ratings on the Internet Movie Database (IMDB). Whoops.

For Internet privacy geeks, this Netflix incident is just another version of an all-too-familiar tale: A well-meaning company releases a large data set of user data, which it has scrubbed to remove any identifying information. Armed with this data set, researchers are able to trace backwards, and match names to the profiles and their online behavior.

The same thing happened back in 2006 when AOL released the search records of 500,000 of its users. Within days of the database’s release, journalists from the New York Times had revealed the identity of user number 4417749 to be Thelma Arnold, a 62-year-old widow from Lilburn, Ga. Over 300 of the woman’s searches were traced back to her, ranging from “60 single men” to “dog that urinates on everything.”

The fallout from the AOL incident was devastating, both for the company and the industry as a whole. The CTO of the company and the researchers responsible for sharing the data were all fired. In addition to pulling the data set, the entire Web presence for AOL’s research division was taken offline. More than one year onward, the AOL Research group still does not have a working homepage.

The shockwaves spread to the entire search engine industry. Google’s CEO Eric Schmidt spoke to journalists shortly after AOL posted the data. After calling the data release “a terrible thing,” he assured the public that “this kind of thing could not happen at Google.

The end result was that no search engine would ever again release anonymized log data to the research community.


Related Posts: On this day...

Leave a Reply

You must be logged in to post a comment.