Preis­trä­ger 2010: Paul Ohm

Die­ter Meu­rer Pri­ze Lec­tu­re

Paul Ohm

16 Sep­tem­ber 2010

When Dr. Her­ber­ger infor­med me ear­lier this year that I had been awar­ded the Die­ter Meu­rer pri­ze, I reac­ted with a fair amount of sur­pri­se and healt­hy sus­pi­ci­on. As a scho­l­ar toi­ling away in rela­ti­ve obscu­ri­ty in a small public law school in the foot­hills of the Rocky Moun­tains in the United Sta­tes, I could not fathom that my work had been noti­ced in Ger­ma­ny, much less deemed worthy of award.

But as I lear­ned more about the Asso­cia­ti­on for Com­pu­ting in the Judi­cia­ry and about juris, my sen­se of sus­pi­ci­on faded. This is becau­se the goals of the­se worthy and venera­ble insti­tu­ti­ons over­lap so much and so well with my rese­arch agen­da. Like all of you, I am try­ing to build brid­ges bet­ween law and com­pu­ter sci­ence; tog­e­ther we are refu­sing to be satis­fied with ossi­fied and out­mo­ded legal tech­no­lo­gies and approa­ches; tog­e­ther we are poin­ting the way to a new, bet­ter approach to law and jud­ging, one which embraces advan­ces in tech­no­lo­gy.

The more I lear­ned about the Asso­cia­ti­on and, in par­ti­cu­lar, the work of Pro­fes­sor Her­ber­ger and the late Pro­fes­sor Meu­rer, the more my initi­al sen­se of doubt tur­ned into only immense gra­ti­tu­de. I am also fami­li­ar with the ama­zing work of many of the past reci­pi­ents of this award, and I am hono­red to be listed with them. So befo­re I begin my sub­stan­ti­ve remarks, I wan­ted to thank Pro­fes­sor Her­ber­ger, the Asso­cia­ti­on for Com­pu­ting in the Judi­cia­ry, and juris for this honor. I am very gra­te­ful. I also wan­ted to spe­ci­fi­cal­ly say thank you to Sabi­ne Micka, who hel­ped orga­ni­ze my tra­vel.

When I am not wri­ting arti­cles that are also working com­pu­ter pro­grams, I spe­cia­li­ze in infor­ma­ti­on pri­va­cy law. Recent­ly, I published an arti­cle in an Ame­ri­can law jour­nal ent­it­led “Bro­ken Pro­mi­ses of Pri­va­cy: Respon­ding to the Sur­pri­sing Fail­u­re of Anony­mi­za­ti­on.”

The cen­tral con­tri­bu­ti­on of this Arti­cle is to incor­po­ra­te a new and exci­ting trend in com­pu­ter sci­ence rese­arch into law and poli­cy for the first time. In this sen­se, the Arti­cle fits very well with the goals of this con­fe­rence.

The Arti­cle dis­rupts the cen­tral orga­ni­zing princip­le of infor­ma­ti­on pri­va­cy law, and becau­se not every per­son in the room is an infor­ma­ti­on pri­va­cy expert, I ask the experts to indul­ge me as I exp­lain. Infor­ma­ti­on pri­va­cy law reli­es hea­vi­ly on the idea of anony­mi­za­ti­on, the term used to descri­be tech­ni­ques used to pro­tect the pri­va­cy of peop­le descri­bed in data­ba­ses by dele­ti­on and fil­tra­ti­on. For examp­le, we dele­te names, government ID num­bers, birth dates, and home addres­ses from data­ba­ses, moti­va­ted by the idea that we can ana­ly­ze the data left behind wit­hout com­pro­mi­sing the iden­ti­ties of the peop­le in the data. Anony­mi­za­ti­on pro­vi­des the best of both worlds, pro­mi­sing both pri­va­cy and uti­li­ty with one simp­le and inex­pen­si­ve data­ba­se ope­ra­ti­on.

Anony­mi­za­ti­on has been trusted not only by data­ba­se owners but also by governments; legis­la­tors and regu­la­tors world­wi­de have crea­ted laws and rules which reward the use of anony­mi­za­ti­on. In fact, I claim that every sin­gle pri­va­cy law ever writ­ten rewards anony­mi­za­ti­on in some way, lar­ge or small, express or implied. In many cases, anony­mi­za­ti­on pro­vi­des a per­fect safe har­bor: anony­mi­ze your data and the law no lon­ger app­lies at all.

This brings us to the cen­tral ten­ant of infor­ma­ti­on pri­va­cy law; the one cur­r­ent­ly under attack.

In the United Sta­tes, we refer to this as the con­cept of „Per­so­nal­ly Iden­ti­fia­ble Infor­ma­ti­on” or PII.

In the European Data Pro­tec­tion Direc­tive, we encoun­ter it through defi­ni­ti­ons of the term „per­so­nal data.“

The idea, no mat­ter what its name, is that we can pro­tect pri­va­cy by cate­go­ri­zing our data. We inspect our data­ba­ses to sepa­ra­te infor­ma­ti­on that iden­ti­fies a per­son from infor­ma­ti­on that does not. We approach this task almost like a bio­lo­gist try­ing to clas­si­fy dif­fe­rent spe­ci­es of bird, or worm, or mushroom. Just as a mushroom sci­en­tist tri­es to divi­de the set of all mushrooms into poi­son­ous and non-poi­son­ous, so too does his infor­ma­ti­on pri­va­cy coun­ter­part try to divi­de the set of data into dan­ge­rous and benign.

This was all a suc­cess­ful sta­te of affairs for many deca­des. Thanks to the power of anony­mi­za­ti­on, our poli­cy­ma­kers could rely on cate­go­ri­za­ti­on, PII, and per­so­nal data to strike a balan­ce, one which gua­ran­te­ed pri­va­cy while lea­ving room for busi­nes­ses and rese­ar­chers to use data to bet­ter the world and grow the eco­no­my.

[Pau­se]

Unfor­tu­n­a­te­ly, the cen­tral pre­mi­se upon which all of this rests — the power of anony­mi­za­ti­on — has been atta­cked in the past deca­de.

Com­pu­ter sci­en­tists have repeated­ly demons­tra­ted the sur­pri­sing power of rei­den­ti­fi­ca­ti­on. By ana­ly­zing the data that anony­mi­za­ti­on lea­ves behind, the­se rese­ar­chers have shown that with only a bit of out­si­de infor­ma­ti­on and a bit of com­pu­ta­tio­nal power, we can res­to­re iden­ti­ty in anony­mi­zed data.

Let me give you only two examp­les:

First, in 1995, a gra­dua­te stu­dent named Latanya Swee­ney ana­ly­zed a spe­ci­fic trio of infor­ma­ti­on — a person’s birth date, U.S. pos­tal code, and sex. She cho­se the­se three becau­se many anony­mi­zed data­ba­ses con­tai­ned them, which was under­stand­a­ble sin­ce most peop­le thought they could be left behind wit­hout com­pro­mi­sing iden­ti­ty. The intui­ti­ons of many — inclu­ding most experts — sug­gested we sha­red our bir­th­da­te, pos­tal code, and sex in com­mon with many other peop­le; we thought we could hide in a cloud of the many peop­le who sha­red this infor­ma­ti­on.

Dr. Swee­ney pro­ved the­se intui­ti­ons wrong. By ana­ly­zing U.S. cen­sus data, she deter­mi­ned that 87% of Ame­ri­cans were uni­que­ly iden­ti­fied by the­se three pie­ces of infor­ma­ti­on. What once was reco­gni­zed as anony­mi­zed was sud­den­ly rejec­ted as iden­ti­fy­ing. Today, many Ame­ri­can laws reflect Dr. Sweeney’s work, by requi­ring the dele­ti­on of the­se three cate­go­ries of infor­ma­ti­on.

But if regu­la­tors thought that Dr. Swee­ney had pro­ved that the­re was some­thing unusu­al or spe­cial about tho­se three pie­ces of infor­ma­ti­on, they have lear­ned qui­te recent­ly how wrong they were. The­re are many other types of data that sha­re this abi­li­ty to iden­ti­fy. In fact, some have sug­gested that every pie­ce of use­ful infor­ma­ti­on about a per­son can be used to iden­ti­fy, if it is con­nec­ted with the right pie­ce of out­si­de infor­ma­ti­on.

As only one examp­le, con­si­der an Ame­ri­can com­pa­ny, Net­flix, which rents movies on DVD deli­ve­r­ed through the pos­tal mails. On the Net­flix web­site, users rate the movies they have seen, to help Net­flix sug­gest other movies they might enjoy. In an expe­ri­ment in Inter­net col­la­bo­ra­ti­on — one I should add that has been cele­bra­ted for its many non-pri­va­cy-rela­ted con­tri­bu­ti­ons — Net­flix released one hund­red mil­li­on records reve­aling how a half mil­li­on custo­mers had rated movies, but only after first anony­mi­zing the data to pro­tect iden­ti­ty.

A mere two weeks after the data release, two rese­ar­chers named Arvind Nara­ya­n­an and Vita­ly Shma­ti­kov announ­ced a sur­pri­sing result: The movies we watch act like fin­ger­prints. If you know only a litt­le about a Net­flix customer’s movie watching habits, you have a good chan­ce of dis­co­vering his or her iden­ti­ty.

For examp­le, the rese­ar­chers dis­co­ve­r­ed that if you know six some­what obscu­re movies a per­son wat­ched, and not­hing else, you can iden­ti­fy 84% of the peop­le in the data­ba­se. If you know six movies they wat­ched and the appro­xi­ma­te date on which they rated it, you can iden­ti­fy 99% of the peop­le.

The les­son? When Net­flix custo­mers are asked at a din­ner par­ty to list their six favo­ri­te obscu­re movies, they can­not ans­wer unless they want every per­son at the table to be able to look up every movie they have ever rated with Net­flix.

But more serious­ly, what is the broa­der les­son?

If we con­ti­nue to embrace the old PII/personal data approach to pro­tec­ting pri­va­cy, we will end up with worth­less laws, becau­se the­se cate­go­ries will con­ti­nue to expand with each new advan­ce in rei­den­ti­fi­ca­ti­on. For examp­le, the Ame­ri­can health pri­va­cy law, HIPPA, lists eigh­te­en cate­go­ries of infor­ma­ti­on a health pro­vi­der can dele­te to fall out­si­de the law. Should Ame­ri­can regu­la­tors expand this list to con­tain movie ratings? Of cour­se not; this would miss the point ent­i­re­ly.

Unfor­tu­n­a­te­ly, in 15 minu­tes, the best I can do is sha­re the depres­sing and bleak part of the sto­ry. Time doesn’t allow me to sha­re my solu­ti­ons in detail except to say one thing: not­hing we do to replace our old laws will sha­re the power and ease of solu­ti­ons based on anony­mi­za­ti­on. Pre­ser­ving pri­va­cy will beco­me even more dif­fi­cult than it is today. Bey­ond this note, I refer you to the paper to see the ent­i­re sto­ry.

This con­clu­des my sub­stan­ti­ve remarks. Plea­se let me rei­tera­te: I am most gra­te­ful to have recei­ved this award. I look for­ward to mee­ting many of you and lear­ning from many of you throughout the day. Have a won­der­ful con­fe­rence, and once again, thank you.