“Anonymized” data really isn’t—and here’s why not – Ars Technica

You have zero privacy anyway. Get over it.

So spoke Scott McNealy more than a decade ago. At the time he made this statement, he received a fair amount of criticism. Turns out, he might very well have had a point, though perhaps for reasons he might not have foreseen.

A recent paper highlights the issue of the “reidentification” or “deanonymization” of anonymized personal information. However, the issue goes beyond anonymized information to the very heart how one should define personal information that is or should be protected under privacy legislation.

“Anonymized” data really isn’t—and here’s why not – Ars Technica.

Canadian privacy legislation simply defines personal information as “information about an identifiable individual” (excluding certain information about someone in their capacity as an employee). However, what does “about an identifiable individual” mean? Does it mean that the person collecting the particular nugget of information can associate it with a person’s identity? Or, perhaps more disconcertingly, does it include data that has the potential to be associated with someone by analyzing that particular bit of information, which alone (or even in conjunction with all the other information collected by a given organization) could not be linked with a particular individual, with information available from other sources?

google legal?

Recently came across the news (now somewhat dated) that Google has now incorporated some full-text legal decisions from the US into Google Scholar.

From the Official Google Blog:

Starting today, we’re enabling people everywhere to find and read full text legal opinions from U.S. federal and state district, appellate and supreme courts using Google Scholar. You can find these opinions by searching for cases (like Planned Parenthood v. Casey), or by topics (like desegregation) or other queries that you are interested in. For example, go to Google Scholar, click on the “Legal opinions and journals” radio button, and try the query separate but equal. Your search results will include links to cases familiar to many of us in the U.S. such as Plessy v. Ferguson and Brown v. Board of Education, which explore the acceptablity of “separate but equal” facilities for citizens at two different points in the history of the U.S. But your results will also include opinions from cases that you might be less familiar with, but which have played an important role.

Perhaps not surprisingly, the announcement seems to suggest less an emphasis on targeting lawyers as the primary audience, but rather the general public. In fact, in a recent ABA Journal article, Google’s representative even suggested that Google wouldn’t be of much value to lawyers:

Google, meanwhile, is not trying to compete with the likes of West, LexisNexis, Bloomberg, Fastcase or any other commercial legal research company, says lawyer Rick Klau, a project manager at Google who helped build the Scholar database.

“There is no attempt to slay anyone here,” Klau says. “Google’s mission is to organize the world’s information and make it useful. This was a collection of content that was not accessible and well-organized.” He says Google Scholar was designed to make the information accessible for ordinary citizens. The company has no current plans to do more with the information than what is already available.

Google’s database allows users to search its content against any words, concepts or citations and will pull up opinions related to the searcher’s query. The results are ranked by relevance. Citations in the opinions are hyperlinked to other opinions. The results also provide links to other Google databases, such as books and law reviews, to help searchers get context.

But Google Scholar does not provide any sort of system to check the validity of the case, nor does it offer any type of taxonomy of the case.

Klau goes so far as to question the value of Google Scholar to practicing lawyers: “The two primary for-pay services provide tremendous value to their users and help you better understand and consume information, like whether an opinion is still valid. Those are things that practitioners rely on and will continue to rely on.”

Despite Klau’s protestations, others in the legal information sector are watching Google. “You are always very conscious of what Google is doing because the company has immense resources available,” says Warwick of Thomson Reuters.

That same article also describes how LexisNexis and Westlaw, the two Microsofts of the legal information industry, will be implementing sweeping changes in their services. I imagine those changes were prompted less by Google’s foray into the legal information industry and more by the entrance of Bloomberg into the market, and the desire to capture a greater share of what seems to be a shrinking market.

In any event, Google isn’t really reinventing anything here but rather making it a bit more convenient to access and use – apparently all of this material had previously been available on various court and other web sites. Google’s value add was to consolidate it all and make it easier to search and use.

Too bad. It would have been interesting to see Google shake things up a bit in the legal information industry (or for that matter the information industry more generally). Then again, you never know…

the (not so) long arm of the tax authorities

The recent case involving the Canada Revenue Agency and eBay took an interesting (and perhaps somewhat ironic) twist on access to information. Without getting into too much detail, the essence of the issue was this: CRA wanted eBay Canada to cough up information on folks known as “Power Sellers” – those that sell a lot of stuff on eBay. Presumably so that CRA could helpfully remind those folks of their tax obligations in the unfortunate event they somehow forgot to report all the income they made in Canada by selling on eBay.

eBay Canada’s response was that the legal entity in Canada did not in fact own that information and it was also not stored in Canada. Rather, the information was owned by some of its affiliates and stored in the US, outside of Canadian jurisdiction. So they couldn’t provide the information, they asserted.

Unfortunately (for eBay) it came out that eBay Canada was able to access the information even though it didn’t own the data. In fact, it had to be able to access that information in order to run its business. So the court ruled in favour of the CRA, with this rather cogent analysis:

The issue as to the reach of section 231.2 when information, though stored electronically outside Canada, is available to and used by those in Canada, must be approached from the point of view of the realities of today’s world. Such information cannot truly be said to “reside” only in one place or be “owned” by only one person. The reality is that the information is readily and instantaneously available to those within the group of eBay entities in a variety of places. It is irrelevant where the electronically-stored information is located or who as among those entities, if any, by agreement or otherwise asserts “ownership” of the information. It is “both here and there” to use the words of Justice Binnie in Society of Composers, Authors and Music Publishers of Canada v. Canadian Ass’n of Internet Providers, [2004] 2 S.C.R. 427 at paragraph 59. It is instructive to review his reasons, for the Court, at paragraphs 57 to 63 in dealing with whether jurisdiction may be exercised in Canada respecting certain Internet communications, including an important reference to Libman v. The Queen, [1985] 2 SCR 178 and the concept of a “real and substantial link”.

The implications in this case are relatively clear. In other cases, it may become less so. For example, what happens with this concept when someone who once stored their docs on their local hard drive starts using Google Docs, only to find out that the authorities in whatever far-flung jurisdiction have ordered an affiliate of Google to disclose that information? Or in the near future when things like Prism get to a point where users aren’t even sure whether their data is here, there, or elsewhere. Interesting times, indeed.

silly lawsuit of the week

OK. Short version of the story in InformationWeek: Woman puts up a website. She puts a “webwrap” agreement at the bottom – i.e. basically a contract that says if you use the site then you agree to the contract. Still some question as to whether such a mechanism is binding, but anyway…

So the Internet Archive of course comes along and indexes her site. Which apparently is a violation of the webwrap. So she sues, representing herself, I believe. The court throws out everything on a preliminary motion by IA except for the breach of contract.

InformationWork observes that “Her suit asserts that the Internet Archive’s programmatic visitation of her site constitutes acceptance of her terms, despite the obvious inability of a Web crawler to understand those terms and the absence of a robots.txt file to warn crawlers away.” (my emphasis). They then conclude with this statement:

If a notice such as Shell’s is ultimately construed to represent just such a “meaningful opportunity” to an illiterate computer, the opt-out era on the Net may have to change. Sites that rely on automated content gathering like the Internet Archive, not to mention Google, will have to convince publishers to opt in before indexing or otherwise capturing their content. Either that or they’ll have to teach their Web spiders how to read contracts.

(my emphasis).

They already have – sort of. It’s called robots.txt – the thing referred to above. For those of you who haven’t heard of this, its a little file that you put on the top level of your site and which is the equivalent of a “no soliciation” sign on your door. Its been around for at least a decade (probably longer) and most (if not all) search engines

From the Internet Archive’s FAQ:

How can I remove my site’s pages from the Wayback Machine?

The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.

Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our exclusion policy.

You can find exclusion directions at exclude.php. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.

standardized methods of communications – privacy policies, etc. – more. Question is, will people be required to use it, or simply disregard and act dumb?

from the “another security headache” department

Yes postings have been sparse lately – things getting busy so alas. Anyway, very short (but rather alarming) note from Wired about copiers. Though I knew most copiers now used digital technology of some sort, I had no idea they actually contained full-blown hard drives that store your copies. The exact reason why they need hard drives to copy documents, and why the data needs to remain on the drives, is a bit of a mystery to me, and something the article doesn’t go into. I’d had always just assumed that the image information was stored somewhere temporarily and disappeared when you finished copying. Apparently not. Anyway, here’s a brief excerpt:

most digital copiers manufactured in the past five years have disk drives – the same kind of data-storage mechanism found in computers – to reproduce documents. As a result, the seemingly innocuous machines that are commonly used to spit out copies of tax returns for millions of Americans can retain the data being scanned.

If the data on the copier’s disk aren’t protected with encryption or an overwrite mechanism, and if someone with malicious motives gets access to the machine, industry experts say sensitive information from original documents could get into the wrong hands.

I guess someone, somewhere, will be selling add-on kits for copiers relatively shortly…

XBRL Is Cool

Just a very short one during my “lunch”. Ever heard of XBRL? Its short for Extended Business Reporting Language – basically a kind of sort of extension of XML or, perhaps more precisely, a subset of SGML. I like to follow developments on it because I think the potential ways in which XBRL will impact a variety of industries (primarily the financial sector) is huge.

To give you an idea, here’s a (rather old) excerpt from a speech that the CIO of the SEC gave at the last XBRL International Conference last May:

I think the agency can be proud of its use of electronic filing and information distribution. But we can aim higher. Today, the vast majority of EDGAR documents are filed in ASCII text, and another large fraction in HTML. That’s fine for reading about a company’s strategy and general issues, but if you want to do financial analysis or compare accounting policies between companies, you then have to do a lot of printing, searching, data entry, text parsing, and other mechanical work. Or, you can go to a third-party data provider, who can provide you with a database of financial information — but the data provider will have made a number of assumptions to simplify and standardize the financial information, and it may no longer be consistent with how the company intended to present its financials. And you won’t get any of the valuable information from the footnotes.

Since you’re at this conference, I know you can all envision the attractive alternative posed by XBRL and interactive data, so I won’t belabor the point. The potential benefits are persuasive enough — greater transparency of financial information, reduced costs for investors and analysts, potentially even deeper coverage of midcap companies by analysts, and ultimately more efficient markets.

Let me paint what I think is an interesting scenario. Wall Street types have been talking for a couple of years about algorithmic trading — basically, using computers to process real-time streams of market data and making fast, automated trading decisions. Today, that market data is mostly about stock prices and volumes, since that’s what’s available in real time. But at some point in the not-distant future, I envision a hedge fund starting to algorithmically trade with XBRL-based balance sheet and P&L data in real-time as it’s disclosed by companies. At that point, we will all know that interactive data has won the day.

Imagine that. And that’s just the tip of the iceberg. The number of tools that one can create to digest, compile, report and analyze numbers is limited only by one’s imagination. I can also imagine the potential impact that this could have on data vendors who charge quite a bit to provide archived financial information – often in rather archaic forms.

Surprisingly, I’ve not heard of many companies or startups that are working on new products (particularly on the software front) either to help in generating XBRL, translating information into XBRL, or crunching XBRL reports (though admittedly, I haven’t been following it that closely).

Anyway, if you’re in this space, and you haven’t yet looked into XBRL, you should certainly consider doing so.

Wikiality – Part III

Bit of an elaboration on a previous post on the use of Wikipedia in judgements. I cited part of a New York Times article, which had in turn quoted from a letter to the editor from Professor Kenneth Ryesky. The portion cited by the NYT article suggested that Ryesky was quite opposed to the idea, which wasn’t really the case. He was kind enough to exchange some thoughts via e-mail:

In his New York Times article of 29 January 2007, Noam Cohen quoted a sentence (the last sentence) from my Letter to the Editor published in the New York Law Journal on 18 January 2007. You obviously read Mr. Cohen’s article, but it is not clear whether you read the original Letter to the Editor from which the sentence was quoted.

Which exemplifies the point that Wikipedia, for all of its usefulness, is not a primary source of information, and therefore should be used with great care in the judicial process, just as Mr. Cohen’s article was not a primary source of information.

Contrary to the impression you may have gotten from Mr. Cohen’s New York Times article of 29 January, I am not per se against the use of Wikipedia. For the record, I myself have occasion to make use of it in my research (though I almost always go and find the primary sources to which Wikipedia directs me), and find it to be a valuable tool. But in research, as in any other activity, one must use the appropriate tool for the job; using a sledge hammer to tighten a little screw on the motherboard of my computer just won’t work.

Wikipedia and its equivalents present challenges to the legal system. I am quite confident that, after some trial and error, the legal system will acclimate itself to Wikipedia, just as it has to other text and information media innovations over the past quarter-century.

Needless to say, quite a different tone than the excerpt in the NYT article. Thanks for the clarification, Professor Ryesky.

Virtual Diplomacy

Short one as its getting late. Interesting piece on how Sweden is setting up an embassy in Second Life. As most of you know, Second Life is a MMORPG – a virtual world of sorts where people can control computer generated images of people in a virtual world.

That being said, somewhat less exciting than first blush, as the new virtual Swedish embassy will only provide information on visas, immigration, etc. Perhaps not surprising – I mean, its not like you should be able to get a real-world passport through the use of your virtual character. Nor, God forbid, do I hope they’re introducing the bureaucracy of passports to travel through virtual countries….

Wikiality – Part II

There was some traffic on the ULC E-Comm Listserv (on which I surreptitiously lurk – and if you don’t know what it is and are interested in e-commerce law, highly recommended) about courts citing Wikipedia with a couple of links to some other stuff, including an article on Slaw as well as an article in the New York Times about the concerns raised by some regarding court decisions citing Wikipedia. Some excerpts and notes to expand on my previous post:

From the con side:

In a recent letter to The New York Law Journal, Kenneth H. Ryesky, a tax lawyer who teaches at Queens College and Yeshiva University, took exception to the practice, writing that “citation of an inherently unstable source such as Wikipedia can undermine the foundation not only of the judicial opinion in which Wikipedia is cited, but of the future briefs and judicial opinions which in turn use that judicial opinion as authority.”

This raises a good point that I didn’t mention in my previous post. I certainly think Wikipedia is fine to note certain things, but I really, definitely, positively, do not think that it should be cited as judicial authority. In my previous article I thought this was so self-evident I didn’t bother mentioning, but the quote above illustrates that it might not be all that clear. Court decisions, as most of you know, are written by judges who take into account the facts and apply the law to the facts of the case, along with other facts and information that may have a bearing on the case. The source of the law includes statutes and of course previously decided cases, which enunciate rules or principles that the court either applies, distinguishes based on the facts as being inapplicable, or, in some cases, overturns (for any number of reasons). Court decisions are not, of course, published on Wikipedia and are not subject to the collective editing process of Wikipedia, nor should they be. Rather, references to Wikipedia in court cases are to provide additional or ancillary context or facts to a case. They do not and should not derogate from principles of law that are set forth in court decisions. But, contrary to what Mr. Ryesky, Esq., indicates above, I don’t think referring to Wikipedia for context or facts will suddenly undermine the foundations of law, since the legal reasoning itself still will and must be based on sources of law, not facts and not context.

Hence the following end to the NTY article:

Stephen Gillers, a professor at New York University Law School, saw this as crucial: “The most critical fact is public acceptance, including the litigants,” he said. “A judge should not use Wikipedia when the public is not prepared to accept it as authority.”

For now, Professor Gillers said, Wikipedia is best used for “soft facts” that are not central to the reasoning of a decision. All of which leads to the question, if a fact isn’t central to a judge’s ruling, why include it?

“Because you want your opinion to be readable,” said Professor Gillers. “You want to apply context. Judges will try to set the stage. There are background facts. You don’t have to include them. They are not determinitive. But they help the reader appreciate the context.”

He added, “The higher the court the more you want to do it. Why do judges cite Shakespeare or Kafka?”


Pretexting, Canadian Style

From one of my very smart colleagues at the firm – a recent Canadian case involving “pretexting” like activity a la HP.

The short story: A company hires an investigator to see what some former employees are up to, since they’ve started a competing business. Based on what they find out, they sue the employees. In discovery (in rough terms, the process through which each party gets to look at the information that the other side has supporting their case), the employees find out that the investigator has obtained their phone records and also has recorded them on video at their business premises, in both cases without their consent and without a court order.

Sound somewhat familiar?

So the employees countersue the company and the investigator. It turns our that the company wasn’t aware of the methods used by the investigator and so is left off the hook, but the action against the investigators is given the green light.

Whether or not the claim of the employees will succeed remains to be seen. In the meantime, folks thinking of using investigators, for whatever purpose, would be wise to give serious consideration to the nature of information that they want to collect.