documents

Lexology had an interesting story that serves as a really good reminder that sometimes, despite all the great things about modern technology, plain old paper may sometimes be the best way to go.

What happened? Well, to make a long story short, the US Federal Trade Commission inadvertently disclosed a large amount of information that was filed with the FTC that should have remained confidential. To wit:

The mistake made by the FTC was basic. In preparing its brief for filing, FTC staff wrongly assumed that the metadata in its word processing file would not migrate upon direct conversion from native format to portable document format (.pdf). In particular, they wrongly assumed that using Microsoft’s “Highlight” (or “Borders and Shading”) tool to black out text actually removed the text from the file’s contents. It does not. It “covers up” the text, but the text itself remains in the file, fully searchable and available for copying. The resulting .pdf appears at first glance to contain only black boxes in place of the redacted content. That content, however, is present in the .pdf file and can be easily revealed either by copying and pasting the blacked-out text into a word-processing file or an e-mail message or by viewing the .pdf file in a reader such as Preview or Xpdf.

Its one of those stories that makes you want to laugh and cry at the same time. The laughing because its easy enough to think “What kind of idiot would do that?” because the error was (at least for most readers of this blog) rather obvious. The crying because, if you give it some thought, there are instances that this could very well happen to even the most technically sophisticated of you – not just with PDFs, but any number of other forms of digital documents, communications and storage – and in any number of ways. The bottom line is that when things are put into digital form, they are often harder to get rid of. Its something well worth keeping in mind.

OK. Short version of the story in InformationWeek: Woman puts up a website. She puts a “webwrap” agreement at the bottom – i.e. basically a contract that says if you use the site then you agree to the contract. Still some question as to whether such a mechanism is binding, but anyway…

So the Internet Archive of course comes along and indexes her site. Which apparently is a violation of the webwrap. So she sues, representing herself, I believe. The court throws out everything on a preliminary motion by IA except for the breach of contract.

InformationWork observes that “Her suit asserts that the Internet Archive’s programmatic visitation of her site constitutes acceptance of her terms, despite the obvious inability of a Web crawler to understand those terms and the absence of a robots.txt file to warn crawlers away.” (my emphasis). They then conclude with this statement:

If a notice such as Shell’s is ultimately construed to represent just such a “meaningful opportunity” to an illiterate computer, the opt-out era on the Net may have to change. Sites that rely on automated content gathering like the Internet Archive, not to mention Google, will have to convince publishers to opt in before indexing or otherwise capturing their content. Either that or they’ll have to teach their Web spiders how to read contracts.

(my emphasis).

They already have – sort of. It’s called robots.txt – the thing referred to above. For those of you who haven’t heard of this, its a little file that you put on the top level of your site and which is the equivalent of a “no soliciation” sign on your door. Its been around for at least a decade (probably longer) and most (if not all) search engines

From the Internet Archive’s FAQ:

How can I remove my site’s pages from the Wayback Machine?

The Internet Archive is not interested in preserving or offering access to Web sites or other Internet documents of persons who do not want their materials in the collection. By placing a simple robots.txt file on your Web server, you can exclude your site from being crawled as well as exclude any historical pages from the Wayback Machine.

Internet Archive uses the exclusion policy intended for use by both academic and non-academic digital repositories and archivists. See our exclusion policy.

You can find exclusion directions at exclude.php. If you cannot place the robots.txt file, opt not to, or have further questions, email us at info at archive dot org.

standardized methods of communications – privacy policies, etc. – more. Question is, will people be required to use it, or simply disregard and act dumb?

so much for the paperless revolution

silly lawsuit of the week

tags

recent tweets

archives