what i thought big data was…

I read with interest an article in RWW entitled Big Data: What Do You Think It Is?.

I suppose the term “big data” is akin to “cloud computing” – you hear the term bandied about quite a bit, but often without a clear definition. Unless your talking about about legal presentations, in which you’ll see a zillion definitions of cloud computing trotted out, including, yes, once again, the NIST definition of cloud computing (PDF), amongst various others.

In any event, it was nice to see an article asking the question and (hopefully) providing a clear answer. Perhaps suprisingly (or not), when folks were surveyed on what they thought big data meant, there was no clear consensus:

Harris asked 154 C-level executives from U.S.-based multi-national companies last April a series of questions, one of them being to simply pick the definition of “Big Data” that most closely resembled their own strategies. The results were all over the map. While 28% of respondents agreed with “Massive growth of transaction data” (the notion that data is getting bigger) as most like their own concepts, 24% agreed with “New technologies designed to address the volume, variety, and velocity challenges of big data” (the notion that database systems are getting more complex).  Some 19% agreed with the “requirement to store and archive data for regulatory and compliance,” 18% agreed with the “explosion of new data sources,” while 11% stuck with “Other.”

The author then goes on to attempt to create a generally aapplicable definition:

Essentially, Big Data tools address the way large quantities of data are stored, accessed and presented for manipulation or analysis.

Perhaps rightly or wrongly, that hasn’t quite been the impression I’ve had when reading articles about big data. I’m not at all suggesting that the proposed definition is incorrect. In fact, perhaps the opposite. That being said, when I have in the past seen the term “big data” it was almost always used to describe not the technologies used to store, access or data, but rather primarily (or almost exclusively) analysis  of very large datasets in order to develop new knowledge, ideas or products. Or starting to collect that had previously not been collected (at least not in easily manipulated digital form) for the purposes of such analysis. For example, to figure out, based on purchasing patterns, that someone is pregnant in order to market baby supplies to them.  Or using cell-phone records to detect disease outbreaks, analyzing listening data to figure out how a recording artist becomes a star, analyzing information collected from smart meters to figure out ways to reduce energy consumption, using algorithms to analyze server and device logs to manage IT infrastructure, etc. etc. – see a nice collection of stories in GigaOm.

Of course, all of that necessarily presumes that the technology exists to record and access such large datasets. So that may well be properly considered part of big data, I suppose. Thought perhaps not quite as interesting as what you can do with it. At least to me.

“Anonymized” data really isn’t—and here’s why not – Ars Technica

You have zero privacy anyway. Get over it.

So spoke Scott McNealy more than a decade ago. At the time he made this statement, he received a fair amount of criticism. Turns out, he might very well have had a point, though perhaps for reasons he might not have foreseen.

A recent paper highlights the issue of the “reidentification” or “deanonymization” of anonymized personal information. However, the issue goes beyond anonymized information to the very heart how one should define personal information that is or should be protected under privacy legislation.

“Anonymized” data really isn’t—and here’s why not – Ars Technica.

Canadian privacy legislation simply defines personal information as “information about an identifiable individual” (excluding certain information about someone in their capacity as an employee). However, what does “about an identifiable individual” mean? Does it mean that the person collecting the particular nugget of information can associate it with a person’s identity? Or, perhaps more disconcertingly, does it include data that has the potential to be associated with someone by analyzing that particular bit of information, which alone (or even in conjunction with all the other information collected by a given organization) could not be linked with a particular individual, with information available from other sources?

when not to use technology

I came across a link to a story where a South African company was using homing pigeons to transport data because it was faster than their broadband connection:

Workers will attach a memory card containing the data to bird’s leg and let nature take its course.

Experts believe the specially-trained 11-month-old pigeon will complete the flight in just 45 minutes – and at a fraction of the cost.

To send four gigabytes of encrypted information takes around six hours on a good day. If we get bad weather and the service goes down then it can up to two days to get through.

If you’re curious, doing the math on that works out to roughly 1.5 Mbps for the broadband connection and, if a 4GB card is used with the pigeon, just under 12 Mbps for the pigeon.

Of course, such a solution isn’t without risk:

‘With modern computer hacking, we’re confident well-encrypted data attached to a pigeon is as secure as information sent down a phone line anyway.

‘There are other problems, of course. Winston [the pigeon] is vulnerable to the weather and predators such as hawks. Obviously he will have to take his chances, but we’re confident this system can work for us.’

Though the story is amusing, the point it reinforces is I think a helpful one – namely, that the use of particular technology might not necessarily be the best solution to a business problem. It may just be due to the area I work in, but I have seen instances where organizations are so focused on the use of technology (or in some cases a particular type of technology) that they don’t consider alternatives that may achieve their goals better, cheaper or faster.

Certainly not necessarily advocating the widespread use of PigeonNets, but the story is an amusing example of someone overcoming the law of the golden hammer.

asp issues

Will keep this short – I was reading an article (whose authors will go unnamed) describing some recent trends in software licensing and issues arising from those trends. One trend that was highlighted was the change from licensing of software to be installed and operated by a licensee (with maintenance and support from the licensor) to a vendor-hosted model (or “application service provider” or “ASP” for short), where the vendor instead sets up the software on its own machine and the vendor’s customers then make use of the software remotely – often through a browser, but sometimes through other “thin” clients.

What was the primary issue they identified? To make sure you get acceptance testing. Hmmm. Well, hate to disagree but I would think there might be a few others that might be at least (if not more) important. So, without further ado, some thoughts on what to keep an eye out for if you are thinking of signing up to an ASP service, in no particular order:

Your Data – Will your ASP be storing your data? Will it be your primary repository of your data? Is your data important? Does your data contain sensitive, confidential or personal information? If so, then you should make sure that your ASP is handling your data appropriately, including giving adequate assurances that it is only used for providing the service (and not anything else) and that appropriate security measures are taken to protect it, such as encrypted communications when sending/receiving as well as encrypted storage. We’ve all read the recent horror stories about certain large corporations who have misplaced, lost, or inadvertently disclosed sensitive data, such as credit card numbers. Make sure it isn’t your company making the headlines.

Service Levels and/or Easy Outs – Addresses the same issue as acceptance testing but in a different way. Typically one big advantage of ASPs is that there is no big upfront licensing fee and therefore no big upfront capital to invest, or risk regarding that capital investment in the event the software doesn’t do what it was expected to do. Thus, the concept of acceptance testing was invented to address this big upfront risk, with the thinking that you get to kick the tires extensively before you hand over the the truckload of cash. And if the testing doesn’t pan out, you don’t pay. OTOH, ASPs usually involve a periodic (typically monthly) payment which is much smaller. In effect, the monthly service fee can be thought of as a replacement for: (1) the amortized cost of the initial license fee; (2) maintenance and support; (3) investment in hardware and infrastructure; and (4) additional people costs on the vendor side, to keep (3) up and running. Very often this is a win-win situation, since vendors can often achieve economies of scale by running a large number of instances centrally at one dedicated data centre (and ironically to some extent harkening back to the days of mainframes + terminals – but I digress) and offer very attactive savings over what it would otherwise cost a customer to maintain the application in-house.

Anyway, the point being that there is less upfront risk with an ASP solution, provided of course, you’re: (a) not locked in to a 50 year contract; or (b) you have really good assurances that the software will be up and running as needed when you need it. Its good to have both, but at the same time it can also be thought of, to some extent, as an either-or proposition – if you can arrange for a month to month contract, then if the ASP stinks, just terminate and go elsewhere. Alternatively, if you get ironclad service levels (including significant credits and termination rights) then you might be willing to commit longer. Of course, you’ll also need to ensure that you have the ability, in the case of a month to month agreement or termination rights, to move to another service easily, and to get your data back, etc. But I’ll leave that for another time.

Anyway, not necessarily saying that acceptance testing isn’t important (and in fact if you need to spend a ton of money to have the vendor customize a solution for you it may still be very important) but just a couple of other issues to keep in mind.

conversion of data (and not the conversion you’re probably thinking of)

Very interesting piece from Duane Morris on a case in New York. My ultra short summary of the summary: Insurance company leases computer to agent. Agent puts all his business and personal data on it. Insurance company terminates agency, takes back the computer and all data on it, and refuses to give the agent access to any of it. Agent sues, loses, but then wins on appeal.

The interesting part is the basis on which he won, which was a claim under “conversion”. Not necessarily incredibly groundbreaking, as other cases have dealt with conversion as applied to intangibles previously, but, as the folks at Duane note:

Under the merger doctrine, a conversion claim will apply to intangible property, such as shares of stock, that are merged or converted into a document, such as a stock certificate. Accordingly, conversion of the certificate may be treated as conversion of the shares of stock represented by the certificate. More recently, the court ruled that a plaintiff could maintain a cause of action for conversion where the defendant infringed the plaintiff’s intangible property right to a musical performance by misappropriating a master recording, a tangible item of property capable of being physically taken.

Thyroff was the Court’s first opportunity to consider whether the common law should permit conversion for intangible property that did not strictly satisfy the merger test. Recognizing that it “is the strength of the common law to respond, albeit cautiously and intelligently, to the demands of common sense justice in an evolving society,” the Court decided that the time had arrived to depart from the strict common-law limitation of conversion.

Interestingly, in their analysis of the decision, they conclude that:

This decision provides a powerful remedy for New York employers to bring a cause of action against employees who steal company information or [intangible] property. Unlike claims for breach of fiduciary duty or misappropriation of trade secrets, conversion may be easier to plead than other claims because it does not require that the employer establish willfulness or wrongful conduct.

Hmmm. Not quite sure I’d agree – after all, the “conversion” itself would need to be established. Also, I’m not sure that a rogue employee who takes a copy of his or her employer’s confidential information but leaves the original copy with their employer, would be the basis for a cause of action under conversion, which, if I understand the case correctly, has more to do with depriving someone of property that is rightfully theirs. Absconding with confidential information does not deprive the owner of that information of the data, but rather the value the owner of the data can realize by virtue of the fact it can only be used by that owner. That situation seems somewhat different than the one in Thyroff – the analogy there would be if insurance company did not deny the agent access to his information, but rather took a copy of it and used it in a way they weren’t supposed to. It would be interesting to see whether the court would extend a claim of conversion to deprivation not of the intangible information itself, but rather value of the rights to exploit it exclusively. Alternatively, it may be that the ruling could be read broadly enough to already take that into account.

I also wonder what sort of effect this might have on those who might have otherwise leapt at the opportunity to become an agent for the insurance company…

Belgian Court Slaps Google News

The short story: a Belgian court has ruled that Google must remove headlines and links posted on its news site for which it did not obtain permission to post, based on copyright law.

Rather unfortunate, I think. Sure, there are cases where some links and even partial reproduction should be prohibited, but in the context of what Google was doing its difficult to see the harm. In fact, I’m a bit surprised that the content owner would have pursued the claim. Google’s take:

“We believe that Google News is entirely legal,” the company said in a statement. “We only ever show the headlines and a few snippets of text and small thumbnail images. If people want to read the entire story they have to click through to the newspaper’s Web site.”

Google said its service actually does newspaper a favor by driving traffic to their sites.

But the court said Google’s innovations don’t get exemptions from Belgian data storage law.

“We confirm that the activities of Google News, the reproduction and publication of headlines as well as short extracts, and the use of Google’s cache, the publicly available data storage of articles and documents, violate the law on authors’ rights,” the ruling said.

If Google News violates authors’ rights, there will be a lot more that does as well. Tons. It will be interesting to see what happens on appeal as it could have rather far-reaching implications – at least in Belgium.