Info « TreeTops

Info

March 9, 2014

Anyone familiar with the work of Claude Shannon, his contemporaries and successors will have a good sense of the difference between data and information, but in today’s hectic world of real-time system telemetry, Web statistics and service metrics there is a palpable blurring of the two. I marvel at how some people believe that increasing the volume of measurements or the frequency of sampling is going to provide them with the information they crave. In reality, they are getting more and more data, but not a whole lot of information. We say “here are the figures you requested” and they ask “but what do they tell us?” And there is the essence of the problem: we get more data but fail to listen to the information.

Lost? OK, let me take a step back. In the world in which I operate there are many large and complex systems that generate enormous amounts of raw data. Most of this data relates to the way that millions of people interact with hundreds of services. There’s nothing in this vast stream that easily identifies individual people, but collectively it tells us something about how the people we serve react to the service we offer. It is that broad reaction that we are trying to understand. So we are going from raw data (the stream of events generated by people) to information (the broad behaviour, trends and general reactions). Getting the data is easy. Information is another challenge entirely.

Still lost? Let me put it this way: if you had all the click-stream data for a major Web site would that tell you if your site was successful, failing, had potential for growth and so on? Well, yes, it probably would, but not directly. You’d have to slice and dice that data in many different ways before you could reach such conclusions. That’s information extraction, and it’s far from simple.

Information is usually easy to recognise. It tells you something that you did not already know, something you were unlikely to predict, or unable to predict. In the field of information theory (one of the aforementioned Shannon’s greatest contributions) the extent of the information derived from the data depends on probability. To illustrate, I’ll use the good old-fashioned coin toss example:

Suppose a man has a coin that is black on one side, white on the other, and he peers into his hand to observe which side is facing him and then walks away without telling the woman beside him what he saw. She’s not going to be too happy because he has told her nothing. No information has passed between them, none whatsoever.

Now suppose that it was she who placed the coin in his hand, so that she already knew which side he would see. In this case it makes no difference if he tells her or not, she already knows the answer. If he speaks, he conveys no information. She knows the outcome already, so there is nothing he can add.

However, if the man rattles the coin around in his hand first, and then tells her what he sees, information has been conveyed. To be precise, one “bit” of information, assuming each side of the coin had equal probability. (This is a maximum Entropy of 1, if you want to be precise.) Of course, if she knew that the coin was weighted with a 99.9% bias towards black then telling her that the coin was black side up would be no surprise (i.e. not much information there) but if he remarks that the coin is white then this is indeed a surprise and would be real information. In the real world of service telemetry, the data is almost impossible to predict, so is ripe for information extraction.

Maths aside, you can recognize good quality information when it tells you something you didn’t expect, could not have predicted, certainly could not have known in advance.

So what about all the click-stream data from a Web site? That’s just data. Lots of it. Putting it all into bar charts, flow diagrams and so on is just a way of compressing it into a picture. But is it showing you information? No, it’s not. Does showing you the average time between requests in a session give you information? No. It’s just another piece of data. Given the input stream, that particular average is entirely predictable. Look closer. Maybe you’ll find that the average of the first few steps is much smaller than the average of the rest of the steps in a session. Now we are approaching something we did not see before. What does it mean? What would it mean if we found the reverse to be true? (Early steps take longer than those near the end of a session.) Now we are seeing some interesting behaviour, something worth investigating further.

This kind of stuff is information. This is the gem for which we are digging. The nature of the information is limited only by your imagination, and your understanding of the target domain. Forget asking for more charts of averages, more pie charts by time of day, by version of browser or any other way of slicing it. Look deeper. There are better questions to ask. Better information to find.

And when you finally get the information, your next question is: what will I do with this?

Categorised as: Legal and Political, Technology, Web

Posted by Rotan on March 9, 2014 at 7:41 pm