Archive for the ‘representation’ Category

The power of representation: Adding powers of two

Wednesday, February 13th, 2008

decimalOn the left is an addition problem. If you know the answer without thinking, you’re probably a geek.

Suppose you had to solve a large number of problems of this type; adding consecutive powers of 2 starting from 1. If you did enough of them you might guess that 1 + 2 + 4 + … + 2n – 1 is always equal to 2n – 1. In the example on the left, we’re summing from 20 to 210 and the answer is 211 – 1 = 2047.

And if you cast your mind back to high-school mathematics you might even be able to prove this using induction.

But that’s a lot of work, even supposing you see the pattern and are able to do a proof by induction.

binary-addLet’s instead think about the problem in binary (i.e., base 2). In binary, the sum looks like the image on the right.

There’s really no work to be done here. If you think in binary, you already know the answer to this “problem”. It would be a waste of time to even write the problem down. It’s like asking a regular base-10 human to add up 3 + 30 + 300 + 3000 + 30000, for example. You already know the answer. In a sense there is no problem because your representation is so nicely aligned with the task that the problem seems to vanish.

Why am I telling you all this?

Because, as I’ve emphasized in three other postings, if you choose a good representation, what looks like a problem can simply disappear.

I claim (without proof) that lots of the issues we’re coming up against today as we move to a programmable web, integrated social networks, and as we struggle with data portability, ownership, and control will similarly vanish if we simply start representing information in a different way.

I’m trying to provide some simple examples of how this sort of magic can happen. There’s nothing deep here. In the non-computer world we wouldn’t talk about representation, we’d just say that you need to look at the problem from the right point of view. Once you do that, you see that it’s actually trivial.

Talking at TTI/Vanguard Smart(er) Data conference

Monday, February 11th, 2008

tti


I’ve been invited to speak at a TTI/Vanguard conference in Atlanta on SMART(ER) DATA on Feb 20/21.

Here’s the abstract.

Representation, Representation, Representation

In this talk I will argue for the importance of information representation. Choice of representation is critical to the design of computational systems. A good choice can simplify or even eliminate problems by reducing or obviating the need for clever algorithms. By making better choices about low-level information representation we can broadly increase the power and flexibility of higher-level applications, and also make them easier to build. In other words, we can more easily make future applications smarter if we are smarter about how we represent the data they manipulate. Despite all this, representation choice is often ignored or taken for granted.

Key trends in our experience with online data are signalled by terms such as mashups, data web, programmable web, read/write web, and collective databases and also by the increasing focus on APIs, inter-operability, transparency, standards, openness, data portability, and data ownership.

To fully realize our hopes for future applications built along these lines, and our interactions with the information they will present to us, we must rethink three important aspects of how we represent information. These are our model of information ownership and control, the distinction between data and metadata, and how we organize information. I will illustrate why these issues are so fundamental and demonstrate how we are addressing them at Fluidinfo.

Tagging in the year 3000 (BC)

Friday, January 4th, 2008

Jimmy Guterman recently called Marcel Proust an Alpha Geek and asked for thoughts on “what from 100 years ago might be the hot new technology of 2008?”

Here’s something about 5000 years older. As a bonus there’s a deep connection with what Fluidinfo is doing.

Alex Wright recently wrote GLUT: Mastering Information Through the Ages. The book is good. It’s a little dry in places, but in others it’s really excellent. I especially enjoyed the last 2 chapters, “The Web that Wasn’t” and “Memories of the Future”. GLUT has a non-trivial overlap with the even more excellent Everything is Miscellaneous by David Weinberger.

In chapter 4 of GLUT, “The Age of Alphabets”, Wright describes the rise of writing systems around 3000 BC as a means of recording commercial transactions. The details of the transactions were written onto a wet clay tablet, signed by the various parties, and then baked. Wright (p50) continues:

Once the tablet was baked, the scribe would then deposit it on a shelf or put it in a basket, with labels affixed to the outside to facilitate future search and retrieval.

There are two comments I want to make about this. One is a throwaway answer to Jimmy Guterman’s request, but the other deserves consideration.

Firstly, this is tagging. Note that the tags are attached after the data is put onto the clay tablet and it is baked. This temporal distinction is important – it’s not like other mentions of metadata or tagging given by Wright (e.g., see p51 and p76). Tags could presumably have different shapes or colors, and be removed, added to, etc. Tags can be attached to objects you don’t own – like using a database to put tags on a physically distant web page you don’t own. No-one has to anticipate all the tag types, or the uses they might be put to. If a Sumerian scribe decided to tag the best agrarian deals of 3000 BC or all deals involving goats, he/she could have done it just as naturally as we’d do it today.

Secondly, I find it very interesting to consider the location of information here and in other systems. The tags that scribes were putting on tablets in 3000 BC were stored with the tablets. They were physically attached to them. I think that’s right-headed. To my mind, the tag information belongs with the object that’s being tagged. In contrast, today’s online tagging systems put our tags in a physically separate location. They’re forced to do that because of the data architecture of the web. The tagging system itself, and the many people who may be tagging a remote web page, don’t own that page. They have no permission to alter it.

Let’s follow this thinking about the location of information a little further…

Later in GLUT, Wright touches on how the card catalog of libraries became separated from the main library content, the actual books. Libraries became so big and accumulated so many volumes that it was no longer feasible to store the metadata for each volume with the volume. So that information was collected and stored elsewhere.

This is important because the computational world we all inhabit has similarly been shaped by resource constraints. In our case the original constraints are long gone, but we continue to live in their shadow.

I’ll explain.

We all use file systems. These were designed many decades ago for a computing environment that no longer exists. Machines were slow. Core and disk memory was tiny. Fast indexing and retrieval algorithms had yet to be invented. Today, file content and file metadata are firmly separated. File data is in one place while file name, permissions, and other metadata are stored elsewhere. That division causes serious problems. The two systems need different access mechanisms. They need different search mechanisms.

Now would be a good time to ask yourself why it has traditionally been almost impossible to find a file based simultaneously on its name and its content.

Our file systems are like our libraries. They have a huge card catalog just inside the front door (at the start of the disk), and that’s where you go to look things up. If you want the actual content you go fetch it from the stacks. Wandering the stacks without consulting the catalog is a little like reading raw disk blocks at random (that can be fun btw).

But libraries and books are physical objects. They’re big and slow and heavy. They have ladders and elevators and are traversed by short-limbed humans with bad eyesight. Computers do not have these characteristics. By human standards, they are almost infinitely fast and their storage is cheap and effectively infinite. There’s no longer any reason for computers to separate data from metadata. In fact there’s no need for a distinction between the two. As David Weinberger put it, in the real world “everything is metadata”. So it should be in the computer world as well.

In other words, I think it is time to return to a more natural system of information storage. A little like the tagging we were doing in 3000 BC.

Several things will have to change if we’re to pull this off. And that, gentle reader, is what Fluidinfo is all about.

Stay tuned.

Multiplying with Roman numerals

Saturday, November 10th, 2007

I like thinking about the power of representation, particularly inside computers. I wrote about it earlier in the year and gave a couple of examples. Here’s another.

Think about how you might have done multiplication with Roman numerals. Why is it so difficult?

It’s not because multiplication is inherently so hard. Roman numerals were just a terribly awkward way to represent numbers. However, if you introduce the concept of a zero and use a positional representation, things become much easier.

Note that the problem hasn’t changed, only the representation did. A new representation can make things that look like problems go away.

I claim that we are still using Roman numerals to manage information online (and on the desktop for that matter). Until we do something about it, we’ll probably continue butting our heads against the same problems and they’ll probably continue to appear intractable.

At Fluidinfo, everything we do is based on a new way to represent information.

why data (information representation) is the key to the coming semantic web

Monday, March 19th, 2007

In my last posting I argued that we should drop all talk about Artificial Intelligence when discussing the semantic web, web 3.0, etc., and acknowledge that in fact it’s all about data. There are two points in that statement. I was scratching an itch and so I only argued one of them. So what about my other claim?

While I’m not ready to describe what my company is doing, there’s a lot I can say about why I claim that data is the important thing.

Suppose something crops up in the non-computational “real-world” and you decide to use a computer to help address the situation. An inevitable task is to take the real-world situation and somehow get it into the computational system so the computer can act on it. Thus one of the very first tasks we face when deciding to use a computer is one of representation. Given information in the real world, we must choose how to represent it as data in a computer. (And it always is a choice.)

So when I say that data is important, I’m mainly referring to information representation. In my opinion, representation is the unacknowledged cornerstone of problem solving and algorithms. It’s fundamentally important and yet it’s widely ignored.

When computer scientists and others talk about problem solving and algorithms, they usually ignore representation. Even in the genetic algorithms community, in which representation is obviously needed and is a required explicit choice, the subject receives little attention. But if you think about it, in choosing a representation you have already begun to solve the problem. In other words, representation choice is a part of problem solving. But it’s never talked about as being part of a problem-solving algorithm. In fact though, if you choose your representation carefully the rest of the problem may disappear or become so trivial that it can be solved quickly by exhaustive search. Representation can be everything.

To illustrate why, here are a couple of examples.

Example 1. Suppose I ask you to use a computer to find two positive integers that have a sum of 15 and a product of 56. First, let’s pick some representation of a positive integer. How about a 512-bit binary string for each integer? That should cover it, I guess. We’ll have two of them, so that will be 1,024 bits in our representation. And here’s an algorithm, more or less: repeatedly set the 1,024 bits at random, add the corresponding integer values, to see if they sum to 15. If so, multiply them and check the product too.

But wait, wait, wait… even my 7-year-old could tell you that’s not a sensible approach. It will work, eventually. The state search space has 21024 candidate solutions. Even if we test a billion billion billion of them per second, it’s going to take much longer than a billion years.

Instead, we could think a little about representation before considering what would classically be called the algorithm. Aha! It turns out we could actually represent each integer using just 4 bits, without risk of missing the solution. Then we can use our random (or an exhaustive) search algorithm, and have the answer in about a billionth of a second. Wow.

Of course this is a deliberately extreme example. But think about what just happened. The problem and the algorithm are the same in both of the above approaches. The only thing that changed was the representation. We coupled the stupidest possible algorithm with a good representation and the problem became trivial.

Example 2. Consider the famous Eight Queens problem (8QP). That’s considerably harder than the above problem. Or is it?

Let’s represent a chess board in the computer using a 64-bit string, and make sure that exactly 8 bits are set to one to indicate the presence of a queen. We’ll devise a clever algorithm for coming up with candidate 64-bit solutions, and write code to check them for correctness. But the search space is 264, and that’s not a small number. It could easily take a year to run through that space, so the algorithm had better be pretty good!

But wait. If you put a queen in row R and column C, no other queen can be in row R or column C. Following that line of thinking, you can see that all possibly valid solutions can be represented by a permutation of the numbers 1 through 8. The first number in the permutation gives the column of the queen in the first row, and so on. There are only 8! = 40,320 possible arrangements that need to be checked. That’s a tiny number. We could program it up, use exhaustive search as our algorithm, and have a solution in well under a second!

Once again, a change of representation has a radical impact on what people would normally think of as the problem. But the problem isn’t changing at all. What’s happening is that when you choose a representation you have actually already begun to solve the problem. In fact, as the examples show, if you get the representation right enough the “problem” pretty much vanishes.

These are just two simple examples. There are many others. You may not be ready to generalize from them, but I am.

I think fundamental advances based almost solely on improved representation lie just ahead of us.

I think that If we adopt a better representation of information, things that currently look impossible may even cease to look like problems.

There are other people who seem to believe this too, though perhaps implicitly. Web 3.0, whatever that is, can bring major advances without anyone needing to come up with new algorithms. Given a better representation we could even use dumb algorithms (though perhaps not pessimal algorithms) and yet do things that we can’t do with “smart” ones. I think this is the realization, justifiably exciting, that underlies the often vague talk of “web 3.0″, the “read/write web”, the “data web”, “data browsing”, the infinite possible futures of mash ups, etc.

This is why, to pick the most obvious target, I am certain that Google is not the last word in search. It’s probably not a smart idea to try to be smarter than Google. But if you build a computational system with a better underlying representation of information you may not need to be particularly intelligent at all. Things that some might think are related to “intelligence”, including the emergence of a sexy new “semantic” web, may not need much more than improved representation.

Give a 4-year-old a book with a 90%-obscured picture of a tiger in the jungle. Ask them what they see. Almost instantly they see the tiger. It seems incredible. Is the child solving a problem? Does the brain or the visual system use some fantastic algorithm that we’ve not yet discovered? Above I’ve given examples of how better representation can turn things that a priori seemed to require problem solving and algorithms into things that are actually trivial. We can extend the argument to intelligence. I suspect it’s easy to mistake someone with a good representation and a dumb algorithm as being somehow intelligent.

I bet that evolution has produced a representation of information in the brain that makes some problems (like visual pattern matching) non-existent. I.e., not problems at all. I bet that there’s basically no problem solving going on at all in some things people are tempted to think of as needing intelligence. The “algorithm”, and I hesitate to use that word, might be as simple as a form of (chemical) hill climbing, or something even more mundane. Perhaps everything we delight in romantically ascribing to native “intelligence” is really just a matter of representation.

That’s why I believe data (aka information representation) is so extremely important. That’s where we’re heading. It’s why I’m doing what I’m doing.

the semantic web is the new AI

Sunday, March 18th, 2007

I’m not a huge fan of rationality. But if you are going to try to think and act rationally, especially on quantitative or technical subjects, you may as well do a decent job of it.

I have a strong dislike of trendy terms that give otherwise intelligent people a catchy new phrase that can be tossed around to get grants, get funded, and get laid. I spent years trying to debunk what I thought was appalling lack of thought about Fitness Landscapes. At the Santa Fe Institute in the early 90s, this was a term that (with very few exceptions, most notably Peter Stadler) was tossed about with utter carelessness. I wrote a Ph.D. dissertation on Evolutionary Algorithms, Fitness Landscapes and Search, parts of which were thinly-veiled criticism of some of the unnecessarily colorful biological language used to describe “evolutionary” algorithms. I get extremely impatient when I sense a herd mentality in the adoption of a catchy new term for talking about something that in fact is far more mundane. I get even more impatient when widespread use of the term means that people stop thinking.

That’s why I’m fed up with the current breathless reporting on the semantic web. The semantic web is the new artificial intelligence. We’re on the verge of wonders, but everyone agrees these will take a few more years to realize. Instead of having intelligent robots to do our bidding, we’ll have intelligent software agents that can reason about stuff they find online, and do what we mean without even needing to be told. They’ll do so many things, coordinating our schedules, finding us hotels and booking us in, anticipating our wishes and intelligently combining disparate information from all over the place to…. well, you get the picture.

There are two things going on in all this talk about the semantic web. One is recycled rubbish and one is real. The recycled rubbish is the Artificial Intelligence nonsense, the visionary technologist’s wet dream that will not die. Sorry folks – it ain’t gonna happen. It wasn’t going to happen last century, and it’s not going to happen now. Can we please just forget about Artificial Intelligence?

It was once thought that it would take intelligence for a computer to play chess. Computers can now play grandmaster-level chess. But they’re not one whit closer to being intelligent as a result, and we know it. Instead of admitting we were wrong, or admitting that since it obviously doesn’t take intelligence to play chess that maybe Artificial Intelligence as a field was chasing something that was not actually intelligence at all, we move the goalposts and continue the elusive search. Obviously the development of computers that can play better than human-level chess (is it good chess? I don’t think we can say it is), and other advances, have had a major impact. But they’ve nothing to do with intelligence, beside our own ingenuity at building faster, smaller, and cooler machines with better algorithms (and, in the case of chess, bigger lookup tables) making their way into hardware.

And so it is with the semantic web. All talk of intelligence should be dropped. It’s worse than useless.

But, there has been real progress in the web in recent years. Web 2.0, whatever that means exactly, is real. Microsoft were right to be worried that the browser could make the underlying OS and its applications irrelevant. They were just 10 years too early in trying to kill it, and then, beautiful irony, they had a big hand in making it happen with their influence in getting asynchronous browser/server traffic (i.e., XmlHttpRequest and its Microsoft equivalent, the foundation of AJAX) into the mainstream.

Similarly, there is real substance to what people talk about as Web 3.0 and the semantic web.

It’s all about data.

It’s just that. One little and very un-sexy word. There’s no need to get all hot and bothered about intelligence, meaning, reasoning, etc. It’s all about data. It’s about making data more uniform, more accessible, easier to create, to share, to find, and to organize.

If you read around on the web, there are dozens of articles about the upcoming web. Some are quite clear that it’s all about the data. But many give into the temptation to jump on the intelligence bandwagon, and rabbit on about the heady wonders of the upcoming semantic web (small-s, capital-S, I don’t mind). Intelligent agents will read our minds, do the washing, pick up the kids from school, etc.

Some articles mix in a bit of both. I just got done reading a great example, A Smarter Web: New technologies will make online search more intelligent–and may even lead to a “Web 3.0.”

As you read it, try to keep a clean separation in mind between the AI side of the romantic semantic web and simple data. Every time intelligence is mentioned, it’s vague and with an acknowledgment that this kind of progress may be a little way off (sound familiar?). Every time real progress and solid results are mentioned, it’s because someone had the common sense to take a bunch of data and put it into a better format, like RDF, and then take some other routine action (usually search) on it.

I fully agree with those who claim that important qualitative advances are on their way. Yes, that’s a truism. I mean that we are soon going to see faster-than-usual advances in how we work with information on the web. But the advances will be driven by mundane improvements in data architecture, and, just like computers “learning” to “play” chess, they will have nothing at all to do with intelligence.

I should probably disclose that I’m not financially neutral on this subject. I have a small company that some would say is trying to build the semantic web. To me, it’s all about data architecture.