Archive for March, 2007

why data (information representation) is the key to the coming semantic web

Monday, March 19th, 2007

In my last posting I argued that we should drop all talk about Artificial Intelligence when discussing the semantic web, web 3.0, etc., and acknowledge that in fact it’s all about data. There are two points in that statement. I was scratching an itch and so I only argued one of them. So what about my other claim?

While I’m not ready to describe what my company is doing, there’s a lot I can say about why I claim that data is the important thing.

Suppose something crops up in the non-computational “real-world” and you decide to use a computer to help address the situation. An inevitable task is to take the real-world situation and somehow get it into the computational system so the computer can act on it. Thus one of the very first tasks we face when deciding to use a computer is one of representation. Given information in the real world, we must choose how to represent it as data in a computer. (And it always is a choice.)

So when I say that data is important, I’m mainly referring to information representation. In my opinion, representation is the unacknowledged cornerstone of problem solving and algorithms. It’s fundamentally important and yet it’s widely ignored.

When computer scientists and others talk about problem solving and algorithms, they usually ignore representation. Even in the genetic algorithms community, in which representation is obviously needed and is a required explicit choice, the subject receives little attention. But if you think about it, in choosing a representation you have already begun to solve the problem. In other words, representation choice is a part of problem solving. But it’s never talked about as being part of a problem-solving algorithm. In fact though, if you choose your representation carefully the rest of the problem may disappear or become so trivial that it can be solved quickly by exhaustive search. Representation can be everything.

To illustrate why, here are a couple of examples.

Example 1. Suppose I ask you to use a computer to find two positive integers that have a sum of 15 and a product of 56. First, let’s pick some representation of a positive integer. How about a 512-bit binary string for each integer? That should cover it, I guess. We’ll have two of them, so that will be 1,024 bits in our representation. And here’s an algorithm, more or less: repeatedly set the 1,024 bits at random, add the corresponding integer values, to see if they sum to 15. If so, multiply them and check the product too.

But wait, wait, wait… even my 7-year-old could tell you that’s not a sensible approach. It will work, eventually. The state search space has 21024 candidate solutions. Even if we test a billion billion billion of them per second, it’s going to take much longer than a billion years.

Instead, we could think a little about representation before considering what would classically be called the algorithm. Aha! It turns out we could actually represent each integer using just 4 bits, without risk of missing the solution. Then we can use our random (or an exhaustive) search algorithm, and have the answer in about a billionth of a second. Wow.

Of course this is a deliberately extreme example. But think about what just happened. The problem and the algorithm are the same in both of the above approaches. The only thing that changed was the representation. We coupled the stupidest possible algorithm with a good representation and the problem became trivial.

Example 2. Consider the famous Eight Queens problem (8QP). That’s considerably harder than the above problem. Or is it?

Let’s represent a chess board in the computer using a 64-bit string, and make sure that exactly 8 bits are set to one to indicate the presence of a queen. We’ll devise a clever algorithm for coming up with candidate 64-bit solutions, and write code to check them for correctness. But the search space is 264, and that’s not a small number. It could easily take a year to run through that space, so the algorithm had better be pretty good!

But wait. If you put a queen in row R and column C, no other queen can be in row R or column C. Following that line of thinking, you can see that all possibly valid solutions can be represented by a permutation of the numbers 1 through 8. The first number in the permutation gives the column of the queen in the first row, and so on. There are only 8! = 40,320 possible arrangements that need to be checked. That’s a tiny number. We could program it up, use exhaustive search as our algorithm, and have a solution in well under a second!

Once again, a change of representation has a radical impact on what people would normally think of as the problem. But the problem isn’t changing at all. What’s happening is that when you choose a representation you have actually already begun to solve the problem. In fact, as the examples show, if you get the representation right enough the “problem” pretty much vanishes.

These are just two simple examples. There are many others. You may not be ready to generalize from them, but I am.

I think fundamental advances based almost solely on improved representation lie just ahead of us.

I think that If we adopt a better representation of information, things that currently look impossible may even cease to look like problems.

There are other people who seem to believe this too, though perhaps implicitly. Web 3.0, whatever that is, can bring major advances without anyone needing to come up with new algorithms. Given a better representation we could even use dumb algorithms (though perhaps not pessimal algorithms) and yet do things that we can’t do with “smart” ones. I think this is the realization, justifiably exciting, that underlies the often vague talk of “web 3.0″, the “read/write web”, the “data web”, “data browsing”, the infinite possible futures of mash ups, etc.

This is why, to pick the most obvious target, I am certain that Google is not the last word in search. It’s probably not a smart idea to try to be smarter than Google. But if you build a computational system with a better underlying representation of information you may not need to be particularly intelligent at all. Things that some might think are related to “intelligence”, including the emergence of a sexy new “semantic” web, may not need much more than improved representation.

Give a 4-year-old a book with a 90%-obscured picture of a tiger in the jungle. Ask them what they see. Almost instantly they see the tiger. It seems incredible. Is the child solving a problem? Does the brain or the visual system use some fantastic algorithm that we’ve not yet discovered? Above I’ve given examples of how better representation can turn things that a priori seemed to require problem solving and algorithms into things that are actually trivial. We can extend the argument to intelligence. I suspect it’s easy to mistake someone with a good representation and a dumb algorithm as being somehow intelligent.

I bet that evolution has produced a representation of information in the brain that makes some problems (like visual pattern matching) non-existent. I.e., not problems at all. I bet that there’s basically no problem solving going on at all in some things people are tempted to think of as needing intelligence. The “algorithm”, and I hesitate to use that word, might be as simple as a form of (chemical) hill climbing, or something even more mundane. Perhaps everything we delight in romantically ascribing to native “intelligence” is really just a matter of representation.

That’s why I believe data (aka information representation) is so extremely important. That’s where we’re heading. It’s why I’m doing what I’m doing.

the semantic web is the new AI

Sunday, March 18th, 2007

I’m not a huge fan of rationality. But if you are going to try to think and act rationally, especially on quantitative or technical subjects, you may as well do a decent job of it.

I have a strong dislike of trendy terms that give otherwise intelligent people a catchy new phrase that can be tossed around to get grants, get funded, and get laid. I spent years trying to debunk what I thought was appalling lack of thought about Fitness Landscapes. At the Santa Fe Institute in the early 90s, this was a term that (with very few exceptions, most notably Peter Stadler) was tossed about with utter carelessness. I wrote a Ph.D. dissertation on Evolutionary Algorithms, Fitness Landscapes and Search, parts of which were thinly-veiled criticism of some of the unnecessarily colorful biological language used to describe “evolutionary” algorithms. I get extremely impatient when I sense a herd mentality in the adoption of a catchy new term for talking about something that in fact is far more mundane. I get even more impatient when widespread use of the term means that people stop thinking.

That’s why I’m fed up with the current breathless reporting on the semantic web. The semantic web is the new artificial intelligence. We’re on the verge of wonders, but everyone agrees these will take a few more years to realize. Instead of having intelligent robots to do our bidding, we’ll have intelligent software agents that can reason about stuff they find online, and do what we mean without even needing to be told. They’ll do so many things, coordinating our schedules, finding us hotels and booking us in, anticipating our wishes and intelligently combining disparate information from all over the place to…. well, you get the picture.

There are two things going on in all this talk about the semantic web. One is recycled rubbish and one is real. The recycled rubbish is the Artificial Intelligence nonsense, the visionary technologist’s wet dream that will not die. Sorry folks – it ain’t gonna happen. It wasn’t going to happen last century, and it’s not going to happen now. Can we please just forget about Artificial Intelligence?

It was once thought that it would take intelligence for a computer to play chess. Computers can now play grandmaster-level chess. But they’re not one whit closer to being intelligent as a result, and we know it. Instead of admitting we were wrong, or admitting that since it obviously doesn’t take intelligence to play chess that maybe Artificial Intelligence as a field was chasing something that was not actually intelligence at all, we move the goalposts and continue the elusive search. Obviously the development of computers that can play better than human-level chess (is it good chess? I don’t think we can say it is), and other advances, have had a major impact. But they’ve nothing to do with intelligence, beside our own ingenuity at building faster, smaller, and cooler machines with better algorithms (and, in the case of chess, bigger lookup tables) making their way into hardware.

And so it is with the semantic web. All talk of intelligence should be dropped. It’s worse than useless.

But, there has been real progress in the web in recent years. Web 2.0, whatever that means exactly, is real. Microsoft were right to be worried that the browser could make the underlying OS and its applications irrelevant. They were just 10 years too early in trying to kill it, and then, beautiful irony, they had a big hand in making it happen with their influence in getting asynchronous browser/server traffic (i.e., XmlHttpRequest and its Microsoft equivalent, the foundation of AJAX) into the mainstream.

Similarly, there is real substance to what people talk about as Web 3.0 and the semantic web.

It’s all about data.

It’s just that. One little and very un-sexy word. There’s no need to get all hot and bothered about intelligence, meaning, reasoning, etc. It’s all about data. It’s about making data more uniform, more accessible, easier to create, to share, to find, and to organize.

If you read around on the web, there are dozens of articles about the upcoming web. Some are quite clear that it’s all about the data. But many give into the temptation to jump on the intelligence bandwagon, and rabbit on about the heady wonders of the upcoming semantic web (small-s, capital-S, I don’t mind). Intelligent agents will read our minds, do the washing, pick up the kids from school, etc.

Some articles mix in a bit of both. I just got done reading a great example, A Smarter Web: New technologies will make online search more intelligent–and may even lead to a “Web 3.0.”

As you read it, try to keep a clean separation in mind between the AI side of the romantic semantic web and simple data. Every time intelligence is mentioned, it’s vague and with an acknowledgment that this kind of progress may be a little way off (sound familiar?). Every time real progress and solid results are mentioned, it’s because someone had the common sense to take a bunch of data and put it into a better format, like RDF, and then take some other routine action (usually search) on it.

I fully agree with those who claim that important qualitative advances are on their way. Yes, that’s a truism. I mean that we are soon going to see faster-than-usual advances in how we work with information on the web. But the advances will be driven by mundane improvements in data architecture, and, just like computers “learning” to “play” chess, they will have nothing at all to do with intelligence.

I should probably disclose that I’m not financially neutral on this subject. I have a small company that some would say is trying to build the semantic web. To me, it’s all about data architecture.

more on flight costs

Tuesday, March 13th, 2007

Continuing from the last post, let’s suppose fuel costs are constant across all airlines.

On my 6-hour Air Comet flight, I will be paying about $8/hr for fuel and $10/hr for everything else (not bad, seeing as I get to watch a couple of movies and eat a meal).

Going to Cheaptickets I see the next cheapest option is Air Lingus (not a direct route), who will charge me $450. So you’d be tempted to conclude that Air Lingus is 4 times more expensive than Air Comet. But… the price of the fuel is constant. That means I’m paying $400 for the trip in non-fuel costs, which would be roughly $65/hr if the flights were the same length (they’re not). So Air Lingus is actually more like 6.5 times as expensive as Air Comet. The cheapest US carrier (Delta in this case) will charge me $1,127 which would be more like $180/hr or 18 times as expensive as Air Comet (were the trips the same length, which they’re not).

All very non-scientific.

jfk to madrid for 83 euros

Tuesday, March 13th, 2007

I’m about to book a cheap flight from JFK to Madrid with Air Comet. There are some alarming and amusing comments about the airline online. See this page for example – search for hot red smocks and slit skirts. I flew that route with them about a month ago and everything went smoothly.

And the low, low, price? Just 83 euros one way!

According to this site a 12-hour flight needs 110 tons of fuel. Mine’s a 6:15 flight, so call it 55 tons of fuel. A ton is 2000 pounds according to google, so 110,000 pounds of fuel are needed. Jet engine fuel is like kerosene and weighs about 6 pounds per gallon. So that’s 110,000 / 6 = 18,333 gallons of fuel for the trip. Fedex charges a fuel surcharge when the price of jet fuel rises above $0.98, so let’s assume Air Comet is paying $0.80 per gallon.

Thus the price of fuel alone for the trip is roughly 18,333 x $0.80 = $14,666.

The plane is an Airbus 313, which has a capacity of 295. If we assume the flight is full, Air Comet needs to charge each passenger just under $50 for fuel alone. 83 euros is about $110. So Air Comet can cover the cost of fuel. Good.

Continuing, that leaves $60 of my ticket price times 295 passengers, or roughly $17K to pay for everything else.

This all assumes that everyone is paying the same low price, which of course they are not.

While googling for the above numbers, I found an article about the first model plane that crossed the Atlantic. It weighed 11 pounds (5 kilos) and got about 3,000 miles per gallon of fuel, i.e., less than $1 of fuel for the whole trip.

orwell on dickens

Tuesday, March 6th, 2007

I’ve not read a single word of Dickens. I don’t know the plot of a single book, apart from superficial knowledge of Oliver Twist. For a long time this has seemed like a major hole in my reading. I’ve occasionally considered doing something about it.

But I have just finished a 50-page essay on Dickens by Orwell. I’ve read the obvious Orwell but never knew anything about the man. I like Orwell. The Dickens essay is good. After reading it I have even less interest in reading Dickens. Of course I should probably make up my mind about Dickens from reading him first hand. But life is short. Orwell strongly confirmed my suspicions. And so I’ve decided to skip Dickens completely. Forever.

It’s nice to have the hole, and to now know that it’s permanent. It has strategic value. Plus I have the good fortune that my hole happens to be Dickens. He wasn’t worth reading anyway.