Archive for January, 2008

S3 numbers revisited: six orders of magnitude does matter

Tuesday, January 29th, 2008

OK…. I should have realized in my original posting that the Oct 2007 10,000,000,000,000 objects figure was the source of the problem. I knew S3 could not be doubling every week, and that Amazon could not be making $11B a month, but didn’t see the now-obvious error in the input.

So what sort of money are they actually making?

Don MacAskill pointed me to this article at Forbes which says the number of objects at the end of 2007 was up to 14B from 10B in October. So let’s suppose the number now stands at 15B (1.5e10) and that Amazon are currently adding about 1B objects a month.

I’ll leave the other assumptions alone, for now.

Amazon’s S3 pricing for storage is $0.15 per GB per month. Assume all this data is stored on their cheaper US servers and that objects take on average 1K bytes. So that’s roughly 1.5e10 * 1e3 / 1e9 = 1.5e4 gigabytes in storage, for which Amazon charges $0.15 per month, or $2250.

Next, let’s do incoming data transfer cost, at $0.10 per GB. That’s simply 2/3rds of the data storage charge, so we add another 2/3 * $2250, or $1500.

Then the PUT requests that transmit the new objects: 1B new objects were added in the last month. Each of those takes a PUT, and these are charged at $0.01 per thousand, so that’s 1e9 / 1e3 * $0.01, or $10,000.

Lastly, some of the stored data is being retrieved. Some will just be backups, and never touched, and some will simply not be looked at in a given month. Let’s assume that just 1% of all (i.e., not just the new) objects and data are retrieved in any given month.

That’s 1.5e10 * 1e3 * 0.01 / 1e9 = 150 GB of outgoing data, or 0.15K TB. That’s much less than 10TB, so all this goes out at the highest rate, $0.18 per GB, giving another $27 in revenue.

And if 1% of objects are being pulled back, that’s 1.5e10 * 0.01 = 1.5e8 GET operations, which are charged at $0.01 per 10K. So that’s 1.5e8 / 1e4 * $0.01 = $150 for the GETs.

This gives a total of $2250 + $1500 + $10,000 + $27 + $150 = $13,927 in the last month.

And that doesn’t look at all like $11B!

Where did all that revenue go? Mainly it’s not there because Amazon only added 1e9 objects in the last month, not 1e15. That’s six orders of magnitude. So instead of $11B in PUT charges, they make a mere $11K. That’s about enough to pay one programmer.

I created a simple Amazon S3 Model spreadsheet where you can play with the numbers. The cells with the orange background are the variables you can change in the model. The variables we don’t have a good grip on are the average size of objects and the percentage of objects retrieved each month. If you increase average object size to 1MB, revenue jumps to $3.7M.

BTW, the spreadsheet has a simplification: regarding all data as being owned by one user, and using that to calculate download cost. In reality there are many users, and most of them will be paying for all their download data at the top rate. Also note that my % of objects retrieved is a simplification. Better would be to estimate how many objects are retrieved (i.e., including objects being retrieved multiple times) as well as estimating the download data amount. I roll these both into one number.

Google maps gets SFO location waaaay wrong

Monday, January 28th, 2008

google-sfoBefore leaving Barcelona yesterday morning, I checked Google maps to get driving directions from San Francisco International airport (SFO) to a friend’s place in Oakland.

Google got it way wrong. Imagine trying to follow these instructions if you didn’t know they were so wrong. Click on the image to see the full sized map. Google maps is working again now.

Amazon S3 to rival the Big Bang?

Monday, January 28th, 2008

Note: this posting is based on an incorrect number from an Amazon slide. I’ve now re-done the revenue numbers.

We’ve been playing around with Amazon’s Simple Storage Service (S3).

Adam Selipsky, Amazon VP of Web Services, has put some S3 usage numbers online (see slides 7 and 8). Here are some numbers on those numbers.

There were 5,000,000,000 (5e9) objects inside S3 in April 2007 and 10,000,000,000,000 (1e13) in October 2007. That means that in October 2007, S3 contained 2,000 times more objects than it did in April 2007. That’s a 26 week period, or 182 days. 2,000 is roughly 211. That means that S3 is doubling its number of objects roughly once every 182/11 = 16.5 days. (That’s supposing that the growth is merely exponential – i.e., that the logarithm of the number of objects is increasing linearly. It could actually be super-exponential, but let’s just pretend it’s only exponential.)

First of all, that’s simply amazing.

It’s now 119 days since the beginning of October 2007, so we might imagine that S3 now has 2119/16.5 or about 150 times as many objects in it. That’s 1,500,000,000,000,000 (1.5e15) objects. BTW, I assume by object they mean a key/value pair in a bucket (these are put into and retrieved from S3 using HTTP PUT and GET requests).

Amazon’s S3 pricing for storage is $0.15 per GB per month. Assume all this data is stored on their cheaper US servers and that objects take on average 1K bytes. These seem reasonable assumptions. (A year ago at ETech, SmugMug CEO Don MacAskill said they had 200TB of image data in S3, and images obviously occupy far more than 1K each. So do backups.) So that’s roughly 1.5e15 * 1K / 1G = 1.5e9 gigabytes in storage, for which Amazon charges $0.15 per month, or $225M.

That’s $225M in revenue per month just for storage. And growing rapidly – S3 is doubling its number of objects every 2 weeks, so the increase in storage might be similar.

Next, let’s do incoming data transfer cost, at $0.10 per GB. That’s simply 2/3rds of the data storage charge, so we add another 2/3 * $225M, or $150M.

What about the PUT requests, that transmit the new objects?

If you’re doubling every 2 weeks, then in the last month you’ve doubled twice. So that means that a month ago S3 would have had 1.5e15 / 4 = 3.75e14 objects. That means 1.125e15 new objects were added in the last month! Each of those takes an HTTP PUT request. PUTs are charged at one penny per thousand, so that’s 1.125e15 / 1000 * $0.01.

Correct me if I’m wrong, but that looks like $11,250,000,000.

To paraphrase a scene I loved in Blazing Saddles (I was only 11, so give me a break), that’s a shitload of pennies.

Lastly, some of that stored data is being retrieved. Some will just be backups, and never touched, and some will simply not be looked at in a given month. Let’s assume that just 1% of all (i.e., not just the new) objects and data are retrieved in any given month.

That’s 1.5e15 * 1K * 1% / 1e9 = 15M GB of outgoing data, or 15K TB. Let’s assume this all goes out at the lowest rate, $0.13 per GB, giving another $2M in revenue.

And if 1% of objects are being pulled back, that’s 1.5e15 * 1% = 1.5e13 GET operations, which are charged at $0.01 per 10K. So that’s 1.5e13 / 10K * $0.01 = $15M for the GETs.

This gives a total of $225M + $150M + $11,250M + $2M + $15M = $11,642M in the last month. That’s $11.6 billion. Not a bad month.

Can this simple analysis possibly be right?

It’s pretty clear that Amazon are not making $11B per month from S3. So what gives?

One hint that they’re not making that much money comes from slide 8 of the Selipsky presentation. That tells us that in October 2007, S3 was making 27,601 transactions per second. That’s about 7e10 per month. If Amazon was already doubling every two weeks by that stage, then 3/4s of their 1e13 S3 objects would have been new that month. That’s 7.5e12, which is 100 times more transactions just for the incoming PUTs (no outgoing) than are represented by the 27,601 number. (It’s not clear what they mean by transaction – I mean what goes on in a single transaction.)

So something definitely doesn’t add up there. It may be more accurate to divide the revenue due to PUTs by 100, bringing it down to a measly $110M.

An unmentioned assumption above is that Amazon is actually charging everyone, including themselves, for the use of S3. They might have special deals with other companies, or they might be using S3 themselves to store tons of tiny objects. I.e., we don’t know that the reported number is of paid objects.

There’s something of a give away the razors and charge for the blades feel to this. When you first see Amazon’s pricing, it looks extremely cheap. You can buy external disk space for, e.g., $100 for 500GB, or $0.20 per GB. Amazon charges you just $0.18 per GB for replicated storage. But that’s per month. A disk might last you two years, so we could conclude that Amazon is e.g., 8 or 12 times more expensive, depending on the degree of replication. But you don’t need a data center or to grow (or shrink) a data center, cooling, employees, replacement disks—all of which have been noted many times—so the cost perhaps isn’t that high.

But…. look at those PUT requests! If an object is 1K (as above), it takes 500M of them to fill a 500GB disk. Amazon charges you $0.01 per 1000, so that’s 500K * $0.01 or $5000. That’s $10 per GB just to access your disk (i.e., before you even think about transfer costs and latency), which is about 50 times the cost of disk space above.

In paying by the PUT and GET, S3 users are in effect paying Amazon for the compute resources needed to store and retrieve their objects. If we estimate it taking 10ms for Amazon to process a PUT, then 1000 takes 10 seconds of compute time, for which Amazon charges $0.01. That’s nearly $26K per month being paid for machines to do PUT storage, which is 370 times more expensive than what Amazon would charge you to run a small EC2 instance for a month. Such a machine probably costs Amazon around $1500 to bring into service. So there’s no doubt they’re raking it in on the PUT charges. That makes the 5% margins of their retailing operation look quaint. Wall Street might soon be urging Bezos to get out of the retailing business.

Given that PUTs are so expensive, you can expect to see people encoding lots of data into single S3 objects, transmitting them all at once (one PUT), and decoding when they get the object back. That pushes programmers towards using more complex formats for their data. That’s a bad side-effect. A storage system shouldn’t encourage that sort of thing in programmers.

Nothing can double every two weeks for very long, so that kind of growth simply cannot continue. It may have leveled out in October 2007, which would make my numbers off by roughly 2119/16.5 or about 150, as above.

When we were kids they told us that the universe has about 280 particles in it. 1.5e15 is already about 250, so only 30 more doubling are needed, which would take Amazon just over a year. At that point, even if all their storage were in 1TB drives and objects were somehow stored in just 1 byte each, they’d still need about 240 disk drives. The earth has a surface area of 510,065,600 km2 so that would mean over 2000 Amazon disk drives in each square kilometer on earth. That’s clearly not going to happen.

It’s also worth bearing in mind that Amazon claims data stored into S3 is replicated. Even if the replication factor is only 2, that’s another doubling of the storage requirement.

At what point does this growth stop?

Amazon has its Q4 2007 earnings call this Wednesday. That should be revealing. If I had any money I’d consider buying stock ASAP.

The Black Swan

Saturday, January 26th, 2008

I got a copy of The Black Swan: The Impact of the Highly Improbable for xmas.

In London a couple of weeks ago I pointed it out to Russell as we wandered through a Waterstones. He picked it up, flipped it open, and immediately began to make deadly and merciless fun of it.

For me this is the kind of book I know I’ll want to read if it’s any good, and which I know I’ll (try to) read in any case because these days I’m meeting the kind of people who like to refer to this sort of book. Not wanting to look like I’m not up to speed on the latest popular science, I’ll read for as long as I can bear it.

There are lots of books in this category. E.g., The Tipping Point, which I enjoyed, Wisdom of the Crowds, which I found so annoying and bad that I had to stop reading it, and A Short History of Almost Everything which was semi-amusing and which I made myself finish despite having much better things to read. There’s also Everything is Miscellaneous, which I enjoyed a lot, and Predictably Irrational: The Hidden Forces That Shape Our Decisions, which I’ve yet to get hold of. You know the type.

I went to bed early (3am) the other night so I could read a bit of the Black Swan before I went to sleep.

I got about 2 pages in and found it so bad that I almost had to put it down. The prologue is a dozen pages long. I forced myself to read the whole thing.

It’s dreadful, it’s pretentious, it’s vague, it’s silly, it’s obvious, it’s parenthesized and qualified beyond belief, it’s full of the author’s made-up names for things (Black Swan, antiknowledge, empty suits, GIF, Platonicity, Platonic fold, nerdified, antilibrary, extremistan, mediocristan), it’s self-indulgent, it’s trite. It’s a painfully horrible introduction to what I’d hoped would be a good book.

It was so bad that I couldn’t believe it could go on, so I decided to keep reading. This is published by Random House, who you might hope would know better. But I guess they know a smash hit popular theme and title when they see it, and they’ll publish it, even if they know the style is appalling and for whatever reason they don’t have the leverage to force changes.

Fortunately though, the book improves.

The guy is obviously very smart and has been thinking about some of this for a long time, he has an unconventional take on many things, and he does offer insights. I am still finding the style annoying, but I have a feeling I will finish it and I know for sure I’ll take some lessons away. I’m up to page 56, with about 250 to go. I suppose I’ll blog about it again if it seems worthwhile.

If you’re contemplating reading it, I suggest jumping in at Chapter 3.

I’m off to read a bit more now.

Worst of the web award: MIT/Stanford Venture Lab

Friday, January 25th, 2008

vlabI’ve just awarded one of my coveted Worst of the Web awards to the MIT/Stanford Venture Lab.

Here’s why. They are hosting a video I’d like to watch. You can see it on their home page right now, Web 3.0: New Opportunities on the Semantic Web.

If you click on that link, wonderful things happen.

You get taken to a page with a Watch Online link. Clicking on it tells you that this is a “Restricted Article!” and that you need to register to see the video. Another click and you’re faced with a page that gives you four registration options: Volunteers, Board Members, Standard Members, or Sponsors. Below each of them it says “Rates: Membership price: $0.00”.

Ok, so we’re going to pay $0.00 to sign up for a free video. That takes me to a page with 15 fields, including “Billing Address”. If you leave everything blank and try clicking through, it tells you “A user account with the same email you entered already exists in the system.” But I left the email field empty.

When you fill in email and your name, you get to confirm your purchase: Review your order details. If all appears ok, click “Submit Transaction ->” to finalize the transaction. There’s a summary of the charges, with Price and Total columns, Sub-totals, Tax, Shipping, Grand Total – all set to $0.00. There’s a button labeled “Submit Transaction” and a warning: “Important: CLICK ONCE ONLY to avoid being charged twice.”

You then wind up on a profile page with no less than 54 fields! Scroll to the bottom, take yourself off the mailing list, then “Update profile”.

OK, so you’re registered. The top left of the screen has your user name, and the top right has a link labeled “Sign Out”. So you’re apparently logged in too.

Now you go back to the home page, and click on the link for the video. Then click on the Watch Online link. And it tells you this is a “Restricted Article!” and that if you’re already a member you can log in. But I thought I was logged in?

OK…. click to log in. There’s a field for email address and password. What password? Hmmm. I can click to have it reset, so I do that. A password and log-in link arrives in email.

I follow the link and log in. I go back to the home page. I click on the link to the video I want. I click on Watch Online.

Now I get a screen with a flash player in it. It says Please wait. Apparently forever. I wait ten minutes and begin to blog about my wonderful experience at the MIT/Stanford Venture Lab.

The video never loads.

I actually went through this process twice to verify the steps. The first time was a bit more complex, believe it or not, and involved a Captcha. Also, the two welcome mails I got from signing up were totally different! One looked like

Dear Terry,

Welcome to You are now ready to enjoy the many benefits our site offers its registered users.

Please login using:
Password: lksjljls

For your convenience, you can change your password to something more easily remembered once you sign in.

and the other also greeted me and finally, as a footnote, at the very end of the mail after the goodbye:

IMPORTANT: Your account is now active. To log in, go to and use “i2nosjf3p” as your temporary password.

So weird.

And then, to top off the whole thing, I get a friendly email greeting which includes the following:

Dear Terry,

Thank you, and welcome to our community.

Your purchase of Standard Members for the amount of $0 entitles you to enjoy more of our activities, gain greater access to site functionality, and enhance your overall experience with us.

Your Standard Members is now valid and will expire on January 16th, 2038.

You couldn’t make this stuff up. It’s 2008. We’re trying to look at a free online video. Hosted by MIT/Stanford of all people. We’re prepared to jump through hoops! We’ll even risk being billed $0.00 multiple times! But no cigar.

Final straws for Mac OS X

Thursday, January 24th, 2008

I’ve had it with Mac OS X.

I’m going to install Linux on my MacBook Pro laptop in March once I’m back from ETech.

I’ve been thinking about this for months. There are just so many things I don’t like about Mac OS X.

Yes, it’s beautiful, and there are certainly things I do like (e.g., iCal). But I don’t like:

  • Waiting forever when I do a rm on a big tree
  • Sitting wondering what’s going on when I go back to a Terminal window and it’s unresponsive for 15 seconds
  • Weird stuff like this
  • Case insensitive file names (see above problem)
  • Having applications often freeze and crash. E.g. emacs, which basically never crashes under Linux

I could go on. I will go on.

I don’t like it when the machine freezes, and that happens too often with Mac OS X. I used Linux for years and almost never had a machine lock up on me. With Mac OS X I find myself doing a hard reset about once a month. That’s way too flaky for my liking.

Plus, I do not agree to trade a snappy OS experience for eye candy. I’ll take both if I can have them, but if it’s a choice then I’ll go back to X windows and Linux desktops and fonts and printer problems and so on – all of which are probably even better than they already were a few years back.

This machine froze on me 2 days ago and I thought “Right. That’s it.” When I rebooted, it was in a weird magnifying glass mode, in which the desktop was slightly magnified and moved around disconcertingly whenever I moved the mouse. Rebooting didn’t help. Estéve correctly suggested that I somehow had magnification on. But, how? WTF is going on?

And, I am not a fan of Apple.

In just the last two days, we have news that 1. Apple crippled its DTrace port so you can’t trace iTunes, and 2. Apple QuickTime DRM Disables Video Editing Apps so that Adobe’s After Effects video editing software no longer works after a QuickTime update.

It’s one thing to use UNIX, which I have loved for over 25 years, but it’s another thing completely to be in the hands of a vendor who (regularly) does things like this while “upgrading” other components of your system.

Who wants to put up with that shit?

And don’t even get me started on the iPhone, which is a lovely and groundbreaking device, but one that I would never ever buy due to Apple’s actions.

I’m out of here.

Understanding high-dimensional spaces

Wednesday, January 23rd, 2008

I’ve spent lots of time thinking about high-dimensional spaces, usually in the context of optimization problems. Many difficult problems that we face today can be phrased as problems of navigating in high-dimensional spaces.

One problem with high-dimensional spaces is that they can be highly non-intuitive. I did a lot of work on fitness landscapes, which are a form of high dimensional space, and ran into lots of cases in which problems were exceedingly difficult because it’s not clear how to navigate efficiently in such a space. If you’re trying to find high points (e.g., good solutions), which way is up? We’re all so used to thinking in 3 dimensions. It’s very easy to do the natural thing and let our simplistic lifelong physical and visual 3D experience influence our thinking about solving problems in high-dimensional spaces.

Another problem with high-dimensional spaces is that we can’t visualize them unless they are very simple. You could argue that an airline pilot in a cockpit monitoring dozens of dials (each dial gives a reading on one dimension) does a pretty good job of navigating a high-dimensional space. I don’t mean the 3D space in which the plane is flying, I mean the virtual high-dimensional space whose points are determined by the readings on all the instruments.

I think that’s true, but the landscape is so smooth that we know how to move around on it pretty well. Not too many planes fall out of the sky.

Things get vastly more difficult when the landscape is not smooth. In fact they get positively weird. Even with trivial examples, like a hypercube, things get weird fast. For example, if you’re at a vertex on a hypercube, exactly one half of the space is reachable in a single step. That’s completely non-intuitive, and we haven’t even put fitness numbers on the nodes. When I say fitness, I mean goodness, or badness, or energy level, or heuristic, or whatever it is you’re dealing with.

We can visually understand and work with many 3D spaces (though 3D mazes can of course be hard). We can hold them in our hands, turn them around, and use our visual system to help us. If you had to find the high-point looking out over a collection of sand dunes, you could move to a good vantage point (using your visual system and understanding of 3D spaces) and then just look. There’s no need to run an optimization algorithm to find high points, avoiding getting trapped in local maxima, etc.

But that’s not the case in a high-dimensional space. We can’t just look at them and solve problems visually. So we write awkward algorithms that often do exponentially increasing amounts of work.

If we can’t visually understand a high-dimensional space, is there some other kind of understanding that we could get?

If so, how could we prove that we understood the space?

I think the answer might be that there are difficult high-dimensional spaces that we could understand, and demonstrate that we understand them.

One way to demonstrate that you understand a 3D space is to solve puzzles in it, like finding high points, or navigating over or through it without crashing.

We can apply the same test to a high-dimensional space: build problems and see if they can be solved on the fly by the system that claims to understand the space.

One way to do that is the following.

Have a team of people who will each sit in front of a monitor showing them a 3D scene. They’ll each have a joystick that they can use to “fly” through the scene that they see. You take your data and give 3 dimensions to each of the people. You do this with some degree of dimensional overlap. Then you let the people try to solve a puzzle in the space, like finding a high point. Their collective navigation gives you a way to move through the high-dimensional space.

You’d have to allocate dimensions to people carefully, and you’d have to do something about incompatible decisions. But if you built something like this (e.g., with 2 people navigating through a 4D space), you’d have a distributed understanding of the high-dimensional space. No one person would have a visual understanding of the whole space, but collectively they would.

In a way it sounds expensive and like overkill. But I think it’s pretty easy to build and there’s enormous value to be had from doing better optimization in high-dimensional spaces.

All we need is a web server hooked up to a bunch of people working on Mechanical Turk. Customers upload their high-dimensional data, specify what they’re looking for, the data is split by dimension, and the humans do their 3D visual thing. If the humans are distributed and don’t know each other they also can’t collude to steal or take advantage of the data – because they each only see a small slice.

There’s a legitimate response that we already build systems like this. Consider the hundreds of people monitoring the space shuttle in a huge room, each in front of a monitor. Or even a pilot and co-pilot in a plane, jointly monitoring instruments (does a co-pilot do that? I don’t even know). Those are teams collectively understanding high-dimensional spaces. But they’re, in the majority of cases, not doing overlapping dimensional monitoring, and the spaces they’re working in are probably relatively smooth. It’s not a conscious effort to collectively monitor or understand a high-dimensional space. But the principle is the same, and you could argue that it’s a proof the idea would work – for sufficiently non-rugged spaces.

Apologies for errors in the above – I just dashed this off ahead of going to play real football in 3D. That’s a hard enough optimization problem for me.

Giselle is served an apple martini, but she doesn’t drink it

Sunday, January 20th, 2008

Well that’s a relief.

One email a day

Saturday, January 19th, 2008

I’ve got my email inbox locked down so tightly that only one email made it through today. That’s down from several hundred a day just a few weeks ago.

All the email that doesn’t make it immediately into my inbox gets filed elsewhere. I deal with it all quickly – either deleting stuff (mailing lists), saving, or replying and then saving.

I’m spending way less time looking at my inbox wondering what I didn’t reply to in a list of a few thousand emails. That’s good. I’m spending less time blogging. I haven’t been on Twitter for ages.

In the productivity corner, I somehow managed (with help) to get a 3 meter whiteboard up here and onto the wall. It’s fantastic. I spend 2+ hours every morning talking with Estéve, drawing circles, lines, trees, and random scrawly notes. Today I sat talking to him in my chair while using my laser pointer (thanks Derek!) to point to things on the whiteboard. Ah, the luxury.

Free wifi at Stansted

Friday, January 11th, 2008

I’m at Stansted heading back to Barcelona. There’s free wifi here (on the merula network in the waiting area for gates 1-19), for the first time I’ve seen it. At first I didn’t understand their web page, then I read the login box which clearly says to enter merula as user name and password. It works.

Wifi on a bus

Friday, January 11th, 2008

I’m on the X90 National Express bus from Oxford to London. At the bus stop before we left I pulled out my laptop to do some work on a presentation. I noticed there was an open wifi signal and thought I’d connect quickly to pick up my mail.

It turns out the wifi network is on the bus.

I’m now speeding down the motorway, it’s gray and raining outside, and I’m sitting here warm and online. I suppose all National Express buses have wifi. One day it will be a rarity not to have network access. Today is the first time I’ve had access from a bus. Nice.

Tagging in the year 3000 (BC)

Friday, January 4th, 2008

Jimmy Guterman recently called Marcel Proust an Alpha Geek and asked for thoughts on “what from 100 years ago might be the hot new technology of 2008?”

Here’s something about 5000 years older. As a bonus there’s a deep connection with what Fluidinfo is doing.

Alex Wright recently wrote GLUT: Mastering Information Through the Ages. The book is good. It’s a little dry in places, but in others it’s really excellent. I especially enjoyed the last 2 chapters, “The Web that Wasn’t” and “Memories of the Future”. GLUT has a non-trivial overlap with the even more excellent Everything is Miscellaneous by David Weinberger.

In chapter 4 of GLUT, “The Age of Alphabets”, Wright describes the rise of writing systems around 3000 BC as a means of recording commercial transactions. The details of the transactions were written onto a wet clay tablet, signed by the various parties, and then baked. Wright (p50) continues:

Once the tablet was baked, the scribe would then deposit it on a shelf or put it in a basket, with labels affixed to the outside to facilitate future search and retrieval.

There are two comments I want to make about this. One is a throwaway answer to Jimmy Guterman’s request, but the other deserves consideration.

Firstly, this is tagging. Note that the tags are attached after the data is put onto the clay tablet and it is baked. This temporal distinction is important – it’s not like other mentions of metadata or tagging given by Wright (e.g., see p51 and p76). Tags could presumably have different shapes or colors, and be removed, added to, etc. Tags can be attached to objects you don’t own – like using a database to put tags on a physically distant web page you don’t own. No-one has to anticipate all the tag types, or the uses they might be put to. If a Sumerian scribe decided to tag the best agrarian deals of 3000 BC or all deals involving goats, he/she could have done it just as naturally as we’d do it today.

Secondly, I find it very interesting to consider the location of information here and in other systems. The tags that scribes were putting on tablets in 3000 BC were stored with the tablets. They were physically attached to them. I think that’s right-headed. To my mind, the tag information belongs with the object that’s being tagged. In contrast, today’s online tagging systems put our tags in a physically separate location. They’re forced to do that because of the data architecture of the web. The tagging system itself, and the many people who may be tagging a remote web page, don’t own that page. They have no permission to alter it.

Let’s follow this thinking about the location of information a little further…

Later in GLUT, Wright touches on how the card catalog of libraries became separated from the main library content, the actual books. Libraries became so big and accumulated so many volumes that it was no longer feasible to store the metadata for each volume with the volume. So that information was collected and stored elsewhere.

This is important because the computational world we all inhabit has similarly been shaped by resource constraints. In our case the original constraints are long gone, but we continue to live in their shadow.

I’ll explain.

We all use file systems. These were designed many decades ago for a computing environment that no longer exists. Machines were slow. Core and disk memory was tiny. Fast indexing and retrieval algorithms had yet to be invented. Today, file content and file metadata are firmly separated. File data is in one place while file name, permissions, and other metadata are stored elsewhere. That division causes serious problems. The two systems need different access mechanisms. They need different search mechanisms.

Now would be a good time to ask yourself why it has traditionally been almost impossible to find a file based simultaneously on its name and its content.

Our file systems are like our libraries. They have a huge card catalog just inside the front door (at the start of the disk), and that’s where you go to look things up. If you want the actual content you go fetch it from the stacks. Wandering the stacks without consulting the catalog is a little like reading raw disk blocks at random (that can be fun btw).

But libraries and books are physical objects. They’re big and slow and heavy. They have ladders and elevators and are traversed by short-limbed humans with bad eyesight. Computers do not have these characteristics. By human standards, they are almost infinitely fast and their storage is cheap and effectively infinite. There’s no longer any reason for computers to separate data from metadata. In fact there’s no need for a distinction between the two. As David Weinberger put it, in the real world “everything is metadata”. So it should be in the computer world as well.

In other words, I think it is time to return to a more natural system of information storage. A little like the tagging we were doing in 3000 BC.

Several things will have to change if we’re to pull this off. And that, gentle reader, is what Fluidinfo is all about.

Stay tuned.

Both my kids beat me at Connect 4

Friday, January 4th, 2008

Image: katypearce

My 2 older kids got Connect 4 for xmas.

I’ve liked Connect 4 for a long time. The first TCP/IP socket programming I ever did was in 1987 and it was code to let two people on the net play Connect 4 against each other, with graphics done using curses code written with Andrew Hensel. Later I wrote a machine opponent that used some form of Alpha-beta pruning and which was popular among a few CS grad students at the University of Waterloo. Amazingly, you can still find traces of my youthful code (and function names!) online. I like/d to think I am/was a pretty good player.

So you can imagine my confidence as I walked into the kid’s room and asked them who wanted to be beaten at Connect 4 by the champion of the world. My friend Russell has a take-no-prisoners attitude towards playing games with his kids. He wouldn’t dream of deliberately letting them win at anything. I let mine win very often, and find it hard to imagine how you could teach a small kid to play (say) chess if you don’t give them a chance. Anyway, tonight I decided I was going to show no mercy and whip them repeatedly at Connect 4.

I was so wrong.

At xmas just a couple of weeks ago I remember explaining the game to Sofia (8), and thinking what a vast gap existed between her understanding of the game and mine. Of course she quickly got the idea, but she had no idea at all of strategy. Lucas (6) came up during the explanation and of course had to be included, which meant an even more painstaking explanation from the champion of the world to his tabula rasa midgets.

Yesterday Ana told me that the kids, Sofia especially, were getting quite good. I smiled a knowing smile, and inside I scoffed.

Tonight I played Sofia in the first game and won fairly quickly. I told them we were going to play winner stays on, and so I then faced Lucas.

And the little bugger beat me. Fair and square he got me good, knew exactly what he was doing, and celebrated like a wild animal as he dropped the winning piece, while I sat there in shock with a huge smile on my face.

When I finally got back into the game I was up against Sofia. She proceeded to beat me too.

Amazing. Great. Funny. Alarming. How is this possible?

It reminds me of when I was about 12. My father was trying to figure out how to connect something with some cables. I took a look and told him what to do. I’ll never forget it. He knew I was right and he looked straight at me and said “how come you’re smarter than I am?” I guess I shrugged, but inside I was thinking “yep”.

Pride before a fall. Multiple falls. And you wouldn’t want it any other way, of course.

Still, they might have waited a few more years before mowing me down.

More email customization

Friday, January 4th, 2008

My recent email changes are working out well. Yesterday morning I woke up and didn’t read email. That’s because I didn’t have any email!

Well, I did, but procmail had filed it all into mail/incoming/IN-20080103.spool because none of it needed immediate attention. I have set VM up so that it knows to look for an x.spool file if I ask it to visit a file called x. That’s one line of elisp in VM: (setq vm-spool-file-suffixes (list ".spool")).

I like this setup because 1) it keeps my main inbox almost empty, 2) it keeps non-essential emails out of my face, and 3) it puts pressure on me to quickly deal with stuff that collects in the daily file, because I know that if I don’t it’s going to be forgotten.

And how to get to the daily file when I do decide to go look? Yes, another little piece of code:

  (define-key vm-mode-map "i"
    ‘(lambda ()
         (concat "~/mail/incoming/IN-"
                 (format-time-string "%Y%m%d"))))))

which simultaneously defines a function to take me (in VM, in emacs) to today’s file and puts that function on the “i” key in VM. So I just hit a single key and I’m automatically looking at the non-time-critical mail file for the day. I’ll probably write a little function to take me to yesterday’s too.

And yes, I guess this is all highly personalized, but these are things that I do many times a day every day of my life. So I’m happy to streamline them. And all the code is trivial. That’s the most interesting thing. With a tiny bit of code you can do so much and without it you can only do what other programmers thought you might want or need to be able to do.

I just deactivated my Facebook account

Thursday, January 3rd, 2008

I just deactivated my Facebook account. This has nothing to do with Robert Scoble’s account being disabled earlier today, I’m just sick of Facebook. It does nothing whatsoever for me, except send messages that can and would otherwise have been sent in email. I don’t want to use a tool that encourages people to send me messages on a website that I then have to go log in to. I don’t want some website to hold my messages. I like them to be searchable with things like grep. I like to organize them my way. I like email. Apart from receiving messages in a totally unattractive way, Facebook is useless for me – just a steady stream of invitations to things I don’t want to attend from people I don’t know, plus a smattering of cream pies, flying sheep, etc. So I’m outta there. I wonder if I’ll manage to survive.

Amazon just billed me 14 cents

Wednesday, January 2nd, 2008

I’ve been messing around with Esteve setting up an Amazon EC2 machine.

We set up a machine the other day, ssh’d into it, took a look around, and then shut it down a little later. Amazon just sent me a bill:

Greetings from Amazon Web Services,

This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:

Total: $0.14

Please see the Account Activity area of the AWS web site for detailed account information.

Isn’t that cool?

It would certainly cost more than 14 cents to get your hands on your own (virtual) Linux box any other way.

My email setup

Wednesday, January 2nd, 2008

I like customizing my environment. I’ve spent lots and lots of time doing that over the decades.

Some examples: My emacs environment has about 6000 lines of elisp that I’ve written to help me edit. I have over 500 shell scripts in my bin directory (30K lines of code), and certainly hundreds of other scripts around the place to help with other specific tasks. My bash setup is about 2000 lines of shell script.

That’s about 40K lines of code all written just to help me edit and work in the shell.

As a computer user, I’m damned happy I’m a programmer. I don’t think I can imagine what it would be like to be a computer user and not a programmer.

As a non-programmer you’re at the mercy of others. When you run into a problem you don’t have a solution for, you’re either out of luck, you have to spend often huge amounts of time solving it in some contorted semi- or fully-manual way, you have to find someone else’s (likely partial) solution and maybe pay for it, or you ask or pay someone to solve your problem, or you wait for the thing you need to appear in some product, etc. And all the while you’ve got a perfectly good high-speed general-purpose machine sitting right in front of you, likely with all the programming tools you’d need already installed for you…. but you don’t know how to use it!!

How weird is that?

As a programmer when you run into a problem you don’t have a solution for, you can just write your own.

One thing that always surprises me is how little time most other programmers tend to spend customizing their environments. Given 1) that programmers probably spend a large percentage of each day in their editor, in email, and in the shell, 2) that those things can all be programmed (assuming you use emacs :-)), and 3) that programmers usually don’t like repeating themselves, doing unnecessary work or being inefficient, you’d think that programmers would all be spending vast amounts of time getting things set up just so.

FWIW, here’s a description of the email setup I’ve built up over the years.

But first some stats.

I’ve been saving all my incoming and outgoing emails since Sept 19, 1989. I don’t know why I didn’t start earlier – I wish I had. My first 7 years of emailing is lost, almost certainly forever. I’ve sent 125K emails in that time and received 425K. I’ve got all my incoming email split into files by sender, with some overlap, in 6700 files. The total disk usage of all mails is just under 4G. I have 1.1G (compressed) of saved spam. I have 1250 mail aliases in my .mailrc file.

  1. I write mail in emacs, of course. Seeing as email is text, why would you use anything but your text editor to compose it? Not being able to use emacs to edit text is a show-stopper for me when it comes to using software products. Don’t try to make me use an inferior editor. Don’t ask me to edit text in my browser.
  2. All my outgoing mails get dumped into a single file. I occasionally move these files when they get too big. I keep things this way as it’s then really fast to look at stuff I’ve sent, which I do frequently. I have shell commands called o, oo, ooo etc., to show me the last (second last, etc) of these files (starting at bottom) instantly.
  3. I read mail in emacs (using VM). I could do that differently, but email is (usually) text and I want to copy it, paste it, edit it, reply to it, etc. I also use the emacs supercite package, smart paragraph filling, automatic alias expansion, etc. All that has been standard in emacs for at least 10 years, but it’s still not available in tons of “modern” email readers.
  4. VM recognizes the 37 email addresses I’ve used over the years as indicating a mail is from me (and so doesn’t put that address in any followup line).
  5. I do all my MIME decoding manually. VM knows how to handle most things, I just don’t let it do it until I want it done. That’s mainly a security thing – several years ago I predicted that PDF files would one day be used to trigger buffer overflows, as just happened. I don’t open any attachment of any form from anyone I don’t know (and don’t open them from some people I do know who like to pass along random crap from others).
  6. I have VM figure out exactly where each mail should be saved, based on sending email address. So I never have to make a decision about where to save anything.
  7. I have 154 virtual folders defined in VM. These let me dynamically make a mail folder based on fairly flexible rules (subject, sender, etc). They’re not folders on disk, but are composed from these on the fly. It’s a great feature of VM, highly useful. E.g., I have friends with multiple email addresses – my friend Emily has used 21 emails addresses in the last 15 years and I can see all her incoming mail in one virtual folder no matter where she sends it from. Virtual folders can be used for much more than that though.
  8. I have an emacs function that detects if the person sending me mail also uses VM and, if so, lets me know if their version of VM is newer than mine. That way I don’t have to think about upgrading VM – when a friend does it, emacs tells me automatically.
  9. I have VM keys set up to send messages to SpamBayes to teach it that things are spam or ham.
  10. I have an emacs hook function that looks at the mail I’m currently looking at in VM and sets my email address accordingly. So if I’m reading mail from Cambridge it sets my address to be my Cambridge one, and similarly for Fluidinfo, for my domain and a couple of others. That means I pretty much never reply to an email using an address I didn’t want to use on that email. That’s all totally automatic and I never have to think about email identity, except when mailing someone for the first time.
  11. VM also does a bunch of other things for me, like add attachments, encrypt and decrypt mail, etc. But that’s all fairly standard now.
  12. I use a script I wrote to repeatedly use fetchmail to pull my incoming mails from half a dozen mailboxes.
  13. I use grepmail to search for emails. It’s open source, so I was able to speed it up, fix some problems I ran into, and add some enhancements I wanted in versions 4.72 and 4.80.
  14. In front of grepmail I run my own mail-to program which knows where I store my outgoing mail, parses command line from and to dates to figure out the relevant files to pass to grepmail, etc.
  15. I use cron and some scripts to maintain a list of email addresses I’ve ever received/sent mail from/to (78500 of these) or just received from (40K of these). Cron updates these files nightly, using another program that knows how to pull things that look like emails out of mail files.
  16. I have a shell script which looks in the received mail address file to find email addresses. So if I am wondering about what someone’s address from, e.g., Siemens might be, I can run emails-of siemens and see 140 Siemens email addresses. Yes, I used to send a lot of mail to Siemens.
  17. I use procmail to filter my incoming mail. With procmail I do a bunch of things:
  18. Procmail logs basic info on all my incoming mail to a file.
  19. It looks for a special file in my home directory, and if it’s there it forwards mail to my mobile phone.
  20. It also looks for mail from me with a special subject, and when, found either creates or removes the above file. This allows me to turn forwarding to my mobile phone on and off when I’m away from my machine.
  21. It dumps some known spam addresses for me.
  22. With procmail I run incoming mail through a script I wrote that looks at the above file of all known (received) mail addresses. This adds a header to the mail to tell me it’s from a known former sender. Those mails then get favorable treatment as they’re very likely not spam.
  23. With procmail I run incoming mail through another program I wrote that looks at the From line and marks the mail as being something I want delivered immediately. If not it gets put aside for later viewing.
  24. With procmail I run incoming mail through another program I wrote that looks at the overall MIME structure of the mail and flags it if it looks like image spam (hint: don’t send me a GIF image attachment).
  25. Finally, I also use procmail to run incoming mail through both SpamBayes and Spam Assassin.
  26. I used to use procmail to auto-reply to anything considered spam (and then auto-drop the many bounces to this). But I turned that off as it was making too many mistakes replying to forged mails from mailing lists.
  27. I have a program that cron runs every night which goes through the day’s spam and summarizes the most interesting messages. It typically pulls out 15-20% of my spam into a summary mail which it sends me. The summary is sorted based on the mail address in the To line (my old mail addresses get scored very low). It also identifies common subjects (so I can kill them), and does some checks like tossing emails whose subjects are not composed of at least some recognizable words. This program is pretty severe – all these mails have already been classed as spam by one of the above programs, so this is just a safety check that I haven’t tossed anything I should keep. It generates a piece of emacs lisp for each message it pulls out so I can jump straight to the correct spam folder and message number in case I want to look at something. It also keeps a list of things to watch for that are definitely not spam. With this program in place I never go looking in my spam folders. I can also run this from the shell at any time.
  28. I have a program that summarizes the mail I’ve put aside (not for immediate delivery). Cron runs that nightly and mails me the result. I can also run this from the shell at any time.
  29. I have a simple program I use to grep for mail aliases in my .mailrc.
  30. I have a script which lists my received email files in reverse order of last update. I can pipe the output of that program into xargs grep to quickly search all incoming mail, in new-to-old order (for speed), mentioning any term.
  31. I have a script to send unrecorded mail (from the shell). That’s mail that doesn’t have my usual FCC line in it, in case I’m mailing out something large and don’t want a copy of it in my outgoing mail file.
  32. I have an emacs function to visit my current outgoing mail folder with backups disabled (the folders are often large and I rarely want to edit them).
  33. And I can’t resist pointing out that I wrote the Spamometer in 1997 to do probabilistic spam detection, and set up a Library of Spam (which attracted a hell of a lot of spam). This was 5 years before Paul Graham wrote his famous A Plan for Spam article about doing Bayesian filtering to detect spam. The Spam Assassin is very similar in approach and design.

I resolve to waste less time online

Tuesday, January 1st, 2008

It’s new year’s day. I never make new year’s resolutions. But today I’ve finally taken a step I’ve been meaning to take for a while, and it happens to be Jan 1st, so there you go.

Over the last 2 months I’ve spent lots of time running around talking to people and not producing any code (or much of anything else).

I’ve also found it increasingly hard to get anything useful done (by useful I almost always mean “code”).

I’m going to try cutting myself off a little more. But I need to be online – to read docs, to receive/send some mail, to test code, etc.

I’ve just made some changes to my email setup. Now all my mail, with about 15 exceptions, will go into a separate file that I’m only going to look at once a day (more likely I’ll write a little program to send me a summary). If you’re one of the lucky 15 your mail will still go straight into my inbox and I’ll see it pretty quickly.

I get about 550-700 mails a day. 300-500 of them are spam and are caught as spam by my filters. But that still leaves hundreds of mails a day that pop up in my mailbox all the time.

Quite a lot of those are from mailing lists and some spam that slips through. Of the rest, from actual people, hardly any need to be read or replied to straight away. So I’m going to file them out of sight and read them once in a while. If I remember.

I’m planning another blog post on my email setup. It’s heavily customized. That’s good and it’s also why I figure I’ll probably never switch to another email setup – the bar is just too high and too personalized.

I’m also planning to blog less, to twitter less, to stop reading RSS streams, etc. If I don’t I’m going to turn into one of those trendy knowledgeable tech people who generate a lot of hot air and not much else. I don’t want to be like that.