Add to Technorati Favorites

Facebook release Tornado and it’s not based on Twisted?

14:25 September 12th, 2009 by terry. Posted under FluidDB, python, tech, twisted. 17 Comments »

Image: Jay Smith

Image: Jay Smith

To their great credit, Facebook have just open-sourced more of their core software. This time it’s Tornado, an asynchronous web server written in Python.

Surely that can only mean one thing: Tornado is based on Twisted. Right?

Incredibly, no. Words fail me on this one. I’ve spent some hours today trying to put my thoughts into order so I could put together a reasonably coherent blog post on the subject. But I’ve failed. So here are some unstructured thoughts.

First of all, I’m not meaning to bash Facebook on this. At Fluidinfo we use their Thrift code. We’ll almost certainly use Scribe for logging at some point, and we’re seriously considering using Cassandra. Esteve Fernandez has put a ton of work into txAMQP to glue together Thrift, Twisted, and AMQP, and in the process became a Thrift committer.

Second, you can understand—or make an educated guess at—what happened: the Facebook programmers, like programmers everywhere, were strongly tempted to just write their own code instead of dealing with someone else’s. It’s not just about learning curves and fixing deficiencies, there are also issues of speed of changes and of control. At Fluidinfo we suffered through six months of maintaining our own set of related patches to Thrift before the Twisted support Esteve wrote was finally merged to trunk. That was painful and the temptation to scratch our own itch, fork, and forget about the official Thrift project was high.

Plus, Twisted suffers from the fact that the documentation is not as good as it could be. I commented at length on this over three years ago. Please read the follow-up posts in that thread for an illustration (one of many) of the maturity of the people running Twisted. Also note that the documentation has improved greatly since then. Nevertheless, Twisted is a huge project, it has tons of parts, and it’s difficult to document and to wrap your head around no matter what.

So you can understand why Facebook might have decided not to use Twisted. In their words:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

I’m betting it’s that last part that’s the key to the decision not to use Twisted.

But seriously…… WTF?

Twisted is an amazing piece of work, written by some truly brilliant coders, with huge experience doing exactly what Facebook set out to reinvent.

This is where I’m at a loss for words. I think: “what an historic missed opportunity” and “reinventing the wheel, badly” and “no, no, no, this cannot be” and “this is just so short-sighted” and “a crying shame” and many things besides.

Sure, Twisted is difficult to grok. But that’s no excuse. It’s a seriously cool and powerful framework, it’s vastly more sophisticated and useful and extensible than what Facebook have cobbled together. Facebook could have worked to improve twisted.web (which everyone agrees has some shortcomings) which could have benefitted greatly from even a small fraction of the resources Facebook must have put into Tornado. The end result would have been much better. Or Facebook could have just ignored twisted.web and built directly on top of the Twisted core. That would have been great too.

Or Facebook could have had a team of people who knew how to do it better, and produced something better than Twisted. I guess that’s the real frustration here – they’ve put a ton of effort into building something much more limited in scope and vision, and even the piece that they did build looks like a total hack built to scratch their short term needs.

What’s the biggest change in software engineering over the last decade? Arguably it’s the rise of test-driven development. I’m not the only one who thinks so. Yet here we are in late 2009 and Facebook have released a major piece of code with no test suite. Amazing. OK, you could argue this is a minor thing, that it’s not core to Tornado. That argument has some weight, but it’s hard to think that this thing is not a hack.

If you decide to use an asynchronous web framework, do you expect to have to manually set your sockets to be non-blocking? Do you feel like catching EWOULDBLOCK and EAGAIN yourself? Those sorts of things, and their non-portability (even within the world of Linux) are precisely the kinds of things that lead people away from writing C and towards doing rapid development using something robust and portable that looks after the details. They’re precisely the things Twisted takes care of for you, and which (at least in Twisted) work across platforms, including Windows.

It looks like Tornado are using a global reactor, which the Twisted folks have learned the hard way is not the best solution.

Those are just some of the complaints I’ve heard and seen in the Tornado code. I confess I’ve looked only superficially at their code – but more than enough to feel such a sense of lost opportunity. They built a small subsection of Twisted, they’ve done it with much less experience and elegance and hiding of detail than the Twisted code, and the thing doesn’t even come with a test suite. Who knows if it actually works, or when, or where, etc.?

And…. Twisted is so much more. HTTP is just one of many protocols Twisted speaks, including (from their home page): “TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others”.

Want to build a sophisticated, powerful, and flexible asynchronous internet service in Python? Use Twisted.

A beautiful thing about Twisted is that it expands your mind. Its abstractions (particularly the clean separation and generality of transports, protocols, factories, services, and Deferreds—see here and here and here) makes you a better programmer. As I wrote to some friends in April 2006: “Reading the Twisted docs makes me feel like my brain is growing new muscles.”

Twisted’s deferreds are extraordinarily elegant and powerful, I’ve blogged and posted to the Twisted mailing list about them on multiple occasions. Learning to think in the Twisted way has been a programming joy to me over the last 3 years, so you can perhaps imagine my dismay that a company with the resources of Facebook couldn’t be bothered to get behind it and had to go reinvent the wheel, and do it so poorly. What a pity.

In my case, I threw away an entire year of C code in order to use Twisted in FluidDB. That was a decision I didn’t take lightly. I’d written my own libraries to do lots of low level network communications and RPC – including auto-generating server and client side glue libraries to serialize and unserialize RPC calls and args (a bit like Thrift), plus a server and tons of other stuff. I chucked it because it was too brittle. It was too much of a hack. It wasn’t portable enough. It was too get the details right. It wasn’t extensible.

In other words….. it was too much like Tornado! So I threw it all away in favor of Twisted. As I happily tell people, FluidDB is written in Python so it can use Twisted. It was a question of an amazingly great asynchronous networking framework determining the choice of programming language. And this was done in spite of the fact that I thought the Twisted docs sucked badly. The people behind Twisted were clearly brilliant and the community was great. There was an opportunity to make a bet on something and to contribute. I wish Facebook had made the same decision. It’s everyone’s loss that they did not. What a great pity.

AddThis Social Bookmark Button

bzr – not your grandfather’s VCS

20:43 March 26th, 2009 by terry. Posted under programming, tech. Comments Off on bzr – not your grandfather’s VCS

bzr viz

AddThis Social Bookmark Button

A kinder and more consistent defer.inlineCallbacks

19:20 November 21st, 2008 by terry. Posted under deferreds, python, tech, twisted. Comments Off on A kinder and more consistent defer.inlineCallbacks

Here’s a suggestion for making Twisted‘s inlineCallbacks function decorator more consistent and less confusing. Let’s suppose you’re writing something like this:

    @inlineCallbacks
    def func():
        # Do something.

    result = func()

There are 2 things that could be better, IMO:

1. func may not yield. In that case, you get an AttributeError when inlineCallbacks tries to send() to something that’s not a generator. Or worse, the call to send might actually work, and do who knows what. I.e., func() could return an object with a send method but which is not a generator. For some fun, run some code that calls the following decorated function (see if you can figure out what will happen before you do):

    @defer.inlineCallbacks
    def f():
        class yes():
            def send(x, y):
                print 'yes'
                # accidentally_destroy_the_universe_too()
        return yes()

2. func might raise before it get to its first yield. In that case you’ll get an exception thrown when the inlineCallbacks decorator tries to create the wrapper function:

    File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 813, in unwindGenerator
      return _inlineCallbacks(None, f(*args, **kwargs), Deferred())

There’s a simple and consistent way to handle both of these. Just have inlineCallbacks do some initial work based on what it has been passed:

    def altInlineCallbacks(f):
        def unwindGenerator(*args, **kwargs):
            deferred = defer.Deferred()
            try:
                result = f(*args, **kwargs)
            except Exception, e:
                deferred.errback(e)
                return deferred
            if isinstance(result, types.GeneratorType):
                return defer._inlineCallbacks(None, result, deferred)
            deferred.callback(result)
            return deferred

        return mergeFunctionMetadata(f, unwindGenerator)

This has the advantage that (barring e.g., a KeyboardInterrupt in the middle of things) you’ll *always* get a deferred back when you call an inlineCallbacks decorated function. That deferred might have already called or erred back (corresponding to cases 1 and 2 above).

I’m going to use this version of inlineCallbacks in my code. There’s a case for it making it into Twisted itself: inlinecallbacks is already cryptic enough in its operation that anything we can do to make its operation more uniform and less surprising, the better.

You might think that case 1 rarely comes up. But I’ve hit it a few times, usually when commenting out sections of code for testing. If you accidentally comment out the last yield in func, it no longer returns a generator and that causes a different error.

And case 2 happens to me too. Having inlinecallbacks try/except the call to func is nicer because it means I don’t have to be quite so defensive in coding. So instead of me having to write

    try:
        d = func()
    except Exception:
        # Do something.

and try to figure out what happened if an exception fired, I can just write d = func() and add errbacks as I please (they then have to figure out what happened). The (slight?) disadvantage to my suggestion is that with the above try/except fragment you can tell if the call to func() raised before ever yielding. You can detect that, if you need to, with my approach if you’re not offended by looking at d.called immediately after calling func.

The alternate approach also helps if you’re a novice, or simply being lazy/careless/forgetful, and writing:

    d = func()
    d.addCallback(ok)
    d.addErrback(not_ok)

thinking you have your ass covered, but you actually don’t (due to case 2).

There’s some test code here that illustrates all this.

AddThis Social Bookmark Button

bzr viz is so nice

00:07 November 20th, 2008 by terry. Posted under python, tech. 21 Comments »

A year ago we switched from SVN to Bazaar for source code control. I started using source code control in 1989 with RCS after comparing it with SCCS. Then I duly moved to CVS and on to SVN. In retrospect, they all sucked pretty badly but each in turn was a big improvement and seemed great at the time.

The topic of source code control is a very complex one. There’s tons of debate online about the advantages of various packages. I don’t want to get into details, plus there are details that I don’t fully appreciate anyway. It really is complex – at least if you want to do anything even a little bit sophisticated, e.g., with multiple users working on multiple branches.

Anyway, we wanted to move away from SVN, which is cumbersome, too manual and heavyweight (at least in its handling of branches), and requires you to talk to a centralized server all the time. Plus it has no handling of directories or symbolic links, and you lose history in merging. There are other problems and annoyances too.

A distributed version control system seemed like the way to go.

We looked closely at Bazaar and Mercurial. I was prejudiced towards Mercurial. I liked its name, I liked the coolness of the Qwerty-symmetric hg command, and above all I liked how lightweight and simple it is. We took a quick look at Git, but it looked like a bit of a hodge-podge and we’re Python fanboys, so we fairly quickly decided against it.

From what I’ve read, all of Bazaar, Mercurial and Git are excellent. It’s clear that they leave SVN for dead. When I run across open source projects, especially new projects, that are still using SVN I silently raise an inner eyebrow.

But like I said, I don’t want to get into details. What I do want to do is say that I really like a plugin for Bazaar called viz (aka vizualize). It’s in bzr-gtk in case you use apt-get.

You just type bzr viz and it pops a glorious window with a visualization of your branching and merging history. The image above is just a fragment of the full window. The most recent activity is at the top, so as you look down the page you’re looking at older and older branches and merges. On the left you see the branch numbers. The vertical lines are the branches, the left-most being the trunk (in this case). You can see that the 2 right-most branches have no activity in the fragment shown.

If you want to take a look at more of the window, showing a different part of the tree, click on the following image.

Not only does Bazaar make branching really lightweight, it takes all the uncertainty out of the process (ever try merging branches in SVN, reading the log file to make sure you’ve got the right revision numbers before entering the extremely long command?). Plus you get full history when merging (and this is nicely displayed in the output of bzr log) and with a tool like bzr viz you can just see the history. Our tree has some much more complex sections, including one where Esteve had 25 branches going at the same time! And yes, they all got merged to trunk. Bazaar makes branching and merging so simple you just start to do it all the time, and it becomes very natural. Then you just merge whatever you like into whatever you like and gradually merge your way back into the trunk (after merging the trunk into your branch first to have a look at things). It’s great.

That’s it. No time for blogging. I’m waiting for someone to upload a patch so I can continue working. Meanwhile, lightweight distributed version control has really changed how we work. It’s much much better. If you’re still using SVN and haven’t checked out Bazaar or Mercurial (and there are several others), you really should.

AddThis Social Bookmark Button

A Python metaclass for Twisted allowing __init__ to return a Deferred

01:33 November 3rd, 2008 by terry. Posted under deferreds, python, tech, twisted. 4 Comments »

OK, I admit, this is geeky.

But we’ve all run into the situation in which you’re using Python and Twisted, and you’re writing a new class and you want to call something from the __init method that returns a Deferred. This is a problem. The __init method is not allowed to return a value, let alone a Deferred. While you could just call the Deferred-returning function from inside your __init, there’s no guarantee of when that Deferred will fire. Seeing as you’re in your __init method, it’s a good bet that you need that function to have done its thing before you let anyone get their hands on an instance of your class.

For example, consider a class that provides access to a database table. You want the __init__ method to create the table in the db if it doesn’t already exist. But if you’re using Twisted’s twisted.enterprise.adbapi class, the runInteraction method returns a Deferred. You can call it to create the tables, but you don’t want the instance of your class back in the hands of whoever’s creating it until the table is created. Otherwise they might call a method on the instance that expects the table to be there.

A cumbersome solution would be to add a callback to the Deferred you get back from runInteraction and have that callback add an attribute to self to indicate that it is safe to proceed. Then all your class methods that access the db table would have to check to see if the attribute was on self, and take some alternate action if not. That’s going to get ugly very fast plus, your caller has to deal with you potentially not being ready.

I ran into this problem a couple of days ago and after scratching my head for a while I came up with an idea for how to solve this pretty cleanly via a Python metaclass. Here’s the metaclass code:

from twisted.internet import defer

class TxDeferredInitMeta(type):
    def __new__(mcl, classname, bases, classdict):
        hidden = '__hidden__'
        instantiate = '__instantiate__'
        for name in hidden, instantiate:
            if name in classdict:
                raise Exception(
                    'Class %s contains an illegally-named %s method' %
                    (classname, name))
        try:
            origInit = classdict['__init__']
        except KeyError:
            origInit = lambda self: None
        def newInit(self, *args, **kw):
            hiddenDict = dict(args=args, kw=kw, __init__=origInit)
            setattr(self, hidden, hiddenDict)
        def _instantiate(self):
            def addSelf(result):
                return (self, result)
            hiddenDict = getattr(self, hidden)
            d = defer.maybeDeferred(hiddenDict['__init__'], self,
                                    *hiddenDict['args'], **hiddenDict['kw'])
            return d.addCallback(addSelf)
        classdict['__init__'] = newInit
        classdict[instantiate] = _instantiate
        return super(TxDeferredInitMeta, mcl).__new__(
            mcl, classname, bases, classdict)

I’m not going to explain what it does here. If it’s not clear and you want to know, send me mail or post a comment. But I’ll show you how you use it in practice. It’s kind of weird, but it makes sense once you get used to it.

First, we make a class whose metaclass is TxDeferredInitMeta and whose __init__ method returns a deferred:

class MyClass(object):
    __metaclass__ = TxDeferredInitMeta
    def __init__(self):
        d = aFuncReturningADeferred()
        return d

Having __init__ return anything other than None is illegal in normal Python classes. But this is not a normal Python class, as you will now see.

Given our class, we use it like this:

def cb((instance, result)):
    # instance is an instance of MyClass
    # result is from the callback chain of aFuncReturningADeferred
    pass

d = MyClass()
d.__instantiate__()
d.addCallback(cb)

That may look pretty funky, but if you’re used to Twisted it wont seem too bizarre. What’s happening is that when you ask to make an instance of MyClass, you get back an instance of a regular Python class. It has a method called __instantiate__ that returns a Deferred. You add a callback to that Deferred and that callback is eventually passed two things. The first is an instance of MyClass, as you requested. The second is the result that came down the callback chain from the Deferred that was returned by the __init__ method you wrote in MyClass.

The net result is that you have the value of the Deferred and you have your instance of MyClass. It’s safe to go ahead and use the instance because you know the Deferred has been called. It will probably seem a bit odd to get your instance later as a result of a Deferred firing, but that’s perfectly in keeping with the Twisted way.

That’s it for now. You can grab the code and a trial test suite to put it through its paces at http://foss.fluidinfo.com/txDeferredInitMeta.zip. The code could be cleaned up somewhat, and made more general. There is a caveat to using it – your class can’t have __hidden__ or __instantiate__ methods. That could be improved. But I’m not going to bother for now, unless someone cares.

AddThis Social Bookmark Button

Pond scum

15:53 September 5th, 2008 by terry. Posted under other, tech. 17 Comments »

Pond scumI had breakfast this morning at a bar in the Santa Caterina market in Barcelona with Jono Bennett. He’s a writer. We were reflecting on similarities in our struggles to do our own thing. An email about a potential Fluidinfo investor that I’d recently sent to a friend came to mind. I wrote:

I had a really good call with AAA. He told me he’s interested and wants to talk to BBB and CCC. I then got mail the next day from DDD (of the NYT) who told me he’d just had dinner with AAA and BBB and that they’d talked about my stuff. So something may happen there (i.e., I’ll never hear from them again).

The last comment, that I’d probably never hear from them again, was entirely tongue-in-cheek. I wrote it knowing it was a possibility, but not really thinking it would happen.

But it did.

Things like that seem to be part & parcel of the startup world as you attempt to get funded. I have often asked myself how can it be possible for things to be this way? How you can have people so excited, telling you and others you’re going to change the world, be worth billions, and then you never hear from them again? (Yes, of course you have to follow up, and I did. But that’s not the point: If you didn’t follow up you’d never hear from them.)

How can that be? In what sort of world is such a thing possible?

I came up with a highly flawed analogy. Despite its limited accuracy I find it amusing and can’t resist blogging it even if people will label me bitter (I’m not).

Kids with sticksFirst: startup founders are pond scum. Second: potential investors are a troupe of young kids wandering through the park with sticks.

The kids poke into the ponds, stirring up the scum. They’re looking for cool things, signs of life, perhaps even something to take home. They’re genuinely interested. They’re fascinated. The pond scum listen to their excited conversation and think the kids will surely be back tomorrow. But it’s summer, and the world is so very very big.

The pond scum are working on little projects like photosynthesis, enhancements to the Krebs cycle, or the creation of life itself. All the while they’re pondering how to make themselves irresistible, believing that someday the kids with the sticks will be back, that they’ll eventually be scooped up.

As Paul Graham recently wrote, fundraising is brutal. His #1 recommendation is to keep expectations low.

Kid with stickYep, you’re pond scum.

Get used to it.

Embrace it.

AddThis Social Bookmark Button

Minor mischief: create redirect loops from predictable short URLs

16:14 July 1st, 2008 by terry. Posted under other, python, tech. 2 Comments »

redirect loopI was checking out the new bit.ly URL shortening service from Betaworks.

I started wondering how random the URLs from these URL-shortening services could be. I wrote a tiny script the other day to turn URLs given on the command line into short URLs via is.gd:

import urllib, sys
for arg in sys.argv[1:]:
    print urllib.urlopen(
        'http://is.gd/api.php?longurl=' + arg).read()

I ran it a couple of times to see what URLs it generated. Note that you have to use a new URL each time, as it’s smart enough not to give out a new short URL for one it has seen before. I got the sequence http://is.gd/JzB, http://is.gd/JzC, http://is.gd/JzD, http://is.gd/JzE,…

That’s an invitation to some minor mischief, because you can guess the next URL in the is.gd sequence before it’s actually assigned to redirect somewhere.

We can ask bit.ly for a short URL that redirects to our predicted next is.gd URL. Then we ask is.gd for a short URL that redirects to the URL that bit.ly gives us. If we do this fast enough, is.gd will not yet have assigned the predicted next URL and we’ll get it. So the bit.ly URL will end up redirecting to the is.gd URL and vice versa. In ugly Python (and with a bug/shortcoming in the nextIsgd function):

import urllib, random

def bitly(url):
    return urllib.urlopen(
        'http://bit.ly/api?url=' + url).read()

def isgd(url):
    return urllib.urlopen(
        'http://is.gd/api.php?longurl=' + url).read()

def nextIsgd(url):
    last = url[-1]
    if last == 'z':
        next = 'A'
    else:
        next = chr(ord(last) + 1)
    return url[:-1] + next

def randomURI():
    return 'http://www.a%s.com' % \
           ''.join(map(str, random.sample(xrange(100000), 3)))

isgdURL = isgd(randomURI())
print 'Last is.gd URL:', isgdURL

nextIsgdURL = nextIsgd(isgdURL)
print 'Next is.gd URL will be:', nextIsgdURL

# Ask bit.ly for a URL that redirects to nextIsgdURL
bitlyURL = bitly(nextIsgdURL)
print 'Step 1: bit.ly now redirects %s to %s' % (
    bitlyURL, nextIsgdURL)

# Ask is.gd for a URL that redirects to that bit.ly url
isgdURL2 = isgd(bitlyURL)
print 'Step 2: is.gd now redirects %s to %s' % (
    isgdURL2, bitlyURL)

if nextIsgdURL == isgdURL2:
    print 'Success'
else:
    print 'Epic FAIL'

This worked first time, giving:

Step 1: bit.ly now redirects http://bit.ly/fkuL8 to http://is.gd/JA9
Step 2: is.gd now redirects http://is.gd/JA9 to http://bit.ly/fkuL8

In general it’s not a good idea to use predictable numbers like this, which hardly bears saying as just about every responsible programmer knows that already.

is.gd wont shorten a tinyurl.com link, as tinyurl is on their blacklist. So they obviously know what they’re doing. The bit.ly service is brand new and presumably not on the is.gd radar yet.

And finally, what happens when you visit one of the deadly looping redirect URLs in your browser? You’d hope that after all these years the browser would detect the redirect loop and break it at some point. And that’s what happened with Firefox 3, producing the image above.

If you want to give it a try, http://bit.ly/fkuL8 and http://is.gd/JA9 point to each other. Do I need to add that I’m not responsible if your browser explodes in your face?

AddThis Social Bookmark Button

Embracing Encapsulation

16:09 June 18th, 2008 by terry. Posted under me, python, tech. 43 Comments »

Encapsulated[This is a bit rambling / repetitive, sorry. I don’t have time to make it shorter, etc.]

Last year at FOWA I had a discussion with Paul Graham about programming and programmers in which we disagreed over the importance of knowing the fundamentals.

By this I mean the importance of knowing things down to the nuts and bolts level, to really understand what’s going on at the lower levels when you’re writing code. I used to think that sort of thing mattered a lot, but now I think it rarely does.

I well remember learning to program in AWK and being acutely aware of how resource intensive “associative arrays” (as we quaintly called them in those days) were, and knowing full well what was going on behind the scenes. I wrote a full Pascal compiler (no lex, no yacc) in the mid-80’s with Keith Rowe. If you haven’t done that, you really can’t appreciate the amount of computation that goes on when you compile a program to an executable. It’s astonishing. I did lots of assembly language programming, starting from age 15 or so, and spent years squeezing code into embedded environments, where a client might call to ask if you couldn’t come up with a way to reduce your executable code by 2 bytes so it would fit in their device.

But you know what? None of those skills really matter any more. Or they matter only very rarely.

The reason is that best practices have been worked out and incorporated into low-level libraries, and for the most part you don’t need to have any awareness at all of how those levels work. In fact it can be detrimental to you to spend years learning all those details if you could instead be learning how to build great things using the low-level libraries as black-box tools.

That’s the way the world moves in general. Successive generations get the accumulated wisdom of earlier generations packaged up for them. We used log tables, slide rules, and our heads, while our kids use calculators with hundreds of built-in functions. We learned to read analog 12-hour clocks, our kids learn to read digital clocks (so much easier!) and may not be able to read an analog clock until later. And it doesn’t matter. We buy a CD player (remember them?) or an iPod, and when it breaks you don’t even consider getting it “fixed” (remember that?). You just go out and buy another one. That’s because it’s cheaper and much faster and easier to just get a new one that has been put together by a machine than it is to have an actual human try to open the thing and figure out how to repair it. You can’t even (easily) open an iPod. And so the people who know how to do these things dwindle in number until there are none left. Like watch makers or the specialist knife sharpeners we have in Barcelona who ride around on motorcycles with their distinctive whistles, calling to people to bring down their blunt knives. And it doesn’t matter, at least from a technical point of view. Their brilliance and knowledge and hard-won experience has been encapsulated and put into machines and higher-level tools, or simply baked into society in smaller, more accurate and easier to digest forms. In computers it goes down into libraries and compilers and hardware. There’s simply no need for anyone to know how, learn how, or to bother, to do those sorts of things any more.

Note that I’m not saying it’s not nice to have your watch repaired by someone with a jeweler’s eyepiece or your knife or scissors sharpened in the street. I’m just noting the general progression by which knowledge inevitably becomes encapsulated.

In my discussion with Paul Graham, he argued that it was still important for tech founders to be great programmers at a low level. I argued that that’s not right. Sure, people like that are good to have around, but I don’t think you need to be that way and as I said I think it can even be detrimental because all that knowledge comes at a price (other knowledge, other experience).

I work with a young guy called Esteve (Hi Esteve!). He’s great at many levels, including the lower ones. He’s also a product of a new generation of programmers. They’re people who grew up only knowing object-oriented programming, only really writing in very high-level languages (not you Esteve! I mean that just in general), who think in those terms, and who instead of spending many years working with nuts and bolts spent the years working with newer high-level tools.

I think people like Esteve have a triple advantage over us dinosaurs. 1) They tend to use more powerful tools; 2) Because they use better tools, they are more comfortable and think more naturally in the terms of the higher-level abstractions their tools present them; and 3) they also have more experience putting those tools and methods to good use.

The experience gap widens at double speed, just as when a single voter changes side; the gap between the two parties increases by two votes. Even when the dinosaur modernizes itself and learns a few new tricks, you’re still way behind because the 25 year-old you’re working with (again, excluding Esteve) has never had to work at the nuts and bolts level. They think with the new paradigms and can put more general and more powerful tools directly into action. They don’t have to think about protocols or timeouts or dynamically resizing buffers or partial reads or memory management or data structures or error propogation. They simply think “Computer, fetch me the contents of that web page!” And most of the time it all just works. When it doesn’t, you can call in a gray-haired repair person or, more likely, just throw the busted tool away and buy another (or just get it free, in the case of Open Source software).

That’s real progress, and to insist that we should make the young suffer through all the stuff we had to learn in order to build all the libraries and compilers etc., that are now available to us all is just wrong. It’s wrong because it goes against the flow of history, because it’s counter-productive, and because it smacks of “I had to suffer through this stuff, walk barefoot to school in the snow, and therefore you must too.”

Some of the above will probably sound a bit abstract, but to me it’s not. I think it’s important to realize and accept. The fact that your kid can’t tie their shoelaces because they have velcro and have never owned a shoe with a lace is probably a good thing. You don’t know how to hunt your own food or start a fire, and it just doesn’t matter. The same goes for programming. The collective brilliance of generations of programmers is now built in to languages like Java, Python and Ruby, and into operating systems, graphics libraries, etc. etc., and it really doesn’t matter a damn if young people who are using those tools don’t have a clue what’s going on at the lower levels (as I said above, that’s probably a good thing). One day very few people will. The knowledge wont be lost. It’s just encapsulated into more modern environments and tools.

I’m writing all this down because I’ve been thinking about it on and off since FOWA, but also because of what I’m working on right now. I’m trying to modify 12K lines of synchronous Python code to use Twisted (an extraordinarily good set of asynchronous networking libraries written by a set of extraordinarily young and gifted programmers). The work is a bit awkward and three times I’ve not known how best to proceed in terms of design. Each time, Esteve has taken a look at the problem and quickly suggested a fairly clean way to tackle it. Desperate to cook up a way to think that he might not be that much smarter than I am, I’m forced into a corner in which I conclude that he has spent more time working with new tools (patterns, OO, a nice language like Python). So he looks at the world in a different way and naturally says “oh, you just do that”. Then I go do the routine work of making his ideas work – which is great by me, I get to learn in the best way, by doing. How nice to hire people who are better than you are.

That’s it. Encapsulation is inevitable. So you either have to embrace it or become a hand-wringing dinosaur moaning about the kids of today and how they no longer know the fundamentals. It’s not as though any of us could survive if we suddenly had to do everything from first principles (hunt, rub sticks together to make fire, etc). So relax. Enjoy it. The young are much better than we are because they grow up with better tools and they spend more time using them. It’s not enough to learn them when you’re older, even if you can do that really fast. You’ll never catch up on the experience front.

But it sure is fun to try.

AddThis Social Bookmark Button

Random thoughts on Twitter

02:48 June 9th, 2008 by terry. Posted under companies, tech, twitter. 23 Comments »

TwitterI’ve spent a lot of time thinking about Twitter this year. Here are a few thoughts at random.

Obviously Twitter have tapped into something quite fundamental, which at a high level we might simply call human sociability. We humans are primates, though there’s a remarkably strong tendency to forget or ignore this. We know a lot about the intensely social lives of our fellow primate species. It shouldn’t come as a surprise that we like to Twitter amongst ourselves too.

Here are a couple of interesting (to me) reasons for the popularity of Twitter.

One is that many people are in some sense atomized by the fact that many of us now work in an isolated way. Technical people who can do their work and communicate over the internet probably see less of their peers than others do. That’s just a general point, it’s not specific to Twitter or to 2008. It would have seemed unfathomably odd to humans 50 years ago to hear that many of us would be doing a large percentage of our work and social communication via machines, interacting with people who we don’t otherwise know, and who we rarely or never meet face to face. The rise of internet-based communication is obviously(?) helping to fill a gap created by this generational change.

The second point is specific to Twitter. Through brilliance or accident, the form of communication on Twitter is really special. Building a social network on nothing-implied asymmetric follower relationships is not something I would have predicted as leading to success. Maybe it worked, or could have all gone wrong, just due to random chance. But I’m inclined to believe that there’s more to it than that. Perhaps we’re all secretly voyeurs, or stickybeaks (nosy-parkers). Perhaps we like to see one half of conversations and be able to follow along if we like. Perhaps there’s a small secret thrill to promiscuously following someone and seeing if they follow you back. I don’t know the answer, but as I said above I do think Twitter have tapped into something interesting and strong here. There’s a property of us, we simple primates, that the Twitter model has managed to latch onto.

I think Twitter should change the dynamics for new users by initially assigning them ten random followers. New users can easily follow others, but if no-one is following them….. why bother? New user uptake would be much higher if they didn’t have the (correct) feeling that they were for some reason expected to want to Twitter in a vacuum. You announce a new program, called e.g., Twitter Guides and ask for people to volunteer to be guides (i.e., followers) of newbees. Lend a hand, make new friends, maybe get some followers yourself, etc. Lots of people would click to be a Guide. I bet this would change Twitter’s adoption dynamics. If you study things like random graph theory and dynamic systems, you know that making small changes to (especially initial) probabilities can have a dramatic effect on overall structure. If Twitter is eventually to reach a mass audience (whatever that means), it should be an uncontestable assertion that anything which significantly reduces the difficulty for new users to get into using it is very important.

Twitter should probably fix their reliability issues sometime soon.

I say “probably” because reliability and scaling are obviously not the most important things. Twitter has great value. It must have, or it would have lost its users long ago.

There’s a positive side to Twitter’s unreliability. People are amazed that the site goes down so often. Twitter gets snarled up in ways that give rise to a wide variety of symptoms. The result seems to be more attention, to make the service somehow more charming. It’s like a bad movie that you remember long afterwards because it wasn’t good. We don’t take Twitter for granted and move on the next service to pop up – we’re all busy standing around making snide remarks, playing armchair engineer, knowing that we too might face some of these issues, and talking, talking, talking. Twitter is a fascinating sight. Great harm is done by its unreliability, but the fact that their success so completely flies in the face of conventional wisdom is fascinating – and the fact that we find it so interesting and compelling a spectacle is fantastic for Twitter. They can fix the scaling issues, I hope. They should prove temporary. But the human side of Twitter, its character as a site, the site we stuck with and rooted for when times were so tough, the amazing little site that dropped to the canvas umpteen times but always got back to its feet, etc…. All that is permanent. If Twitter make it, they’re going to be more than just a web service. The public outages are like a rock musician or movie star doing something outrageous or threatening suicide – capturing attention. We’re drawn to the spectacle and the drama. We can’t help ourselves: it is our selves. We love it, we hate it, it brings us together to gnash our teeth when it’s down. But do we leave? Change the channel? No way.

Twitter is both the temperamental child rock star we love and, often, the medium by which we discuss it – an enviable position!

I’m reminded of a trick I learned during tens of thousands of miles of hitch-hiking. A great place to try for a lift is on a fairly high-speed curve on the on-ramp to the freeway / motorway / autopista / autoroute etc. Stand somewhere where a speeding car can only just manage a stop and only just manage to pull in away from the following traffic. Conventional wisdom tells you that you’ll never get a ride. But the opposite is true – you’ll get a ride extremely quickly. Invariably, the first thing the driver says when you get in is “Why on earth where you standing there? You’re very lucky I managed to stop. No-one would have ever picked you up standing there!” I’ve done this dozens of times. Twitter—being incredibly, unbelievably, frustratingly, unreliable and running contrary to all received wisdom—is a powerful spectacle. Human psyche is a funny thing. That’s a part of why it’s probably impossible to foretell success when mass adoption is required.

If I were running Twitter, apart from working to get the service to be more reliable, I’d be telling the engineering team to log everything. There’s a ton of value in the data flowing into Twitter.

Just as Google took internet search to a new level by link analysis, there’s another level of value in Twitter that I don’t think has really begun to be tapped yet.

PageRank, at least as I understand its early operation, ran a kind of iterative relaxation algorithm assigning and passing on credit via linked pages. A similar thing is clearly possible with Twitter, and some people have commented on this or tried to build little things that assign some form of score to users. But I think there’s a lot more that can be done. Because the Twitter API isn’t that powerful (mainly because you’re largely limited to querying as a single authorized user) and certainly because it’s rate-limited to just 70 API calls an hour, this sort of analysis will need to be done by Twitter themselves. I’m sure they’re well aware of that. Rate limiting probably helps them stay up, but it also means that the truly interesting and valuable stuff can’t be done by outsiders. I have no beef with that – I just wish Twitter would hurry up and do some of it.

Some examples in no order:

  • The followers to following ratio of a Twitter user is obviously a high-level measure of that user’s “importance” (in some Twitter sense of importance). But there’s more to it than that. Who are the followers? Who do they follow, who follows them? Etc. This leads immediately back to Google PageRank.
  • If a user gets followed by many people and doesn’t follow those people back, what does it say about the people involved? If X follows Y and Y then goes to look at a few pages of X’s history but does not then follow X, what do we know?
  • If X has 5K followers and re-tweets a twit of Y, how many of X’s followers go check out and perhaps follow Y? What kind of people are these? (How do you advertise to them, versus others?)
  • Along the lines of co-citation analysis, Twitter could build up a map showing you who you might follow. I.e., you can get pairwise distances between users X and Y by considering how many people they follow in common and how many they follow not-in-common. That would lead to a people you should be following that you’re not kind of suggestion.
  • Even without co-citation analysis (or similar), Twitter should be able to tell me about people that many of the people I follow are following but whom I am not following. I’d find that very useful.
  • Twitter could tell me why someone chooses to follow me. What were they looking at (if anything) before they decided to follow me? I.e., were they browsing the following list of someone else? Did they see my user name mentioned in a Tweet? Did they come in from an outside link? Would a premium Twitter user pay to have that information?
  • Twitter has tons of links. They know the news as it happens. They could easily create a news site like Digg.
  • In some sense the long tail of Twitter is where the value is. For instance, it doesn’t mean much if a user following 10K others follows someone. But if someone is following just 10 people, it’s much more significant. There’s more information there (probably). The Twitter mega users are in some way uninteresting – the more people they have following them and the more they follow, the less you really know (or care) about them. Yes, you could probably figure out more if you really wanted to, but if someone has 10K followers all you really know is that they’re probably famous in some way. If they add another 100 followers it’s no big deal. (I say all this a bit lightly and generally – the details might of course be fascinating and revealing – e.g., if you notice Jason Calacanis and Dave Winer have suddenly started @ messaging each other again it’s like IRC coming back from a network split :-))
  • Similarly if someone with a very high followers to following ratio follows a Twitter user who has just a couple of followers, it’s a safe bet that those two are somehow friends with a pre-existing relationship.
  • I bet you could do a pretty good job of putting Twitter users into boxes just based on their overall behavior, something like the 16 Myers-Briggs categories. Do you follow people back when they follow you? Do you @ answer people who @ address you (and Twitter knows when you’ve seen the original message)? Do you send @ messages to people (and how influential are those people)? Do those people @ you back (and how influential those people are says something about how interesting / provocative you are)? Do you follow tons and tons of people? Do you follow people and then un-follow them if they don’t follow you back? Do you follow random links in other people’s Twitters, and are those links accompanied by descriptive text or tinyurl links? Do you @ message people after you follow their links? Do your Twitter times follow a strict pattern, or are you on at all hours, or suddenly spending days without Twittering? Do you visit and just read much more than you tweet? How much old stuff do you read? Do you tend to talk in public or via DM? Are your tweets public?All that without even considering the content of your Twitters.
  • Could Twitter become a search engine? That’s not a 100% serious question, but it’s worth considering. I don’t mean just making the content of all tweet searchable, I mean it with some sort of ranking algorithm, again perhaps akin to PageRank. If you somehow rank results by the importance or closeness of the user whose tweets match the search terms, you might have something interesting.
  • Twitter also presumably know who’s talking about whom in the DM backchat. They can’t use that information in obvious way, but it’s of high value.

I could go on for hours, but that’s more than enough for now. I don’t feel like any of the above list is particularly compelling, but I do think the list of nice things they could be doing is extremely long and that Twitter have only just begun (at least publicly) to tap into the value they’re sitting on.

I think Google should buy Twitter. They have what Twitter needs: 1) engineering and scale, 2) link analysis and algorithm brilliance, and 3) they’re in a position to monetize the value illustrated above (via their search engine, that already has ads) without pissing off the Twitter community by e.g., running ads on Twitter. What percentage of Twitter users also use Google? I bet it’s very high.

AddThis Social Bookmark Button

Python: looks great, stays wet longer

00:02 June 8th, 2008 by terry. Posted under python, tech. 12 Comments »

Wet clayI should be coding, not blogging. But a friend noticed I hadn’t blogged in a month, so in lieu of emailing people, here are a couple of comments on programming in Python. There are many things that could be said, but I just want to make two points that I think aren’t so obvious.

1. Python looks great

In Python, indentation is used to delimit code blocks. I like that a lot – you would indent your code anyway, right? It reduces clutter. But apart from that, Python is very minimalistic in its syntax. There are rather few punctuation symbols used, and they’re used pretty consistently. As a result, Python code looks great on the page. It’s not painful to edit, and I mean that figuratively and literally. This is worth noting because when you write complex code it’s nice if the language you’re doing it in is very clean. That’s important because code can become hard to understand and unpleasant to work with. If you have pieces of code that you dread touching, that may be in part because the code is really ugly and complex on the page. Perl is a case in point – there’s tons of punctuation symbols, and in some cases the same thing (e.g., curly braces) is used in multiple (about 5!) different ways to mean different things. If the language is pleasant to look at for longer, you are more willing to work on code that might be more forbidding when expressed in other languages. Esthetics is important. Actively enjoying looking at code simply because the language is so clean is a great advantage—for you, and for the language.

This might not seem like a big point, but it’s important to me, it’s something I’ve never encountered before, and it’s a nice property of Python. BTW, people always make fun of Lisp for its parentheses. But Lisp is the cleanest language I know of in terms of simplicity on the page. The parens and using prefix operators in S-expressions removes the need for almost all other punctuation (and makes programmatically generating code an absolute breeze).

2. Python stays wet longer

I don’t like to do too much formal planning of code. I much prefer to sit down and try writing something to see how it fits. That means I’ll often go through several iterations of code design before I reach the point where I’m happy. Sometimes this is an inefficient way to do things, particularly when you’re working on something very complex that you don’t really have your head around when you start. But I still choose to do things this way because it’s fun.

Sometimes I think of it like pottery. You grab a lump of wet clay and slap it down on the wheel. Then you try out various ideas to shape whatever it is you’re trying to create. If it doesn’t work, you re-shape it—perhaps from scratch. This isn’t a very accurate analogy, but I do think it’s valid to say that preferring to work with real code in an attempt to understand how best to shape your ideas is a much more physical process than trying to spec everything out sans code. I find I can’t know if code to implement an idea or solve a problem is going to feel right unless I physically play with it in different forms.

For me, Python stays wet longer. I can re-shape my code really easily in Python. In other languages I’ve often found myself in a position where a re-design of some aspect involves lots of work. In Python the opposite has been true, and that’s a real pleasure. When you realize you should be doing things differently and it’s just a bit of quick editing to re-organize things, you notice. I might gradually be becoming a better programmer, but I mainly feel that in using Python I simply have better quality clay.

AddThis Social Bookmark Button

Everything you think you know is wrong

01:05 April 11th, 2008 by terry. Posted under other, tech. 2 Comments »

wrongI’m often surprised at how confident people are about their knowledge of the world. Looking at the history of thought and of science, you quickly see that it’s strewn with discredited and totally incorrect theories about almost everything. So I don’t understand why it’s not more commonplace to look at history and to arrive immediately at the most likely conclusion: that we too have almost everything wrong.

I don’t mean that literally everything we think is completely wrong. Some things are certainly partly right, or even mainly or fully right. But to have a high degree of confidence, or to assume we’re right just because we know so much more about the world than our ancestors did, or simply because we think we’re right, is just inviting ridicule. Considering our record, and our continual attendant misguided arrogance and confidence along the way, you’d be nuts to think that we know much today or that our confidence adds any weight at all. Many thousands of years of history argue strongly against that conclusion.

Thinking that almost everything is probably wrong in some important fundamental way is a useful default. That attitude stands you in good stead for digging into things, for reconsidering them, for asking questions at a low level. In mathematics when you know for sure that something is wrong (or right) it helps enormously in proving it. It’s a psychological thing. In my dissertation I proved a statistical result that I knew must be true from running simulations. It took me a week or two to nail the proof, and I would never have gotten there if I hadn’t known in advance that the equality I was trying to prove analytically was certainly true (pp 201-207 here in case you’re interested).

As an example of something that I think will be overturned, I think we’ll come to regard our decades of designing computational systems according to the Von Neumann Architecture as extremely primitive. Maybe that will involve some form of analog or quantum computation. I think we’ll take more and more from nature, for instance in solving optimization problems.

On a less grandiose note but still important, I think we’ll look back on our current information architecture and also see it as being extremely primitive. Or, as I’ve said before, we’re living in the shadow of information architecture decisions that were made decades ago. I think that’s all hopelessly wrong. In the real world, information processing simply doesn’t look much like a hierarchical file system.

Hence Fluidinfo.

And so ends another semi-cryptic and ultimately unsatisfying post. I do, as always, plan to eventually say more. And I will.

AddThis Social Bookmark Button

Twitter dynamics: unfollowing guykawasaki, Scobleizer and cameronreilly

16:16 March 22nd, 2008 by terry. Posted under tech, twitter. 16 Comments »

cameronreillyI’ve only got so much time a day to read blogs, Twitters, etc.

With blogs I find that I tend to try to keep up with those that post at a frequency at or below what I can handle, irrespective of quality of content. There are lots of blogs that I really enjoy, but which post new material so often that I end up never going to their sites. E.g., BoingBoing or ReadWriteWeb. I tend to always go to new content at blogs I like that have about one new article a day. I have dozens of examples in both these categories.

With blogs it’s no problem if some of the sites you’re subscribed to have tons of content. If you never click through on the indicator that there are 500 unread postings, you never see them.

On Twitter though the dynamic is very different. I follow about 140 people. From time to time during the day – normally when I’m drinking a coffee like I am now, or eating food – I’ll go have a look at Twitter to see what’s up in the wider world.

Unlike with blogs, if someone posts hundreds of Twitter updates you’re going to see them all. You’re perhaps going to see something like the image above (click for larger version). That’s not what I want to see at all. I’m hoping to see a whole bunch of people posting a few things, not screen after screen of one person talking to many people I don’t know or follow. It’s worse than being in a room with someone talking loudly on a mobile phone, hearing just one side of the conversation – this is like being in a room with that same person, but they’re talking to multiple people at once.

So with some reluctance I have recently un-followed Scobelizer, guykawasaki and cameronreilly. I actually like much of their content, but they have much too much of an unbalancing effect on my overall Twitter experience.

Move along.

AddThis Social Bookmark Button

iPod vending machine

12:25 March 9th, 2008 by terry. Posted under tech, travel. 3 Comments »

iPod vending machineHere’s an iPod vending machine I just passed on Concourse A in the Atlanta airport. It also offers a variety of other audio components, like headphones from Harman Kardon and Bose, laptop chargers, digital cameras (including 2 models more advanced than the one I just bought), etc. I didn’t check on the prices, which are only available on the LCD screen you see the couple using.

AddThis Social Bookmark Button

ETech Antigenic Cartography presentation online

19:01 March 7th, 2008 by terry. Posted under tech. Comments Off on ETech Antigenic Cartography presentation online

ETech logo
I gave my ETech talk on Wednesday afternoon. The Keynote presentation and a PDF of the slides are online.

This was my second presentation made witk Keynote. It took me quite a few days to put it together. Keynote has a few nits that make it slightly awkward to use, but overall it’s really really good. I learned a lot.

With Powerpoint you need to put in a lot of work to make things look good. In Keynote it would take work to make them look bad. The presentation themes are beautiful out of the box. And it’s extremely easy to work with.

I’m even thinking of buying a new laptop to run linux on so I don’t have to dump keynote. I could use Parallels, but I don’t want to spend all my time running on a virtual machine.

AddThis Social Bookmark Button

Keynote is good

19:15 February 15th, 2008 by terry. Posted under tech. 2 Comments »

roman numerals in keynoteI’ve been playing with Keynote to make a presentation. There are a lot of things I don’t really like about using a Mac, but Keynote is not one of them.

It makes really attractive presentations. It’s easy to use. The help actually helps. You can export to multiple formats (Quicktime, Powerpoint, PDF, images, Flash, HTML, iPod).

And, it’s fun to use. I’m going to miss it when I head back to Linux.


AddThis Social Bookmark Button

Worst of the web award: Cheaptickets

16:22 February 14th, 2008 by terry. Posted under companies, me, tech. 10 Comments »

Here’s a great example of terrible (for me at least) UI design.

I was just trying to change a ticket booking at Cheaptickets. Here’s the interface for selecting what you want to change (click to see the full image).

cheaptickets

As you can see, I indicated a date/time change on my return flight. When I clicked on the continue button, I got an error message:

An error has occurred while processing this page. Please see detail below. (Message 1500)

Please select flight attributes to change.

I thought there was some problem with Firefox not sending the information that I’d checked. So I tried again. Then I tried clicking a couple of the boxes. Then I tried with Opera. Then I changed machines and tried with IE on a windows box. All of these got me the exact same error.

I looked at the page several times to see if I’d missed something – like a check box to indicate which of the flights to change. I figured Cheaptickets must have an error server side. Then I thought come on, you must be doing something wrong.

Then I figured it out. Can you?

AddThis Social Bookmark Button

The power of representation: Adding powers of two

17:42 February 13th, 2008 by terry. Posted under representation, tech. 5 Comments »

decimalOn the left is an addition problem. If you know the answer without thinking, you’re probably a geek.

Suppose you had to solve a large number of problems of this type; adding consecutive powers of 2 starting from 1. If you did enough of them you might guess that 1 + 2 + 4 + … + 2n – 1 is always equal to 2n – 1. In the example on the left, we’re summing from 20 to 210 and the answer is 211 – 1 = 2047.

And if you cast your mind back to high-school mathematics you might even be able to prove this using induction.

But that’s a lot of work, even supposing you see the pattern and are able to do a proof by induction.

binary-addLet’s instead think about the problem in binary (i.e., base 2). In binary, the sum looks like the image on the right.

There’s really no work to be done here. If you think in binary, you already know the answer to this “problem”. It would be a waste of time to even write the problem down. It’s like asking a regular base-10 human to add up 3 + 30 + 300 + 3000 + 30000, for example. You already know the answer. In a sense there is no problem because your representation is so nicely aligned with the task that the problem seems to vanish.

Why am I telling you all this?

Because, as I’ve emphasized in three other postings, if you choose a good representation, what looks like a problem can simply disappear.

I claim (without proof) that lots of the issues we’re coming up against today as we move to a programmable web, integrated social networks, and as we struggle with data portability, ownership, and control will similarly vanish if we simply start representing information in a different way.

I’m trying to provide some simple examples of how this sort of magic can happen. There’s nothing deep here. In the non-computer world we wouldn’t talk about representation, we’d just say that you need to look at the problem from the right point of view. Once you do that, you see that it’s actually trivial.

AddThis Social Bookmark Button

Talking about Antigenic Cartography at ETech

13:34 February 12th, 2008 by terry. Posted under me, tech. Comments Off on Talking about Antigenic Cartography at ETech

ETech 2008Blogs are all about self-promotion, right? Right.

I’m talking at ETech in the first week of March in San Diego. The talk is at 2pm on Wednesday March 3, and is titled Antigenic Cartography: Visualizing Viral Evolution for Influenza Vaccine Design.

You can find out more about Antigenic Cartography here and here.

Here’s my abstract:

Mankind has been fighting influenza for thousands of years. The 1918 pandemic killed 50-100 million people. Today, influenza kills roughly half a million people each year. Because the virus evolves, it is necessary for vaccines to track its evolution closely in order to remain effective.

Antigenic Cartography is a new computational method that allows a unique visualization of viral evolution. First published in 2004, the technique is now used to aid the WHO in recommending the composition of human influenza vaccines. It is also being applied to the design of pandemic influenza vaccines and to the study of a variety of other infectious diseases.

The rise of Antigenic Cartography is a remarkable story of how recent immunological theory, mathematics, and computer science have combined with decades of virological and medical research and diligent data collection to produce an entirely new tool with immediate practical impact.

This talk will give you food for thought regarding influenza, and move on to explain what Antigenic Cartography is, how it works, and exactly how it is used to aid vaccine strain selection—all in layman’s terms, with no need for a biological or mathematical background.

In case you’re wondering, no, I didn’t go so far as to make the “I’m speaking” image above. I chose it from the conference speaker resources. Self-promotion has its limits.

AddThis Social Bookmark Button

Google maps gets SFO location waaaay wrong

22:16 January 28th, 2008 by terry. Posted under companies, tech. 1 Comment »

google-sfoBefore leaving Barcelona yesterday morning, I checked Google maps to get driving directions from San Francisco International airport (SFO) to a friend’s place in Oakland.

Google got it way wrong. Imagine trying to follow these instructions if you didn’t know they were so wrong. Click on the image to see the full sized map. Google maps is working again now.


AddThis Social Bookmark Button

Amazon S3 to rival the Big Bang?

00:40 January 28th, 2008 by terry. Posted under companies, tech. 4 Comments »

Note: this posting is based on an incorrect number from an Amazon slide. I’ve now re-done the revenue numbers.

We’ve been playing around with Amazon’s Simple Storage Service (S3).

Adam Selipsky, Amazon VP of Web Services, has put some S3 usage numbers online (see slides 7 and 8). Here are some numbers on those numbers.

There were 5,000,000,000 (5e9) objects inside S3 in April 2007 and 10,000,000,000,000 (1e13) in October 2007. That means that in October 2007, S3 contained 2,000 times more objects than it did in April 2007. That’s a 26 week period, or 182 days. 2,000 is roughly 211. That means that S3 is doubling its number of objects roughly once every 182/11 = 16.5 days. (That’s supposing that the growth is merely exponential – i.e., that the logarithm of the number of objects is increasing linearly. It could actually be super-exponential, but let’s just pretend it’s only exponential.)

First of all, that’s simply amazing.

It’s now 119 days since the beginning of October 2007, so we might imagine that S3 now has 2119/16.5 or about 150 times as many objects in it. That’s 1,500,000,000,000,000 (1.5e15) objects. BTW, I assume by object they mean a key/value pair in a bucket (these are put into and retrieved from S3 using HTTP PUT and GET requests).

Amazon’s S3 pricing for storage is $0.15 per GB per month. Assume all this data is stored on their cheaper US servers and that objects take on average 1K bytes. These seem reasonable assumptions. (A year ago at ETech, SmugMug CEO Don MacAskill said they had 200TB of image data in S3, and images obviously occupy far more than 1K each. So do backups.) So that’s roughly 1.5e15 * 1K / 1G = 1.5e9 gigabytes in storage, for which Amazon charges $0.15 per month, or $225M.

That’s $225M in revenue per month just for storage. And growing rapidly – S3 is doubling its number of objects every 2 weeks, so the increase in storage might be similar.

Next, let’s do incoming data transfer cost, at $0.10 per GB. That’s simply 2/3rds of the data storage charge, so we add another 2/3 * $225M, or $150M.

What about the PUT requests, that transmit the new objects?

If you’re doubling every 2 weeks, then in the last month you’ve doubled twice. So that means that a month ago S3 would have had 1.5e15 / 4 = 3.75e14 objects. That means 1.125e15 new objects were added in the last month! Each of those takes an HTTP PUT request. PUTs are charged at one penny per thousand, so that’s 1.125e15 / 1000 * $0.01.

Correct me if I’m wrong, but that looks like $11,250,000,000.

To paraphrase a scene I loved in Blazing Saddles (I was only 11, so give me a break), that’s a shitload of pennies.

Lastly, some of that stored data is being retrieved. Some will just be backups, and never touched, and some will simply not be looked at in a given month. Let’s assume that just 1% of all (i.e., not just the new) objects and data are retrieved in any given month.

That’s 1.5e15 * 1K * 1% / 1e9 = 15M GB of outgoing data, or 15K TB. Let’s assume this all goes out at the lowest rate, $0.13 per GB, giving another $2M in revenue.

And if 1% of objects are being pulled back, that’s 1.5e15 * 1% = 1.5e13 GET operations, which are charged at $0.01 per 10K. So that’s 1.5e13 / 10K * $0.01 = $15M for the GETs.

This gives a total of $225M + $150M + $11,250M + $2M + $15M = $11,642M in the last month. That’s $11.6 billion. Not a bad month.

Can this simple analysis possibly be right?

It’s pretty clear that Amazon are not making $11B per month from S3. So what gives?

One hint that they’re not making that much money comes from slide 8 of the Selipsky presentation. That tells us that in October 2007, S3 was making 27,601 transactions per second. That’s about 7e10 per month. If Amazon was already doubling every two weeks by that stage, then 3/4s of their 1e13 S3 objects would have been new that month. That’s 7.5e12, which is 100 times more transactions just for the incoming PUTs (no outgoing) than are represented by the 27,601 number. (It’s not clear what they mean by transaction – I mean what goes on in a single transaction.)

So something definitely doesn’t add up there. It may be more accurate to divide the revenue due to PUTs by 100, bringing it down to a measly $110M.

An unmentioned assumption above is that Amazon is actually charging everyone, including themselves, for the use of S3. They might have special deals with other companies, or they might be using S3 themselves to store tons of tiny objects. I.e., we don’t know that the reported number is of paid objects.

There’s something of a give away the razors and charge for the blades feel to this. When you first see Amazon’s pricing, it looks extremely cheap. You can buy external disk space for, e.g., $100 for 500GB, or $0.20 per GB. Amazon charges you just $0.18 per GB for replicated storage. But that’s per month. A disk might last you two years, so we could conclude that Amazon is e.g., 8 or 12 times more expensive, depending on the degree of replication. But you don’t need a data center or to grow (or shrink) a data center, cooling, employees, replacement disks—all of which have been noted many times—so the cost perhaps isn’t that high.

But…. look at those PUT requests! If an object is 1K (as above), it takes 500M of them to fill a 500GB disk. Amazon charges you $0.01 per 1000, so that’s 500K * $0.01 or $5000. That’s $10 per GB just to access your disk (i.e., before you even think about transfer costs and latency), which is about 50 times the cost of disk space above.

In paying by the PUT and GET, S3 users are in effect paying Amazon for the compute resources needed to store and retrieve their objects. If we estimate it taking 10ms for Amazon to process a PUT, then 1000 takes 10 seconds of compute time, for which Amazon charges $0.01. That’s nearly $26K per month being paid for machines to do PUT storage, which is 370 times more expensive than what Amazon would charge you to run a small EC2 instance for a month. Such a machine probably costs Amazon around $1500 to bring into service. So there’s no doubt they’re raking it in on the PUT charges. That makes the 5% margins of their retailing operation look quaint. Wall Street might soon be urging Bezos to get out of the retailing business.

Given that PUTs are so expensive, you can expect to see people encoding lots of data into single S3 objects, transmitting them all at once (one PUT), and decoding when they get the object back. That pushes programmers towards using more complex formats for their data. That’s a bad side-effect. A storage system shouldn’t encourage that sort of thing in programmers.

Nothing can double every two weeks for very long, so that kind of growth simply cannot continue. It may have leveled out in October 2007, which would make my numbers off by roughly 2119/16.5 or about 150, as above.

When we were kids they told us that the universe has about 280 particles in it. 1.5e15 is already about 250, so only 30 more doubling are needed, which would take Amazon just over a year. At that point, even if all their storage were in 1TB drives and objects were somehow stored in just 1 byte each, they’d still need about 240 disk drives. The earth has a surface area of 510,065,600 km2 so that would mean over 2000 Amazon disk drives in each square kilometer on earth. That’s clearly not going to happen.

It’s also worth bearing in mind that Amazon claims data stored into S3 is replicated. Even if the replication factor is only 2, that’s another doubling of the storage requirement.

At what point does this growth stop?

Amazon has its Q4 2007 earnings call this Wednesday. That should be revealing. If I had any money I’d consider buying stock ASAP.

AddThis Social Bookmark Button