Add to Technorati Favorites

jsongrep.py – Python for extracting pieces of JSON objects

02:10 November 25th, 2010 by terry. Posted under FluidDB, python. 18 Comments »

Lots of APIs these days return JSON objects. I love JSON, but reading a raw JSON dump can be awkward. You can print the whole thing using pprint in Python (or your favorite language), but what if you want to grep out and print only parts of the JSON? I was thinking tonight that it would be easy and useful to write a recursive script to do that. It took about half an hour to arrive at this solution:

#!/usr/bin/env python

import sys
import re
import json
from pprint import pprint

def jsongrep(d, patterns):
    try:
        pattern = patterns.pop(0)
    except IndexError:
        pprint(d)
    else:
        if isinstance(d, dict):
            keys = filter(pattern.match, d.keys())
        elif isinstance(d, list):
            keys = map(int,
                       filter(pattern.match,
                              ['%d' % i for i in range(len(d))]))
        else:
            if pattern.match(str(d)):
                pprint(d)
            return
        for item in (d[key] for key in keys):
            jsongrep(item, patterns[:])

if __name__ == '__main__':
    try:
        j = json.loads(sys.stdin.read())
    except ValueError, e:
        print >>sys.stderr, 'Could not load JSON object from stdin.'
        sys.exit(1)

    jsongrep(j, map(re.compile, sys.argv[1:]))

Usage is really simple. Let’s look at a couple of easy examples from the command line:

$ echo '{"hey" : "you"}' | jsongrep.py hey
u'you'

jsongrep.py has matched the “hey” key in the JSON object and printed its value. Let’s do the same thing with a 2-level JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey 
{u'there': u'you'}

Again, we see the entire object corresponding to the “hey” key. We can add another argument to drill down into the object

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey there
u'you'

As you might hope, you can use a regular expression for an argument:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py 'h.*' '.*'
u'you'

which in this case could have been given more concisely as

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py h .
u'you'

So you can drill down into nested dictionaries quite easily. When jsongrep.py runs out of patterns it just prints whatever’s left. A special case of this is if you give no patterns at all, you get the whole JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py
{u'hey': {u'there': u'you'}}

The regex patterns you pass on the command line are being matched against the keys of JSON objects (Python dicts). If jsongrep.py runs into a list, it will instead match against the list indices like so:

$ echo '{"hey" : { "joe" : ["blow", "xxx" ]}}' | jsongrep.py hey joe 1
u'xxx'

You can see we’ve pulled out just the first list element after matching “hey” and “joe”. So jsongrep.py regex args can be used to navigate your way through both JSON objects and lists.

Now let’s do something more interesting.

Twitter’s API can give you JSON, and the JSON is pretty chunky. For example, if I get my first 100 followers with this command:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones'

there’s 164Kb of output (try it and see). What if I just want the Twitter user names of the people who follow me? Looking at the JSON, I can see it starts with:

[{"profile_background_color":"131516","description":null

Hmm… looks like it’s a list of dictionaries. Let’s print just the first dictionary in the list:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py 0

which starts out:

{u'contributors_enabled': False,
 u'created_at': u'Wed Jul 19 00:29:58 +0000 2006',
 u'description': None,
 u'favourites_count': 0,
 u'follow_request_sent': False,
 u'followers_count': 178,
 u'following': False,
 u'friends_count': 67,
 u'geo_enabled': False,
 u'id': 2471,
 u'id_str': u'2471',
 u'lang': u'en',
 u'listed_count': 3,
 u'location': None,
 u'name': u'Roy',
 u'notifications': False,
 u'profile_background_color': u'131516',
 u'profile_background_image_url': u'http://s.twimg.com/a/1288470193/images/themes/theme14/bg.gif',
 u'profile_background_tile': True,
 u'profile_image_url': u'http://a3.twimg.com/profile_images/194788727/roy_with_phone_normal.jpg',
 u'profile_link_color': u'009999',
 u'profile_sidebar_border_color': u'eeeeee',
 u'profile_sidebar_fill_color': u'efefef',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'wasroykosuge',

and you can see a “screen_name” key in there which looks like what we want. Let’s see the first few:

$ curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py . screen_name | head
u'wasroykosuge'
u'Piiiu_piiiu'
u'350'
u'KardioFit'
u'jrecursive'
u'doodlesockingad'
u'revinprogress'
u'cloudjobs'
u'PointGcomics'
u'lucasbuchala'

Finally, here’s an example using FluidDB‘s new /values HTTP call. I’ll ask FluidDB for all objects matching the query has unionsquareventures.com/portfolio and from those matching objects I’ll pull back the value of the FluidDB tag named fluiddb/about. The result is JSON that starts out like this:

{"results": { "id" : {"93989942-b519-49b4-87de-ac834e6a6161": {"fluiddb/about": {"value": "http://www.outside.in"}}

You can see there’s a 5-level deep nesting of JSON objects. I just want the “value” key on all matching objects. Easy:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value | sort
u'http://amee.cc'
u'http://getglue.com'
u'http://stackoverflow.com'
u'http://tumblr.com'
u'http://www.10gen.com'
u'http://www.boxee.tv'
u'http://www.buglabs.net'
u'http://www.clickable.com'
u'http://www.cv.im'
u'http://www.disqus.com'
u'http://www.etsy.com'
u'http://www.flurry.com'
u'http://www.foursquare.com'
u'http://www.heyzap.com'
u'http://www.indeed.com'
u'http://www.meetup.com'
u'http://www.oddcast.com'
u'http://www.outside.in'
u'http://www.returnpath.net'
u'http://www.shapeways.com'
u'http://www.simulmedia.com'
u'http://www.targetspot.com'
u'http://www.tastylabs.com'
u'http://www.tracked.com'
u'http://www.twilio.com'
u'http://www.twitter.com'
u'http://www.workmarket.com'
u'http://www.zemanta.com'
u'http://zynga.com'

And there you have it, a sorted list of all Union Square Ventures portfolio companies, from the command line, as listed here.

jsongrep.py will also try to match on things that are not objects or lists if it runs into them, so we can refine this list a little. E.g.,

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*ee' | sort
u'http://www.meetup.com'
u'http://amee.cc'
u'http://www.indeed.com'
u'http://www.boxee.tv'

being the USV companies with “ee” somewhere in their URL. Or, for some advanced regex fun, the USV companies whose URLs don’t end in “.com”:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*(?

AddThis Social Bookmark Button

Twisted’s Deferred class chainDeferred method is too simplistic?

02:25 August 14th, 2010 by terry. Posted under deferreds, python, twisted. 1 Comment »

I think it’s worth spending some time thinking about whether chainDeferred in twisted.internet.defer.Deferred is too simplistic. I’ve thought for a while that it could be more helpful in preventing people from doing unintended things and/or cause fewer surprises (e.g., see point #1 in this post).

Those are smaller things that people can obviously live with. But a more important one concerns deferred cancellation. See uncalled.py below for why.

Here are five examples of chainDeferred behavior that I think could be better. In an attempt to fix these, I made a few changes to a copy of defer.py (immodestly called tdefer.py, sorry) which you can grab, with runnable examples, here. The tdefer.py code is meant as a suggested approach. I doubt that it’s bulletproof.

Here are the examples.

boom1.py:

# Normal deferreds: this raises defer.AlreadyCalledError because
# the callback of d1 causes the callback of d2 to be called, but d2 has
# already been cancelled (and hence called).

# With tdefer.py: there is no error because d1.callback will not call
# d2 as it has already been cancelled.

def printCancel(fail):
    fail.trap(defer.CancelledError)
    print 'cancelled'

def canceller(d):
    print 'cancelling'

d1 = defer.Deferred()
d2 = defer.Deferred(canceller)
d2.addErrback(printCancel)
d1.chainDeferred(d2)
d2.cancel()
d1.callback('hey')

boom2.py:

# Normally: raises defer.AlreadyCalledError because calling d1.callback
# will call d2, which has already been called.

# With tdefer.py: Raises AssertionError: "Can't callback an already
# chained deferred" because calling callback on a deferred that's
# already been chained is asking for trouble (as above).

d1 = defer.Deferred()
d2 = defer.Deferred()
d1.chainDeferred(d2)
d2.callback('hey')
d1.callback('jude')

uncalled.py:

# Normally: although d2 has been chained to d1, when d1 is cancelled,
# d2's cancel method is never called. Even calling d2.cancel ourselves
# after the call to d1.cancel has no effect, as d2 has already been
# called.

# With tdefer: both cancel1 and cancel2 are called when d1.cancel is
# called. The additional final call to d2.cancel correctly has no
# effect as d2 has been called (via d1.cancel).

def cancel1(d):
    print 'cancel one'

def cancel2(d):
    print 'cancel two'

def reportCancel(fail, which):
    fail.trap(defer.CancelledError)
    print 'cancelled', which

d1 = defer.Deferred(cancel1)
d1.addErrback(reportCancel, 'one')
d2 = defer.Deferred(cancel2)
d2.addErrback(reportCancel, 'two')
d1.chainDeferred(d2)
d1.cancel()
d2.cancel()

unexpected1.py:

# Normally: prints "called: None", instead of the probably expected
# "called: hey"

# tdefer.py: prints "called: hey"

def called(result):
    print 'called:', result

d1 = defer.Deferred()
d2 = defer.Deferred()
d1.chainDeferred(d2)
d1.addCallback(called)
d1.callback('hey')

unexpected2.py:

# Normally: prints
#   called 2: hey
#   called 3: None

# tdefer.py: prints
#   called 2: hey
#   called 3: hey

def report2(result):
    print 'called 2:', result

def report3(result):
    print 'called 3:', result

d1 = defer.Deferred()
d2 = defer.Deferred().addCallback(report2)
d3 = defer.Deferred().addCallback(report3)
d1.chainDeferred(d2)
d1.chainDeferred(d3)
d1.callback('hey')

I wont go into detail here as this post is already long enough. Those are 3 classes of behavior arising from chainDeferred being very simplistic. Comments welcome, of course. Once again, the runnable code is here.

AddThis Social Bookmark Button

Asynchronous data structures with Twisted Deferreds

23:55 July 23rd, 2010 by terry. Posted under deferreds, me, python, twisted. 8 Comments »

Earlier this week I gave a talk titled Deferred Gratification (slides) at EuroPython in Birmingham. The title was supposed to hint at how much I love Twisted‘s Deferred class, and that it took me some time to really appreciate it.

While thinking about Deferreds in the days before the talk, I think I put my finger on one of the reasons I find Deferreds so attractive and elegant. The thought can be neatly summarized: Deferreds let you replace synchronous data structures with elegant asynchronous ones.

There are various ways that people get exposed to thinking about programming in an asynchronous style. The great majority do so via building user interfaces, usually with widget libraries wherein one specifies widgets and layouts, sets up event handlers, and then surrenders control to a main loop that thereafter routes events to handlers. Exposure to asynchronous programming via building UIs is becoming much more common as Javascript programmers build client-side web apps that operate asynchronously with web servers (and the “A” in AJAX of course stands for asynchronous). Others become aware of asynchronous programming via writing network code (perhaps using Twisted). Relatively few become aware of asynchronous programming via doing async filesystem I/O.

Because Twisted’s Deferreds don’t have anything to do with UIs or networking or filesystems, you can use them to implement other asynchronous things, like an asynchronous data structure. To show you what I mean, here’s a slightly simplified version of Twisted’s DeferredQueue class, taken from twisted/internet/defer.py:

class DeferredQueue(object):

    def __init__(self):
        # It would be better if these both used collections.deque (see comments section below).
        self.waiting = [] # Deferreds that are expecting an object
        self.pending = [] # Objects queued to go out via get.

    def put(self, obj):
        if self.waiting:
            self.waiting.pop(0).callback(obj)
        else:
            self.pending.append(obj)

    def get(self):
        if self.pending:
            return succeed(self.pending.pop(0))
        else:
            d = Deferred()
            self.waiting.append(d)
            return d

I find this extremely elegant, and I’m going to explain why.

But first, think about code for a regular synchronous queue. What happens if you call get on a regular queue that’s empty? Almost certainly one of two things: you’ll get some kind of QueueEmpty error, or else your code will block until some other code puts something into the queue. I.e., you either get a synchronous error or you get a synchronous non-empty response.

If you look at the get method in the code above, you’ll see that if the queue is empty (i.e., the pending list is empty), a new Deferred is made, added to self.waiting, and is immediately returned to the caller. So code calling get on an empty queue doesn’t get an error and doesn’t block, it always gets a result back essentially immediately. How can you get a result from an empty queue? Easy: the result is a Deferred. And because we’re in the asynchronous world, you just attach callbacks (like event handlers in the UI world) to the Deferred, and go on your merry way.

If you can follow that thinking, the rest of the code in the class above should be easy to grok. In put, if there are any outstanding Deferreds (i.e., earlier callers who hit an empty queue and got a Deferred back), the incoming object is given to the first of these by passing it to the callback function of the Deferred (and popping it out of the waiting list). If there are no outstanding Deferreds expecting a result, the incoming object is simply appended to self.pending. On the get side, if the queue (i.e., self.pending) is non-empty, the code creates a Deferred that has already been fired (using the succeed helper function) and returns it.

By now, this kind of code seems routine and straightforward to me. But it certainly wasn’t always that way. So if the above seems cryptic or abstract, I encourage you to think it over, maybe write some code, ask questions below, etc. To my mind, these kinds of constructs – simple, elegant, robust, practical, obviously(?) bug-free, single-threaded, etc. – are extremely instructive. You too can solve potentially gnarly (at least in the threaded world) networking coding challenges in very simple ways just by building code out of simple Twisted Deferreds (there’s nothing special about the DeferredQueue class, it’s just built using primitive Deferreds).

For extra points, think about the concept of asynchronous data structures. I find that concept very attractive, and putting my finger on it helped me to see part of the reason why I think Deferreds are so great. (In fact, as you can see from the use of succeed, Deferreds are not specifically asynchronous or synchronous: they’re so simple they don’t even care, and as a consumer of Deferreds, your code doesn’t have to either). That’s all remarkably elegant.

I sometimes hope O’Reilly will publish a second volume of Beautiful Code and ask me to write a chapter. I know exactly what I’d write about :-)

AddThis Social Bookmark Button

bzr viz is so pretty

21:32 January 18th, 2010 by terry. Posted under programming, tech. Comments Off on bzr viz is so pretty

A visual summary of my coding work in the last week, creating branches, working on them, merging them back into the FluidDB trunk.

bzr viz

AddThis Social Bookmark Button

Twisted code for retrying function calls

10:53 November 12th, 2009 by terry. Posted under deferreds, python, twisted. 4 Comments »

These days I often find myself writing code to talk to services that are periodically briefly unavailable. An error of some kind occurs and the correct (and documented) action to take is just to retry the original call a little later. Examples include using Amazon’s S3 service and the Twitter API. In both of these services, transient failures happen fairly frequently.

So I wrote the Twisted class below to retry calls, and tried to make it fairly general. I’d be happy to hear comments on it, because it’s pretty simple and if it can be made bullet proof I imagine others will use it too.

In case you’re not familiar with Twisted and it’s not clear, the call retrying in the below is scheduled by the Twisted reactor. This all asynchronous event-based code that will not block (assuming the function you pass in also does not).

First off, here’s the class that handles the calling:

from twisted.internet import reactor, defer, task
from twisted.python import log, failure

class RetryingCall(object):
    """Calls a function repeatedly, passing it args and kw args. Failures
    are passed to a user-supplied failure testing function. If the failure
    is ignored, the function is called again after a delay whose duration
    is obtained from a user-supplied iterator. The start method (below)
    returns a deferred that fires with the eventual non-error result of
    calling the supplied function, or fires its errback if no successful
    result can be obtained before the delay backoff iterator raises
    StopIteration.
    """
    def __init__(self, f, *args, **kw):
        self._f = f
        self._args = args
        self._kw = kw
        
    def _err(self, fail):
        if self.failure is None:
            self.failure = fail
        try:
            fail = self._failureTester(fail)
        except:
            self._deferred.errback()
        else:
            if isinstance(fail, failure.Failure):
                self._deferred.errback(fail)
            else:
                log.msg('RetryingCall: Ignoring %r' % (fail,))
                self._call()

    def _call(self):
        try:
            delay = self._backoffIterator.next()
        except StopIteration:
            log.msg('StopIteration in RetryingCall: ran out of attempts.')
            self._deferred.errback(self.failure)
        else:
            d = task.deferLater(reactor, delay,
                                self._f, *self._args, **self._kw)
            d.addCallbacks(self._deferred.callback, self._err)

    def start(self, backoffIterator=None, failureTester=None):
        self._backoffIterator = iter(backoffIterator or simpleBackoffIterator())
        self._failureTester = failureTester or (lambda _: None)
        self._deferred = defer.Deferred()
        self.failure = None
        self._call()
        return self._deferred

You call the constructor with your function and the args it should be called with. Then you call start() to get back a deferred that will eventually fire with the result of the call, or an error. BTW, I called it “start” to mirror twisted.internet.task.LoopingCall.

There’s a helper function for producing successive inter-call delays:

from operator import mul
from functools import partial

def simpleBackoffIterator(maxResults=10, maxDelay=120.0, now=True,
                          initDelay=0.01, incFunc=None):
    assert maxResults > 0
    remaining = maxResults
    delay = initDelay
    incFunc = incFunc or partial(mul, 2.0)
    
    if now:
        yield 0.0
        remaining -= 1
        
    while remaining > 0:
        yield (delay if delay < maxDelay else maxDelay)
        delay = incFunc(delay)
        remaining -= 1

By default this will generate the sequence of inter-call delays 0.0, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56 and it should be easy to see how you could write your own. Or you can just supply a list, etc. When the backoff iterator finishes, the RetryingCall class gives up on trying to get a non-error result from the function. In that case errback is called on the deferred that start() returns, with the failure from the first call.

You get to specify a function for testing failures. If it ever raises or returns a failure, the start() deferred's errback is called. The failure tester can just ignore whatever failures should be considered transient.

So, for example, if you were calling S3 and wanted to ignore 504 errors, you could supply a failureTester arg like this:

    from twisted.web import error, http

    def test(self, failure):
        failure.trap(error.Error)
        if int(failure.value.status) != http.GATEWAY_TIMEOUT:
            return failure

As another example, while using the Twitter API you might want to allow a range of HTTP errors and also exactly one 404 error, seeing as a 404 might be an error on the part of Twitter (I don't mean to suggest that actually happens). It's probably definitive - but, why not try it once again just to be more sure? So, pass RetryingCall a failureTester that's an instance of a class like this:

class TwitterFailureTester(object):
    okErrs = (http.INTERNAL_SERVER_ERROR,
              http.BAD_GATEWAY,
              http.SERVICE_UNAVAILABLE)

    def __init__(self):
        self.seen404 = False

    def __call__(self, failure):
        failure.trap(error.Error)
        status = int(failure.value.status)
        if status == http.NOT_FOUND:
            if self.seen404:
                return failure
            else:
                self.seen404 = True
        elif status not in self.okErrs:
            return failure

Changing existing code to use RetryingCall is pretty trivial. Take something like this

from twisted.web import client

def getUserByScreenname(screenname):
    d = client.getPage(
        'http://twitter.com/users/show.json?screen_name=glyf')
    return d

and change it to look like this:

def getUserByScreenname(screenname):
    r = RetryingCall(client.getPage,
        'http://twitter.com/users/show.json?screen_name=glyf')
    d = r.start(failureTester=TwitterFailureTester())
    return d

I wrote this about 10 days ago and posted it to the Twisted mailing list. No-one replied to say how horrible the code is or that it shoud be done another way, which is a pretty good sign. The above includes an improvement suggested by Tim Allen, and is slightly more useful than the code I posted originally (see the thread on the Twisted list for details).

All code above is available to you under CC0 1.0 Universal - Public Domain Dedication.

AddThis Social Bookmark Button

Fault-tolerant Python Twisted classes for getting all Twitter friends or followers

00:56 October 22nd, 2009 by terry. Posted under python, twisted, twitter. Comments Off on Fault-tolerant Python Twisted classes for getting all Twitter friends or followers

It’s been forever since I blogged here. I just wrote a little Python to grab all of a user’s friends or followers (or just their user ids). It uses Twisted, of course. There were two main reasons for doing this: 1) I want all friends/followers, not just the first bunch returned by the Twitter API, and 2) I wanted code that is fairly robust in the face of various 50x HTTP errors (I regularly experience INTERNAL_SERVER_ERROR, BAD_GATEWAY, and SERVICE_UNAVAILABLE).

If you want to use the code below and you’re not familiar with the Twitter API, consider whether you can use the FriendsIdFetcher and FollowersIdFetcher classes as they’ll do far fewer requests (you get 5000 results per API call, instead of 100). If you can live with user ids and do the occasional fetch of a full user, you’ll probably do far fewer API calls.

For the FriendsFetcher and FollowersFetcher classes, you get back a list of dictionaries, one per user. For FriendsIdFetcher and FollowersIdFetcher you get a list of Twitter user ids.

Of course there’s no documentation. Feel free to ask questions in the comments. Download the source.

import sys

from twisted.internet import defer
from twisted.web import client, error, http
    
if sys.hexversion >= 0x20600f0:
    import json
else:
    import simplejson as json

class _Fetcher(object):
    baseURL = 'http://twitter.com/'
    URITemplate = None # Override in subclass.
    dataKey = None # Override in subclass.
    maxErrs = 10
    okErrs = (http.INTERNAL_SERVER_ERROR,
              http.BAD_GATEWAY,
              http.SERVICE_UNAVAILABLE)
    
    def __init__(self, name):
        assert self.baseURL.endswith('/')
        self.results = []
        self.errCount = 0
        self.nextCursor = -1
        self.deferred = defer.Deferred()
        self.URL = self.baseURL + (self.URITemplate % { 'name' : name })

    def _fail(self, failure):
        failure.trap(error.Error)
        self.errCount += 1
        if (self.errCount < self.maxErrs and
            int(failure.value.status) in self.okErrs):
            self.fetch()
        else:
            self.deferred.errback(failure)
        
    def _parse(self, result):
        try:
            data = json.loads(result)
            self.nextCursor = data.get('next_cursor')
            self.results.extend(data[self.dataKey])
        except Exception:
            self.deferred.errback()
        else:
            self.fetch()
            
    def _deDup(self):
        raise NotImplementedError('Override _deDup in subclasses.')

    def fetch(self):
        if self.nextCursor:
            d = client.getPage(self.URL + '?cursor=%s' % self.nextCursor)
            d.addCallback(self._parse)
            d.addErrback(self._fail)
        else:
            self.deferred.callback(self._deDup())
        return self.deferred

class _FriendsOrFollowersFetcher(_Fetcher):
    dataKey = u'users'
    
    def _deDup(self):
        seen = set()
        result = []
        for userdict in self.results:
            uid = userdict['id']
            if uid not in seen:
                result.append(userdict)
                seen.add(uid)
        return result

class _IdFetcher(_Fetcher):
    dataKey = u'ids'
    
    def _deDup(self):
        # Keep the ids in the order we received them.
        seen = set()
        result = []
        for uid in self.results:
            if uid not in seen:
                result.append(uid)
                seen.add(uid)
        return result

class FriendsFetcher(_FriendsOrFollowersFetcher):
    URITemplate = 'statuses/friends/%(name)s.json'

class FollowersFetcher(_FriendsOrFollowersFetcher):
    URITemplate = 'statuses/followers/%(name)s.json'

class FriendsIdFetcher(_IdFetcher):
    URITemplate = 'friends/ids/%(name)s.json'

class FollowersIdFetcher(_IdFetcher):
    URITemplate = 'followers/ids/%(name)s.json'

Usage is dead simple:

fetcher = FriendsFetcher('terrycojones')
d = fetcher.fetch()
d.addCallback(....) # etc.

Enjoy.

AddThis Social Bookmark Button

Facebook release Tornado and it’s not based on Twisted?

14:25 September 12th, 2009 by terry. Posted under FluidDB, python, tech, twisted. 17 Comments »

Image: Jay Smith

Image: Jay Smith

To their great credit, Facebook have just open-sourced more of their core software. This time it’s Tornado, an asynchronous web server written in Python.

Surely that can only mean one thing: Tornado is based on Twisted. Right?

Incredibly, no. Words fail me on this one. I’ve spent some hours today trying to put my thoughts into order so I could put together a reasonably coherent blog post on the subject. But I’ve failed. So here are some unstructured thoughts.

First of all, I’m not meaning to bash Facebook on this. At Fluidinfo we use their Thrift code. We’ll almost certainly use Scribe for logging at some point, and we’re seriously considering using Cassandra. Esteve Fernandez has put a ton of work into txAMQP to glue together Thrift, Twisted, and AMQP, and in the process became a Thrift committer.

Second, you can understand—or make an educated guess at—what happened: the Facebook programmers, like programmers everywhere, were strongly tempted to just write their own code instead of dealing with someone else’s. It’s not just about learning curves and fixing deficiencies, there are also issues of speed of changes and of control. At Fluidinfo we suffered through six months of maintaining our own set of related patches to Thrift before the Twisted support Esteve wrote was finally merged to trunk. That was painful and the temptation to scratch our own itch, fork, and forget about the official Thrift project was high.

Plus, Twisted suffers from the fact that the documentation is not as good as it could be. I commented at length on this over three years ago. Please read the follow-up posts in that thread for an illustration (one of many) of the maturity of the people running Twisted. Also note that the documentation has improved greatly since then. Nevertheless, Twisted is a huge project, it has tons of parts, and it’s difficult to document and to wrap your head around no matter what.

So you can understand why Facebook might have decided not to use Twisted. In their words:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

I’m betting it’s that last part that’s the key to the decision not to use Twisted.

But seriously…… WTF?

Twisted is an amazing piece of work, written by some truly brilliant coders, with huge experience doing exactly what Facebook set out to reinvent.

This is where I’m at a loss for words. I think: “what an historic missed opportunity” and “reinventing the wheel, badly” and “no, no, no, this cannot be” and “this is just so short-sighted” and “a crying shame” and many things besides.

Sure, Twisted is difficult to grok. But that’s no excuse. It’s a seriously cool and powerful framework, it’s vastly more sophisticated and useful and extensible than what Facebook have cobbled together. Facebook could have worked to improve twisted.web (which everyone agrees has some shortcomings) which could have benefitted greatly from even a small fraction of the resources Facebook must have put into Tornado. The end result would have been much better. Or Facebook could have just ignored twisted.web and built directly on top of the Twisted core. That would have been great too.

Or Facebook could have had a team of people who knew how to do it better, and produced something better than Twisted. I guess that’s the real frustration here – they’ve put a ton of effort into building something much more limited in scope and vision, and even the piece that they did build looks like a total hack built to scratch their short term needs.

What’s the biggest change in software engineering over the last decade? Arguably it’s the rise of test-driven development. I’m not the only one who thinks so. Yet here we are in late 2009 and Facebook have released a major piece of code with no test suite. Amazing. OK, you could argue this is a minor thing, that it’s not core to Tornado. That argument has some weight, but it’s hard to think that this thing is not a hack.

If you decide to use an asynchronous web framework, do you expect to have to manually set your sockets to be non-blocking? Do you feel like catching EWOULDBLOCK and EAGAIN yourself? Those sorts of things, and their non-portability (even within the world of Linux) are precisely the kinds of things that lead people away from writing C and towards doing rapid development using something robust and portable that looks after the details. They’re precisely the things Twisted takes care of for you, and which (at least in Twisted) work across platforms, including Windows.

It looks like Tornado are using a global reactor, which the Twisted folks have learned the hard way is not the best solution.

Those are just some of the complaints I’ve heard and seen in the Tornado code. I confess I’ve looked only superficially at their code – but more than enough to feel such a sense of lost opportunity. They built a small subsection of Twisted, they’ve done it with much less experience and elegance and hiding of detail than the Twisted code, and the thing doesn’t even come with a test suite. Who knows if it actually works, or when, or where, etc.?

And…. Twisted is so much more. HTTP is just one of many protocols Twisted speaks, including (from their home page): “TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others”.

Want to build a sophisticated, powerful, and flexible asynchronous internet service in Python? Use Twisted.

A beautiful thing about Twisted is that it expands your mind. Its abstractions (particularly the clean separation and generality of transports, protocols, factories, services, and Deferreds—see here and here and here) makes you a better programmer. As I wrote to some friends in April 2006: “Reading the Twisted docs makes me feel like my brain is growing new muscles.”

Twisted’s deferreds are extraordinarily elegant and powerful, I’ve blogged and posted to the Twisted mailing list about them on multiple occasions. Learning to think in the Twisted way has been a programming joy to me over the last 3 years, so you can perhaps imagine my dismay that a company with the resources of Facebook couldn’t be bothered to get behind it and had to go reinvent the wheel, and do it so poorly. What a pity.

In my case, I threw away an entire year of C code in order to use Twisted in FluidDB. That was a decision I didn’t take lightly. I’d written my own libraries to do lots of low level network communications and RPC – including auto-generating server and client side glue libraries to serialize and unserialize RPC calls and args (a bit like Thrift), plus a server and tons of other stuff. I chucked it because it was too brittle. It was too much of a hack. It wasn’t portable enough. It was too get the details right. It wasn’t extensible.

In other words….. it was too much like Tornado! So I threw it all away in favor of Twisted. As I happily tell people, FluidDB is written in Python so it can use Twisted. It was a question of an amazingly great asynchronous networking framework determining the choice of programming language. And this was done in spite of the fact that I thought the Twisted docs sucked badly. The people behind Twisted were clearly brilliant and the community was great. There was an opportunity to make a bet on something and to contribute. I wish Facebook had made the same decision. It’s everyone’s loss that they did not. What a great pity.

AddThis Social Bookmark Button

Python code for retrieving all your tweets

21:44 June 24th, 2009 by terry. Posted under python, twitter. 13 Comments »

Here’s a little Python code to pull back all a user’s Twitter tweets. Make sure you read the notes at bottom in case you want to use it.

import sys, twitter, operator
from dateutil.parser import parse

twitterURL = 'http://twitter.com'

def fetch(user):
    data = {}
    api = twitter.Api()
    max_id = None
    total = 0
    while True:
        statuses = api.GetUserTimeline(user, count=200, max_id=max_id)
        newCount = ignCount = 0
        for s in statuses:
            if s.id in data:
                ignCount += 1
            else:
                data[s.id] = s
                newCount += 1
        total += newCount
        print >>sys.stderr, "Fetched %d/%d/%d new/old/total." % (
            newCount, ignCount, total)
        if newCount == 0:
            break
        max_id = min([s.id for s in statuses]) - 1
    return data.values()

def htmlPrint(user, tweets):
    for t in tweets:
        t.pdate = parse(t.created_at)
    key = operator.attrgetter('pdate')
    tweets = sorted(tweets, key=key)
    f = open('%s.html' % user, 'wb')
    print >>f, """Tweets for %s
    
    """ % user
    for i, t in enumerate(tweets):
        print >>f, '%d. %s %s
' % ( i, t.pdate.strftime('%Y-%m-%d %H:%M'), twitterURL, user, t.id, t.text.encode('utf8')) print >>f, '
' f.close() if __name__ == '__main__': user = 'terrycojones' if len(sys.argv) < 2 else sys.argv[1] data = fetch(user) htmlPrint(user, data)

Notes:

Fetch all of a user's tweets and write them to a file username.html (where username is given on the command line).

Output is to a file instead of to stdout as tweet texts are unicode and sys.stdout.encoding is ascii on my machine, which prevents printing non-ASCII chars.

This code uses the Python-Twitter library. You need to get (via SVN) the very latest version, and then you need to fix a tiny bug, described here. Or wait a while and the SVN trunk will be patched.

This worked flawlessly for my 2,300 tweets, but only retrieved about half the tweets of someone who had over 7,000. I'm not sure what happened there.

There are tons of things that could be done to make the output more attractive and useful. And yes, for nitpickers, the code has a couple of slight inefficiencies :-)

AddThis Social Bookmark Button

A mixin class allowing Python __init__ methods to work with Twisted deferreds

17:54 May 11th, 2009 by terry. Posted under deferreds, python, twisted. Comments Off on A mixin class allowing Python __init__ methods to work with Twisted deferreds

I posted to the Python Twisted list back in Nov 2008 with subject: A Python metaclass for Twisted allowing __init__ to return a Deferred

Briefly, I was trying to find a nice way to allow the __init__ method of a class to work with deferreds in such a way that methods of the class could use work done by __init__ safe in the knowledge that the deferreds had completed. E.g., if you have

class X(object):
    def __init__(self, host, port):
        def final(connection):
            self.db = connection
        d = makeDBConnection(host, port)
        d.addCallback(final)

    def query(self, q):
        return self.db.runQuery(q)

Then when you make an X and call query on it, there’s a chance the deferred wont have fired, and you’ll get an error. This is just a very simple illustrative example. There are many more, and this is a general problem of the synchronous world (in which __init__ is supposed to prepare a fully-fledged class instance and cannot return a deferred) meeting the asynchronous world in which, as Twisted programmers, we would like to (and must) use deferreds.

The earlier thread, with lots of useful followups can be read here. Although I learned a lot in that thread, I wasn’t completely happy with any of the solutions. Some of the things that still bugged me are in posts towards the end of the thread (here and here).

The various approaches we took back then all boiled down to waiting for a deferred to fire before the class instance was fully ready to use. When that happened, you had your instance and could call its methods.

I had also thought about an alternate approach: having __init__ add a callback to the deferreds it dealt with to set a flag in self and then have all dependent methods check that flag to see if the class instance was ready for use. But that 1) is ugly (too much extra code); 2) means the caller has to be prepared to deal with errors due to the class instance not being ready, and 3) adds a check to every method call. It would look something like this:

class X(object):
    def __init__(self, host, port):
        self.ready = False
        def final(connection):
            self.db = connection
            self.ready = True
        d = makeDBConnection(host, port)
        d.addCallback(final)

    def query(self, q):
        if not self.ready:
            raise IAmNotReadyException()
        return self.db.runQuery(q)

That was too ugly for my taste, for all of the above reasons, most especially for forcing the unfortunate caller of my code to handle IAmNotReadyException.

Anyway…. fast forward 6 months and I’ve hit the same problem again. It’s with existing code, in which I would like an __init__ to call something that (now, due to changes elsewhere) returns a deferred. So I started thinking again, and came up with a much cleaner way to do the alternate approach via a class mixin:

from twisted.internet import defer

class deferredInitMixin(object):
    def wrap(self, d, *wrappedMethods):
        self.waiting = []
        self.stored = {}

        def restore(_):
            for method in self.stored:
                setattr(self, method, self.stored[method])
            for d in self.waiting:
                d.callback(None)

        def makeWrapper(method):
            def wrapper(*args, **kw):
                d = defer.Deferred()
                d.addCallback(lambda _: self.stored[method](*args, **kw))
                self.waiting.append(d)
                return d
            return wrapper

        for method in wrappedMethods:
            self.stored[method] = getattr(self, method)
            setattr(self, method, makeWrapper(method))

        d.addCallback(restore)

You use it as in the class Test below:

from twisted.internet import defer, reactor

def fire(d, value):
    print "I finally fired, with value", value
    d.callback(value)

def late(value):
    d = defer.Deferred()
    reactor.callLater(1, fire, d, value)
    return d

def called(result, what):
    print 'final callback of %s, result = %s' % (what, result)

def stop(_):
    reactor.stop()


class Test(deferredInitMixin):
    def __init__(self):
        d = late('Test')
        deferredInitMixin.wrap(self, d, 'f1', 'f2')

    def f1(self, arg):
        print "f1 called with", arg
        return late(arg)

    def f2(self, arg):
        print "f2 called with", arg
        return late(arg)


if __name__ == '__main__':
    t = Test()
    d1 = t.f1(44)
    d1.addCallback(called, 'f1')
    d2 = t.f1(33)
    d2.addCallback(called, 'f1')
    d3 = t.f2(11)
    d3.addCallback(called, 'f2')
    d = defer.DeferredList([d1, d2, d3])
    d.addBoth(stop)
    reactor.run()

Effectively, the __init__ of my Test class asks deferredInitMixin to temporarily wrap some of its methods. deferredInitMixin stores the original methods away and replaces each of them with a function that immediately returns a deferred. So after __init__ finishes, code that calls the now-wrapped methods of the class instance before the deferred has fired will get a deferred back as usual (but see * below). As far as they know, everything is normal. Behind the scenes, deferredInitMixin has arranged for these deferreds to fire only after the deferred passed from __init__ has fired. Once that happens, deferredInitMixin also restores the original functions to the instance. As a result there is no overhead later to check a flag to see if the instance is ready to use. If the deferred from __init__ happens to fire before any of the instance’s methods are called, it will simply restore the original methods. Finally (obviously?) you only pass the method names to deferredInitMixin that depend on the deferred in __init__ being done.

BTW, calling the methods passed to deferredInitMixin “wrapped” isn’t really accurate. They’re just temporarily replaced.

I quite like this approach. It’s a second example of something I posted about here, in which a pool of deferreds is accumulated and all fired when another deferred fires. It’s nice because you don’t reply with an error and there’s no need for locking or other form of coordination – the work you need done is already in progress, so you get back a fresh deferred and everything goes swimmingly.

* Minor note: the methods you wrap should probably be ones that already return deferreds. That way you always get a deferred back from them, whether they’re temporarily wrapped or not. The above mixin works just fine if you ask it to wrap non-deferred-returning functions, but you have to deal with the possibility that they will return a deferred (i.e., if you call them while they’re wrapped).

AddThis Social Bookmark Button

Coding

07:04 April 7th, 2009 by terry. Posted under me, programming. Comments Off on Coding

Coding is what you do when you tire of dealing with other humans. You get to interact with a literal-minded idiot who slavishly follows your every wish. Unfortunately you have to specify *everything* down to the very last detail in order to get anything done. But at least it does what it’s told, which is comforting in a world where everyone selfishly insists on doing what *they* want. You should try it sometime, it’s very relaxing, and nice to be totally in control.

AddThis Social Bookmark Button

bzr – not your grandfather’s VCS

20:43 March 26th, 2009 by terry. Posted under programming, tech. Comments Off on bzr – not your grandfather’s VCS

bzr viz

AddThis Social Bookmark Button

Who signed up for Twitter immediately before/after you?

03:52 January 14th, 2009 by terry. Posted under companies, python, twitter. 2 Comments »

This is just a quick hack, done in about 20 minutes in 32 lines of Python. The following script will print out the Twitter screen names of the people who signed up immediately before and after a given user.

import sys
from twitter import Api
from operator import add
from functools import partial

inc = partial(add, 1)
dec = partial(add, -1)
api = Api()

def getUser(u):
    try:
        return api.GetUser(u)
    except Exception:
        return None

def do(name):
    user = getUser(name)
    if user:
        for f, what in (dec, 'Before:'), (inc, 'After:'):
            i = user.id
            while True:
                i = f(i)
                u = getUser(i)
                if u:
                    print what, u.screen_name
                    break
    else:
        print 'Could not find user %r' % name

if __name__ == '__main__':
    for name in sys.argv[1:]: 
        do(name)

I’m happy to have reached the point in my Python development where I can pretty much just type something like this in without really having to think, including the use of operator.add and functools.partial.

BTW, the users who signed up immediately before and after I did were skywalker and kitu012.

The above is just a hack. Notes:

  1. If it can’t retrieve a user for any reason, it just assumes there is no such user.
  2. Twitter periodically deletes accounts of abusers, so the answer will skip those.
  3. Twitter had lots of early hiccups, so there may be no guarantee that user ids were actually assigned sequentially.
  4. This script may run forever.
  5. I’m using the Python Twitter library written by DeWitt Clinton. It’s been a while since it was updated, and it doesn’t give you back the time a user was created in Twitter. It would be fun to print that too.

As you were.

AddThis Social Bookmark Button

A kinder and more consistent defer.inlineCallbacks

19:20 November 21st, 2008 by terry. Posted under deferreds, python, tech, twisted. Comments Off on A kinder and more consistent defer.inlineCallbacks

Here’s a suggestion for making Twisted‘s inlineCallbacks function decorator more consistent and less confusing. Let’s suppose you’re writing something like this:

    @inlineCallbacks
    def func():
        # Do something.

    result = func()

There are 2 things that could be better, IMO:

1. func may not yield. In that case, you get an AttributeError when inlineCallbacks tries to send() to something that’s not a generator. Or worse, the call to send might actually work, and do who knows what. I.e., func() could return an object with a send method but which is not a generator. For some fun, run some code that calls the following decorated function (see if you can figure out what will happen before you do):

    @defer.inlineCallbacks
    def f():
        class yes():
            def send(x, y):
                print 'yes'
                # accidentally_destroy_the_universe_too()
        return yes()

2. func might raise before it get to its first yield. In that case you’ll get an exception thrown when the inlineCallbacks decorator tries to create the wrapper function:

    File "/usr/lib/python2.5/site-packages/twisted/internet/defer.py", line 813, in unwindGenerator
      return _inlineCallbacks(None, f(*args, **kwargs), Deferred())

There’s a simple and consistent way to handle both of these. Just have inlineCallbacks do some initial work based on what it has been passed:

    def altInlineCallbacks(f):
        def unwindGenerator(*args, **kwargs):
            deferred = defer.Deferred()
            try:
                result = f(*args, **kwargs)
            except Exception, e:
                deferred.errback(e)
                return deferred
            if isinstance(result, types.GeneratorType):
                return defer._inlineCallbacks(None, result, deferred)
            deferred.callback(result)
            return deferred

        return mergeFunctionMetadata(f, unwindGenerator)

This has the advantage that (barring e.g., a KeyboardInterrupt in the middle of things) you’ll *always* get a deferred back when you call an inlineCallbacks decorated function. That deferred might have already called or erred back (corresponding to cases 1 and 2 above).

I’m going to use this version of inlineCallbacks in my code. There’s a case for it making it into Twisted itself: inlinecallbacks is already cryptic enough in its operation that anything we can do to make its operation more uniform and less surprising, the better.

You might think that case 1 rarely comes up. But I’ve hit it a few times, usually when commenting out sections of code for testing. If you accidentally comment out the last yield in func, it no longer returns a generator and that causes a different error.

And case 2 happens to me too. Having inlinecallbacks try/except the call to func is nicer because it means I don’t have to be quite so defensive in coding. So instead of me having to write

    try:
        d = func()
    except Exception:
        # Do something.

and try to figure out what happened if an exception fired, I can just write d = func() and add errbacks as I please (they then have to figure out what happened). The (slight?) disadvantage to my suggestion is that with the above try/except fragment you can tell if the call to func() raised before ever yielding. You can detect that, if you need to, with my approach if you’re not offended by looking at d.called immediately after calling func.

The alternate approach also helps if you’re a novice, or simply being lazy/careless/forgetful, and writing:

    d = func()
    d.addCallback(ok)
    d.addErrback(not_ok)

thinking you have your ass covered, but you actually don’t (due to case 2).

There’s some test code here that illustrates all this.

AddThis Social Bookmark Button

bzr viz is so nice

00:07 November 20th, 2008 by terry. Posted under python, tech. 21 Comments »

A year ago we switched from SVN to Bazaar for source code control. I started using source code control in 1989 with RCS after comparing it with SCCS. Then I duly moved to CVS and on to SVN. In retrospect, they all sucked pretty badly but each in turn was a big improvement and seemed great at the time.

The topic of source code control is a very complex one. There’s tons of debate online about the advantages of various packages. I don’t want to get into details, plus there are details that I don’t fully appreciate anyway. It really is complex – at least if you want to do anything even a little bit sophisticated, e.g., with multiple users working on multiple branches.

Anyway, we wanted to move away from SVN, which is cumbersome, too manual and heavyweight (at least in its handling of branches), and requires you to talk to a centralized server all the time. Plus it has no handling of directories or symbolic links, and you lose history in merging. There are other problems and annoyances too.

A distributed version control system seemed like the way to go.

We looked closely at Bazaar and Mercurial. I was prejudiced towards Mercurial. I liked its name, I liked the coolness of the Qwerty-symmetric hg command, and above all I liked how lightweight and simple it is. We took a quick look at Git, but it looked like a bit of a hodge-podge and we’re Python fanboys, so we fairly quickly decided against it.

From what I’ve read, all of Bazaar, Mercurial and Git are excellent. It’s clear that they leave SVN for dead. When I run across open source projects, especially new projects, that are still using SVN I silently raise an inner eyebrow.

But like I said, I don’t want to get into details. What I do want to do is say that I really like a plugin for Bazaar called viz (aka vizualize). It’s in bzr-gtk in case you use apt-get.

You just type bzr viz and it pops a glorious window with a visualization of your branching and merging history. The image above is just a fragment of the full window. The most recent activity is at the top, so as you look down the page you’re looking at older and older branches and merges. On the left you see the branch numbers. The vertical lines are the branches, the left-most being the trunk (in this case). You can see that the 2 right-most branches have no activity in the fragment shown.

If you want to take a look at more of the window, showing a different part of the tree, click on the following image.

Not only does Bazaar make branching really lightweight, it takes all the uncertainty out of the process (ever try merging branches in SVN, reading the log file to make sure you’ve got the right revision numbers before entering the extremely long command?). Plus you get full history when merging (and this is nicely displayed in the output of bzr log) and with a tool like bzr viz you can just see the history. Our tree has some much more complex sections, including one where Esteve had 25 branches going at the same time! And yes, they all got merged to trunk. Bazaar makes branching and merging so simple you just start to do it all the time, and it becomes very natural. Then you just merge whatever you like into whatever you like and gradually merge your way back into the trunk (after merging the trunk into your branch first to have a look at things). It’s great.

That’s it. No time for blogging. I’m waiting for someone to upload a patch so I can continue working. Meanwhile, lightweight distributed version control has really changed how we work. It’s much much better. If you’re still using SVN and haven’t checked out Bazaar or Mercurial (and there are several others), you really should.

AddThis Social Bookmark Button

A Python metaclass for Twisted allowing __init__ to return a Deferred

01:33 November 3rd, 2008 by terry. Posted under deferreds, python, tech, twisted. 4 Comments »

OK, I admit, this is geeky.

But we’ve all run into the situation in which you’re using Python and Twisted, and you’re writing a new class and you want to call something from the __init method that returns a Deferred. This is a problem. The __init method is not allowed to return a value, let alone a Deferred. While you could just call the Deferred-returning function from inside your __init, there’s no guarantee of when that Deferred will fire. Seeing as you’re in your __init method, it’s a good bet that you need that function to have done its thing before you let anyone get their hands on an instance of your class.

For example, consider a class that provides access to a database table. You want the __init__ method to create the table in the db if it doesn’t already exist. But if you’re using Twisted’s twisted.enterprise.adbapi class, the runInteraction method returns a Deferred. You can call it to create the tables, but you don’t want the instance of your class back in the hands of whoever’s creating it until the table is created. Otherwise they might call a method on the instance that expects the table to be there.

A cumbersome solution would be to add a callback to the Deferred you get back from runInteraction and have that callback add an attribute to self to indicate that it is safe to proceed. Then all your class methods that access the db table would have to check to see if the attribute was on self, and take some alternate action if not. That’s going to get ugly very fast plus, your caller has to deal with you potentially not being ready.

I ran into this problem a couple of days ago and after scratching my head for a while I came up with an idea for how to solve this pretty cleanly via a Python metaclass. Here’s the metaclass code:

from twisted.internet import defer

class TxDeferredInitMeta(type):
    def __new__(mcl, classname, bases, classdict):
        hidden = '__hidden__'
        instantiate = '__instantiate__'
        for name in hidden, instantiate:
            if name in classdict:
                raise Exception(
                    'Class %s contains an illegally-named %s method' %
                    (classname, name))
        try:
            origInit = classdict['__init__']
        except KeyError:
            origInit = lambda self: None
        def newInit(self, *args, **kw):
            hiddenDict = dict(args=args, kw=kw, __init__=origInit)
            setattr(self, hidden, hiddenDict)
        def _instantiate(self):
            def addSelf(result):
                return (self, result)
            hiddenDict = getattr(self, hidden)
            d = defer.maybeDeferred(hiddenDict['__init__'], self,
                                    *hiddenDict['args'], **hiddenDict['kw'])
            return d.addCallback(addSelf)
        classdict['__init__'] = newInit
        classdict[instantiate] = _instantiate
        return super(TxDeferredInitMeta, mcl).__new__(
            mcl, classname, bases, classdict)

I’m not going to explain what it does here. If it’s not clear and you want to know, send me mail or post a comment. But I’ll show you how you use it in practice. It’s kind of weird, but it makes sense once you get used to it.

First, we make a class whose metaclass is TxDeferredInitMeta and whose __init__ method returns a deferred:

class MyClass(object):
    __metaclass__ = TxDeferredInitMeta
    def __init__(self):
        d = aFuncReturningADeferred()
        return d

Having __init__ return anything other than None is illegal in normal Python classes. But this is not a normal Python class, as you will now see.

Given our class, we use it like this:

def cb((instance, result)):
    # instance is an instance of MyClass
    # result is from the callback chain of aFuncReturningADeferred
    pass

d = MyClass()
d.__instantiate__()
d.addCallback(cb)

That may look pretty funky, but if you’re used to Twisted it wont seem too bizarre. What’s happening is that when you ask to make an instance of MyClass, you get back an instance of a regular Python class. It has a method called __instantiate__ that returns a Deferred. You add a callback to that Deferred and that callback is eventually passed two things. The first is an instance of MyClass, as you requested. The second is the result that came down the callback chain from the Deferred that was returned by the __init__ method you wrote in MyClass.

The net result is that you have the value of the Deferred and you have your instance of MyClass. It’s safe to go ahead and use the instance because you know the Deferred has been called. It will probably seem a bit odd to get your instance later as a result of a Deferred firing, but that’s perfectly in keeping with the Twisted way.

That’s it for now. You can grab the code and a trial test suite to put it through its paces at http://foss.fluidinfo.com/txDeferredInitMeta.zip. The code could be cleaned up somewhat, and made more general. There is a caveat to using it – your class can’t have __hidden__ or __instantiate__ methods. That could be improved. But I’m not going to bother for now, unless someone cares.

AddThis Social Bookmark Button

Digging into Twitter following

18:56 October 13th, 2008 by terry. Posted under companies, python, twitter. 10 Comments »

TwitterThis is just a quick post. I have a ton of things I could say about this, but they’ll have to wait – I need to do some real work.

Last night and today I wrote some Python code to dig into the follower and following sets of Twitter users.

I also think I understand better why Twitter is so compelling, but that’s going to have to wait for now too.

You give my program some Twitter user names and it builds you a table showing numbers of followers, following etc. for each user. It distinguishes between people you follow and who don’t follow you, and people who follow you but whom you don’t follow back.

But the really interesting thing is to look at the intersection of some of these sets between users.

For example, if I follow X and they don’t follow me back, we can assume I have some interest in X. So if am later followed by Y and it turns out that X follows Y, I might be interested to know that. I might want to follow Y back just because I know it might bring me to the attention of X, who may then follow me. If I follow Y, I might want to publicly @ message him/her, hoping that he/she might @ message me back, and that X may see it and follow me.

Stuff like that. If you think that sort of thing isn’t important, or is too detailed or introspective, I’ll warrant you don’t know much about primate social studies. But more on that in another posting too.

As another example use, I plan to forward the mails Twitter sends me telling me someone new is following me into a variant of my program. It can examine the sets of interest and weight them. That can give me an automated recommendation of whether I should follow that person back – or just do the following for me.

There are lots of directions you could push this in, like considering who the person had @ talked to (and whether those people were followers or not) and the content of their Tweets (e.g., do they talk about things I’m interested or not interested in?).

Lots.

For now, here are links to a few sample runs. Apologies to the Twitter users I’ve picked on – you guys were on my screen or on my mind (following FOWA).

I’d love to turn these into nice Euler Diagrams but I didn’t find any decent open source package to produce them.

I’m also hoping someone else (or other people) will pick this up and run with it. I’ve got no time for it! I’m happy to send the source code to anyone who wants it. Just follow me on Twitter and ask for it.

Example 1: littleidea compared to sarawinge.
Example 2: swardley compared to voidspace.
Example 3: aweissman compared to johnborthwick.

And finally here’s the result for deWitt, on whose Twitter Python library I based my own code. This is the output you get from the program when you only give it one user to examine.

More soon, I guess.

AddThis Social Bookmark Button

How many users does Twitter have?

00:55 October 13th, 2008 by terry. Posted under companies, python, twitter. 1 Comment »

Inclusion/Exclusion

Here’s a short summary of a failed experiment using the Principle of Inclusion/Exclusion to estimate how many users Twitter has. I.e., there’s no answer below, just the outline of some quick coding.

I was wondering about this over cereal this morning. I know some folks at Twitter, and I know some folks who have access to the full tweet database, so I could perhaps get that answer just by asking. But that wouldn’t be any fun, and I probably couldn’t blog about it.

I was at FOWA last week and it seemed that absolutely everyone was on Twitter. Plus, they were active users, not people who’d created an account and didn’t use it. If Twitter’s usage pattern looks anything like a Power Law as we might expect, there will be many, many inactive or dormant accounts for every one that’s moderately active.

BTW, I’m terrycojones on Twitter. Follow me please, I’m trying to catch Jason Calacanis.

You could have a crack at answering the question by looking at Twitter user id numbers via the API and trying to estimate how many users there are. I did play with that at one point at least with tweet ids, but although they increase there are large holes in the tweet id space. And approaches like that have to go through the Twitter API, which limits you to a mere 70 requests per hour – not enough for any serious (and quick) probing.

In any case, I was looking at the Twitter Find People page. Go to the Search tab and you can search for users.

I searched for the single letter A, and got around 109K hits. That lead me to think that I could get a bound on Twitter’s size using the Principle of Inclusion/Exclusion (PIE). (If you don’t know what that is, don’t be intimidated by the math – it’s actually very simple, just consider the cases of counting the size of the union of 2 and 3 sets). The PIE is a beautiful and extremely useful tool in combinatorics and probability theory (some nice examples can be found in Chapter 3 of the introductory text Applied Combinatorics With Problem Solving). The image above comes from the Wikipedia page.

To get an idea of how many Twitter users there are, we can add the number of people with an A in their name to the number with a B in their name, …., to the number with a Z in their name.

That will give us an over-estimate though, as names typically have many letters in them. So we’ll be counting users multiple times in this simplistic sum. That’s where the PIE comes in. The basic idea is that you add the size of a bunch of sets, and then you subtract off the sizes of all the pairwise intersections. Then you add on the sizes of all the triple set intersections, and so on. If you keep going, you get the answer exactly. If you stop along the way you’ll have an upper or lower bound.

So I figured I could add the size of all the single-letter searches and then adjust that downwards using some simple estimates of letter co-occurrence.

That would definitely work.

But then the theory ran full into the reality of Twitter.

To begin with, Twitter gives zero results if you search for S or T. I have no idea why. It gives a result for all other (English) letters. My only theory was that Twitter had anticipated my effort and the missing S and T results were their way of saying Stop That!

Anyway, I put the values for the 24 letters that do work into a Python program and summed them:

count = dict(a = 108938,
             b =  12636,
             c =  13165,
             d =  21516,
             e =  14070,
             f =   5294,
             g =   8425,
             h =   7108,
             i = 160592,
             j =   9226,
             k =  12524,
             l =   8112,
             m =  51721,
             n =  11019,
             o =   9840,
             p =   8139,
             q =   1938,
             r =  10993,
             s =      0,
             t =      0,
             u =   8997,
             v =   4342,
             w =   6834,
             x =   8829,
             y =   8428,
             z =   3245)

upperBoundOnUsers = sum(count.values())
print 'Upper bound on number of users:', upperBoundOnUsers

The total was 515,931.

Remember that that’s a big over-estimate due to duplicate counting.

And unless I really do live in a tech bubble, I think that number is way too small – even without adjusting it using the PIE.

(If we were going to adjust it, we could try to estimate how often pairs of letters co-occur in Twitter user names. That would be difficult as user names are not like normal words. But we could try.)

Looking at the letter frequencies, I found them really strange. I wrote a tiny bit more code, using the English letter frequencies as given on Wikipedia to estimate how many hits I’d have gotten back on a normal set of words. If we assume Twitter user names have an average length of 7, we can print the expected numbers versus the actual numbers like this:

# From http://en.wikipedia.org/wiki/Letter_frequencies
freq = dict(a = 0.08167,
            b = 0.01492,
            c = 0.02782,
            d = 0.04253,
            e = 0.12702,
            f = 0.02228,
            g = 0.02015,
            h = 0.06094,
            i = 0.06966,
            j = 0.00153,
            k = 0.00772,
            l = 0.04025,
            m = 0.02406,
            n = 0.06749,
            o = 0.07507,
            p = 0.01929,
            q = 0.00095,
            r = 0.05987,
            s = 0.06327,
            t = 0.09056,
            u = 0.02758,
            v = 0.00978,
            w = 0.02360,
            x = 0.00150,
            y = 0.01974,
            z = 0.00074)

estimatedUserNameLen = 7

for L in sorted(count.keys()):
    probNotLetter = 1.0 - freq[L]
    probOneOrMore = 1.0 - probNotLetter ** estimatedUserNameLen
    expected = int(upperBoundOnUsers * probOneOrMore)
    print "%s: expected %6d, saw %6d." % (L, expected, count[L])

Which results in:

a: expected 231757, saw 108938.
b: expected  51531, saw  12636.
c: expected  92465, saw  13165.
d: expected 135331, saw  21516.
e: expected 316578, saw  14070.
f: expected  75281, saw   5294.
g: expected  68517, saw   8425.
h: expected 183696, saw   7108.
i: expected 204699, saw 160592.
j: expected   5500, saw   9226.
k: expected  27243, saw  12524.
l: expected 128942, saw   8112.
m: expected  80866, saw  51721.
n: expected 199582, saw  11019.
o: expected 217149, saw   9840.
p: expected  65761, saw   8139.
q: expected   3421, saw   1938.
r: expected 181037, saw  10993.
s: expected 189423, saw      0.
t: expected 250464, saw      0.
u: expected  91732, saw   8997.
v: expected  34301, saw   4342.
w: expected  79429, saw   6834.
x: expected   5392, saw   8829.
y: expected  67205, saw   8428.
z: expected   2666, saw   3245.

You can see there are wild differences here.

While it’s clearly not right to be multiplying the probability of one or more of each letter appearing in a name by the 515,931 figure (because that’s a major over-estimate), you might hope that the results would be more consistent and tell you how much of an over-estimate it was. But the results are all over the place.

I briefly considered writing some code to scrape the search results and calculate the co-occurrence frequencies (and the actual set of letters in user names). Then I noticed that the results don’t always add up. E.g., search for C and you’re told there are 13,190 results. But the results come 19 at a time and there are 660 pages of results (and 19 * 660 = 12,540, which is not 13,190).

At that point I decided not to trust Twitter’s results and to call it quits.

A promising direction (and blog post) had fizzled out. I was reminded of trying to use AltaVista to compute co-citation distances between web pages back in 1996. AltaVista was highly variable in its search results, which made it hard to do mathematics.

I’m blogging this as a way to stop thinking about this question and to see if someone else wants to push on it, or email me the answer. Doing the above only took about 10-15 mins. Blogging it took at least a couple of hours :-(

Finally, in case it’s not clear there are lots of assumptions in what I did. Some of them:

  • We’re not considering non-English letters (or things like underscores, which are common) in user names.
  • The mean length of Twitter user names is probably not 7.
  • Twitter search returns user names that don’t contain the searched-for letter (instead, the letter appears in the user’s name, not the username).
AddThis Social Bookmark Button

Minor mischief: create redirect loops from predictable short URLs

16:14 July 1st, 2008 by terry. Posted under other, python, tech. 2 Comments »

redirect loopI was checking out the new bit.ly URL shortening service from Betaworks.

I started wondering how random the URLs from these URL-shortening services could be. I wrote a tiny script the other day to turn URLs given on the command line into short URLs via is.gd:

import urllib, sys
for arg in sys.argv[1:]:
    print urllib.urlopen(
        'http://is.gd/api.php?longurl=' + arg).read()

I ran it a couple of times to see what URLs it generated. Note that you have to use a new URL each time, as it’s smart enough not to give out a new short URL for one it has seen before. I got the sequence http://is.gd/JzB, http://is.gd/JzC, http://is.gd/JzD, http://is.gd/JzE,…

That’s an invitation to some minor mischief, because you can guess the next URL in the is.gd sequence before it’s actually assigned to redirect somewhere.

We can ask bit.ly for a short URL that redirects to our predicted next is.gd URL. Then we ask is.gd for a short URL that redirects to the URL that bit.ly gives us. If we do this fast enough, is.gd will not yet have assigned the predicted next URL and we’ll get it. So the bit.ly URL will end up redirecting to the is.gd URL and vice versa. In ugly Python (and with a bug/shortcoming in the nextIsgd function):

import urllib, random

def bitly(url):
    return urllib.urlopen(
        'http://bit.ly/api?url=' + url).read()

def isgd(url):
    return urllib.urlopen(
        'http://is.gd/api.php?longurl=' + url).read()

def nextIsgd(url):
    last = url[-1]
    if last == 'z':
        next = 'A'
    else:
        next = chr(ord(last) + 1)
    return url[:-1] + next

def randomURI():
    return 'http://www.a%s.com' % \
           ''.join(map(str, random.sample(xrange(100000), 3)))

isgdURL = isgd(randomURI())
print 'Last is.gd URL:', isgdURL

nextIsgdURL = nextIsgd(isgdURL)
print 'Next is.gd URL will be:', nextIsgdURL

# Ask bit.ly for a URL that redirects to nextIsgdURL
bitlyURL = bitly(nextIsgdURL)
print 'Step 1: bit.ly now redirects %s to %s' % (
    bitlyURL, nextIsgdURL)

# Ask is.gd for a URL that redirects to that bit.ly url
isgdURL2 = isgd(bitlyURL)
print 'Step 2: is.gd now redirects %s to %s' % (
    isgdURL2, bitlyURL)

if nextIsgdURL == isgdURL2:
    print 'Success'
else:
    print 'Epic FAIL'

This worked first time, giving:

Step 1: bit.ly now redirects http://bit.ly/fkuL8 to http://is.gd/JA9
Step 2: is.gd now redirects http://is.gd/JA9 to http://bit.ly/fkuL8

In general it’s not a good idea to use predictable numbers like this, which hardly bears saying as just about every responsible programmer knows that already.

is.gd wont shorten a tinyurl.com link, as tinyurl is on their blacklist. So they obviously know what they’re doing. The bit.ly service is brand new and presumably not on the is.gd radar yet.

And finally, what happens when you visit one of the deadly looping redirect URLs in your browser? You’d hope that after all these years the browser would detect the redirect loop and break it at some point. And that’s what happened with Firefox 3, producing the image above.

If you want to give it a try, http://bit.ly/fkuL8 and http://is.gd/JA9 point to each other. Do I need to add that I’m not responsible if your browser explodes in your face?

AddThis Social Bookmark Button

Embracing Encapsulation

16:09 June 18th, 2008 by terry. Posted under me, python, tech. 43 Comments »

Encapsulated[This is a bit rambling / repetitive, sorry. I don’t have time to make it shorter, etc.]

Last year at FOWA I had a discussion with Paul Graham about programming and programmers in which we disagreed over the importance of knowing the fundamentals.

By this I mean the importance of knowing things down to the nuts and bolts level, to really understand what’s going on at the lower levels when you’re writing code. I used to think that sort of thing mattered a lot, but now I think it rarely does.

I well remember learning to program in AWK and being acutely aware of how resource intensive “associative arrays” (as we quaintly called them in those days) were, and knowing full well what was going on behind the scenes. I wrote a full Pascal compiler (no lex, no yacc) in the mid-80’s with Keith Rowe. If you haven’t done that, you really can’t appreciate the amount of computation that goes on when you compile a program to an executable. It’s astonishing. I did lots of assembly language programming, starting from age 15 or so, and spent years squeezing code into embedded environments, where a client might call to ask if you couldn’t come up with a way to reduce your executable code by 2 bytes so it would fit in their device.

But you know what? None of those skills really matter any more. Or they matter only very rarely.

The reason is that best practices have been worked out and incorporated into low-level libraries, and for the most part you don’t need to have any awareness at all of how those levels work. In fact it can be detrimental to you to spend years learning all those details if you could instead be learning how to build great things using the low-level libraries as black-box tools.

That’s the way the world moves in general. Successive generations get the accumulated wisdom of earlier generations packaged up for them. We used log tables, slide rules, and our heads, while our kids use calculators with hundreds of built-in functions. We learned to read analog 12-hour clocks, our kids learn to read digital clocks (so much easier!) and may not be able to read an analog clock until later. And it doesn’t matter. We buy a CD player (remember them?) or an iPod, and when it breaks you don’t even consider getting it “fixed” (remember that?). You just go out and buy another one. That’s because it’s cheaper and much faster and easier to just get a new one that has been put together by a machine than it is to have an actual human try to open the thing and figure out how to repair it. You can’t even (easily) open an iPod. And so the people who know how to do these things dwindle in number until there are none left. Like watch makers or the specialist knife sharpeners we have in Barcelona who ride around on motorcycles with their distinctive whistles, calling to people to bring down their blunt knives. And it doesn’t matter, at least from a technical point of view. Their brilliance and knowledge and hard-won experience has been encapsulated and put into machines and higher-level tools, or simply baked into society in smaller, more accurate and easier to digest forms. In computers it goes down into libraries and compilers and hardware. There’s simply no need for anyone to know how, learn how, or to bother, to do those sorts of things any more.

Note that I’m not saying it’s not nice to have your watch repaired by someone with a jeweler’s eyepiece or your knife or scissors sharpened in the street. I’m just noting the general progression by which knowledge inevitably becomes encapsulated.

In my discussion with Paul Graham, he argued that it was still important for tech founders to be great programmers at a low level. I argued that that’s not right. Sure, people like that are good to have around, but I don’t think you need to be that way and as I said I think it can even be detrimental because all that knowledge comes at a price (other knowledge, other experience).

I work with a young guy called Esteve (Hi Esteve!). He’s great at many levels, including the lower ones. He’s also a product of a new generation of programmers. They’re people who grew up only knowing object-oriented programming, only really writing in very high-level languages (not you Esteve! I mean that just in general), who think in those terms, and who instead of spending many years working with nuts and bolts spent the years working with newer high-level tools.

I think people like Esteve have a triple advantage over us dinosaurs. 1) They tend to use more powerful tools; 2) Because they use better tools, they are more comfortable and think more naturally in the terms of the higher-level abstractions their tools present them; and 3) they also have more experience putting those tools and methods to good use.

The experience gap widens at double speed, just as when a single voter changes side; the gap between the two parties increases by two votes. Even when the dinosaur modernizes itself and learns a few new tricks, you’re still way behind because the 25 year-old you’re working with (again, excluding Esteve) has never had to work at the nuts and bolts level. They think with the new paradigms and can put more general and more powerful tools directly into action. They don’t have to think about protocols or timeouts or dynamically resizing buffers or partial reads or memory management or data structures or error propogation. They simply think “Computer, fetch me the contents of that web page!” And most of the time it all just works. When it doesn’t, you can call in a gray-haired repair person or, more likely, just throw the busted tool away and buy another (or just get it free, in the case of Open Source software).

That’s real progress, and to insist that we should make the young suffer through all the stuff we had to learn in order to build all the libraries and compilers etc., that are now available to us all is just wrong. It’s wrong because it goes against the flow of history, because it’s counter-productive, and because it smacks of “I had to suffer through this stuff, walk barefoot to school in the snow, and therefore you must too.”

Some of the above will probably sound a bit abstract, but to me it’s not. I think it’s important to realize and accept. The fact that your kid can’t tie their shoelaces because they have velcro and have never owned a shoe with a lace is probably a good thing. You don’t know how to hunt your own food or start a fire, and it just doesn’t matter. The same goes for programming. The collective brilliance of generations of programmers is now built in to languages like Java, Python and Ruby, and into operating systems, graphics libraries, etc. etc., and it really doesn’t matter a damn if young people who are using those tools don’t have a clue what’s going on at the lower levels (as I said above, that’s probably a good thing). One day very few people will. The knowledge wont be lost. It’s just encapsulated into more modern environments and tools.

I’m writing all this down because I’ve been thinking about it on and off since FOWA, but also because of what I’m working on right now. I’m trying to modify 12K lines of synchronous Python code to use Twisted (an extraordinarily good set of asynchronous networking libraries written by a set of extraordinarily young and gifted programmers). The work is a bit awkward and three times I’ve not known how best to proceed in terms of design. Each time, Esteve has taken a look at the problem and quickly suggested a fairly clean way to tackle it. Desperate to cook up a way to think that he might not be that much smarter than I am, I’m forced into a corner in which I conclude that he has spent more time working with new tools (patterns, OO, a nice language like Python). So he looks at the world in a different way and naturally says “oh, you just do that”. Then I go do the routine work of making his ideas work – which is great by me, I get to learn in the best way, by doing. How nice to hire people who are better than you are.

That’s it. Encapsulation is inevitable. So you either have to embrace it or become a hand-wringing dinosaur moaning about the kids of today and how they no longer know the fundamentals. It’s not as though any of us could survive if we suddenly had to do everything from first principles (hunt, rub sticks together to make fire, etc). So relax. Enjoy it. The young are much better than we are because they grow up with better tools and they spend more time using them. It’s not enough to learn them when you’re older, even if you can do that really fast. You’ll never catch up on the experience front.

But it sure is fun to try.

AddThis Social Bookmark Button

Python: looks great, stays wet longer

00:02 June 8th, 2008 by terry. Posted under python, tech. 12 Comments »

Wet clayI should be coding, not blogging. But a friend noticed I hadn’t blogged in a month, so in lieu of emailing people, here are a couple of comments on programming in Python. There are many things that could be said, but I just want to make two points that I think aren’t so obvious.

1. Python looks great

In Python, indentation is used to delimit code blocks. I like that a lot – you would indent your code anyway, right? It reduces clutter. But apart from that, Python is very minimalistic in its syntax. There are rather few punctuation symbols used, and they’re used pretty consistently. As a result, Python code looks great on the page. It’s not painful to edit, and I mean that figuratively and literally. This is worth noting because when you write complex code it’s nice if the language you’re doing it in is very clean. That’s important because code can become hard to understand and unpleasant to work with. If you have pieces of code that you dread touching, that may be in part because the code is really ugly and complex on the page. Perl is a case in point – there’s tons of punctuation symbols, and in some cases the same thing (e.g., curly braces) is used in multiple (about 5!) different ways to mean different things. If the language is pleasant to look at for longer, you are more willing to work on code that might be more forbidding when expressed in other languages. Esthetics is important. Actively enjoying looking at code simply because the language is so clean is a great advantage—for you, and for the language.

This might not seem like a big point, but it’s important to me, it’s something I’ve never encountered before, and it’s a nice property of Python. BTW, people always make fun of Lisp for its parentheses. But Lisp is the cleanest language I know of in terms of simplicity on the page. The parens and using prefix operators in S-expressions removes the need for almost all other punctuation (and makes programmatically generating code an absolute breeze).

2. Python stays wet longer

I don’t like to do too much formal planning of code. I much prefer to sit down and try writing something to see how it fits. That means I’ll often go through several iterations of code design before I reach the point where I’m happy. Sometimes this is an inefficient way to do things, particularly when you’re working on something very complex that you don’t really have your head around when you start. But I still choose to do things this way because it’s fun.

Sometimes I think of it like pottery. You grab a lump of wet clay and slap it down on the wheel. Then you try out various ideas to shape whatever it is you’re trying to create. If it doesn’t work, you re-shape it—perhaps from scratch. This isn’t a very accurate analogy, but I do think it’s valid to say that preferring to work with real code in an attempt to understand how best to shape your ideas is a much more physical process than trying to spec everything out sans code. I find I can’t know if code to implement an idea or solve a problem is going to feel right unless I physically play with it in different forms.

For me, Python stays wet longer. I can re-shape my code really easily in Python. In other languages I’ve often found myself in a position where a re-design of some aspect involves lots of work. In Python the opposite has been true, and that’s a real pleasure. When you realize you should be doing things differently and it’s just a bit of quick editing to re-organize things, you notice. I might gradually be becoming a better programmer, but I mainly feel that in using Python I simply have better quality clay.

AddThis Social Bookmark Button