Archive for the ‘python’ Category

Learning jQuery Deferreds published

Tuesday, January 7th, 2014

learning-jQuery-deferredsNicholas Tollervey and I have written a book, Learning jQuery Deferreds, published by O’Reilly.

If you’ve been a reader of this blog over the years, you may have noticed that I’m very fond of deferreds (also known as promises or futures). I’ve mainly been using them in Python with Twisted, and a couple of years ago was happy to notice that jQuery also has a (somewhat different) version of deferreds. Asking around, it soon became clear that although there are tens of millions of programmers who’ve used jQuery, very few of them have ever used deferreds. E.g., at the jQuery conference in San Francisco in 2012, only about 25% of the audience in a talk on deferreds. There are about 19,000 results for “deferred” on StackOverflow.

This seemed like a perfect storm: a fantastically cool and empowering technology that I love thinking and writing about, built in to a ubiquitous web library, in use by millions of programmers, and yet somehow not widely known or used.

The book tries to really teach deferreds. There are 18 real-life examples, along with 75 challenges (and solutions). We’ve tried to get readers into deferreds the way we are, to be provocative, to try and get you thinking and scratching your head. To get you to see how cool and elegant working with deferreds can be. In short, to make you one of us.

Although the book focusses on jQuery’s flavor of deferreds, we wrote it with a much broader aim: to be just as useful to people working with any flavor of deferreds in any language. The concepts behind deferreds are few and simple. Even if you’re not a jQuery user or a JavaScript programmer, we hope you’ll enjoy and benefit from the book.

If you’re curious, the animal on the cover is a musky rat-kangaroo. O’Reilly chose it for us. When I first saw it, I mailed them and Nicholas to say it looked “overfed, passive, and thoughtful” to which Nicholas replied that it resembled me. The O’Reilly toolchain is modern and fun, employing AsciiDoc and a shared Git repo. We wrote 30,817 words and 2,301 lines of Javascript. There’s a source code repo for all the book examples, at We spent six months on the book, during which I usually spent 1-2 days a week on it. It was a blast to write.

If you’d like to buy a copy, use AUTHD as a discount code on the O’Reilly site and you’ll receive 40% off the print or 50% off the e-book. Please let me know what you think, and/or leave a review on the O’Reilly (or Amazon) site. Have fun, and thanks!

txdlo – a Twisted deferred list observer

Monday, December 30th, 2013

Last night I wrote txdlo, a Python package that provides a class called DeferredListObserver. As you might guess, it lets you observe callback and errback events from a list of Twisted deferreds. You can add observers that will be passed information about deferreds firing. You can add deferreds to the observed list at any time, which is very useful if you’re dynamically creating deferreds that you want to monitor.

The class can be used to easily build functions or classes that provide deferreds that fire when arbitrary combinations of events from the observed deferreds have occurred.

For example you can write functions or classes that support deferreds that:

  • Implement Twisted’s DeferredList or simple variants of it, or that let you separate the various behaviors of DeferredList into simpler functions.
  • Provide a deferred that fires when N of the observed deferreds have fired.
  • Provide a deferred that ignores errors until one of the observed deferred succeeds, only firing with an error if all the observed deferreds fail.
  • Or (a more involved example), suppose you have 3 methods that can return you a user’s avatar: a fast local cache, a filesystem, and a slow network call to Gravatar. You want to write a deferred-returning function that launches all three lookups at once and fires its deferred with the first answer. But if the cache and/or filesystems fails first, you don’t want to trigger an error, you instead want to take the result from Gravatar and add it to the cache and/or filesystem, as well firing the returned deferred with the result (wherever it comes from). Only if all three lookups fail do you want to errback the deferred you returned.

Here’s a sample example usage, which shows how to use DeferredListObserver to build a simplified version of Twisted’s DeferredList class as a function:

from twisted.internet.defer import Deferred, succeed
from txdlo import DeferredListObserver

def deferredList(deferreds):
    Return a deferred that fires with a list of (success, result) tuples,
    'success' being a boolean.

    @param deferreds: a C{list} of deferreds.
    @return: a L{twisted.internet.defer.Deferred} that fires as above.

    if len(deferreds) == 0:
        return succeed([])

    dlo = DeferredListObserver(maintainHistory=True)
    map(dlo.append, deferreds)
    deferred = Deferred()

    def observer(*ignore):
        if dlo.pendingCount == 0:
            # Everything in the list has fired.
            resultList = [None] * len(deferreds)
            for index, success, value in dlo.history:
                resultList[index] = (success, value)


    return deferred

You can grab the code, read more about usage, and see several other examples at

Promises are first-class objects for function calls

Thursday, September 12th, 2013

Have you ever programmed in a language without functions as first-class objects? You can’t return a function from a function, can’t pass a function as an argument, and you certainly can’t make anonymous functions on the fly. Remember how liberating, empowering, and flexible it felt when you moved to a language with functions as first-class objects?

Yesterday, after 7 years of working with deferreds/promises/futures (call them what you will), thinking carefully about them thousands of times, mailing and blogging about them, giving talks about them, and now writing a book about them, I realized there is a very simple way to look at them:

Promises are first-class objects for function calls.

In other words, a promise gives you a way to make a function call and to pass around that function call. Return it from a function, pass it as an argument, put it in a data structure, etc. Given a promise, you can arrange to process the result of the function call or its error.

To be a bit more abstract: A promise is a time-independent reification of a function call.

By “time-independent” I mean that if you get a promise from somewhere, you don’t have to worry whether the underlying function call has already completed, is currently in progress, or has yet to run. Depending on the promise implementation there may be no way to find out. The point is that you don’t have to care and you shouldn’t care.

That’s all I’ll say for now, except for a final comment on naming.

I think “promise” is a better name than “future” or “deferred” because the latter two feel more like they’re referring to an event yet to happen. A promise feels more neutral: you could easily be making a promise about something you’ve already done. Many explanations of promises/deferreds/futures stress that they are something that will hold the result of a computation that’s not yet completed. That’s certainly a common usage, but it’s only part of the picture. Describing promises as a reification of a function call takes the time factor completely out of the picture.

Here’s a small Javascript function to memoize a (single-argument) function to illustrate the point:

var memoize = function(fn) {
    var promises = {};

    return function(arg) {
        var promise = promises[arg];
        if (!promise) {
            promise = promises[arg] = $.when(fn(arg)).promise();
        return promise;

The memoization cache is full of promises, not underlying function results. Some of the function calls may have completed, some may be underway. I’m using the jQuery $.when function as a convenience (that’s an irrelevant detail, don’t let it distract you). The promises stay in the cache indefinitely (no eviction, for simplicity), holding promises for function calls from the past.

Time is not an issue here.

Specifically, and in contrast, think about what happens with non-promise memoization. What happens if a first call comes in with an argument X, but before fn(X) has finished computing there is another call with argument X? Then another and another and another? You wind up calling fn(X) many times. I.e., doing exactly the thing you were trying to avoid! And there’s nothing you can do about it.

If you’re interested to review the book I’m writing with Nicholas Tollervey on jQuery deferreds, please email me or leave a comment below. Most of the book is not really specific to jQuery’s flavor of deferreds.

Yet another cancelable Twisted deferred class

Thursday, June 20th, 2013

I’m posting this (completely untested!) to illustrate how one could write a class to provide a Twisted deferred-like class, identical to the twisted.internet.defer.Deferred class, but which lets you call callback, errback, or cancel on the instance yourself. Hopefully that will make some sense. If not, let me know in the comments and I’ll try to be clearer. There’s also discussion of this issue (again) in the Twisted mailing list right now. For context, the discussion thread from January 2010 where I first wrote a class like the below is here.

from twisted.internet.defer import CancelledError, Deferred
from twisted.python.failure import Failure

class ControllableDeferred2013(object):

    """A Deferred-like class that takes a regular Twisted Deferred and
    provides a deferred that can be fired at will. If you have a regular
    Twisted Deferred, you can produce a deferred you have more control over
    by using your Deferred instance to make an instance of this class.

    Any time you need to fire a ControllableDeferred2013 instance for any
    reason, call its callback, errback or cancel method. It will fire
    immediately with the value you provide and the original Deferred will
    be cancelled."""

    def __init__(self, originalDeferred):
        self._fired = False
        self._originalDeferred = originalDeferred
        self._newDeferred = Deferred()
        for method in ('addBoth', 'addCallback', 'addCallbacks', 'addErrback',
            setattr(self, method, getattr(self._newDeferred, method))

    def _originalFired(self, result):
        if not self._fired:
            self._fired = True

    def cancel(self, value=None):
        if not self._fired:
            self._fired = True

    def callback(self, result=None):
        if not self._fired:
            self._fired = True

    def errback(self, fail=None):
        if not self._fired:
            self._fired = True

    def pause(self):

    def unpause(self):

Secure per-site passwords with no encrypted blob

Sunday, February 3rd, 2013

Last night I read a Guardian article, Online passwords: keep it complicated. It’s a surprisingly good summary, given that it’s aimed at the general public. The author concludes by telling us he decided to adopt LastPass, and also mentions 1Password. The comments section on the page gives similar solutions, like KeePass. The security-conscious people I know have arrived at the same conclusion. There are plenty of articles on the web that summarize various similar products, e.g., Which Password Manager Is The Most Secure? at LifeHacker. A single encrypted blob offers good security and works well in practice. It also allows the storage of information like name, address, credit cards, etc., that can be used to auto-fill web forms.

But… I’ve never liked the idea of a single encrypted file with all my passwords in it. What if the storage is lost or corrupted? Could the file someday be decrypted by someone else? If my encrypted blob is online, what happens when I am offline? If the blob is to be stored locally, I need to think about where to put it, how to back it up, etc. If a company holds it for me, what happens if they go out of business or get hacked? What if they use proprietary encryption, a closed-source access app, or a proprietary underlying data format? Not all the above solutions have all these issues, but they all have some of them.

The crucial thing they all have in common is that they use a master password to encrypt all your passwords into a single blob, and the blob has to then be reliably stored and accessible forever.

An approach that requires no storage

I realized there’s a solution that doesn’t require any storage. It’s not perfect, but it has some attractive properties. I’ve already started using it for some sites.

[Edit: it has been pointed out in the comments that the following solution has been thought of before. See SuperGenPass.]

Here’s a simple first version of Python code to print my password for a given service:

import base64, getpass, hashlib, sys

service = sys.argv[1]
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())[:32]

print 'Your password on %s is %s' % (service, password)

The service name is given on the command line. The code prints a 32-character password for use on that service. Here’s some sample output:

        $ facebook
        Enter your master password: 
        Your password on facebook is Wza2l5Tqy0omgWP+5DDsXjQLO/Mc07N8

        $ twitter
        Enter your master password: 
        Your password on twitter is eVhhpjJrmtSa8XnNMu6vLSDhPeO5nFOT

This has some nice advantages. It also places some small requirements on the user. Unfortunately, however, it is not generally applicable – at least not today. These are discussed below.


The obvious advantage is that there is no external storage. Your passwords are not stored anywhere. There’s no blob to store, protect, access, backup, worry about, etc. The algorithm used to generate your password is dead-simple, it’s open and available in dozens of languages, and it’s secure.

You’re free to use more than one master password if you like. You can invent your own services names (more on which below).

Requirements / User burden

As with all one-key-to-unlock-them-all approaches, the user obviously needs to remember their master password.

With this approach though, the user also has to remember the name they used for the service they’re trying to access. If you create a password for a service called “gmail” you’ll need to use that exact service name in the future. For me that’s not much of a burden, but I guess it would be for others.

There’s no reason why the list of services you have passwords for couldn’t be stored locally. If the password generator were in a browser extension, it could possibly suggest a service name, like “”, based on the domain of the page you were on.

With this approach, it’s even more important that one’s master password be hard to guess. Unlike the single-encrypted-blob approach, anyone who can guess your master password (and the names you use for services) can immediately obtain your passwords. They don’t also need access to the blob – it doesn’t exist.

Additional security can be easily had by, for example, using a convention of adding a constant character to your service names. So, e.g., you could use “facebook*” and “twitter*” as service names, and not tell anyone how you form service names.

General applicability

Unfortunately, there is a major problem with this approach. That’s because different sites have different requirements on passwords. Some of the difficulties can be avoided quite easily, but there’s an additional problem, caused when services change their password policy.

The above code generates a Base64 password string. So, to give some examples, if the service you want a password for doesn’t allow a plus sign in your password, the above code might make something unacceptable to the service. Same thing if they insist that passwords must be at most 12 characters long.

Ironically, these services are insisting on policies that prevent the use of truly secure passwords. They’re usually in place to ensure that short passwords are chosen from a bigger space. It would be better, though more work, to impose restrictions only on short passwords.

In a perfect world, all sites could immediately switch to allowing Base64 passwords of length ≥ 16 (say). Then the above approach would work everywhere and we’d be done.

Varying password length

A general approach to adjusting the generated password is to take some of the Base64 information produced and use it to modify the password. For example, you might not comfortable with all your passwords being the same length, so we can compute a length like this:

import base64, getpass, hashlib, string, sys

b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(password[-5]) % 16
print 'Your password on %s is %s' % (service, password[0:16 + lenAdjust])

This generates passwords that are between 16 and 31 characters in length:

        $ ./ facebook
        Enter your master password: 
        Your password on facebook is 1nTlVGPhuWZf0l9Sk27
        $ ./ twitter
        Enter your master password: 
        Your password on twitter is WE1DVZHAFBx2c3g63tR+Oi3Jxs4xMV

Satisfying site requirements

A possible approach to dealing with per-site password requirements is to have the code look up known services and adjust the initial password it generates to be acceptable. This can easily be done in a secure, random, repeatable way. For example:

  • If a site doesn’t allow upper case, lowercase the password.
  • If a site doesn’t allow digits, replace them with random letters.
  • If a site requires punctuation, you can replace some initial letters in the password with randomly chosen punctuation and then randomly permute the result using the Knuth Shuffle.

Some of these transformations use random numbers. These are easy to obtain: take an unused part of the Base64 string and use it to seed a RNG. For each transformation, you would need to call the RNG a fixed number of times, i.e., independent of the number of random numbers actually used to perform the transformation. That’s necessary in order to keep the RNG in a known state for subsequent transformations (if any).

For example, the following replaces digits 0-9 with a letter from A-J whose case is chosen randomly:

import base64, getpass, hashlib, random, string, sys

def getSeed(chars):
    seed = ord(chars[0])
    for letter in chars[1:]:
        seed = (seed << 8) & ord(letter)
    return seed

service = sys.argv[1]
b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
digest = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(digest[-5]) % 16

passwordWithDigits = digest[0:16 + lenAdjust]
password = ''

randoms = [random.randint(0, 1) for _ in passwordWithDigits]

for index, letter in enumerate(passwordWithDigits):
    if letter in '0123456789':
        base = ord('a' if randoms[index] else 'A')
        replacement = chr(base + ord(letter) - ord('0'))
        password += replacement
        password += letter

print 'Your password on %s is %s' % (service, password)

Here’s the output for the above two services (using the same master password):

        $ ./ facebook
        Enter your master password: 
        Your password on facebook is bnTlVGPhuWZfAljSkch
        $ ./ twitter
        Enter your master password: 
        Your password on twitter is WEBDVZHAFBxccdgGdtR+OidJxsExMV

As you can see, the digits are replaced with letters (in a biased way, that we can ignore in an informal blog post). The RNG is in a known state because the number of times it has been called is independent of the number of digits in the pre-transformation text.

This approach can be used to transform the initial random sequence into one that satisfies a service’s password restrictions.

It is difficult to reliably associate service names with site policies. To do so might require keeping a file mapping the name a user used for a service to the policy of the site. Although this doesn’t defeat the purpose of this approach (since that file would not need to be stored securely), it is an additional and unwanted pain for the user. Part of the point was to try to entirely avoid additional storage, even if it doesn’t have to be encrypted.

The major problem with per-site requirements

The major problem however is that sites may change their password policy. Even if our program knew the rules for all sites, it would have a real problem if a site changed its policy. The code would need to be updated to generate passwords according to the new site policy. Existing users, supposing they upgraded, would then be shown incorrect passwords and would need to do password resets, which is obviously inconvenient.


I like the above approach a lot, but don’t see a way to solve the issue with changing site policies. I wouldn’t mind building in some rules for known popular sites, but any step in that direction has its problems – at least as far as I can see.

For now, I’m going to start using the above approach on sites that allow a long random password with characters from the Base64 set. That covers the majority of sites I use. Importantly, that includes Google, so if I ever need password resets I can have them sent there, knowing that I can always log in to recover them.

Omit needless parens

Tuesday, December 18th, 2012

The famous 17th commandment in The Elements of Style is “Omit needless words”.

There should be an equivalent in programming, but for parentheses. Every time I see needless parens in a program I want to rip them out (unless they’re obviously there for formatting/readability reasons).

Community service message: Omit needless parens. When in doubt whether parens are needed, look up the precedence rules for the operators involved and only use parens if the default isn’t what you want.

Here’s why you shouldn’t use needless parens:

  • The #1 reason is that you’re making your code more difficult to read for people who know the language better than you do. A more experienced programmer will see a red flag and look at your code more carefully than necessary because they will be trying to figure out why you used the extra parens and if there’s something non-obvious going on. When I come across code like that, I usually conclude that whoever wrote the code doesn’t know the language that well. My opinion of the code goes down. My reading speed goes down too because the needless parens, in my estimation, indicate an increased likelihood that programmer has done other (worse) things elsewhere.
  • You don’t want to appear incompetent or lazy, or to slow down or put off people reading your code, right? (See above.)
  • Putting in needless parens is heading down a slippery slope. How many levels of extra parens should you stop at? The only clear cut rule that makes sense is to stop at zero.
  • If you pause to look up the precedence rules, you’ll make yourself a better programmer in the language in question. You’ll be able to read other people’s needless-parenthesis-free code with no problem. You can pen lofty holier-than-thou blog posts like this one.

Back in about 1985 I wrote a tiny shell script to print out operator precedence and associativity rules for C. When I started programming in Perl, I wrote one for it. Then one for Python and later one for Javascript. For your convenience, and as a reward for reading, I just stuck the 4 scripts up on Github.

A simple way to calculate the day of the week for any day of a given year

Sunday, November 11th, 2012

March 29th

Image: Jeremy Church

The other day I read a tweet about how someone was impressed that a friend had been able to tell them the day of the week given an arbitrary date.

There are a bunch of general methods to do this listed on the Wikipedia page for Determination of the day of the week. Typically, there are several steps involved, and you need to memorize some small tables of numbers.

I used to practice that mental calculation (and many others) when I was about 16. Although all the steps are basic arithmetic, it’s not easy to do the calculation in your head in a couple of seconds. Given that most of these questions that you’re likely to face in day-to-day life will be about the current year, it seemed like it might be a poor tradeoff to learn the complicated method to calculate the day of the week for any date if there was a simpler way to do it for a specific year.

The method I came up with after that observation is really simple. It just requires you to memorize a single 12-digit sequence for the current year. The 12 digits correspond to the first Monday for each month.

For example, the sequence for 2012 is 265-274-263-153. Suppose you’ve memorized the sequence and you need to know the day of the week for March 29th. You can trivially calculate that it’s a Thursday. You take the 3rd digit of the sequence (because March is the 3rd month), which is 5. That tells you that the 5th of March was a Monday. Then you just go backward or forward as many weeks and days as you need. The 5th was a Monday, so the 12th, 19th, and 26th were too, which means the 29th was a Thursday.

It’s nice because the amount you need to memorize is small, and you can memorize less digits if you only want to cover a shorter period. The calculation is very simple and always the same in every case, and you never have to think about leap years. At the start of each year you just memorize a single sequence, which is quickly reinforced once you use it a few times.

Here’s Python code to print the sequence for any year.

#!/usr/bin/env python

import datetime, sys

    year = int(sys.argv[1])
except IndexError:
    year =

firstDayToFirstMonday = ['1st', '7th', '6th', '5th', '4th', '3rd', '2nd']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
summary = ''

for month in range(12):
    firstOfMonth = datetime.datetime(year, month + 1, 1).weekday()
    firstMonday = firstDayToFirstMonday[firstOfMonth]
    print months[month], firstMonday
    summary += firstMonday[0]

print 'Summary:', '-'.join(summary[x:x+3] for x in range(0, 12, 3))

The output for 2012 looks like

Jan 2nd
Feb 6th
Mar 5th
Apr 2nd
May 7th
Jun 4th
Jul 2nd
Aug 6th
Sep 3rd
Oct 1st
Nov 5th
Dec 3rd
Summary: 265-274-263-153

The memory task is made simpler by the fact that there are only 14 different possible sequences. Or, if you consider just the last 10 digits of the sequences (i.e., starting from March), there are only 7 possible sequences. There are only 14 different sequences, so if you use this method in the long term, you’ll find the effort of remembering a sequence will pay off when it re-appears. E.g., 2013 and 2019 both have sequence 744-163-152-742. There are other nice things you can learn that can also make the memorization and switching between years easier (see the Corresponding months section on the above Wikipedia page).

Here are the sequences through 2032:

2012 265-274-263-153
2013 744-163-152-742
2014 633-752-741-631
2015 522-641-637-527
2016 417-426-415-375
2017 266-315-374-264
2018 155-274-263-153
2019 744-163-152-742
2020 632-641-637-527
2021 411-537-526-416
2022 377-426-415-375
2023 266-315-374-264
2024 154-163-152-742
2025 633-752-741-631
2026 522-641-637-527
2027 411-537-526-416
2028 376-315-374-264
2029 155-274-263-153
2030 744-163-152-742
2031 633-752-741-631
2032 521-537-526-416

Simpler Twisted deferred code via decorated callbacks

Sunday, October 14th, 2012

This morning I was thinking about Twisted deferreds and how people find them difficult to grasp, but how they’re conceptually simple once you get it. I guess most of us tell people a deferred is something to hold a result that hasn’t arrived yet. Sometimes, though, deferreds do have a result in them immediately (e.g., using succeed or fail to get an already-fired deferred).

I wondered if it might work to tell people to think of a deferred as really being the result. If that were literally true, then instead of writing:

d = getPage(...)
d.addErrback(errcheck, args)
d.addCallback(cleanup, args)
d.addCallback(reformat, args)
return d

We might write something like:

result1 = getPage(...)
result2 = errcheck(result1, args)
result3 = cleanup(result2, args)
return reformat(result3, args)

And if you could write that, you could obviously instead write:

return reformat(cleanup(errcheck(getPage(...), args), args), args)

If we could write Twisted code that way, I think using deferreds would be simpler for people unfamiliar with them. We could show them Twisted code and not even have to mention deferreds (see below).

In the style we’re all used to, the programmer manually adds callbacks and errbacks. That’s basically boilerplate. It gets worse when you then need to also use DeferredList, etc. It’s a little confusing to read deferred code at first, because you need to know that the deferred result/failure is automatically passed as the first arg to callbacks/errbacks. It seems to take a year or more for people to finally realize how the callback & errback chains actually interact :-) Also, I wonder how comfortable programmers are with code ordered innermost function first, as in the normal d.addCallback(inner).addCallback(outer) Twisted style, versus outer(inner()), as in the line above.

Anyway… I realized we CAN let people use the succinct style above, by putting boilerplate into decorators. I wrote two decorators, called (surprise!) callback and errback. You can do this:

def errcheck(failure, arg):

def cleanup(page, arg):

def reformat(page, arg):

reformat(cleanup(errcheck(getPage(...), arg1), arg2), arg3)

The deferred callback and errback chains are hooked up automatically. You still get a regular deferred back as the return value.

And… the “deferred” aspect of the code (or at least the need to talk about or explain it) has conveniently vanished.

You can also do things like

func1(getDeferred1(), errcheck(func2(getDeferred2(), getDeferred3())))

This gets the result of deferreds 2 & 3 and (if neither fails) passes the result of calling func2 on both results through to func1, which is called along with the result of deferred 1. You don’t need to use DeferredLists, as the decorator makes them for you. The errcheck function wont be called at all unless there’s an error.

That’s nice compared to the verbose equivalent:

d1 = DeferredList([getDeferred2(), getDeferred3()])
d2 = DeferredList([getDeferred1(), d1])

Or the more compact but awkward:

DeferredList([getDeferred(), DeferredList([getDeferred(), getDeferred()]).addCallback(

There’s lots more that could be said about this, but that’s enough for now. The code (surely not bulletproof) and some tests are on Github. I’ll add a README sometime soon. This is still pretty much proof of concept, and some it could be done slightly differently. I’m happy to discuss in more detail if people are interested.

describejson – a Python script for summarizing JSON structure

Thursday, August 9th, 2012

Yesterday I was sent a 24M JSON file and needed to look through it to give someone an opinion on its contents. I did what I normally do to look at JSON, piping it into python -m json.tool. The result looked pretty good, I scrolled through some screens with a single long list and jumped to the bottom to see what looked like the end of the list. What I didn’t know at the time was that the output was 495,647 lines long! And there was some important stuff in the middle of the output that I didn’t see at all.

So I decided to write a quick program to recursively summarize JSON. You can grab it from Github at

Usage is simple, just send it JSON on stdin. Here’s an example input:

  "cats": 3,
  "dogs": 6,
  "parrots": 1

Which gets you this output:

$ python < input.json
1 dict of length 3. Values:
  3 ints

The output is a little cryptic, but you’ll get used to it (and I may improve it). In words, this is telling you that (after loading the JSON) the input contained 1 Python dictionary with 3 elements. The values of the 3 elements are all integers. The indentation is meaningful, of course. You can see that the script is summarizing the fact that the 3 values in the dict were all of the same type.

Here’s another sample input:

  ["fluffy", "kitty", "ginger"],
  ["fido", "spot", "rover"],

Which gets you:

$ python < input.json
1 list of length 3. Values:
  2 lists of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

In words, the input was a list of length 3. Its contents were 2 lists of length 3 that both contained 3 unicode strings, and a final list that contains just a single unicode string.

Specifying equality strictness

The script currently takes just one option, --strictness (or just -s) to indicate how strict it should be in deciding whether things are “the same” in order to summarize them. In the above output, the default strictness length is used, so the script considers the first two inner lists to be collapsible in the summary, and prints a separate line for the last list since it’s of a different length. Here’s the output from running with --strictness type:

$ python –strictness type < input.json
1 list of length 3. Values:
  3 lists of length 3. Values:
    3 unicodes

The lists are all considered equal here. The output is a little misleading, since it tells us there are 3 lists of length 3, each containing 3 unicodes. I may fix that.

We can also be more strict. Here’s the output from --strictness keys:

$ python –strictness keys < input.json
1 list of length 3. Values:
  1 list of length 3. Values:
    3 unicodes
  1 list of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

The 3 inner lists are each printed separately because their contents differ. The keys argument is also a bit confusing for lists, it just means the list values. It’s clearer when you have dictionaries in the input.

This input:

    "a": 1,
    "b": 2,
    "c": 3
    "d": 4,
    "e": 5,
    "f": 6


$ python < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

I.e., one list, containing 2 dictionaries, each containing 3 int values. Note that this is using the default of --strictness length so the two dicts are considered the same. If we run that input with strictness of keys, we’ll instead get this:

$ python –strictness keys < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

The dicts are considered different because their keys differ. If we change the input to make the keys the same:

    "a": 1,
    "b": 2,
    "c": 3
    "a": 4,
    "b": 5,
    "c": 6

and run again with --strictness keys, the dicts are considered the same:

$ python –strictness keys < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

but if we use --strictness equal, the dicts will be considered different:

$ python –strictness equal < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

Finally, making the dicts the same:

    "a": 1,
    "b": 2,
    "c": 3
    "a": 1,
    "b": 2,
    "c": 3

and running with --strictness equal will collapse the summary as you’d expect:

$ python –strictness equal < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

Hopefully it’s clear that by being less strict on matching you’ll get more concise output in which things are casually considered “the same” and if you’re more strict you’ll get more verbose output, all the way to using strict equality for both lists and dicts.

Here’s the full set of --strictness options:

  • type: compare things by type only.
  • length: compare lists and objects by length.
  • keys: compare lists by equality, dicts by keys.
  • equal: compare lists and dicts by equality.


The naming of the --strictness options could be improved. The keys option should probably be called values (but that is confusing, since dictionaries have values and it’s a comparison based on their keys!). A values option should probably also compare the value of primitive things, like integers and strings.

There are quite a few other things I might do to this script, if I ever have time. It would be helpful to print out some keys and values when these are short and unchanging. It would be good to show an example representative value of something that repeats (modulo strictness) many times. It might be good to be able to limit the depth to go into a JSON structure.

Overall though, I already find the script useful and I’m not in a rush to “improve” it by adding features. You can though :-)

You might also find it helpful to take what you learn about a JSON object via describe JSON and use that to grep out specific pieces of the structure using

If you’re curious, here’s the 24-line output summary of the 24M JSON I received. Much more concise than the nearly 1/2 a million lines from python -m json.tool:

1 dict of length 3. Values:
  1 int
  1 dict of length 4. Values:
    1 list of length 17993. Values:
      17993 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
    1 list of length 0.
    1 list of length 11907. Values:
      11907 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 1. Values:
          1 unicode
        2 unicodes
    1 list of length 28068. Values:
      28068 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
  1 unicode

Autovivification in Python: nested defaultdicts with a specific final type

Saturday, May 26th, 2012

I quite often miss the flexibility of autovivification in Python. That Wikipedia link shows a cute way to get what Perl has:

from collections import defaultdict

def tree():
    return defaultdict(tree)

lupin = tree()
lupin["express"][3] = "stand and deliver"

It’s interesting to think about what’s going on in the above code and why it works. I really like defaultdict.

What I more often want, though, is not infinitely deep nested dictionaries like the above, but a (known) finite number of nested defaultdicts with a specific type at the final level. Here’s a tiny function I wrote to get just that:

from collections import defaultdict

def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels – 1, final)))

So let’s say you were counting the number of occurrences of words said by people by year, month, and day in a chat room. You’d write:

words = autovivify(5, int)

words["sam"][2012][5][25]["hello"] += 1
words["sue"][2012][5][24]["today"] += 1

Etc. It’s pretty trivial, but it was fun to think about how to do it with a function and to play with some alternatives. You could do it manually with nested lambdas:

words = defaultdict(lambda: defaultdict(int))
words["sam"]["hello"] += 1

But that gets messy quickly and isn’t nearly as much fun.

Emacs buffer mode histogram

Thursday, November 10th, 2011

Tonight I noticed that I had over 200 buffers open in emacs. I’ve been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I’m programming versus working on documentation, etc.

Because it’s emacs, it wasn’t hard to write a function to display a buffer mode histogram. Here’s mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I’m not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I’m a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it’s a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
  "Display a histogram of emacs buffer modes."
  (let* ((totals ‘())
         (buffers (buffer-list()))
         (total-buffers (length buffers))
         (ht (make-hash-table :testequal)))
      (dolist (buffer buffers)
        (set-buffer buffer)
            ((mode-name (symbol-name major-mode)))
          (puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
    (maphash (lambda (key value)
               (setq totals (cons (list key value) totals)))
    (setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
    (with-output-to-temp-buffer "Buffer mode histogram"
      (princ (format "%d buffers open, in %d distinct modes\n\n"
                      total-buffers (length totals)))
      (dolist (item totals)
            ((key (car item))
             (count (cadr item)))
          (if (equal (substring key -5) "-mode")
              (setq key (substring key 0 -5)))
          (princ (format "%2d %20s %s\n" count key
                         (make-string count ?+))))))))

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.

txdpce: a Twisted class for deferred parallel command execution

Tuesday, July 12th, 2011

I just uploaded a simple Twisted Python class, txdpce, to Launchpad. It’s designed for situations where you have multiple ways of obtaining a function result and you want to try them all in parallel and return the result from the fastest function. A typical case is that you can either compute a result via a network call or try to get it out of a cache (perhaps also via a network call, to memcached). You might also be able to compute it locally, etc.

Things can be more complicated than provided for here, of course. E.g., you might like to cache the result of a local call (if it finishes first or if the cache lookup fails). The txdpce class is supposed to be a simple demonstration. I wrote it for a bit of fun this morning and also because it’s yet another nice example of how you can click together the building blocks of Twisted to form increasingly sophisticated classes.

Here’s the class. You’ll find a test suite at the Launchpad site. You can download the code using bzr via bzr branch lp:txdpce.

from twisted.internet import defer
from twisted.python import failure

class ParallelCommandException(Exception):

class DeferredParallelCommandExecutor(object):
    Provides a mechanism for the execution of a command by multiple methods,
    returning the result from the one that finishes first.

    def __init__(self):
        self._functions = []

    def registerFunction(self, func):
        Add func to the list of functions that will be run when execute is

        @param func: A callable that should be run when execute is called.

    def execute(self, *args, **kwargs):
        Run all the functions in self._functions on the given arguments and
        keyword arguments.

        @param args: Arguments to pass to the registered functions.

        @param kwargs: Keyword arguments to pass to the registered functions.

        @raise RuntimeError: if no execution functions have been registered.

        @return: A C{Deferred} that fires when the first of the functions
        if not self._functions:
            raise RuntimeError('No execution functions have been registered.')

        deferreds = [defer.maybeDeferred(func, *args, **kwargs)
                     for func in self._functions]
        d = defer.DeferredList(deferreds, fireOnOneCallback=1, consumeErrors=1)
        d.addCallback(self._done, deferreds)
        return d

    def _done(self, deferredListResult, deferreds):
        Process the result of the C{DeferredList} execution of all the
        functions in self._functions. If result is a tuple, it's the result
        of a successful function, in which case we cancel all the other
        deferreds (none of which has finished yet) and give back the
        result.  Otherwise, all the function calls must have failed (since
        we passed fireOnOneCallback=True to the C{DeferredList} and we
        return a L{ParallelCommandException} containing the failures.

        @param deferredListResult: The result of a C{DeferredList} returned
        by self.execute firing.

        @param deferreds: A list of deferreds for other functions that were
        trying to compute the result.

        @return: Either the result of the first function to successfully
        compute the result or a C{failure.Failure} containing a
        L{ParallelCommandException} with a list of the failures from all
        functions that tried to get the result.
        if isinstance(deferredListResult, tuple):
            # A tuple result means the DeferredList fired with a successful
            # result.  Cancel all other deferreds and return the result.
            result, index = deferredListResult
            for i in range(len(self._functions)):
                if i != index:
            return result
            # All the deferreds failed. Return a list of all the failures.
            failures = [fail for (result, fail) in deferredListResult]
            return failure.Failure(ParallelCommandException(failures))

A resizable dispatch queue for Twisted

Monday, June 27th, 2011

In December 2009 I posted to the Twisted mailing list about what I called a Resizable Dispatch Queue. I’ve just spent some time making a new version that’s much improved. You can pick up the new version via pip install txrdq, from PyPI, or from the txRDQ project page on Launchpad. Here’s the example use case, taken from my original posting:

You want to write a server with a web interface that allows people to enter their phone number so you can send them an SMS. You anticipate lots of people will use the service. But sending SMS messages is quite slow, and the company that you ship those jobs off to is concerned that you’ll overrun their service (or maybe they have an API limit, etc). So you need to queue up jobs locally and send them off at a certain rate. You’d like to be able to adjust that rate up or down. You also want to be able to shut your service down cleanly (i.e., not in the middle of a task), and when you restart it you want to be able to re-queue the jobs that were queued last time but which hadn’t gone out.

To make the example more concrete, suppose your function that sends the SMS is called sendSMS and that it takes a (number, message) tuple argument. Here are some of the kinds of things you can do:

from txrdq.rdq import ResizableDispatchQueue

# Create a queue that will allow 5 simultaneous underway jobs.
rdq = ResizableDispatchQueue(sendSMS, 5)

# Later... send off some SMS messages.
d1 = rdq.put((2127399921, 'Hello...'), priority=5)

d2 = rdq.put((5052929919, 'Test...'), priority=5)

# Cancel the second job

# Widen the outgoing pipeline to 10 simultaneous jobs.
rdq.width = 10

# We're dispatching jobs too fast, turn it down a little.
rdq.width = 7

# Get a copy of the list of pending jobs.
jobs = rdq.pending()

# Cancel one of the pending jobs from the jobs list.

# Reprioritize one of the pending jobs from the jobs list.
rdq.reprioritize(job, -1)

# Arrange to increase the number of jobs in one hour.
reactor.callLater(3600, rdq.setWidth, 20)

# Pause processing.

# Resume processing, with a new width of 8.

# Shutdown. Wait for any underway jobs to complete, and save
# the list of jobs not yet processed.

def saveJobs(jobs):
    pickle.dump(jobs, ...)

d = rdq.stop()

I’ve come up with many uses for this class over the last 18 months, and have quite often recommended it to people on the #twisted IRC channel. Other examples include fetching a large number of URLs in a controlled way, making many calls to the Twitter API, etc.

Usage of the class is very simple. You create the dispatch queue, giving it a function that will be called to process all jobs. Then you just put job arguments as fast as you like. Each call to put gets you a Twisted Deferred instance. If your function runs successfully on the argument, the deferred will call back with an instance of txrdq.job.Job. The job contains information about when it was queued, when it was launched, when it stopped, and of course the result of the function. If your function hits an error or the job is canceled (by calling cancel on the deferred), the deferred will errback and the failure will again contain a job instance with the details.

It’s also useful to have an admin interface to the queue, so calls such as pending and underway are provided. These return lists of job instances. You can call cancel on a job instance too. You can reprioritize jobs. And you can pause processing or shut the queue down cleanly.

The code also contains an independently useful Twisted classes called DeferredPriorityQueue (which I plan to write about), and DeferredPool (which I described earlier).

How to asynchronously exchange a dictionary using Twisted deferreds

Friday, June 10th, 2011

Here’s a fun class that I can’t think of a good use for :-) But I like its simplicity and it’s another illustration of what I like to call asynchronous data structures.

Suppose you have a producer who’s building a dictionary and a consumer who wants to look things up in it. The producer is working at their own pace, making new dictionary entries in whatever work order that suits them, and the consumer is requesting dictionary items in whatever order they need them. The two orders are obviously extremely unlikely to be the same if the dictionary is of non-trivial size. How do you write an asynchronous server (or data structure) that sits between the producer and consumer?

Yes, it’s far fetched, perhaps, but here’s a simple asynchronous dictionary class that lets you do it gracefully:

from collections import defaultdict
from twisted.internet import defer

class DeferredDict(dict):
    def __init__(self, *args, **kwargs):
        self._deferreds = defaultdict(set)
        dict.__init__(self, *args, **kwargs)

    def __getitem__(self, item):
            return defer.succeed(dict.__getitem__(self, item))
        except KeyError:
            d = defer.Deferred()
            return d

    def __setitem__(self, item, value):
        if item in self._deferreds:
            for d in self._deferreds[item]:
            del self._deferreds[item]
        dict.__setitem__(self, item, value)

When a consumer tries to get an element from the dictionary, they always get a deferred. The deferred will fire with the value from the dictionary when (if) it becomes available. Of course if the value is already known, they get it in a deferred that has already fired (via succeed). When the producer puts an element into the dictionary, any consumer deferreds that were waiting on that element’s value are given the value.

Note that a __delitem__ isn’t needed, we just inherit that from dict. If a non-existent item is deleted, you get the normal dictionary behavior (a KeyError). If the item does exist, that means the list of waiting deferreds on that item is empty (the fact the item exists means any waiting deferreds for the item have all been fired and that that item in the self._deferreds dictionary was deleted), so we can just let the dictionary class delete the item, as usual.

Graceful shutdown of a Twisted service with outstanding deferreds

Friday, June 10th, 2011

I’ve been spending a bit of time thinking again about queues and services. I wrote a Twisted class in 2009 to maintain a resizable dispatch queue (code in Launchpad, description on the Twisted mailing list). For this post I’ve pulled out (and simplified slightly) one of its helper classes, a DeferredPool.

This simple class maintains a set of deferreds and gives you a mechanism to get a deferred that will fire when (if!) the size of the set ever drops to zero. This is useful because it can be used to gracefully shut down a service that has a bunch of outstanding requests in flight. For each incoming request (that’s handled via a deferred), you add the deferred to the pool. When a signal arrives to tell the service to stop, you stop taking new requests and ask the pool for a deferred that will fire when all the outstanding deferreds are done, then you exit. This can all be done elegantly in Twisted, the last part by having the stopService method return the deferred you get back from the pool (perhaps after you add more cleanup callbacks to it).

Here’s the code:

from twisted.internet import defer

class DeferredPool(object):
    """Maintains a pool of not-yet-fired deferreds and gives a mechanism to
    request a deferred that fires when the pool size goes to zero."

    def __init__(self):
        self._pool = set()
        self._waiting = []

    def _fired(self, result, d):
        """Callback/errback each pooled deferred runs when it fires. The
        deferred first removes itself from the pool. If the pool is then
        empty, fire all the waiting deferreds (which were returned by
        if not self._pool:
            waiting, self._waiting = self._waiting, []
            for waiter in waiting:
        return result

    def add(self, d):
        """Add a deferred to the pool."""
        d.addBoth(self._fired, d)

    def notifyWhenEmpty(self, testImmediately=True):
        """Return a deferred that fires (with None) when the pool empties.
        If testImmediately is True and the pool is empty, return an already
        fired deferred (via succeed)."
        if testImmediately and not self._pool:
            return defer.succeed(None)
            d = defer.Deferred()
            return d

As usual I’m posting this example because I find Twisted’s deferreds so elegant. Here are a few comments on the above that might help you understand deferreds better.

A frequent pattern when creating and managing deferreds is that you can add callbacks and errbacks to them yourself to transparently do some housekeeping when they fire. In this case, for each deferred passed to add, I’m adding a callback and an errback that will run self._fired when the deferred fires. The first thing that method does is take the deferred out of the pool of outstanding deferreds. So the deferred itself cleans up the pool. It does that transparently, by which I mean that the call/errback function (self._fired) always returns whatever result it was passed. It’s on both the callback and errback chains of the deferred and has no effect on the result. The deferred may already have call/errbacks on it when it is passed to add, and it may have them added to it after add is done. Whoever created and is otherwise using the deferred will be none the wiser and is in no way affected.

When a deferred in the pool fires, it also checks to see if the pool size is zero and if there are any deferreds waiting to be informed of that. If so, it fires all the waiting deferreds and empties the list of waiting deferreds. This doesn’t mean the action is necessarily over. More deferreds can be added, more waiters can be added, etc. The pool size can go to zero again and if there are no waiters are waiting, no big deal, etc.

It’s easy to add functionality to e.g., record what time deferreds were added, provide stats, allow outstanding deferreds to be cancelled, add notifications when high/low water marks are reached, etc. But that’s enough for now. Feel free to ask questions below.

The eighty six non-trivial powers ≤ 2^20

Wednesday, March 30th, 2011

Tonight Jamu Kakar mentioned in IRC that a program of his had unexpectedly crashed after processing 1,048,376 items. I think it’s a useful debugging skill to have to be able to recognize numbers like that (it’s very close to 2^20). I’ve often wanted to write a tiny program to print out all the non-trivial powers, and since I have far more important and pressing things to be doing, I immediately went to write the code. At a minimum it seems prudent to recognize all powers up to 1000, and the powers of 2 to much higher. Below you have all 86 non-trivial powers up to 2^20. I don’t know them all, but I wish I did.

  4 = 2^2                  729 = 3^6, 9^3                32768 = 2^15, 8^5
  8 = 2^3                 1000 = 10^3                    38416 = 14^4
  9 = 3^2                 1024 = 2^10, 4^5               46656 = 6^6
 16 = 2^4, 4^2            1296 = 6^4                     50625 = 15^4
 25 = 5^2                 1331 = 11^3                    59049 = 3^10, 9^5
 27 = 3^3                 1728 = 12^3                    65536 = 2^16, 4^8, 16^4
 32 = 2^5                 2048 = 2^11                    78125 = 5^7
 36 = 6^2                 2187 = 3^7                     83521 = 17^4
 49 = 7^2                 2197 = 13^3                   100000 = 10^5
 64 = 2^6, 4^3, 8^2       2401 = 7^4                    104976 = 18^4
 81 = 3^4, 9^2            2744 = 14^3                   117649 = 7^6
100 = 10^2                3125 = 5^5                    130321 = 19^4
121 = 11^2                3375 = 15^3                   131072 = 2^17
125 = 5^3                 4096 = 2^12, 4^6, 8^4, 16^3   160000 = 20^4
128 = 2^7                 4913 = 17^3                   161051 = 11^5
144 = 12^2                5832 = 18^3                   177147 = 3^11
169 = 13^2                6561 = 3^8, 9^4               248832 = 12^5
196 = 14^2                6859 = 19^3                   262144 = 2^18, 4^9, 8^6
216 = 6^3                 7776 = 6^5                    279936 = 6^7
225 = 15^2                8000 = 20^3                   371293 = 13^5
243 = 3^5                 8192 = 2^13                   390625 = 5^8
256 = 2^8, 4^4, 16^2     10000 = 10^4                   524288 = 2^19
289 = 17^2               14641 = 11^4                   531441 = 3^12, 9^6
324 = 18^2               15625 = 5^6                    537824 = 14^5
343 = 7^3                16384 = 2^14, 4^7              759375 = 15^5
361 = 19^2               16807 = 7^5                    823543 = 7^7
400 = 20^2               19683 = 3^9                   1000000 = 10^6
512 = 2^9, 8^3           20736 = 12^4                  1048576 = 2^20, 4^10, 16^5
625 = 5^4                28561 = 13^4

I produced the above with this quick hack:

from collections import defaultdict

powers = defaultdict(list)
lim = 20

for a in range(2, lim + 1):
    for b in range(2, lim + 1):
        n = a ** b
        if n > 2 ** lim:
        powers[n].append((a, b))

for n in sorted(powers.keys()):
    print '%7d = %s' % (n,
                        ', '.join('%d^%d' % (a, b)
                                  for (a, b) in powers[n]))


Friday, January 28th, 2011

I can never understand when I meet programmers who don’t use emacs. As a programmer, you spend inordinate amounts of time in your editor. You call yourself a programmer. You like to automate things. You get frustrated when you can’t take matters into your own hands. You like hacking on things. Right? And so…….. why wouldn’t you be deeply attracted to an editor that is fully programmable? Sure, (emacs) lisp may not be to everyone’s liking, but being able to program your editor is hugely powerful, especially when the programming language comes with an extremely strong library of tools just for editing text inside the editor.

Jamu Kakar, another emacs fan was just over at my place. He didn’t know about narrowing the buffer – to only show one section of it so that you can edit it to your heart’s content (e.g., global search & replace) and then widen it again when you’re done. We were looking at some Python code and I did a search. The class we were looking at was long, and I didn’t know if my search had finished in code that was still part of the same class. I said to Jamu “Emacs Python mode should have a function that narrows the buffer to the current class”.

After he was gone, I was thinking about that, and realized it would be easy to write. It’s all of 8 lines.

(defun py-narrow-to-class nil
    (py-beginning-of-def-or-class t)
        ((start (point)))
      (py-end-of-def-or-class t)
      (narrow-to-region (point) start))))

Not too shabby, and it took less than 5 minutes to write it. In words: here’s a function called py-narrow-to-class that takes no arguments and that I want to call interactively (via M-x py-narrow-to-class). It’s going to go to the start of the current Python class, set a local variable called start to remember that location, then move to the end of the class, and narrow the buffer. That’s wrapped in a (save-excursion) so that when all that moving around and narrowing is done, the cursor will be in the exact spot it was when I started. If I want, I can now assign this function to a single keystroke when I’m in Python mode.

If you don’t think that’s pretty neat, you’re probably not a programmer. Can you do that in your editor? – Python for extracting pieces of JSON objects

Thursday, November 25th, 2010

Lots of APIs these days return JSON objects. I love JSON, but reading a raw JSON dump can be awkward. You can print the whole thing using pprint in Python (or your favorite language), but what if you want to grep out and print only parts of the JSON? I was thinking tonight that it would be easy and useful to write a recursive script to do that. It took about half an hour to arrive at this solution:

#!/usr/bin/env python

import sys
import re
import json
from pprint import pprint

def jsongrep(d, patterns):
        pattern = patterns.pop(0)
    except IndexError:
        if isinstance(d, dict):
            keys = filter(pattern.match, d.keys())
        elif isinstance(d, list):
            keys = map(int,
                              ['%d' % i for i in range(len(d))]))
            if pattern.match(str(d)):
        for item in (d[key] for key in keys):
            jsongrep(item, patterns[:])

if __name__ == '__main__':
        j = json.loads(
    except ValueError, e:
        print >>sys.stderr, 'Could not load JSON object from stdin.'

    jsongrep(j, map(re.compile, sys.argv[1:]))

Usage is really simple. Let’s look at a couple of easy examples from the command line:

$ echo '{"hey" : "you"}' | hey
u'you' has matched the “hey” key in the JSON object and printed its value. Let’s do the same thing with a 2-level JSON object:

$ echo '{"hey" : { "there" : "you"}}' | hey
{u'there': u'you'}

Again, we see the entire object corresponding to the “hey” key. We can add another argument to drill down into the object

$ echo '{"hey" : { "there" : "you"}}' | hey there

As you might hope, you can use a regular expression for an argument:

$ echo '{"hey" : { "there" : "you"}}' | 'h.*' '.*'

which in this case could have been given more concisely as

$ echo '{"hey" : { "there" : "you"}}' | h .

So you can drill down into nested dictionaries quite easily. When runs out of patterns it just prints whatever’s left. A special case of this is if you give no patterns at all, you get the whole JSON object:

$ echo '{"hey" : { "there" : "you"}}' |
{u'hey': {u'there': u'you'}}

The regex patterns you pass on the command line are being matched against the keys of JSON objects (Python dicts). If runs into a list, it will instead match against the list indices like so:

$ echo '{"hey" : { "joe" : ["blow", "xxx" ]}}' | hey joe 1

You can see we’ve pulled out just the first list element after matching “hey” and “joe”. So regex args can be used to navigate your way through both JSON objects and lists.

Now let’s do something more interesting.

Twitter’s API can give you JSON, and the JSON is pretty chunky. For example, if I get my first 100 followers with this command:

curl ''

there’s 164Kb of output (try it and see). What if I just want the Twitter user names of the people who follow me? Looking at the JSON, I can see it starts with:


Hmm… looks like it’s a list of dictionaries. Let’s print just the first dictionary in the list:

curl '' | 0

which starts out:

{u'contributors_enabled': False,
 u'created_at': u'Wed Jul 19 00:29:58 +0000 2006',
 u'description': None,
 u'favourites_count': 0,
 u'follow_request_sent': False,
 u'followers_count': 178,
 u'following': False,
 u'friends_count': 67,
 u'geo_enabled': False,
 u'id': 2471,
 u'id_str': u'2471',
 u'lang': u'en',
 u'listed_count': 3,
 u'location': None,
 u'name': u'Roy',
 u'notifications': False,
 u'profile_background_color': u'131516',
 u'profile_background_image_url': u'',
 u'profile_background_tile': True,
 u'profile_image_url': u'',
 u'profile_link_color': u'009999',
 u'profile_sidebar_border_color': u'eeeeee',
 u'profile_sidebar_fill_color': u'efefef',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'wasroykosuge',

and you can see a “screen_name” key in there which looks like what we want. Let’s see the first few:

$ curl '' | . screen_name | head

Finally, here’s an example using FluidDB‘s new /values HTTP call. I’ll ask FluidDB for all objects matching the query has and from those matching objects I’ll pull back the value of the FluidDB tag named fluiddb/about. The result is JSON that starts out like this:

{"results": { "id" : {"93989942-b519-49b4-87de-ac834e6a6161": {"fluiddb/about": {"value": ""}}

You can see there’s a 5-level deep nesting of JSON objects. I just want the “value” key on all matching objects. Easy:

curl '' | results . . fluiddb/about value | sort

And there you have it, a sorted list of all Union Square Ventures portfolio companies, from the command line, as listed here. will also try to match on things that are not objects or lists if it runs into them, so we can refine this list a little. E.g.,

curl '' | results . . fluiddb/about value '.*ee' | sort

being the USV companies with “ee” somewhere in their URL. Or, for some advanced regex fun, the USV companies whose URLs don’t end in “.com”:

curl '' | results . . fluiddb/about value '.*(?<!\.com)$'

Twisted’s Deferred class chainDeferred method is too simplistic?

Saturday, August 14th, 2010

I think it’s worth spending some time thinking about whether chainDeferred in twisted.internet.defer.Deferred is too simplistic. I’ve thought for a while that it could be more helpful in preventing people from doing unintended things and/or cause fewer surprises (e.g., see point #1 in this post).

Those are smaller things that people can obviously live with. But a more important one concerns deferred cancellation. See below for why.

Here are five examples of chainDeferred behavior that I think could be better. In an attempt to fix these, I made a few changes to a copy of (immodestly called, sorry) which you can grab, with runnable examples, here. The code is meant as a suggested approach. I doubt that it’s bulletproof.

Here are the examples.

# Normal deferreds: this raises defer.AlreadyCalledError because
# the callback of d1 causes the callback of d2 to be called, but d2 has
# already been cancelled (and hence called).

# With there is no error because d1.callback will not call
# d2 as it has already been cancelled.

def printCancel(fail):
    print ‘cancelled’

def canceller(d):
    print ‘cancelling’

d1 = defer.Deferred()
d2 = defer.Deferred(canceller)

# Normally: raises defer.AlreadyCalledError because calling d1.callback
# will call d2, which has already been called.

# With Raises AssertionError: "Can’t callback an already
# chained deferred" because calling callback on a deferred that’s
# already been chained is asking for trouble (as above).

d1 = defer.Deferred()
d2 = defer.Deferred()

# Normally: although d2 has been chained to d1, when d1 is cancelled,
# d2’s cancel method is never called. Even calling d2.cancel ourselves
# after the call to d1.cancel has no effect, as d2 has already been
# called.

# With tdefer: both cancel1 and cancel2 are called when d1.cancel is
# called. The additional final call to d2.cancel correctly has no
# effect as d2 has been called (via d1.cancel).

def cancel1(d):
    print ‘cancel one’

def cancel2(d):
    print ‘cancel two’

def reportCancel(fail, which):
    print ‘cancelled’, which

d1 = defer.Deferred(cancel1)
d1.addErrback(reportCancel, ‘one’)
d2 = defer.Deferred(cancel2)
d2.addErrback(reportCancel, ‘two’)

# Normally: prints "called: None", instead of the probably expected
# "called: hey"

# prints "called: hey"

def called(result):
    print ‘called:’, result

d1 = defer.Deferred()
d2 = defer.Deferred()

# Normally: prints
#   called 2: hey
#   called 3: None

# prints
#   called 2: hey
#   called 3: hey

def report2(result):
    print ‘called 2:’, result

def report3(result):
    print ‘called 3:’, result

d1 = defer.Deferred()
d2 = defer.Deferred().addCallback(report2)
d3 = defer.Deferred().addCallback(report3)

I wont go into detail here as this post is already long enough. Those are 3 classes of behavior arising from chainDeferred being very simplistic. Comments welcome, of course. Once again, the runnable code is here.

Asynchronous data structures with Twisted Deferreds

Friday, July 23rd, 2010

Earlier this week I gave a talk titled Deferred Gratification (slides) at EuroPython in Birmingham. The title was supposed to hint at how much I love Twisted‘s Deferred class, and that it took me some time to really appreciate it.

While thinking about Deferreds in the days before the talk, I think I put my finger on one of the reasons I find Deferreds so attractive and elegant. The thought can be neatly summarized: Deferreds let you replace synchronous data structures with elegant asynchronous ones.

There are various ways that people get exposed to thinking about programming in an asynchronous style. The great majority do so via building user interfaces, usually with widget libraries wherein one specifies widgets and layouts, sets up event handlers, and then surrenders control to a main loop that thereafter routes events to handlers. Exposure to asynchronous programming via building UIs is becoming much more common as Javascript programmers build client-side web apps that operate asynchronously with web servers (and the “A” in AJAX of course stands for asynchronous). Others become aware of asynchronous programming via writing network code (perhaps using Twisted). Relatively few become aware of asynchronous programming via doing async filesystem I/O.

Because Twisted’s Deferreds don’t have anything to do with UIs or networking or filesystems, you can use them to implement other asynchronous things, like an asynchronous data structure. To show you what I mean, here’s a slightly simplified version of Twisted’s DeferredQueue class, taken from twisted/internet/

class DeferredQueue(object):

    def __init__(self):
        # It would be better if these both used collections.deque (see comments section below).
        self.waiting = [] # Deferreds that are expecting an object
        self.pending = [] # Objects queued to go out via get.

    def put(self, obj):
        if self.waiting:

    def get(self):
        if self.pending:
            return succeed(self.pending.pop(0))
            d = Deferred()
            return d

I find this extremely elegant, and I’m going to explain why.

But first, think about code for a regular synchronous queue. What happens if you call get on a regular queue that’s empty? Almost certainly one of two things: you’ll get some kind of QueueEmpty error, or else your code will block until some other code puts something into the queue. I.e., you either get a synchronous error or you get a synchronous non-empty response.

If you look at the get method in the code above, you’ll see that if the queue is empty (i.e., the pending list is empty), a new Deferred is made, added to self.waiting, and is immediately returned to the caller. So code calling get on an empty queue doesn’t get an error and doesn’t block, it always gets a result back essentially immediately. How can you get a result from an empty queue? Easy: the result is a Deferred. And because we’re in the asynchronous world, you just attach callbacks (like event handlers in the UI world) to the Deferred, and go on your merry way.

If you can follow that thinking, the rest of the code in the class above should be easy to grok. In put, if there are any outstanding Deferreds (i.e., earlier callers who hit an empty queue and got a Deferred back), the incoming object is given to the first of these by passing it to the callback function of the Deferred (and popping it out of the waiting list). If there are no outstanding Deferreds expecting a result, the incoming object is simply appended to self.pending. On the get side, if the queue (i.e., self.pending) is non-empty, the code creates a Deferred that has already been fired (using the succeed helper function) and returns it.

By now, this kind of code seems routine and straightforward to me. But it certainly wasn’t always that way. So if the above seems cryptic or abstract, I encourage you to think it over, maybe write some code, ask questions below, etc. To my mind, these kinds of constructs – simple, elegant, robust, practical, obviously(?) bug-free, single-threaded, etc. – are extremely instructive. You too can solve potentially gnarly (at least in the threaded world) networking coding challenges in very simple ways just by building code out of simple Twisted Deferreds (there’s nothing special about the DeferredQueue class, it’s just built using primitive Deferreds).

For extra points, think about the concept of asynchronous data structures. I find that concept very attractive, and putting my finger on it helped me to see part of the reason why I think Deferreds are so great. (In fact, as you can see from the use of succeed, Deferreds are not specifically asynchronous or synchronous: they’re so simple they don’t even care, and as a consumer of Deferreds, your code doesn’t have to either). That’s all remarkably elegant.

I sometimes hope O’Reilly will publish a second volume of Beautiful Code and ask me to write a chapter. I know exactly what I’d write about :-)