Archive for the ‘tech’ Category

I go shopping for a compass, then my Sonos decides it needs one too

Wednesday, December 10th, 2014

Screen Shot 2014-12-10 at 3.42.33 PMLast night I spent some time online looking to buy a compass. I looked at many of the Suunto models. Also yesterday I installed Little Snitch after noticing that an unknown gamed process wanted to establish a TCP/IP connection.

Anyway… a few minutes ago, 10 or 11 hours after I eventually bought a compass, a message pops up from Little Snitch telling me that the Mac OS X desktop Sonos app was trying to open a connection to ns.suunto.com. See image on right (click for the full-sized version).

WTF?!?!

Can someone explain how that works? Either Little Snitch is mightily confused or…… Or what? Has the Sonos app has been digging around in my Chrome browser history or cache or cookie jar? Is Chrome somehow complicit locally? Or something with cookies (but that would require the Sonos to be accessing and sending cookies stored by Chrome). Or…. what?

And, why ns.suunto.com? There’s an HTTP server there, but its / resource is not very informative:

$ curl -S -v http://ns.suunto.com
* Rebuilt URL to: http://ns.suunto.com/
* Hostname was NOT found in DNS cache
*   Trying 23.63.99.202...
* Connected to ns.suunto.com (23.63.99.202) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.37.1
> Host: ns.suunto.com
> Accept: */*
>
< HTTP/1.1 404 Not Found
* Server Apache is not blacklisted
< Server: Apache
< Content-Type: text/html; charset=iso-8859-1
< Date: Wed, 10 Dec 2014 16:02:47 GMT
< Content-Length: 16
< Connection: keep-alive
<
* Connection #0 to host ns.suunto.com left intact
File not found."⏎

Unfortunately, Little Snitch doesn't tell me the full URL that the Sonos app was trying to access.

Anyone care to speculate what's going on here?

jQuery-when2 – a more flexible way to work with jQuery deferreds

Saturday, July 27th, 2013

I’ve often been frustrated by the limited functionality of jQuery.when. You give it some deferred objects and it fires when all the deferreds are finished. The trouble is, if any of the deferreds is rejected the deferred returned by jQuery.when fires immediately. So you can’t use it to collect the results of all deferreds including any errors. You also can’t use it to fire when the first of the passed deferreds fires.

So last night I wrote a new version of when, called jQuery-when2 that offers the three behaviors that I commonly want:

  1. resolve on the first success,
  2. fail on the first error (the jQuery.when behavior), or
  3. resolve when all results (successes or errors) have been collected.

The API differences from jQuery.when:

  • when2 must be called with a list as its first argument.
  • An options Javascript object may be passed as a second argument.
  • If options.resolveOnFirstSuccess is true, the deferred returned by when2 will resolve as soon as any of the passed deferreds resolves. In this case, .done callbacks will be passed index and value args, where index is the index of the deferred that fired. If non-deferreds are in the arguments to when2, the first one seen will trigger the resolve (not very useful, but consistent).
  • If options.failOnFirstError is false, the deferred returned by when2 will never fail. It will collect the results from all the deferreds and pass them all back. The caller gets to figure out which values (if any) are errors.
  • If no options are given (or options.failOnFirstError is true), fail will be called on the deferred returned by when2 on the first error (this is the behavior of jQuery.when). The args passed to fail will be index and value, indicating which deferred was rejected.
  • Any .progress callbacks added to the returned deferred are also called with index and value arguments so you can tell which deferred made progress.

You can grab the source code, see examples, etc., on Github.

BTW, I’m writing a short (about 75pp) O’Reilly book, Learning jQuery Deferreds, with Nicholas Tollervey. If you’re interested in reviewing early Rough Cuts drafts, please let me know! The book will be out late this year.

Secure per-site passwords with no encrypted blob

Sunday, February 3rd, 2013

Last night I read a Guardian article, Online passwords: keep it complicated. It’s a surprisingly good summary, given that it’s aimed at the general public. The author concludes by telling us he decided to adopt LastPass, and also mentions 1Password. The comments section on the page gives similar solutions, like KeePass. The security-conscious people I know have arrived at the same conclusion. There are plenty of articles on the web that summarize various similar products, e.g., Which Password Manager Is The Most Secure? at LifeHacker. A single encrypted blob offers good security and works well in practice. It also allows the storage of information like name, address, credit cards, etc., that can be used to auto-fill web forms.

But… I’ve never liked the idea of a single encrypted file with all my passwords in it. What if the storage is lost or corrupted? Could the file someday be decrypted by someone else? If my encrypted blob is online, what happens when I am offline? If the blob is to be stored locally, I need to think about where to put it, how to back it up, etc. If a company holds it for me, what happens if they go out of business or get hacked? What if they use proprietary encryption, a closed-source access app, or a proprietary underlying data format? Not all the above solutions have all these issues, but they all have some of them.

The crucial thing they all have in common is that they use a master password to encrypt all your passwords into a single blob, and the blob has to then be reliably stored and accessible forever.

An approach that requires no storage

I realized there’s a solution that doesn’t require any storage. It’s not perfect, but it has some attractive properties. I’ve already started using it for some sites.

[Edit: it has been pointed out in the comments that the following solution has been thought of before. See SuperGenPass.]

Here’s a simple first version of Python code to print my password for a given service:

import base64, getpass, hashlib, sys

service = sys.argv[1]
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())[:32]

print 'Your password on %s is %s' % (service, password)

The service name is given on the command line. The code prints a 32-character password for use on that service. Here’s some sample output:

        $ mypass.py facebook
        Enter your master password: 
        Your password on facebook is Wza2l5Tqy0omgWP+5DDsXjQLO/Mc07N8

        $ mypass.py twitter
        Enter your master password: 
        Your password on twitter is eVhhpjJrmtSa8XnNMu6vLSDhPeO5nFOT

This has some nice advantages. It also places some small requirements on the user. Unfortunately, however, it is not generally applicable – at least not today. These are discussed below.

Advantages

The obvious advantage is that there is no external storage. Your passwords are not stored anywhere. There’s no blob to store, protect, access, backup, worry about, etc. The algorithm used to generate your password is dead-simple, it’s open and available in dozens of languages, and it’s secure.

You’re free to use more than one master password if you like. You can invent your own services names (more on which below).

Requirements / User burden

As with all one-key-to-unlock-them-all approaches, the user obviously needs to remember their master password.

With this approach though, the user also has to remember the name they used for the service they’re trying to access. If you create a password for a service called “gmail” you’ll need to use that exact service name in the future. For me that’s not much of a burden, but I guess it would be for others.

There’s no reason why the list of services you have passwords for couldn’t be stored locally. If the password generator were in a browser extension, it could possibly suggest a service name, like “facebook.com”, based on the domain of the page you were on.

With this approach, it’s even more important that one’s master password be hard to guess. Unlike the single-encrypted-blob approach, anyone who can guess your master password (and the names you use for services) can immediately obtain your passwords. They don’t also need access to the blob – it doesn’t exist.

Additional security can be easily had by, for example, using a convention of adding a constant character to your service names. So, e.g., you could use “facebook*” and “twitter*” as service names, and not tell anyone how you form service names.

General applicability

Unfortunately, there is a major problem with this approach. That’s because different sites have different requirements on passwords. Some of the difficulties can be avoided quite easily, but there’s an additional problem, caused when services change their password policy.

The above code generates a Base64 password string. So, to give some examples, if the service you want a password for doesn’t allow a plus sign in your password, the above code might make something unacceptable to the service. Same thing if they insist that passwords must be at most 12 characters long.

Ironically, these services are insisting on policies that prevent the use of truly secure passwords. They’re usually in place to ensure that short passwords are chosen from a bigger space. It would be better, though more work, to impose restrictions only on short passwords.

In a perfect world, all sites could immediately switch to allowing Base64 passwords of length ≥ 16 (say). Then the above approach would work everywhere and we’d be done.

Varying password length

A general approach to adjusting the generated password is to take some of the Base64 information produced and use it to modify the password. For example, you might not comfortable with all your passwords being the same length, so we can compute a length like this:

import base64, getpass, hashlib, string, sys

b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(password[-5]) % 16
print 'Your password on %s is %s' % (service, password[0:16 + lenAdjust])

This generates passwords that are between 16 and 31 characters in length:

        $ ./length-varying.py facebook
        Enter your master password: 
        Your password on facebook is 1nTlVGPhuWZf0l9Sk27
        $ ./length-varying.py twitter
        Enter your master password: 
        Your password on twitter is WE1DVZHAFBx2c3g63tR+Oi3Jxs4xMV

Satisfying site requirements

A possible approach to dealing with per-site password requirements is to have the code look up known services and adjust the initial password it generates to be acceptable. This can easily be done in a secure, random, repeatable way. For example:

  • If a site doesn’t allow upper case, lowercase the password.
  • If a site doesn’t allow digits, replace them with random letters.
  • If a site requires punctuation, you can replace some initial letters in the password with randomly chosen punctuation and then randomly permute the result using the Knuth Shuffle.

Some of these transformations use random numbers. These are easy to obtain: take an unused part of the Base64 string and use it to seed a RNG. For each transformation, you would need to call the RNG a fixed number of times, i.e., independent of the number of random numbers actually used to perform the transformation. That’s necessary in order to keep the RNG in a known state for subsequent transformations (if any).

For example, the following replaces digits 0-9 with a letter from A-J whose case is chosen randomly:

import base64, getpass, hashlib, random, string, sys

def getSeed(chars):
    seed = ord(chars[0])
    for letter in chars[1:]:
        seed = (seed << 8) & ord(letter)
    return seed

service = sys.argv[1]
b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
digest = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(digest[-5]) % 16

passwordWithDigits = digest[0:16 + lenAdjust]
password = ''

random.seed(getSeed(digest[32:36]))
randoms = [random.randint(0, 1) for _ in passwordWithDigits]

for index, letter in enumerate(passwordWithDigits):
    if letter in '0123456789':
        base = ord('a' if randoms[index] else 'A')
        replacement = chr(base + ord(letter) - ord('0'))
        password += replacement
    else:
        password += letter

print 'Your password on %s is %s' % (service, password)

Here’s the output for the above two services (using the same master password):

        $ ./no-digits.py facebook
        Enter your master password: 
        Your password on facebook is bnTlVGPhuWZfAljSkch
        $ ./no-digits.py twitter
        Enter your master password: 
        Your password on twitter is WEBDVZHAFBxccdgGdtR+OidJxsExMV

As you can see, the digits are replaced with letters (in a biased way, that we can ignore in an informal blog post). The RNG is in a known state because the number of times it has been called is independent of the number of digits in the pre-transformation text.

This approach can be used to transform the initial random sequence into one that satisfies a service’s password restrictions.

It is difficult to reliably associate service names with site policies. To do so might require keeping a file mapping the name a user used for a service to the policy of the site. Although this doesn’t defeat the purpose of this approach (since that file would not need to be stored securely), it is an additional and unwanted pain for the user. Part of the point was to try to entirely avoid additional storage, even if it doesn’t have to be encrypted.

The major problem with per-site requirements

The major problem however is that sites may change their password policy. Even if our program knew the rules for all sites, it would have a real problem if a site changed its policy. The code would need to be updated to generate passwords according to the new site policy. Existing users, supposing they upgraded, would then be shown incorrect passwords and would need to do password resets, which is obviously inconvenient.

Conclusion

I like the above approach a lot, but don’t see a way to solve the issue with changing site policies. I wouldn’t mind building in some rules for known popular sites, but any step in that direction has its problems – at least as far as I can see.

For now, I’m going to start using the above approach on sites that allow a long random password with characters from the Base64 set. That covers the majority of sites I use. Importantly, that includes Google, so if I ever need password resets I can have them sent there, knowing that I can always log in to recover them.

CEL, a Chrome Event Logger

Sunday, January 27th, 2013

logo-128Last night I wrote CEL, a Chrome Event Logger, a Google Chrome extension that logs all known chrome.* API events to the Javascript console. Example use cases are:

  • You wonder if there is a Chrome API event that’s triggered for some action you take in the browser. Rather than guessing what the event might be and trying to find it in the API docs, you can enable CEL, perform the action in Chrome, and see what CEL logs.
  • You’re writing an extension and are unsure about whether an event is being triggered or with what arguments. Instead of adding an event listener in your own code and reloading your extension, you can just look in the CEL log.

Click here to install.

Viewing the logging

Once installed, you can examine the CEL logging by visiting chrome://extensions, clicking to enable Developer mode, and then clicking the link next to the CEL icon where it says Inspect views: _generated_background_page.html

Manually adjusting logging

In the JS console for the extension’s background page, there are several commands you can run to adjust what is logged:

// Return a list of the chrome.* API events being logged.
CEL.enabled()

// Return a list of the chrome.* API events being ignored.
CEL.disabled()

// Enable logging of some calls (see below).
CEL.enable(name1, name2, …)

// Disable logging of some calls (see below).
CEL.disable(name1, name2, …)
 

The names you pass to CEL.enable and CEL.disable can be individual API calls (without the leading “chrome.”), or can be higher-level categories. Here are some examples:

// Enable chrome.tabs.onCreated and all chrome.webRequest.* events:
CEL.enable(‘tabs.onCreated’, ‘webRequest’)

// Disable all chrome.tabs.* events and chrome.webNavigation.onCommitted
CEL.disable(‘tabs’, ‘webNavigation.onCommitted’)
 

Note that CEL.enable will enable all necessary higher level logging. So, for example, if you call CEL.enable('omnibox.onInputEntered') all chrome.omnibox.* events (that have not been explicitly disabled) will be logged. If you don’t want to enable and disable groups of calls in this way, always pass explicit API calls.

CEL.disabled will show you the names of individual calls that are disabled, as well as any disabled higher levels.

Global enable / disable

The extension provides a context menu item that lets you globally enable or disable logging.

Installation from the Chrome web store

Go to http://bit.ly/chrome-event-logger which will redirect you to the CEL page in the CWS.

Tracking the development version

If you want the development version, you can install the extension by visting https://fluiddb.fluidinfo.com/about/chrome-event-logger/fluidinfo.com/chrome.crx. Chrome will warn you that extensions cannot be installed from non-Chrome Web Store URLs but will download the .crx file in any case. Open chrome://extensions and drag the .crx file you just downloaded onto that page.

Installation from source

The source to the extension is available on Github. Here’s how to clone and install it.

  • Download the repo: git clone http://github.com/terrycojones/chrome-event-logger
  • In Chrome, go to chrome://extensions
  • Click Developer mode
  • Click Load Unpacked Extension…
  • Navigate to the directory where you cloned the repo and click Open

Destructive, invasive, and dangerous behavior by UK ISP TalkTalk (aka StalkStalk)

Wednesday, December 5th, 2012

Today I spent several hours trying to figure out what was going wrong with a web service I’ve been building. The service uses websockets to let browsers and the server send messages to each other over a connection that is held open.

I built and tested the service locally and it worked fine. But when I deployed it to a remote server, the websocket connections were mysteriously dropped 30-35 seconds after they were established. The error messages on the server were cryptic, as were those in the browser. Google came to the rescue, and led me to the eye-opening explanation

It turns out TalkTalk, my ISP, are also fetching the URLs my browser fetches, after a delay of 30-35 seconds! I guess they’re not doing it with all URLs, probably because they figure the sites are “safe” and not full of content that might be deemed objectionable. All the accesses come from a single IP address (62.24.252.133), and if you block that address or otherwise deny its connection attempts, the websocket problem goes away immediately.

Dear TalkTalk – this is a very bad idea. Here are 3 strong reasons why:

  1. First of all, you’re breaking the web for your own customers, as seen above. When your customers try to use a new start-up service based on websockets, their experience will be severely degraded, perhaps to the point where the service is unusable. Time and money will be spent trying to figure out what’s going on, and people will not be happy to learn that their ISP is to blame.
  2. Second, there’s a real privacy issue here. I don’t really want to go into it, but I don’t trust my ISP (any ISP) to securely look after data associated with my account, let alone all the web content I look at. I have Google web history disabled. I don’t want my ISP building up a profile of what the people in this house look at online. There’s a big difference between recording the URLs I go to and actually retrieving their content.
  3. Third, it’s downright dangerous. What if I were controlling a medical device via a web interface and TalkTalk were interfering by killing my connections or by replaying my requests? What if there was a security system or some other sensitive controller on the other end? There’s no way on earth TalkTalk should be making requests with unknown effects to an unknown service that they have not been authorized to use. The TalkTalk legal team should consider this an emergency. Something is going to break, perhaps with fatal consequences, and they are going to get sued.

If you want to read more opinions on this issue, try Googling TalkTalk 62.24.252.133. Lots of people have run into this problem and are upset for various reasons.

See also: Phorm.

Max Tabs

Tuesday, December 4th, 2012

Here’s the second of several Chrome and Chromium browser extensions I’ve recently written. Earlier, I posted some of the background motivation in Alternate browsing realities.

After installing Max Tabs, your browser will not allow you to have more than 15 tabs open at once. Any time you try to open more tabs, it will automatically close existing tabs to keep you at the limit. If you later close some tabs and go below the limit, tabs will be reopened to show URLs that were previously automatically closed. I.e., the URLs in tabs that are closed are not forgotten, they’re stored until you’re down to a reasonable number of tabs. (The URLs are not stored across browser sessions, though they easily could be.)

The main idea here is to limit the amount of memory Chrome consumes in keeping tabs open for you. I regularly have about 50 tabs open, sometimes for weeks or even months, on pages I’m planning to read. My laptop gets bogged down as Chrome consumes more and more memory. I’ve long wanted something to limit my tab habit. I didn’t really like any of the options I found in the Chrome Store, so I wrote my own. In case you’re wondering, the extension closes your rightmost open tabs.

Max Tabs installs a context menu item that shows you the number of URLs it has closed. If you click the context menu item, you can disable the extension, at which point it will immediately open tabs for all URLs it automatically closed.

Note that the extension starts out disabled. I set it up that way so it would be less alarming on installation (if you have over 15 tabs open when you install it, it will immediately close as many tabs as needed). Enable it via the context menu.

The extension is not in the Chrome Web Store yet. It’s still very easy to install: just click here to download the extension, then follow these instructions.

If you’re a programmer, or just curious about how to build Chrome extensions, the source code is available on Github. For info on what URLs the extension has closed tabs for, you can look in the console of its background page, accessible from chrome://extensions.

Open It Later

Tuesday, December 4th, 2012

Here’s the first of several Chrome and Chromium browser extensions I’ve recently written. Earlier, I posted some of the background motivation in Alternate browsing realities.

After installing Open It Later, your browser will randomly delay following links you click on. That is, instead of following the link in your existing tab, it immediately closes the tab! :-) If you open a new tab and try to go to a URL, that tab will immediately close too. The URL you were trying to reach will be opened in a new tab at a random future time, between 15 seconds and 5 minutes later.

This is pretty silly, of course. It deliberately goes directly against the idea that there should be an immediate (useful) reaction from your browser when you click a link. Think of it as something that slows you down, that makes your browsing more considered, that gives you a pause during which you might forget about something that you didn’t really need to read anyway.

Open It Later installs a context menu item that shows you the number of URLs that are pending opening. Click the context menu item to disable the extension. Not only will it ungrudgingly disable itself without pause, it will also immediately open all URLs that were scheduled to be opened in the future.

The extension is not in the Chrome Web Store yet. It’s still very easy to install: just click here to download the extension, then follow these instructions.

If you’re a programmer, or just curious about how to build Chrome extensions, the source code is available on Github. For info on when the extension plans to open your URLs, you can look in the console of its background page, accessible from chrome://extensions.

Special thanks to Hugh McLeod, who (unknowingly) provided Open It Later‘s Snake Oil icon:

Image: Hugh McLeod

Alternate browsing realities

Tuesday, December 4th, 2012

I find it interesting to look for things we take for granted, and to ask what the world would look like if the basic assumptions were outlandishly violated.

Recently I’ve been thinking about browsing. What do we all take for granted when browsing? A biggie is that when we click on a link (or open a new tab) the browser should take us to what we asked for, and do it immediately.

Below are some fanciful ways in which that could be different. You can probably think of others. Imagine if:

  • When you click a link, the new page is shown as usual, but only at some random point in the future. Clicking a link or opening a new tab on a URL, causes your current tab to immediately close!
  • Your browser restricts you to having (say) at most 10 tabs open. If you try to open more, the browser automatically picks another open tab and closes it. When you drop back under 10 tabs, a tab that was automatically closed pops into existence again.
  • When you click to follow a link or you open a new tab, the page appears in someone else’s browser, not in yours.
  • You and a group of friends are limited in the number of total tabs you can collectively have open. If you open a tab that takes you over the limit, a random tab is closed in the browser of a random group member. When the group drops under the limit, the tab is re-opened in the browser of a random group member.
  • You and a group of friends are limited so that only one of you can be looking at any given URL. I.e., if you go to a URL that one of your group already has open, their browser automatically closes their tab.
  • When you click on a link, your browser shows you the page and the page also appears in the browsers of a group of friends. If a friend then clicks on a link on that page, your tab follows along.
  • When reading an interesting page, with one click you can send the URL to a group of friends, whose browsers all load the page.

The nice thing about this kind of blue-sky thinking is that it starts out as frivolous or even ridiculous, but can quickly lead to interesting new approaches. For example, the idea of opening tabs in the future comes directly from questioning the immediacy of link following. Hot on the heels of the ridiculous idea of never following links at all, we land right next to the idea of a Read-Later button that millions of people already find very useful.

Anyway….. I decided to have some fun and implement several of the above for the Chrome and Chromium. I’ll write them up separately very soon.

Simpler Twisted deferred code via decorated callbacks

Sunday, October 14th, 2012

This morning I was thinking about Twisted deferreds and how people find them difficult to grasp, but how they’re conceptually simple once you get it. I guess most of us tell people a deferred is something to hold a result that hasn’t arrived yet. Sometimes, though, deferreds do have a result in them immediately (e.g., using succeed or fail to get an already-fired deferred).

I wondered if it might work to tell people to think of a deferred as really being the result. If that were literally true, then instead of writing:

d = getPage(...)
d.addErrback(errcheck, args)
d.addCallback(cleanup, args)
d.addCallback(reformat, args)
return d

We might write something like:

result1 = getPage(...)
result2 = errcheck(result1, args)
result3 = cleanup(result2, args)
return reformat(result3, args)

And if you could write that, you could obviously instead write:

return reformat(cleanup(errcheck(getPage(...), args), args), args)

If we could write Twisted code that way, I think using deferreds would be simpler for people unfamiliar with them. We could show them Twisted code and not even have to mention deferreds (see below).

In the style we’re all used to, the programmer manually adds callbacks and errbacks. That’s basically boilerplate. It gets worse when you then need to also use DeferredList, etc. It’s a little confusing to read deferred code at first, because you need to know that the deferred result/failure is automatically passed as the first arg to callbacks/errbacks. It seems to take a year or more for people to finally realize how the callback & errback chains actually interact :-) Also, I wonder how comfortable programmers are with code ordered innermost function first, as in the normal d.addCallback(inner).addCallback(outer) Twisted style, versus outer(inner()), as in the line above.

Anyway… I realized we CAN let people use the succinct style above, by putting boilerplate into decorators. I wrote two decorators, called (surprise!) callback and errback. You can do this:

@errback
def errcheck(failure, arg):
    ...

@callback
def cleanup(page, arg):
    ...

@callback
def reformat(page, arg):
    ...

reformat(cleanup(errcheck(getPage(...), arg1), arg2), arg3)

The deferred callback and errback chains are hooked up automatically. You still get a regular deferred back as the return value.

And… the “deferred” aspect of the code (or at least the need to talk about or explain it) has conveniently vanished.

You can also do things like

func1(getDeferred1(), errcheck(func2(getDeferred2(), getDeferred3())))

This gets the result of deferreds 2 & 3 and (if neither fails) passes the result of calling func2 on both results through to func1, which is called along with the result of deferred 1. You don’t need to use DeferredLists, as the decorator makes them for you. The errcheck function wont be called at all unless there’s an error.

That’s nice compared to the verbose equivalent:

d1 = DeferredList([getDeferred2(), getDeferred3()])
d1.addCallback(func2)
d1.addErrback(errcheck)
d2 = DeferredList([getDeferred1(), d1])
d2.addCallback(func1)

Or the more compact but awkward:

DeferredList([getDeferred(), DeferredList([getDeferred(), getDeferred()]).addCallback(
    func2).addErrback(errcheck)]).addCallback(func1)
 

There’s lots more that could be said about this, but that’s enough for now. The code (surely not bulletproof) and some tests are on Github. I’ll add a README sometime soon. This is still pretty much proof of concept, and some it could be done slightly differently. I’m happy to discuss in more detail if people are interested.

describejson – a Python script for summarizing JSON structure

Thursday, August 9th, 2012

Yesterday I was sent a 24M JSON file and needed to look through it to give someone an opinion on its contents. I did what I normally do to look at JSON, piping it into python -m json.tool. The result looked pretty good, I scrolled through some screens with a single long list and jumped to the bottom to see what looked like the end of the list. What I didn’t know at the time was that the output was 495,647 lines long! And there was some important stuff in the middle of the output that I didn’t see at all.

So I decided to write a quick program to recursively summarize JSON. You can grab it from Github at https://github.com/terrycojones/describejson.

Usage is simple, just send it JSON on stdin. Here’s an example input:

{
  "cats": 3,
  "dogs": 6,
  "parrots": 1
}

Which gets you this output:

$ python describejson.py < input.json
1 dict of length 3. Values:
  3 ints

The output is a little cryptic, but you’ll get used to it (and I may improve it). In words, this is telling you that (after loading the JSON) the input contained 1 Python dictionary with 3 elements. The values of the 3 elements are all integers. The indentation is meaningful, of course. You can see that the script is summarizing the fact that the 3 values in the dict were all of the same type.

Here’s another sample input:

[
  ["fluffy", "kitty", "ginger"],
  ["fido", "spot", "rover"],
  ["squawk"]
]

Which gets you:

$ python describejson.py < input.json
1 list of length 3. Values:
  2 lists of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

In words, the input was a list of length 3. Its contents were 2 lists of length 3 that both contained 3 unicode strings, and a final list that contains just a single unicode string.

Specifying equality strictness

The script currently takes just one option, --strictness (or just -s) to indicate how strict it should be in deciding whether things are “the same” in order to summarize them. In the above output, the default strictness length is used, so the script considers the first two inner lists to be collapsible in the summary, and prints a separate line for the last list since it’s of a different length. Here’s the output from running with --strictness type:

$ python describejson.py –strictness type < input.json
1 list of length 3. Values:
  3 lists of length 3. Values:
    3 unicodes

The lists are all considered equal here. The output is a little misleading, since it tells us there are 3 lists of length 3, each containing 3 unicodes. I may fix that.

We can also be more strict. Here’s the output from --strictness keys:

$ python describejson.py –strictness keys < input.json
1 list of length 3. Values:
  1 list of length 3. Values:
    3 unicodes
  1 list of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

The 3 inner lists are each printed separately because their contents differ. The keys argument is also a bit confusing for lists, it just means the list values. It’s clearer when you have dictionaries in the input.

This input:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "d": 4,
    "e": 5,
    "f": 6
  }
]

produces

$ python describejson.py < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

I.e., one list, containing 2 dictionaries, each containing 3 int values. Note that this is using the default of --strictness length so the two dicts are considered the same. If we run that input with strictness of keys, we’ll instead get this:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

The dicts are considered different because their keys differ. If we change the input to make the keys the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

and run again with --strictness keys, the dicts are considered the same:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

but if we use --strictness equal, the dicts will be considered different:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

Finally, making the dicts the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 1,
    "b": 2,
    "c": 3
  }
]

and running with --strictness equal will collapse the summary as you’d expect:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

Hopefully it’s clear that by being less strict on matching you’ll get more concise output in which things are casually considered “the same” and if you’re more strict you’ll get more verbose output, all the way to using strict equality for both lists and dicts.

Here’s the full set of --strictness options:

  • type: compare things by type only.
  • length: compare lists and objects by length.
  • keys: compare lists by equality, dicts by keys.
  • equal: compare lists and dicts by equality.

Improvements

The naming of the --strictness options could be improved. The keys option should probably be called values (but that is confusing, since dictionaries have values and it’s a comparison based on their keys!). A values option should probably also compare the value of primitive things, like integers and strings.

There are quite a few other things I might do to this script, if I ever have time. It would be helpful to print out some keys and values when these are short and unchanging. It would be good to show an example representative value of something that repeats (modulo strictness) many times. It might be good to be able to limit the depth to go into a JSON structure.

Overall though, I already find the script useful and I’m not in a rush to “improve” it by adding features. You can though :-)

You might also find it helpful to take what you learn about a JSON object via describe JSON and use that to grep out specific pieces of the structure using jsongrep.py.

If you’re curious, here’s the 24-line output summary of the 24M JSON I received. Much more concise than the nearly 1/2 a million lines from python -m json.tool:

1 dict of length 3. Values:
  1 int
  1 dict of length 4. Values:
    1 list of length 17993. Values:
      17993 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
    1 list of length 0.
    1 list of length 11907. Values:
      11907 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 1. Values:
          1 unicode
        2 unicodes
    1 list of length 28068. Values:
      28068 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
  1 unicode

Autovivification in Python: nested defaultdicts with a specific final type

Saturday, May 26th, 2012

I quite often miss the flexibility of autovivification in Python. That Wikipedia link shows a cute way to get what Perl has:

from collections import defaultdict

def tree():
    return defaultdict(tree)

lupin = tree()
lupin["express"][3] = "stand and deliver"

It’s interesting to think about what’s going on in the above code and why it works. I really like defaultdict.

What I more often want, though, is not infinitely deep nested dictionaries like the above, but a (known) finite number of nested defaultdicts with a specific type at the final level. Here’s a tiny function I wrote to get just that:

from collections import defaultdict

def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels – 1, final)))

So let’s say you were counting the number of occurrences of words said by people by year, month, and day in a chat room. You’d write:

words = autovivify(5, int)

words["sam"][2012][5][25]["hello"] += 1
words["sue"][2012][5][24]["today"] += 1

Etc. It’s pretty trivial, but it was fun to think about how to do it with a function and to play with some alternatives. You could do it manually with nested lambdas:

words = defaultdict(lambda: defaultdict(int))
words["sam"]["hello"] += 1

But that gets messy quickly and isn’t nearly as much fun.

Women’s guide to HTTP status codes for dealing with unwanted geek advances

Saturday, January 21st, 2012

Here’s a women’s guide to the most useful HTTP status codes for dealing with unwanted geek advances. Someone should make and sell a deck of these.


301 MOVED PERMANENTLY
303 SEE OTHER
305 USE PROXY
306 RESERVED
307 TEMPORARY REDIRECT
400 BAD REQUEST
401 UNAUTHORIZED
402 PAYMENT REQUIRED
403 FORBIDDEN
404 NOT FOUND
405 METHOD NOT ALLOWED
406 NOT ACCEPTABLE
407 PROXY AUTHENTICATION REQUIRED
408 REQUEST TIMEOUT
409 CONFLICT
410 GONE
411 LENGTH REQUIRED
412 PRECONDITION FAILED
413 REQUEST ENTITY TOO LARGE
414 REQUEST-URI TOO LONG
415 UNSUPPORTED MEDIA TYPE
416 REQUESTED RANGE NOT SATISFIABLE
417 EXPECTATION FAILED
422 UNPROCESSABLE ENTITY
423 LOCKED
424 FAILED DEPENDENCY
426 UPGRADE REQUIRED
500 INTERNAL SERVER ERROR
501 NOT IMPLEMENTED
502 BAD GATEWAY
503 SERVICE UNAVAILABLE
504 GATEWAY TIMEOUT
507 INSUFFICIENT STORAGE
508 LOOP DETECTED
510 NOT EXTENDED

Source: Hypertext Transfer Protocol (HTTP) Status Code Registry

Emacs buffer mode histogram

Thursday, November 10th, 2011

Tonight I noticed that I had over 200 buffers open in emacs. I’ve been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I’m programming versus working on documentation, etc.

Because it’s emacs, it wasn’t hard to write a function to display a buffer mode histogram. Here’s mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I’m not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I’m a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it’s a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
  "Display a histogram of emacs buffer modes."
  (interactive)
  (let* ((totals ‘())
         (buffers (buffer-list()))
         (total-buffers (length buffers))
         (ht (make-hash-table :testequal)))
    (save-excursion
      (dolist (buffer buffers)
        (set-buffer buffer)
        (let
            ((mode-name (symbol-name major-mode)))
          (puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
    (maphash (lambda (key value)
               (setq totals (cons (list key value) totals)))
             ht)
    (setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
    (with-output-to-temp-buffer "Buffer mode histogram"
      (princ (format "%d buffers open, in %d distinct modes\n\n"
                      total-buffers (length totals)))
      (dolist (item totals)
        (let
            ((key (car item))
             (count (cadr item)))
          (if (equal (substring key -5) "-mode")
              (setq key (substring key 0 -5)))
          (princ (format "%2d %20s %s\n" count key
                         (make-string count ?+))))))))
 

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.

A resizable dispatch queue for Twisted

Monday, June 27th, 2011

In December 2009 I posted to the Twisted mailing list about what I called a Resizable Dispatch Queue. I’ve just spent some time making a new version that’s much improved. You can pick up the new version via pip install txrdq, from PyPI, or from the txRDQ project page on Launchpad. Here’s the example use case, taken from my original posting:

You want to write a server with a web interface that allows people to enter their phone number so you can send them an SMS. You anticipate lots of people will use the service. But sending SMS messages is quite slow, and the company that you ship those jobs off to is concerned that you’ll overrun their service (or maybe they have an API limit, etc). So you need to queue up jobs locally and send them off at a certain rate. You’d like to be able to adjust that rate up or down. You also want to be able to shut your service down cleanly (i.e., not in the middle of a task), and when you restart it you want to be able to re-queue the jobs that were queued last time but which hadn’t gone out.

To make the example more concrete, suppose your function that sends the SMS is called sendSMS and that it takes a (number, message) tuple argument. Here are some of the kinds of things you can do:

from txrdq.rdq import ResizableDispatchQueue

# Create a queue that will allow 5 simultaneous underway jobs.
rdq = ResizableDispatchQueue(sendSMS, 5)

# Later... send off some SMS messages.
d1 = rdq.put((2127399921, 'Hello...'), priority=5)
d1.addCallback(...)

d2 = rdq.put((5052929919, 'Test...'), priority=5)
d2.addCallback(...)

# Cancel the second job
d2.cancel()

# Widen the outgoing pipeline to 10 simultaneous jobs.
rdq.width = 10

# We're dispatching jobs too fast, turn it down a little.
rdq.width = 7

# Get a copy of the list of pending jobs.
jobs = rdq.pending()

# Cancel one of the pending jobs from the jobs list.
job.cancel()

# Reprioritize one of the pending jobs from the jobs list.
rdq.reprioritize(job, -1)

# Arrange to increase the number of jobs in one hour.
reactor.callLater(3600, rdq.setWidth, 20)

# Pause processing.
rdq.pause()

# Resume processing, with a new width of 8.
rdq.resume(8)

# Shutdown. Wait for any underway jobs to complete, and save
# the list of jobs not yet processed.

def saveJobs(jobs):
    pickle.dump(jobs, ...)

d = rdq.stop()
d.addCallback(saveJobs)
 

I’ve come up with many uses for this class over the last 18 months, and have quite often recommended it to people on the #twisted IRC channel. Other examples include fetching a large number of URLs in a controlled way, making many calls to the Twitter API, etc.

Usage of the class is very simple. You create the dispatch queue, giving it a function that will be called to process all jobs. Then you just put job arguments as fast as you like. Each call to put gets you a Twisted Deferred instance. If your function runs successfully on the argument, the deferred will call back with an instance of txrdq.job.Job. The job contains information about when it was queued, when it was launched, when it stopped, and of course the result of the function. If your function hits an error or the job is canceled (by calling cancel on the deferred), the deferred will errback and the failure will again contain a job instance with the details.

It’s also useful to have an admin interface to the queue, so calls such as pending and underway are provided. These return lists of job instances. You can call cancel on a job instance too. You can reprioritize jobs. And you can pause processing or shut the queue down cleanly.

The code also contains an independently useful Twisted classes called DeferredPriorityQueue (which I plan to write about), and DeferredPool (which I described earlier).

Back of the envelope calculations with The Rule of 72

Monday, June 20th, 2011

Image: internetworldstats.com

The Rule of 72 deserves to be better known among technical people. It’s a widely-known financial rule of thumb used for understanding and calculating interest rates. But others, including computer scientist and start-up founders, are often concerned with growth rates. Knowing and applying the rule of 72 can help in developing numerical literacy (numeracy) around growth.

For example, consider Moore’s Law, which describes how "the number of transistors that can be placed inexpensively on an integrated circuit doubles approximately every two years." If something doubles every two years, at what rate does it increase per month, on average? If you know the rule of 72, you’ll instantly know that the monthly growth rate is about 3%. You get the answer by dividing 72 by 24 (the number of months).

Computer scientists are usually very familiar with powers of two. It’s often convenient to take advantage of the fact that 2^10 is about 1,000. That means that when something increases by a factor of 1,000, it has doubled about 10 times. By extension, and with a little more error, an increase of a million corresponds to 20 doublings, and a billion is 30 doublings (log base two of a billion is actually 29.897, so the error isn’t too wild). You can use this to ballpark the number of doublings in a process really easily, and go directly from that to a growth rate using the rule of 72.

For example, the bottom of this page tells us that there were about 16,000 internet domains on July 1st 1992, and 1.3M of them on July 1st 1997. Let’s think in thousands: that’s a jump from 16 to just over 1,000 in 5 years. To get from 1 to 16 is four doublings, so from 16 to 1,000 is six doublings (because 1,000 is ten doublings from 1). So the number of domains doubled 6 times in 5 years, or 6 times in 60 months, or once every 10 months (on average). If you want something to double in 10 months, the rule of 72 tells us we need a growth rate of 7.2% per month. To check: 16,000 * (1.072 ^ 60) = 1,037,067. That’s a damned good estimate (remember that we were shooting for 1M, not 1.3M) for five seconds of mental arithmetic! Note that the number of domains was growing much faster than Moore’s law (3% per month).

You can quickly get very good at doing these sorts of calculations. Here’s another easy example. This page shows the number of internet users growing from 16M in December 1995 to 2,072M in March of 2011. That’s just like the above example, but it’s 7 doublings in 15.25 years, or 183 months. That’s pretty close to a doubling every 24 months, which we know from above corresponds to 3% growth per month.

You can use facility with growth rates to have a good sense for interest rates in general. You can use it when building simple (exponential) models of product growth. E.g., suppose you’re launching a product and you reckon you’ll have 300K users in a year’s time. You want to map this out in a spreadsheet using a simple exponential model. What should the growth rate be? 300K is obviously not much more than 256 * 1,024, which is 18 doublings in 365 days, or a doubling roughly every 20 days. The rule of 72 gives 72/20 = ~3.5, so you need to grow 3.5% every day to hit your target. Is that reasonable? If it is, it means that when you hit 300K users, you’ll be signing up about 3.5% of that number, or 10,500 users per day. As you can see, familiarity with powers of two (i.e., estimating number of doublings) and with the rule of 72 can give you ballpark figures really easily. You can even use your new math powers to avoid looking stupid in front of VCs.

The math behind the rule of 72 is easy to extend to triplings (rule of 110), quadrupling (rule of 140), quintupling (rule of 160), etc.

Finally, you can use these rules of thumb to do super geeky party tricks. E.g., what’s the tenth root of two? Put another way, what interest rate do you need for something to double after ten periods? The rule of 72 tells you it’s 72/10 = 7.2%, so the tenth root of two will be about 1.072 (in fact 1.072 ^ 10 = 2.004). What’s the 20th root of 5? The rule of 160 tells you you need 160/20 = 8% growth each period, so 1.08 should be about right (the correct answer is ~1.0838).

As with all rules of thumb, it’s good to have a sense of when it’s most applicable. See the wikipedia page or this page for more detailed information. It’s also of course good to understand that it may not be suitable to model growth as an exponential at all.

Apple channeling Microsoft?

Tuesday, February 1st, 2011

Image by Lara64

Apple’s behavior, as described today in the New York Times and in Ars Technica reminds me of Microsoft building MSIE into Windows. When that happened, other browser manufacturers cried foul. They argued that this was bundling, that few people would want to use a non-native browser, and that Microsoft was using its platform monopoly to tilt the browser playing field.

Here again we have a vendor (Apple), with an operating system platform (iOS), with a piece of extremely valuable functionality (the App Store) built in by the vendor, who are now strong-arming others writing applications for the platform into always offering access through their functionality. That all reminds me of Microsoft.

While it might now be difficult to think of the iPhone without the App Store, the iPhone existed for 18 months before the App Store came along: the iPhone was released in January 2007, the App Store in July 2008. Windows and MSIE also started life as independent entities; it was about 2 years before they were fused and optimistically declared inseparable.

The two cases are obviously not the same in detail, but I find the similarities striking and thought-provoking.

Just for fun, imagine a court case aimed at forcing Apple to make their App Store separable from their operating system platform. To allow others to build their own app stores. To give the user the choice to install/uninstall whatever app stores they liked. Imagine Apple claiming that such a separation is technically impossible and that the App Store is fundamental to the iPhone experience.

Couldn’t possibly happen, right?

py-narrow-to-class

Friday, January 28th, 2011

I can never understand when I meet programmers who don’t use emacs. As a programmer, you spend inordinate amounts of time in your editor. You call yourself a programmer. You like to automate things. You get frustrated when you can’t take matters into your own hands. You like hacking on things. Right? And so…….. why wouldn’t you be deeply attracted to an editor that is fully programmable? Sure, (emacs) lisp may not be to everyone’s liking, but being able to program your editor is hugely powerful, especially when the programming language comes with an extremely strong library of tools just for editing text inside the editor.

Jamu Kakar, another emacs fan was just over at my place. He didn’t know about narrowing the buffer – to only show one section of it so that you can edit it to your heart’s content (e.g., global search & replace) and then widen it again when you’re done. We were looking at some Python code and I did a search. The class we were looking at was long, and I didn’t know if my search had finished in code that was still part of the same class. I said to Jamu “Emacs Python mode should have a function that narrows the buffer to the current class”.

After he was gone, I was thinking about that, and realized it would be easy to write. It’s all of 8 lines.

(defun py-narrow-to-class nil
  (interactive)
  (save-excursion
    (py-beginning-of-def-or-class t)
    (let
        ((start (point)))
      (py-end-of-def-or-class t)
      (narrow-to-region (point) start))))
 

Not too shabby, and it took less than 5 minutes to write it. In words: here’s a function called py-narrow-to-class that takes no arguments and that I want to call interactively (via M-x py-narrow-to-class). It’s going to go to the start of the current Python class, set a local variable called start to remember that location, then move to the end of the class, and narrow the buffer. That’s wrapped in a (save-excursion) so that when all that moving around and narrowing is done, the cursor will be in the exact spot it was when I started. If I want, I can now assign this function to a single keystroke when I’m in Python mode.

If you don’t think that’s pretty neat, you’re probably not a programmer. Can you do that in your editor?

bzr viz is so pretty

Monday, January 18th, 2010

A visual summary of my coding work in the last week, creating branches, working on them, merging them back into the FluidDB trunk.

bzr viz

At what point does an Amazon EC2 reserved instance become worth it?

Friday, January 8th, 2010

If you purchase an Amazon EC2 reserved instance, you’ll pay a certain amount up front (pricing). If you don’t use the instance much, it will be more expensive per hour than a regular on-demand instance. E.g., if you paid $227.50 to reserve a small instance for a year but then only used it for a single day, you’d be paying almost $10/hr and it would obviously be much cheaper to just get an on-demand instance and pay just 8.5 cents per hour.

OTOH, if you ran a small instance for a year at the on-demand price, you’d pay $745 and it would obviously be cheaper to pay the up-front reservation price ($227.50) plus a year of the low per-hour pricing (365 * 24 * $0.03), or $490.

So for how long do you have to run an instance in order for it to be cheaper to pay for a reserved instance? (Note that I’m ignoring the time value of money, what you might do with the up-front money in the meantime if you didn’t give it to Amazon in advance, etc.)

The answer is pretty simple: for a one-year reservation you need to run the instance for about 6 months to make it worthwhile. For a three-year reservation you need to run the instance for at least 3 months per year, on average.

Here’s a fragment from a simple spreadsheet I made, based on the US N. Virginia prices:

ec2

Facebook release Tornado and it’s not based on Twisted?

Saturday, September 12th, 2009

Image: Jay Smith

Image: Jay Smith

To their great credit, Facebook have just open-sourced more of their core software. This time it’s Tornado, an asynchronous web server written in Python.

Surely that can only mean one thing: Tornado is based on Twisted. Right?

Incredibly, no. Words fail me on this one. I’ve spent some hours today trying to put my thoughts into order so I could put together a reasonably coherent blog post on the subject. But I’ve failed. So here are some unstructured thoughts.

First of all, I’m not meaning to bash Facebook on this. At Fluidinfo we use their Thrift code. We’ll almost certainly use Scribe for logging at some point, and we’re seriously considering using Cassandra. Esteve Fernandez has put a ton of work into txAMQP to glue together Thrift, Twisted, and AMQP, and in the process became a Thrift committer.

Second, you can understand—or make an educated guess at—what happened: the Facebook programmers, like programmers everywhere, were strongly tempted to just write their own code instead of dealing with someone else’s. It’s not just about learning curves and fixing deficiencies, there are also issues of speed of changes and of control. At Fluidinfo we suffered through six months of maintaining our own set of related patches to Thrift before the Twisted support Esteve wrote was finally merged to trunk. That was painful and the temptation to scratch our own itch, fork, and forget about the official Thrift project was high.

Plus, Twisted suffers from the fact that the documentation is not as good as it could be. I commented at length on this over three years ago. Please read the follow-up posts in that thread for an illustration (one of many) of the maturity of the people running Twisted. Also note that the documentation has improved greatly since then. Nevertheless, Twisted is a huge project, it has tons of parts, and it’s difficult to document and to wrap your head around no matter what.

So you can understand why Facebook might have decided not to use Twisted. In their words:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

I’m betting it’s that last part that’s the key to the decision not to use Twisted.

But seriously…… WTF?

Twisted is an amazing piece of work, written by some truly brilliant coders, with huge experience doing exactly what Facebook set out to reinvent.

This is where I’m at a loss for words. I think: “what an historic missed opportunity” and “reinventing the wheel, badly” and “no, no, no, this cannot be” and “this is just so short-sighted” and “a crying shame” and many things besides.

Sure, Twisted is difficult to grok. But that’s no excuse. It’s a seriously cool and powerful framework, it’s vastly more sophisticated and useful and extensible than what Facebook have cobbled together. Facebook could have worked to improve twisted.web (which everyone agrees has some shortcomings) which could have benefitted greatly from even a small fraction of the resources Facebook must have put into Tornado. The end result would have been much better. Or Facebook could have just ignored twisted.web and built directly on top of the Twisted core. That would have been great too.

Or Facebook could have had a team of people who knew how to do it better, and produced something better than Twisted. I guess that’s the real frustration here – they’ve put a ton of effort into building something much more limited in scope and vision, and even the piece that they did build looks like a total hack built to scratch their short term needs.

What’s the biggest change in software engineering over the last decade? Arguably it’s the rise of test-driven development. I’m not the only one who thinks so. Yet here we are in late 2009 and Facebook have released a major piece of code with no test suite. Amazing. OK, you could argue this is a minor thing, that it’s not core to Tornado. That argument has some weight, but it’s hard to think that this thing is not a hack.

If you decide to use an asynchronous web framework, do you expect to have to manually set your sockets to be non-blocking? Do you feel like catching EWOULDBLOCK and EAGAIN yourself? Those sorts of things, and their non-portability (even within the world of Linux) are precisely the kinds of things that lead people away from writing C and towards doing rapid development using something robust and portable that looks after the details. They’re precisely the things Twisted takes care of for you, and which (at least in Twisted) work across platforms, including Windows.

It looks like Tornado are using a global reactor, which the Twisted folks have learned the hard way is not the best solution.

Those are just some of the complaints I’ve heard and seen in the Tornado code. I confess I’ve looked only superficially at their code – but more than enough to feel such a sense of lost opportunity. They built a small subsection of Twisted, they’ve done it with much less experience and elegance and hiding of detail than the Twisted code, and the thing doesn’t even come with a test suite. Who knows if it actually works, or when, or where, etc.?

And…. Twisted is so much more. HTTP is just one of many protocols Twisted speaks, including (from their home page): “TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others”.

Want to build a sophisticated, powerful, and flexible asynchronous internet service in Python? Use Twisted.

A beautiful thing about Twisted is that it expands your mind. Its abstractions (particularly the clean separation and generality of transports, protocols, factories, services, and Deferreds—see here and here and here) makes you a better programmer. As I wrote to some friends in April 2006: “Reading the Twisted docs makes me feel like my brain is growing new muscles.”

Twisted’s deferreds are extraordinarily elegant and powerful, I’ve blogged and posted to the Twisted mailing list about them on multiple occasions. Learning to think in the Twisted way has been a programming joy to me over the last 3 years, so you can perhaps imagine my dismay that a company with the resources of Facebook couldn’t be bothered to get behind it and had to go reinvent the wheel, and do it so poorly. What a pity.

In my case, I threw away an entire year of C code in order to use Twisted in FluidDB. That was a decision I didn’t take lightly. I’d written my own libraries to do lots of low level network communications and RPC – including auto-generating server and client side glue libraries to serialize and unserialize RPC calls and args (a bit like Thrift), plus a server and tons of other stuff. I chucked it because it was too brittle. It was too much of a hack. It wasn’t portable enough. It was too get the details right. It wasn’t extensible.

In other words….. it was too much like Tornado! So I threw it all away in favor of Twisted. As I happily tell people, FluidDB is written in Python so it can use Twisted. It was a question of an amazingly great asynchronous networking framework determining the choice of programming language. And this was done in spite of the fact that I thought the Twisted docs sucked badly. The people behind Twisted were clearly brilliant and the community was great. There was an opportunity to make a bet on something and to contribute. I wish Facebook had made the same decision. It’s everyone’s loss that they did not. What a great pity.