Archive for the ‘python’ Category

Secure per-site passwords with no encrypted blob

Sunday, February 3rd, 2013

Last night I read a Guardian article, Online passwords: keep it complicated. It’s a surprisingly good summary, given that it’s aimed at the general public. The author concludes by telling us he decided to adopt LastPass, and also mentions 1Password. The comments section on the page gives similar solutions, like KeePass. The security-conscious people I know have arrived at the same conclusion. There are plenty of articles on the web that summarize various similar products, e.g., Which Password Manager Is The Most Secure? at LifeHacker. A single encrypted blob offers good security and works well in practice. It also allows the storage of information like name, address, credit cards, etc., that can be used to auto-fill web forms.

But… I’ve never liked the idea of a single encrypted file with all my passwords in it. What if the storage is lost or corrupted? Could the file someday be decrypted by someone else? If my encrypted blob is online, what happens when I am offline? If the blob is to be stored locally, I need to think about where to put it, how to back it up, etc. If a company holds it for me, what happens if they go out of business or get hacked? What if they use proprietary encryption, a closed-source access app, or a proprietary underlying data format? Not all the above solutions have all these issues, but they all have some of them.

The crucial thing they all have in common is that they use a master password to encrypt all your passwords into a single blob, and the blob has to then be reliably stored and accessible forever.

An approach that requires no storage

I realized there’s a solution that doesn’t require any storage. It’s not perfect, but it has some attractive properties. I’ve already started using it for some sites.

[Edit: it has been pointed out in the comments that the following solution has been thought of before. See SuperGenPass.]

Here’s a simple first version of Python code to print my password for a given service:

import base64, getpass, hashlib, sys

service = sys.argv[1]
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())[:32]

print 'Your password on %s is %s' % (service, password)

The service name is given on the command line. The code prints a 32-character password for use on that service. Here’s some sample output:

        $ mypass.py facebook
        Enter your master password: 
        Your password on facebook is Wza2l5Tqy0omgWP+5DDsXjQLO/Mc07N8

        $ mypass.py twitter
        Enter your master password: 
        Your password on twitter is eVhhpjJrmtSa8XnNMu6vLSDhPeO5nFOT

This has some nice advantages. It also places some small requirements on the user. Unfortunately, however, it is not generally applicable – at least not today. These are discussed below.

Advantages

The obvious advantage is that there is no external storage. Your passwords are not stored anywhere. There’s no blob to store, protect, access, backup, worry about, etc. The algorithm used to generate your password is dead-simple, it’s open and available in dozens of languages, and it’s secure.

You’re free to use more than one master password if you like. You can invent your own services names (more on which below).

Requirements / User burden

As with all one-key-to-unlock-them-all approaches, the user obviously needs to remember their master password.

With this approach though, the user also has to remember the name they used for the service they’re trying to access. If you create a password for a service called “gmail” you’ll need to use that exact service name in the future. For me that’s not much of a burden, but I guess it would be for others.

There’s no reason why the list of services you have passwords for couldn’t be stored locally. If the password generator were in a browser extension, it could possibly suggest a service name, like “facebook.com”, based on the domain of the page you were on.

With this approach, it’s even more important that one’s master password be hard to guess. Unlike the single-encrypted-blob approach, anyone who can guess your master password (and the names you use for services) can immediately obtain your passwords. They don’t also need access to the blob – it doesn’t exist.

Additional security can be easily had by, for example, using a convention of adding a constant character to your service names. So, e.g., you could use “facebook*” and “twitter*” as service names, and not tell anyone how you form service names.

General applicability

Unfortunately, there is a major problem with this approach. That’s because different sites have different requirements on passwords. Some of the difficulties can be avoided quite easily, but there’s an additional problem, caused when services change their password policy.

The above code generates a Base64 password string. So, to give some examples, if the service you want a password for doesn’t allow a plus sign in your password, the above code might make something unacceptable to the service. Same thing if they insist that passwords must be at most 12 characters long.

Ironically, these services are insisting on policies that prevent the use of truly secure passwords. They’re usually in place to ensure that short passwords are chosen from a bigger space. It would be better, though more work, to impose restrictions only on short passwords.

In a perfect world, all sites could immediately switch to allowing Base64 passwords of length ≥ 16 (say). Then the above approach would work everywhere and we’d be done.

Varying password length

A general approach to adjusting the generated password is to take some of the Base64 information produced and use it to modify the password. For example, you might not comfortable with all your passwords being the same length, so we can compute a length like this:

import base64, getpass, hashlib, string, sys

b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
password = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(password[-5]) % 16
print 'Your password on %s is %s' % (service, password[0:16 + lenAdjust])

This generates passwords that are between 16 and 31 characters in length:

        $ ./length-varying.py facebook
        Enter your master password: 
        Your password on facebook is 1nTlVGPhuWZf0l9Sk27
        $ ./length-varying.py twitter
        Enter your master password: 
        Your password on twitter is WE1DVZHAFBx2c3g63tR+Oi3Jxs4xMV

Satisfying site requirements

A possible approach to dealing with per-site password requirements is to have the code look up known services and adjust the initial password it generates to be acceptable. This can easily be done in a secure, random, repeatable way. For example:

  • If a site doesn’t allow upper case, lowercase the password.
  • If a site doesn’t allow digits, replace them with random letters.
  • If a site requires punctuation, you can replace some initial letters in the password with randomly chosen punctuation and then randomly permute the result using the Knuth Shuffle.

Some of these transformations use random numbers. These are easy to obtain: take an unused part of the Base64 string and use it to seed a RNG. For each transformation, you would need to call the RNG a fixed number of times, i.e., independent of the number of random numbers actually used to perform the transformation. That’s necessary in order to keep the RNG in a known state for subsequent transformations (if any).

For example, the following replaces digits 0-9 with a letter from A-J whose case is chosen randomly:

import base64, getpass, hashlib, random, string, sys

def getSeed(chars):
    seed = ord(chars[0])
    for letter in chars[1:]:
        seed = (seed << 8) & ord(letter)
    return seed

service = sys.argv[1]
b64letters = string.ascii_letters + '0123456789+/'
secret = getpass.getpass('Enter your master password: ')
digest = base64.b64encode(hashlib.sha512(secret + service).digest())
lenAdjust = b64letters.find(digest[-5]) % 16

passwordWithDigits = digest[0:16 + lenAdjust]
password = ''

random.seed(getSeed(digest[32:36]))
randoms = [random.randint(0, 1) for _ in passwordWithDigits]

for index, letter in enumerate(passwordWithDigits):
    if letter in '0123456789':
        base = ord('a' if randoms[index] else 'A')
        replacement = chr(base + ord(letter) - ord('0'))
        password += replacement
    else:
        password += letter

print 'Your password on %s is %s' % (service, password)

Here’s the output for the above two services (using the same master password):

        $ ./no-digits.py facebook
        Enter your master password: 
        Your password on facebook is bnTlVGPhuWZfAljSkch
        $ ./no-digits.py twitter
        Enter your master password: 
        Your password on twitter is WEBDVZHAFBxccdgGdtR+OidJxsExMV

As you can see, the digits are replaced with letters (in a biased way, that we can ignore in an informal blog post). The RNG is in a known state because the number of times it has been called is independent of the number of digits in the pre-transformation text.

This approach can be used to transform the initial random sequence into one that satisfies a service’s password restrictions.

It is difficult to reliably associate service names with site policies. To do so might require keeping a file mapping the name a user used for a service to the policy of the site. Although this doesn’t defeat the purpose of this approach (since that file would not need to be stored securely), it is an additional and unwanted pain for the user. Part of the point was to try to entirely avoid additional storage, even if it doesn’t have to be encrypted.

The major problem with per-site requirements

The major problem however is that sites may change their password policy. Even if our program knew the rules for all sites, it would have a real problem if a site changed its policy. The code would need to be updated to generate passwords according to the new site policy. Existing users, supposing they upgraded, would then be shown incorrect passwords and would need to do password resets, which is obviously inconvenient.

Conclusion

I like the above approach a lot, but don’t see a way to solve the issue with changing site policies. I wouldn’t mind building in some rules for known popular sites, but any step in that direction has its problems – at least as far as I can see.

For now, I’m going to start using the above approach on sites that allow a long random password with characters from the Base64 set. That covers the majority of sites I use. Importantly, that includes Google, so if I ever need password resets I can have them sent there, knowing that I can always log in to recover them.

Omit needless parens

Tuesday, December 18th, 2012

The famous 17th commandment in The Elements of Style is “Omit needless words”.

There should be an equivalent in programming, but for parentheses. Every time I see needless parens in a program I want to rip them out (unless they’re obviously there for formatting/readability reasons).

Community service message: Omit needless parens. When in doubt whether parens are needed, look up the precedence rules for the operators involved and only use parens if the default isn’t what you want.

Here’s why you shouldn’t use needless parens:

  • The #1 reason is that you’re making your code more difficult to read for people who know the language better than you do. A more experienced programmer will see a red flag and look at your code more carefully than necessary because they will be trying to figure out why you used the extra parens and if there’s something non-obvious going on. When I come across code like that, I usually conclude that whoever wrote the code doesn’t know the language that well. My opinion of the code goes down. My reading speed goes down too because the needless parens, in my estimation, indicate an increased likelihood that programmer has done other (worse) things elsewhere.
  • You don’t want to appear incompetent or lazy, or to slow down or put off people reading your code, right? (See above.)
  • Putting in needless parens is heading down a slippery slope. How many levels of extra parens should you stop at? The only clear cut rule that makes sense is to stop at zero.
  • If you pause to look up the precedence rules, you’ll make yourself a better programmer in the language in question. You’ll be able to read other people’s needless-parenthesis-free code with no problem. You can pen lofty holier-than-thou blog posts like this one.

Back in about 1985 I wrote a tiny shell script to print out operator precedence and associativity rules for C. When I started programming in Perl, I wrote one for it. Then one for Python and later one for Javascript. For your convenience, and as a reward for reading, I just stuck the 4 scripts up on Github.

A simple way to calculate the day of the week for any day of a given year

Sunday, November 11th, 2012

March 29th

Image: Jeremy Church

The other day I read a tweet about how someone was impressed that a friend had been able to tell them the day of the week given an arbitrary date.

There are a bunch of general methods to do this listed on the Wikipedia page for Determination of the day of the week. Typically, there are several steps involved, and you need to memorize some small tables of numbers.

I used to practice that mental calculation (and many others) when I was about 16. Although all the steps are basic arithmetic, it’s not easy to do the calculation in your head in a couple of seconds. Given that most of these questions that you’re likely to face in day-to-day life will be about the current year, it seemed like it might be a poor tradeoff to learn the complicated method to calculate the day of the week for any date if there was a simpler way to do it for a specific year.

The method I came up with after that observation is really simple. It just requires you to memorize a single 12-digit sequence for the current year. The 12 digits correspond to the first Monday for each month.

For example, the sequence for 2012 is 265-274-263-153. Suppose you’ve memorized the sequence and you need to know the day of the week for March 29th. You can trivially calculate that it’s a Thursday. You take the 3rd digit of the sequence (because March is the 3rd month), which is 5. That tells you that the 5th of March was a Monday. Then you just go backward or forward as many weeks and days as you need. The 5th was a Monday, so the 12th, 19th, and 26th were too, which means the 29th was a Thursday.

It’s nice because the amount you need to memorize is small, and you can memorize less digits if you only want to cover a shorter period. The calculation is very simple and always the same in every case, and you never have to think about leap years. At the start of each year you just memorize a single sequence, which is quickly reinforced once you use it a few times.

Here’s Python code to print the sequence for any year.

#!/usr/bin/env python

import datetime, sys

try:
    year = int(sys.argv[1])
except IndexError:
    year = datetime.datetime.today().year

firstDayToFirstMonday = ['1st', '7th', '6th', '5th', '4th', '3rd', '2nd']
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
summary = ''

for month in range(12):
    firstOfMonth = datetime.datetime(year, month + 1, 1).weekday()
    firstMonday = firstDayToFirstMonday[firstOfMonth]
    print months[month], firstMonday
    summary += firstMonday[0]

print 'Summary:', '-'.join(summary[x:x+3] for x in range(0, 12, 3))

The output for 2012 looks like

Jan 2nd
Feb 6th
Mar 5th
Apr 2nd
May 7th
Jun 4th
Jul 2nd
Aug 6th
Sep 3rd
Oct 1st
Nov 5th
Dec 3rd
Summary: 265-274-263-153

The memory task is made simpler by the fact that there are only 14 different possible sequences. Or, if you consider just the last 10 digits of the sequences (i.e., starting from March), there are only 7 possible sequences. There are only 14 different sequences, so if you use this method in the long term, you’ll find the effort of remembering a sequence will pay off when it re-appears. E.g., 2013 and 2019 both have sequence 744-163-152-742. There are other nice things you can learn that can also make the memorization and switching between years easier (see the Corresponding months section on the above Wikipedia page).

Here are the sequences through 2032:

2012 265-274-263-153
2013 744-163-152-742
2014 633-752-741-631
2015 522-641-637-527
2016 417-426-415-375
2017 266-315-374-264
2018 155-274-263-153
2019 744-163-152-742
2020 632-641-637-527
2021 411-537-526-416
2022 377-426-415-375
2023 266-315-374-264
2024 154-163-152-742
2025 633-752-741-631
2026 522-641-637-527
2027 411-537-526-416
2028 376-315-374-264
2029 155-274-263-153
2030 744-163-152-742
2031 633-752-741-631
2032 521-537-526-416

Simpler Twisted deferred code via decorated callbacks

Sunday, October 14th, 2012

This morning I was thinking about Twisted deferreds and how people find them difficult to grasp, but how they’re conceptually simple once you get it. I guess most of us tell people a deferred is something to hold a result that hasn’t arrived yet. Sometimes, though, deferreds do have a result in them immediately (e.g., using succeed or fail to get an already-fired deferred).

I wondered if it might work to tell people to think of a deferred as really being the result. If that were literally true, then instead of writing:

d = getPage(...)
d.addErrback(errcheck, args)
d.addCallback(cleanup, args)
d.addCallback(reformat, args)
return d

We might write something like:

result1 = getPage(...)
result2 = errcheck(result1, args)
result3 = cleanup(result2, args)
return reformat(result3, args)

And if you could write that, you could obviously instead write:

return reformat(cleanup(errcheck(getPage(...), args), args), args)

If we could write Twisted code that way, I think using deferreds would be simpler for people unfamiliar with them. We could show them Twisted code and not even have to mention deferreds (see below).

In the style we’re all used to, the programmer manually adds callbacks and errbacks. That’s basically boilerplate. It gets worse when you then need to also use DeferredList, etc. It’s a little confusing to read deferred code at first, because you need to know that the deferred result/failure is automatically passed as the first arg to callbacks/errbacks. It seems to take a year or more for people to finally realize how the callback & errback chains actually interact :-) Also, I wonder how comfortable programmers are with code ordered innermost function first, as in the normal d.addCallback(inner).addCallback(outer) Twisted style, versus outer(inner()), as in the line above.

Anyway… I realized we CAN let people use the succinct style above, by putting boilerplate into decorators. I wrote two decorators, called (surprise!) callback and errback. You can do this:

@errback
def errcheck(failure, arg):
    ...

@callback
def cleanup(page, arg):
    ...

@callback
def reformat(page, arg):
    ...

reformat(cleanup(errcheck(getPage(...), arg1), arg2), arg3)

The deferred callback and errback chains are hooked up automatically. You still get a regular deferred back as the return value.

And… the “deferred” aspect of the code (or at least the need to talk about or explain it) has conveniently vanished.

You can also do things like

func1(getDeferred1(), errcheck(func2(getDeferred2(), getDeferred3())))

This gets the result of deferreds 2 & 3 and (if neither fails) passes the result of calling func2 on both results through to func1, which is called along with the result of deferred 1. You don’t need to use DeferredLists, as the decorator makes them for you. The errcheck function wont be called at all unless there’s an error.

That’s nice compared to the verbose equivalent:

d1 = DeferredList([getDeferred2(), getDeferred3()])
d1.addCallback(func2)
d1.addErrback(errcheck)
d2 = DeferredList([getDeferred1(), d1])
d2.addCallback(func1)

Or the more compact but awkward:

DeferredList([getDeferred(), DeferredList([getDeferred(), getDeferred()]).addCallback(
    func2).addErrback(errcheck)]).addCallback(func1)
 

There’s lots more that could be said about this, but that’s enough for now. The code (surely not bulletproof) and some tests are on Github. I’ll add a README sometime soon. This is still pretty much proof of concept, and some it could be done slightly differently. I’m happy to discuss in more detail if people are interested.

describejson – a Python script for summarizing JSON structure

Thursday, August 9th, 2012

Yesterday I was sent a 24M JSON file and needed to look through it to give someone an opinion on its contents. I did what I normally do to look at JSON, piping it into python -m json.tool. The result looked pretty good, I scrolled through some screens with a single long list and jumped to the bottom to see what looked like the end of the list. What I didn’t know at the time was that the output was 495,647 lines long! And there was some important stuff in the middle of the output that I didn’t see at all.

So I decided to write a quick program to recursively summarize JSON. You can grab it from Github at https://github.com/terrycojones/describejson.

Usage is simple, just send it JSON on stdin. Here’s an example input:

{
  "cats": 3,
  "dogs": 6,
  "parrots": 1
}

Which gets you this output:

$ python describejson.py < input.json
1 dict of length 3. Values:
  3 ints

The output is a little cryptic, but you’ll get used to it (and I may improve it). In words, this is telling you that (after loading the JSON) the input contained 1 Python dictionary with 3 elements. The values of the 3 elements are all integers. The indentation is meaningful, of course. You can see that the script is summarizing the fact that the 3 values in the dict were all of the same type.

Here’s another sample input:

[
  ["fluffy", "kitty", "ginger"],
  ["fido", "spot", "rover"],
  ["squawk"]
]

Which gets you:

$ python describejson.py < input.json
1 list of length 3. Values:
  2 lists of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

In words, the input was a list of length 3. Its contents were 2 lists of length 3 that both contained 3 unicode strings, and a final list that contains just a single unicode string.

Specifying equality strictness

The script currently takes just one option, --strictness (or just -s) to indicate how strict it should be in deciding whether things are “the same” in order to summarize them. In the above output, the default strictness length is used, so the script considers the first two inner lists to be collapsible in the summary, and prints a separate line for the last list since it’s of a different length. Here’s the output from running with --strictness type:

$ python describejson.py –strictness type < input.json
1 list of length 3. Values:
  3 lists of length 3. Values:
    3 unicodes

The lists are all considered equal here. The output is a little misleading, since it tells us there are 3 lists of length 3, each containing 3 unicodes. I may fix that.

We can also be more strict. Here’s the output from --strictness keys:

$ python describejson.py –strictness keys < input.json
1 list of length 3. Values:
  1 list of length 3. Values:
    3 unicodes
  1 list of length 3. Values:
    3 unicodes
  1 list of length 1. Values:
    1 unicode

The 3 inner lists are each printed separately because their contents differ. The keys argument is also a bit confusing for lists, it just means the list values. It’s clearer when you have dictionaries in the input.

This input:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "d": 4,
    "e": 5,
    "f": 6
  }
]

produces

$ python describejson.py < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

I.e., one list, containing 2 dictionaries, each containing 3 int values. Note that this is using the default of --strictness length so the two dicts are considered the same. If we run that input with strictness of keys, we’ll instead get this:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

The dicts are considered different because their keys differ. If we change the input to make the keys the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 4,
    "b": 5,
    "c": 6
  }
]

and run again with --strictness keys, the dicts are considered the same:

$ python describejson.py –strictness keys < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

but if we use --strictness equal, the dicts will be considered different:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  1 dict of length 3. Values:
    3 ints
  1 dict of length 3. Values:
    3 ints

Finally, making the dicts the same:

[
  {
    "a": 1,
    "b": 2,
    "c": 3
  },
  {
    "a": 1,
    "b": 2,
    "c": 3
  }
]

and running with --strictness equal will collapse the summary as you’d expect:

$ python describejson.py –strictness equal < input.json
1 list of length 2. Values:
  2 dicts of length 3. Values:
    3 ints

Hopefully it’s clear that by being less strict on matching you’ll get more concise output in which things are casually considered “the same” and if you’re more strict you’ll get more verbose output, all the way to using strict equality for both lists and dicts.

Here’s the full set of --strictness options:

  • type: compare things by type only.
  • length: compare lists and objects by length.
  • keys: compare lists by equality, dicts by keys.
  • equal: compare lists and dicts by equality.

Improvements

The naming of the --strictness options could be improved. The keys option should probably be called values (but that is confusing, since dictionaries have values and it’s a comparison based on their keys!). A values option should probably also compare the value of primitive things, like integers and strings.

There are quite a few other things I might do to this script, if I ever have time. It would be helpful to print out some keys and values when these are short and unchanging. It would be good to show an example representative value of something that repeats (modulo strictness) many times. It might be good to be able to limit the depth to go into a JSON structure.

Overall though, I already find the script useful and I’m not in a rush to “improve” it by adding features. You can though :-)

You might also find it helpful to take what you learn about a JSON object via describe JSON and use that to grep out specific pieces of the structure using jsongrep.py.

If you’re curious, here’s the 24-line output summary of the 24M JSON I received. Much more concise than the nearly 1/2 a million lines from python -m json.tool:

1 dict of length 3. Values:
  1 int
  1 dict of length 4. Values:
    1 list of length 17993. Values:
      17993 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
    1 list of length 0.
    1 list of length 11907. Values:
      11907 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 1. Values:
          1 unicode
        2 unicodes
    1 list of length 28068. Values:
      28068 dicts of length 5. Values:
        1 unicode
        1 int
        1 list of length 0.
        2 unicodes
  1 unicode

Autovivification in Python: nested defaultdicts with a specific final type

Saturday, May 26th, 2012

I quite often miss the flexibility of autovivification in Python. That Wikipedia link shows a cute way to get what Perl has:

from collections import defaultdict

def tree():
    return defaultdict(tree)

lupin = tree()
lupin["express"][3] = "stand and deliver"

It’s interesting to think about what’s going on in the above code and why it works. I really like defaultdict.

What I more often want, though, is not infinitely deep nested dictionaries like the above, but a (known) finite number of nested defaultdicts with a specific type at the final level. Here’s a tiny function I wrote to get just that:

from collections import defaultdict

def autovivify(levels=1, final=dict):
    return (defaultdict(final) if levels < 2 else
            defaultdict(lambda: autovivify(levels – 1, final)))

So let’s say you were counting the number of occurrences of words said by people by year, month, and day in a chat room. You’d write:

words = autovivify(5, int)

words["sam"][2012][5][25]["hello"] += 1
words["sue"][2012][5][24]["today"] += 1

Etc. It’s pretty trivial, but it was fun to think about how to do it with a function and to play with some alternatives. You could do it manually with nested lambdas:

words = defaultdict(lambda: defaultdict(int))
words["sam"]["hello"] += 1

But that gets messy quickly and isn’t nearly as much fun.

Emacs buffer mode histogram

Thursday, November 10th, 2011

Tonight I noticed that I had over 200 buffers open in emacs. I’ve been programming a lot in Python recently, so many of them are in Python mode. I wondered how many Python files I had open, and I counted them by hand. About 90. I then wondered how many were in Javascript mode, in RST mode, etc. I wondered what a histogram would look like, for me and for others, at times when I’m programming versus working on documentation, etc.

Because it’s emacs, it wasn’t hard to write a function to display a buffer mode histogram. Here’s mine:

235 buffers open, in 23 distinct modes

91               python +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
47          fundamental +++++++++++++++++++++++++++++++++++++++++++++++
24                  js2 ++++++++++++++++++++++++
21                dired +++++++++++++++++++++
16                 html ++++++++++++++++
 7                 text +++++++
 4                 help ++++
 4           emacs-lisp ++++
 3                   sh +++
 3       makefile-gmake +++
 2          compilation ++
 2                  css ++
 1          Buffer-menu +
 1                 mail +
 1                 grep +
 1      completion-list +
 1                   vm +
 1                  org +
 1               comint +
 1              apropos +
 1                 Info +
 1           vm-summary +
 1      vm-presentation +

Tempting as it is, I’m not going to go on about the heady delights of having a fully programmable editor. You either already know, or you can just drool in slack-jawed wonder.

Unfortunately I’m a terrible emacs lisp programmer. I can barely remember a thing each time I use it. But the interpreter is of course just emacs itself and the elisp documentation is in emacs, so it’s a really fun environment to develop in. And because emacs lisp has a ton of support for doing things to itself, code that acts on emacs and your own editing session or buffers is often very succinct. See for example the save-excursion and with-output-to-temp-buffer functions below.

(defun buffer-mode-histogram ()
  "Display a histogram of emacs buffer modes."
  (interactive)
  (let* ((totals ‘())
         (buffers (buffer-list()))
         (total-buffers (length buffers))
         (ht (make-hash-table :testequal)))
    (save-excursion
      (dolist (buffer buffers)
        (set-buffer buffer)
        (let
            ((mode-name (symbol-name major-mode)))
          (puthash mode-name (1+ (gethash mode-name ht 0)) ht))))
    (maphash (lambda (key value)
               (setq totals (cons (list key value) totals)))
             ht)
    (setq totals (sort totals (lambda (x y) (> (cadr x) (cadr y)))))
    (with-output-to-temp-buffer "Buffer mode histogram"
      (princ (format "%d buffers open, in %d distinct modes\n\n"
                      total-buffers (length totals)))
      (dolist (item totals)
        (let
            ((key (car item))
             (count (cadr item)))
          (if (equal (substring key -5) "-mode")
              (setq key (substring key 0 -5)))
          (princ (format "%2d %20s %s\n" count key
                         (make-string count ?+))))))))
 

Various things about the formatting could be improved. E.g., not use fixed-width fields for the count and the mode names, and make the + signs indicate more than one buffer mode when there are many.

txdpce: a Twisted class for deferred parallel command execution

Tuesday, July 12th, 2011

I just uploaded a simple Twisted Python class, txdpce, to Launchpad. It’s designed for situations where you have multiple ways of obtaining a function result and you want to try them all in parallel and return the result from the fastest function. A typical case is that you can either compute a result via a network call or try to get it out of a cache (perhaps also via a network call, to memcached). You might also be able to compute it locally, etc.

Things can be more complicated than provided for here, of course. E.g., you might like to cache the result of a local call (if it finishes first or if the cache lookup fails). The txdpce class is supposed to be a simple demonstration. I wrote it for a bit of fun this morning and also because it’s yet another nice example of how you can click together the building blocks of Twisted to form increasingly sophisticated classes.

Here’s the class. You’ll find a test suite at the Launchpad site. You can download the code using bzr via bzr branch lp:txdpce.

from twisted.internet import defer
from twisted.python import failure

class ParallelCommandException(Exception):
    pass

class DeferredParallelCommandExecutor(object):
    """
    Provides a mechanism for the execution of a command by multiple methods,
    returning the result from the one that finishes first.
    "
""

    def __init__(self):
        self._functions = []

    def registerFunction(self, func):
        """
        Add func to the list of functions that will be run when execute is
        called.

        @param func: A callable that should be run when execute is called.
        """
        self._functions.append(func)

    def execute(self, *args, **kwargs):
        """
        Run all the functions in self._functions on the given arguments and
        keyword arguments.

        @param args: Arguments to pass to the registered functions.

        @param kwargs: Keyword arguments to pass to the registered functions.

        @raise RuntimeError: if no execution functions have been registered.

        @return: A C{Deferred} that fires when the first of the functions
        finishes.
        """
        if not self._functions:
            raise RuntimeError('No execution functions have been registered.')

        deferreds = [defer.maybeDeferred(func, *args, **kwargs)
                     for func in self._functions]
        d = defer.DeferredList(deferreds, fireOnOneCallback=1, consumeErrors=1)
        d.addCallback(self._done, deferreds)
        return d

    def _done(self, deferredListResult, deferreds):
        """
        Process the result of the C{DeferredList} execution of all the
        functions in self._functions. If result is a tuple, it's the result
        of a successful function, in which case we cancel all the other
        deferreds (none of which has finished yet) and give back the
        result.  Otherwise, all the function calls must have failed (since
        we passed fireOnOneCallback=True to the C{DeferredList} and we
        return a L{ParallelCommandException} containing the failures.

        @param deferredListResult: The result of a C{DeferredList} returned
        by self.execute firing.

        @param deferreds: A list of deferreds for other functions that were
        trying to compute the result.

        @return: Either the result of the first function to successfully
        compute the result or a C{failure.Failure} containing a
        L{ParallelCommandException} with a list of the failures from all
        functions that tried to get the result.
        """
        if isinstance(deferredListResult, tuple):
            # A tuple result means the DeferredList fired with a successful
            # result.  Cancel all other deferreds and return the result.
            result, index = deferredListResult
            for i in range(len(self._functions)):
                if i != index:
                    deferreds[i].cancel()
            return result
        else:
            # All the deferreds failed. Return a list of all the failures.
            failures = [fail for (result, fail) in deferredListResult]
            return failure.Failure(ParallelCommandException(failures))
 

A resizable dispatch queue for Twisted

Monday, June 27th, 2011

In December 2009 I posted to the Twisted mailing list about what I called a Resizable Dispatch Queue. I’ve just spent some time making a new version that’s much improved. You can pick up the new version via pip install txrdq, from PyPI, or from the txRDQ project page on Launchpad. Here’s the example use case, taken from my original posting:

You want to write a server with a web interface that allows people to enter their phone number so you can send them an SMS. You anticipate lots of people will use the service. But sending SMS messages is quite slow, and the company that you ship those jobs off to is concerned that you’ll overrun their service (or maybe they have an API limit, etc). So you need to queue up jobs locally and send them off at a certain rate. You’d like to be able to adjust that rate up or down. You also want to be able to shut your service down cleanly (i.e., not in the middle of a task), and when you restart it you want to be able to re-queue the jobs that were queued last time but which hadn’t gone out.

To make the example more concrete, suppose your function that sends the SMS is called sendSMS and that it takes a (number, message) tuple argument. Here are some of the kinds of things you can do:

from txrdq.rdq import ResizableDispatchQueue

# Create a queue that will allow 5 simultaneous underway jobs.
rdq = ResizableDispatchQueue(sendSMS, 5)

# Later... send off some SMS messages.
d1 = rdq.put((2127399921, 'Hello...'), priority=5)
d1.addCallback(...)

d2 = rdq.put((5052929919, 'Test...'), priority=5)
d2.addCallback(...)

# Cancel the second job
d2.cancel()

# Widen the outgoing pipeline to 10 simultaneous jobs.
rdq.width = 10

# We're dispatching jobs too fast, turn it down a little.
rdq.width = 7

# Get a copy of the list of pending jobs.
jobs = rdq.pending()

# Cancel one of the pending jobs from the jobs list.
job.cancel()

# Reprioritize one of the pending jobs from the jobs list.
rdq.reprioritize(job, -1)

# Arrange to increase the number of jobs in one hour.
reactor.callLater(3600, rdq.setWidth, 20)

# Pause processing.
rdq.pause()

# Resume processing, with a new width of 8.
rdq.resume(8)

# Shutdown. Wait for any underway jobs to complete, and save
# the list of jobs not yet processed.

def saveJobs(jobs):
    pickle.dump(jobs, ...)

d = rdq.stop()
d.addCallback(saveJobs)
 

I’ve come up with many uses for this class over the last 18 months, and have quite often recommended it to people on the #twisted IRC channel. Other examples include fetching a large number of URLs in a controlled way, making many calls to the Twitter API, etc.

Usage of the class is very simple. You create the dispatch queue, giving it a function that will be called to process all jobs. Then you just put job arguments as fast as you like. Each call to put gets you a Twisted Deferred instance. If your function runs successfully on the argument, the deferred will call back with an instance of txrdq.job.Job. The job contains information about when it was queued, when it was launched, when it stopped, and of course the result of the function. If your function hits an error or the job is canceled (by calling cancel on the deferred), the deferred will errback and the failure will again contain a job instance with the details.

It’s also useful to have an admin interface to the queue, so calls such as pending and underway are provided. These return lists of job instances. You can call cancel on a job instance too. You can reprioritize jobs. And you can pause processing or shut the queue down cleanly.

The code also contains an independently useful Twisted classes called DeferredPriorityQueue (which I plan to write about), and DeferredPool (which I described earlier).

How to asynchronously exchange a dictionary using Twisted deferreds

Friday, June 10th, 2011

Here’s a fun class that I can’t think of a good use for :-) But I like its simplicity and it’s another illustration of what I like to call asynchronous data structures.

Suppose you have a producer who’s building a dictionary and a consumer who wants to look things up in it. The producer is working at their own pace, making new dictionary entries in whatever work order that suits them, and the consumer is requesting dictionary items in whatever order they need them. The two orders are obviously extremely unlikely to be the same if the dictionary is of non-trivial size. How do you write an asynchronous server (or data structure) that sits between the producer and consumer?

Yes, it’s far fetched, perhaps, but here’s a simple asynchronous dictionary class that lets you do it gracefully:

from collections import defaultdict
from twisted.internet import defer

class DeferredDict(dict):
    def __init__(self, *args, **kwargs):
        self._deferreds = defaultdict(set)
        dict.__init__(self, *args, **kwargs)

    def __getitem__(self, item):
        try:
            return defer.succeed(dict.__getitem__(self, item))
        except KeyError:
            d = defer.Deferred()
            self._deferreds[item].add(d)
            return d

    def __setitem__(self, item, value):
        if item in self._deferreds:
            for d in self._deferreds[item]:
                d.callback(value)
            del self._deferreds[item]
        dict.__setitem__(self, item, value)
 

When a consumer tries to get an element from the dictionary, they always get a deferred. The deferred will fire with the value from the dictionary when (if) it becomes available. Of course if the value is already known, they get it in a deferred that has already fired (via succeed). When the producer puts an element into the dictionary, any consumer deferreds that were waiting on that element’s value are given the value.

Note that a __delitem__ isn’t needed, we just inherit that from dict. If a non-existent item is deleted, you get the normal dictionary behavior (a KeyError). If the item does exist, that means the list of waiting deferreds on that item is empty (the fact the item exists means any waiting deferreds for the item have all been fired and that that item in the self._deferreds dictionary was deleted), so we can just let the dictionary class delete the item, as usual.

Graceful shutdown of a Twisted service with outstanding deferreds

Friday, June 10th, 2011

I’ve been spending a bit of time thinking again about queues and services. I wrote a Twisted class in 2009 to maintain a resizable dispatch queue (code in Launchpad, description on the Twisted mailing list). For this post I’ve pulled out (and simplified slightly) one of its helper classes, a DeferredPool.

This simple class maintains a set of deferreds and gives you a mechanism to get a deferred that will fire when (if!) the size of the set ever drops to zero. This is useful because it can be used to gracefully shut down a service that has a bunch of outstanding requests in flight. For each incoming request (that’s handled via a deferred), you add the deferred to the pool. When a signal arrives to tell the service to stop, you stop taking new requests and ask the pool for a deferred that will fire when all the outstanding deferreds are done, then you exit. This can all be done elegantly in Twisted, the last part by having the stopService method return the deferred you get back from the pool (perhaps after you add more cleanup callbacks to it).

Here’s the code:

from twisted.internet import defer

class DeferredPool(object):
    """Maintains a pool of not-yet-fired deferreds and gives a mechanism to
    request a deferred that fires when the pool size goes to zero."
""

    def __init__(self):
        self._pool = set()
        self._waiting = []

    def _fired(self, result, d):
        """Callback/errback each pooled deferred runs when it fires. The
        deferred first removes itself from the pool. If the pool is then
        empty, fire all the waiting deferreds (which were returned by
        notifyWhenEmpty)."
""
        self._pool.remove(d)
        if not self._pool:
            waiting, self._waiting = self._waiting, []
            for waiter in waiting:
                waiter.callback(None)
        return result

    def add(self, d):
        """Add a deferred to the pool."""
        d.addBoth(self._fired, d)
        self._pool.add(d)

    def notifyWhenEmpty(self, testImmediately=True):
        """Return a deferred that fires (with None) when the pool empties.
        If testImmediately is True and the pool is empty, return an already
        fired deferred (via succeed)."
""
        if testImmediately and not self._pool:
            return defer.succeed(None)
        else:
            d = defer.Deferred()
            self._waiting.append(d)
            return d
 

As usual I’m posting this example because I find Twisted’s deferreds so elegant. Here are a few comments on the above that might help you understand deferreds better.

A frequent pattern when creating and managing deferreds is that you can add callbacks and errbacks to them yourself to transparently do some housekeeping when they fire. In this case, for each deferred passed to add, I’m adding a callback and an errback that will run self._fired when the deferred fires. The first thing that method does is take the deferred out of the pool of outstanding deferreds. So the deferred itself cleans up the pool. It does that transparently, by which I mean that the call/errback function (self._fired) always returns whatever result it was passed. It’s on both the callback and errback chains of the deferred and has no effect on the result. The deferred may already have call/errbacks on it when it is passed to add, and it may have them added to it after add is done. Whoever created and is otherwise using the deferred will be none the wiser and is in no way affected.

When a deferred in the pool fires, it also checks to see if the pool size is zero and if there are any deferreds waiting to be informed of that. If so, it fires all the waiting deferreds and empties the list of waiting deferreds. This doesn’t mean the action is necessarily over. More deferreds can be added, more waiters can be added, etc. The pool size can go to zero again and if there are no waiters are waiting, no big deal, etc.

It’s easy to add functionality to e.g., record what time deferreds were added, provide stats, allow outstanding deferreds to be cancelled, add notifications when high/low water marks are reached, etc. But that’s enough for now. Feel free to ask questions below.

The eighty six non-trivial powers ≤ 2^20

Wednesday, March 30th, 2011

Tonight Jamu Kakar mentioned in IRC that a program of his had unexpectedly crashed after processing 1,048,376 items. I think it’s a useful debugging skill to have to be able to recognize numbers like that (it’s very close to 2^20). I’ve often wanted to write a tiny program to print out all the non-trivial powers, and since I have far more important and pressing things to be doing, I immediately went to write the code. At a minimum it seems prudent to recognize all powers up to 1000, and the powers of 2 to much higher. Below you have all 86 non-trivial powers up to 2^20. I don’t know them all, but I wish I did.

  4 = 2^2                  729 = 3^6, 9^3                32768 = 2^15, 8^5
  8 = 2^3                 1000 = 10^3                    38416 = 14^4
  9 = 3^2                 1024 = 2^10, 4^5               46656 = 6^6
 16 = 2^4, 4^2            1296 = 6^4                     50625 = 15^4
 25 = 5^2                 1331 = 11^3                    59049 = 3^10, 9^5
 27 = 3^3                 1728 = 12^3                    65536 = 2^16, 4^8, 16^4
 32 = 2^5                 2048 = 2^11                    78125 = 5^7
 36 = 6^2                 2187 = 3^7                     83521 = 17^4
 49 = 7^2                 2197 = 13^3                   100000 = 10^5
 64 = 2^6, 4^3, 8^2       2401 = 7^4                    104976 = 18^4
 81 = 3^4, 9^2            2744 = 14^3                   117649 = 7^6
100 = 10^2                3125 = 5^5                    130321 = 19^4
121 = 11^2                3375 = 15^3                   131072 = 2^17
125 = 5^3                 4096 = 2^12, 4^6, 8^4, 16^3   160000 = 20^4
128 = 2^7                 4913 = 17^3                   161051 = 11^5
144 = 12^2                5832 = 18^3                   177147 = 3^11
169 = 13^2                6561 = 3^8, 9^4               248832 = 12^5
196 = 14^2                6859 = 19^3                   262144 = 2^18, 4^9, 8^6
216 = 6^3                 7776 = 6^5                    279936 = 6^7
225 = 15^2                8000 = 20^3                   371293 = 13^5
243 = 3^5                 8192 = 2^13                   390625 = 5^8
256 = 2^8, 4^4, 16^2     10000 = 10^4                   524288 = 2^19
289 = 17^2               14641 = 11^4                   531441 = 3^12, 9^6
324 = 18^2               15625 = 5^6                    537824 = 14^5
343 = 7^3                16384 = 2^14, 4^7              759375 = 15^5
361 = 19^2               16807 = 7^5                    823543 = 7^7
400 = 20^2               19683 = 3^9                   1000000 = 10^6
512 = 2^9, 8^3           20736 = 12^4                  1048576 = 2^20, 4^10, 16^5
625 = 5^4                28561 = 13^4

I produced the above with this quick hack:

from collections import defaultdict

powers = defaultdict(list)
lim = 20

for a in range(2, lim + 1):
    for b in range(2, lim + 1):
        n = a ** b
        if n > 2 ** lim:
            break
        powers[n].append((a, b))

for n in sorted(powers.keys()):
    print '%7d = %s' % (n,
                        ', '.join('%d^%d' % (a, b)
                                  for (a, b) in powers[n]))
 

py-narrow-to-class

Friday, January 28th, 2011

I can never understand when I meet programmers who don’t use emacs. As a programmer, you spend inordinate amounts of time in your editor. You call yourself a programmer. You like to automate things. You get frustrated when you can’t take matters into your own hands. You like hacking on things. Right? And so…….. why wouldn’t you be deeply attracted to an editor that is fully programmable? Sure, (emacs) lisp may not be to everyone’s liking, but being able to program your editor is hugely powerful, especially when the programming language comes with an extremely strong library of tools just for editing text inside the editor.

Jamu Kakar, another emacs fan was just over at my place. He didn’t know about narrowing the buffer – to only show one section of it so that you can edit it to your heart’s content (e.g., global search & replace) and then widen it again when you’re done. We were looking at some Python code and I did a search. The class we were looking at was long, and I didn’t know if my search had finished in code that was still part of the same class. I said to Jamu “Emacs Python mode should have a function that narrows the buffer to the current class”.

After he was gone, I was thinking about that, and realized it would be easy to write. It’s all of 8 lines.

(defun py-narrow-to-class nil
  (interactive)
  (save-excursion
    (py-beginning-of-def-or-class t)
    (let
        ((start (point)))
      (py-end-of-def-or-class t)
      (narrow-to-region (point) start))))
 

Not too shabby, and it took less than 5 minutes to write it. In words: here’s a function called py-narrow-to-class that takes no arguments and that I want to call interactively (via M-x py-narrow-to-class). It’s going to go to the start of the current Python class, set a local variable called start to remember that location, then move to the end of the class, and narrow the buffer. That’s wrapped in a (save-excursion) so that when all that moving around and narrowing is done, the cursor will be in the exact spot it was when I started. If I want, I can now assign this function to a single keystroke when I’m in Python mode.

If you don’t think that’s pretty neat, you’re probably not a programmer. Can you do that in your editor?

jsongrep.py – Python for extracting pieces of JSON objects

Thursday, November 25th, 2010

Lots of APIs these days return JSON objects. I love JSON, but reading a raw JSON dump can be awkward. You can print the whole thing using pprint in Python (or your favorite language), but what if you want to grep out and print only parts of the JSON? I was thinking tonight that it would be easy and useful to write a recursive script to do that. It took about half an hour to arrive at this solution:

#!/usr/bin/env python

import sys
import re
import json
from pprint import pprint

def jsongrep(d, patterns):
    try:
        pattern = patterns.pop(0)
    except IndexError:
        pprint(d)
    else:
        if isinstance(d, dict):
            keys = filter(pattern.match, d.keys())
        elif isinstance(d, list):
            keys = map(int,
                       filter(pattern.match,
                              ['%d' % i for i in range(len(d))]))
        else:
            if pattern.match(str(d)):
                pprint(d)
            return
        for item in (d[key] for key in keys):
            jsongrep(item, patterns[:])

if __name__ == '__main__':
    try:
        j = json.loads(sys.stdin.read())
    except ValueError, e:
        print >>sys.stderr, 'Could not load JSON object from stdin.'
        sys.exit(1)

    jsongrep(j, map(re.compile, sys.argv[1:]))

Usage is really simple. Let’s look at a couple of easy examples from the command line:

$ echo '{"hey" : "you"}' | jsongrep.py hey
u'you'

jsongrep.py has matched the “hey” key in the JSON object and printed its value. Let’s do the same thing with a 2-level JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey
{u'there': u'you'}

Again, we see the entire object corresponding to the “hey” key. We can add another argument to drill down into the object

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py hey there
u'you'

As you might hope, you can use a regular expression for an argument:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py 'h.*' '.*'
u'you'

which in this case could have been given more concisely as

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py h .
u'you'

So you can drill down into nested dictionaries quite easily. When jsongrep.py runs out of patterns it just prints whatever’s left. A special case of this is if you give no patterns at all, you get the whole JSON object:

$ echo '{"hey" : { "there" : "you"}}' | jsongrep.py
{u'hey': {u'there': u'you'}}

The regex patterns you pass on the command line are being matched against the keys of JSON objects (Python dicts). If jsongrep.py runs into a list, it will instead match against the list indices like so:

$ echo '{"hey" : { "joe" : ["blow", "xxx" ]}}' | jsongrep.py hey joe 1
u'xxx'

You can see we’ve pulled out just the first list element after matching “hey” and “joe”. So jsongrep.py regex args can be used to navigate your way through both JSON objects and lists.

Now let’s do something more interesting.

Twitter’s API can give you JSON, and the JSON is pretty chunky. For example, if I get my first 100 followers with this command:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones'

there’s 164Kb of output (try it and see). What if I just want the Twitter user names of the people who follow me? Looking at the JSON, I can see it starts with:

[{"profile_background_color":"131516","description":null

Hmm… looks like it’s a list of dictionaries. Let’s print just the first dictionary in the list:

curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py 0

which starts out:

{u'contributors_enabled': False,
 u'created_at': u'Wed Jul 19 00:29:58 +0000 2006',
 u'description': None,
 u'favourites_count': 0,
 u'follow_request_sent': False,
 u'followers_count': 178,
 u'following': False,
 u'friends_count': 67,
 u'geo_enabled': False,
 u'id': 2471,
 u'id_str': u'2471',
 u'lang': u'en',
 u'listed_count': 3,
 u'location': None,
 u'name': u'Roy',
 u'notifications': False,
 u'profile_background_color': u'131516',
 u'profile_background_image_url': u'http://s.twimg.com/a/1288470193/images/themes/theme14/bg.gif',
 u'profile_background_tile': True,
 u'profile_image_url': u'http://a3.twimg.com/profile_images/194788727/roy_with_phone_normal.jpg',
 u'profile_link_color': u'009999',
 u'profile_sidebar_border_color': u'eeeeee',
 u'profile_sidebar_fill_color': u'efefef',
 u'profile_text_color': u'333333',
 u'profile_use_background_image': True,
 u'protected': False,
 u'screen_name': u'wasroykosuge',

and you can see a “screen_name” key in there which looks like what we want. Let’s see the first few:

$ curl 'http://api.twitter.com/1/statuses/followers.json?screen_name=terrycojones' |
jsongrep.py . screen_name | head
u'wasroykosuge'
u'Piiiu_piiiu'
u'350'
u'KardioFit'
u'jrecursive'
u'doodlesockingad'
u'revinprogress'
u'cloudjobs'
u'PointGcomics'
u'lucasbuchala'

Finally, here’s an example using FluidDB‘s new /values HTTP call. I’ll ask FluidDB for all objects matching the query has unionsquareventures.com/portfolio and from those matching objects I’ll pull back the value of the FluidDB tag named fluiddb/about. The result is JSON that starts out like this:

{"results": { "id" : {"93989942-b519-49b4-87de-ac834e6a6161": {"fluiddb/about": {"value": "http://www.outside.in"}}

You can see there’s a 5-level deep nesting of JSON objects. I just want the “value” key on all matching objects. Easy:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value | sort
u'http://amee.cc'
u'http://getglue.com'
u'http://stackoverflow.com'
u'http://tumblr.com'
u'http://www.10gen.com'
u'http://www.boxee.tv'
u'http://www.buglabs.net'
u'http://www.clickable.com'
u'http://www.cv.im'
u'http://www.disqus.com'
u'http://www.etsy.com'
u'http://www.flurry.com'
u'http://www.foursquare.com'
u'http://www.heyzap.com'
u'http://www.indeed.com'
u'http://www.meetup.com'
u'http://www.oddcast.com'
u'http://www.outside.in'
u'http://www.returnpath.net'
u'http://www.shapeways.com'
u'http://www.simulmedia.com'
u'http://www.targetspot.com'
u'http://www.tastylabs.com'
u'http://www.tracked.com'
u'http://www.twilio.com'
u'http://www.twitter.com'
u'http://www.workmarket.com'
u'http://www.zemanta.com'
u'http://zynga.com'

And there you have it, a sorted list of all Union Square Ventures portfolio companies, from the command line, as listed here.

jsongrep.py will also try to match on things that are not objects or lists if it runs into them, so we can refine this list a little. E.g.,

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*ee' | sort
u'http://www.meetup.com'
u'http://amee.cc'
u'http://www.indeed.com'
u'http://www.boxee.tv'

being the USV companies with “ee” somewhere in their URL. Or, for some advanced regex fun, the USV companies whose URLs don’t end in “.com”:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about' |
jsongrep.py results . . fluiddb/about value '.*(?<!\.com)$'
u'http://www.outside.in'
u'http://amee.cc'
u'http://www.cv.im'
u'http://www.buglabs.net'
u'http://www.returnpath.net'
u'http://www.boxee.tv'

Twisted’s Deferred class chainDeferred method is too simplistic?

Saturday, August 14th, 2010

I think it’s worth spending some time thinking about whether chainDeferred in twisted.internet.defer.Deferred is too simplistic. I’ve thought for a while that it could be more helpful in preventing people from doing unintended things and/or cause fewer surprises (e.g., see point #1 in this post).

Those are smaller things that people can obviously live with. But a more important one concerns deferred cancellation. See uncalled.py below for why.

Here are five examples of chainDeferred behavior that I think could be better. In an attempt to fix these, I made a few changes to a copy of defer.py (immodestly called tdefer.py, sorry) which you can grab, with runnable examples, here. The tdefer.py code is meant as a suggested approach. I doubt that it’s bulletproof.

Here are the examples.

boom1.py:

# Normal deferreds: this raises defer.AlreadyCalledError because
# the callback of d1 causes the callback of d2 to be called, but d2 has
# already been cancelled (and hence called).

# With tdefer.py: there is no error because d1.callback will not call
# d2 as it has already been cancelled.

def printCancel(fail):
    fail.trap(defer.CancelledError)
    print ‘cancelled’

def canceller(d):
    print ‘cancelling’

d1 = defer.Deferred()
d2 = defer.Deferred(canceller)
d2.addErrback(printCancel)
d1.chainDeferred(d2)
d2.cancel()
d1.callback(‘hey’)
 

boom2.py:

# Normally: raises defer.AlreadyCalledError because calling d1.callback
# will call d2, which has already been called.

# With tdefer.py: Raises AssertionError: "Can’t callback an already
# chained deferred" because calling callback on a deferred that’s
# already been chained is asking for trouble (as above).

d1 = defer.Deferred()
d2 = defer.Deferred()
d1.chainDeferred(d2)
d2.callback(‘hey’)
d1.callback(‘jude’)
 

uncalled.py:

# Normally: although d2 has been chained to d1, when d1 is cancelled,
# d2′s cancel method is never called. Even calling d2.cancel ourselves
# after the call to d1.cancel has no effect, as d2 has already been
# called.

# With tdefer: both cancel1 and cancel2 are called when d1.cancel is
# called. The additional final call to d2.cancel correctly has no
# effect as d2 has been called (via d1.cancel).

def cancel1(d):
    print ‘cancel one’

def cancel2(d):
    print ‘cancel two’

def reportCancel(fail, which):
    fail.trap(defer.CancelledError)
    print ‘cancelled’, which

d1 = defer.Deferred(cancel1)
d1.addErrback(reportCancel, ‘one’)
d2 = defer.Deferred(cancel2)
d2.addErrback(reportCancel, ‘two’)
d1.chainDeferred(d2)
d1.cancel()
d2.cancel()
 

unexpected1.py:

# Normally: prints "called: None", instead of the probably expected
# "called: hey"

# tdefer.py: prints "called: hey"

def called(result):
    print ‘called:’, result

d1 = defer.Deferred()
d2 = defer.Deferred()
d1.chainDeferred(d2)
d1.addCallback(called)
d1.callback(‘hey’)
 

unexpected2.py:

# Normally: prints
#   called 2: hey
#   called 3: None

# tdefer.py: prints
#   called 2: hey
#   called 3: hey

def report2(result):
    print ‘called 2:’, result

def report3(result):
    print ‘called 3:’, result

d1 = defer.Deferred()
d2 = defer.Deferred().addCallback(report2)
d3 = defer.Deferred().addCallback(report3)
d1.chainDeferred(d2)
d1.chainDeferred(d3)
d1.callback(‘hey’)
 

I wont go into detail here as this post is already long enough. Those are 3 classes of behavior arising from chainDeferred being very simplistic. Comments welcome, of course. Once again, the runnable code is here.

Asynchronous data structures with Twisted Deferreds

Friday, July 23rd, 2010

Earlier this week I gave a talk titled Deferred Gratification (slides) at EuroPython in Birmingham. The title was supposed to hint at how much I love Twisted‘s Deferred class, and that it took me some time to really appreciate it.

While thinking about Deferreds in the days before the talk, I think I put my finger on one of the reasons I find Deferreds so attractive and elegant. The thought can be neatly summarized: Deferreds let you replace synchronous data structures with elegant asynchronous ones.

There are various ways that people get exposed to thinking about programming in an asynchronous style. The great majority do so via building user interfaces, usually with widget libraries wherein one specifies widgets and layouts, sets up event handlers, and then surrenders control to a main loop that thereafter routes events to handlers. Exposure to asynchronous programming via building UIs is becoming much more common as Javascript programmers build client-side web apps that operate asynchronously with web servers (and the “A” in AJAX of course stands for asynchronous). Others become aware of asynchronous programming via writing network code (perhaps using Twisted). Relatively few become aware of asynchronous programming via doing async filesystem I/O.

Because Twisted’s Deferreds don’t have anything to do with UIs or networking or filesystems, you can use them to implement other asynchronous things, like an asynchronous data structure. To show you what I mean, here’s a slightly simplified version of Twisted’s DeferredQueue class, taken from twisted/internet/defer.py:

class DeferredQueue(object):

    def __init__(self):
        # It would be better if these both used collections.deque (see comments section below).
        self.waiting = [] # Deferreds that are expecting an object
        self.pending = [] # Objects queued to go out via get.

    def put(self, obj):
        if self.waiting:
            self.waiting.pop(0).callback(obj)
        else:
            self.pending.append(obj)

    def get(self):
        if self.pending:
            return succeed(self.pending.pop(0))
        else:
            d = Deferred()
            self.waiting.append(d)
            return d
 

I find this extremely elegant, and I’m going to explain why.

But first, think about code for a regular synchronous queue. What happens if you call get on a regular queue that’s empty? Almost certainly one of two things: you’ll get some kind of QueueEmpty error, or else your code will block until some other code puts something into the queue. I.e., you either get a synchronous error or you get a synchronous non-empty response.

If you look at the get method in the code above, you’ll see that if the queue is empty (i.e., the pending list is empty), a new Deferred is made, added to self.waiting, and is immediately returned to the caller. So code calling get on an empty queue doesn’t get an error and doesn’t block, it always gets a result back essentially immediately. How can you get a result from an empty queue? Easy: the result is a Deferred. And because we’re in the asynchronous world, you just attach callbacks (like event handlers in the UI world) to the Deferred, and go on your merry way.

If you can follow that thinking, the rest of the code in the class above should be easy to grok. In put, if there are any outstanding Deferreds (i.e., earlier callers who hit an empty queue and got a Deferred back), the incoming object is given to the first of these by passing it to the callback function of the Deferred (and popping it out of the waiting list). If there are no outstanding Deferreds expecting a result, the incoming object is simply appended to self.pending. On the get side, if the queue (i.e., self.pending) is non-empty, the code creates a Deferred that has already been fired (using the succeed helper function) and returns it.

By now, this kind of code seems routine and straightforward to me. But it certainly wasn’t always that way. So if the above seems cryptic or abstract, I encourage you to think it over, maybe write some code, ask questions below, etc. To my mind, these kinds of constructs – simple, elegant, robust, practical, obviously(?) bug-free, single-threaded, etc. – are extremely instructive. You too can solve potentially gnarly (at least in the threaded world) networking coding challenges in very simple ways just by building code out of simple Twisted Deferreds (there’s nothing special about the DeferredQueue class, it’s just built using primitive Deferreds).

For extra points, think about the concept of asynchronous data structures. I find that concept very attractive, and putting my finger on it helped me to see part of the reason why I think Deferreds are so great. (In fact, as you can see from the use of succeed, Deferreds are not specifically asynchronous or synchronous: they’re so simple they don’t even care, and as a consumer of Deferreds, your code doesn’t have to either). That’s all remarkably elegant.

I sometimes hope O’Reilly will publish a second volume of Beautiful Code and ask me to write a chapter. I know exactly what I’d write about :-)

Twisted code for retrying function calls

Thursday, November 12th, 2009

These days I often find myself writing code to talk to services that are periodically briefly unavailable. An error of some kind occurs and the correct (and documented) action to take is just to retry the original call a little later. Examples include using Amazon’s S3 service and the Twitter API. In both of these services, transient failures happen fairly frequently.

So I wrote the Twisted class below to retry calls, and tried to make it fairly general. I’d be happy to hear comments on it, because it’s pretty simple and if it can be made bullet proof I imagine others will use it too.

In case you’re not familiar with Twisted and it’s not clear, the call retrying in the below is scheduled by the Twisted reactor. This all asynchronous event-based code that will not block (assuming the function you pass in also does not).

First off, here’s the class that handles the calling:

from twisted.internet import reactor, defer, task
from twisted.python import log, failure

class RetryingCall(object):
    """Calls a function repeatedly, passing it args and kw args. Failures
    are passed to a user-supplied failure testing function. If the failure
    is ignored, the function is called again after a delay whose duration
    is obtained from a user-supplied iterator. The start method (below)
    returns a deferred that fires with the eventual non-error result of
    calling the supplied function, or fires its errback if no successful
    result can be obtained before the delay backoff iterator raises
    StopIteration.
    "
""
    def __init__(self, f, *args, **kw):
        self._f = f
        self._args = args
        self._kw = kw
       
    def _err(self, fail):
        if self.failure is None:
            self.failure = fail
        try:
            fail = self._failureTester(fail)
        except:
            self._deferred.errback()
        else:
            if isinstance(fail, failure.Failure):
                self._deferred.errback(fail)
            else:
                log.msg(‘RetryingCall: Ignoring %r’ % (fail,))
                self._call()

    def _call(self):
        try:
            delay = self._backoffIterator.next()
        except StopIteration:
            log.msg(‘StopIteration in RetryingCall: ran out of attempts.’)
            self._deferred.errback(self.failure)
        else:
            d = task.deferLater(reactor, delay,
                                self._f, *self._args, **self._kw)
            d.addCallbacks(self._deferred.callback, self._err)

    def start(self, backoffIterator=None, failureTester=None):
        self._backoffIterator = iter(backoffIterator or simpleBackoffIterator())
        self._failureTester = failureTester or (lambda _: None)
        self._deferred = defer.Deferred()
        self.failure = None
        self._call()
        return self._deferred
 

You call the constructor with your function and the args it should be called with. Then you call start() to get back a deferred that will eventually fire with the result of the call, or an error. BTW, I called it “start” to mirror twisted.internet.task.LoopingCall.

There’s a helper function for producing successive inter-call delays:

from operator import mul
from functools import partial

def simpleBackoffIterator(maxResults=10, maxDelay=120.0, now=True,
                          initDelay=0.01, incFunc=None):
    assert maxResults > 0
    remaining = maxResults
    delay = initDelay
    incFunc = incFunc or partial(mul, 2.0)
   
    if now:
        yield 0.0
        remaining -= 1
       
    while remaining > 0:
        yield (delay if delay < maxDelay else maxDelay)
        delay = incFunc(delay)
        remaining -= 1
 

By default this will generate the sequence of inter-call delays 0.0, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 1.28, 2.56 and it should be easy to see how you could write your own. Or you can just supply a list, etc. When the backoff iterator finishes, the RetryingCall class gives up on trying to get a non-error result from the function. In that case errback is called on the deferred that start() returns, with the failure from the first call.

You get to specify a function for testing failures. If it ever raises or returns a failure, the start() deferred’s errback is called. The failure tester can just ignore whatever failures should be considered transient.

So, for example, if you were calling S3 and wanted to ignore 504 errors, you could supply a failureTester arg like this:

    from twisted.web import error, http

    def test(self, failure):
        failure.trap(error.Error)
        if int(failure.value.status) != http.GATEWAY_TIMEOUT:
            return failure
 

As another example, while using the Twitter API you might want to allow a range of HTTP errors and also exactly one 404 error, seeing as a 404 might be an error on the part of Twitter (I don’t mean to suggest that actually happens). It’s probably definitive – but, why not try it once again just to be more sure? So, pass RetryingCall a failureTester that’s an instance of a class like this:

class TwitterFailureTester(object):
    okErrs = (http.INTERNAL_SERVER_ERROR,
              http.BAD_GATEWAY,
              http.SERVICE_UNAVAILABLE)

    def __init__(self):
        self.seen404 = False

    def __call__(self, failure):
        failure.trap(error.Error)
        status = int(failure.value.status)
        if status == http.NOT_FOUND:
            if self.seen404:
                return failure
            else:
                self.seen404 = True
        elif status not in self.okErrs:
            return failure
 

Changing existing code to use RetryingCall is pretty trivial. Take something like this

from twisted.web import client

def getUserByScreenname(screenname):
    d = client.getPage(
        ‘http://twitter.com/users/show.json?screen_name=glyf’)
    return d
 

and change it to look like this:

def getUserByScreenname(screenname):
    r = RetryingCall(client.getPage,
        ‘http://twitter.com/users/show.json?screen_name=glyf’)
    d = r.start(failureTester=TwitterFailureTester())
    return d
 

I wrote this about 10 days ago and posted it to the Twisted mailing list. No-one replied to say how horrible the code is or that it shoud be done another way, which is a pretty good sign. The above includes an improvement suggested by Tim Allen, and is slightly more useful than the code I posted originally (see the thread on the Twisted list for details).

Fault-tolerant Python Twisted classes for getting all Twitter friends or followers

Thursday, October 22nd, 2009

It’s been forever since I blogged here. I just wrote a little Python to grab all of a user’s friends or followers (or just their user ids). It uses Twisted, of course. There were two main reasons for doing this: 1) I want all friends/followers, not just the first bunch returned by the Twitter API, and 2) I wanted code that is fairly robust in the face of various 50x HTTP errors (I regularly experience INTERNAL_SERVER_ERROR, BAD_GATEWAY, and SERVICE_UNAVAILABLE).

If you want to use the code below and you’re not familiar with the Twitter API, consider whether you can use the FriendsIdFetcher and FollowersIdFetcher classes as they’ll do far fewer requests (you get 5000 results per API call, instead of 100). If you can live with user ids and do the occasional fetch of a full user, you’ll probably do far fewer API calls.

For the FriendsFetcher and FollowersFetcher classes, you get back a list of dictionaries, one per user. For FriendsIdFetcher and FollowersIdFetcher you get a list of Twitter user ids.

Of course there’s no documentation. Feel free to ask questions in the comments. Download the source.

import sys

from twisted.internet import defer
from twisted.web import client, error, http
   
if sys.hexversion >= 0x20600f0:
    import json
else:
    import simplejson as json

class _Fetcher(object):
    baseURL = ‘http://twitter.com/’
    URITemplate = None # Override in subclass.
    dataKey = None # Override in subclass.
    maxErrs = 10
    okErrs = (http.INTERNAL_SERVER_ERROR,
              http.BAD_GATEWAY,
              http.SERVICE_UNAVAILABLE)
   
    def __init__(self, name):
        assert self.baseURL.endswith(‘/’)
        self.results = []
        self.errCount = 0
        self.nextCursor = -1
        self.deferred = defer.Deferred()
        self.URL = self.baseURL + (self.URITemplate % { ‘name’ : name })

    def _fail(self, failure):
        failure.trap(error.Error)
        self.errCount += 1
        if (self.errCount < self.maxErrs and
            int(failure.value.status) in self.okErrs):
            self.fetch()
        else:
            self.deferred.errback(failure)
       
    def _parse(self, result):
        try:
            data = json.loads(result)
            self.nextCursor = data.get(‘next_cursor’)
            self.results.extend(data[self.dataKey])
        except Exception:
            self.deferred.errback()
        else:
            self.fetch()
           
    def _deDup(self):
        raise NotImplementedError(‘Override _deDup in subclasses.’)

    def fetch(self):
        if self.nextCursor:
            d = client.getPage(self.URL + ‘?cursor=%s’ % self.nextCursor)
            d.addCallback(self._parse)
            d.addErrback(self._fail)
        else:
            self.deferred.callback(self._deDup())
        return self.deferred

class _FriendsOrFollowersFetcher(_Fetcher):
    dataKey = u‘users’
   
    def _deDup(self):
        seen = set()
        result = []
        for userdict in self.results:
            uid = userdict[‘id’]
            if uid not in seen:
                result.append(userdict)
                seen.add(uid)
        return result

class _IdFetcher(_Fetcher):
    dataKey = u‘ids’
   
    def _deDup(self):
        # Keep the ids in the order we received them.
        seen = set()
        result = []
        for uid in self.results:
            if uid not in seen:
                result.append(uid)
                seen.add(uid)
        return result

class FriendsFetcher(_FriendsOrFollowersFetcher):
    URITemplate = ‘statuses/friends/%(name)s.json’

class FollowersFetcher(_FriendsOrFollowersFetcher):
    URITemplate = ‘statuses/followers/%(name)s.json’

class FriendsIdFetcher(_IdFetcher):
    URITemplate = ‘friends/ids/%(name)s.json’

class FollowersIdFetcher(_IdFetcher):
    URITemplate = ‘followers/ids/%(name)s.json’
 

Usage is dead simple:

fetcher = FriendsFetcher(‘terrycojones’)
d = fetcher.fetch()
d.addCallback(….) # etc.
 

Enjoy.

Facebook release Tornado and it’s not based on Twisted?

Saturday, September 12th, 2009

Image: Jay Smith

Image: Jay Smith

To their great credit, Facebook have just open-sourced more of their core software. This time it’s Tornado, an asynchronous web server written in Python.

Surely that can only mean one thing: Tornado is based on Twisted. Right?

Incredibly, no. Words fail me on this one. I’ve spent some hours today trying to put my thoughts into order so I could put together a reasonably coherent blog post on the subject. But I’ve failed. So here are some unstructured thoughts.

First of all, I’m not meaning to bash Facebook on this. At Fluidinfo we use their Thrift code. We’ll almost certainly use Scribe for logging at some point, and we’re seriously considering using Cassandra. Esteve Fernandez has put a ton of work into txAMQP to glue together Thrift, Twisted, and AMQP, and in the process became a Thrift committer.

Second, you can understand—or make an educated guess at—what happened: the Facebook programmers, like programmers everywhere, were strongly tempted to just write their own code instead of dealing with someone else’s. It’s not just about learning curves and fixing deficiencies, there are also issues of speed of changes and of control. At Fluidinfo we suffered through six months of maintaining our own set of related patches to Thrift before the Twisted support Esteve wrote was finally merged to trunk. That was painful and the temptation to scratch our own itch, fork, and forget about the official Thrift project was high.

Plus, Twisted suffers from the fact that the documentation is not as good as it could be. I commented at length on this over three years ago. Please read the follow-up posts in that thread for an illustration (one of many) of the maturity of the people running Twisted. Also note that the documentation has improved greatly since then. Nevertheless, Twisted is a huge project, it has tons of parts, and it’s difficult to document and to wrap your head around no matter what.

So you can understand why Facebook might have decided not to use Twisted. In their words:

We ended up writing our own web server and framework after looking at existing servers and tools like Twisted because none matched both our performance requirements and our ease-of-use requirements.

I’m betting it’s that last part that’s the key to the decision not to use Twisted.

But seriously…… WTF?

Twisted is an amazing piece of work, written by some truly brilliant coders, with huge experience doing exactly what Facebook set out to reinvent.

This is where I’m at a loss for words. I think: “what an historic missed opportunity” and “reinventing the wheel, badly” and “no, no, no, this cannot be” and “this is just so short-sighted” and “a crying shame” and many things besides.

Sure, Twisted is difficult to grok. But that’s no excuse. It’s a seriously cool and powerful framework, it’s vastly more sophisticated and useful and extensible than what Facebook have cobbled together. Facebook could have worked to improve twisted.web (which everyone agrees has some shortcomings) which could have benefitted greatly from even a small fraction of the resources Facebook must have put into Tornado. The end result would have been much better. Or Facebook could have just ignored twisted.web and built directly on top of the Twisted core. That would have been great too.

Or Facebook could have had a team of people who knew how to do it better, and produced something better than Twisted. I guess that’s the real frustration here – they’ve put a ton of effort into building something much more limited in scope and vision, and even the piece that they did build looks like a total hack built to scratch their short term needs.

What’s the biggest change in software engineering over the last decade? Arguably it’s the rise of test-driven development. I’m not the only one who thinks so. Yet here we are in late 2009 and Facebook have released a major piece of code with no test suite. Amazing. OK, you could argue this is a minor thing, that it’s not core to Tornado. That argument has some weight, but it’s hard to think that this thing is not a hack.

If you decide to use an asynchronous web framework, do you expect to have to manually set your sockets to be non-blocking? Do you feel like catching EWOULDBLOCK and EAGAIN yourself? Those sorts of things, and their non-portability (even within the world of Linux) are precisely the kinds of things that lead people away from writing C and towards doing rapid development using something robust and portable that looks after the details. They’re precisely the things Twisted takes care of for you, and which (at least in Twisted) work across platforms, including Windows.

It looks like Tornado are using a global reactor, which the Twisted folks have learned the hard way is not the best solution.

Those are just some of the complaints I’ve heard and seen in the Tornado code. I confess I’ve looked only superficially at their code – but more than enough to feel such a sense of lost opportunity. They built a small subsection of Twisted, they’ve done it with much less experience and elegance and hiding of detail than the Twisted code, and the thing doesn’t even come with a test suite. Who knows if it actually works, or when, or where, etc.?

And…. Twisted is so much more. HTTP is just one of many protocols Twisted speaks, including (from their home page): “TCP, UDP, SSL/TLS, multicast, Unix sockets, a large number of protocols including HTTP, NNTP, IMAP, SSH, IRC, FTP, and others”.

Want to build a sophisticated, powerful, and flexible asynchronous internet service in Python? Use Twisted.

A beautiful thing about Twisted is that it expands your mind. Its abstractions (particularly the clean separation and generality of transports, protocols, factories, services, and Deferreds—see here and here and here) makes you a better programmer. As I wrote to some friends in April 2006: “Reading the Twisted docs makes me feel like my brain is growing new muscles.”

Twisted’s deferreds are extraordinarily elegant and powerful, I’ve blogged and posted to the Twisted mailing list about them on multiple occasions. Learning to think in the Twisted way has been a programming joy to me over the last 3 years, so you can perhaps imagine my dismay that a company with the resources of Facebook couldn’t be bothered to get behind it and had to go reinvent the wheel, and do it so poorly. What a pity.

In my case, I threw away an entire year of C code in order to use Twisted in FluidDB. That was a decision I didn’t take lightly. I’d written my own libraries to do lots of low level network communications and RPC – including auto-generating server and client side glue libraries to serialize and unserialize RPC calls and args (a bit like Thrift), plus a server and tons of other stuff. I chucked it because it was too brittle. It was too much of a hack. It wasn’t portable enough. It was too get the details right. It wasn’t extensible.

In other words….. it was too much like Tornado! So I threw it all away in favor of Twisted. As I happily tell people, FluidDB is written in Python so it can use Twisted. It was a question of an amazingly great asynchronous networking framework determining the choice of programming language. And this was done in spite of the fact that I thought the Twisted docs sucked badly. The people behind Twisted were clearly brilliant and the community was great. There was an opportunity to make a bet on something and to contribute. I wish Facebook had made the same decision. It’s everyone’s loss that they did not. What a great pity.

Python code for retrieving all your tweets

Wednesday, June 24th, 2009

Here’s a little Python code to pull back all a user’s Twitter tweets. Make sure you read the notes at bottom in case you want to use it.

import sys, twitter, operator
from dateutil.parser import parse

twitterURL = ‘http://twitter.com’

def fetch(user):
    data = {}
    api = twitter.Api()
    max_id = None
    total = 0
    while True:
        statuses = api.GetUserTimeline(user, count=200, max_id=max_id)
        newCount = ignCount = 0
        for s in statuses:
            if s.id in data:
                ignCount += 1
            else:
                data[s.id] = s
                newCount += 1
        total += newCount
        print >>sys.stderr, "Fetched %d/%d/%d new/old/total." % (
            newCount, ignCount, total)
        if newCount == 0:
            break
        max_id = min([s.id for s in statuses])1
    return data.values()

def htmlPrint(user, tweets):
    for t in tweets:
        t.pdate = parse(t.created_at)
    key = operator.attrgetter(‘pdate’)
    tweets = sorted(tweets, key=key)
    f = open(‘%s.html’ % user, ‘wb’)
    print >>f, """<html><title>Tweets for %s</title>
    <meta http-equiv="
Content-Type" content="text/html;charset=utf-8">
    <body><small>"
"" % user
    for i, t in enumerate(tweets):
        print >>f, ‘%d. %s <a href="%s/%s/status/%d">%s</a><br/>’ % (
            i, t.pdate.strftime(‘%Y-%m-%d %H:%M’), twitterURL,
            user, t.id, t.text.encode(‘utf8′))
    print >>f, ‘</small></body></html>’
    f.close()
   
if __name__ == ‘__main__’:
    user = ‘terrycojones’ if len(sys.argv) < 2 else sys.argv[1]
    data = fetch(user)
    htmlPrint(user, data)
 

Notes:

Fetch all of a user’s tweets and write them to a file username.html (where username is given on the command line).

Output is to a file instead of to stdout as tweet texts are unicode and sys.stdout.encoding is ascii on my machine, which prevents printing non-ASCII chars.

This code uses the Python-Twitter library. You need to get (via SVN) the very latest version, and then you need to fix a tiny bug, described here. Or wait a while and the SVN trunk will be patched.

This worked flawlessly for my 2,300 tweets, but only retrieved about half the tweets of someone who had over 7,000. I’m not sure what happened there.

There are tons of things that could be done to make the output more attractive and useful. And yes, for nitpickers, the code has a couple of slight inefficiencies :-)