Add to Technorati Favorites

Sort uniq sort revisited, in modern Python

Just after I started messing around with Python, my friend Nelson posted about writing some simple Python to speed up the UNIX sort | uniq -c | sort -nr idiom.

I played with it a bit trying to speed it up, and wrote several versions in Python and Perl. This was actually just my second Python program.

The other night I was re-reading some newer Python (2.5) docs and decided to try applying the latest and greatest Python tools to the problem. I came up with this:

from sys import stdin
from operator import itemgetter
from collections import defaultdict

total = 0
data = defaultdict(int)
freqCache = {}

for line in stdin:
    data[line] += 1
    total += 1

for line, count in sorted(data.iteritems(), key=itemgetter(1), reverse=True):
    frac = freqCache.setdefault(count, float(count) / total)
    print "%7d %f %s" % (count, frac, line),

In trying out various options, I found that defaultdict(int) is hard to beat, though using defaultdict with an inline lambda: 0 or a simple def x(): return 0 are competitive.

In the solution I sent to Nelson, I simply made a list of the data keys and sorted it, passing lambda a, b: -cmp(data[a], data[b]) as a sort comparator. Nelson pointed out that this was a newbie error, as it stops Python from taking full advantage of its blazingly fast internal sort algorithm. But…. overall the code was quite a bit faster than Nelson’s approach which sorted a list of tuples.

So this time round I was pretty sure I’d see a good improvement. The code above just sorts on the counts, and it lets sort use its own internal comparator. Plus it just runs through the data dictionary once to sort and pull out all results – no need to fish into data each time around the print loop. So it seemed like the best of both worlds.

But, this code turns out to be about 10% slower (on my small set of inputs, each of 200-300K lines) than the naive version which extracts data.keys, sorts it using the above lambda, and then digs back into data when printing the results.

It looks nice though.


You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.