Add to Technorati Favorites

Sort uniq sort revisited, in modern Python

00:16 June 17th, 2007 by terry. Posted under python, tech. Comments Off on Sort uniq sort revisited, in modern Python

Just after I started messing around with Python, my friend Nelson posted about writing some simple Python to speed up the UNIX sort | uniq -c | sort -nr idiom.

I played with it a bit trying to speed it up, and wrote several versions in Python and Perl. This was actually just my second Python program.

The other night I was re-reading some newer Python (2.5) docs and decided to try applying the latest and greatest Python tools to the problem. I came up with this:

from sys import stdin
from operator import itemgetter
from collections import defaultdict

total = 0
data = defaultdict(int)
freqCache = {}

for line in stdin:
    data[line] += 1
    total += 1

for line, count in sorted(data.iteritems(), key=itemgetter(1), reverse=True):
    frac = freqCache.setdefault(count, float(count) / total)
    print "%7d %f %s" % (count, frac, line),

In trying out various options, I found that defaultdict(int) is hard to beat, though using defaultdict with an inline lambda: 0 or a simple def x(): return 0 are competitive.

In the solution I sent to Nelson, I simply made a list of the data keys and sorted it, passing lambda a, b: -cmp(data[a], data[b]) as a sort comparator. Nelson pointed out that this was a newbie error, as it stops Python from taking full advantage of its blazingly fast internal sort algorithm. But…. overall the code was quite a bit faster than Nelson’s approach which sorted a list of tuples.

So this time round I was pretty sure I’d see a good improvement. The code above just sorts on the counts, and it lets sort use its own internal comparator. Plus it just runs through the data dictionary once to sort and pull out all results – no need to fish into data each time around the print loop. So it seemed like the best of both worlds.

But, this code turns out to be about 10% slower (on my small set of inputs, each of 200-300K lines) than the naive version which extracts data.keys, sorts it using the above lambda, and then digs back into data when printing the results.

It looks nice though.

AddThis Social Bookmark Button

resorting to regular expressions

22:53 June 13th, 2007 by terry. Posted under python, tech. 1 Comment »

I was going to write a much longer set of thoughts on moving to Python, but I don’t have time. Instead I’ll summarize by saying that I programmed for 28 years in various languages before switching to Python nearly 2 years ago.

I like Python. A lot. And there are multiple reasons, which I may go into another time.

One thing that has struck me as very interesting is my use of regular expressions. I came to Python after doing a lot of work in Perl (about 8 years). In Perl I used regular expressions all the time. And I mean every single day, many times a day. I like regular expressions. I understand pretty well how they work. I found multiple errors in the 2nd edition of Mastering Regular Expressions. I made a 20% speedup to version 4.72 of Grepmail with a trivial change to a regex. I put both GNU and Henry Spencer regex support into strsed. I use them in emacs lisp programming and in general day-to-day emacs usage, and in their limited form on the shell command line and in grep.

So given that regular expressions are so powerful, that I well know how to wield them, and that I did so perhaps ten thousand times during those 8 years of Perl, you might expect that I’d use them frequently in Python.

But that’s not the case.

In two years of writing Python almost every day, I think I’ve probably only used regular expressions about 10 times!

I’m not going to speculate now on why that might be the case. I’m writing this partly to see if others (in my huge circle of readers) have experienced something similar. I was prompted to write by an svn check in message of Daniel’s last night. He said:

You know things are bad when you find yourself resorting to regular expressions

And I knew exactly what he meant. When I find myself reaching for the Python pocket guide to refresh my memory on using Python regular expressions, it’s such an unusual event (especially given the contrast mentioned above) that I find myself wondering if maybe I’m doing something really inefficient and unPythonic.

AddThis Social Bookmark Button

iteranything

12:22 May 7th, 2007 by terry. Posted under python, tech. Comments Off on iteranything

Here’s a Python function to iterate over pretty much anything. In the extremely unlikely event that anyone uses this code, note that if you pass keyword arguments the order of the resulting iteration is not defined (as with iterating through any Python dictionary).

from itertools import chain
import types

def iteranything(*args, **kwargs):
    for arg in chain(args, kwargs.itervalues()):
        t = type(arg)
        if t == types.StringType:
            yield arg
        elif t == types.FunctionType:
            for i in arg():
                yield i
        else:
            try:
                i = iter(arg)
            except TypeError:
                yield arg
            else:
                while True:
                    try:
                        yield i.next()
                    except StopIteration:
                        break

if __name__ == '__main__':
    def gen1():
        yield 1
        yield 2

    def gen2():
        yield 3
        yield 4

    assert list(iteranything()) == []
    assert list(iteranything([])) == []
    assert list(iteranything([[]])) == [[]]
    assert list(iteranything([], [])) == []
    assert list(iteranything(3)) == [3]
    assert list(iteranything(3, 4)) == [3, 4]
    assert list(iteranything(3, 4, dog='fido')) == [3, 4, 'fido']
    assert list(iteranything(3, 4, func=gen1)) == [3, 4, 1, 2]
    assert list(iteranything(3, 4, func=gen1())) == [3, 4, 1, 2]
    assert list(iteranything(3, 4, func=iteranything)) == [3, 4]
    assert list(iteranything(3, 4, func=iteranything())) == [3, 4]
    assert list(iteranything(3, 4, func=iteranything('a', 'b', c='z'))) ==
        [3, 4, 'a', 'b', 'z']
    assert list(iteranything(3, 4, func=iteranything('a',
        iteranything(5, 6), c='z'))) == [3, 4, 'a', 5, 6, 'z']
    assert list(iteranything(None, 'xxx', True)) == [None, 'xxx', True]
    assert list(iteranything(3, 4, [5, 6])) == [3, 4, 5, 6]
    assert list(iteranything(3, 4, gen1, gen2)) == [3, 4, 1, 2, 3, 4]
    assert list(iteranything(3, 4, gen1(), gen2())) == [3, 4, 1, 2, 3, 4]
    assert list(iteranything(1, 2, iteranything(3, 4), None)) ==
        [1, 2, 3, 4, None]
    assert list(iteranything(1, 2, iteranything(3, iteranything(1, 2,
        iteranything(3, 4), None)))) == [1, 2, 3, 1, 2, 3, 4, None]
AddThis Social Bookmark Button