Terry Jones » programming

Pages
- About
Recent
- Estimating infectiousness throughout SARS-CoV-2 infection course July 16, 2021
- The vikings had smallpox April 18, 2021
- Daudin – a Python shell October 13, 2019
- Papers on ancient hepatitis B virus and human parvovirus B19 July 15, 2018
- A BLAST puzzle August 25, 2017
- Do stuff on things, in parallel August 5, 2017
- Thoughts ahead of the 2017 Transcontinental Race July 25, 2017
- Everesting September 5, 2016
- Trump After the Inauguration June 13, 2016
- Knog Milkman bike combination lock security flaw June 4, 2016
- I go shopping for a compass, then my Sonos decides it needs one too December 10, 2014
- Learning jQuery Deferreds published January 7, 2014
Categories

Categories
Search
Archives
Archives

Hacking Twitter on JetBlue

21:41 November 24th, 2007 by terry. Posted under companies, me, python, twitter. 7 Comments »

I have much better and more important things to do than hack on my ideas for measuring Twitter growth.

But a man’s gotta relax sometime.

So I spent a couple of hours at JFK and then on the plane hacking some Python to pull down tweets (is this what other people call Twitter posts?), pull out their Twitter id and date, convert the dates to integers, write this down a pipe to gnuplot, and put the results onto a graph. I’ve nothing much to show right now. I need more data.

But the story with Twitter ids is apparently not that simple. While you can get tweets from very early on (like #20 that I pointed to earlier), and you can get things like #438484102 which is a recent one of mine, it’s not clear how the intermediate range is populated. Just to get a feel for it, I tried several loops like the following at the shell:

i=5000

while [ $i -lt 200000 ]
do
  wget --http-user terrycojones --http-passwd xxx \
    http://www.twitter.com/statuses/show/$i.xml
  i=`expr $i + 5000`
  sleep 1
done

Most of these were highly unsuccessful. I doubt that’s because there’s widespread deleting of tweets by users. So maybe Twitter are using ids that are not sequential.

Of course if I wasn’t doing this for the simple joy of programming I’d start by doing a decent search for the graph I’m trying to make. Failing that I’d look for someone else online with a bundle of tweets.

I’ll probably let this drop. I should let it drop. But once I get started down the road of thinking about a neat little problem, I sometimes don’t let go. Experience has taught me that it is usually better to hack on it like crazy for 2 days and get it over with. It’s a bit like reading a novel that you don’t want to put down when you know you really should.

One nice sub-problem is deciding where to sample next in the Twitter id space. You can maintain something like a heap of areas – where area is the size of the triangle defined by two tweets: their ids and dates. That probably sounds a bit obscure, but I understand it :-) Gradient of the growth curve is interesting – you probably want more samples when the gradient is changing fastest. Adding time between tweets to gradient gives you a triangle whose area you can measure. There are simpler approaches too, like uniform sampling, or some form of binary splitting of interesting regions of id space. Along the way you need to account for pages that give you a 404. That’s a data point about the id space too.

Can’t stand perl

23:34 November 22nd, 2007 by terry. Posted under python, tech. 4 Comments »

I’ve just spent the last 7 hours working on a bunch of old Perl code that maintains a company equity plan. It’s been pain, pain, pain the whole way. I can’t believe I ever thought Perl was cool and fun. I can’t believe I wrote that stuff. I can’t believe it’s almost midnight.

But, I’m nearly done.

Twittering from inside emacs

04:34 November 12th, 2007 by terry. Posted under python, tech, twitter. Comments Off on Twittering from inside emacs

I do everything I can from inside emacs. Lately I’ve been thinking a bit about the Twitter API and social graphs.

Tonight I went and grabbed python-twitter, a Python API for Twitter. Then I wrote a quick python script to post to Twitter:

import sys
import twitter
twit = twitter.Api(username='terrycojones', password='xxx',
                        input_encoding='iso-8859-1')
twit.PostUpdate(sys.argv[1])

and an equally small emacs lisp function to call it:

(defun tweet (mesg)
  (interactive "MTweet: ")
  (call-process "tweet" nil 0 nil mesg))

so now I can M-x tweet from inside emacs, or simply run tweet from the shell.

Along the way I wrote some simple emacs hook functions to tweet whenever I visited a new file or switched into Python mode. I’m sure that’s not so interesting to my faithful Twitter followers, but it does raise interesting questions. I also thought about adding a mail-send-hook function to Twitter every time I send a mail (and to whom). Probably not a good idea.

You can follow me in Twitter. Go on, you know you want to.

Anyway, Twitter is not the right place to publish information like this. Something more general would be nicer…

Succinct Python

18:37 October 29th, 2007 by terry. Posted under python. Comments Off on Succinct Python

Apropos of nothing…

Having passed beyond the macho need to write obscure code, I’m not fond of
coding constructs that make me scratch my head. But I found this yesterday
in the Python Cookbook (2nd Ed.) p705.

    from itertools import izip
    def chop(iterable, length=2):
        return izip(*(iter(iterable),) * length)

It took me a few minutes to figure out exactly how it does what it does. Talk about succinct. It’s probably very efficient too.

One thing I really don’t like, and which is a chronic problem in perl, is reusing symbols for multiple purposes. In the above, the first * is expanding a list into multiple arguments to izip, while the second * is multiplying (a list). Thankfully, Python is almost completely free of that problem.

There’s also the reliance on the precedence of the latter being higher than the former. I actually do approve of that – I think if you’re going to program seriously with a language you should at least have a fair grip on the precedence of its operators. Not doing so means your code winds up littered with unneeded parens. While it’s nice to be explicit, and “explicit is better than implicit” is one of the Python guidelines, the rules of precedence are already explicit. To put in parens where they’re not needed can make your code less easily to follow at a glance for someone who does know the language. That’s because when reading such code, you look at it more carefully, figuring that those parens must be there for some good reason, because the person who put them in obviously didn’t want the default precedence to apply. When you realize that they’re unnecessary, it’s frustrating and a worry, because you’ve just wasted time and you realize you’re reading the code of someone who either enjoys putting in unneeded syntax or doesn’t know the language well. And who wants to deal with either of those?

Anyway, even if you immediately know what the two * symbols are doing and about the precedence, it’s still nice to think all the way through the above. How/why does it work? When do the iterators stop? Who catches and deals with StopIteration, what happens if the length of iterable is not zero mod length?

Target cheat sheet

14:30 October 25th, 2007 by terry. Posted under python. Comments Off on Target cheat sheet

B	O	H
C	N	K
R	E	W

I have a friend who sends me the Target puzzle from the Sydney Morning Herald every day. I’ve loved doing anagrams for as long as I can remember. I used to write many programs to process words for fun. At Waterloo I made lots of silly dictionaries with my friend Andrew Hensel. We used to make anagram dictionaries for fun reference and memorization.

Anyway, I decided to whip up an anagram dictionary maker in Python. Here’s the code:

import sys
from collections import defaultdict

words = defaultdict(list)

print '<html><head><title>Target cheat sheet.</title></head><body>'

for word in (line[:-1] for line in sys.stdin):
    words[''.join(sorted(list(word.lower())))].append(word)

for letters in sorted(words.keys()):
    print '<strong>%s</strong> = ' % letters
    for word in sorted(words[letters], key=str.lower):
        print word

print '</body></html>'

I built the dictionary with the shell command

awk 'length($0) == 9 {print}' /usr/share/dict/web2 | ./anagram-dict.py > target.html

and you can see the result here. To use it, you take your anagram, sort its letters, and look up the result in that web page. For example, today’s anagram is “bohcnkrew”. Sorting those 9 letters we get “bcehknorw”. Looking at the results page (use the Find function in your browser!) we see two answers: “benchwork” and “workbench”.

Sort uniq sort revisited, in modern Python

00:16 June 17th, 2007 by terry. Posted under python, tech. Comments Off on Sort uniq sort revisited, in modern Python

Just after I started messing around with Python, my friend Nelson posted about writing some simple Python to speed up the UNIX sort | uniq -c | sort -nr idiom.

I played with it a bit trying to speed it up, and wrote several versions in Python and Perl. This was actually just my second Python program.

The other night I was re-reading some newer Python (2.5) docs and decided to try applying the latest and greatest Python tools to the problem. I came up with this:

from sys import stdin
from operator import itemgetter
from collections import defaultdict

total = 0
data = defaultdict(int)
freqCache = {}

for line in stdin:
    data[line] += 1
    total += 1

for line, count in sorted(data.iteritems(), key=itemgetter(1), reverse=True):
    frac = freqCache.setdefault(count, float(count) / total)
    print "%7d %f %s" % (count, frac, line),

In trying out various options, I found that defaultdict(int) is hard to beat, though using defaultdict with an inline lambda: 0 or a simple def x(): return 0 are competitive.

In the solution I sent to Nelson, I simply made a list of the data keys and sorted it, passing lambda a, b: -cmp(data[a], data[b]) as a sort comparator. Nelson pointed out that this was a newbie error, as it stops Python from taking full advantage of its blazingly fast internal sort algorithm. But…. overall the code was quite a bit faster than Nelson’s approach which sorted a list of tuples.

So this time round I was pretty sure I’d see a good improvement. The code above just sorts on the counts, and it lets sort use its own internal comparator. Plus it just runs through the data dictionary once to sort and pull out all results – no need to fish into data each time around the print loop. So it seemed like the best of both worlds.

But, this code turns out to be about 10% slower (on my small set of inputs, each of 200-300K lines) than the naive version which extracts data.keys, sorts it using the above lambda, and then digs back into data when printing the results.

It looks nice though.

resorting to regular expressions

22:53 June 13th, 2007 by terry. Posted under python, tech. 1 Comment »

I was going to write a much longer set of thoughts on moving to Python, but I don’t have time. Instead I’ll summarize by saying that I programmed for 28 years in various languages before switching to Python nearly 2 years ago.

I like Python. A lot. And there are multiple reasons, which I may go into another time.

One thing that has struck me as very interesting is my use of regular expressions. I came to Python after doing a lot of work in Perl (about 8 years). In Perl I used regular expressions all the time. And I mean every single day, many times a day. I like regular expressions. I understand pretty well how they work. I found multiple errors in the 2nd edition of Mastering Regular Expressions. I made a 20% speedup to version 4.72 of Grepmail with a trivial change to a regex. I put both GNU and Henry Spencer regex support into strsed. I use them in emacs lisp programming and in general day-to-day emacs usage, and in their limited form on the shell command line and in grep.

So given that regular expressions are so powerful, that I well know how to wield them, and that I did so perhaps ten thousand times during those 8 years of Perl, you might expect that I’d use them frequently in Python.

But that’s not the case.

In two years of writing Python almost every day, I think I’ve probably only used regular expressions about 10 times!

I’m not going to speculate now on why that might be the case. I’m writing this partly to see if others (in my huge circle of readers) have experienced something similar. I was prompted to write by an svn check in message of Daniel’s last night. He said:

You know things are bad when you find yourself resorting to regular expressions

And I knew exactly what he meant. When I find myself reaching for the Python pocket guide to refresh my memory on using Python regular expressions, it’s such an unusual event (especially given the contrast mentioned above) that I find myself wondering if maybe I’m doing something really inefficient and unPythonic.

iteranything

12:22 May 7th, 2007 by terry. Posted under python, tech. Comments Off on iteranything

Here’s a Python function to iterate over pretty much anything. In the extremely unlikely event that anyone uses this code, note that if you pass keyword arguments the order of the resulting iteration is not defined (as with iterating through any Python dictionary).

from itertools import chain
import types

def iteranything(*args, **kwargs):
    for arg in chain(args, kwargs.itervalues()):
        t = type(arg)
        if t == types.StringType:
            yield arg
        elif t == types.FunctionType:
            for i in arg():
                yield i
        else:
            try:
                i = iter(arg)
            except TypeError:
                yield arg
            else:
                while True:
                    try:
                        yield i.next()
                    except StopIteration:
                        break

if __name__ == '__main__':
    def gen1():
        yield 1
        yield 2

    def gen2():
        yield 3
        yield 4

    assert list(iteranything()) == []
    assert list(iteranything([])) == []
    assert list(iteranything([[]])) == [[]]
    assert list(iteranything([], [])) == []
    assert list(iteranything(3)) == [3]
    assert list(iteranything(3, 4)) == [3, 4]
    assert list(iteranything(3, 4, dog='fido')) == [3, 4, 'fido']
    assert list(iteranything(3, 4, func=gen1)) == [3, 4, 1, 2]
    assert list(iteranything(3, 4, func=gen1())) == [3, 4, 1, 2]
    assert list(iteranything(3, 4, func=iteranything)) == [3, 4]
    assert list(iteranything(3, 4, func=iteranything())) == [3, 4]
    assert list(iteranything(3, 4, func=iteranything('a', 'b', c='z'))) ==
        [3, 4, 'a', 'b', 'z']
    assert list(iteranything(3, 4, func=iteranything('a',
        iteranything(5, 6), c='z'))) == [3, 4, 'a', 5, 6, 'z']
    assert list(iteranything(None, 'xxx', True)) == [None, 'xxx', True]
    assert list(iteranything(3, 4, [5, 6])) == [3, 4, 5, 6]
    assert list(iteranything(3, 4, gen1, gen2)) == [3, 4, 1, 2, 3, 4]
    assert list(iteranything(3, 4, gen1(), gen2())) == [3, 4, 1, 2, 3, 4]
    assert list(iteranything(1, 2, iteranything(3, 4), None)) ==
        [1, 2, 3, 4, None]
    assert list(iteranything(1, 2, iteranything(3, iteranything(1, 2,
        iteranything(3, 4), None)))) == [1, 2, 3, 1, 2, 3, 4, None]

Next Entries »

Pages

Recent

Categories

Search

Archives

Pages

Archives

Categories