Posted Saturday, November 24th, 2007 at 9:41 pm under companies, me, python, twitter.

Hacking Twitter on JetBlue

I have much better and more important things to do than hack on my ideas for measuring Twitter growth.

But a man’s gotta relax sometime.

So I spent a couple of hours at JFK and then on the plane hacking some Python to pull down tweets (is this what other people call Twitter posts?), pull out their Twitter id and date, convert the dates to integers, write this down a pipe to gnuplot, and put the results onto a graph. I’ve nothing much to show right now. I need more data.

But the story with Twitter ids is apparently not that simple. While you can get tweets from very early on (like #20 that I pointed to earlier), and you can get things like #438484102 which is a recent one of mine, it’s not clear how the intermediate range is populated. Just to get a feel for it, I tried several loops like the following at the shell:

i=5000

while [ $i -lt 200000 ]
do
  wget –http-user terrycojones –http-passwd xxx \
    http://www.twitter.com/statuses/show/$i.xml
  i=`expr $i + 5000`
  sleep 1
done

Most of these were highly unsuccessful. I doubt that’s because there’s widespread deleting of tweets by users. So maybe Twitter are using ids that are not sequential.

Of course if I wasn’t doing this for the simple joy of programming I’d start by doing a decent search for the graph I’m trying to make. Failing that I’d look for someone else online with a bundle of tweets.

I’ll probably let this drop. I should let it drop. But once I get started down the road of thinking about a neat little problem, I sometimes don’t let go. Experience has taught me that it is usually better to hack on it like crazy for 2 days and get it over with. It’s a bit like reading a novel that you don’t want to put down when you know you really should.

One nice sub-problem is deciding where to sample next in the Twitter id space. You can maintain something like a heap of areas – where area is the size of the triangle defined by two tweets: their ids and dates. That probably sounds a bit obscure, but I understand it :-) Gradient of the growth curve is interesting – you probably want more samples when the gradient is changing fastest. Adding time between tweets to gradient gives you a triangle whose area you can measure. There are simpler approaches too, like uniform sampling, or some form of binary splitting of interesting regions of id space. Along the way you need to account for pages that give you a 404. That’s a data point about the id space too.

  • Pingback: fluidinfo » Blog Archive » How many users does Twitter have?

  • http://jon.es terry

    BTW Esteve, did you know that [ used to (only) be an executable in /bin?

    It's still there, probably for compatibility with old shells and scripts that explicitly use /bin/[ for some reason. But years ago [ got built in to the shell for speed reasons. It's the same as test, which it is typically a symlink to, except if you call it as [ you need to pass a final syntactic-sugar ] argument too. Wacky.

    Old-timer shell programmers grew up being taught to use case where possible because it was built into the shell and didn’t need an extra process to be forked.

  • http://jon.es/ terry

    BTW Esteve, did you know that [ used to (only) be an executable in /bin?

    It's still there, probably for compatibility with old shells and scripts that explicitly use /bin/[ for some reason. But years ago [ got built in to the shell for speed reasons. It's the same as test, which it is typically a symlink to, except if you call it as [ you need to pass a final syntactic-sugar ] argument too. Wacky.

    Old-timer shell programmers grew up being taught to use case where possible because it was built into the shell and didn’t need an extra process to be forked.

  • http://jon.es terry

    You’re right, of course. I guess I’m just old :-)

    At least in example code I tend to avoid special things like seq (and there used to be another useful tool like seq called jot). There’s also arithmetic built into some shells, so in bash I could have used

    for ((i = 5000; i < 200000; i += 5000))
    

    or

    while [ $((i++) -lt 200000 ]
    

    Anyway, thanks. I did in fact use expr. I always meant to make myself use arithmetic in bash, given that bash is pretty much ubiquitous now.

    BTW, I still sometimes use expr for regexp matching. That's even more ugly and prehistoric :-)

  • http://jon.es/ terry

    You’re right, of course. I guess I’m just old :-)

    At least in example code I tend to avoid special things like seq (and there used to be another useful tool like seq called jot). There’s also arithmetic built into some shells, so in bash I could have used

    for ((i = 5000; i < 200000; i += 5000))

    or

    while [ $((i++) -lt 200000 ]

    Anyway, thanks. I did in fact use expr. I always meant to make myself use arithmetic in bash, given that bash is pretty much ubiquitous now.

    BTW, I still sometimes use expr for regexp matching. That’s even more ugly and prehistoric :-)

  • esteve

    i=5000

    while [ $i -lt 200000 ]
    do
    wget –http-user terrycojones –http-passwd xxx \
    http://www.twitter.com/statuses/show/$i.xml
    i=`expr $i + 5000`
    sleep 1
    done

    spawning a process inside a loop for just incrementing a variable is a big NO :-) It’s better to do something like this if “seq” is available on your machine:

    for i in `seq 5000 200000`
    do
    wget –http-user terrycojones –http-passwd xxx \
    http://www.twitter.com/statuses/show/$i.xml
    sleep 1
    done

  • esteve

    i=5000

    while [ $i -lt 200000 ]
    do
    wget –http-user terrycojones –http-passwd xxx
    http://www.twitter.com/statuses/show/$i.xml
    i=`expr $i + 5000`
    sleep 1
    done

    spawning a process inside a loop for just incrementing a variable is a big NO :-) It’s better to do something like this if “seq” is available on your machine:

    for i in `seq 5000 200000`
    do
    wget –http-user terrycojones –http-passwd xxx
    http://www.twitter.com/statuses/show/$i.xml
    sleep 1
    done