I have much better and more important things to do than hack on my ideas for measuring Twitter growth.
But a man’s gotta relax sometime.
So I spent a couple of hours at JFK and then on the plane hacking some Python to pull down tweets (is this what other people call Twitter posts?), pull out their Twitter id and date, convert the dates to integers, write this down a pipe to gnuplot, and put the results onto a graph. I’ve nothing much to show right now. I need more data.
But the story with Twitter ids is apparently not that simple. While you can get tweets from very early on (like #20 that I pointed to earlier), and you can get things like #438484102 which is a recent one of mine, it’s not clear how the intermediate range is populated. Just to get a feel for it, I tried several loops like the following at the shell:
i=5000
while [ $i -lt 200000 ]
do
wget --http-user terrycojones --http-passwd xxx \
http://www.twitter.com/statuses/show/$i.xml
i=`expr $i + 5000`
sleep 1
done
Most of these were highly unsuccessful. I doubt that’s because there’s widespread deleting of tweets by users. So maybe Twitter are using ids that are not sequential.
Of course if I wasn’t doing this for the simple joy of programming I’d start by doing a decent search for the graph I’m trying to make. Failing that I’d look for someone else online with a bundle of tweets.
I’ll probably let this drop. I should let it drop. But once I get started down the road of thinking about a neat little problem, I sometimes don’t let go. Experience has taught me that it is usually better to hack on it like crazy for 2 days and get it over with. It’s a bit like reading a novel that you don’t want to put down when you know you really should.
One nice sub-problem is deciding where to sample next in the Twitter id space. You can maintain something like a heap of areas – where area is the size of the triangle defined by two tweets: their ids and dates. That probably sounds a bit obscure, but I understand it :-) Gradient of the growth curve is interesting – you probably want more samples when the gradient is changing fastest. Adding time between tweets to gradient gives you a triangle whose area you can measure. There are simpler approaches too, like uniform sampling, or some form of binary splitting of interesting regions of id space. Along the way you need to account for pages that give you a 404. That’s a data point about the id space too.