Archive for October, 2008

Twitter’s amazing stickiness (with a caveat)

Friday, October 31st, 2008

I just followed a link to a site that shows the date of the first tweet of 50 early Twitter users. I wondered how many of these early users were still active users, and guessed many would be.

Instead of going and fetching each user’s last tweet by hand, I wrote a little shell script to do all the work:

for name in \
  `curl -s http://myfirsttweet.com/oldest.php |
   perl -p -e ‘s,<a href="http://myfirsttweet.com/1st/(\w+)">,\nNAME:\t$1\n,g’ |
   egrep ‘^NAME:’ |
   cut -f2 |
   uniq`
do
    echo $name \
      `curl -s "http://twitter.com/statuses/user_timeline/$name.xml?count=1" |
       grep created_at |
       cut -f2 -d\> |
       cut -f1 -d\<`
done
 

Who wouldn’t want to be a (UNIX) programmer!?

And the output, massaged into an HTML table:

User Last tweeted on
jack Thu Oct 30 03:41:49 +0000 2008
biz Thu Oct 30 22:24:12 +0000 2008
Noah Tue Oct 28 22:56:15 +0000 2008
adam Thu Oct 30 21:34:56 +0000 2008
tonystubblebine Fri Oct 31 00:53:38 +0000 2008
dom Thu Oct 30 20:36:31 +0000 2008
rabble Fri Oct 31 00:56:28 +0000 2008
kellan Fri Oct 31 00:32:44 +0000 2008
sarahm Thu Oct 30 22:45:37 +0000 2008
dunstan Thu Oct 30 23:59:57 +0000 2008
stevej Fri Oct 31 00:12:03 +0000 2008
lemonodor Thu Oct 30 18:21:43 +0000 2008
blaine Wed Oct 29 23:52:06 +0000 2008
rael Fri Oct 31 01:02:58 +0000 2008
bob Fri Oct 31 00:39:18 +0000 2008
graysky Fri Oct 31 00:23:21 +0000 2008
veen Thu Oct 30 19:47:40 +0000 2008
dens Fri Oct 31 00:13:12 +0000 2008
heyitsnoah Thu Oct 30 20:09:35 +0000 2008
rodbegbie Thu Oct 30 23:42:39 +0000 2008
astroboy Thu Oct 30 22:07:50 +0000 2008
alba Thu Oct 30 16:06:29 +0000 2008
kareem Thu Oct 30 20:20:14 +0000 2008
gavin Thu Oct 30 17:48:45 +0000 2008
nick Fri Oct 31 01:17:29 +0000 2008
psi Thu Oct 30 20:40:53 +0000 2008
vertex Fri Oct 31 00:44:09 +0000 2008
mulegirl Fri Oct 31 00:31:05 +0000 2008
thedaniel Thu Oct 30 20:00:31 +0000 2008
myles Thu Oct 30 15:50:31 +0000 2008
mike ftw Fri Oct 31 00:28:00 +0000 2008
stumblepeach Thu Oct 30 23:20:06 +0000 2008
bunch Sat Oct 25 20:46:42 +0000 2008
adamgiles com Thu Apr 10 17:22:52 +0000 2008
naveen Thu Oct 30 23:24:23 +0000 2008
nph Fri Oct 31 01:53:13 +0000 2008
caterina Tue Oct 28 18:07:32 +0000 2008
rafer Thu Oct 30 19:23:50 +0000 2008
ML Thu Oct 30 15:31:47 +0000 2008
brianoberkirch Thu Oct 30 20:21:43 +0000 2008
joelaz Thu Oct 30 22:03:59 +0000 2008
arainert Fri Oct 31 01:18:43 +0000 2008
tony Sun Oct 26 18:16:02 +0000 2008
brianr Fri Oct 31 01:57:27 +0000 2008
prash Tue Oct 28 22:14:24 +0000 2008
danielmorrison Thu Oct 30 21:37:41 +0000 2008
slack Fri Oct 31 01:26:08 +0000 2008
mike9r Thu Oct 30 21:17:29 +0000 2008
monstro Thu Oct 30 22:28:46 +0000 2008
mat Fri Oct 31 00:26:22 +0000 2008

Wow… look at those dates. Only one of these people has failed to update in the last week!

Here’s the caveat. We don’t know how many early Twitter users are in the My First Tweet database. The data looks suspicious: there are only 50 Twitter users in a 7 month period? That can’t be right. So it’s possible the My First Tweet database is built by finding currently active tweeters and then looking back to their first post. If so, my table doesn’t say much about stickiness.

But I find it fairly impressive in any case.

Digging into Twitter following

Monday, October 13th, 2008

TwitterThis is just a quick post. I have a ton of things I could say about this, but they’ll have to wait – I need to do some real work.

Last night and today I wrote some Python code to dig into the follower and following sets of Twitter users.

I also think I understand better why Twitter is so compelling, but that’s going to have to wait for now too.

You give my program some Twitter user names and it builds you a table showing numbers of followers, following etc. for each user. It distinguishes between people you follow and who don’t follow you, and people who follow you but whom you don’t follow back.

But the really interesting thing is to look at the intersection of some of these sets between users.

For example, if I follow X and they don’t follow me back, we can assume I have some interest in X. So if am later followed by Y and it turns out that X follows Y, I might be interested to know that. I might want to follow Y back just because I know it might bring me to the attention of X, who may then follow me. If I follow Y, I might want to publicly @ message him/her, hoping that he/she might @ message me back, and that X may see it and follow me.

Stuff like that. If you think that sort of thing isn’t important, or is too detailed or introspective, I’ll warrant you don’t know much about primate social studies. But more on that in another posting too.

As another example use, I plan to forward the mails Twitter sends me telling me someone new is following me into a variant of my program. It can examine the sets of interest and weight them. That can give me an automated recommendation of whether I should follow that person back – or just do the following for me.

There are lots of directions you could push this in, like considering who the person had @ talked to (and whether those people were followers or not) and the content of their Tweets (e.g., do they talk about things I’m interested or not interested in?).

Lots.

For now, here are links to a few sample runs. Apologies to the Twitter users I’ve picked on – you guys were on my screen or on my mind (following FOWA).

I’d love to turn these into nice Euler Diagrams but I didn’t find any decent open source package to produce them.

I’m also hoping someone else (or other people) will pick this up and run with it. I’ve got no time for it! I’m happy to send the source code to anyone who wants it. Just follow me on Twitter and ask for it.

Example 1: littleidea compared to sarawinge.
Example 2: swardley compared to voidspace.
Example 3: aweissman compared to johnborthwick.

And finally here’s the result for deWitt, on whose Twitter Python library I based my own code. This is the output you get from the program when you only give it one user to examine.

More soon, I guess.

How many users does Twitter have?

Monday, October 13th, 2008

Inclusion/Exclusion

Here’s a short summary of a failed experiment using the Principle of Inclusion/Exclusion to estimate how many users Twitter has. I.e., there’s no answer below, just the outline of some quick coding.

I was wondering about this over cereal this morning. I know some folks at Twitter, and I know some folks who have access to the full tweet database, so I could perhaps get that answer just by asking. But that wouldn’t be any fun, and I probably couldn’t blog about it.

I was at FOWA last week and it seemed that absolutely everyone was on Twitter. Plus, they were active users, not people who’d created an account and didn’t use it. If Twitter’s usage pattern looks anything like a Power Law as we might expect, there will be many, many inactive or dormant accounts for every one that’s moderately active.

BTW, I’m terrycojones on Twitter. Follow me please, I’m trying to catch Jason Calacanis.

You could have a crack at answering the question by looking at Twitter user id numbers via the API and trying to estimate how many users there are. I did play with that at one point at least with tweet ids, but although they increase there are large holes in the tweet id space. And approaches like that have to go through the Twitter API, which limits you to a mere 70 requests per hour – not enough for any serious (and quick) probing.

In any case, I was looking at the Twitter Find People page. Go to the Search tab and you can search for users.

I searched for the single letter A, and got around 109K hits. That lead me to think that I could get a bound on Twitter’s size using the Principle of Inclusion/Exclusion (PIE). (If you don’t know what that is, don’t be intimidated by the math – it’s actually very simple, just consider the cases of counting the size of the union of 2 and 3 sets). The PIE is a beautiful and extremely useful tool in combinatorics and probability theory (some nice examples can be found in Chapter 3 of the introductory text Applied Combinatorics With Problem Solving). The image above comes from the Wikipedia page.

To get an idea of how many Twitter users there are, we can add the number of people with an A in their name to the number with a B in their name, …., to the number with a Z in their name.

That will give us an over-estimate though, as names typically have many letters in them. So we’ll be counting users multiple times in this simplistic sum. That’s where the PIE comes in. The basic idea is that you add the size of a bunch of sets, and then you subtract off the sizes of all the pairwise intersections. Then you add on the sizes of all the triple set intersections, and so on. If you keep going, you get the answer exactly. If you stop along the way you’ll have an upper or lower bound.

So I figured I could add the size of all the single-letter searches and then adjust that downwards using some simple estimates of letter co-occurrence.

That would definitely work.

But then the theory ran full into the reality of Twitter.

To begin with, Twitter gives zero results if you search for S or T. I have no idea why. It gives a result for all other (English) letters. My only theory was that Twitter had anticipated my effort and the missing S and T results were their way of saying Stop That!

Anyway, I put the values for the 24 letters that do work into a Python program and summed them:

count = dict(a = 108938,
             b =  12636,
             c =  13165,
             d =  21516,
             e =  14070,
             f =   5294,
             g =   8425,
             h =   7108,
             i = 160592,
             j =   9226,
             k =  12524,
             l =   8112,
             m =  51721,
             n =  11019,
             o =   9840,
             p =   8139,
             q =   1938,
             r =  10993,
             s =      0,
             t =      0,
             u =   8997,
             v =   4342,
             w =   6834,
             x =   8829,
             y =   8428,
             z =   3245)

upperBoundOnUsers = sum(count.values())
print ‘Upper bound on number of users:’, upperBoundOnUsers

The total was 515,931.

Remember that that’s a big over-estimate due to duplicate counting.

And unless I really do live in a tech bubble, I think that number is way too small – even without adjusting it using the PIE.

(If we were going to adjust it, we could try to estimate how often pairs of letters co-occur in Twitter user names. That would be difficult as user names are not like normal words. But we could try.)

Looking at the letter frequencies, I found them really strange. I wrote a tiny bit more code, using the English letter frequencies as given on Wikipedia to estimate how many hits I’d have gotten back on a normal set of words. If we assume Twitter user names have an average length of 7, we can print the expected numbers versus the actual numbers like this:

# From http://en.wikipedia.org/wiki/Letter_frequencies
freq = dict(a = 0.08167,
            b = 0.01492,
            c = 0.02782,
            d = 0.04253,
            e = 0.12702,
            f = 0.02228,
            g = 0.02015,
            h = 0.06094,
            i = 0.06966,
            j = 0.00153,
            k = 0.00772,
            l = 0.04025,
            m = 0.02406,
            n = 0.06749,
            o = 0.07507,
            p = 0.01929,
            q = 0.00095,
            r = 0.05987,
            s = 0.06327,
            t = 0.09056,
            u = 0.02758,
            v = 0.00978,
            w = 0.02360,
            x = 0.00150,
            y = 0.01974,
            z = 0.00074)

estimatedUserNameLen = 7

for L in sorted(count.keys()):
    probNotLetter = 1.0 – freq[L]
    probOneOrMore = 1.0 – probNotLetter ** estimatedUserNameLen
    expected = int(upperBoundOnUsers * probOneOrMore)
    print "%s: expected %6d, saw %6d." % (L, expected, count[L])

Which results in:

a: expected 231757, saw 108938.
b: expected  51531, saw  12636.
c: expected  92465, saw  13165.
d: expected 135331, saw  21516.
e: expected 316578, saw  14070.
f: expected  75281, saw   5294.
g: expected  68517, saw   8425.
h: expected 183696, saw   7108.
i: expected 204699, saw 160592.
j: expected   5500, saw   9226.
k: expected  27243, saw  12524.
l: expected 128942, saw   8112.
m: expected  80866, saw  51721.
n: expected 199582, saw  11019.
o: expected 217149, saw   9840.
p: expected  65761, saw   8139.
q: expected   3421, saw   1938.
r: expected 181037, saw  10993.
s: expected 189423, saw      0.
t: expected 250464, saw      0.
u: expected  91732, saw   8997.
v: expected  34301, saw   4342.
w: expected  79429, saw   6834.
x: expected   5392, saw   8829.
y: expected  67205, saw   8428.
z: expected   2666, saw   3245.

You can see there are wild differences here.

While it’s clearly not right to be multiplying the probability of one or more of each letter appearing in a name by the 515,931 figure (because that’s a major over-estimate), you might hope that the results would be more consistent and tell you how much of an over-estimate it was. But the results are all over the place.

I briefly considered writing some code to scrape the search results and calculate the co-occurrence frequencies (and the actual set of letters in user names). Then I noticed that the results don’t always add up. E.g., search for C and you’re told there are 13,190 results. But the results come 19 at a time and there are 660 pages of results (and 19 * 660 = 12,540, which is not 13,190).

At that point I decided not to trust Twitter’s results and to call it quits.

A promising direction (and blog post) had fizzled out. I was reminded of trying to use AltaVista to compute co-citation distances between web pages back in 1996. AltaVista was highly variable in its search results, which made it hard to do mathematics.

I’m blogging this as a way to stop thinking about this question and to see if someone else wants to push on it, or email me the answer. Doing the above only took about 10-15 mins. Blogging it took at least a couple of hours :-(

Finally, in case it’s not clear there are lots of assumptions in what I did. Some of them:

  • We’re not considering non-English letters (or things like underscores, which are common) in user names.
  • The mean length of Twitter user names is probably not 7.
  • Twitter search returns user names that don’t contain the searched-for letter (instead, the letter appears in the user’s name, not the username).