Top data blogs information now in Fluidinfo, with an API

February 12th, 2011 by Terry Jones. Filed under Data, Howto.

Image: Education Week

A bit over a month ago, Marshall Kirkpatrick of Read Write Web made lists of the Top 300 blogs about data and the Top 300 blogs about geo. As soon as I saw the lists, I added the data to Fluidinfo and emailed Marshall.

I added marshallk.com/top-blogs/data and marshallk.com/top-blogs/geo tags to the Fluidinto objects that correspond to the URLs in his lists (Fluidinfo has an object for everything; in each case I put the tags onto the logical object in Fluidinfo: the one object whose fluiddb/about value is the URL in question.)


You can then do things like this:

$ curl 'http://fluiddb.fluidinfo.com/values?query=marshallk.com/top-blogs/data%3c%3d10&tag=fluiddb/about&tag=marshallk.com/top-blogs/data' |
jsongrep.py results id '.*'
{u'fluiddb/about': {u'value': u'http://www.readwriteweb.com/cloud'},
 u'marshallk.com/top-blogs/data': {u'value': 3}}
{u'fluiddb/about': {u'value': u'http://cloud.gigaom.com'},
 u'marshallk.com/top-blogs/data': {u'value': 2}}
{u'fluiddb/about': {u'value': u'http://flowingdata.com'},
 u'marshallk.com/top-blogs/data': {u'value': 8}}
{u'fluiddb/about': {u'value': u'http://highscalability.com'},
 u'marshallk.com/top-blogs/data': {u'value': 7}}
{u'fluiddb/about': {u'value': u'http://www.calculatedriskblog.com'},
 u'marshallk.com/top-blogs/data': {u'value': 6}}
{u'fluiddb/about': {u'value': u'http://www.fivethirtyeight.com'},
 u'marshallk.com/top-blogs/data': {u'value': 5}}
{u'fluiddb/about': {u'value': u'http://www.guardian.co.uk/news/datablog'},
 u'marshallk.com/top-blogs/data': {u'value': 9}}
{u'fluiddb/about': {u'value': u'http://www.informationisbeautiful.net'},
 u'marshallk.com/top-blogs/data': {u'value': 10}}
{u'fluiddb/about': {u'value': u'http://www.zerohedge.com'},
 u'marshallk.com/top-blogs/data': {u'value': 1}}
{u'fluiddb/about': {u'value': u'http://freakonomics.com/blog'},
 u'marshallk.com/top-blogs/data': {u'value': 4}}
 

Those are the top 10 on Marshall’s data list (unsorted, obviously). I’ve cleaned up the output using my jsongrep.py program described and available here.

More interestingly, you can see if any sites are on both of Marshall’s lists:

$ curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20marshallk.com/top-blogs/geo&tag=fluiddb/about'
{"results": {"id": {"a2e56723-453a-44e5-bd91-5576d0615c8e": {"fluiddb/about": {"value": "http://blog.simplegeo.com"}}}}}

Just a single blog is in both lists: http://blog.simplegeo.com.

So far, so good.

About half an hour ago, I saw a tweet from Daniel Tunkelang (the mind behind TunkRank) saying that eCairn have just released some work based on Marshall’s data, producing a list of 500 top data blogs! Cool.

So I’ve just imported that data to Fluidinfo too, adding a ecairn.com/top-data-blogs tag to the object for each URL on their list. The value of each tag, as with Marshall’s data, is the ranking on the eCairn list.

Let’s see how many blogs are on both lists:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20ecairn.com/top-data-blogs&tag=fluiddb/about' | jsongrep.py results id '.*' | wc -l
117
 

Not as many as I expected. But there are some small differences in the URLs used, for example Marshall’s list had http://kaushik.net/avinash whereas the eCairn list has http://www.kaushik.net/avinash. This would be easy to clean up, and of course it’s also possible just to tag the object for both URLs in Fluidinfo.

You can do the Fluidinfo query has marshallk.com/top-blogs/data except has ecairn.com/top-data-blogs to see the sites that Marshall has in his list but which do not appear in the eCairn list, such as Marshall’s #12, http://blog.sqlauthority.com. eCairn’s calculation might have put them in the lower 500 of their list of 1000 (the eCairn article only gives their top 500). There are plenty of other interesting queries too, but this post is long enough already.

So there you go, a fun bit of playing with more data blog data with Fluidinfo. One of these days we’ll even make it into one of these lists 🙂

Here’s the tiny bit of Python code I just wrote to add the data. It uses the Python FOM library for Fluidinfo written by Ali Afshar:

import sys
from fom.session import Fluid
fdb = Fluid()
fdb.login('ecairn.com', 'password')

urls = [i[:-1] for i in sys.stdin.readlines()]

for rank, url in enumerate(urls):
    fdb.about[url]['ecairn.com/top-data-blogs'].put(rank + 1)