Archive for February, 2011

Fluidinfo voted Top Technology Company at LAUNCH in San Francisco

Friday, February 25th, 2011

Wow…. Fluidinfo was just voted “the clear winner” as Top Technology Company out of 100 start-ups in the LaunchPad at the LAUNCH conference in San Francisco!

Thanks to all the judges, especially to Robert Scoble, Marshall Kirkpatrick, Brian Alvey, Naval Ravikant and Mark Pesce. Mark said “Fluidinfo is totally crazy, but it’s the kind of crazy I love.” :-)

It’s weird, because 3 weeks ago I told Jason Calacanis, who suggested in email that we enter, that I didn’t think we should go up on stage at LAUNCH. We don’t (yet) do UI, and start-up events are heavily oriented towards sexy UI and ideas that can be explained in a couple of minutes. We were so busy already, it seemed like a recipe to do something mediocre if we threw together a demo. Instead I asked if we could just hang out in the LaunchPad with the 99 other start-ups and talk to people passing by. Today the judges went through the LaunchPad and talked to all the start-ups. Several told me we should go up on stage to do a 3 minute presentation, so I thought “why not?”, threw together some Keynote slides, opened some browser tabs (on the quick WeMet.At app written by Nicholas Tollervey in a few hours, and the increasingly great Fluidinfo Explorer (that link points at the ReadWriteWeb top-level namespace in Fluidinfo) written by Pier-Andre Parent, who we’ve never even met) and went for it.

The whole Fluidinfo team has been working really hard towards LAUNCH for the last 3 weeks. Everything at LAUNCH and recently announced on our blog has been built by all of us. It’s nice to win a prize, because after being funded we decided to be quiet, keep our heads down, build and train a team, and not even try to get people to use Fluidinfo until 2011. In January we began our first outward-facing efforts, building and releasing quite a few writable APIs. It’s still early days yet, and we have something very cool right around the corner.

Thanks to all the other great startups and the organizers at LAUNCH, especially Jason Calacanis, Tyler Crowley, and Jason Krute. Standing up for 20 hours and talking for about 30 was never so much fun!

Putting domain names onto data with Fluidinfo

Wednesday, February 23rd, 2011

Internet domain names can be thought of as a mechanism for attaching trust and reputation to digital information. We do this in two major ways: (1) by using domain names in the URLs of web pages, and (2) by putting them in the sender’s “From” address of email messages.

To give a concrete example, suppose you see some shoes for sale on a web page. If you look at the page URL and see the amazon.com or zappos.com domain name, trust and reputation knowledge springs instantly to mind. You know the quality is probably good, the price competitive, and that if the shoes are lost in shipping you’ll be sent another pair for no charge. On the other hand, if you see ebay.com in the URL, a different matrix of trust and reputation knowledge will spring to mind. A similar thing happens if you get email from someone you’ve never met. If you see stanford.edu or forbes.com in the email “From” line, reputation information springs to mind.

Looked at in this way, domain names are small tokens that we send alongside other pieces of content such as web pages and emails. The domain name carries vital trust and reputation information. Recognition and trust in domain names is globally distributed, spread variously through the brains of most of the people on the planet, with its integrity guaranteed by DNS. Domain names make the internet useful. Without them, digital information online would be almost useless as we could not confidently trust any 3rd party data.

Question: given that we can attach domain names to web pages and email messages, can we find a way to attach them to other things?

Domain names on data

We’re excited to announce that Fluidinfo now makes it possible to put domain names onto individual pieces of data.

To illustrate, the image on the right shows a fanciful example book object in Fluidinfo (large version). The tag names on the object are colored. You’ll see that some of them contain domain names: amazon.com/price, barnesandnoble.com/price and vintage.com/epub. Tags in Fluidinfo can have values, as illustrated by the amazon.com/price tag whose value for this book is $19.

The combination of a Fluidinfo tag name containing a domain and an associated tag value is exactly like a URL containing a domain name and an associated HTML value (i.e., a web page) or an email message with a domain name in its From line.

Because Fluidinfo objects don’t have owners (their tags do, though), any number of domain owners are free to put their information, branded with their domain name, onto any Fluidinfo object.

A killer combination: writable APIs with domain-branded data

Fluidinfo automatically provides a writable API for all its data. By allowing for domain names on data, domain holders who want to publish information about their products can now do so with an API that has three major advantages:

  • Your data is branded with your domain name.
  • Your data lives in a writable ecology of related data, collecting on the same Fluidinfo objects. This allows for search across data from different users and domains, put there by different applications. It allows for additional data of all kinds, for mashups, and for customization, personalization, and filtering.
  • Fluidinfo has a flexible permissions system at the level of its tags, so you maintain full control of your own data. You can make it public or private, or can allow or disallow access for specific others.

Because Fluidinfo objects are fine grained, composed simply of tags with values as in the image above, applications can fetch, search on, or combine specific pieces (or combinations) of data provided by different trusted sources with single requests. There is a general principle here: information becomes more useful and valuable when it is stored in context. This is illustrated vividly by Google, which collects web pages into one place to enable search, and by Wikipedia, which allows people to pool related information. Although these examples have very different models of trust and reputation, they both illustrate the underlying principle.

Getting your domain name in Fluidinfo

To start using your domain in Fluidinfo, first sign up, using your domain name as your user name. Our sign-up system will recognize that the username is a domain and will send you an email telling you how to prove that you control the domain. Once that’s done, you can begin using Fluidinfo to upload information branded with your domain and to provide an API for others (or for your own company) to find your products or otherwise use the information you make available.

In other words, all Fluidinfo usernames that correspond to actual internet domains are automatically reserved for their owners. Besides preventing a chaotic land grab, this is how we can guarantee to people seeing information in Fluidinfo that the value of a Fluidinfo tag whose name includes a domain name can be trusted exactly as it would be if that domain appeared in a web page URL or email From address.

So there you have it… domain names on data. We’re very excited to see where this will lead and we’re actively building out some writable APIs with domain-branded data. You can too. Claim your domain name in Fluidinfo right now.

How to make an API in Fluidinfo

Wednesday, February 23rd, 2011

It’s very simple really:

1. Register a domain/user on Fluidinfo

Start here. If you’re registering a domain name then we will require proof of ownership (the instructions explaining how to do this are very simple).

2. Create your namespaces and tags

Be careful, you’re choosing how your data will be structured in Fluidinfo. Some tips we’ve found useful:

  • Flat is good.
  • Use namespaces to differentiate between the different sorts of things you’ll be tagging (e.g. between books and authors).
  • Copy conventions (how do others organise their data?).
  • KISS! Keep it simple (stupid!).

3. Import your data into Fluidinfo

To help you, we have a Python based script/library called FLIMP. Nevertheless, there are lots of freely available libraries that you may want to adapt yourself.

4. Announce your new API

Programmers will interact with your data via the general Fluidinfo API, which is simple and well documented. All you need to do is tell the world that your data is available, and what namespaces and tags you’re using to store it in Fluidinfo.

That’s it.

Please feel free to get in touch at any time if you have any questions or would like to explore the possibility of Fluidinfo Inc. helping you to add your data to Fluidinfo.

ReadWriteWeb ReadWriteAPI

Wednesday, February 23rd, 2011

Over the weekend I scraped the 11300 or so articles in the ReadWriteWeb archive. These are a great source of technology news and analysis covering stories from 2003 to the present day. Rather than keep this to myself (and rather unsurprisingly) I imported the metadata about each article into Fluidinfo. Hey presto, another instant API emerges!

Here’s how it works. For each article in the ReadWriteWeb archive there is an object in Fluidinfo. Each object has a unique “about” tag-value: the URL of the article. Furthermore, each object is annotated with information using tags found under the readwriteweb.com top level namespace. Tags include title, extract, date, categories and so on. In other words, you might visualize each object something like this:

I’ve also created and annotated objects about each of the authors of ReadWriteWeb articles and tagged objects representing each website ever mentioned by ReadWriteWeb.

So, it’s now possible to use the API like this:

>>> import fluidinfo
>>> returnTags = ['readwriteweb.com/title', 'readwriteweb.com/author-name', 'readwriteweb.com/extract', 'readwriteweb.com/date', ]
>>> query = "readwriteweb.com/year = 2010 and readwriteweb.com/month = 5 and readwriteweb.com/day = 5"
>>> head, result = fluidinfo.call('GET', '/values', tags=returnTags, query=query)
>>> head['status']
'200'
>>> result
{u'results':
    {u'id':
        {u'05936b9b-4c20-4887-9607-f63752e7f274':
            {u'readwriteweb.com/author-name': {u'value': u'Sarah Perez'},
              u'readwriteweb.com/date': {u'value': u'May  5, 2010  7:24 AM'},
              u'readwriteweb.com/extract': {u'value': u"Feel like hacking your phone today? If you've got about 10 minutes to spare, you can turn your iPhone into a Wi-Fi hotspot using a combination of the ..."},
              u'readwriteweb.com/title': {u'value': u'How To Turn Your iPhone into a Wi-Fi Hotspot'}},
        ... etc....
 

What’s just happened..? I used a client library (fluidinfo.py) to ask Fluidinfo to return the author name, publication date, title and an extract of all ReadWriteWeb articles published on the 5th May 2010.

Being able to search and extract data from an API is cool, especially since you get this by virtue of simply hosting your data in Fluidinfo. But this is ReadWriteWeb we’re talking about. Happily, Fluidinfo can accommodate.

>>> fluidinfo.login('ntoll', 'mysecretpassword') # change as appropriate
>>> headers, result = fluidinfo.call('PUT', ['about', 'http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php', 'ntoll', 'rating'], 10)
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-type': 'text/html',
 'date': 'Wed, 23 Feb 2011 15:07:29 GMT',
 'server': 'nginx/0.7.65',
 'status': '204'}
 

The example above shows how I sign in and annotate the object “about” the article http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php with a tag called ntoll/rating and an associated value of 10 (obviously I enjoyed this article). The HTTP 204 response status tells me the value was successfully tagged.

Let’s just pause here for a moment and consider what I’ve just been able to do. Because Fluidinfo is openly writable I’m able to annotate the objects about ReadWriteWeb articles with my own data. Since objects in Fluidinfo don’t have owners or permissions attached to them I didn’t have to ask ReadWriteWeb for permission to augment the data about the article in question. Furthermore, if I only want my buddies to see what my ratings are I can set the tag to be only visible to a specific group of people. In this way Fluidinfo remains openly writable yet I still retain ownership and control over my data.

We’ve seen “read” and “write”, but what about “web”..?

Well it turns out I can stretch this analogy even further. Because everyone is tagging the same objects (identified by their “about” tag values) the data is being linked by virtue of the context of the object. We’re starting to get a web of linked data (yeah, I know, bear with me on this one…).

Since I can search and retrieve using any of the tags for which I have “read” permission I can start to create really cool mash-ups of data like this:

>>> header, result = fluidinfo.call('GET', '/values', tags=['fluiddb/about', 'boingboing.net/mentioned', 'readwriteweb.com/mentioned'], query="has boingboing.net/mentioned and has readwriteweb.com/mentioned and has unionsquareventures.com/portfolio")
>>> header
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '23528',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=has+boingboing.net%2Fmentioned+and+has+readwriteweb.com%2Fmentioned+and+has+unionsquareventures.com%2Fportfolio&tag=fluiddb%2Fabout&tag=boingboing.net%2Fmentioned&tag=readwriteweb.com%2Fmentioned',
 'content-type': 'application/json',
 'date': 'Wed, 23 Feb 2011 15:24:36 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> len(result['results']['id'])
4
>>> for r in result['results']['id'].values():
...     print r['fluiddb/about']['value']
...
http://www.twitter.com
http://www.etsy.com
http://www.boxee.tv
http://www.meetup.com
 

What..? I’ve just asked Fluidinfo for all the articles from BoingBoing and ReadWriteWeb about companies backed by Union Square Ventures that both BoingBoing and ReadWriteWeb have covered. It turns out there are four companies: Twitter, Etsy, Boxee and Meetup.

What do one of these results look like..?

{u'boingboing.net/mentioned':
    {u'value': [u'http://boingboing.net/2009/11/06/vampireotherkinenerg.html',
                     u'http://boingboing.net/2010/01/11/ny-times-on-urban-ca.html',
                     u'http://boingboing.net/2010/10/26/ron-paul-supporter-w.html',
                     u'http://boingboing.net/2002/06/27/meetup-meatspace-cam.html',
                     u'http://boingboing.net/2004/03/17/wired-rave-awards.html',
                     u'http://boingboing.net/2006/01/05/net-pug-nabbed-by-cr.html']},
u'fluiddb/about':
    {u'value': u'http://www.meetup.com'},
u'readwriteweb.com/mentioned':
    {u'value':  [u'http://www.readwriteweb.com/archives/meetup_the_secret_campaign_weapon.php']}}
 

What was involved in making such a cool query possible..? Simply importing data into Fluidinfo.

I’ll say no more and let you ponder the implications of what I’ve just demonstrated…

How I made a writable API for Union Square Ventures in an hour

Tuesday, February 15th, 2011

Image: Eric Archivell

I was mailing Fred Wilson and Albert Wenger of Union Square Ventures late last year, talking about Fred’s article Giving every person a voice. Fred said

I hadn’t really thought that we are all about shrinking the minimal viable publishing object, but that may well be true in hindsight.

I wanted to illustrate Fluidinfo as doing both: providing a minimal viable way to publish data (with an API), and also giving everyone a voice. So I decided to build Union Square Ventures a minimal API, and to then add my voice. In an hour.

A minimal viable API for USV

USV currently has 30 investments. If you want to get a list of the 30 company URLs, how would you do it? A non-programmer would have no choice but to go to the USV portfolio page, and click on each company in turn, then right-click on the link to each company’s home page and copy the link address, and then add that URL to your list. That process is boring and error prone.

If you’re a programmer though, you’d find this ridiculously manual. You’d much rather do that in one command, for example if you’re collecting information on VC company portfolios, perhaps for research or to get funded. Or if you were building an application, perhaps to do what Jason Calacanis is doing as part of the collecting who’s funding whom on Twitter and Facebook. You want your application to be able to fetch the list of USV company URLs in one simple call.

So I made a unionsquareventures.com user in Fluidinfo (sign up here), did the repetitive but one-time work of getting their portfolio companies’ URLs out of their HTML (so you wouldn’t have to), and added it to Fluidinfo. I put a unionsquareventures.com/portfolio tag onto the Fluidinfo object about each of those URLs. In other words, because Fluidinfo has an object for everything (including all URLs), I asked it to tag that object.

That was just 7 lines of code using the elegant and simple Python FOM library for Fluidinfo written by Ali Afshar:

import sys
from fom.session import Fluid

fdb = Fluid()
fdb.login('unionsquareventures.com', 'password')
urls = [i[:-1] for i in sys.stdin.readlines()] # Read portfolio URLs from stdin

for url in urls:
    fdb.about[url]['unionsquareventures.com/portfolio'].put(True)
 

As a result, using the jsongrep script I wrote to get neater output from JSON, I can now use curl and the Fluidinfo /values method to get the list of USV portfolio companies in the blink of an eye:

curl ‘http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about’ |
jsongrep.py results . . fluiddb/about value | sort
u’http://amee.cc’
u’http://getglue.com’
u’http://stackoverflow.com’
u’http://tumblr.com’
u’http://www.10gen.com’
u’http://www.boxee.tv’
u’http://www.buglabs.net’
u’http://www.clickable.com’
u’http://www.cv.im’
u’http://www.disqus.com’
u’http://www.edmodo.com’
u’http://www.etsy.com’
u’http://www.flurry.com’
u’http://www.foursquare.com’
u’http://www.hashable.com’
u’http://www.heyzap.com’
u’http://www.indeed.com’
u’http://www.meetup.com’
u’http://www.oddcast.com’
u’http://www.outside.in’
u’http://www.returnpath.net’
u’http://www.shapeways.com’
u’http://www.simulmedia.com’
u’http://www.soundcloud.com’
u’http://www.targetspot.com’
u’http://www.twilio.com’
u’http://www.twitter.com’
u’http://www.workmarket.com’
u’http://www.zemanta.com’
u’http://zynga.com’

There you have it, a sorted list of all Union Square Ventures portfolio companies’ URLs, from the command line. I can do it, you can do it, and any application can do it.

The jsongrep.py program can also be used to pull out selective pieces of the output. For example, which of the companies have “ee” in their URL?

curl ‘http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about’ |
jsongrep.py results . . fluiddb/about value ‘.*ee’ | sort
u’http://www.meetup.com’
u’http://amee.cc’
u’http://www.indeed.com’
u’http://www.boxee.tv’

So maybe, in order to be funded by USV, it helps to have “ee” in your URL? :-)

What about USV companies that don’t have “.com” URLs?

curl ‘http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio&tag=fluiddb/about’ |
jsongrep.py results . . fluiddb/about value ‘.*(?<!\.com)$’
u’http://www.outside.in’
u’http://amee.cc’
u’http://www.cv.im’
u’http://www.buglabs.net’
u’http://www.returnpath.net’
u’http://www.boxee.tv’

OK, these things are geeky, but that’s part of the point of an API: to enable applications to do things. We’ve made the portfolio available programmatically, and you can immediately see how to do fun things with it that you couldn’t easily do before. In fact, it’s quite a bit more interesting than that. As a result of doing this work, I can tell you that there was a company listed a couple of months ago on the portfolio page that is no longer there. And there’s a company that’s been invested in that’s not yet listed. That’s a different subject, but it does illustrate the power of doing things programmatically.

This is a minimal viable API for USV because there’s only one piece of information being made available (so far). But an API it is, and it’s already useful.

It’s also writable.

Giving everyone a voice

In a sense we’ve just seen that everyone has a voice. USV put a tag onto the Fluidinfo objects that correspond to the URLs of their portfolio companies and they didn’t have to ask permission to do so.

But what about me? I’m a person too. I’ve met the founders of some of those companies, so I’m going to put a terrycojones/met-a-founder-of tag onto the same objects. Fluidinfo lets me do that because its objects don’t have owners, its permission system is instead based at the level of the tags on the objects.

So I wrote another 7 line program, like the one above, and added those tags. I also added another USV tag, called unionsquareventures.com/company-name. Let’s pull back just the names of the companies whose founders I’ve met:

curl ‘http://fluiddb.fluidinfo.com/values?query=has%20unionsquareventures.com/portfolio%20and%20has%20terrycojones/met-a-founder-of&tag=unionsquareventures.com’/company-name |
jsongrep.py results . . . value | sort
u’Bug Labs’
u’Foursquare’
u’GetGlue’
u’Meetup’
u’Shapeways’
u’Stack Overflow’
u’Tumblr’
u’Twitter’
u’Zemanta’

Isn’t that cool? I do indeed have a voice!

You have one too. If you sign up for a Fluidinfo account you can add your own tags and values to anything in Fludinfo. And you can use Fluidinfo, just as I’ve illustrated above, to make your own writable API. See also: our post from yesterday, What is a writable API?

Interview on writable book APIs & publishing at O’Reilly TOC

Tuesday, February 15th, 2011

Below is an interview I did yesterday with Mac Slocum at the O’Reilly TOC conference in New York. We discuss writable book APIs and why they matter, as well as talking about what that might mean for publishers, readers, and the publishing process in general. (You can also see the interview on YouTube.)

Watch live streaming video from oreillyconfs at livestream.com

What is a writable API?

Monday, February 14th, 2011

When we released the Fluidinfo API for Boing Boing two weeks ago, Simon Willison noted on his blog:

“Fluidinfo really is a fascinating piece of software.” …. “Writable APIs are much less common than read-only APIs – Fluidinfo instantly provides both.”

If you search online to try to discover what people mean by a “writable API”, it’s hard to find anything that merits the name. So what did Simon mean? What is a writable API?

Both Simon and the team at Fluidinfo think “writable API” should be a kind of shorthand for an API that provides access to underlying data that is writable. This is not meant in the trivial already-possible sense wherein you pass data to an API method that stores them into a database you can’t otherwise access. We mean it in a more fundamental sense: that the underlying data is writable. That anyone or any application can directly access the data storage layer and add new information to it – without the knowledge of the people who stored the original data. That sounds pretty radical. But if you have a model of control in which objects are not owned but their pieces are, it’s not scary at all. In fact it’s liberating.

And, you guessed it, Fluidinfo has exactly that model of control. Any information stored into Fluidinfo instantly has a writable API in the sense just described. Let’s see a concrete example from the recent Boing Boing data imported into Fluidinfo.

Below is an illustration of an object in Fluidinfo, showing a subset of the tags that are on every Fluidinfo object representing a Boing Boing article. (The image was generated using Nick Radcliffe‘s fun About Tag image generator for Fluidinfo objects. Click the image to see the all its tags.)

An object

Simply by virtue of being stored in Fluidinfo, Boing Boing instantly got an API for all their articles. The API lets you find Boing Boing articles, as represented by objects in Fluidinfo, via querying on tags such as those shown on the object above. For example, you can use the API to find Boing Boing articles published in December 2008 that were written by Cory Doctorow. Or you can get a list of all the Boing Boing articles that contain a reference to the domain www.whitehouse.gov. (You can see details of these sorts of queries in our article on Mining the Boing Boing API.)

Those kinds of searches on Boing Boing data were not previously possible. We put the whole thing together in a single evening, which illustrates how simple it can be to make a Fluidinfo-fueled API for your own information. As cool as these examples are, though, they’re just reading & searching Boing Boing controlled data, as with a traditional API. What about writing?

Writing the Boing Boing data – without stopping to ask permission

The tags on the object above were put there by the Fluidinfo user named boingboing.net. That user controls those tags, and has given the rest of us read permission. But no-one owns the Fluidinfo object that the tags are on. As a result, anyone with a Fluidinfo account (sign up here) can add any information to the exact same object.

To give a very simple example, suppose someone wrote a simple browser extension (or extensions) that let Boing Boing readers mark stories as being funny or not suitable for work. Two users, Alice and Bob, might then put alice/funny and bob/nsfw tags onto the above object. Assuming I had read permission on those tags, I could then find Boing Boing articles by Cory Doctorow that Alice enjoyed and Bob found too risqué for work. Someone else could write a browser extension that popped up a warning about NSFW content based on Bob’s tag. In fact, take a proper look at the object above, you’ll see that I have added a terrycojones/nsfw tag to it (terrycojones is my username in Fluidinfo).

That’s customization and personalization – in our hands. It’s adding data to the exact same objects that Boing Boing created, combining their data and ours as we please, and all without stopping to ask permission or requiring that a database administrator or programmer anticipate our idiosyncratic needs. Boing Boing and any applications they create, may not be aware of, care about, or even be able to detect the new data (depending on permissions).

In other words, we can say that Boing Boing has a writable API, because other people and other applications are always free to add information to the same objects that the Boing Boing API is providing access to. The same applies to any application or API that uses Fluidinfo. A writable API opens the door onto a very different world, allowing unlimited possibilities for mash-ups, new applications, extensions, widgets, etc. It allows arbitrary customization and personalization. Fluidinfo acts like a universal metadata engine, providing guaranteed write access to anything, with a permissions system at the level of the tag, not the object.

We’ll give another example of a simple but fun writable API tomorrow. Next week we’ll release a much more substantial one at the LAUNCH conference in San Francisco. We’re really excited about it, and have a series of not-to-be-missed upcoming blog posts on what we’ve been up to.

Stay tuned!

Top data blogs information now in Fluidinfo, with an API

Saturday, February 12th, 2011

Image: Education Week

A bit over a month ago, Marshall Kirkpatrick of Read Write Web made lists of the Top 300 blogs about data and the Top 300 blogs about geo. As soon as I saw the lists, I added the data to Fluidinfo and emailed Marshall.

I added marshallk.com/top-blogs/data and marshallk.com/top-blogs/geo tags to the Fluidinto objects that correspond to the URLs in his lists (Fluidinfo has an object for everything; in each case I put the tags onto the logical object in Fluidinfo: the one object whose fluiddb/about value is the URL in question.)


You can then do things like this:

$ curl 'http://fluiddb.fluidinfo.com/values?query=marshallk.com/top-blogs/data%3c%3d10&tag=fluiddb/about&tag=marshallk.com/top-blogs/data' |
jsongrep.py results id '.*'
{u'fluiddb/about': {u'value': u'http://www.readwriteweb.com/cloud'},
 u'marshallk.com/top-blogs/data': {u'value': 3}}
{u'fluiddb/about': {u'value': u'http://cloud.gigaom.com'},
 u'marshallk.com/top-blogs/data': {u'value': 2}}
{u'fluiddb/about': {u'value': u'http://flowingdata.com'},
 u'marshallk.com/top-blogs/data': {u'value': 8}}
{u'fluiddb/about': {u'value': u'http://highscalability.com'},
 u'marshallk.com/top-blogs/data': {u'value': 7}}
{u'fluiddb/about': {u'value': u'http://www.calculatedriskblog.com'},
 u'marshallk.com/top-blogs/data': {u'value': 6}}
{u'fluiddb/about': {u'value': u'http://www.fivethirtyeight.com'},
 u'marshallk.com/top-blogs/data': {u'value': 5}}
{u'fluiddb/about': {u'value': u'http://www.guardian.co.uk/news/datablog'},
 u'marshallk.com/top-blogs/data': {u'value': 9}}
{u'fluiddb/about': {u'value': u'http://www.informationisbeautiful.net'},
 u'marshallk.com/top-blogs/data': {u'value': 10}}
{u'fluiddb/about': {u'value': u'http://www.zerohedge.com'},
 u'marshallk.com/top-blogs/data': {u'value': 1}}
{u'fluiddb/about': {u'value': u'http://freakonomics.com/blog'},
 u'marshallk.com/top-blogs/data': {u'value': 4}}
 

Those are the top 10 on Marshall’s data list (unsorted, obviously). I’ve cleaned up the output using my jsongrep.py program described and available here.

More interestingly, you can see if any sites are on both of Marshall’s lists:

$ curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20marshallk.com/top-blogs/geo&tag=fluiddb/about'
{"results": {"id": {"a2e56723-453a-44e5-bd91-5576d0615c8e": {"fluiddb/about": {"value": "http://blog.simplegeo.com"}}}}}

Just a single blog is in both lists: http://blog.simplegeo.com.

So far, so good.

About half an hour ago, I saw a tweet from Daniel Tunkelang (the mind behind TunkRank) saying that eCairn have just released some work based on Marshall’s data, producing a list of 500 top data blogs! Cool.

So I’ve just imported that data to Fluidinfo too, adding a ecairn.com/top-data-blogs tag to the object for each URL on their list. The value of each tag, as with Marshall’s data, is the ranking on the eCairn list.

Let’s see how many blogs are on both lists:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20ecairn.com/top-data-blogs&tag=fluiddb/about' | jsongrep.py results id '.*' | wc -l
117
 

Not as many as I expected. But there are some small differences in the URLs used, for example Marshall’s list had http://kaushik.net/avinash whereas the eCairn list has http://www.kaushik.net/avinash. This would be easy to clean up, and of course it’s also possible just to tag the object for both URLs in Fluidinfo.

You can do the Fluidinfo query has marshallk.com/top-blogs/data except has ecairn.com/top-data-blogs to see the sites that Marshall has in his list but which do not appear in the eCairn list, such as Marshall’s #12, http://blog.sqlauthority.com. eCairn’s calculation might have put them in the lower 500 of their list of 1000 (the eCairn article only gives their top 500). There are plenty of other interesting queries too, but this post is long enough already.

So there you go, a fun bit of playing with more data blog data with Fluidinfo. One of these days we’ll even make it into one of these lists :-)

Here’s the tiny bit of Python code I just wrote to add the data. It uses the Python FOM library for Fluidinfo written by Ali Afshar:

import sys
from fom.session import Fluid
fdb = Fluid()
fdb.login('ecairn.com', 'password')

urls = [i[:-1] for i in sys.stdin.readlines()]

for rank, url in enumerate(urls):
    fdb.about[url]['ecairn.com/top-data-blogs'].put(rank + 1)
 

What the Post-It Note Can Teach Us About Apps and Data

Thursday, February 10th, 2011

On Feb 9th I gave a talk the the NYC Ignite event, titled “What the Post-It Note Can Teach Us About Apps and Data.” Below are my 20 slides (21 if you could image credits). These were advanced automatically every 15 seconds during the 5-minute talk. I’ll post a link to the video once it’s up.

While the topic may not seem to have anything to do with Fluidinfo, there is a very close connection. I’ll write about that another time.

Mining the BoingBoing API

Wednesday, February 9th, 2011

With all the BoingBoing data from the past ten years now in Fluidinfo the next question is “what can we do with it..?”. That’s what I’ll be answering in this technical how-to, so expect lots of code / examples!

I’ve organised the article into four parts:

  1. Basic Fluidinfo concepts
  2. How BoingBoing data is organised
  3. Minecraft (example data mining interactions with the API)
  4. Super-duper cool stuff (this is the best bit!)

Basic Fluidinfo Concepts

Understanding Fluidinfo involves four simple concepts:

  1. Objects represent things.
  2. Tags define objects’ attributes.
  3. Namespaces organise tags.
  4. Permissions apply to namespaces and tags.

How does this all fit together..? Objects are simply tagged with data. Put another way, tags associate a value with an object.

The other important concept to make clear is that nobody owns objects, there are no permissions associated with objects and objects last for ever. Although every object has a unique ID they are also usually identified by a globally unique and immutable “about” tag value. It’s used as you’d expect: to indicate what the object is supposed to be about. Finally, anyone can add data to any object (more on this later).

(er… that’s really all it is.)

Of course, since Fluidinfo is a data-store it is possible to do searches, link objects and store all sorts of different types of data (from primitive types like numbers, booleans and text to more opaque values such as images, video, sound and other binary data).

Oh yeah, interaction with the data is via a simple yet powerful REST API. There are plenty of client libraries in many different languages which allow you to work without worrying about the dirty implementation details.

How the BoingBoing data is organised in Fluidinfo

Each of the 64,000 BoingBoing articles is represented by a corresponding Fluidinfo object whose about tag value is the URL of the original post on boingboing.net. In the original XML dump, each post looked something like this:

   <row>
        <permalink>http://boingboing.net/2000/01/21/street-tech-reviews-.html</permalink>
        <created_on>2000-01-21 14:07:38</created_on>
        <basename>street_tech_reviews_</basename>
        <author>Mark Frauenfelder</author>
        <title>Street Tech Reviews and news</title>
        <body><A HREF="http://www.streettech.com/">Street Tech</A> Reviews and news for gadget-lovers and propeller heads of all stripes.</body>
        <body_more>NULL</body_more>
        <comment_count>0</comment_count>
        <categories>NULL</categories>
    </row>

I’ve done the simplest thing possible: created a top-level boingboing.net namespace in Fluidinfo under which all tags used to annotate BoingBoing data are defined. I’ve added tags to this namespace that map to the original XML elements: permalink, created_on, basename, author, title, body, body_more, comment_count and categories. The Fluidinfo objects representing BoingBoing posts have data associated with them using these tags. For example, the object representing the post described in the XML example above has a boingboing.net/title tag with the associated value: “Street Tech Reviews and news”.

Since I was also cleaning the raw XML I decided to extract / re-structure some of the data. This resulted in some additional tags: year, month, day, timestamp, links and domains. The function of the date related tags should be clear. The links and domains tags are interesting because I scraped all the anchor tags in the body and body_more fields and processed the href values. Obviously the links tag references a list of all the URLs referenced in an article and the domains tag references a related list containing just the domain names.

I did one final enhancement to the data dump. I extracted all the authors and categories and turned them into tags. When I imported the data I used these tags in the “delicious” way of tagging: simply by having such a tag (with no associated value) an object is associated with an author or category.

Here’s what an object representing a BoingBoing article looks like:

An object

Another interesting view on the data is to explore the BoingBoing tags and namespaces in the Fluidinfo Explorer (see the screen-shot on the right). In the Explorer, if you right-click on a tag and select “Open Object” you’ll see the object that represents the tag in the main area of the application. This object is itself tagged with useful information – such as a description (containing copyright information). Yeah, I know, it sounds odd but this makes meta-tagging possible.

In addition to creating Fluidinfo objects for all the BoingBoing articles I also created an object for every domain referenced by BoingBoing throughout the last ten years.

The about tag value for these domain objects is the domain name itself. For example, there is an object about the “bbc.co.uk” domain.

Each of these domain objects has been tagged with a list of all the BoingBoing articles that mention them. This is, I think, rather cool. To continue the example, the bbc.co.uk domain was referenced in 177 BoingBoing articles.

Minecraft (example data mining interactions with the API)

So here comes the cool how-to stuff…

Should you need to, use the existing documentation to read about the Fluidinfo API in super-painfully-precise-techno-vision. However, I’m going to present a quick guided tour in the form of a Python session using the fluiddb.py module (remember my advice to use one of the client libs). The advantage of using fluiddb.py is that it’s a very thin layer on top of the HTTP API so you get a feel for how various things work. The other advantage is that reading Python is like reading pseudo-code and is thus a great teaching tool.

In the following example I simply import the fluiddb module and ask it for information about my user (ntoll). The basic pattern for calling Fluidinfo is: fluiddb.call(“HTTP-VERB“, “PATH IN API“, OTHER OPTIONAL ARGS)

>>> import fluiddb # loads the module into the session
>>> headers, body = fluiddb.call('GET', '/users/ntoll') # 'GET' is the HTTP verb & '/users/ntoll' is the API path
>>> headers # contains the HTTP headers returned from Fluidinfo
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '76',
 'content-location': 'https://fluiddb.fluidinfo.com/users/ntoll',
 'content-type': 'application/json',
 'date': 'Tue, 08 Feb 2011 19:42:10 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> body # contains the actual result, in this case basic information about the user ntoll (me)
{u'id': u'a694f2d0-428e-4aaf-85d1-58e903f56b30',
 u'name': u'Nicholas Tollervey'}

Notice how the “content-location” in the headers tells you what the full URL of the API call is (this is interesting since fluiddb.py creates this automagically for you). The body (result) is a Python dict object that basically mirrors the JSON dict object Fluidinfo served up.

The following example grabs information about a specific object. Notice that I pass in the path to the Fluidinfo resource I’m GETting as a list. This ensures that the BoingBoing URL gets correctly percent encoded.

>>> headers, body = fluiddb.call('GET', ['about', 'http://boingboing.net/2000/01/21/street-tech-reviews-.html']) # get basic information about the object about "http://boingboing.net/2000/01/21/street-tech-reviews-.html"
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '455',
 'content-location': 'https://fluiddb.fluidinfo.com/about/http%3A%2F%2Fboingboing.net%2F2000%2F01%2F21%2Fstreet-tech-reviews-.html',
 'content-type': 'application/json',
 'date': 'Tue, 08 Feb 2011 19:45:27 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> body
{u'id': u'469257cf-2c33-4628-a97e-47166bae24fa',
 u'tagPaths': [u'boingboing.net/timestamp',
               u'fluiddb/about',
               u'boingboing.net/day',
               u'boingboing.net/month',
               u'boingboing.net/year',
               u'boingboing.net/authors/markfrauenfelder',
               u'boingboing.net/comment_count',
               u'boingboing.net/author',
               u'boingboing.net/basename',
               u'boingboing.net/body',
               u'boingboing.net/domains',
               u'boingboing.net/created_on',
               u'boingboing.net/permalink',
               u'boingboing.net/title',
               u'boingboing.net/links']}
>>>

Hopefully, the result speaks for itself: it contains the unique ID of the Fluidinfo object that is about the BoingBoing URL, and a list of the tags on that object. Getting the value of a specific tag is simple:

>>> headers, body = fluiddb.call('GET', '/objects/469257cf-2c33-4628-a97e-47166bae24fa/boingboing.net/title')
>>> body
u'Street Tech Reviews and news'

I simply appended the path to the tag onto the object’s unique ID (this also works with the about tag too as used in the prior example).

Returning tag values for a set of results that match a query is also easy. The equivalent of the following SQL-esque query:

SELECT title, categories, created_on FROM boingboing.net WHERE authors="markfrauenfelder" AND year=2010;

… is:

>>> headers, body = fluiddb.call('GET', '/values', tags=['boingboing.net/title', 'boingboing.net/created_on', 'boingboing.net/categories'], query="has boingboing.net/authors/markfrauenfelder and boingboing.net/year=2010")

A call is made to the “/values” endpoint with a list of tags whose values we want returned and a query to generate the result set. The query is written in Fluidinfo’s super-simple query language. The headers of the response look like this:

>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '287328',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=has+boingboing.net%2Fauthors%2Fmarkfrauenfelder+and+boingboing.net%2Fyear%3D2010&tag=boingboing.net%2Ftitle&tag=boingboing.net%2Fcreated_on&tag=boingboing.net%2Fcategories',
 'content-type': 'application/json',
 'date': 'Wed, 09 Feb 2011 10:55:50 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}

The actual results are a JSON object (of which the following is only a fragment):

{
  "results": {
    "id": {
      "f2976562-eba6-47e4-94a1-b36ffe9a2ab1": {
        "boingboing.net/created_on": {
          "value": "2010-10-14 13:14:14"
        },
        "boingboing.net/categories": {
          "value": [
            "science",
            "technology",
            "art and design",
            "design"
          ]
        },
        "boingboing.net/title": {
          "value": "TED releases iPad app today"
        }
      },
      "627ebf2e-e38d-41da-a709-16294b4ab6f2": {
        "boingboing.net/created_on": {
          "value": "2010-02-19 11:29:36"
        },
        "boingboing.net/categories": {
          "value": [
            "culture"
          ]
        },
        "boingboing.net/title": {
          "value": "Miniboss T-shirt in the Boing Boing Bazaar"
        }
      } // etc... for lots of results
    }
  }
}

Happily, fluiddb.py has converted it into the Python equivalent so we can find out some useful information and look at individual results.

>>> len(body['results']['id']) # how many results do we have..?
1214
>>> body['results']['id'].keys()[0] # what's the id of the first result..?
u'f2976562-eba6-47e4-94a1-b36ffe9a2ab1'
>>> body['results']['id']['f2976562-eba6-47e4-94a1-b36ffe9a2ab1'] # show the record for the first result...
{u'boingboing.net/categories': {u'value': [u'science',
                                           u'technology',
                                           u'art and design',
                                           u'design']},
 u'boingboing.net/created_on': {u'value': u'2010-10-14 13:14:14'},
 u'boingboing.net/title': {u'value': u'TED releases iPad app today'}}

Great! So you have all the tools you need to search and explore all the BoingBoing articles from the last ten years. That’s what a conventional data API provides.

However, Fluidinfo can do additional super-duper cool stuff..!

Super-duper cool stuff!

Fluidinfo is an openly writeable database where objects have value because they are annotated with data from different sources. That’s why anyone can tag any data to any object. Since you control who can use, read and control your namespaces and tags, you still maintain control of data and importantly create a mechanism for trust.

You can trust values annotated with tags from the boingboing.net namespace because only BoingBoing is allowed to create and edit anything under this namespace. Since BoingBoing has annotated objects with information about articles then it’s safe to assume the objects are about a BoingBoing articles.

Here’s the super-duper stuff: you can contribute data to these objects too.

How..?

I’m glad you asked… :-)

First of all you’ll need an account on Fluidinfo. Once you’ve signed up you’ll be the proud owner of a top-level namespace with the same name as your username. Before you can add data to objects you’ll need to create some tags to achieve this:

>>> fluiddb.login('ntoll', 'top-secret-password') # change as appropriate
>>> newTag = {'name': 'tuba', 'description': 'Related to Tubas in some way so it must be awesome!', 'indexed': False})
>>> headers, body = fluiddb.call('POST', '/tags/ntoll', newTag) # create new tag in /ntoll namespace
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '104',
 'content-type': 'application/json',
 'date': 'Wed, 09 Feb 2011 13:08:52 GMT',
 'location': 'https://sandbox.fluidinfo.com/tags/ntoll/tuba',
 'server': 'nginx/0.7.65',
 'status': '201'}
>>> headers, body = fluiddb.call('GET', '/tags/ntoll/tuba', returnDescription=True)
>>> body
{u'description': u'Related to Tubas in some way so it must be awesome!',
 u'id': u'b03f6937-cebf-481d-a0eb-5fd355a8a602',
 u'indexed': False}

The new tag is given a name (“tuba”), description and an indication if it should be indexed. The “201″ status that Fluidinfo returned confirms that the new tag was successfully created under the “ntoll” namespace.

In case you hadn’t guessed I like tubas! I’d like others to find other tuba related objects in Fluidinfo so I’ve decided I’ll attach this newly created tag to anything tuba-related, including BoingBoing posts. As it happens Fluidinfo helps me get a bunch of these posts with a search like this:

>>> headers, body = fluiddb.call('GET', '/values', tags=['boingboing.net/title', 'fluiddb/about',], query = 'fluiddb/about matches "tuba"')
>>> body
{
  "results": {
    "id": {
      "e6c108f4-bd10-4cd3-b7d5-ad549b988c28": {
        "fluiddb/about": {
          "value": "http://boingboing.net/2006/06/22/flaming-tuba-guy-dav.html"
        },
        "boingboing.net/title": {
          "value": "Flaming Tuba guy David Silverman on NBC Tonight Show 6/23"
        }
      },
      "0c006f04-0663-48d6-9f11-4e082e75eb51": {
        "fluiddb/about": {
          "value": "http://boingboing.net/2010/11/22/tuba-skinny-old-time.html"
        },
        "boingboing.net/title": {
          "value": "Tuba Skinny: Old timey blues and jazz street act from New Orleans"
        }
      }
    }
  }
}

I’ve simply queried for matches for the word “tuba” in the fluiddb/about tag. Now that I’ve got a couple of results I can tag them like so:

>>> for tubaItem in body['results']['id']:
...     header, body = fluiddb.call('PUT', '/objects/%s/ntoll/tuba' % tubaItem, "Umpah-tastical, man!")
...     print header['status']
'204'
'204'

Yay! I’ve added some information to a couple of objects about BoingBoing articles! Let’s just confirm this by asking Fluidinfo for all the objects tagged with ntoll/tuba:

>>> headers, body = fluiddb.call('GET', '/values', tags=['fluiddb/about', 'boingboing.net/title', 'ntoll/tuba', ], query="has ntoll/tuba")
>>> body
{
  "results": {
    "id": {
      "e6c108f4-bd10-4cd3-b7d5-ad549b988c28": {
        "ntoll/tuba": {
          "value": "Umpah-tastical, man!"
        },
        "fluiddb/about": {
          "value": "http://boingboing.net/2006/06/22/flaming-tuba-guy-dav.html"
        },
        "boingboing.net/title": {
          "value": "Flaming Tuba guy David Silverman on NBC Tonight Show 6/23"
        }
      },
      "0c006f04-0663-48d6-9f11-4e082e75eb51": {
        "ntoll/tuba": {
          "value": "Umpah-tastical, man!"
        },
        "fluiddb/about": {
          "value": "http://boingboing.net/2010/11/22/tuba-skinny-old-time.html"
        },
        "boingboing.net/title": {
          "value": "Tuba Skinny: Old timey blues and jazz street act from New Orleans"
        }
      },
      "024bf1b6-348d-4839-8700-cbb30d86fb97": {
        "ntoll/tuba": {
          "value-type": "image/jpg",
          "size": 467947
        },
        "fluiddb/about": {
          "value": "CrossCountryTuba"
        }
      },  
      "a694f2d0-428e-4aaf-85d1-58e903f56b30": {
        "ntoll/tuba": {
          "value": "I play tuba!"
        },
        "fluiddb/about": {
          "value": "Object for the user named ntoll"
        }
      }
    }
  }
}

Oops, I forgot I’d already tagged a couple of non-BoingBoing objects with the ntoll/tuba tag: one whose about tag value is “CrossCountryTuba” and the other being the object that represents me in Fluidinfo.

Notice how the value for the ntoll/tuba tag on the object about “CrossCountryTuba” contains only metadata: the type of data stored by that tag on that particular object (image/jpg) and the size of the data (467947 bytes). Looks like it’s an image of some sort. Let’s get it and see:

>>> headers, body = fluiddb.call('GET', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/tuba')
>>> image = open('tuba.jpg', 'w')
>>> image.write(body)
>>> image.close()

And what does tuba.jpg contain..?

CrossCountryTuba

Cool! Fluidinfo stores any type of data so long as you supply the appropriate MIME type when you upload the data.

How did I get the data into Fluidinfo..?

>>> tuba = open('Desktop/tuba.jpg', 'r') # open the original image
>>> header, body = fluiddb.call('PUT', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/tuba', tuba.read(), mime='image/jpg') # notice how I specify the MIME type
>>> tuba.close()
>>> headers['status'] # check we got a 200 OK response
'200'
>>> header, body = fluiddb.call('PUT', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/attribution', 'Tuba photo source: http://www.flickr.com/photos/dust/3813581130 licensed under a CC-BY 2.0 license') # need to add attribution as per the license

Simple..!

Now we’ve covered a lot of ground, so let’s just consider where we’ve got to.

  • We have a consistent, simple and powerful API to play with.
  • We can retrieve values using a simple query language referencing data contributed from many different users.
  • We can contribute data ourselves in such a way that the data remains under our control.
  • We can put all our data in the right place. If I want to contribute something about a BoingBoing article I just tag it to the object representing the right BoingBoing article.
  • We can contribute all sorts of data be it searchable primitive values like numbers, text and booleans or opaque data such as images, audio or anything else for which you can specify a MIME type.

You’re armed with enough basic knowledge to both mine BoingBoing data and contribute to it too. In fact, if you look carefully you’ll find all sorts of interesting objects in Fluidinfo. Remember, to find out more about the API check out our technical documentation.

Dive in, have fun and we’re more than happy to answer questions.

Image credits: BoingBoing’s logo and font butchered with permission (thanks @mustardhamsters!), diagram generated by abouttag written by Nick Radcliffe and the “Cross Country Tuba image” is © 2009 Amanda M Hatfield under a Creative Commons license.