Nicholas Tollervey « Fluidinfo

March 22, 2011

How we built the O’Reilly API using Fluidinfo

Filed under: Data,Howto,People,Progress — Nicholas Tollervey @ 8:34 am

In case you haven’t noticed, we’ve imported the O’Reilly catalogue into Fluidinfo thus giving them an instantly writable API for their data.

How did we do it..?

There were three basic steps:

Get the raw data.
Clean the raw data.
Import the cleaned data.

That’s it!

I’ll take each step in detail…

Get the raw data

Since we didn’t have an existing raw dump of the data nor access to O’Reilly’s database we had to think of some other way to get the catalogue. We found that O’Reilly had two different existing data services we could use: OPMI (O’Reilly Product Metadata Interface) and an affiliate’s API within Safari.

Unfortunately the RDF returned from OPMI is complicated. We’d either have to become experts in RDF or learn how to use a specialist library to get at the data we were interested in. We didn’t have time to pursue either of these avenues. The other alternative, the Safari service, just didn’t work as advertised. 🙁

Then we remembered learning about @frabcus and @amcguire62‘s ScraperWiki project.

Put simply, ScraperWiki allows you to write scripts that scrape (extract) information from websites and store the results for retrieval later. The “wiki” aspect of the ScraperWiki name comes from its collaborative development environment where users can share their scripts and the resulting raw data.

In any case, a couple of hours later I had the beginnings of a batched up script for scraping information from the O’Reilly catalogue on the oreilly.com website. After some tests and refactoring ScraperWiki started to do its stuff. The result was a data dump in the easy to understand and manipulate CSV or JSON formats. ScraperWiki saves the day!

Clean the raw data

This involved massaging the raw data into a meaningful structure that corresponded to the namespaces, tags and tag-values we were going to use in Fluidinfo. We also extracted some useful information from the raw data. For example, we made sure the publication date of each work was also stored in a machine-readable value. Finally, we checked that all the authors and books matched up.

Most of this work was done by a single Python script. It loaded the raw data (in JSON format), cleaned it and saved the cleaned data as another JSON file. This meant that we could re-clean the raw data any number of times when we got things wrong or needed to change anything. Since this was all done in-memory it was also very fast.

The file containing the cleaned data was simply a list of JSON objects that mapped to objects in Fluidinfo. The attributes of each JSON object corresponded to the tags and associated values to be imported.

Import the cleaned data

This stage took place in two parts:

Create the required namespaces and tags
Import the data by annotating objects

Given the cleaned data we were able to create the required namespaces and tags. You can see the resulting tree-like structure in the Fluidinfo explorer (on the left hand side).

Next, we simply iterated over the list of JSON objects and pushed them into Fluidinfo. (It’s important to note is that network latency means that importing data can seem to take a while. We’re well aware of this and will be blogging about best practices at a later date.)

That’s it!

We used Ali Afshar’s excellent FOM (Fluid Object Mapper) library for both creating the namespace and tags and importing the JSON objects into Fluidinfo and elements of flimp (the FLuid IMPorter) for pushing the JSON into FOM.

What have we learned..? The most time consuming part of the exercise was scraping the data. The next most time consuming aspect was agreeing how to organise it. The actual import of the data didn’t take long at all.

Given access to the raw data and a well thought out schema we could have done this in an afternoon.

Comments (5)

March 21, 2011

The structure of O’Reilly book and author data in Fluidinfo

Filed under: Data,Howto,Programming,Progress — Nicholas Tollervey @ 9:40 am

This short post explains how the O’Reilly catalog is represented in Fluidinfo.

Put simply, we annotate two types of object: those representing products (usually books) and those representing authors. We annotate them using namespaces and tags within the oreilly.com top level namespace so you can be sure that this is bona fide O’Reilly information.

Within the oreilly.com namespace we store a bunch of “top level” tags that describe a product in the O’Reilly catalogue (title, summary, URL and so on). The oreilly.com namespace has two child namespaces: “authors” and “media“. (If you want a visual representation of this structure head on over to the Fluidinfo explorer and explore, starting from the tree menu on the left hand side.)

The authors namespace contains tags that define information about an author (name, biography, homepage and so on) and also contains a child namespace called “expertise“. The expertise namespace contains a set of tags that map to the list of areas of expertise that O’Reilly uses to categorise their authors. So, for example, an object representing the O’Reilly author “Chris DiBona” looks like this:

Notice how Chris’s object has tags under the oreilly.com/authors namespace including several under the oreilly.com/authors/expertise namespace. Importantly, the object also has tags that were not provided by the O’Reilly data. Terry has added a tag terrycojones/met to indicate (rather obviously) that he’s met Chris and the fluiddb/about tag is used to indicate that the object is about the author called Chris diBona.

What about the objects that represent books..? What do they look like..? Well let’s consider a current favourite of mine: “XMPP: The Definitive Guide”. Here’s how Nick Radcliffe’s excellent abouttag utility displays the object representing this book:

Whoa! Lots more tags! Many of them are from the oreilly.com domain (although notice how there are 15 missing). Once again it’s possible to see who/what else has been tagging the object. I’ve added a review and rating (ntoll/review and ntoll/rating) and various other people have annotated useful information that wasn’t at first in the dataset provided by O’Reilly.

How are authors and books linked..?

Every author object has an oreilly/authors/works tag that contains a list of the 13 digit O’Reilly ID / ISBN for each work they were involved in. Every book object has a corresponding oreilly.com/id and oreilly.com/isbn tag.

Alternatively, every book object has an oreilly.com/authors-urls tag that contains a list of it’s author’s homepages on the O’Reilly website and every author object has an associated oreilly.com/url containing the same information.

Finally, for the sake of completeness here’s a list of all the book and author tags along with a description of what each one represents:

Book tags

publication-day: The day of the month upon which the item was published.
publication-month – The number of the month within which the item was published.
duration – The duration of this item in minutes.
subtitle – The subtitle associated with the item.
id – The unique ID used by O’Reilly to identify the item, usually the 13-digit ISBN number (as a string).
page-count-is-estimate – A flag to indicate that any associated page count value is only an estimate.
cover-medium – The URL for a medium size image of the cover at the oreilly.com domain.
toc – The table of contents as text/html.
homepage – A URL to the item’s homepage on the O’Reilly website.
description – A long description of the item as text/html.
cover-small – The URL for a small size image of the cover at the oreilly.com domain.
author-urns – A list of unique reference numbers used by O’Reilly to reference the authors of the item.
cover-large – The URL for a large size image of the cover at the oreilly.com domain.
isbn – The 13-digit ISBN number (as a string).
safari-url – A URL to the item’s page on O’Reilly’s Safari service.
author-urls – A list of URLs pointing to the author’s homepages on the O’Reilly website.
pages – The number of pages this item has.
publisher – The name of the publisher of the item.
price-us – The advertised US price in cents.
title – The title of the item.
author-names – A list of author names.
summary – A short summary of the item as text/html.
publication-date – The publication date as YYYY-MM-DD.
price-uk – The advertised UK price in pence.
media – A list of the type[s] of media in which the item is available. Can be one or more of: ‘up-to-date’, ‘rough cut’, ‘dvd’, ‘ebook’, ‘kit’, ‘video’, ‘print’, ‘early release ebook’, ‘safari books online’ or ‘merchandise'”

Author tags

name – The author’s full name.
url – A URL to the author’s homepage on the O’Reilly website.
photo – A path to an image file containing a photo of the author hosted at the oreilly.com domain.
twitter – The author’s Twitter username.
works – A list of the ids of items that the author has created.
expertise – A list of the expertise tags associated with the author.
biography – The author’s biography as text/html.

Comments (1)

Examples of Fluidinfo O’Reilly API queries

Filed under: APIs,Data,Howto,Programming — Nicholas Tollervey @ 9:40 am

This post is all about querying the O’Reilly book and author information recently imported into Fluidinfo. If you want the skinny on Fluidinfo’s query language in glorious in-depth techno-geek-speak then check out the documentation. If you’d rather see some real world examples, read on…

In Fluidinfo, objects represent things (and all objects have a unique id). Information is added to objects using tags. Tags can have values, and tag names are organized into namespaces that give them context. Permissions control who can see and use namespaces and tags.

Objects do not belong to anyone and don’t have permissions associated with them. They’re openly writable. Anyone can tag anything to any object. Many objects have a special globally unique “about” tag value that indicates what they are about. Interaction with Fluidinfo is via a REST API.

That’s Fluidinfo in a nutshell.

In another article published today I describe the Fluidinfo tags and namespaces used to annotate objects with O’Reilly data. The tags are attached to objects for O’Reilly books and authors. Both kinds of objects have about tags. So a trivial first kind of query is to go directly to an object that’s about a book. For example, to get information about the object representing the book “Open Government” visit the URL http://fluiddb.fluidinfo.com/about/book:open government (daniel lathrop; laurel ruma).

You’ll get back a JSON response containing a list of all the tags (that you have permission to read) attached to that object and the object’s globally unique id. Similarly, you can go directly to the object for an O’Reilly author http://fluiddb.fluidinfo.com/about/author:tim oreilly.

In case you’re wondering about the format of these book and author about tags, we used the abouttag library written by Nicholas Radcliffe to generate them. They’re designed to be readable, easy to generate programmatically, and unlikely to result in collisions. You don’t have to remember them though, as there are many other ways to get at objects, via querying, as we’re about to see.

Queries on tags and their values

Below are some examples of using Fluidinfo’s query language.

Presence

Return all the objects that have an O’Reilly title:


has oreilly.com/title

You can see the results at the following URL: http://fluiddb.fluidinfo.com/objects?query=has oreilly.com/title. Once again, the result is in JSON. It simply contains a list of the ids of matching objects (representing things that O’Reilly have tagged with a title).

That’s the equivalent of the following SQL statement:


SELECT id FROM oreilly.com WHERE title IS NOT NULL;

Caveat: There are no tables in Fluidinfo so it’s impossible to make a direct translation to SQL. This example and those that follow simply illustrate a conceptual equivalence to make it easier for those of you familiar with SQL to get your heads around the Fluidinfo query language.

Comparison

Return all the O’Reilly objects whose price is less than $40 (the price is stored in cents).


oreilly.com/price-us < 4000

Here it is as a URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/price-us < 4000

In SQL it would be:


SELECT id FROM oreilly.com WHERE price-us < 4000;

Text Matching

Return all the O'Reilly objects that have "Python" in the title.


oreilly.com/title matches "Python"

The resulting URL: https://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches "Python"

In SQL:


SELECT id FROM oreilly.com WHERE title LIKE '%Python%';

Set Contents

Return all the O'Reilly objects representing authors who were involved in writing the work with ISBN "9781565923607" (which is the unique ID O'Reilly use in their catalog). The value of oreilly.com/authors/works tags is always a set of unique ISBN numbers like this: ["9781565923607", "9781565563728", "9781627397284"].


oreilly.com/authors/works contains "9781565923607"

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/authors/works contains "9781565923607"

In SQL:


SELECT id FROM oreilly.com/authors WHERE '9781565923607' in (SELECT works FROM oreilly.com/authors);

(Actually, the similar "IN" operation in SQL isn't a very good example since it results in verbose monstrosities like the above.)

Exclusion

Return all the O'Reilly books that were published in 2001 except those published in April.


oreilly.com/publication-year=2010 except oreilly.com/publication-month=4

The resulting URL: https://fluiddb.fluidinfo.com/objects?query=oreilly.com/publication-year=2010 except oreilly.com/publication-month=4

In SQL:


SELECT id FROM oreilly.com WHERE year=2010 and month<>4;

Logic

It's possible to use the and and or logical operations. For example, return all the O'Reilly books whose title matches "Python" and were published before 2005:


oreilly.com/title matches "Python" and oreilly.com/publication-year < 2005

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches "Python" and oreilly.com/publication-year < 2005

In SQL:


SELECT id FROM oreilly.com WHERE title LIKE '%Python%' and year < 2005

Grouping

Return all the objects representing O'Reilly books mentioning "Python" in their title that were published in either 2008 or 2010.


oreilly.com/title matches "Python" and (oreilly.com/publication-year=2008 or oreilly.com/publication-year=2010)

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches "Python" and (oreilly.com/publication-year=2008 or oreilly.com/publication-year=2010)

In SQL:


SELECT id FROM oreilly.com WHERE title LIKE '%Python%' AND (year = 2008 OR year = 2010);

Querying across different data sets

Fluidinfo can query seamlessly across tags from different sources that are stored on the same object. E.g., return the titles of all O'Reilly books that Terry Jones owns.


has oreilly.com/title and has terrycojones/owns

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=has oreilly.com/title and has terrycojones/owns

In SQL:

Well, it's actually not clear how you'd do this in SQL. Presumably there'd need to be some kind of table join, supposing that were possible!

Getting back tags on objects matching a query

It's also possible to indicate which tag values to return for each matching object. This is done by using the Fluidinfo /values HTTP endpoint and specifying the tag values to return as arguments in the URL path. For example, if I wanted the title, author names and publication year of all the O'Reilly books with the word "Python" in the title published before 2006 then I'd use the following query:


oreilly.com/title matches "Python" and oreilly.com/publication-year < 2006

and append the wanted tags to the URL after the query (in any order):


&tag=oreilly.com/title&tag=oreilly.com/author-names&tag=oreilly.com/publication-year

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches "Python" and oreilly.com/publication-year < 2006&tag=oreilly.com/title&tag=oreilly.com/author-names&tag=oreilly.com/publication-year

This is similar to the following SQL:


SELECT title, authors, year FROM oreilly.com WHERE title LIKE '%Python%' AND year < 2006;

Fluidinfo returns a JSON object like this:


{u'results': {u'id': {u'1a91e021-7bce-4693-bfa5-0dc437fe1817': 
    {u'oreilly.com/author-names': {u'value': [u'Anna Ravenscroft', u'David Ascher', u'Alex Martelli']},
     u'oreilly.com/publication-year': {u'value': 2005},
     u'oreilly.com/title': {u'value': u'Python Cookbook, Second Edition'}},
u'1d25baae-b977-4ff4-bb77-01c52bd1d339': 
    {u'oreilly.com/author-names': {u'value': [u'Fredrik Lundh']},
     u'oreilly.com/publication-year': {u'value': 2001},
     u'oreilly.com/title': {u'value': u'Python Standard Library'}},
u'3360f05f-9bf4-4da5-abc0-0e3742809b98': 
    {u'oreilly.com/author-names': {u'value': [u'Fred L. Drake Jr', u'Christopher A. Jones']},
     u'oreilly.com/publication-year': {u'value': 2001},
     u'oreilly.com/title': {u'value': u'Python & XML'}},
u'9845b184-ef1b-46fb-8e7c-011da053dcb6': 
    {u'oreilly.com/author-names': {u'value': [u'Andy Robinson', u'Mark Hammond']},
     u'oreilly.com/publication-year': {u'value': 2000},
     u'oreilly.com/title': {u'value': u'Python Programming On Win32'}}}}}

It's also possible to update and delete tag values from matching objects. This process is explained in detail in the Fluidinfo documentation and this blog post.

Finally, rather than interacting with Fluidinfo directly using the raw HTTP API it's a good idea to use one of the client libraries listed here. For example, using the fluidinfo.py library the last example query can be executed as follows:


>>> import fluidinfo
>>> import pprint
>>> headers, result = fluidinfo.call('GET', '/values', tags=['oreilly.com/title', 'oreilly.com/author-names', 'oreilly.com/publication-year'], query='oreilly.com/title matches "Python" and oreilly.com/publication-year < 2006')
>>> pprint.pprint(headers)
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '937',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=oreilly.com%2Ftitle+matches+%22Python%22+and+oreilly.com%2Fpublication-year+%3C+2006&tag=oreilly.com%2Ftitle&tag=oreilly.com%2Fauthor-names&tag=oreilly.com%2Fpublication-year',
 'content-type': 'application/json',
 'date': 'Thu, 10 Mar 2011 15:17:58 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> pprint.pprint(result)
{u'results': {u'id': {u'1a91e021-7bce-4693-bfa5-0dc437fe1817': {u'oreilly.com/author-names': {u'value': [u'Anna Ravenscroft',
... etc ...

Learn more

Hopefully, this has explained enough to get you started. If you don't have a Fluidinfo account, you can sign up here. If you have any questions, please don't hesitate to get involved with the Fluidinfo community, contact us directly or join us on IRC. We'll be more than happy to help!

Comments (1)

O’Reilly Fluidinfo Chrome extension

Filed under: Howto,People,Programming,Writable APIs — Nicholas Tollervey @ 9:40 am

To help people get going with the API competition announced today on the O’Reilly Radar site, Emanuel Carnevale has written a cool extension for Google’s Chrome browser. The extension shows some of the non-O’Reilly tags on the book objects and also lets you indicate which O’Reilly books you own. It does this by putting tags onto the objects representing O’Reilly books in Fluidinfo.

To install the extension onto your Chrome browser click on the following link (from within Chrome): https://fluiddb.fluidinfo.com/about/oreilly.com/fluidinfo/chrome-extension.crx. Your browser will guide you through what to do. It’s pretty obvious stuff. Once it’s installed you’ll see a new icon in the top right hand corner of the browser window between the address bar and the little spanner icon:

Click the icon and sign in with your Fluidinfo credentials. If you don’t yet have an account on Fluidinfo you can sign up here.

How do you use it..?

Simple. Go visit the O’Reilly catalog and click on one of the books you own. For example, I happen to be the proud owner of Natural Language Processing with Python. If you visit the page for the book you’ll notice a new small Fluidinfo icon in the book details:

Click the icon and you’ll see a pop-up like this:

You can click on the appropriate statement at bottom to indicate ownership or not, as the case may be.

The writable API gives us all a voice

The extension uses an “owns” tag in your top-level Fluidinfo namespace to indicate book ownership on the objects in Fluidinfo. For example, my tag is called “ntoll/owns”. The extension attaches this tag to the object representing the O’Reilly book whose page you are visiting.

Because the extension tags the exact same Fluidinfo objects that have the O’Reilly information, I can start to do some really cool searches. For example, I happen to know Terry has a particularly large O’Reilly “zoo” as do I (in fact, doesn’t every developer..?). We can see what books we both own about Python with the following query:


oreilly.com/title matches "Python" and has terrycojones/owns and has ntoll/owns

The following code snippet for running this query uses the fluidinfo.py client library from within the Python shell. Alternatively, you can see the result directly if you visit this URL.


>>> import fluidinfo
>>> import pprint
>>> headers, result = fluidinfo.call('GET', '/values', tags=['oreilly.com/title',], query='oreilly.com/title matches "Python" and has terrycojones/owns and has ntoll/owns')
>>> pprint.pprint(result)
{u'results': {u'id': {
                      u'01371c03-9097-4267-a137-ae88a23790ef': {u'oreilly.com/title': {u'value': u'Python Pocket Reference, Fourth Edition'}},
                      u'4e9c42b6-68cb-43f5-9b75-60af9c0bd5a7': {u'oreilly.com/title': {u'value': u'Programming Python, Fourth Edition'}},
                      u'cd0838db-96ae-42ae-98c9-248a1507e2bb': {u'oreilly.com/title': {u'value': u'Python in a Nutshell, Second Edition'}}}}}

This illustrates how anyone can add tags to the objects being used by O’Reilly, and can then search based on their own additions and those of others. That’s why we say that Fluidinfo provides writable APIs. Cool 🙂

Run with it!

There’s obviously a lot more that could be done with this extension. We kept it simple mainly because we wanted to give an example of how such an extension could be written. We hope it can provide a basis for your own efforts, especially if you’re entering the O’Reilly API competition. Emanuel has released the source code for the extension so you can grab it from Github and take it from there!

Comments (0)

February 23, 2011

How to make an API in Fluidinfo

Filed under: Data,Howto,Programming — Nicholas Tollervey @ 11:10 am

It’s very simple really:

1. Register a domain/user on Fluidinfo

Start here. If you’re registering a domain name then we will require proof of ownership (the instructions explaining how to do this are very simple).

2. Create your namespaces and tags

Be careful, you’re choosing how your data will be structured in Fluidinfo. Some tips we’ve found useful:

Flat is good.
Use namespaces to differentiate between the different sorts of things you’ll be tagging (e.g. between books and authors).
Copy conventions (how do others organise their data?).
KISS! Keep it simple (stupid!).

3. Import your data into Fluidinfo

To help you, we have a Python based script/library called FLIMP. Nevertheless, there are lots of freely available libraries that you may want to adapt yourself.

4. Announce your new API

Programmers will interact with your data via the general Fluidinfo API, which is simple and well documented. All you need to do is tell the world that your data is available, and what namespaces and tags you’re using to store it in Fluidinfo.

That’s it.

Please feel free to get in touch at any time if you have any questions or would like to explore the possibility of Fluidinfo Inc. helping you to add your data to Fluidinfo.

Comments (2)

ReadWriteWeb ReadWriteAPI

Filed under: APIs,Data,Writable APIs — Nicholas Tollervey @ 11:04 am

Over the weekend I scraped the 11300 or so articles in the ReadWriteWeb archive. These are a great source of technology news and analysis covering stories from 2003 to the present day. Rather than keep this to myself (and rather unsurprisingly) I imported the metadata about each article into Fluidinfo. Hey presto, another instant API emerges!

Here’s how it works. For each article in the ReadWriteWeb archive there is an object in Fluidinfo. Each object has a unique “about” tag-value: the URL of the article. Furthermore, each object is annotated with information using tags found under the readwriteweb.com top level namespace. Tags include title, extract, date, categories and so on. In other words, you might visualize each object something like this:

I’ve also created and annotated objects about each of the authors of ReadWriteWeb articles and tagged objects representing each website ever mentioned by ReadWriteWeb.

So, it’s now possible to use the API like this:


>>> import fluidinfo
>>> returnTags = ['readwriteweb.com/title', 'readwriteweb.com/author-name', 'readwriteweb.com/extract', 'readwriteweb.com/date', ]
>>> query = "readwriteweb.com/year = 2010 and readwriteweb.com/month = 5 and readwriteweb.com/day = 5"
>>> head, result = fluidinfo.call('GET', '/values', tags=returnTags, query=query)
>>> head['status']
'200'
>>> result
{u'results': 
    {u'id': 
        {u'05936b9b-4c20-4887-9607-f63752e7f274':
            {u'readwriteweb.com/author-name': {u'value': u'Sarah Perez'},
              u'readwriteweb.com/date': {u'value': u'May  5, 2010  7:24 AM'},
              u'readwriteweb.com/extract': {u'value': u"Feel like hacking your phone today? If you've got about 10 minutes to spare, you can turn your iPhone into a Wi-Fi hotspot using a combination of the ..."},
              u'readwriteweb.com/title': {u'value': u'How To Turn Your iPhone into a Wi-Fi Hotspot'}},
        ... etc....

What’s just happened..? I used a client library (fluidinfo.py) to ask Fluidinfo to return the author name, publication date, title and an extract of all ReadWriteWeb articles published on the 5th May 2010.

Being able to search and extract data from an API is cool, especially since you get this by virtue of simply hosting your data in Fluidinfo. But this is ReadWriteWeb we’re talking about. Happily, Fluidinfo can accommodate.


>>> fluidinfo.login('ntoll', 'mysecretpassword') # change as appropriate
>>> headers, result = fluidinfo.call('PUT', ['about', 'http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php', 'ntoll', 'rating'], 10)
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-type': 'text/html',
 'date': 'Wed, 23 Feb 2011 15:07:29 GMT',
 'server': 'nginx/0.7.65',
 'status': '204'}

The example above shows how I sign in and annotate the object “about” the article http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php with a tag called ntoll/rating and an associated value of 10 (obviously I enjoyed this article). The HTTP 204 response status tells me the value was successfully tagged.

Let’s just pause here for a moment and consider what I’ve just been able to do. Because Fluidinfo is openly writable I’m able to annotate the objects about ReadWriteWeb articles with my own data. Since objects in Fluidinfo don’t have owners or permissions attached to them I didn’t have to ask ReadWriteWeb for permission to augment the data about the article in question. Furthermore, if I only want my buddies to see what my ratings are I can set the tag to be only visible to a specific group of people. In this way Fluidinfo remains openly writable yet I still retain ownership and control over my data.

We’ve seen “read” and “write”, but what about “web”..?

Well it turns out I can stretch this analogy even further. Because everyone is tagging the same objects (identified by their “about” tag values) the data is being linked by virtue of the context of the object. We’re starting to get a web of linked data (yeah, I know, bear with me on this one…).

Since I can search and retrieve using any of the tags for which I have “read” permission I can start to create really cool mash-ups of data like this:


>>> header, result = fluidinfo.call('GET', '/values', tags=['fluiddb/about', 'boingboing.net/mentioned', 'readwriteweb.com/mentioned'], query="has boingboing.net/mentioned and has readwriteweb.com/mentioned and has unionsquareventures.com/portfolio")
>>> header
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '23528',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=has+boingboing.net%2Fmentioned+and+has+readwriteweb.com%2Fmentioned+and+has+unionsquareventures.com%2Fportfolio&tag=fluiddb%2Fabout&tag=boingboing.net%2Fmentioned&tag=readwriteweb.com%2Fmentioned',
 'content-type': 'application/json',
 'date': 'Wed, 23 Feb 2011 15:24:36 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> len(result['results']['id'])
4
>>> for r in result['results']['id'].values():
...     print r['fluiddb/about']['value']
... 
http://www.twitter.com
http://www.etsy.com
http://www.boxee.tv
http://www.meetup.com

What..? I’ve just asked Fluidinfo for all the articles from BoingBoing and ReadWriteWeb about companies backed by Union Square Ventures that both BoingBoing and ReadWriteWeb have covered. It turns out there are four companies: Twitter, Etsy, Boxee and Meetup.

What do one of these results look like..?


{u'boingboing.net/mentioned': 
    {u'value': [u'http://boingboing.net/2009/11/06/vampireotherkinenerg.html',
                     u'http://boingboing.net/2010/01/11/ny-times-on-urban-ca.html',
                     u'http://boingboing.net/2010/10/26/ron-paul-supporter-w.html',
                     u'http://boingboing.net/2002/06/27/meetup-meatspace-cam.html',
                     u'http://boingboing.net/2004/03/17/wired-rave-awards.html',
                     u'http://boingboing.net/2006/01/05/net-pug-nabbed-by-cr.html']},
u'fluiddb/about': 
    {u'value': u'http://www.meetup.com'},
u'readwriteweb.com/mentioned': 
    {u'value':  [u'http://www.readwriteweb.com/archives/meetup_the_secret_campaign_weapon.php']}}

What was involved in making such a cool query possible..? Simply importing data into Fluidinfo.

I’ll say no more and let you ponder the implications of what I’ve just demonstrated…

Comments (3)

February 9, 2011

Mining the BoingBoing API

Filed under: Awesomeness,Howto,Programming — Nicholas Tollervey @ 4:54 pm

With all the BoingBoing data from the past ten years now in Fluidinfo the next question is “what can we do with it..?”. That’s what I’ll be answering in this technical how-to, so expect lots of code / examples!

I’ve organised the article into four parts:

Basic Fluidinfo concepts
How BoingBoing data is organised
Minecraft (example data mining interactions with the API)
Super-duper cool stuff (this is the best bit!)

Basic Fluidinfo Concepts

Understanding Fluidinfo involves four simple concepts:

Objects represent things.
Tags define objects’ attributes.
Namespaces organise tags.
Permissions apply to namespaces and tags.

How does this all fit together..? Objects are simply tagged with data. Put another way, tags associate a value with an object.

The other important concept to make clear is that nobody owns objects, there are no permissions associated with objects and objects last for ever. Although every object has a unique ID they are also usually identified by a globally unique and immutable “about” tag value. It’s used as you’d expect: to indicate what the object is supposed to be about. Finally, anyone can add data to any object (more on this later).

(er… that’s really all it is.)

Of course, since Fluidinfo is a data-store it is possible to do searches, link objects and store all sorts of different types of data (from primitive types like numbers, booleans and text to more opaque values such as images, video, sound and other binary data).

Oh yeah, interaction with the data is via a simple yet powerful REST API. There are plenty of client libraries in many different languages which allow you to work without worrying about the dirty implementation details.

How the BoingBoing data is organised in Fluidinfo

Each of the 64,000 BoingBoing articles is represented by a corresponding Fluidinfo object whose about tag value is the URL of the original post on boingboing.net. In the original XML dump, each post looked something like this:


    <row>
        <permalink>http://boingboing.net/2000/01/21/street-tech-reviews-.html</permalink>
        <created_on>2000-01-21 14:07:38</created_on>
        <basename>street_tech_reviews_</basename>
        <author>Mark Frauenfelder</author>
        <title>Street Tech Reviews and news</title>
        <body><A HREF="http://www.streettech.com/">Street Tech</A> Reviews and news for gadget-lovers and propeller heads of all stripes.</body>
        <body_more>NULL</body_more>
        <comment_count>0</comment_count>
        <categories>NULL</categories>
    </row>

I’ve done the simplest thing possible: created a top-level boingboing.net namespace in Fluidinfo under which all tags used to annotate BoingBoing data are defined. I’ve added tags to this namespace that map to the original XML elements: permalink, created_on, basename, author, title, body, body_more, comment_count and categories. The Fluidinfo objects representing BoingBoing posts have data associated with them using these tags. For example, the object representing the post described in the XML example above has a boingboing.net/title tag with the associated value: “Street Tech Reviews and news”.

Since I was also cleaning the raw XML I decided to extract / re-structure some of the data. This resulted in some additional tags: year, month, day, timestamp, links and domains. The function of the date related tags should be clear. The links and domains tags are interesting because I scraped all the anchor tags in the body and body_more fields and processed the href values. Obviously the links tag references a list of all the URLs referenced in an article and the domains tag references a related list containing just the domain names.

I did one final enhancement to the data dump. I extracted all the authors and categories and turned them into tags. When I imported the data I used these tags in the “delicious” way of tagging: simply by having such a tag (with no associated value) an object is associated with an author or category.

Here’s what an object representing a BoingBoing article looks like:

Another interesting view on the data is to explore the BoingBoing tags and namespaces in the Fluidinfo Explorer (see the screen-shot on the right). In the Explorer, if you right-click on a tag and select “Open Object” you’ll see the object that represents the tag in the main area of the application. This object is itself tagged with useful information – such as a description (containing copyright information). Yeah, I know, it sounds odd but this makes meta-tagging possible.

In addition to creating Fluidinfo objects for all the BoingBoing articles I also created an object for every domain referenced by BoingBoing throughout the last ten years.

The about tag value for these domain objects is the domain name itself. For example, there is an object about the “bbc.co.uk” domain.

Each of these domain objects has been tagged with a list of all the BoingBoing articles that mention them. This is, I think, rather cool. To continue the example, the bbc.co.uk domain was referenced in 177 BoingBoing articles.

Minecraft (example data mining interactions with the API)

So here comes the cool how-to stuff…

Should you need to, use the existing documentation to read about the Fluidinfo API in super-painfully-precise-techno-vision. However, I’m going to present a quick guided tour in the form of a Python session using the fluiddb.py module (remember my advice to use one of the client libs). The advantage of using fluiddb.py is that it’s a very thin layer on top of the HTTP API so you get a feel for how various things work. The other advantage is that reading Python is like reading pseudo-code and is thus a great teaching tool.

In the following example I simply import the fluiddb module and ask it for information about my user (ntoll). The basic pattern for calling Fluidinfo is: fluiddb.call(“HTTP-VERB“, “PATH IN API“, OTHER OPTIONAL ARGS)


>>> import fluiddb # loads the module into the session
>>> headers, body = fluiddb.call('GET', '/users/ntoll') # 'GET' is the HTTP verb & '/users/ntoll' is the API path
>>> headers # contains the HTTP headers returned from Fluidinfo
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '76',
 'content-location': 'https://fluiddb.fluidinfo.com/users/ntoll',
 'content-type': 'application/json',
 'date': 'Tue, 08 Feb 2011 19:42:10 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> body # contains the actual result, in this case basic information about the user ntoll (me)
{u'id': u'a694f2d0-428e-4aaf-85d1-58e903f56b30',
 u'name': u'Nicholas Tollervey'}

Notice how the “content-location” in the headers tells you what the full URL of the API call is (this is interesting since fluiddb.py creates this automagically for you). The body (result) is a Python dict object that basically mirrors the JSON dict object Fluidinfo served up.

The following example grabs information about a specific object. Notice that I pass in the path to the Fluidinfo resource I’m GETting as a list. This ensures that the BoingBoing URL gets correctly percent encoded.


>>> headers, body = fluiddb.call('GET', ['about', 'http://boingboing.net/2000/01/21/street-tech-reviews-.html']) # get basic information about the object about "http://boingboing.net/2000/01/21/street-tech-reviews-.html"
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '455',
 'content-location': 'https://fluiddb.fluidinfo.com/about/http%3A%2F%2Fboingboing.net%2F2000%2F01%2F21%2Fstreet-tech-reviews-.html',
 'content-type': 'application/json',
 'date': 'Tue, 08 Feb 2011 19:45:27 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> body
{u'id': u'469257cf-2c33-4628-a97e-47166bae24fa',
 u'tagPaths': [u'boingboing.net/timestamp',
               u'fluiddb/about',
               u'boingboing.net/day',
               u'boingboing.net/month',
               u'boingboing.net/year',
               u'boingboing.net/authors/markfrauenfelder',
               u'boingboing.net/comment_count',
               u'boingboing.net/author',
               u'boingboing.net/basename',
               u'boingboing.net/body',
               u'boingboing.net/domains',
               u'boingboing.net/created_on',
               u'boingboing.net/permalink',
               u'boingboing.net/title',
               u'boingboing.net/links']}
>>>

Hopefully, the result speaks for itself: it contains the unique ID of the Fluidinfo object that is about the BoingBoing URL, and a list of the tags on that object. Getting the value of a specific tag is simple:


>>> headers, body = fluiddb.call('GET', '/objects/469257cf-2c33-4628-a97e-47166bae24fa/boingboing.net/title')
>>> body
u'Street Tech Reviews and news'

I simply appended the path to the tag onto the object’s unique ID (this also works with the about tag too as used in the prior example).

Returning tag values for a set of results that match a query is also easy. The equivalent of the following SQL-esque query:


SELECT title, categories, created_on FROM boingboing.net WHERE authors="markfrauenfelder" AND year=2010;

… is:


>>> headers, body = fluiddb.call('GET', '/values', tags=['boingboing.net/title', 'boingboing.net/created_on', 'boingboing.net/categories'], query="has boingboing.net/authors/markfrauenfelder and boingboing.net/year=2010")

A call is made to the “/values” endpoint with a list of tags whose values we want returned and a query to generate the result set. The query is written in Fluidinfo’s super-simple query language. The headers of the response look like this:


>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '287328',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=has+boingboing.net%2Fauthors%2Fmarkfrauenfelder+and+boingboing.net%2Fyear%3D2010&tag=boingboing.net%2Ftitle&tag=boingboing.net%2Fcreated_on&tag=boingboing.net%2Fcategories',
 'content-type': 'application/json',
 'date': 'Wed, 09 Feb 2011 10:55:50 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}

The actual results are a JSON object (of which the following is only a fragment):


{
  "results": {
    "id": {
      "f2976562-eba6-47e4-94a1-b36ffe9a2ab1": {
        "boingboing.net/created_on": {
          "value": "2010-10-14 13:14:14"
        }, 
        "boingboing.net/categories": {
          "value": [
            "science", 
            "technology", 
            "art and design", 
            "design"
          ]
        }, 
        "boingboing.net/title": {
          "value": "TED releases iPad app today"
        }
      }, 
      "627ebf2e-e38d-41da-a709-16294b4ab6f2": {
        "boingboing.net/created_on": {
          "value": "2010-02-19 11:29:36"
        }, 
        "boingboing.net/categories": {
          "value": [
            "culture"
          ]
        }, 
        "boingboing.net/title": {
          "value": "Miniboss T-shirt in the Boing Boing Bazaar"
        }
      } // etc... for lots of results
    }
  }
}

Happily, fluiddb.py has converted it into the Python equivalent so we can find out some useful information and look at individual results.


>>> len(body['results']['id']) # how many results do we have..?
1214
>>> body['results']['id'].keys()[0] # what's the id of the first result..?
u'f2976562-eba6-47e4-94a1-b36ffe9a2ab1'
>>> body['results']['id']['f2976562-eba6-47e4-94a1-b36ffe9a2ab1'] # show the record for the first result...
{u'boingboing.net/categories': {u'value': [u'science',
                                           u'technology',
                                           u'art and design',
                                           u'design']},
 u'boingboing.net/created_on': {u'value': u'2010-10-14 13:14:14'},
 u'boingboing.net/title': {u'value': u'TED releases iPad app today'}}

Great! So you have all the tools you need to search and explore all the BoingBoing articles from the last ten years. That’s what a conventional data API provides.

However, Fluidinfo can do additional super-duper cool stuff..!

Super-duper cool stuff!

Fluidinfo is an openly writeable database where objects have value because they are annotated with data from different sources. That’s why anyone can tag any data to any object. Since you control who can use, read and control your namespaces and tags, you still maintain control of data and importantly create a mechanism for trust.

You can trust values annotated with tags from the boingboing.net namespace because only BoingBoing is allowed to create and edit anything under this namespace. Since BoingBoing has annotated objects with information about articles then it’s safe to assume the objects are about a BoingBoing articles.

Here’s the super-duper stuff: you can contribute data to these objects too.

How..?

I’m glad you asked… 🙂

First of all you’ll need an account on Fluidinfo. Once you’ve signed up you’ll be the proud owner of a top-level namespace with the same name as your username. Before you can add data to objects you’ll need to create some tags to achieve this:


>>> fluiddb.login('ntoll', 'top-secret-password') # change as appropriate
>>> newTag = {'name': 'tuba', 'description': 'Related to Tubas in some way so it must be awesome!', 'indexed': False})
>>> headers, body = fluiddb.call('POST', '/tags/ntoll', newTag) # create new tag in /ntoll namespace
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '104',
 'content-type': 'application/json',
 'date': 'Wed, 09 Feb 2011 13:08:52 GMT',
 'location': 'https://sandbox.fluidinfo.com/tags/ntoll/tuba',
 'server': 'nginx/0.7.65',
 'status': '201'}
>>> headers, body = fluiddb.call('GET', '/tags/ntoll/tuba', returnDescription=True)
>>> body
{u'description': u'Related to Tubas in some way so it must be awesome!',
 u'id': u'b03f6937-cebf-481d-a0eb-5fd355a8a602',
 u'indexed': False}

The new tag is given a name (“tuba”), description and an indication if it should be indexed. The “201” status that Fluidinfo returned confirms that the new tag was successfully created under the “ntoll” namespace.

In case you hadn’t guessed I like tubas! I’d like others to find other tuba related objects in Fluidinfo so I’ve decided I’ll attach this newly created tag to anything tuba-related, including BoingBoing posts. As it happens Fluidinfo helps me get a bunch of these posts with a search like this:


>>> headers, body = fluiddb.call('GET', '/values', tags=['boingboing.net/title', 'fluiddb/about',], query = 'fluiddb/about matches "tuba"')
>>> body
{
  "results": {
    "id": {
      "e6c108f4-bd10-4cd3-b7d5-ad549b988c28": {
        "fluiddb/about": {
          "value": "http://boingboing.net/2006/06/22/flaming-tuba-guy-dav.html"
        }, 
        "boingboing.net/title": {
          "value": "Flaming Tuba guy David Silverman on NBC Tonight Show 6/23"
        }
      }, 
      "0c006f04-0663-48d6-9f11-4e082e75eb51": {
        "fluiddb/about": {
          "value": "http://boingboing.net/2010/11/22/tuba-skinny-old-time.html"
        }, 
        "boingboing.net/title": {
          "value": "Tuba Skinny: Old timey blues and jazz street act from New Orleans"
        }
      }
    }
  }
}

I’ve simply queried for matches for the word “tuba” in the fluiddb/about tag. Now that I’ve got a couple of results I can tag them like so:


>>> for tubaItem in body['results']['id']:
...     header, body = fluiddb.call('PUT', '/objects/%s/ntoll/tuba' % tubaItem, "Umpah-tastical, man!")
...     print header['status']
'204'
'204'

Yay! I’ve added some information to a couple of objects about BoingBoing articles! Let’s just confirm this by asking Fluidinfo for all the objects tagged with ntoll/tuba:


>>> headers, body = fluiddb.call('GET', '/values', tags=['fluiddb/about', 'boingboing.net/title', 'ntoll/tuba', ], query="has ntoll/tuba")
>>> body
{
  "results": {
    "id": {
      "e6c108f4-bd10-4cd3-b7d5-ad549b988c28": {
        "ntoll/tuba": {
          "value": "Umpah-tastical, man!"
        }, 
        "fluiddb/about": {
          "value": "http://boingboing.net/2006/06/22/flaming-tuba-guy-dav.html"
        }, 
        "boingboing.net/title": {
          "value": "Flaming Tuba guy David Silverman on NBC Tonight Show 6/23"
        }
      }, 
      "0c006f04-0663-48d6-9f11-4e082e75eb51": {
        "ntoll/tuba": {
          "value": "Umpah-tastical, man!"
        }, 
        "fluiddb/about": {
          "value": "http://boingboing.net/2010/11/22/tuba-skinny-old-time.html"
        }, 
        "boingboing.net/title": {
          "value": "Tuba Skinny: Old timey blues and jazz street act from New Orleans"
        }
      }, 
      "024bf1b6-348d-4839-8700-cbb30d86fb97": {
        "ntoll/tuba": {
          "value-type": "image/jpg", 
          "size": 467947
        }, 
        "fluiddb/about": {
          "value": "CrossCountryTuba"
        }
      },  
      "a694f2d0-428e-4aaf-85d1-58e903f56b30": {
        "ntoll/tuba": {
          "value": "I play tuba!"
        }, 
        "fluiddb/about": {
          "value": "Object for the user named ntoll"
        }
      }
    }
  }
}

Oops, I forgot I’d already tagged a couple of non-BoingBoing objects with the ntoll/tuba tag: one whose about tag value is “CrossCountryTuba” and the other being the object that represents me in Fluidinfo.

Notice how the value for the ntoll/tuba tag on the object about “CrossCountryTuba” contains only metadata: the type of data stored by that tag on that particular object (image/jpg) and the size of the data (467947 bytes). Looks like it’s an image of some sort. Let’s get it and see:


>>> headers, body = fluiddb.call('GET', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/tuba')
>>> image = open('tuba.jpg', 'w')
>>> image.write(body)
>>> image.close()

And what does tuba.jpg contain..?

Cool! Fluidinfo stores any type of data so long as you supply the appropriate MIME type when you upload the data.

How did I get the data into Fluidinfo..?


>>> tuba = open('Desktop/tuba.jpg', 'r') # open the original image
>>> header, body = fluiddb.call('PUT', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/tuba', tuba.read(), mime='image/jpg') # notice how I specify the MIME type
>>> tuba.close()
>>> headers['status'] # check we got a 200 OK response
'200'
>>> header, body = fluiddb.call('PUT', '/objects/024bf1b6-348d-4839-8700-cbb30d86fb97/ntoll/attribution', 'Tuba photo source: http://www.flickr.com/photos/dust/3813581130 licensed under a CC-BY 2.0 license') # need to add attribution as per the license

Simple..!

Now we’ve covered a lot of ground, so let’s just consider where we’ve got to.

We have a consistent, simple and powerful API to play with.
We can retrieve values using a simple query language referencing data contributed from many different users.
We can contribute data ourselves in such a way that the data remains under our control.
We can put all our data in the right place. If I want to contribute something about a BoingBoing article I just tag it to the object representing the right BoingBoing article.
We can contribute all sorts of data be it searchable primitive values like numbers, text and booleans or opaque data such as images, audio or anything else for which you can specify a MIME type.

You’re armed with enough basic knowledge to both mine BoingBoing data and contribute to it too. In fact, if you look carefully you’ll find all sorts of interesting objects in Fluidinfo. Remember, to find out more about the API check out our technical documentation.

Dive in, have fun and we’re more than happy to answer questions.

Image credits: BoingBoing’s logo and font butchered with permission (thanks @mustardhamsters!), diagram generated by abouttag written by Nick Radcliffe and the “Cross Country Tuba image” is © 2009 Amanda M Hatfield under a Creative Commons license.

Comments (0)

January 27, 2011

How we made an API for BoingBoing in an evening

Filed under: Awesomeness,Essence,Howto,Programming,Progress — Nicholas Tollervey @ 9:05 am

Yesterday the folks over at boingboing.net posted eleven year’s worth of posts as a zipped up XML file. XML is good, but having a searchable database of posts is better. So I (ntoll) am in the process of importing all the data into Fluidinfo. 🙂

When finished, every post and author in the boingboing data dump will be represented by an object in Fluidinfo and tagged with useful information. The diagram below shows a representation of what a typical object about a boingboing.net post looks like:

The object (the red blob with a unique ID written inside it) has several tags attached to it (named “boingboing.net/author” and “boingboing.net/comment_count” for example) with associated values (“Mark Frauenfelder” and “53” respectively).

Furthermore, while I was cleaning/preparing the data for upload I made sure to extract every domain name and URL referenced in each post and annotate the publication date as computer friendly values rather than just a human readable date.

An instant win is the ability to query data. For example, you’ll be able to search for all posts that link to techcrunch.com written in 2010 by Cory Doctorow. This is how to write the query in Fluidinfo’s super simple query language:

boingboing.net/domains contains "techcrunch.com" and boingboing.net/year = 2010 and boingboing.net/author = "Cory Doctorow"

The result will depend on how you make the query, but let’s assume you’re using a /values based call in Fluidinfo’s REST api and you’ve asked for each post’s title, publication date and a list of domains mentioned. You’ll get back some JSON encoded data that looks something like this:

[
  "results" : {
        "id" : {
            "05eee31e-fbd1-43cc-9500-0469707a9bc3" : {
                "boingboing.net/title" : {
                    "value" : "This is a made up title for illustrative purposes"
                },
                "boingboing.net/created_on" : {
                    "value" : "2010-08-19 13:23:41"
                },
                "boingboing.net/domains" : {
                    "value": [ 
                        "techcrunch.com", 
                        "microsoft.com"
                    ]
                }
            },
            "0521e31e-fbd1-43cc-9500-046974569bc3" : {
               ... more results ...
            }
        }
    }
  }
]

Wait a minute..!?!? This is just as if boingboing.net had an API.

Actually, by importing the flat XML file into Fluidinfo they do have an API – for free! Because of Fluidinfo’s open nature anyone can now make use of boingboing’s data via a few simple and easy to construct RESTful calls to Fluidinfo.

But that’s not all..!

Fluidinfo isn’t just openly readable – it’s openly writeable too.

Huh..?

Any user of Fluidinfo can tag data to any object. For example, I control a couple of tags called “ntoll/rating” and “ntoll/comment” which I could attach to any of the objects representing boingboing.net posts. By tagging an object with associated values I’m indicating what I thought about the post.

Importantly, I know which object I want to tag because it has a special unique tag called “about” whose value is the URL to the boingboing.net post in question. Other people who want to add information about this post will know to use the same object as me because the about tag-value tells them, er, what the object is about.

This brings me to the killer point: accessing data from boingboing.net is good, but the facility to annotate, discover and re-use everyone’s data about boingboing.net posts is better. That’s why we sometimes say we’re trying to do to databases what Wikipedia did to encyclopaedias.

Users of Fluidinfo don’t have to retrieve information about boingboing.net posts by building queries using just boingboing.net tags. It’s possible to search using other people’s tags. For example, here’s how to search for posts where I’ve given it a relatively high rating and added a comment:

ntoll/rating > 6 and has ntoll/comment and has boingboing.net/title

And users don’t have to just ask for boingboing.net related tag-values either. It’s possible to ask objects for all their tags that you have permission to see. For example, you could retrieve a matching post’s title, body, author and any comments I make about the post with the ntoll/comment tag.

I’m only scratching the surface here so I’ll follow up with another post soon with some example code and use cases. In the meantime, if you want to find out more feel free to get in touch with us. We’re more than happy to help.

If you’re a developer and want to play with the boingboing.net data you should take a read of my last post explaining how to explore Fluidinfo’s API with Python.

In case you were wondering, it really was only half an evening’s work to prepare the data and write the import script. 🙂

Note: The import is currently running but should be complete later this afternoon. Not all posts will be in Fluidinfo yet (so far we have everything up to the end of September 2008).

Comments (13)

January 14, 2011

Exploring FluidDB with fluiddb.py

Filed under: Howto,Programming — Nicholas Tollervey @ 10:27 am

FluidDB.py is a Python module based upon work by Seo Sanghyeon. The module has been extracted, extended and unit-tests were added by me (Nicholas Tollervey).

This post leads you from signing up to FluidDB to executing commands and queries using fluiddb.py. It assumes you’re already familiar with the concepts behind FluidDB and that you’re a developer looking to experiment with the API. If you’re not familiar or need a refresher take a look at the following slides:

We’ll be using Python but don’t assume any familiarity with that language. So lets get started…

When you sign up for FluidDB the following steps happen:

An object representing the new user is created and tagged with useful information. You can get the object’s id with the following call to the API: https://fluiddb.fluidinfo.com/users/terrycojones (replace “terrycojones” with the username you’re interested in). This will return a JSON object containing the user’s “real” name and the object id. You can see what’s tagged to the user’s object by making a call like this: https://fluiddb.fluidinfo.com/objects/05eee31e-fbd1-43cc-9500-0469707a9bc3 (replacing the id with that of the object representing the user you’re interested in).
A new top-level namespace is also created with the same name as the new user. The default permissions (which can be changed later) mean that only the new user is allowed to create new namespaces or tags in this location but everything is openly readable. You can find out what’s in a namespace with an API call like this: https://fluiddb.fluidinfo.com/namespaces/terrycojones?returnDescription=True&returnNamespaces=True&returnTags=True which returns a JSON dictionary containing lists of child tags, namespaces and the namespace’s description.

In order to access the API with your new credentials you need to use basic HTTP authentication over SSL (i.e. the URI starts with “https”). Obviously, using the raw API with a browser or tools such as curl or wget isn’t practical for most people. Hence the need for fluiddb.py as a wrapper (there are lots of API wrappers for FluidDB in many different languages, check out our list of libraries on our website for more information).

Installing fluiddb.py is as simple as typing:

$ pip install -U fluiddb.py

$ easy_install fluiddb.py

or, if you want to install from source:

$ git clone https://github.com/ntoll/fluiddb.py.git $ cd fluiddb.py $ python setup.py install

Simply import fluiddb to get started. The following Python terminal session demonstrates what I mean:

$ python Python 2.6.5 (r265:79063, Apr 16 2010, 13:09:56) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import fluiddb

The fluiddb.instance variable indicates which instance of FluidDB the module is using (it defaults to the main instance – the sandbox can be used for the purposes of testing/experimentation). Make use of the fluiddb.MAIN and fluiddb.SANDBOX “constants” to change instance as shown below.

>>> fluiddb.SANDBOX 'https://sandbox.fluidinfo.com' >>> fluiddb.instance = fluiddb.SANDBOX >>> fluiddb.MAIN 'https://fluiddb.fluidinfo.com' >>> fluiddb.instance = fluiddb.MAIN

Use the login and logout functions to, er, login and logout (what did you expect..?):

>>> fluiddb.login('username', 'password') >>> fluiddb.logout()

The most important function provided by the fluiddb module is call(). You must supply at least the HTTP method and path as the first two arguments to a call to the REST API. For example, the following call gets information about Terry Jones:

>>> fluiddb.call('GET', '/users/terrycojones') ({'status': '200', 'content-length': '69', 'content-location': 'https://fluiddb.fluidinfo.com/users/terrycojones', 'server': 'nginx/0.7.65', 'connection': 'keep-alive', 'cache-control': 'no-cache', 'date': 'Fri, 14 Jan 2011 15:15:51 GMT', 'content-type': 'application/json'}, {u'name': u'Terry Jones', u'id': u'05eee31e-fbd1-43cc-9500-0469707a9bc3'})

Notice how call() returns a tuple containing two items:

The header dictionary
The content of the response (if there is any)

Often it is simply better to do the following:

>>> headers, content = fluiddb.call('GET', '/users/test')

So the response headers get put into the “headers” variable and the actual content of the response is found in the “content” variable.

It is also possible to send the path as a list of path elements:

>>> headers, content = fluiddb.call('GET', ['about','yes/no','test','foo'])

Which will ensure that each element is correctly percent encoded even if it includes problem characters like slash: ‘/’ (essential for using the “about” based API).

If the API involves sending json data to FluidDB simply send the appropriate Python dict object and fluiddb.py will “jsonify” it appropriately for you:

>>> headers, content = fluiddb.call('POST', '/objects', body={'about': 'an-example'})

If the body argument isn’t a Python dictionary then you must be HTTP PUTting a tag-value on an object. In which case, it’s possible to set the mime-type of the value passed in body:

>>> headers, content = fluiddb.call('PUT', '/about/an-example/test/foo', body='<html><body>Hello, World!</body></html>', mime='text/html')

If you’re PUTting a primitive value then the fluiddb.py will automatically provide the correct mime-type for you:

>>> headers, content = fluiddb.call('PUT', '/about/an-example/test/foo', 12345)

To send URI arguments simply append them as arguments to the call() method:

>>> headers, content = fluiddb.call('GET', '/permissions/namespaces/test', action='create')

The “action = ‘create'” argument will be turned into “?action=create” appended to the end of the URL sent to FluidDB.

Furthermore, if you want to send some custom headers to FluidDB (useful for testing purposes) then supply them as a dictionary via the custom_headers argument:

>>> headers, content = fluiddb.call('GET', '/users/test', custom_headers={'Origin': 'http://foo.com'})

Finally, should you be sending a query via the /values endpoint then you can supply the list of tags whose values you want returned via the tags argument. For example, the following call will return the about-tag value and the twitter screen name of those twitter users I have met in the real world:

>>> headers, content = fluiddb.call('GET', '/values', tags=['fluiddb/about', 'twitter.com/users/screen_name'], query='has ntoll/met')

If this walkthrough isn’t enough then check out the three screencasts below. I made them a few months ago so I’m demonstrating an older version of fluiddb.py but it’s pretty much unchanged (only the implementation details described in part two have changed a little).

Using FluidDB’s RESTful API with fluiddb.py (Part 1)

Using FluidDB’s RESTful API with fluiddb.py (Part 2)

Using FluidDB’s RESTful API with fluiddb.py (Part 3)

As always, if you have any question, encounter problems or simply want to give us feedback, get in touch!

Comments (1)

January 10, 2011

Introducing the Fluidinfo Explorer

Filed under: Awesomeness,Howto — Nicholas Tollervey @ 9:09 am

Normally users will use applications that use Fluidinfo and are unaware that the application is using Fluidinfo. Programmers will use Fluidinfo through its API. So, what if you’re a non-programmer and not using an application and you just want to have a look around inside Fluidinfo? Pier-Andre Parent has written the Fluidinfo Explorer – a web-based “explorer” GUI. If you’re not a developer, this is probably your best way of starting to interact with Fluidinfo without having to get into all the nitty-gritty details of the API. It’s like the file-system explorer you find on Windows, Mac or Linux.

We’ve found the Explorer so useful that we’ve made it available via the explorer.fluidinfo.com name. The URL you visit in your browser is very important. The pattern is http://explorer.fluidinfo.com/INSTANCE/NAMESPACE where INSTANCE is either “fluidinfo” or “sandbox” and NAMESPACE is name of the user, organisation or application you’re interested in. For example, the following link will display my (ntoll’s) top-level namespace in the explorer: http://explorer.fluidinfo.com/fluidinfo/ntoll

The result will look something like the following:

Notice how the namespace / tag structure is displayed in a collapsable tree control on the left hand side. The main body of the user interface contains a helpful welcome message and at the top right hand side is a search box for queries written in Fluidinfo’s query language and the login button.

Right click a namespace or tag to view and update its attributes, to create and delete namespaces and tags, and to set permissions on them. Clicking on a tag in on the left hand side fills the main area of the UI with all the objects that it has tagged:

So far, so simple…

But what about exploring the tags on a specific object? Click on an objects object id in the result set to display a list of all the tags attached to it. Click “Load all tag values” to display the associated values. Notice how the explorer differentiates between primitive (numbers, booleans, strings etc..) and opaque (images, audio, binary files etc…) values – primitive values are displayed whereas the cells for opaque values contain a description of the type of value stored in Fluidinfo:

Click the “open” link next to each of the opaque tag value to trigger a pop-up with the opaque value presented therein. In this example the value is an image:

Finally, if you follow the “View visual representation” link a rather nice graphical representation of the object is presented to you:

These diagrams are automatically generated by yet another third party application created by Nicholas Radcliffe and hosted on Google’s AppEngine. Given an appropriate URL a rather cool image is generated, e.g., http://abouttag.appspot.com/id/butterfly/8c2860e1-0d3f-47aa-9064-8a682cea6154.

The great thing about the explorer is that it provides an intuitive and visual representation of how Fluidinfo is structured. Have fun exploring!

Comments (3)

Older Posts »