Archive for the ‘Data’ Category

Personalized filtering of friend requests in social networks

Wednesday, June 1st, 2011

In an O’Reilly Radar article titled Getting closer to the web2.0 address book in October 2010, I described how a set of applications might coordinate via shared storage to solve Tim O’Reilly‘s question:

Given that knowledge about who I know is in my phone, in the O’Reilly org chart, and in the set of authors of O’Reilly books, why do I still have to manually approve friend requests from Facebook, LinkedIn, etc.?

In that article I argued that this aspect of what Tim calls the web 2.0 address book should be solved not by an application, or even by applications making API calls to one another, but by a set of applications that communicate asynchronously via shared writable storage. As well as shared storage, also needed are a permissions system, a query language, and conventions used by the cooperating applications. No API calls are needed between applications. The interaction between the cooperating applications takes place entirely via shared storage. API calls are only used to add, retrieve, and query data in the shared storage.

At Gluecon in Denver last week, I gave a talk titled The Real Promise of Cloud Storage in which I argued that shared storage gives two big advantages. The second was that it allows the evolution of “asynchronous data protocols”. The final slides of the presentation showed the series of steps that would be taken to build what Tim was asking for. I wasn’t able to go into the detail of how this would work in the time available. So in this post I’ll give the detail of how this can be implemented using Fluidinfo. The post is long because I want to spell out the steps, the assumptions, the Fluidinfo tag permissions, the queries, the security aspects, and so on.

The business case

I’m going to pretend LinkedIn is the company that would like its users to have more flexibility in approving friend requests. Additional flexibility could include personalized friend resolvers (for example based on the O’Reilly org chart) and also applications that could assist in various way, and if desired even automate some responses (e.g., people you have called more than twice in the last month, or people you have put on a particular list on Twitter).

LinkedIn is a good candidate because they have traditionally placed less emphasis on being a burgeoning social network, and more on link quality, by requiring a higher level of proof to create more meaningful reciprocal connections. That seems to be changing slowly now, though. You used to need to know another user’s email to connect to them, but that has been relaxed over the last couple of years. It’s obvious why LinkedIn would do this: knowing connections between their members is valuable. LinkedIn will now show me suggested people I might know, and let me reach out to connect with them with a single click. No need to go look up email addresses, etc.

I suspect many people have valid unanswered friend connection requests sitting in their LinkedIn inboxes. I do, and I almost never make time to deal with them. Those requests, if accepted, would add value to LinkedIn. The (valid) connection requests that sit unanswered for whatever reason represent unrealized value. So it’s very much in LinkedIn’s interest to help people connect. They should be interested in a solution to Tim’s dilemma, as should Facebook and any social network whose friending mechanism is reciprocal.

Using the shared writable storage of Fluidinfo to solve this problem

Here is how this could be achieved with Fluidinfo.

1. LinkedIn announces that it will support personalized friend resolution via Fluidinfo. Users will have control over what kinds of friend resolution they want to enable. LinkedIn will use the linkedin.com user name in Fluidinfo. (Fluidinfo makes it possible to put domain names on data. Every Fluidinfo user name that corresponds to a domain is reserved for its internet domain owner.)

2. Suppose Tim, a LinkedIn user, decides to try the new functionality. LinkedIn directs him to a page where he can choose a Fluidinfo user name (let’s say "tim") and create a Fluidinfo tag called tim/friend-request/linkedin. Tim sets the permissions so that the Fluidinfo linkedin.com user can both add and delete the tag on Fluidinfo objects. He then tells LinkedIn his Fluidinfo user name. This is necessary because LinkedIn will later create data in Fluidinfo on Tim’s behalf. Note that LinkedIn only knows Tim’s Fluidinfo user name, not his password.

3. Suppose a user Alice now tells LinkedIn that she is a friend of Tim’s. LinkedIn creates a new object in Fluidinfo and puts an instance of tim/friend-request/linkedin onto it with a value (perhaps in JSON format) holding the information Alice is willing to supply as evidence that she knows Tim. This could include things like her Twitter user name, her phone number, the name of a company she used to work at with Tim, etc. LinkedIn also adds a random request identifier (more on which below).

The object in Fluidinfo looks like this (the tag value is in the rectangle under the tag name):


Note that the object does not have an owner. That’s a fundamental feature of Fluidinfo: objects don’t have owners. Instead, tags on objects have permissions. That’s crucial in what follows, because it allows any application to add more information to the object representing the friend request.

A friend resolver based on Twitter friends: TwitResolve

4. Let’s suppose a keen-eyed developer notices the LinkedIn announcement and decides to write a Twitter-based friend resolver called TwitResolve. The developer grabs the internet domain twitresolve.com and then gets the twitresolve.com user name on Fluidinfo (as the domain owner, only they can do this, as above). TwitResolve will use two tags in Fluidinfo, twitresolve.com/accepted and twitresolve.com/examined. It sets the permissions on the tag twitresolve.com/accepted so that only the Fluidinfo user called linkedin.com can read it.

5. Let’s suppose Tim hears of TwitResolve and decides to use it. The TwitResolve sign-up page asks Tim to change the permissions on the Fluidinfo tag tim/friend-request/linkedin so that the Fluidinfo user twitresolve.com can read the tag. (The setting of the permission could be done in various ways, which we’ll ignore for now.) By giving TwitResolve permission to read the tag that LinkedIn puts onto objects, Tim enables TwitResolve’s participation in helping to resolve friend requests for him. TwitResolve also asks Tim for his Twitter username (and gets Tim to prove that).

6. TwitResolve runs a process periodically that asks Fluidinfo for the outstanding requests for Tim that it hasn’t already examined. The relevant query in Fluidinfo’s query language is has tim/friend-request/linkedin except has twitresolve.com/examined.

7. In our case, this query finds the object shown above and TwitResolve examines the value of the tim/friend-request/linkedin tag. It talks to Twitter to discover whether Tim is following Alice (it can do this because Alice’s Twitter user name is in the tag value, and Tim provided his Twitter user name to TwitResolve). Suppose Tim is not following Alice. TwitResolve cannot conclude anything, so it simply puts a twitresolve.com/examined tag (with no value) onto the request object in order to avoid re-examining this request. At this point the Fluidinfo object looks as below and the friend request is still unfulfilled.


A friend resolver based on the O’Reilly org chart

8. News gets out quickly to the digerati and someone inside O’Reilly decides to write a resolver based on the O’Reilly org chart. They create two tags, similar to those created by TwitResolve, called oreilly.com/accepted and oreilly.com/examined, again setting the permissions on the former so that only the Fluidinfo user called linkedin.com can read that tag. Tim decides to use the new resolver, so he gives the oreilly.com user permission to read the tim/friend-request/linkedin tag.

9. The O’Reilly resolver might be started by cron each night. When it runs, it sends a Fluidinfo query, just like TwitResolve does. Suppose Alice does not work at O’Reilly. The details in the request cannot confirm her as a friend of Tim’s. Just as with TwitResolve, the O’Reilly program puts an oreilly.com/examined tag onto the object, and we arrive at the following:


A friend resolver based on Amazon books

10. To give a third example (to fulfill Tim’s requirements), another resolver could be written based on authors of books. Suppose it’s written by someone at Amazon. (The details of how this might successfully match Alice to Tim aren’t important here.) Suppose it is also unsuccessful in matching this friend request. It adds an amazon.com/examined tag to the object:


An iPhone friend resolver: iPhoneFriender

11. Next, someone decides to write a resolver that can match based on phone numbers, called iPhoneFriender. They get the iphonefriender.com domain name and user name in Fluidinfo. They will use the tags (iphonefriender.com/accepted and iphonefriender.com/examined), and set the permissions on the iphonefriender.com/accepted tag so that only the linkedin.com Fluidinfo user can read it.

When the iPhoneFriender application runs, it sends the Fluidinfo query has tim/friend-request/linkedin except has iphonefriender.com/examined and finds our object. It reads the request details. Suppose it finds Alice’s phone number in Tim’s phone address book, and can see whether Tim has called Alice or vice-versa, whether Tim has chosen not to accept her calls, etc. iPhoneFriender has preference settings to allow Tim to specify what kinds of requests it will ask him to confirm, and which to automatically accept. Suppose that one way or another the friend request is accepted.

12. The iPhoneFriender application needs to tell LinkedIn that Alice is recognized as a friend of Tim’s. So it puts an iphonefriender.com/accepted tag with value 9871261721498793 onto the original object. This is the unique request_id value from the original request. Remember that the iphonefriender.com/accepted tag has its permissions set so that only linkedin.com can read it. Hiding the identifier in this way prevents rogue applications from falsely claiming to have satisfied friend requests. Only an application that Tim has given permission to read the tim/friend-request/linkedin tag to could know the request identifier. Only LinkedIn can read the identifier value out of the iphonefriender.com/accepted tag attached by iPhoneFriender. iPhoneFriender also adds an iphonefriender.com/examined tag to the object to avoid repeating work.

The Fluidinfo object now looks as follows (tag values, when present, are shown in rectangles under the tag names):


Friendship, requited!

13. Later, LinkedIn sends Fluidinfo the query has tim/friend-request/linkedin. The search matches our object and LinkedIn then gets a list of its tags. For tags named */accepted (where * matches any user name) it tries to read the value of the tag. If it finds a tag whose value matches the identifier in the request tag on the object, LinkedIn adds the friendship link inside its own site. It also deletes the tim/friend-request/linkedin tag from the object, resulting in:


At this point we’re basically done. There are many possible variations on the above. For example, using timestamps to retry friend resolution, using timestamps to only examine recent requests, using separate tags to hold pieces of friend request information, using an extra linkedin.com tag to reduce the number of queries it must make to find outstanding requests, etc. Applications are not forced to use an examined tag to avoid repeating work; if they do they can name it whatever they please. It’s also easy to imagine more exotic resolvers, e.g., Tim giving people a secret random number and looking for that in the request (LinkedIn would have to allow Alice to add it to the request, obviously), etc. Participating applications could also clean up by finding objects with their examined tag but that no longer have a tim/friend-request/linkedin tag, and removing their examined and accepted tags (if present).

Why this is nice

The above dance is nice for several reasons:

  • This is an open, convention-based, extensible, and validated application ecosystem. LinkedIn just writes data to Fluidinfo and periodically checks for resolution. In effect it is giving Tim the power to use any application he wants to resolve the request in any way he wants. Participating applications just follow the established tag naming convention. LinkedIn knows that if any application attaches a tag of the form */accepted with value 9871261721498793 to the object, that it must represent a validated acceptance by Tim.
  • As a result, anyone can play. An idiosyncratic resolver such as that based on the O’Reilly org chart is as legitimate a contributor as any other. None of the resolvers needed to ask for permission (from LinkedIn) to participate, or needed to be anticipated by LinkedIn.
  • There are no API calls between the applications involved. This is significant because APIs have to be designed in advance and you need permission to use them. In our scenario, all communication between applications is done via a data protocol: adding to and retrieving from shared storage.
  • LinkedIn is free to ignore the */accepted tags from any application if it chooses.
  • Tim can withdraw permission for a resolver to work on his behalf, simply by taking away that application’s permission to read tim/friend-request/linkedin tags.
  • Tim can stop LinkedIn from creating friend resolution tags by removing the linkedin.com user’s permission to add the tim/friend-request/linkedin tag to objects. That would be somewhat extreme, seeing as LinkedIn is likely to offer its users a simpler way for people to turn off the feature, but it’s worth pointing out that Tim has control.
  • To those who don’t have permission to read tim/friend-request/linkedin tags, there is no way to see who has made the friend request. (The fact that Tim is the target could also be easily obscured, if wanted.)
  • All communication is convention-based and asynchronous. This resembles the way we (and other organisms), often communicate in natural systems. I suspect most information communication between living organisms is asynchronous, though I have no way to quantify this. Asynchronous communication via conventions in shared storage (e.g., those seen in Twitter with hashtags and @addressing) is so powerful because it is open-ended and evolutionary. Fit conventions (in the biological sense) will flourish. Conventions can be extended by any player, without harm. I wrote more on this in Dancing out of time: Thoughts on asynchronous communication.

Note that the above is just an example of how applications can communicate indirectly and asynchronously through shared storage using evolving conventions instead of using direct, synchronous, predefined API calls between one another. We have seen a solution to a difficult address book problem that has not involved writing an address book application. Instead, the problem is solved by a set of lightweight and loosely coupled cooperating applications communicating through data. I have (very slowly!) come to realize that this form of inter-application communication is an important part of what Fluidinfo makes possible. This is all enabled by the simple move to shared writable storage, coupled with a flexible permissions model and a query language.

Thanks for reading. I really hope you’ll find this as interesting as I do. Thanks to Nicholas Radcliffe, Tim O’Reilly, and Bar Shirtcliff for comments that greatly improved the above.

How we built the O’Reilly API using Fluidinfo

Tuesday, March 22nd, 2011

In case you haven’t noticed, we’ve imported the O’Reilly catalogue into Fluidinfo thus giving them an instantly writable API for their data.

How did we do it..?

There were three basic steps:

  1. Get the raw data.
  2. Clean the raw data.
  3. Import the cleaned data.

That’s it!

I’ll take each step in detail…

Get the raw data

Since we didn’t have an existing raw dump of the data nor access to O’Reilly’s database we had to think of some other way to get the catalogue. We found that O’Reilly had two different existing data services we could use: OPMI (O’Reilly Product Metadata Interface) and an affiliate’s API within Safari.

Unfortunately the RDF returned from OPMI is complicated. We’d either have to become experts in RDF or learn how to use a specialist library to get at the data we were interested in. We didn’t have time to pursue either of these avenues. The other alternative, the Safari service, just didn’t work as advertised. :-(

Then we remembered learning about @frabcus and @amcguire62‘s ScraperWiki project.

Put simply, ScraperWiki allows you to write scripts that scrape (extract) information from websites and store the results for retrieval later. The “wiki” aspect of the ScraperWiki name comes from its collaborative development environment where users can share their scripts and the resulting raw data.

In any case, a couple of hours later I had the beginnings of a batched up script for scraping information from the O’Reilly catalogue on the oreilly.com website. After some tests and refactoring ScraperWiki started to do its stuff. The result was a data dump in the easy to understand and manipulate CSV or JSON formats. ScraperWiki saves the day!

Clean the raw data

This involved massaging the raw data into a meaningful structure that corresponded to the namespaces, tags and tag-values we were going to use in Fluidinfo. We also extracted some useful information from the raw data. For example, we made sure the publication date of each work was also stored in a machine-readable value. Finally, we checked that all the authors and books matched up.

Most of this work was done by a single Python script. It loaded the raw data (in JSON format), cleaned it and saved the cleaned data as another JSON file. This meant that we could re-clean the raw data any number of times when we got things wrong or needed to change anything. Since this was all done in-memory it was also very fast.

The file containing the cleaned data was simply a list of JSON objects that mapped to objects in Fluidinfo. The attributes of each JSON object corresponded to the tags and associated values to be imported.

Import the cleaned data

This stage took place in two parts:

  1. Create the required namespaces and tags
  2. Import the data by annotating objects

Given the cleaned data we were able to create the required namespaces and tags. You can see the resulting tree-like structure in the Fluidinfo explorer (on the left hand side).

Next, we simply iterated over the list of JSON objects and pushed them into Fluidinfo. (It’s important to note is that network latency means that importing data can seem to take a while. We’re well aware of this and will be blogging about best practices at a later date.)

That’s it!

We used Ali Afshar’s excellent FOM (Fluid Object Mapper) library for both creating the namespace and tags and importing the JSON objects into Fluidinfo and elements of flimp (the FLuid IMPorter) for pushing the JSON into FOM.

What have we learned..? The most time consuming part of the exercise was scraping the data. The next most time consuming aspect was agreeing how to organise it. The actual import of the data didn’t take long at all.

Given access to the raw data and a well thought out schema we could have done this in an afternoon.

The structure of O’Reilly book and author data in Fluidinfo

Monday, March 21st, 2011

This short post explains how the O’Reilly catalog is represented in Fluidinfo.

Put simply, we annotate two types of object: those representing products (usually books) and those representing authors. We annotate them using namespaces and tags within the oreilly.com top level namespace so you can be sure that this is bona fide O’Reilly information.

Within the oreilly.com namespace we store a bunch of “top level” tags that describe a product in the O’Reilly catalogue (title, summary, URL and so on). The oreilly.com namespace has two child namespaces: “authors” and “media“. (If you want a visual representation of this structure head on over to the Fluidinfo explorer and explore, starting from the tree menu on the left hand side.)

The authors namespace contains tags that define information about an author (name, biography, homepage and so on) and also contains a child namespace called “expertise“. The expertise namespace contains a set of tags that map to the list of areas of expertise that O’Reilly uses to categorise their authors. So, for example, an object representing the O’Reilly author “Chris DiBona” looks like this:

Notice how Chris’s object has tags under the oreilly.com/authors namespace including several under the oreilly.com/authors/expertise namespace. Importantly, the object also has tags that were not provided by the O’Reilly data. Terry has added a tag terrycojones/met to indicate (rather obviously) that he’s met Chris and the fluiddb/about tag is used to indicate that the object is about the author called Chris diBona.

What about the objects that represent books..? What do they look like..? Well let’s consider a current favourite of mine: “XMPP: The Definitive Guide”. Here’s how Nick Radcliffe’s excellent abouttag utility displays the object representing this book:

Whoa! Lots more tags! Many of them are from the oreilly.com domain (although notice how there are 15 missing). Once again it’s possible to see who/what else has been tagging the object. I’ve added a review and rating (ntoll/review and ntoll/rating) and various other people have annotated useful information that wasn’t at first in the dataset provided by O’Reilly.

How are authors and books linked..?

Every author object has an oreilly/authors/works tag that contains a list of the 13 digit O’Reilly ID / ISBN for each work they were involved in. Every book object has a corresponding oreilly.com/id and oreilly.com/isbn tag.

Alternatively, every book object has an oreilly.com/authors-urls tag that contains a list of it’s author’s homepages on the O’Reilly website and every author object has an associated oreilly.com/url containing the same information.

Finally, for the sake of completeness here’s a list of all the book and author tags along with a description of what each one represents:

Book tags

  • publication-day: The day of the month upon which the item was published.
  • publication-month – The number of the month within which the item was published.
  • duration – The duration of this item in minutes.
  • subtitle – The subtitle associated with the item.
  • id – The unique ID used by O’Reilly to identify the item, usually the 13-digit ISBN number (as a string).
  • page-count-is-estimate – A flag to indicate that any associated page count value is only an estimate.
  • cover-medium – The URL for a medium size image of the cover at the oreilly.com domain.
  • toc – The table of contents as text/html.
  • homepage – A URL to the item’s homepage on the O’Reilly website.
  • description – A long description of the item as text/html.
  • cover-small – The URL for a small size image of the cover at the oreilly.com domain.
  • author-urns – A list of unique reference numbers used by O’Reilly to reference the authors of the item.
  • cover-large – The URL for a large size image of the cover at the oreilly.com domain.
  • isbn – The 13-digit ISBN number (as a string).
  • safari-url – A URL to the item’s page on O’Reilly’s Safari service.
  • author-urls – A list of URLs pointing to the author’s homepages on the O’Reilly website.
  • pages – The number of pages this item has.
  • publisher – The name of the publisher of the item.
  • price-us – The advertised US price in cents.
  • title – The title of the item.
  • author-names – A list of author names.
  • summary – A short summary of the item as text/html.
  • publication-date – The publication date as YYYY-MM-DD.
  • price-uk – The advertised UK price in pence.
  • media – A list of the type[s] of media in which the item is available. Can be one or more of: ‘up-to-date’, ‘rough cut’, ‘dvd’, ‘ebook’, ‘kit’, ‘video’, ‘print’, ‘early release ebook’, ‘safari books online’ or ‘merchandise’”

Author tags

  • name – The author’s full name.
  • url – A URL to the author’s homepage on the O’Reilly website.
  • photo – A path to an image file containing a photo of the author hosted at the oreilly.com domain.
  • twitter – The author’s Twitter username.
  • works – A list of the ids of items that the author has created.
  • expertise – A list of the expertise tags associated with the author.
  • biography – The author’s biography as text/html.

Examples of Fluidinfo O’Reilly API queries

Monday, March 21st, 2011

This post is all about querying the O’Reilly book and author information recently imported into Fluidinfo. If you want the skinny on Fluidinfo’s query language in glorious in-depth techno-geek-speak then check out the documentation. If you’d rather see some real world examples, read on…

In Fluidinfo, objects represent things (and all objects have a unique id). Information is added to objects using tags. Tags can have values, and tag names are organized into namespaces that give them context. Permissions control who can see and use namespaces and tags.

Objects do not belong to anyone and don’t have permissions associated with them. They’re openly writable. Anyone can tag anything to any object. Many objects have a special globally unique “about” tag value that indicates what they are about. Interaction with Fluidinfo is via a REST API.

That’s Fluidinfo in a nutshell.

In another article published today I describe the Fluidinfo tags and namespaces used to annotate objects with O’Reilly data. The tags are attached to objects for O’Reilly books and authors. Both kinds of objects have about tags. So a trivial first kind of query is to go directly to an object that’s about a book. For example, to get information about the object representing the book “Open Government” visit the URL http://fluiddb.fluidinfo.com/about/book:open government (daniel lathrop; laurel ruma).

You’ll get back a JSON response containing a list of all the tags (that you have permission to read) attached to that object and the object’s globally unique id. Similarly, you can go directly to the object for an O’Reilly author http://fluiddb.fluidinfo.com/about/author:tim oreilly.

In case you’re wondering about the format of these book and author about tags, we used the abouttag library written by Nicholas Radcliffe to generate them. They’re designed to be readable, easy to generate programmatically, and unlikely to result in collisions. You don’t have to remember them though, as there are many other ways to get at objects, via querying, as we’re about to see.

Queries on tags and their values

Below are some examples of using Fluidinfo’s query language.

Presence

Return all the objects that have an O’Reilly title:

has oreilly.com/title

You can see the results at the following URL: http://fluiddb.fluidinfo.com/objects?query=has oreilly.com/title. Once again, the result is in JSON. It simply contains a list of the ids of matching objects (representing things that O’Reilly have tagged with a title).

That’s the equivalent of the following SQL statement:

SELECT id FROM oreilly.com WHERE title IS NOT NULL;

Caveat: There are no tables in Fluidinfo so it’s impossible to make a direct translation to SQL. This example and those that follow simply illustrate a conceptual equivalence to make it easier for those of you familiar with SQL to get your heads around the Fluidinfo query language.

Comparison

Return all the O’Reilly objects whose price is less than $40 (the price is stored in cents).

oreilly.com/price-us < 4000

Here it is as a URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/price-us < 4000

In SQL it would be:

SELECT id FROM oreilly.com WHERE price-us < 4000;

Text Matching

Return all the O’Reilly objects that have “Python” in the title.

oreilly.com/title matches "Python"

The resulting URL: https://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches “Python”

In SQL:

SELECT id FROM oreilly.com WHERE title LIKE '%Python%';

Set Contents

Return all the O’Reilly objects representing authors who were involved in writing the work with ISBN “9781565923607″ (which is the unique ID O’Reilly use in their catalog). The value of oreilly.com/authors/works tags is always a set of unique ISBN numbers like this: ["9781565923607", "9781565563728", "9781627397284"].

oreilly.com/authors/works contains "9781565923607"

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/authors/works contains “9781565923607″

In SQL:

SELECT id FROM oreilly.com/authors WHERE '9781565923607' IN (SELECT works FROM oreilly.com/authors);

(Actually, the similar “IN” operation in SQL isn’t a very good example since it results in verbose monstrosities like the above.)

Exclusion

Return all the O’Reilly books that were published in 2001 except those published in April.

oreilly.com/publication-year=2010 except oreilly.com/publication-month=4

The resulting URL: https://fluiddb.fluidinfo.com/objects?query=oreilly.com/publication-year=2010 except oreilly.com/publication-month=4

In SQL:

SELECT id FROM oreilly.com WHERE year=2010 AND month<>4;

Logic

It’s possible to use the and and or logical operations. For example, return all the O’Reilly books whose title matches “Python” and were published before 2005:

oreilly.com/title matches "Python" and oreilly.com/publication-year < 2005

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches “Python” and oreilly.com/publication-year < 2005

In SQL:

SELECT id FROM oreilly.com WHERE title LIKE '%Python%' AND year < 2005

Grouping

Return all the objects representing O’Reilly books mentioning “Python” in their title that were published in either 2008 or 2010.

oreilly.com/title matches "Python" and (oreilly.com/publication-year=2008 or oreilly.com/publication-year=2010)

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches “Python” and (oreilly.com/publication-year=2008 or oreilly.com/publication-year=2010)

In SQL:

SELECT id FROM oreilly.com WHERE title LIKE '%Python%' AND (year = 2008 OR year = 2010);

Querying across different data sets

Fluidinfo can query seamlessly across tags from different sources that are stored on the same object. E.g., return the titles of all O’Reilly books that Terry Jones owns.

has oreilly.com/title and has terrycojones/owns

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=has oreilly.com/title and has terrycojones/owns

In SQL:

Well, it’s actually not clear how you’d do this in SQL. Presumably there’d need to be some kind of table join, supposing that were possible!

Getting back tags on objects matching a query

It’s also possible to indicate which tag values to return for each matching object. This is done by using the Fluidinfo /values HTTP endpoint and specifying the tag values to return as arguments in the URL path. For example, if I wanted the title, author names and publication year of all the O’Reilly books with the word “Python” in the title published before 2006 then I’d use the following query:

oreilly.com/title matches "Python" and oreilly.com/publication-year < 2006

and append the wanted tags to the URL after the query (in any order):

&tag=oreilly.com/title&tag=oreilly.com/author-names&tag=oreilly.com/publication-year

The resulting URL: http://fluiddb.fluidinfo.com/objects?query=oreilly.com/title matches “Python” and oreilly.com/publication-year < 2006&tag=oreilly.com/title&tag=oreilly.com/author-names&tag=oreilly.com/publication-year

This is similar to the following SQL:

SELECT title, authors, year FROM oreilly.com WHERE title LIKE '%Python%' AND year < 2006;

Fluidinfo returns a JSON object like this:

{u'results': {u'id': {u'1a91e021-7bce-4693-bfa5-0dc437fe1817':
    {u'oreilly.com/author-names': {u'value': [u'Anna Ravenscroft', u'David Ascher', u'Alex Martelli']},
     u'oreilly.com/publication-year': {u'value': 2005},
     u'oreilly.com/title': {u'value': u'Python Cookbook, Second Edition'}},
u'1d25baae-b977-4ff4-bb77-01c52bd1d339':
    {u'oreilly.com/author-names': {u'value': [u'Fredrik Lundh']},
     u'oreilly.com/publication-year': {u'value': 2001},
     u'oreilly.com/title': {u'value': u'Python Standard Library'}},
u'3360f05f-9bf4-4da5-abc0-0e3742809b98':
    {u'oreilly.com/author-names': {u'value': [u'Fred L. Drake Jr', u'Christopher A. Jones']},
     u'oreilly.com/publication-year': {u'value': 2001},
     u'oreilly.com/title': {u'value': u'Python &amp; XML'}},
u'9845b184-ef1b-46fb-8e7c-011da053dcb6':
    {u'oreilly.com/author-names': {u'value': [u'Andy Robinson', u'Mark Hammond']},
     u'oreilly.com/publication-year': {u'value': 2000},
     u'oreilly.com/title': {u'value': u'Python Programming On Win32'}}}}}

It’s also possible to update and delete tag values from matching objects. This process is explained in detail in the Fluidinfo documentation and this blog post.

Finally, rather than interacting with Fluidinfo directly using the raw HTTP API it’s a good idea to use one of the client libraries listed here. For example, using the fluidinfo.py library the last example query can be executed as follows:

>>> import fluidinfo
>>> import pprint
>>> headers, result = fluidinfo.call('GET', '/values', tags=['oreilly.com/title', 'oreilly.com/author-names', 'oreilly.com/publication-year'], query='oreilly.com/title matches "Python" and oreilly.com/publication-year < 2006')
>>> pprint.pprint(headers)
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '937',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=oreilly.com%2Ftitle+matches+%22Python%22+and+oreilly.com%2Fpublication-year+%3C+2006&tag=oreilly.com%2Ftitle&tag=oreilly.com%2Fauthor-names&tag=oreilly.com%2Fpublication-year',
 'content-type': 'application/json',
 'date': 'Thu, 10 Mar 2011 15:17:58 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> pprint.pprint(result)
{u'results': {u'id': {u'1a91e021-7bce-4693-bfa5-0dc437fe1817': {u'oreilly.com/author-names': {u'value': [u'Anna Ravenscroft',
... etc ...
 

Learn more

Hopefully, this has explained enough to get you started. If you don’t have a Fluidinfo account, you can sign up here. If you have any questions, please don’t hesitate to get involved with the Fluidinfo community, contact us directly or join us on IRC. We’ll be more than happy to help!

Putting domain names onto data with Fluidinfo

Wednesday, February 23rd, 2011

Internet domain names can be thought of as a mechanism for attaching trust and reputation to digital information. We do this in two major ways: (1) by using domain names in the URLs of web pages, and (2) by putting them in the sender’s “From” address of email messages.

To give a concrete example, suppose you see some shoes for sale on a web page. If you look at the page URL and see the amazon.com or zappos.com domain name, trust and reputation knowledge springs instantly to mind. You know the quality is probably good, the price competitive, and that if the shoes are lost in shipping you’ll be sent another pair for no charge. On the other hand, if you see ebay.com in the URL, a different matrix of trust and reputation knowledge will spring to mind. A similar thing happens if you get email from someone you’ve never met. If you see stanford.edu or forbes.com in the email “From” line, reputation information springs to mind.

Looked at in this way, domain names are small tokens that we send alongside other pieces of content such as web pages and emails. The domain name carries vital trust and reputation information. Recognition and trust in domain names is globally distributed, spread variously through the brains of most of the people on the planet, with its integrity guaranteed by DNS. Domain names make the internet useful. Without them, digital information online would be almost useless as we could not confidently trust any 3rd party data.

Question: given that we can attach domain names to web pages and email messages, can we find a way to attach them to other things?

Domain names on data

We’re excited to announce that Fluidinfo now makes it possible to put domain names onto individual pieces of data.

To illustrate, the image on the right shows a fanciful example book object in Fluidinfo (large version). The tag names on the object are colored. You’ll see that some of them contain domain names: amazon.com/price, barnesandnoble.com/price and vintage.com/epub. Tags in Fluidinfo can have values, as illustrated by the amazon.com/price tag whose value for this book is $19.

The combination of a Fluidinfo tag name containing a domain and an associated tag value is exactly like a URL containing a domain name and an associated HTML value (i.e., a web page) or an email message with a domain name in its From line.

Because Fluidinfo objects don’t have owners (their tags do, though), any number of domain owners are free to put their information, branded with their domain name, onto any Fluidinfo object.

A killer combination: writable APIs with domain-branded data

Fluidinfo automatically provides a writable API for all its data. By allowing for domain names on data, domain holders who want to publish information about their products can now do so with an API that has three major advantages:

  • Your data is branded with your domain name.
  • Your data lives in a writable ecology of related data, collecting on the same Fluidinfo objects. This allows for search across data from different users and domains, put there by different applications. It allows for additional data of all kinds, for mashups, and for customization, personalization, and filtering.
  • Fluidinfo has a flexible permissions system at the level of its tags, so you maintain full control of your own data. You can make it public or private, or can allow or disallow access for specific others.

Because Fluidinfo objects are fine grained, composed simply of tags with values as in the image above, applications can fetch, search on, or combine specific pieces (or combinations) of data provided by different trusted sources with single requests. There is a general principle here: information becomes more useful and valuable when it is stored in context. This is illustrated vividly by Google, which collects web pages into one place to enable search, and by Wikipedia, which allows people to pool related information. Although these examples have very different models of trust and reputation, they both illustrate the underlying principle.

Getting your domain name in Fluidinfo

To start using your domain in Fluidinfo, first sign up, using your domain name as your user name. Our sign-up system will recognize that the username is a domain and will send you an email telling you how to prove that you control the domain. Once that’s done, you can begin using Fluidinfo to upload information branded with your domain and to provide an API for others (or for your own company) to find your products or otherwise use the information you make available.

In other words, all Fluidinfo usernames that correspond to actual internet domains are automatically reserved for their owners. Besides preventing a chaotic land grab, this is how we can guarantee to people seeing information in Fluidinfo that the value of a Fluidinfo tag whose name includes a domain name can be trusted exactly as it would be if that domain appeared in a web page URL or email From address.

So there you have it… domain names on data. We’re very excited to see where this will lead and we’re actively building out some writable APIs with domain-branded data. You can too. Claim your domain name in Fluidinfo right now.

How to make an API in Fluidinfo

Wednesday, February 23rd, 2011

It’s very simple really:

1. Register a domain/user on Fluidinfo

Start here. If you’re registering a domain name then we will require proof of ownership (the instructions explaining how to do this are very simple).

2. Create your namespaces and tags

Be careful, you’re choosing how your data will be structured in Fluidinfo. Some tips we’ve found useful:

  • Flat is good.
  • Use namespaces to differentiate between the different sorts of things you’ll be tagging (e.g. between books and authors).
  • Copy conventions (how do others organise their data?).
  • KISS! Keep it simple (stupid!).

3. Import your data into Fluidinfo

To help you, we have a Python based script/library called FLIMP. Nevertheless, there are lots of freely available libraries that you may want to adapt yourself.

4. Announce your new API

Programmers will interact with your data via the general Fluidinfo API, which is simple and well documented. All you need to do is tell the world that your data is available, and what namespaces and tags you’re using to store it in Fluidinfo.

That’s it.

Please feel free to get in touch at any time if you have any questions or would like to explore the possibility of Fluidinfo Inc. helping you to add your data to Fluidinfo.

ReadWriteWeb ReadWriteAPI

Wednesday, February 23rd, 2011

Over the weekend I scraped the 11300 or so articles in the ReadWriteWeb archive. These are a great source of technology news and analysis covering stories from 2003 to the present day. Rather than keep this to myself (and rather unsurprisingly) I imported the metadata about each article into Fluidinfo. Hey presto, another instant API emerges!

Here’s how it works. For each article in the ReadWriteWeb archive there is an object in Fluidinfo. Each object has a unique “about” tag-value: the URL of the article. Furthermore, each object is annotated with information using tags found under the readwriteweb.com top level namespace. Tags include title, extract, date, categories and so on. In other words, you might visualize each object something like this:

I’ve also created and annotated objects about each of the authors of ReadWriteWeb articles and tagged objects representing each website ever mentioned by ReadWriteWeb.

So, it’s now possible to use the API like this:

>>> import fluidinfo
>>> returnTags = ['readwriteweb.com/title', 'readwriteweb.com/author-name', 'readwriteweb.com/extract', 'readwriteweb.com/date', ]
>>> query = "readwriteweb.com/year = 2010 and readwriteweb.com/month = 5 and readwriteweb.com/day = 5"
>>> head, result = fluidinfo.call('GET', '/values', tags=returnTags, query=query)
>>> head['status']
'200'
>>> result
{u'results':
    {u'id':
        {u'05936b9b-4c20-4887-9607-f63752e7f274':
            {u'readwriteweb.com/author-name': {u'value': u'Sarah Perez'},
              u'readwriteweb.com/date': {u'value': u'May  5, 2010  7:24 AM'},
              u'readwriteweb.com/extract': {u'value': u"Feel like hacking your phone today? If you've got about 10 minutes to spare, you can turn your iPhone into a Wi-Fi hotspot using a combination of the ..."},
              u'readwriteweb.com/title': {u'value': u'How To Turn Your iPhone into a Wi-Fi Hotspot'}},
        ... etc....
 

What’s just happened..? I used a client library (fluidinfo.py) to ask Fluidinfo to return the author name, publication date, title and an extract of all ReadWriteWeb articles published on the 5th May 2010.

Being able to search and extract data from an API is cool, especially since you get this by virtue of simply hosting your data in Fluidinfo. But this is ReadWriteWeb we’re talking about. Happily, Fluidinfo can accommodate.

>>> fluidinfo.login('ntoll', 'mysecretpassword') # change as appropriate
>>> headers, result = fluidinfo.call('PUT', ['about', 'http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php', 'ntoll', 'rating'], 10)
>>> headers
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-type': 'text/html',
 'date': 'Wed, 23 Feb 2011 15:07:29 GMT',
 'server': 'nginx/0.7.65',
 'status': '204'}
 

The example above shows how I sign in and annotate the object “about” the article http://www.readwriteweb.com/archives/android_app_growth_on_the_rise_9000_new_apps_in_march_2010.php with a tag called ntoll/rating and an associated value of 10 (obviously I enjoyed this article). The HTTP 204 response status tells me the value was successfully tagged.

Let’s just pause here for a moment and consider what I’ve just been able to do. Because Fluidinfo is openly writable I’m able to annotate the objects about ReadWriteWeb articles with my own data. Since objects in Fluidinfo don’t have owners or permissions attached to them I didn’t have to ask ReadWriteWeb for permission to augment the data about the article in question. Furthermore, if I only want my buddies to see what my ratings are I can set the tag to be only visible to a specific group of people. In this way Fluidinfo remains openly writable yet I still retain ownership and control over my data.

We’ve seen “read” and “write”, but what about “web”..?

Well it turns out I can stretch this analogy even further. Because everyone is tagging the same objects (identified by their “about” tag values) the data is being linked by virtue of the context of the object. We’re starting to get a web of linked data (yeah, I know, bear with me on this one…).

Since I can search and retrieve using any of the tags for which I have “read” permission I can start to create really cool mash-ups of data like this:

>>> header, result = fluidinfo.call('GET', '/values', tags=['fluiddb/about', 'boingboing.net/mentioned', 'readwriteweb.com/mentioned'], query="has boingboing.net/mentioned and has readwriteweb.com/mentioned and has unionsquareventures.com/portfolio")
>>> header
{'cache-control': 'no-cache',
 'connection': 'keep-alive',
 'content-length': '23528',
 'content-location': 'https://fluiddb.fluidinfo.com/values?query=has+boingboing.net%2Fmentioned+and+has+readwriteweb.com%2Fmentioned+and+has+unionsquareventures.com%2Fportfolio&tag=fluiddb%2Fabout&tag=boingboing.net%2Fmentioned&tag=readwriteweb.com%2Fmentioned',
 'content-type': 'application/json',
 'date': 'Wed, 23 Feb 2011 15:24:36 GMT',
 'server': 'nginx/0.7.65',
 'status': '200'}
>>> len(result['results']['id'])
4
>>> for r in result['results']['id'].values():
...     print r['fluiddb/about']['value']
...
http://www.twitter.com
http://www.etsy.com
http://www.boxee.tv
http://www.meetup.com
 

What..? I’ve just asked Fluidinfo for all the articles from BoingBoing and ReadWriteWeb about companies backed by Union Square Ventures that both BoingBoing and ReadWriteWeb have covered. It turns out there are four companies: Twitter, Etsy, Boxee and Meetup.

What do one of these results look like..?

{u'boingboing.net/mentioned':
    {u'value': [u'http://boingboing.net/2009/11/06/vampireotherkinenerg.html',
                     u'http://boingboing.net/2010/01/11/ny-times-on-urban-ca.html',
                     u'http://boingboing.net/2010/10/26/ron-paul-supporter-w.html',
                     u'http://boingboing.net/2002/06/27/meetup-meatspace-cam.html',
                     u'http://boingboing.net/2004/03/17/wired-rave-awards.html',
                     u'http://boingboing.net/2006/01/05/net-pug-nabbed-by-cr.html']},
u'fluiddb/about':
    {u'value': u'http://www.meetup.com'},
u'readwriteweb.com/mentioned':
    {u'value':  [u'http://www.readwriteweb.com/archives/meetup_the_secret_campaign_weapon.php']}}
 

What was involved in making such a cool query possible..? Simply importing data into Fluidinfo.

I’ll say no more and let you ponder the implications of what I’ve just demonstrated…

Top data blogs information now in Fluidinfo, with an API

Saturday, February 12th, 2011

Image: Education Week

A bit over a month ago, Marshall Kirkpatrick of Read Write Web made lists of the Top 300 blogs about data and the Top 300 blogs about geo. As soon as I saw the lists, I added the data to Fluidinfo and emailed Marshall.

I added marshallk.com/top-blogs/data and marshallk.com/top-blogs/geo tags to the Fluidinto objects that correspond to the URLs in his lists (Fluidinfo has an object for everything; in each case I put the tags onto the logical object in Fluidinfo: the one object whose fluiddb/about value is the URL in question.)


You can then do things like this:

$ curl 'http://fluiddb.fluidinfo.com/values?query=marshallk.com/top-blogs/data%3c%3d10&tag=fluiddb/about&tag=marshallk.com/top-blogs/data' |
jsongrep.py results id '.*'
{u'fluiddb/about': {u'value': u'http://www.readwriteweb.com/cloud'},
 u'marshallk.com/top-blogs/data': {u'value': 3}}
{u'fluiddb/about': {u'value': u'http://cloud.gigaom.com'},
 u'marshallk.com/top-blogs/data': {u'value': 2}}
{u'fluiddb/about': {u'value': u'http://flowingdata.com'},
 u'marshallk.com/top-blogs/data': {u'value': 8}}
{u'fluiddb/about': {u'value': u'http://highscalability.com'},
 u'marshallk.com/top-blogs/data': {u'value': 7}}
{u'fluiddb/about': {u'value': u'http://www.calculatedriskblog.com'},
 u'marshallk.com/top-blogs/data': {u'value': 6}}
{u'fluiddb/about': {u'value': u'http://www.fivethirtyeight.com'},
 u'marshallk.com/top-blogs/data': {u'value': 5}}
{u'fluiddb/about': {u'value': u'http://www.guardian.co.uk/news/datablog'},
 u'marshallk.com/top-blogs/data': {u'value': 9}}
{u'fluiddb/about': {u'value': u'http://www.informationisbeautiful.net'},
 u'marshallk.com/top-blogs/data': {u'value': 10}}
{u'fluiddb/about': {u'value': u'http://www.zerohedge.com'},
 u'marshallk.com/top-blogs/data': {u'value': 1}}
{u'fluiddb/about': {u'value': u'http://freakonomics.com/blog'},
 u'marshallk.com/top-blogs/data': {u'value': 4}}
 

Those are the top 10 on Marshall’s data list (unsorted, obviously). I’ve cleaned up the output using my jsongrep.py program described and available here.

More interestingly, you can see if any sites are on both of Marshall’s lists:

$ curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20marshallk.com/top-blogs/geo&tag=fluiddb/about'
{"results": {"id": {"a2e56723-453a-44e5-bd91-5576d0615c8e": {"fluiddb/about": {"value": "http://blog.simplegeo.com"}}}}}

Just a single blog is in both lists: http://blog.simplegeo.com.

So far, so good.

About half an hour ago, I saw a tweet from Daniel Tunkelang (the mind behind TunkRank) saying that eCairn have just released some work based on Marshall’s data, producing a list of 500 top data blogs! Cool.

So I’ve just imported that data to Fluidinfo too, adding a ecairn.com/top-data-blogs tag to the object for each URL on their list. The value of each tag, as with Marshall’s data, is the ranking on the eCairn list.

Let’s see how many blogs are on both lists:

curl 'http://fluiddb.fluidinfo.com/values?query=has%20marshallk.com/top-blogs/data%20and%20has%20ecairn.com/top-data-blogs&tag=fluiddb/about' | jsongrep.py results id '.*' | wc -l
117
 

Not as many as I expected. But there are some small differences in the URLs used, for example Marshall’s list had http://kaushik.net/avinash whereas the eCairn list has http://www.kaushik.net/avinash. This would be easy to clean up, and of course it’s also possible just to tag the object for both URLs in Fluidinfo.

You can do the Fluidinfo query has marshallk.com/top-blogs/data except has ecairn.com/top-data-blogs to see the sites that Marshall has in his list but which do not appear in the eCairn list, such as Marshall’s #12, http://blog.sqlauthority.com. eCairn’s calculation might have put them in the lower 500 of their list of 1000 (the eCairn article only gives their top 500). There are plenty of other interesting queries too, but this post is long enough already.

So there you go, a fun bit of playing with more data blog data with Fluidinfo. One of these days we’ll even make it into one of these lists :-)

Here’s the tiny bit of Python code I just wrote to add the data. It uses the Python FOM library for Fluidinfo written by Ali Afshar:

import sys
from fom.session import Fluid
fdb = Fluid()
fdb.login('ecairn.com', 'password')

urls = [i[:-1] for i in sys.stdin.readlines()]

for rank, url in enumerate(urls):
    fdb.about[url]['ecairn.com/top-data-blogs'].put(rank + 1)