How we built the O’Reilly API using Fluidinfo

March 22nd, 2011 by Nicholas Tollervey. Filed under Data, Howto, People, Progress.

In case you haven’t noticed, we’ve imported the O’Reilly catalogue into Fluidinfo thus giving them an instantly writable API for their data.

How did we do it..?

There were three basic steps:

  1. Get the raw data.
  2. Clean the raw data.
  3. Import the cleaned data.

That’s it!

I’ll take each step in detail…

Get the raw data

Since we didn’t have an existing raw dump of the data nor access to O’Reilly’s database we had to think of some other way to get the catalogue. We found that O’Reilly had two different existing data services we could use: OPMI (O’Reilly Product Metadata Interface) and an affiliate’s API within Safari.

Unfortunately the RDF returned from OPMI is complicated. We’d either have to become experts in RDF or learn how to use a specialist library to get at the data we were interested in. We didn’t have time to pursue either of these avenues. The other alternative, the Safari service, just didn’t work as advertised. :-(

Then we remembered learning about @frabcus and @amcguire62‘s ScraperWiki project.

Put simply, ScraperWiki allows you to write scripts that scrape (extract) information from websites and store the results for retrieval later. The “wiki” aspect of the ScraperWiki name comes from its collaborative development environment where users can share their scripts and the resulting raw data.

In any case, a couple of hours later I had the beginnings of a batched up script for scraping information from the O’Reilly catalogue on the oreilly.com website. After some tests and refactoring ScraperWiki started to do its stuff. The result was a data dump in the easy to understand and manipulate CSV or JSON formats. ScraperWiki saves the day!

Clean the raw data

This involved massaging the raw data into a meaningful structure that corresponded to the namespaces, tags and tag-values we were going to use in Fluidinfo. We also extracted some useful information from the raw data. For example, we made sure the publication date of each work was also stored in a machine-readable value. Finally, we checked that all the authors and books matched up.

Most of this work was done by a single Python script. It loaded the raw data (in JSON format), cleaned it and saved the cleaned data as another JSON file. This meant that we could re-clean the raw data any number of times when we got things wrong or needed to change anything. Since this was all done in-memory it was also very fast.

The file containing the cleaned data was simply a list of JSON objects that mapped to objects in Fluidinfo. The attributes of each JSON object corresponded to the tags and associated values to be imported.

Import the cleaned data

This stage took place in two parts:

  1. Create the required namespaces and tags
  2. Import the data by annotating objects

Given the cleaned data we were able to create the required namespaces and tags. You can see the resulting tree-like structure in the Fluidinfo explorer (on the left hand side).

Next, we simply iterated over the list of JSON objects and pushed them into Fluidinfo. (It’s important to note is that network latency means that importing data can seem to take a while. We’re well aware of this and will be blogging about best practices at a later date.)

That’s it!

We used Ali Afshar’s excellent FOM (Fluid Object Mapper) library for both creating the namespace and tags and importing the JSON objects into Fluidinfo and elements of flimp (the FLuid IMPorter) for pushing the JSON into FOM.

What have we learned..? The most time consuming part of the exercise was scraping the data. The next most time consuming aspect was agreeing how to organise it. The actual import of the data didn’t take long at all.

Given access to the raw data and a well thought out schema we could have done this in an afternoon.

  • http://www.flourish.org/ Francis Irving

    Great stuff! You can find all Nicolas’s scrapers, including these O’Reilly ones, on ScraperWiki here.

    http://scraperwiki.com/profiles/ntoll/

  • http://www.patenttrademarklitigation.com/index.html Trademark Litigation

    here is a similar story

    O’Reilly Media is joining with Fluidinfo, an online information storage and search platform that supports openly-writable metadata of any kind, to launch a contest to encourage software developers to write applications for the O’Reilly Fluidinfo “Writable API.” APIs or Application Programmer Interfaces provide third party developers with a set of rules and permissions that gives them access to a content owners data and allows them to create apps and new information products quickly and easily.

  • Publius

    This is the first time I’ve come to Fluidinfo, and I must say I am perplexed by your reasoning. You wrote that you didn’t want to go through the trouble of parsing RDF, yet RDF is supposed to form the basis of a decentralized universal meta-language for generating ontologies in the Web Ontology Language (OWL). OWL resources can be reasoned-over and aligned algorithmically. How does Fluidinfo further this cause or make information more free? If the goal is to form a vast repository of metadeta, then why not produce software and ideology that furthers the burgeoning yet commercially vulnerable Semantic Web? Hypothetically, if Fluidinfo were to collect all the metadata in the world, I don’t see how, as a centralized data store, it could discover or exploit the inherently loose-ended, implicit relationships found in natural language by virtue of its polysemy. In particular, supposing Fluidinfo had a single document “about” each item, Fluidinfo would end up with multiple fields with the same value keyed by synonyms (and by hyponyms, hypernyms, troponyms, etc). Consequently, you would have to merge these fields in order to make sense of them, which would result in either massive data duplication or an erasure of history by destructive updates. In contrast, one of the reasons for using distributing and querying RDF/OWL ontologies is to preserve original differences in perspectives, contexts, and naming-conventions for intersecting scopes and domains. By keeping ontologies distinct, the problems of storage space, semantic ambiguity (like synonyms, etc) can be resolved by generating separate merged ontologies and views that represent a *statistical* alignment, not an absolute one, generated by historical erasure. publius[at]ufl.edu. Anyway, as I said, I’m not too familiar with Fluidinfo, so please forgive my misunderstanding. -Dan G.

  • http://twitter.com/ntoll Nicholas Tollervey

    Hi Dan (Publius),

    Thanks for taking the time to comment. :-)

    My point was simply that I didn’t have time to learn/remember all about RDF/OWL and other semweb technology or learn how to use a library that allowed me to get where I wanted to be ASAP.

    Put simply, it was simpler and easier to scrape the data than it was to use O’Reilly’s existing api.

    Obviously, we’re very sympathetic to the aims and objectives of the Semantic Web. However, one of our primary aims is simplicity and a relatively shallow learning curve. Unfortunately that’s not been the experience of my developers when encountering semweb technology (although I acknowledge this is a matter of opinion).

    I personally think there’s room for lots of solutions in this space and people will choose whichever one seems most appropriate.

    With regard to you point about, “all the metadata in the world”. As I said, we’re just one service among many and people will use us if we meet their requirements. There’s no nefarious plot for global domination of data.

    You go on to discuss how Fluidinfo would have to do this or that given various situations. Fluidinfo *as a technology* doesn’t have any opinion on this (although *we* care very deeply about the data we are custodians of). Put simply, we’d rather users work it out themselves via emerging conventions and the evolution of the data structures they define with their namespaces/tags and the objects they end up tagging. Differences in perspective are handled via the namespaces and the permission system imposes some sort of ownership and therefore trust in Fluidinfo.

    I hope this makes sense.

    Please don’t hesitate to get in touch if you want to discuss these things further. I personally welcome dialogue about these issues since it’s something I find fascinating. I’m ntoll at the domain of this site ;-)

    Best wishes,

    Nicholas.

  • http://sorebuttcheeks.blogspot.com/ anabolic steroids blog

    love that scraperwiki logo.

    dianabol