How we built the O’Reilly API using Fluidinfo
How did we do it..?
There were three basic steps:
- Get the raw data.
- Clean the raw data.
- Import the cleaned data.
I’ll take each step in detail…
Get the raw data
Since we didn’t have an existing raw dump of the data nor access to O’Reilly’s database we had to think of some other way to get the catalogue. We found that O’Reilly had two different existing data services we could use: OPMI (O’Reilly Product Metadata Interface) and an affiliate’s API within Safari.
Unfortunately the RDF returned from OPMI is complicated. We’d either have to become experts in RDF or learn how to use a specialist library to get at the data we were interested in. We didn’t have time to pursue either of these avenues. The other alternative, the Safari service, just didn’t work as advertised.
Put simply, ScraperWiki allows you to write scripts that scrape (extract) information from websites and store the results for retrieval later. The “wiki” aspect of the ScraperWiki name comes from its collaborative development environment where users can share their scripts and the resulting raw data.
In any case, a couple of hours later I had the beginnings of a batched up script for scraping information from the O’Reilly catalogue on the oreilly.com website. After some tests and refactoring ScraperWiki started to do its stuff. The result was a data dump in the easy to understand and manipulate CSV or JSON formats. ScraperWiki saves the day!
Clean the raw data
This involved massaging the raw data into a meaningful structure that corresponded to the namespaces, tags and tag-values we were going to use in Fluidinfo. We also extracted some useful information from the raw data. For example, we made sure the publication date of each work was also stored in a machine-readable value. Finally, we checked that all the authors and books matched up.
Most of this work was done by a single Python script. It loaded the raw data (in JSON format), cleaned it and saved the cleaned data as another JSON file. This meant that we could re-clean the raw data any number of times when we got things wrong or needed to change anything. Since this was all done in-memory it was also very fast.
The file containing the cleaned data was simply a list of JSON objects that mapped to objects in Fluidinfo. The attributes of each JSON object corresponded to the tags and associated values to be imported.
Import the cleaned data
This stage took place in two parts:
- Create the required namespaces and tags
- Import the data by annotating objects
Given the cleaned data we were able to create the required namespaces and tags. You can see the resulting tree-like structure in the Fluidinfo explorer (on the left hand side).
Next, we simply iterated over the list of JSON objects and pushed them into Fluidinfo. (It’s important to note is that network latency means that importing data can seem to take a while. We’re well aware of this and will be blogging about best practices at a later date.)
We used Ali Afshar’s excellent FOM (Fluid Object Mapper) library for both creating the namespace and tags and importing the JSON objects into Fluidinfo and elements of flimp (the FLuid IMPorter) for pushing the JSON into FOM.
What have we learned..? The most time consuming part of the exercise was scraping the data. The next most time consuming aspect was agreeing how to organise it. The actual import of the data didn’t take long at all.
Given access to the raw data and a well thought out schema we could have done this in an afternoon.