Archive for November, 2010

Watering a Peace Lily with Fluidinfo

Tuesday, November 23rd, 2010

I (ntoll) belong to a nascent hackerspace called NortHACKton. It’s an opportunity to learn new skills and to collaborate with a great bunch of people who create cool stuff. I’m going to describe just such a collaboration with Stephen Bridges, one of the organisers of the hackerspace.

Our aim was to combine a simple hardware project with Fluidinfo and do it in such a way that others could repeat, extend and enhance what we’d been up to. We decided to connect an Arduino to a sensor and put the resulting reading into Fluidinfo at regular intervals. In the end we built something to make a moisture reading of the soil in Stephen’s plant pot and update a value in Fluidinfo every 10 minutes.

The Arduino has an Ethernet shield so the device can communicate autonomously with Fluidinfo via the HTTP API. The support circuitry is adapted from Botanicalls.com (Creative Commons) and Stephen created the sensor from a pen lid, sticky tape and a couple of wires. :-)

The source code can be found on GitHub and contains two parts:

  1. A generic and reusable layer that handles basic interaction with Fluidinfo
  2. The application logic that takes the reading and controls the Arduino.

From Fluidinfo’s point of view, there is an object that represents Stephen’s peace lily (its about tag value is “Stephen’s Peace Lily (houseplant)”) and the tag widget/ffm/reading attached to this object is updated with the appropriate value.

Interestingly, I’ve also added some tags to the object representing the peace lily which hold html, css and javascript values. This is a classic case of putting information in context since the peace lily’s web page is a tag-value attached to its object in Fluidinfo. So it’s possible to view the peace lily’s current status with your browser.

The whole thing was hacked together in an afternoon over a drink in a pub in Northampton. Unfortunately for Stephen my mobile phone takes video so I press-ganged him into the following explanation:

You can find Stephen’s write-up on the NortHACKton wiki. If you’re interested in doing something similar with Fluidinfo please don’t hesitate to drop in on our IRC channel (#fluidinfo on Freenode – connect via the web) and ask questions. Alternatively, drop by either the fluidinfo-users or fluidinfo-discuss mailing lists. We’d be more than happy to help.

Importing data into FluidDB with Flimp

Friday, November 19th, 2010

We’d like to introduce you to “Flimp” (the FLuiddb IMPorter) – a tool that makes it easy to import data into FluidDB.

It works in two ways:

  1. Given a source file containing a data dump (in either json, yaml or csv format), Flimp will create the necessary FluidDB namespaces and tags and then import the records. (We expect to provide more file formats soon.)
  2. Given a filesystem path, Flimp will create the necessary FluidDB namespaces (based on directories) and tags (based on file names) and then import file contents as values tagged on a single FluidDB object.

Flimp can be configured to do custom pre-processing (e.g. cleaning, normalizing or modifying) before data is imported into FluidDB. It’s important to note that Flimp is in active development and that we welcome comments, ideas, and bug reports. Flimp is built on fom (the Fluid Object Mapper) created by my colleague Ali Afshar.

As a test, we’ve imported all the metadata from data.gov and data.gov.uk using Flimp and made it publicly readable. The rest of this article explains exactly how we did it so you can also start importing data into FluidDB using Flimp.

Open Government Data

Open linked government data

source: http://www.flickr.com/photos/opensourceway/4371001268/

Governments are making their data openly available to citizens. This has resulted in a tidal wave of hitherto unavailable information flowing onto the Internet.

Unfortunately, it’s very easy to be swamped by both the sheer amount and diversity of what is available. Furthermore, despite progress in this area, it is still difficult to search and explore the data. Plus, governments publish data in many different ways making it difficult to link, annotate and search datasets.

Both the US and UK government data sites provide a dump of their metadata (data describing the data they have available). Finding this invaluable information is hard, so for the record here’s a link to the US dump and here’s a link to the UK dump. These are the sources Flimp imported into FluidDB. No doubt there are more from other governments and when found they’ll also mysteriously find their way into FluidDB.

Get Flimp

Flimp is written in the Python programming language. You’ll need to have this installed first along with setuptools. Once you have these requirements there are two ways to get Flimp:

  1. If you want the latest and greatest “bleeding edge” version then go visit the project’s website and follow the appropriate links/instructions.
  2. If you’d rather use the current packaged stable release then follow the instructions below. The rest of this article deals with Flimp version 0.6.1.

To install the latest stable release open a terminal and issue the following commands (Flimp depends on fom and PyYaml):

$ easy_install fom
$ easy_install PyYaml
$ easy_install flimp

Once installed you can check Flimp has installed correctly by using the “flimp” command like this:

$ flimp --version
flimp 0.6.1

That’s it! You have both the “flimp” command line tool installed and the associated libraries used for importing data into FluidDB.

Help is always available via the command line tool:

$ flimp --help
Usage: flimp [options]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -f FILE, --file=FILE  The FILE to process (valid filetypes: .json, .csv,
                        .yaml)
  -d DIRECTORY, --dir=DIRECTORY
                        The root directory for a filesystem import into
                        FluidDB
  -u UUID, --uuid=UUID  The uuid of the object to which the filesystem import
                        is to attach its tags
  -a ABOUT, --about=ABOUT
                        The about value of the object to which the filesystem
                        import is to attach its tags
  -p, --preview         Show a preview of what will happen, don't import
                        anything
  -i INSTANCE, --instance=INSTANCE
                        The URI for the instance of FluidDB to use
  -l LOG, --log=LOG     The log file to write to (defaults to flimp.log)
  -v, --verbose         Display status messages to console
  -c, --check           Validate the data file containing the data to import
                        into FluidDB - don't import anything

Importing from data.gov.uk

First, we registered the user “data.gov.uk”. Because we’ll be using tags only associated with the data.gov.uk user you can be sure that the source of the data is legitimate. (We’d love this user to be under the control of someone from data.gov.uk – contact us if this applies to you.)

Next, we downloaded a json dump of the UK’s metadata. A quick look at the raw file indicated that it was already in a remarkably good state but we wanted to make sure. Flimp helps out:


$ flimp --file=uk_data_dump.json --check
Working... (this might take some time, why not: tail -f the log?)
The following MISSING fields were found:

geographical_granularity
temporal_coverage-from
temporal_coverage_to
geographic_granularity
temporal_coverage_from
taxonomy_url
import_source
temporal_coverage-to

Full details in the missing.json file

Flimp uses the first item in the json dump as a template for the schema. The “–check” flag tells Flimp to make sure all the items match the schema. In this case we notice that some items don’t have all the fields. This isn’t a problem and if we were to open the “missing.json” file then we’d see which items these are. Importantly, Flimp also checks if any of the items have extra fields associated with them. This would be more of an issue but Flimp would help by giving details of the problem items allowing you to rectify the problem.

It is also possible to preview what Flimp would do when importing the data:

$ flimp --file=uk_data_dump.json --preview
FluidDB username: data.gov.uk
FluidDB password:
Absolute Namespace path (under which imported namespaces and tags will be created): data.gov.uk/meta
Name of dataset (defaults to filename) [uk_data_dump]: data.gov.uk:metadata
Key field for about tag value (if none given, will use anonymous objects): id
Description of the dataset: Metadata from data.gov.uk
Working... (this might take some time, why not: tail -f the log?)
Preview of processing 'uk_data_dump.json'

The following namespaces/tags will be generated.

data.gov.uk/meta/relationships
data.gov.uk/meta/ratings_average
data.gov.uk/meta/maintainer
data.gov.uk/meta/name
data.gov.uk/meta/license
data.gov.uk/meta/author
data.gov.uk/meta/url
data.gov.uk/meta/notes
data.gov.uk/meta/title
data.gov.uk/meta/maintainer_email
data.gov.uk/meta/author_email
data.gov.uk/meta/state
data.gov.uk/meta/version
data.gov.uk/meta/resources
data.gov.uk/meta/groups
data.gov.uk/meta/ratings_count
data.gov.uk/meta/license_id
data.gov.uk/meta/revision_id
data.gov.uk/meta/id
data.gov.uk/meta/tags
data.gov.uk/meta/extras/national_statistic
data.gov.uk/meta/extras/geographic_coverage
data.gov.uk/meta/extras/geographical_granularity
data.gov.uk/meta/extras/external_reference
data.gov.uk/meta/extras/temporal_coverage-from
data.gov.uk/meta/extras/temporal_granularity
data.gov.uk/meta/extras/date_updated
data.gov.uk/meta/extras/agency
data.gov.uk/meta/extras/precision
data.gov.uk/meta/extras/geographic_granularity
data.gov.uk/meta/extras/temporal_coverage_to
data.gov.uk/meta/extras/temporal_coverage_from
data.gov.uk/meta/extras/taxonomy_url
data.gov.uk/meta/extras/import_source
data.gov.uk/meta/extras/temporal_coverage-to
data.gov.uk/meta/extras/department
data.gov.uk/meta/extras/update_frequency
data.gov.uk/meta/extras/date_released
data.gov.uk/meta/extras/categories

4023 records will be imported into FluidDB

The “–preview” flag does exactly what you’d expect: it asks you the same questions as if you were importing the data for real but instead lists the new namespace/tag combinations that will be created and the number of new objects to be annotated.

It’s important to understand how Flimp generates the “about” tag value (unsurprisingly, the about tag value indicates what each object in FluidDB is about). It needs to be unique and descriptive of what the object represents. As a result Flimp asks you to identify a field in your data containing unique values and appends this to the end of the name of the dataset (in the example above, “id” was identified as the key field):


fluiddb/about = "data.gov.uk:1ea4bfa9-9ae1-4be0-ae73-e0c4a26caa6c"

If you don’t provide a field for unique values Flimp simply generates a new object without an associated “about” value.

Nicholas Radcliffe’s About Tag blog is a great source of further information about the emerging conventions surrounding the “about” tag.

Since Flimp has satisfied us that the json data was in a good state we simply issued the following command to start the actual import:

$ flimp --file=uk_data_dump.json
FluidDB username: data.gov.uk
FluidDB password:
Absolute Namespace path (under which imported namespaces and tags will be created): data.gov.uk/meta
Name of dataset (defaults to filename) [uk_data_dump]: data.gov.uk:metadata
Key field for about tag value (if none given, will use anonymous objects): id
Description of the dataset: Metadata from data.gov.uk
Working... (this might take some time, why not: tail -f the log?)

Notice how Flimp interrogates you for sensitive information so you don’t have to have username/password credentials stored in a configuration file.

After the import completed it left a record of exactly what it did in the “flimp.log” file located in the current directory.

Importing from data.gov

Just as with the UK data, we’ve used an appropriate FluidDB username for importing the US data: data.gov (and the same applies – the data.gov user should be under the control of someone from data.gov – please contact us if this applies to you).

We took a different approach to the US metadata. They provide either an rdf document or a csv file. Since Flimp understands csv we used this as the source.

We wanted to make sure that the headers in the csv file (which get transformed into the names of tags in FluidDB) were cleaned and normalized appropriately since they contained lots of whitespace and non-alphanumeric characters. The snippet of Python code below demonstrates how we re-used Flimp in our own import script to achieve this end.

from flimp.utils import process_data_list
from flimp.parser import parse_csv
from fom.session import Fluid

def clean_header(header):
    """
    A function that takes a column header and normalises / cleans it into
    something we’ll use as the name of a tag
    "
""
    # remove leading/trailing whitespace, replace inline whitespace with
    # underscore and any slashes with dashes.
    return header.strip().replace(‘ ‘, ‘_’).replace(‘/’, ‘-’)

csv_file = open("data_gov.csv", "r")
data = parse_csv.parse(csv_file, clean_header)

# data now contains the normalized input from the csv file

# Use fom to create a session with FluidDB – remember flimp uses fom for
# connecting to FluidDB
fdb = Fluid() # defines a session with FluidDB
fdb.login(‘data.gov’, ‘secretpassword123′) # replace these with something that works
fdb.bind()

root_path = ‘data.gov/meta’# Namespace where imported namespaces/tags are created
name = ‘data.gov:metadata’ # used when creating namespace/tag descriptions
desc = ‘Metadata from data.gov’ # a description of the dataset
about = ‘URL’ # field whose value to use for the about tag

# the following function call imports the data
result = process_data_list(data, root_path, name, desc, about)
print result
 

Conclusion

By importing the metadata into FluidDB we immediately gain the following:

  • FluidDB’s consistent, simple and elegant RESTful API as a view into the data.
  • The possibility of simple yet powerful queries across all the metadata.
  • The opportunity to annotate, link and augment the existing data with contributions from other sources.

Any application can now access the newly imported government data. In a future post I’ll demonstrate how to build a web-based interface for this data that is also hosted within FluidDB. I’ll also show how to query, annotate and link data yourself and re-use the contributions of others.

Coming soon to a FluidDB near you…

Monday, November 15th, 2010

Today (Monday 15th November) commencing from 10am GMT (11am Western Eurozone, 5am EST) the main instance of FluidDB will be offline for several hours while we roll out a major update.

We’re excited to announce the following new features and changes:

  • /about added to HTTP API – It will be possible to access FluidDB objects that have a fluiddb/about tag value with requests whose path starts with /about. For example, the object about “Barcelona” can be reached directly via /about/Barcelona. The behaviour of /about, when given an about value, is exactly like that of /objects when given an object id. More information will be available in the API docs at http://api.fluidinfo.com/. Many thanks to Holger Dürer (http://twitter.com/hd42) for suggesting this improvement.
  • /values added to HTTP API – It is now possible to manipulate multiple tag values in a single API request to /values via the PUT, GET and DELETE HTTP methods. From the user’s perspective, this will result in a significant improvement in performance. More information can be found in the API docs at http://api.fluidinfo.com/.
  • “SEE” permission replaced with “READ” – the permissions system has been simplified. FluidDB now uses only the READ permission on tags to decide whether API calls accessing the tag values should be allowed to proceed. Anything that used the SEE permission now uses READ. For example, when you do a GET on an object to retrieve the names of its tags, you will only receive those for which you have READ permission. Many thanks to Jamu Kakar (http://twitter.com/jkakar) for suggesting this simplification.
  • Deleting a tag instance now always returns an HTTP 204 (No content) code – DELETEing a tag value from an object that did not have that tag used to result in a “404 (Not found)” status. This will be changed to simply return the non-error “204 (No Content)”.
  • “Content-MD5″ header for checking payload content – It will be possible to send a checksum of a payload to FluidDB via the “Content-MD5″ header. FluidDB will attempt to validate the checksum with the payload and return a “412 (Precondition failed)” status in the case of a mismatch.
  • Cross Origin Resource Sharing (CORS) added to HTTP API – it will be possible to make cross origin requests as specified by http://www.w3.org/TR/cors/ rather than rely on JSONP. FluidDB will have an almost complete implementation of this emerging standard although we expect to make changes and improvements as the specification matures.
  • Text indexing of fluiddb/about tag values – text indexing is coming to FluidDB but is definitely a work in progress. This release is just the very first step: the fluiddb/about tag will be indexed from the update onwards (existing fluiddb/about tag values will be indexed over the coming days/weeks).

For those of you who have written or maintain a client library for FluidDB we’d like to refer you to the changes we’ve made to the Fluid Object Mapper (FOM) library as a reference for what you might want to do with your own library.

To encourage people to add the new FluidDB capabilities to libraries, we’re going to extend the FluidDB Weekend of Code offer to library authors. Let us know when you’re working on your library and where we can find it (Github, Bitbucket, Sourceforge etc) and we’ll order you a pizza and send you a book of your choice from Amazon.

Finally, we’re moving to a four-week development cycle so expect regular updates, pro-active bug squashing and lots of progress in the coming months. We’ve got lots of exciting stuff in the pipeline and we can’t wait to see how the FluidDB community reacts.