Archive for September, 2009

FluidDB Weekend of Code

Thursday, September 17th, 2009

Image: gui.tavares

Image: gui.tavares

Based extremely loosely on Google’s Summer of Code program, we’re pleased to announce the FluidDB Weekend of Code offer. Here’s the deal.

You have a go at writing a client-side library for the FluidDB HTTP API in a programming language for which no library currently exists (here’s the current list). We send you a new copy of the book of your choice for that language, plus a large pizza to keep you going. You release your code as open source, and we link to it & put your name up in lights on the libraries page.

So if you’d like to play around with a new programming language and want a fun project to tackle, why not have a go? There’s no formal commitment, and no strings attached. We’ll send you a book to help, and you get to keep it no matter what.

For example, there’s no Scala library yet. We’d love to have one, and would be delighted to send you a copy of the new Programming Scala book from O’Reilly. Or a copy of Erlang Programming, or maybe Real World Haskell takes your fancy. Or, write a library in Javascript, or C, etc. If there’s a book that can help you (even to learn some entirely other language), we’ll ship it. We’ll also be happy to help you if you join us in the #fluiddb channel on or sign up for the FluidDB-users mailing list.

Sound like fun? Send mail to info at fluidinfo com. We’ll probably just send one book per language, so please understand if we’ve already got someone working on your first choice. And if you already wrote a library, well thanks :-) (seriously, feel free to ask for a book too; it’d be a pleasure).

Using FluidDB for storage in location-aware software applications

Tuesday, September 15th, 2009

I’m talking at the Association for Geographic Information (AGI) conference next week in Stratford-upon-Avon, birthplace of Shakespeare. The talk is at 14:30 on Sept 23rd in the geoweb stream, organized by Christopher Osborne. Christopher writes (and I am shameless enough to excerpt):

In the geoweb stream we have Terry Jones talking about the just released FluidDB, the database with the heart of a wiki, which looks like it might just be *the next big thing* on the internet. I was lucky enough to host him talk at #geomob in January, an attention grabbing speaker working on some amazing technology. Worth the ticket price alone.

I was also asked to join a panel on privacy and to submit a few paragraphs outlining a position for panel background. Below is what I submitted.

The area in which I may be able to contribute is in data storage. Each time you use an application (and of course sometimes when you’re not – e.g., just carrying your phone down the street in your pocket), you’re generating information. I’m always curious to know where that information winds up. Who owns it? Can I change it, delete it, share it, hide it from certain people, search on it, sell it? Most of those questions have not really been addressed in modern times – by which I mean since we started to use devices that are connected to networks, that use online storage etc. Gone (almost) are the days when you could have a truly private interaction with a piece of technology (a radio, a car, etc).

I don’t think there’s any clear conclusion here, yet. There are lots of tradeoffs involved – many people are willing – even happy – to give up huge amounts of information in exchange for some upside. But there’s a slowly growing awareness of what’s going on, and there are advances in tying information together server-side and these will enter the public consciousness over time. Physical location can tie a lot of things together, so I do think there’s something special about location in particular. Similarly, when using the web there’s a (URL) sense of location, and cookies are designed around that. Location ties things together on the web, with profound (at least in the economic sense) impact. We’re likely to see a similar thing, because everyone has a phone just like everyone has a web browser – only more so.

My particular expertise (to the extent that I have any) is in FluidDB, which we launched last month. FluidDB has an interesting model of information control. While its objects are not owned (you can think of them as concepts if you like), their pieces are. There’s a permissions model at the level of the tag (objects are made of tags). That means that a user can have their lat/long on an object in FluidDB and use the permissions model to control who can read it. Or you could have 2 sets of lat/long pairs, with different degrees of accuracy for different sets of friends or acquaintances.

FluidDB changes things for the application programmer. In the old world, the programmer was used to owning (or at least controlling) all data. When you write an application using FluidDB, you have a choice – you can let the user own some/all of the information the program is managing. That’s interesting because for the first time, I think, programmers are going to be faced with an explicit choice about who should own information. They have the ability to put the control into the hands of their users. (And the user can then be charged for the storage too.) If a user not only owns but controls their own data, they can do what they like with it. The answer to all my questions above (Can I change it, delete it, share it, hide it from certain people, search on it, sell it?) becomes Yes.

So, can we use FluidDB and storage architectures of a similar nature to move at least in part towards a world in which normal consumers have more control over their data? I think the answer is a qualified yes. Qualified because most users still don’t give a hoot, and because there are lots of tradeoffs, as mentioned above, the use of FluidDB being just one of them.

I hope that helps. I’ll try to stay away from generalities – feel free to stop me if I don’t. BTW, I’m very interested in the recent thinking of Jeff Jonas. See his latest blog post. I imagine you’ll find it highly relevant. He’ll be in London just after the AGI conf, on the 26th. Too bad, he’d have been an excellent panelist – apart from being a true expert he’s also highly entertaining!

If you’re going to the AGI conference, please say hi.

The myriad benefits of a simple query language

Thursday, September 10th, 2009

Fluidinfo has a simple query language. If you are familiar with any other query language, you can probably learn the entire Fluidinfo language in a couple of minutes. The image below shows a summary of the whole language. Without going into details, you can immediately tell there’s not much to it. Click on the image to read more. In contrast, SQL is massive. The SQL 2008 standard comes in 9 parts, the second of which is over 1300 pages.

Fluidinfo query language summary

The downside to having such a simple query language is that complicated data retrieval, processing and organization is not done server-side. Applications have to request data in a simpler fashion, process it locally, and make further network requests if they need additional related data.

The strong upside is that a deliberately simple query language permits architectural simplicity. Because query processing is the most complex part of Fluidinfo, it bounds underlying complexity and has a direct influence on overall system implementation and architecture. Whereas a complex query language, such as SQL, makes it difficult to scale, a simple one makes scaling simpler—at least in theory; you still have to build it, of course!

The trick is getting the balance right: design a query language that’s practical and useful for a wide variety of common tasks, but whose simplicity confers important architectural advantages.

Here are a few ways in which the Fluidinfo query language and the resultant architecture give us hope that we’re building something that can grow.

  • Complex queries are not possible. You can make a big query in Fluidinfo or a deep query or a query that returns many results, but you can’t make a complex query—I mean the kind of query that can bring an SQL server to its knees. Just for starters, the Fluidinfo query language has no JOIN statement. When a query language is complex, the database is at the mercy of its applications: Applications can submit queries with JOINs that are so complex that the required data cannot reasonably be brought together (JOINed) in order for the selection to proceed.
  • All query resolution is simple. In the parse tree of any Fluidinfo query, all the leaves are simple. Each requires either a single lookup in a B-tree (or similar), or a single text match. The result of the processing at a leaf is always a set of object ids. The internal nodes of the query tree only require set operations (union, intersection, difference) on object ids. Below is a fragment of a query parse tree. There’s nothing else.

    A Fluidinfo query parse tree fragment

  • Parallelization is trivial. Because the values of Fluidinfo tags are stored separately, as in a column store, leaf queries are always sent in parallel to the independent servers that maintain the tags in question.
  • It scales horizontally. Because tag values are stored independently and internal query tree nodes are always simple set operations on object ids, the architecture is easy to scale horizontally. We built (and open-sourced) txAMQP to combine Thrift and AMQP with Twisted to give ourselves transparent messaging-mediated RPC. That means the new servers can be deployed and run services that simply join or create the appropriate AMQP queues, and immediately begin receiving RPC calls. When more tag servers or set operation servers are needed, it is trivial to add them.
  • Unused tags can be taken offline. Because tags are stored independently, those that have not been used for some time can have their values serialized and stored in a cheaper medium for the interim. They need not occupy expensive and scarce RAM. When they’re next queried—if ever—they can rapidly be brought back online. This is an architectural advantage that’s mainly made possible by the system design, not the query language simplicity. I’ve included it nevertheless, because this kind of optimization might not be possible in a system with a query language that demanded a more complex underlying data organization.
  • It can scale down as well as up. Just as scaling up by adding servers is simple, servers can be taken down during quieter periods. Set operations servers can simply disappear. Tag servers can migrate management of their tags to other servers or just take tags offline – they will be re-animated by another tag server when next needed.
  • Adaptive affinity is straightforward. When tags are frequently being queried together, they can be migrated to the same tag server. Then an entire sub-query involving both can be sent to that server and the result, just a set of object ids, flows up through the query tree exactly as it would have had the leaves been processed on separate servers. And when things get too hot, i.e., tags being stored together have created a hotspot, they can be migrated to separate servers.

That’s enough for now. There are other, more detailed, advantages that I’ve omitted for brevity. I’m trying to keep each of these posts down to reasonable size.

Metadata vs Data: a wholly artificial distinction

Saturday, September 5th, 2009

Image: psd

Image: psd

Computer scientists are fond of talking about metadata. There often seems to be an assumption that drawing a distinction between metadata and data is useful and perhaps even necessary.

At an architectural level, I think that’s entirely wrong. Any storage architecture that maintains a distinction between metadata and data has real problems that will limit its flexibility and usefulness. Note that I’m not saying that an application shouldn’t maintain a distinction between metadata and data, or that applications shouldn’t present things to users in those terms, or that it’s not useful to think in terms of metadata and data. I’m also not claiming that every storage architecture needs to be flexible – there are obviously times where that appears unnecessary (though in many cases you may end up wanting more flexibility).

I’ll simply argue that if you aim to build a storage architecture with real flexibility, maintaining a distinction between data and metadata runs directly counter to your goal. Below I’ll outline some reasons why.

But first, consider the natural world. If you talk to a regular person — meaning someone who’s not a computer scientist, a librarian, an archivist etc. — and ask them if they know what metadata is, you’ll probably draw a blank. Why is that? It’s because the distinction between data and metadata is entirely artificial. It does not exist in the real world, and it’s clear that regular people can get by just fine without it. Fluidinfo draws its inspiration from the way we work with information in the natural world, and maintains no such distinction.

It’s interesting to speculate on the origins of the metadata vs data distinction. I’d love to know its full history. I suspect that it arose from early architectural constraints, from the relative design and programming ease of maintaining a set of constant-size chunks of information about files apart from the dynamic and variable-size memory required by the contents of files. I suspect it probably also has to do with architectural limitations and the slowness of early machines.

Here then are the main reasons why the distinction is harmful.

  • Two access methods: When metadata and data are stored separately, the way to get at those two different things is likely to be different. Consider inodes in a UNIX filesystem versus the disk blocks containing file data. They are stored differently and cannot be accessed in a uniform way. This causes internal complexity for the storage architecture.
  • Two permissions systems: There are likely to be two permissions systems governing changes to metadata and data. This is another source of internal complexity for the architecture.
  • Search across the two is complex or impossible: Why has it traditionally been so hard to find, for example, a file with “accounts” in its name and “automobiles” in the contents? Because this is a simultaneous search across file metadata and file content. The division between metadata (the name) and the data (the content) made such searches extremely difficult. Even with modern systems it’s awkward. Consider the UNIX find command which searches based on file metadata and the grep command which searches file contents. Combining the two is not easy. It’s at least possible in some systems these days, but that’s because those systems pull all the information together and build a separate index on it – i.e., they allow it by removing the division between metadata and data.
  • A central piece of content: Systems, especially document or file systems, usually maintain a distinction between the content and the metadata about the content. But the real world doesn’t work that way. You may possess information about something without having the thing. There may be no pieces of content, or there may be many.
  • Who decides?: If a system maintains a distinction between metadata and data, who decides which is which? Almost inevitably, it’s a programmer, a system architect, or a product manager who makes those decisions. There’s an implicit assertion that they know more about your information than you do. They decide what should be in the metadata. While there are systems that let users create metadata, they are usually limited in scope – someone has decided in advance how much metadata a regular user should be allowed to create, what kind of metadata it can be, how it will be used, how users will be allowed to search on it, etc. The intentions are good, but the whole thing smacks of parental control, of hand-holding, of “trust us, we know better than you do”.
  • Time dependency at creation: Systems maintaining the distinction also introduce an unnatural time dependency. Until the content (i.e., the data) is available, there’s nowhere to put the metadata. E.g., a file object has to be created before it can have metadata, a web page has to come into existence before you can tag it. But the real world doesn’t work that way. E.g., you can have an opinion about someone you’ve never met, or someone who’s dead or fictional. You can have a summary of a call agenda before the call happens, or notes about a meeting before the minutes of the meeting are prepared.
  • Time dependency at deletion: The awkward time dependency bites when the content is deleted too. The metadata necessarily vanishes because the architecture doesn’t allow it to persist: there’s literally nowhere to put it. Once again, the real world doesn’t work that way. E.g., you’re sent a large image file of someone’s pet cat – you take a look and, to show you care, make a mental note of its name and breed, but you delete the image because you don’t want to store it. Or suppose you give away or lose your copy of Moby Dick – you don’t therefore immediately forget the book’s title, its plot, the author, the name of the main character, an idea of how long it is, the book’s first line, etc. The “content” is gone, but the metadata remains. You may have never owned the book, you may think you have a copy but do not, you may have two copies – in the natural world it just doesn’t matter, and nor should it in a storage architecture. Interestingly, Amazon are currently being sued because they threw away someone’s metadata in the process of removing a copy of Orwell’s 1984 from a Kindle. You can bet the metadata was removed automatically when the content was removed.

OK, enough examples for now.

Fluidinfo has none of the problems listed above. It has absolutely no distinction between metadata and data. It has a single permissions system that mediates access to all information. When a tag (perhaps used or presented as the “content” by an application) is removed from an object, all the other tags remain. There is no distinction between important system information and the information stored by any regular user or application – they’re all on an equal footing, and that includes future applications and users. No-one gets to set the rules about what’s more important and what’s not, there’s simply no distinction. You can search on anything, using a single query language – the system uses the query language to find things it needs, just like any other application. The single permission system mediates who can do what – equally and uniformly.

I used to argue that everything should just be considered data. But I think David Weinberger puts it better in Everything is Miscellaneous where he says it’s all metadata. Call it what you will, it’s clear (to me at least) that at a fundamental level there should be no distinction.

BTW, if you’re into self-reference, you might also interested to know that Fluidinfo uses itself to implement its permissions system. Permissions are just more information, after all. Fluidinfo stores that information for tags, namespaces, and users onto the regular Fluidinfo objects that are about those things. There truly is no metadata / data distinction. It’s a little like Lisp: once you have the core system in place, you can (and should) use it to implement the wider system.