Metadata vs Data: a wholly artificial distinction

September 5, 2009

Metadata vs Data: a wholly artificial distinction

Filed under: Essence — Terry Jones @ 9:15 pm

Image: psd

Computer scientists are fond of talking about metadata. There often seems to be an assumption that drawing a distinction between metadata and data is useful and perhaps even necessary.

At an architectural level, I think that’s entirely wrong. Any storage architecture that maintains a distinction between metadata and data has real problems that will limit its flexibility and usefulness. Note that I’m not saying that an application shouldn’t maintain a distinction between metadata and data, or that applications shouldn’t present things to users in those terms, or that it’s not useful to think in terms of metadata and data. I’m also not claiming that every storage architecture needs to be flexible – there are obviously times where that appears unnecessary (though in many cases you may end up wanting more flexibility).

I’ll simply argue that if you aim to build a storage architecture with real flexibility, maintaining a distinction between data and metadata runs directly counter to your goal. Below I’ll outline some reasons why.

But first, consider the natural world. If you talk to a regular person — meaning someone who’s not a computer scientist, a librarian, an archivist etc. — and ask them if they know what metadata is, you’ll probably draw a blank. Why is that? It’s because the distinction between data and metadata is entirely artificial. It does not exist in the real world, and it’s clear that regular people can get by just fine without it. Fluidinfo draws its inspiration from the way we work with information in the natural world, and maintains no such distinction.

It’s interesting to speculate on the origins of the metadata vs data distinction. I’d love to know its full history. I suspect that it arose from early architectural constraints, from the relative design and programming ease of maintaining a set of constant-size chunks of information about files apart from the dynamic and variable-size memory required by the contents of files. I suspect it probably also has to do with architectural limitations and the slowness of early machines.

Here then are the main reasons why the distinction is harmful.

Two access methods: When metadata and data are stored separately, the way to get at those two different things is likely to be different. Consider inodes in a UNIX filesystem versus the disk blocks containing file data. They are stored differently and cannot be accessed in a uniform way. This causes internal complexity for the storage architecture.
Two permissions systems: There are likely to be two permissions systems governing changes to metadata and data. This is another source of internal complexity for the architecture.
Search across the two is complex or impossible: Why has it traditionally been so hard to find, for example, a file with “accounts” in its name and “automobiles” in the contents? Because this is a simultaneous search across file metadata and file content. The division between metadata (the name) and the data (the content) made such searches extremely difficult. Even with modern systems it’s awkward. Consider the UNIX find command which searches based on file metadata and the grep command which searches file contents. Combining the two is not easy. It’s at least possible in some systems these days, but that’s because those systems pull all the information together and build a separate index on it – i.e., they allow it by removing the division between metadata and data.
A central piece of content: Systems, especially document or file systems, usually maintain a distinction between the content and the metadata about the content. But the real world doesn’t work that way. You may possess information about something without having the thing. There may be no pieces of content, or there may be many.
Who decides?: If a system maintains a distinction between metadata and data, who decides which is which? Almost inevitably, it’s a programmer, a system architect, or a product manager who makes those decisions. There’s an implicit assertion that they know more about your information than you do. They decide what should be in the metadata. While there are systems that let users create metadata, they are usually limited in scope – someone has decided in advance how much metadata a regular user should be allowed to create, what kind of metadata it can be, how it will be used, how users will be allowed to search on it, etc. The intentions are good, but the whole thing smacks of parental control, of hand-holding, of “trust us, we know better than you do”.
Time dependency at creation: Systems maintaining the distinction also introduce an unnatural time dependency. Until the content (i.e., the data) is available, there’s nowhere to put the metadata. E.g., a file object has to be created before it can have metadata, a web page has to come into existence before you can tag it. But the real world doesn’t work that way. E.g., you can have an opinion about someone you’ve never met, or someone who’s dead or fictional. You can have a summary of a call agenda before the call happens, or notes about a meeting before the minutes of the meeting are prepared.
Time dependency at deletion: The awkward time dependency bites when the content is deleted too. The metadata necessarily vanishes because the architecture doesn’t allow it to persist: there’s literally nowhere to put it. Once again, the real world doesn’t work that way. E.g., you’re sent a large image file of someone’s pet cat – you take a look and, to show you care, make a mental note of its name and breed, but you delete the image because you don’t want to store it. Or suppose you give away or lose your copy of Moby Dick – you don’t therefore immediately forget the book’s title, its plot, the author, the name of the main character, an idea of how long it is, the book’s first line, etc. The “content” is gone, but the metadata remains. You may have never owned the book, you may think you have a copy but do not, you may have two copies – in the natural world it just doesn’t matter, and nor should it in a storage architecture. Interestingly, Amazon are currently being sued because they threw away someone’s metadata in the process of removing a copy of Orwell’s 1984 from a Kindle. You can bet the metadata was removed automatically when the content was removed.

OK, enough examples for now.

Fluidinfo has none of the problems listed above. It has absolutely no distinction between metadata and data. It has a single permissions system that mediates access to all information. When a tag (perhaps used or presented as the “content” by an application) is removed from an object, all the other tags remain. There is no distinction between important system information and the information stored by any regular user or application – they’re all on an equal footing, and that includes future applications and users. No-one gets to set the rules about what’s more important and what’s not, there’s simply no distinction. You can search on anything, using a single query language – the system uses the query language to find things it needs, just like any other application. The single permission system mediates who can do what – equally and uniformly.

I used to argue that everything should just be considered data. But I think David Weinberger puts it better in Everything is Miscellaneous where he says it’s all metadata. Call it what you will, it’s clear (to me at least) that at a fundamental level there should be no distinction.

BTW, if you’re into self-reference, you might also interested to know that Fluidinfo uses itself to implement its permissions system. Permissions are just more information, after all. Fluidinfo stores that information for tags, namespaces, and users onto the regular Fluidinfo objects that are about those things. There truly is no metadata / data distinction. It’s a little like Lisp: once you have the core system in place, you can (and should) use it to implement the wider system.

Comments (31)

31 Comments »

[…] this page was mentioned by Scoble Favorites (@scoblefaves), Scoble's Favorites (@scoblesfavorite), Steven Walling (@stevenwalling), Terry Jones (@terrycojones), Archives*Open (@archivesopen) and others. […]

Pingback by Tweets that mention FluidDB » Blog Archive » Metadata vs Data: a wholly artificial distinction -- Topsy.com — September 5, 2009 @ 11:29 pm
it’s easier to sing about metadata than it is to sing about data.

Comment by phil shapiro — September 6, 2009 @ 12:26 am
Hear, hear! And explained so nicely too. We’ve been having a similar debate on this blog post – feel free to be the voice of reason 🙂

http://www.cmswatch.com/Trends/1679-Future-CMS-Metadata

Comment by Jon Marks — September 6, 2009 @ 6:51 pm
[…] that an covering shouldn’t reassert a secernment between … Read more here: FluidDB » Blog Archive » Metadata vs Data: a totally staged … Posted in Uncategorized | Tags: and-usefulness-, between-, between-metadata, has-real, […]

Pingback by FluidDB » Blog Archive » Metadata vs Data: a wholly artificial … | Dataentry update today — September 6, 2009 @ 7:12 pm
Not that difficult:

find . -name ‘*accounts*’ |xargs grep automobiles

Don’t disagree with your point though. 🙂

Comment by Lindsay Holmwood — September 6, 2009 @ 11:14 pm
Every is metadata and everything is everything.

Comment by Igor Goldkind — September 7, 2009 @ 11:00 am
[…] path. FluidDB could be a “metadata for any URL” system (though they rightly say that metadata is data at an architecture level.) § […]

Pingback by FluidDB is an interesting “database wit… « Paul M. Watson — September 7, 2009 @ 4:12 pm
Your blog looks good both in content and design. Thank you.

Comment by Ivins — September 7, 2009 @ 5:20 pm
[…] Metadata vs Data: a wholly artificial distinction Computer scientists are fond of talking about metadata. There often seems to be an assumption that drawing a distinction between metadata and data is useful and perhaps even necessary… (tags: metadata fluiddb) Leave a Comment No Comments Yet so far Leave a comment RSS feed for comments on this post. TrackBack URI Leave a comment Click here to cancel reply. Line and paragraph breaks automatic, e-mail address never displayed, HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <pre> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> […]

Pingback by links for 2009-09-07 « another corporate show — September 8, 2009 @ 12:03 am
In this entry the author mistakenly assumes the real World does not use metadata. In my perception he could not be more wrong. In another blog similar to this one I made a joke and suggested we might as well could get rid of street names or street numbers, since we all use navigation devices we can do without them …… right?
A streetname is metadata for a specific location where numerous houses are forming a logical unity, followed by the second tag: the house number which is also metadata. Just because the real world does not know, what we mean by “Metadata” does not mean the real world can live without it.

Metadata is important to bundle and limit collections. If I want to have a listing of incoming invoices I don’t want correspondence about any invoice, I just want to have the invoices. The word “invoice” is the tag I will use to store and to retrieve them. If I want to narrow my search results down I will add a second tag to my query, let’s say the name of the company which send me that invoice in the first place. This is also a tag belonging to “metadata”

Whenever I want to emigrate my existing data to another hosting application I definitely need to be able to supply metadata pointing to the right files, otherwise the end result, in the new application, will just be a scrapheap of information useless to anybody since there is no way to tell what’s there.

Comment by Leon van Oosterom — September 9, 2009 @ 8:40 pm
Hi Leon

I guess I wasn’t clear enough in the second paragraph. I’m not saying that the world doesn’t use metadata (which I guess we can agree is just information about other information) – of course it does, otherwise the world would be a very different place and survival itself wouldn’t be possible (if you consider making summaries and drawing generalizations to be a form of metadata). I’m not saying that it’s not useful to think in those terms, etc. I agree with your points entirely.

What I am saying is that in programming a storage architecture, having a low-level and fundamental distinction between two types of data leads to many problems. It’s better (IMO) to have a completely uniform storage architecture. Applications can use it in any way they like, including to store what to them (and to their users) is metadata. In fact that’s one of the major initial goals of FluidDB – to be a metadata engine for everything. So that’s how important we think metadata is! The way to support metadata on anything is to have an underlying architecture that’s flexible enough to allow that to happen – without someone setting the thing up with an a priori determination of what’s meta- and what’s not. True support for metadata is too important for that – to do it properly you need the architecture to be neutral.

I hope that’s clearer & sorry for any confusion!

Terry

Comment by Terry — September 9, 2009 @ 10:53 pm
Terry,

You certainly made your point much clearer, thank you.

I see this happening in many many discussions where IT people tell something to archivists. When both worlds would agree that both speak different languages, life would be much easier and more projects would turn into success stories.

Success with FluidDB

Comment by leonvanoosterom — September 10, 2009 @ 3:45 pm
Thanks! Sorry to have been unclear – I wrote all that on a plane and was probably tired, etc.

Comment by terrycojones — September 11, 2009 @ 4:23 am
[…] two interesting blog posts about how the distinction between data and metadata is artificial, and that it is merely […]

Pingback by Where pegs grow legs: hanging ideas on words » metadata vs data, an artificial but existential distinction — September 12, 2009 @ 4:50 pm
Terry,

nice & academic. Not realistic in the real world, however. Metadata is essential for practical reasons. I do think that a layer of abstraction such as a file system is nice. Sure, you could argue that the “truth is in the file” and why bother to add complexity. But really, are you traveling with a your car mechanic or do you rely on the aggregated information the dashboard presents you?
Certainly this discussion has different facets – I look at metadata that lives outside of a database. Particularly, file-based metadata: its essential for the survival of modern ECM and DAM systems.

Comment by Hans Fremuth — September 14, 2009 @ 10:13 am
Hi Hans

Thanks for commenting. Sorry this appeared so academic – I can assure you it's not though (for me), as we've built FluidDB on the above principles and it's a released product.

But more importantly, what I meant to convey seems to have not been clear. I agree 100% that metadata is essential in practice. Do the comments I made just above help to make that any clearer? FluidDB is designed (among other things) to support arbitrary metadata – because metadata is so vital. It's just that at an architectural level to do that properly I think it's important to have no distinction. But at higher levels it's essential, as you say.

Does that make this any clearer?

Comment by terrycojones — September 14, 2009 @ 10:22 am
I think the whole metadata/data deal is about who gets to say what. Data is created by somebody, metadata is added by somebody else. Librarians add metadata to books. The OS adds data to files. But of course it's all just data; and since you can also tag metadata you will end up with an infinite level hierarchy that will bust the Universe such as we know it.

Comment by JJ — October 2, 2009 @ 11:37 am
[…] have to explain what I mean by that, especially seeing as some people got the impression from the earlier post on data vs metadata that we don’t think metadata is important, or that it doesn’t exist, or similar. I […]

Pingback by FluidDB » Blog Archive » FluidDB as a universal metadata engine — October 3, 2009 @ 4:07 am
[…] intuitief het verschil te weten tussen metadata en data. Maar is dit niet te kort door de bocht? Dit artikel werpt daar een ander licht op. We weten dat je metadata en data allebei opneemt in je data […]

Pingback by Metadata vs. data « De Kadenzer Courant — November 11, 2009 @ 11:53 pm
Nice blog comment. Great.

งาน
งาน
งาน part time
งานราชการ

Comment by nokfarang002 — November 18, 2009 @ 11:37 am
A streetname is metadata for a specific location where numerous houses are forming a logical unity, followed by the second tag: the house number which is also metadata. Just because the real world does not know, what we mean by “Metadata” does not mean the real world can live without it.

http://staffingpower.com/

Comment by staffing123 — December 1, 2009 @ 4:45 pm
I agree. It's all just information. I'm not saying we can live without it, that would (at least in my mind be like saying that we can live without information). I'm just saying – as in the title – that the *distinction* between the two is artificial. Normal humans don't need or want or understand such a distinction. Computational systems are very often built on that distinction, though. I think that's a mistake.

Thanks a lot for taking the trouble to comment. I hope the above makes it a little clearer.

Comment by terrycojones — December 1, 2009 @ 8:10 pm
[…] argue that the concept of metadata is just not very intuitive, because it’s artificial, something we’re not used to "in real life." I doubt it. (You need to look no further than the cover of a book to […]

Pingback by The Cheap Computer Geek » Blog Archive » So what is metadata, anyway? — December 9, 2009 @ 7:55 pm
[…] and table rows and columns for metadata. We need to start building these systems so that there is no technical distinction between the content store and the metadata store. Having separate stores for content and metadata causes us to duplicate our efforts, causing us to […]

Pingback by When is an Antelope a Document? | vblog — December 17, 2009 @ 9:18 am
and secondly,the frequency you update is too high .we are so excited of that .maybe cause that we forget to comments..:)..

Hot deals—

Comment by sandyxxx — December 19, 2009 @ 2:03 am
and secondly,the frequency you update is too high .we are so excited of that .maybe cause that we forget to comments..:)..

Hot deals—

Comment by sandyxxx — December 19, 2009 @ 9:03 am
Everything is clearer now. Peace seemed to have enveloped my whole being after reading this post. Thanks.

Comment by college paper — February 2, 2010 @ 7:13 am
I gather the FluidDB would have four tables thing, thing type, thing relationship and thing relationship type

Comment by BobM — July 23, 2010 @ 5:58 pm
[…] and table rows and columns for metadata. We need to start building these systems so that there is no technical distinction between the content store and the metadata store. Having separate stores for content and metadata causes us to duplicate our efforts, causing us to […]

Pingback by When is an Antelope a Document? | Den Of Ubiquity — September 2, 2010 @ 8:49 am
The Competition between Metadata vs Data is a wholly artificial distinction which I think that which is very good thing to do and it has good effect to the work it do for.

Essays
http://www.advantagepapers.com/

Comment by Anonymous — December 2, 2010 @ 5:18 am
Metadata vs Data a good thing to happen because They has to be an assumption to drawing a distinction between metadata and data is useful and perhaps even necessary, and I think this is some thing what they have been made of.

Online Essays
http://www.mythesisspace.com/

Comment by Anonymous — December 24, 2010 @ 5:46 am

RSS feed for comments on this post. TrackBack URL

Fluidinfo

September 5, 2009

Metadata vs Data: a wholly artificial distinction

31 Comments »

Leave a comment