On Microformats

posted 25 January 2009, updated 25 January 2009

Back in March last year I declared that the next phase of the web is the emergent web, an accidental explosion of functionality caused when a large number of simple APIs start interacting with each other. At the same time, I declared that semantically marked-up data is impractical. I also had harsh words for microformats. I called them "junk" and "ludicrously inefficient".

But the weird thing is that microformats are still sort of... popular. I mean, not really popular, they don't have mass adoption yet. But nerd-popular. Lots of clever people are talking about them and implementing them. There is some value to be extracted by making the semantic nature of the data we publish on the web explicit, there has to be, or else all these clever people wouldn't be fighting with the frankly inconvenient and ill-defined world of microformats as they currently stand.

So why do people like semantic data? Because semantic data is important. By definition, it's the meaning of the data, the magic that changes raw data into information. That has to be important. So I have to examine myself: if I like semantic data, why do I instinctively recoil from microformats?

The trouble with microformats

The main problem with microformats is that there are not a lot of tools available for interpreting semantic data right now, which is a chicken and egg problem: the lack of tools means nobody marks up their data, and the lack of data means nobody bothers to write any tools, and if we're being honest the lack of practical ideas for what to do with microformatted data, even on microformats.org has probably got something to do with it. Nobody is giving me a right here, right now good reason to build microformats into my website.

There are several secondary problems: since the microformat data is embedded within the body of HTML, a hypothetical microformat-reading tool would have to ingest the entire page and search it for instances of every single known microformat and validate each one. At a small-scale, browser-plugin level that might be practical, but it seriously limits the utility of the data. Each microformat is itself ad-hoc, but once defined they can't really be modified or extended.

Finally, and very importantly, the way microformats use class names is wrong. Not technically wrong: the HTML spec says class names, in addition to being used for CSS selectors, are "for general purpose processing by user agents", which basically means "do what you like". But wrong in a practical sense that they would require us to change the way we use them right now: class names are, in the practical world of web development, the way you link your HTML elements to your CSS. You set them up arbitrarily, and then you build your CSS around them. If you need to change the look of your HTML, you can change the name of the class to suit the new styles you've created.

Microformats as designed break that: by defining meanings for specific class names in specific combinations, they impose a structure on your markup that needs to be known in advance, limiting -- no matter how lightly -- the flexibility with which you can mark up your HTML. Carving out namespaces in class names is also dangerous because they either have to be unique -- and hence not human readable -- or human readable, and hence prone to collision. As a dyed-in-the-wool web developer, having spent 12 years building web pages nearly every day, it just feels wrong to do it that way.

The joy of microformats

The mistake I made in March is to decide that because microformats were wrong, semantic markup was wrong too. Just because microformats are getting it wrong doesn't mean that they aren't a good idea. And there's much to like about them, too: ad-hoc, community-generated and easily extensible; these are great qualities that are very "weblike".

Adding meaning to web pages is also a wonderful idea: you knew what the data meant when you typed it in, so why lose that once it becomes a web page? Tim Berners-Lee's ideal of the web as a primary information store has not come to pass, but that doesn't mean we have to hide all of our semantic relationships in our databases. Exposing them to the world is a good idea, once it can be done cheaply and easily -- something microformats manage -- and consumed equally cheaply and easily, where I believe they currently fail.

This is all by way of preamble. My next post is going to be about my ideas for fixing the problems in microformats.

0 comment

Introducing Cascading Semantic Descriptions

posted 25 January 2009, updated 26 January 2009

Cascading Semantic Descriptions, or CSD, are my idea for a new way of expressing microformats. In my last post I talked about what was good about microformats and what was bad. Now I'm going to put forward my suggestions for how to fix them, and in the process make them a whole lot more flexible, useful, and powerful. Remember: the problem microformats are trying to solve is "how do we add semantic information to web pages?"

Semantic information is web metadata; it should act like it

Semantic information is a type of metadata: information about information. However, HTML has lots of other types of metadata already: in the HEAD of any HTML document you can have the META tag which can contain the information itself (e.g. keywords and descriptions) or you can have a LINK tag which relates the document to other documents, such as RSS feeds, or CSS. I think the most interesting example here is CSS, which is literally a document full of more metadata, specifically data about how the contents of the document should be rendered by the browser, visually or otherwise. One could argue that JavaScript is also a type of metadata, describing how the document should respond to user action.

There are a few important things to note about how existing metadata formats work on the web:

  1. They are separate from the data itself, either in the HEAD or another document entirely
  2. They are not HTML in nature. HTML is used to relate the document to its metadata, but the metadata is not itself HTML.
  3. They are progressive enhancements, layering additional complexity and functionality onto the core document without significantly altering the form.

Thus microformats are an "unweblike" type of metadata in that they are none of these things: they are embedded into the content, they are arguably part of the document's HTML, and they necessarily alter the form and structure of the document -- not necessarily visually, but you have to alter your code for it to become a microformat. This already suggests microformats need to be reformulated.

Why be weblike?

There are a bunch of good arguments for maintaining the principles of web metadata I mentioned above:

  1. Separate metadata is easily ignored, meaning it is more likely to be backward compatible. This is a key aspect of progressive enhancement in general.
  2. A domain-specific syntax means metadata can be efficiently expressed. XSLT is the ultimate example of expressing a good idea in an unsuitable syntax.
  3. There is less chance of technology conflict. If two technologies came along that both required rigid class name definitions as microformats do, it is quite possible they would conflict.
  4. Keeping machine-readable metadata links in the document head instead of the body means they are also easily discovered and efficiently indexable. This is a key feature that microformats currently lack.
  5. Technologies with a small in-document footprint are more easily retrofitted into existing systems. If you have a huge and costly CMS, the prospect of modifying all your markup and thus probably all your CSS to accommodate microformats is prohibitively costly. This needs to be overcome.

Furthermore, there is an excellent counter-example for the current formulation of microformats: presentational markup. Back in the 90s, we added tags like FONT and attributes like BGCOLOR to HTML. This solved the immediate problem but as pages grew more complex it created more: bulky markup, laborious maintenance, and an unpleasant mixing of content and presentation which made specialized web jobs (editor vs. designer) difficult.

Microformats need scalability

Microformats currently have the same problems, for the same reason: their creators are thinking primarily in terms of one or two microformat implementations on a page of HTML, discovered and used client-side by browser plugins and the like. If one wanted -- as really should be the goal -- to mark up every single piece of content on your page in a semantically meaningful way, layering microfortmats pattern upon pattern, your code structure would become incredibly rigid and the CSS required to arbitrarily display your content progressively more complex.

Two more points against: firstly, a search engine trying to index the entire web for semantic data would have to read your entire page, parse it, and then search it for all known combinations of all known microformats. On the scale of the modern web, that's a gigantic additional cost to the search engines that would hinder adoption. Secondly, in a reasonably large website, the people developing the software that generates markup are probably not going to be the people creating and defining the content of pages: to keep the jobs separate, you need a mechanism to separate semantics from structure, in the way that CSS separates it from presentation.

Goals of CSD

We want to add semantic information to web pages. Our solution needs to be:

  • Lightweight
  • Simple
  • Easily adopted into existing markup
  • Elegantly expressed
  • Easily parsed
  • Efficiently indexed

It should also, as much as possible, build upon all the excellent work that has already been done in defining microformats themselves and formulating existing patterns.

So with that in mind, you should head over to Cascading Semantic Descriptions at Emergent Web to read the draft spec document, learn from the examples, and more as I get around to building it all.

6 comments