28 March, 2008

The Emergent Web

Nate Koechley, who is really smart, did a brief post a couple days ago on the way APIs are beginning to multiply and cross-pollinate in ever more interesting ways. He says he doesn't know what it means, but I think I do. And since I have a habit of thinking these things and not writing them down before they are obvious, I have decided to publish rather than be damned.

Tim Berners-Lee, my own personal nerd idol, who I couldn't possibly respect more unless he could fire lasers out of his eyes*, thinks the future of the web lies in the semantic web. He has thought this for a long time. And on the face of it, it seems like a great idea: if you turn the web into a giant collection of structured data, you can then build amazing robots to crawl all over it and draw inferences and glue things together in new and unexpected ways and wow, you have a whole new web and also some magical self-aware software, and incidentally you've built what it was that Sir Tim was trying to build in the first place: a linked information system.

TBL saw the original WWW as an alternative to data trees or databases or file systems: a way to represent a giant, worldwide, semi-structured database. The web was where you would store the information, and the browser was not just the way to view the information, but to edit it as well. But the problem with that dream is that the web is not what TBL intended. It is not even a poor approximation. Instead of building a new type of database system, Marc Andreesen and the rest of Netscape built something else: a medium, a way to display information that you stored somewhere else (usually, a traditional database). They made some half-hearted efforts to return to the view-and-edit idea with Netscape 3, which was an HTML editor too, but it never came to very much.

And then, just when we got used to the idea of the web as a medium, Flickr and del.icio.us and some other very clever people released public web service APIs, and Google released Google Maps and Mail and pretty much invented rich web user interfaces with AJAX. And those two turning points changed the web again. Now the web is where you manipulate information -- not just edit what's already there, but join it up and twist it around to create new content.

So I'm going to go out on a limb here and make a prediction, because the web has been my entire career and most of my life for the last twelve years. Me and the web grew up together, and I feel like I know it pretty well.

The next web is not the semantic web. The next web is the emergent web.

How is this different from the semantic web?

The semantic web makes the assumption that documents are the center of the experience: everyone is going to be marking up information into discrete little documents full of wonderful rich semantic information, and then the mind-blowingly advanced AI necessary to parse these documents and glue them together is left as an exercise to the reader. There's a number of problems with this:

The web is long past being a collection of documents. The web is a collection of APIs. The data isn't lying around to be parsed in a pile of documents. Most people keep their data in a relational database, and spit it out again on request, in a very focussed way. And in the real world, that's a good thing: the person who owns the database knows the most about that data. They know what's important about it, how to structure it, and how to maintain it. Ideally, the API you use should be the one they use, or at least the one their paying customers use, because then they have an incentive to keep it up to date and useful and clean.
There is absolutely no economic incentive to exposing your entire database for anyone to copy and steal. APIs control access to information, but the semantic web doesn't have a mechanism to do that, or even consider it. The web is a success because it was hugely profitable. Yeah, the web is neat and all, but even someone who loves the purity of the web as much as I do has to recognize that it only got popular when .com turned up. The future of the web is dependent on finding a viable business model, just like the current one. Wikipedia is altruistic, but like museums and opera houses, our much-loved symbols of public altruism are always built after everyone has finished making their pile of money.
Semantically marked-up data is impractical. Leaving piles of semantically marked-up data lying around is inconvenient. You can't use it in that format; you (and anyone else who wants to use it) have to read it into a database to be able to manipulate it and spit it out again. So either you'll be creating a bunch of work for yourself reading it into a database and then syncing up your changes, or, much more likely, your "pile of files" will actually be fake, and they'll really be the output of a web application reading from your database, i.e. an API. And RDF is not an API, at least not by itself.

So what is the emergent web?

Emergence is the way complex patterns arise from a large number of much simpler interactions. The emergent web is what happens when you realise APIs are the center of the experience, and let millions of APIs talk to each other. Machines to parse and join data aren't an afterthought, they're the hard part! And I should know, since I've spent the last decade building machines to join and parse data over and over.

The millions of simple interactions will come because a web site isn't a document, it's a machine. Or rather, at the moment, the web is hundreds of millions of machines, and despite all being connected by the same internet, they are mostly just sitting next to each other and not talking very much -- the people who do the travelling are us, hopping from one island of data to the next. But slowly and surely they are putting out feelers, finding each other, and communicating. Currently they communicate only in very basic ways, and intermittently, but we are approaching the tipping point when increased communication will begin to feed on itself, and the same explosion of new value that happened when people got onto the Internet and started communicating constantly is going to happen again when they teach their websites to do so.

The beginnings of the emergent web are already here:

APIs: not just the big, complicated ones like Google's data APIs or the upcoming OpenSocial initiative, but even in little things like RSS. These are the first green shoots of what is soon going to become a dense and thriving jungle of data services.
Meta-APIs: the pre-eminent one among these is Pipes. It's an API that doesn't have any of its own data, and derives all of its value from being able to parse other sources of data and combine them in convenient and useful ways. This is the shape of the idea, and when people start to apply it to richer APIs than the read-only, super-simple RSS, the true value of this idea is unlimited.

So instead of marking up your data, build that API you had on your to-do list anyway. APIs are semantic by nature: they have subjects (addressable resources, like a user), verbs (method calls, like "get") and objects (returned data) with properties and values that (especially if in XML, but often if in JSON as well) are inherently meaningful. You can even output RDF if you really want to.

But more importantly, build meta-APIs. Build one API that can read a hundred similar APIs and combine them or query them or cache them or filter them. If there's already a meta-API for your field, make sure your API plays nice with it: the more APIs you pool up with, the more likely you are to be invited to the party: remember, RSS feeds are now considered an essential component of every web site. Nobody made that the law, it just happened.

So where does the emergent part come in?

Everyone is familiar with the joy of mashups. Craigslist built a site to sell things. Google built maps to find things. And then somebody else built HousingMaps, a beautiful little site that does very little extra work but produces something dramatically more useful than either of these sites separately. This is sort-of emergent behaviour.

A much more accurately emergent behaviour is the idea of RSS aggregators. Each individual site has released an RSS feed for purely selfish, commercial reasons: they want users to be able to keep up with their site through notifications of new articles. But as a result of there being thousands upon thousands of these feeds, software like Technorati can look at what everyone is talking about, which is interesting, and Google Reader can look at what everyone is actually reading and determine who has the important news of the day.

Does that not sound very impressive? It's because the really impressive stuff is yet to come and impossible to predict. It's hard to understate quite how powerful and unexpected emergent behaviour can be, so think of it this way: life itself, in all of its forms, from bacteria all the way up to our entire civilization, is the emergent behaviour you get when you shine sunlight on a ball of rock suspended in space and lightly coated with chemicals. You don't just get a hot rock. So when a half-billion web APIs are out there talking to each other, you're not just going to get another RSS reader.

People using the APIs and meta-APIs I just mentioned will grab data and grind it up and produce useful services. Some of those might be useful, or even profitable. But more importantly, they will produce yet more APIs, and these APIs will be more valuable than the raw data originally was. This is a big idea. If every 100th mashup of existing web services produces 1 useful derivate web service, then 100 original web services produces 1 derivate web service, but the next 100 original APIs produce 2 derivatives -- because they had 201 inputs to choose from. In fact, by the time you've introduced 10000 original web services you have 16,000 total services, and each additional 100 original services produces more than 160 derivative services. It is the very definition of exponential growth: by the 100,000th "original", you have more than 200 million services, with each additional 100 producing a staggering 2 million new derivate services. Even if the actual figure for derivate services is an order of magnitude lower, it becomes easy to imagine how bewilderingly huge the explosion of interactions will become, and how massive the value generated will be.

...you mentioned something about huge profits?

For there to be really large-scale technological change, there needs to be an economic driver. This is not some sad comment on the state of our society, it's an integral part of the system. Money is the sun that heats up our technological rock. So where's the money in an emergent web?

For many individual producers at the moment, the point of free secondary consumption via APIs is that it drives primary consumption via the main interface: the API is just an advertisement, a teaser for the real service (Facebook's inward-looking API being a prime example of this). This is a bad model. The API isn't an ad for the product, the API is the product.

Think of Twitter: a ton of people join because they see hundreds of cool mashups (like the beautiful TwitterVision 3D) being built on top of their API. And since Twitter can at least theoretically make money off of every text message they send (whether or not it has ads**) they don't mind if people aren't using their website, which doesn't have any ads anyway. The value to twitter isn't their website, it's that they hope you'll find their text messaging service the most convenient mobile way to consume your feed. And even if you don't, then you're welcome to come along for the ride, because only a portion of their users need to use SMS for twitter to be profitable.

Of course, this doesn't mean all you need to do to make a dumpster-load of cash is to make a rich, full-featured API to your crappy blog that nobody reads. The web made money because it added value: it took data and turned it into information, it disseminated information more efficiently, it produced new and unexpected information from divergent data sources. It produced new ways of communicating, and the people who really cashed in were the people who built the ways we do that: Amazon changed the way we buy durable goods, eBay changed the way we sold them, Google completely changed the way advertisers found their customers. These people aren't producing content, they're just moving data around that already existed (well, okay, Amazon is creating something...).

And so it will be for the emergent web. But instead of coder X saying "I can take service A and service B to produce awesome service C!", coder X will produce a service that can take any A-type service A1 - A1000, and any B-type service B1-B1000, and the resulting 1,000,000 possible services will be produced by a million different coders. Lots of those services will be junk, but some of them will make money with their services, and then they pay coder X, who won't even have a "product" as we understand it today -- no website, no great domain, no shiny buttons. He'll just have a service, and he'll have competition from X2 through X1000 who are trying to build a better one almost exactly like it.

Sound unrealistic? It's already happening. People pay Google $10,000 a year to do maps. It's not because they couldn't make their own mapping product -- it's just JavaScript, kids, and a ton of map data and geo-code -- it's because Google's service has the advantage of scale, so they can do it more cheaply, and better, than any one company who don't have maps as a primary product. If your service provides sufficient value, sufficiently conveniently that nobody feels the need to write their own, you'll soon get a couple hundred customers paying a couple hundred bucks a year, and you don't need a lot more than that to make a tidy living.

But what about the semantic web? What about microformats, aren't they the next big thing?

The semantic web, bless TBL's nerdy little heart, is a beautiful idea. It will be useful, someday, when we know how to build magical AI bots to read our data and auto-discover connections and make inferences. But it's not what's next. Computers will not know semantics in the emergent web. They don't need to. Drawing inferences is not what they are good at, it's what we're good at. Computers are good at gluing data together and grinding it down into information. And that's what we'll be using them to do. No magical AI bots required, no bursts of altruism from thousands of site owners marking up every proper name in their article with RDF metadata.

And microformats? Pfft. Microformats are junk. JUNK. Slipping tiny chunks of semantic labelling into HTML data that is 90% irrelevant is not a recipe for efficient information exchange. They are cute little toys, and having search engines able to understand those formats is going to be useful for about 10 seconds until people start gaming the system. There is no economic incentive to build them in, you can't use them as an API, and using them as a source for data mining is ludicrously inefficient. You want everybody who uses your website to scrape your whole site just to get the little scraps of microformatted data you leave lying around? If you want people to use your data, you expose it as an API. As you can tell, I'm really not a fan.

So how do we prepare for the emergent web?

No need to do anything; this ball is already rolling. Just sit around, and it'll engulf you. But if you want to keep on top of things, there's a few principles I think are already clear:

Build an API. Duh! Is your service a mashup of two other APIs? Then don't let the buck stop with you. If somebody wants to take your geographical property sales data and overlay it with, say, a map of crime levels in those areas, don't make them reinvent the wheel.
Give your API as many inputs as possible. If it makes even the slightest bit of sense to your application, then allow users to programmatically insert data into your service. That turns it from just a source into a machine, and useful machines are where the real value is.
Become the gatekeeper: build a meta-API. Are there a lot of services like yours? Don't compete with them, co-opt them by aggregating them into a meta-service where your own service is just first amongst equals. You'll drive traffic to everyone's services, but they'll all be going through you.
Be discoverable, or at least well-documented. My ideal API is one like COM, where there's one standard call that gives me a list of all the other calls I can make. Failing that, make sure you document your services up the wazoo. It's not the most powerful API that wins, it's the one that everybody uses. And most people are doing something simple, so make it easy to use.
Adhere to standards when they turn up. APIs are a bit of a wild west right now. RESTful APIs seem to be winning over SOAP and other RPCs, but there's a lot of argument about what really constitutes a RESTful interface, and even if such interfaces are powerful and flexible enough. But eventually some consensus will be reached, and it's in your interest to get on board. The more standards-compliant you are, the more meta-APIs will know how to read your service, and the more useful your service will therefore be.

And finally, please, for the love of all that is good, don't call it "Web 3.0".

* It's worth noting that there's no proof that he can't do this already.

* Mobile companies charge users to receive text messages as well as send them. At a certain level of volume, an SMS-heavy company like Twitter is generating so much SMS traffic from users that the mobile companies stop charging them to send SMS, and give them a revenue share instead. From my own experience in the mobile industry, I was pretty sure from the beginning that this was Twitter's business model.