11 August, 2011

In defence of SQL

If this title does not interest you, here are some alternative, linkbait titles:

Why ORM is the Dumbest Idea Ever
Why NoSQL is a Terrible Idea
OMADS: the future of data storage
Why SQL Will Eventually Conquer The World

A little history

SQL was invented in the 1970s at the same time that "large-scale" (read: millions of rows) data stores came into existence. It triumphed over other query languages not because it was particularly great (though it was easier to read), but because it was standard. Everybody building a data store could write to the SQL standard without having to re-train all their clients and customers. It reduced friction all round. It was a huge success.

SQL is awkward

There's no escaping that SQL, as we use it day to day, is not pretty.

Keep in mind that what SQL is really designed to express is relational algebra, a type of logic essentially invented by the ridiculously clever E.F. Codd (along with nearly all the other theoretical underpinnings of relational databases). If you're not familiar with it, I find it helps to think about relational algebra as Venn diagrams: it's about sets intersecting with, unioning with, subtracting from, joining with each other. Find all the fruits in set A, with prices in set B, farmed by the farmer in set C. That kind of thing.

What it's not really for is collating, aggregating, and most especially filtering of data sets. The reason count(*) is so awkward is because that's not really what the language was designed to do. GROUP BY and ORDER BY clauses look tacked-on because they are (HAVING is an even more grievous hack, UNIQUE is a disaster, and let's not get started on LIMIT). Of course, in regular use of a data set, you nearly always want to do these things, which is why SQL provides them. SQL, loyal workhorse that it is, is nothing if not willing. But it might not be terribly quick.

So you're right. SQL -- the kind you write every day -- is ugly and awkward. In fact, it looks like hell on legs. And it's often pretty slow. And that's all because you're asking it to do something it, the language, is not really designed to do (whether the engine is designed to do it is another question). But it works, and in forty years since its invention we have come up with very little in the way of improvements and nothing close to powerful enough to be a replacement.

What about ORM?

I want to be very, very clear about this: ORM is a stupid idea.

The birth of ORM lies in the fact that SQL is ugly and intimidating (because relational algebra is pretty hard, and very different to most other types of programming). Our programs already have an object-oriented model, and we already know one programming language -- why learn a second language, and a second model? Let's just throw an abstraction layer on top of this baby and forget there's even an RDBMS down there.

This is obviously silly. You've stored your data in a way that doesn't match your primary use-case, accessible via a language that you are not willing to learn. Your solution is to keep the store and the language and just wrap them in abstraction? Maybe you'd do that if your data were in a legacy system and you needed to write a new front-end, but people slap ORM on new projects. Why the hell would you do that?

ORM is slower than just using SQL, because abstraction layers always are. But unlike other abstraction layers, which make up for their performance hit with faster development, ORM layers add almost nothing. In fact, often, if you need to do anything more complicated then a SELECT, you end up writing fragments of SQL or pseudo-SQL languages in order to tell the underlying RDBMS what you're trying to really do.

OMADS: data stores that match the application

ORM is dumb, and people noticed. So clever programmers looked at this ridiculous edifice and realized the real problem: the data store and the use-case were mismatched. So they threw away ORM, SQL, and RDBMS, and wrote lovely new key-value stores, or object stores, or document stores, or searchable indexes, or any of a half-dozen other data structures that more closely matched what they were trying to do. And because these data stores all turned up at a time when nearly all data stores were SQL-interfaced RDBMS, they got the name "NoSQL", even though the actual problem was the Relational model, not SQL itself. And because "Obviously More Appropriate Data Stores", or OMADS, is not catchy enough I guess.*

So I love NoSQL stores. My startup would literally be unable to function without memcache. I think Cassandra is nifty even if Twitter found it not worth the trouble of switching from MySQL. I think Redis is cool if a little buggy. MongoDB is awesome, and I'm probably going to be building a production system based on it quite soon. HDFS I use in production every day, and it still blows my tiny little mind. Really, the only think I dislike about them is the label "NoSQL", which as many people have already pointed out doesn't really say anything about what they are, just what they are not. And also because it makes people unfamiliar with the details of the situation think there's something Wrong, Bad or Old Fashioned about SQL. And programmers hate using anything that is any of those things.

What is the relational data model good for anyway?

So if your data store should always match your application, what application is it that RDBMS are perfect for? The answer is: all, and none.

We take this for granted these days, but the relational model is pretty magical. Set up a model of your entities, pour data into it, and get answers. How many teachers at the university earn over $100k but teach less than 20 students? How many customers who bought our newest product had never bought anything before? What were sales like on Tuesdays over the last 30 months? You don't have to know in advance what your questions will be; you don't have to write any special code to examine all the rows, or work out the most efficient strategy for combining the results: you just need to know how the data relate each other, and then you can ask ad-hoc questions and the database knows the answer. I remember the first time I really grokked that concept, and it filled me with nerdy joy.

If you pick the wrong data structure for your store when you're first writing your application, you can end up -- as happened to a team at my last job -- running crazy, days-long depth-first searches across distributed document stores in order to perform elementary operations like getting a total count of objects. So if you don't know all the questions you might need to ask about your data, the safest thing to do is put them in an RDBMS. And when you first start a project, you almost never know all the questions you're going to need to ask. So my advice: always use an RDBMS. Just don't only use an RDBMS.

Optimize, but be prepared for ad-hoc queries

Is your data really just a giant hash lookup? Then a key-value store is what you want. Do you primarily access your related data via a single key? Then a document store is for you. Do you need full-text searching? Then, dear god, use a text-indexing engine, not an RDBMS. Do you need to answer questions about your data that you can't predict in advance? Then make sure your data also ends up in an RDBMS. Maybe not in real-time, maybe summarized rather than in raw form, but somehow. Then when your co-founder asks "how many Xs happened in Y?" your answer won't be "uh, let me spend half a day writing code to find that out". Just throw down some SQL, and it'll give you an answer -- it'll take 5 minutes to return a single number, but that's a lot faster than half a day.

Because that's what SQL is for.

Post-SQL

If you scroll back to the top you'll see the description of the circumstances that gave birth to SQL: a whole bunch of new data stores came into existence at once, and the lack of a common language created friction and fragmentation. The same thing is happening again with the NoSQL crowd. If you decide to write your app using Cassandra, you better be sure it's what you want, because if you change stores you have to change all your code. It's the ultimate lock-in, and it's not the plan of an evil monopolist corporation, it's just an unfortunate side-effect.

Pretty soon, the same sort of clever people who noticed that ORM was a ridiculous hack will notice an opening for an actually useful abstraction layer: a single common API that can access all the NoSQL stores. Maybe it will be Thrift or Avro, but I'm not sure. I'd say the chance is about 50-50 that it will be SQL again.

SQL triumphant

And why not? Awkward it may be, but SQL is a lot more succint and readable than multiple lines of API calls or crazy, math-like relational algebra languages. And there's nothing intrinsically slow about the language itself. If you could run "SELECT * FROM table WHERE ..." on Cassandra, it would be no slower than specifying the same conditions via API calls. In fact, when trying to explain how to use its API, the MongoDB documentation lists the equivalent SQL queries. That's a pretty clear vote for the usability of SQL.

Computer programmers really like new, cool things. So when something like SQL hangs around for nearly 40 years, it either means nobody really cares about it -- I think we're clear that's not the case -- or that there's really nothing else that can do the job quite as well.

So go forth, use your OMADS, keep an RDBMS in your back pocket, and stop being so mean to poor old SQL.

* On the off-chance that anybody starts calling these things OMADS, remember: you heard it here first.

Updated 2010-07-13 to fix link to E.F. Codd; thank you Sordina!