11 August, 2011

ORM is an anti-pattern

I tweeted about ORM last week, and since then several people have asked me to clarify what I meant. I have actually previously written about ORM, but it was in the context of a larger discussion about SQL and I shouldn't have confused the two issues. So here I'm going to focus on ORM itself. I'm also going to try to be very brief, since it became very apparent from my SQL article that people tend to stop reading at the first sentence that makes them angry (and then leave a comment about it, whether or not their point is addressed later on).

What's an anti-pattern?

I was pleased to discover that Wikipedia has a comprehensive list of anti-patterns, both from within the world of programming and outside of it. The reason I call ORM an anti-pattern is because it matches the two criteria the author of AntiPatterns used to distinguish anti-patterns from mere bad habits, specifically:

It initially appears to be beneficial, but in the long term has more bad consequences than good ones
An alternative solution exists that is proven and repeatable

It is the first characteristic that has led to ORM's maddening (to me) popularity: it seems like a good idea at first, and by the time the problems become apparent, it's too late to switch away.

What do you mean by ORM?

The chief offender that I'm talking about is ActiveRecord, made famous by Ruby on Rails and ported to half a dozen languages since then. However, the same criticisms largely apply to other ORM layers like Hibernate in Java and Doctrine in PHP.

The benefits of ORM

Simplicity: some ORM layers will tell you that they "eliminate the need for SQL". This is a promise I have yet to see delivered. Others will more realistically claim that they reduce the need to write SQL but allow you to use it when you need it. For simple models, and early in a project, this is definitely a benefit: you will get up and running faster with ORM, no doubt about it. However, you will be running in the wrong direction.
Code generation: eliminating user-level code from the model through ORM opens the way for code generation, the "scaffolding" pattern which can give you a functional interface to all your tables through a simple description of your schema. Even more magically, you can change your schema description and re-generate the code, eliminating CRUD. Again, this definitely works initially.
Efficiency is "good enough": none of the ORM layers I've seen claim efficiency gains. They are all fairly explicit that you are making a sacrifice of efficiency for code agility. If things get slow, you can always override your ORM methods with more efficient hand-coded SQL. Right?

The problems with ORM

Inadequate abstraction

The most obvious problem with ORM as an abstraction is that it does not adequately abstract away the implementation details. The documentation of all the major ORM libraries is rife with references to SQL concepts. Some introduce them without indicating their equivalents in SQL, while others treat the library as merely a set of procedural functions for generating SQL.

The whole point of an abstraction is that it is supposed to simplify. An abstraction of SQL that requires you to understand SQL anyway is doubling the amount you need to learn: first you need to learn what the SQL you're trying to run is, then you have to learn the API to get your ORM to write it for you. In Hibernate, to perform complicated SQL you actually have to learn a third language, HQL, which is maddeningly almost-but-not-quite SQL, which then gets translated to SQL for you.

A defender of ORM will say that this is not true of every project, that not everyone needs to do complicated joins, that ORM is an "80/20" solution, where 80% of users need only 20% of the features of SQL, and that ORM can handle those. All I can say is that in my fifteen years of developing database-backed web applications that has not been true for me. Only at the very beginning of a project can you get away with no joins or naive joins. After that, you need to tune and consolidate queries. Even if 80% of users need only 30% of the features of SQL, then 100% of users have to break your abstraction to get the job done.

Incorrect abstraction

If your project really does not need any relational data features, then ORM will work perfectly for you, but then you have a different problem: you're using the wrong datastore. The overhead of a relational datastore is enormous; this is a large part of why NoSQL data stores are so much faster. If your data is relational, however, that overhead is worth it: your database does not merely store your data, it represents your data and can answer questions about it on the basis of the relations captured, far more efficiently than you could in procedural code.

But if your data is not relational, then you are adding a huge and unnecessary overhead by using SQL in the first place and then compounding the problem by adding a further abstraction layer on top of that.

On the the other hand, if your data is relational, then your object mapping will eventually break down. SQL is about relational algebra: the output of SQL is not an object but an answer to a question. If your object "is" an instance of X and "has" a number of Y, and each of Y "belongs to" a Z, what is the correct representation in memory of your object? Is it merely the properties of X, or should it include all the Ys, and/or all the Zs? If you get only the properties of X, when do you run the query to fetch the Ys? And do you want one or all of them? In reality, it depends: that's what I mean when I say SQL is the answer to a question. The representation of your object in memory depends what you intend to do with it, and context-sensitive representation is not a feature of OO design. Relations are not objects; objects are not relations.

Death by a thousand queries

This leads naturally to another problem of ORM: inefficiency. When you fetch an object, which of its properties (columns in the table) do you need? ORM can't know, so it gets all of them (or it requires you to say, breaking the abstraction). Initially this is not a problem, but when you are fetching a thousand records at a time, fetching 30 columns when you only need 3 becomes a pernicious source of inefficiency. Many ORM layers are also notably bad at deducing joins, and will fall back to dozens of individual queries for related objects. As I mentioned earlier, many ORM layers explicitly state that efficiency is being sacrificed, and some provide a mechanism to tune troublesome queries. The problem, I have discovered with experience, is that there is seldom a single "magic bullet" query that needs to be optimized: the death of database-backed applications is not the efficiency of any one query, but the number of queries. ORM's lack of context-sensitivity means that it cannot consolidate queries, and must fall back on caching and other mechanisms to attempt to compensate.

What are the alternatives?

Hopefully by this point I've made some kind of case that ORM has fundamental design flaws. But to be an antipattern, there needs to be an alternative. In fact, there are two:

Use objects

If your data is objects, stop using a relational database. The programming world is currently awash with key-value stores that will allow you to hold elegant, self-contained data structures in huge quantities and access them at lightning speed. There's no law that says Step One of writing any web app is installing MySQL. The massive over-application of relational databases to every data representation problem is one of the reasons SQL has acquired a bad reputation in recent years, when in fact the problem is lazy design.

Use SQL in the Model

It's hugely dangerous to claim there is One True Wayâ„¢ to do anything in programming. But in my experience, the best way to represent relational data in object-oriented code is still through a model layer: encapsulation of your data representation into a single area of your code is fundamentally a good idea. However, remember that the job of your model layer is not to represent objects but to answer questions. Provide an API that answers the questions your application has, as simply and efficiently as possible. Sometimes these answers will be painfully specific, in a way that seems "wrong" to even a seasoned OO developer, but with experience you will get better at finding points of commonality that allow you to refactor multiple query methods into one.

Likewise, sometimes the output will be a single object X, which is easy to represent. But sometimes the output will be a grid of aggregate data, or a single integer count. Resist the temptation to wrap these in too many layers of abstraction, and deal with the data on its own terms. Above all resist the fallacy of OO, that it can represent anything and everything. OO is itself an abstraction, a beautiful and hugely flexible one, but relational data is one of its boundaries, and pretending objects can do something they can't is the fundamental, root problem in all ORM.

In summary (TL;DR)

ORM is initially simpler to understand and faster to write than SQL-based model code
Its efficiency in the early stages of any project is adequate
Unfortunately, these advantages disappear as the project increases in complexity: the abstraction breaks down, forcing the dev to use and understand SQL
Entirely anecdotally, I claim that the abstraction of ORM breaks down not for 20% of projects, but close to 100% of them.
Objects are not an adequate way of expressing the results of relational queries.
The inadequacy of the mapping of queries to objects leads to a fundamental inefficiency in ORM-backed applications that is pervasive, distributed, and therefore not easily fixed without abandoning ORM entirely.
Instead of using relational stores and ORM for everything, think more carefully about your design
If your data is object in nature, then use object stores ("NoSQL"). They'll be much faster than a relational database.
If your data is relational in nature, the overhead of a relational database is worth it.
Encapsulate your relational queries into a Model layer, but design your API to serve the specific data needs of your application; resist the temptation to generalize too far.
OO design cannot represent relational data in an efficient way; this is a fundamental limitation of OO design that ORM cannot fix.