There’s an awful lot of buzz around new-fangled "NoSQL" or "non-relational" databases at the moment. I’m hearing a lot of people asking why they’d choose one of these new-fangled storage models over the relational databases they know and (mostly) like. This post aims to shed some light on the choices available, and to maybe give some guidance if you’re trying to make a decision.
Let’s define some terms
I find it always helps to make sure you’re talking about the same thing if you think you might be arguing.
Relational database management system (RDBMS)
When I say "relational database", I’m referring to an application or service which stores information in tables. Tables are divided into columns, and columns have a type, such as integer, date, text etc. The data held in the tables is divided into rows, and each row contains a value (possibly including NULL) for each of the columns. The RDBMS can maintain and enforce "master-detail" relationships between rows in different tables using primary and foreign keys.
The big commercial relational database systems are Microsoft SQL Server and Oracle; the main open-source alternatives are MySQL, PostgreSQL and SQLite.
New-fangled data stores (NoSQL)
I’m going to use the term NoSQL to cover everything else, which is a massive generalisation and therefore inevitably inaccurate (several new-fangled data stores actually support a SQL-like syntax). However, in the interests of a bit of clarity, I’ll try to outline some of the distinct types here, although the distinctions can be a bit vague.
Key/Value Stores are the most basic NoSQL databases, and are essentially hashes with disk persistence. In some cases the value is a single binary or text object; in others it is a tuple with named properties, sometimes with searching supported against those properties. These solutions are schema-less: entities within a collection, domain or table may have completely different properties. Microsoft’s Azure Table Storage and Amazon’s SimpleDB are both cloud-based, pay-as-you-go Key/Value stores. Open-source implementations include BerkeleyDB, MemcacheDB, Redis and TokyoCabinet.
Wide Column, Tabular or Column Family Stores are scalable, distributed systems that support a defined schema for each "table", but not relationships between tables. Google’s BigTable database, which is the storage solution for their AppEngine cloud platform, is a Wide Column store. Open-source alternatives include Cassandra and Hypertable.
Document Stores work with data structured using Object Notation, usually JSON or BSON. Not to be confused with SharePoint, Domino etc. This format allows complex data structures, including arrays and hashes, to be represented within a single entity that can be retrieved in one read operation; they also often support complex update operations such as adding elements to an array within a document. Some popular Document Stores are CouchDB, MongoDB and Riak.
Object Databases are superficially similar to Document Stores, but have predefined schema and inheritance hierarchies. In an Object Database, graphs are persisted using pointers between objects rather than by embedding graphs within other graphs. The most popular Object Database is db4o.
So which is best?
Right, so. Why should I use any of these new-fangled "stores" instead of a relational database?
I said "mu". I keep hearing this question, and I submit that the answer is "mu", which means "your question cannot be answered because it depends on incorrect assumptions."
What incorrect assumption is that?
The assumption that the relational database is axiomatically the correct storage solution in all situations, and that the onus is on all other options to prove that they are better.
Is that not right then?
No, it isn’t. Have you heard of doing the simplest thing that could possibly work? Or YAGNI? It’s possible to add unnecessary features to a system without coding them yourself. The first thing you should rule out as a storage medium is the file-system.
What, text files? Isn’t that a bit 1980s?
Not at all. Let me ask you a question: what requirements or properties of your application preclude using one or more flat text files to persist your data?
I need structured data.
What’s wrong with tab-delimited, CSV or even fixed-length field files?
I need master-detail structured data.
In which case, you could use XML, YAML or even JSON. Let’s say we’re using XML, even though it’s quite the worst of those three options.
I need details with many masters.
You mean lookups? Why on earth would you want to use those?
To save space.
OK, I don’t need to save space. But I still need one copy of each piece of data, so I only have to update it in one place.
That’s fine, but does that need to be supported by your persistence layer? Are you going to re-read those lookup descriptions every time you pull a record? Add multiple joins to every SELECT statement just to read back those varchar(40) descriptions?
As opposed to…?
As opposed to reading those lookups into memory once the first time they’re needed, and maybe refreshing them in background every now and then if you’re in a multi-user smart-client system and they’re particularly dynamic, which they’re probably not. How often does somebody add a new country to the system? Or a new type of customer, or method of payment, or whatever?
OK. I accept that lookups are not a valid reason to use a relational database.
Good. So, next requirement or property?
I need many-to-many relationships. For example, in my library application, each customer can borrow many books, and each book (over time) can be borrowed by many customers.
Well, we’re still using XML, so you can have a list of child elements in your customer record detailing the books they’ve borrowed, and a list of child elements in the book record detailing the customers who have borrowed it.
So when somebody borrows a book, I have to update two records? That’s bad, isn’t it?
I don’t know; it’s your application. Do both records have to get updated in the same split second? Is it likely that another user will do a read that could get inaccurate information while the two updates are happening? Or is it enough that the customer and book records should be eventually consistent (for a given value of eventually)?
But then I’d have to read a bunch of files individually to get the details of those books, or the names of those customers.
Not at all.
Well how would you display sensible data on the screen? Are you going to cache the entire book catalogue and customer list in memory as well?
It’s an option. But no, I would probably just decide which details about the book I need in the customer views, and which details about the customer I need in the book views, and then I’d include those details in the child records.
You’d duplicate the data?
[Shocked gasps from enthralled onlookers]
Not a bit of it. Books don’t change their titles, or authors. Customers don’t change their names very often; certainly not so often that it’s beyond reason to bulk-update a bunch of book records when they do.
So you just duplicate the data?
Absolutely. Saves a whole bunch of extra disk reads, and those join tables you keep using in many-to-many relationships, well, let’s be honest: they’re a kludge, and they’re one of the most common causes of performance issues due to volume of data.
I remain unconvinced.
OK. Let’s take your library example. Let’s say your library is in a particularly literate part of the world and has 10,000 active customers, and 100,000 books. And let’s say that each customer borrows two books a week. After one year, your many-to-many join table has 1,000,000 rows in it. Meanwhile, each of my customers’ records have 100 book record keys, titles, author names, publishers, and loan start- and end-dates, and each of my book records have, on average, 10 customer record keys, names, maybe addresses, and loan start- and end-dates. In storage terms, let’s say you’ve used probably 32 bytes per row (two 32-bit keys and two 64-bit date/times, plus overhead), so that’s 32MB, plus the same again for indexing. Meanwhile, let’s just hazard a guess and say I’ve used a whole gigabyte with all my duplicated data.
Exactly. Look at all that storage you’re wasting.
My goodness, you’re right! That’s 0.1% of a £120 hard-disk I’ve just frittered away! Or nearly $2 a year for the cloud storage.
There’s no need to be sarcastic.
I wasn’t; I was being ironic. Anyway, my point is, pulling a customer record and displaying the complete list of books they’ve borrowed for me is a single read operation; for you it’s a three-table join with one of those tables containing a million records for each year your software’s been in use.
Actually, scratch that.
Yes, I was wrong.
I thought you were.
I forgot that you’ll have normalised out your book catalogue into a book table with a many-to-many join to an author table, and another many-to-many join to a publisher table, so you’ve actually got seven tables totalling a few million rows that you need to reference to get the same information that I can get in a single read operation. How’s your EXPLAIN output looking, by the way?
I’m not sure. What does DEPENDENT SUBQUERY mean?
No idea, but it sounds nasty. You can fix it later. You’re supposed to be giving me reasons you need a relational database.
So I am. OK, well, I need to search the data.
Aha! Very good. Much closer. That would be a reason for a database management system, but not necessarily a relational one. There are lots of NoSQL solutions that will let you run queries against your datasets, with varying degrees of performance. I’ve written queries in MongoDB that run against massive datasets so fast it’d make your eyes bleed.
[Suddenly, a heavenly light shines down upon the pupil. Or some scales fall from his eyes.]
I’ve been such a fool! I drank the relational database kool-aid and no mistake! Thank you for showing me the error of my ways. I’m going NoSQL all the way for my next project.
What? No, wait, come back.
Why? I’m convinced. You made your case.
No, I haven’t. I’ve just stopped you from being one kind of wrong and you’ve immediately started being another, completely different, but equally bad kind of wrong.
No need to be rude.
Sorry. But flat text files and NoSQL databases aren’t the solution to every persistence problem. There are perfectly good reasons to use a SQL database.
You might need the whole ACID thing: Atomicity, Consistency, Isolation and Durability. If your system deals with financial data, or medical records, or other sensitive information, you’re going to want a robust transaction system.
And if you’ve got a lot of concurrent users who might try to update the same bit of data at the same time, like an entry in a Stock table in an e-commerce system, then isolation and locking could be very handy.
And NoSQL systems can’t do ACID and locking and so on?
Oh, they can, one way or another. But a lot of the time you end up writing extra code in your data access layer to do stuff that a decent RDBMS will do for you. And if you’ve got a room-full of developers who are dead good with SQL Server or Entity Framework or whatever, why force them to learn a new technology?
I would have a hard time justifying the time and money required for our developers to learn the new skills.
Then again, in the cloud the NoSQL storage can be much, much cheaper than the RDBMS. For example, Windows Azure storage costs $0.15 per gigabyte per month, whereas a 1GB SQL Azure instance is $9.99 per month; 10GB is $99.99 per month.
So Windows Azure storage is cheaper than SQL Azure storage?
Not necessarily. With Windows Azure Storage you also pay for transactions, at the rate of $0.01 for every ten thousand. SQL Azure doesn’t charge for transactions. So if you’re storing less than a gigabyte but running more than ten million transactions per month, SQL Azure would be cheaper. Or if you’re storing less than ten gigabytes but running more than a hundred million transactions-
Stop! I’m all confused. Please, just tell me, what should I use?
You want a definite answer?
There really is one?
There really is one.
And are you going to tell me what it is?
Although I don’t think you’re going to like it*.
It doesn’t matter! I must know!
You’re really not going to like it.
The answer is…
The answer is: It All Depends.
I want the last 5 minutes of my life back.
5 minutes? I’ve spent hours writing this.
Anyway, the point is…
…every system is different. Which persistence paradigm is best depends on the project and its requirements and properties. All I want to say is that you shouldn’t start with RDBMS and then look for reasons not to use it; at the same time, don’t start with Project Voldemort or CouchDB just because they’re new and exciting. Approach every project with no preconceptions about what you’re going to use; or have "flat text files" as the default choice.
A non-exhaustive list of things to consider:
- If constant consistency is a key requirement then you should probably use an RDBMS.
- If horizontal scalability is important and eventual consistency is good enough, then a NoSQL solution might be better.
- If you’re deploying an application to Windows Azure or Amazon Web Services, then you should probably compare the costs of Windows Azure Storage Services or SimpleDB to SQL Azure or Elastic RDBMS for your particular application.
- If you’re writing an application for a mobile platform, you’re back to limited storage space so you might get back to RDBMS and normalising for efficient use of that space.
- If you’re writing a small application you want to knock out quickly using Entity Framework or NHibernate or ActiveRecord then use SQL Server Compact or SQLite, or skip that stuff altogether and serialise object graphs to files on the hard-disk.
In fact, if you really design your architecture with an open mind (or beginner’s mind), you’ll come up with a combination of technologies and paradigms that properly meet different parts of the requirements; maybe use SQL Server 2008 R2 for stock control, invoicing and geo-spatial search operations, but Azure Table Storage for auditing, text files for logging. And why not evaluate distributed Document Stores for the awesome customer review and discussion features that you’re adding in phase two?
Most important of all, before making any decisions, take the time to do some apples-to-apples comparisons of potential solutions, with use cases, data volumes and loads that are representative of your system. I recently chose MongoDB for a personal project, but I started out comparing SQL Server, MySQL and PostgreSQL and finding them all wanting, which is how I ended up looking at NoSQL databases in the first place.
(Just for the record, the most important query and most frequently-run query for the application took nearly 15 seconds on the fastest relational database (even after some violent denormalization); MongoDB can retrieve the same information in under half a second.)
A comprehensive – and very balanced – list of articles on NoSQL databases: http://nosql-database.org/links.html
*All apologies to Douglas Adams.