Production Postmortem The case of the intransigent new database
A customer called us to tell that they had a problem with RavenDB. As part of their process for handling new customers, they would create a new database, setup indexes, and then direct all the queries for that customer to that database.
Unfortunately, this system that has worked so well in development died a horrible death in production.But, and this was strange, only for new customers, and only in the create new customer stage.
The problem was:
- The user would create a new database in RavenDB. This just create a db record, and its location on disk. It doesn’t actually initialize a database.
- On the first request, we initialize the db, creating it if needed. The first request will wait until this happens, then proceed.
- On their production systems, that first request (which they used to create the indexes they require) would time out with an error.
Somehow, the creation of a new database would take way too long.
The first thought we had was they are creating the database on a path of an already existing database, maybe a big one that had a long initialization period, or maybe one that required recovery. But the customer validated that they were creating the database on an empty folder.
We looked at the logs, and the logs just showed a bunch of time were there was no activity. In fact, we had a single method call to open the database that took over 15 seconds to run. Except that on a new database, this method just create a bunch of files to start things out and is ready really quickly.
That is the point that led us to suspect that the issue was environmental. Luckily, as the result of many such calls, RavenDB comes with a pretty basic I/O Test tool. I asked the customer to run this on their production system, and I got the following:
And now everything was clear. They were running on an I/O constrained system (a cloud machine), and they were running into an interesting problem. When RavenDB creates a database, it pre-allocate some files for its transactional journal.
Those files are 64MB in size, and the total write for a new Esent RavenDB database with default configuration is just over 65MB. If your write throughput is less than 1MB/sec sustained, that will be problematic.
I let the customer know about the configuration option to take less space at startup (Esent RavenDB databases can go as low as 5MB, Voron RavenDB starts at 256Kb), but I also gave them a hearty recommendation to make sure that their I/O rates improved, because this isn’t going to be the only case where slow I/O will kill them.
Reference: | The case of the intransigent new database from our NCG partner Oren Eini at the Ayende @ Rahien blog. |