Production postmortem: The case of the native memory leak
This one is a pretty recent one. A customer complained about high memory usage in RavenDB under moderate usage. That was a cause for concern, since we care a lot about our memory utilization.
So we started investigating that, and it turned out that we were wrong, the problem wasn’t with RavenDB, it was with the RavenDB Client Library. The customer had a scenario where 100% of the time, after issuing a small number of requests (less than ten), the client process would be using hundreds of MB, for really no purpose at all. The client already turned off caching, profiling and pretty much anything else that both they and us could think of.
We got a process dump from them and looked at that, and everything seemed to be fine. The size of the heap was good, and there didn’t appear to be any memory being leaked. Our assumption at that point was that there is some sort of native memory leak from their application.
To continue the investigation further, NDAs was required, but we managed to go through that and we finally had a small repro that we could look at ourselves. The fact that the customer was able to create such a thing is really appreciated, because very often we have to work with a lot of missing information. Of course, when we run this on our own system, everything was just fine & dandy. There was no issue. We got back to the customer and they told us that the problem would only reproduce in IIS.
And indeed, once we have tested inside IIS (vs. using IIS Express), that problem was immediately obvious. That was encouraging (at least from my perspective). RavenDB doesn’t contain anything that would differentiate between IIS and IIS Express. So if the problem was only in IIS, that is not likely to be something that we did.
Once I had a repro of the problem, I sat down to observe what was actually going on. The customer was reading 1,024 documents from the server, and then sending them to the browser. That meant that each of the requests we tested was roughly 5MB in size. Note that this means that we have to:
- Read the response from the server
- Decompress the data
- Materialize it to objects
- Serialize the data back to send it over the network
That is a lot of memory that is being used here. And we never run into any such issues before.
I had a hunch, and I created the following:
[RoutePrefix("api/gc")]
public class GcController : ApiController
{
[Route]
public HttpResponseMessage Get()
{
GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
GC.Collect(2);
return Request.CreateResponse(HttpStatusCode.OK);
}
}
This allows me to manually invoke a rude GC, and indeed, when running this, memory utilization dropped quite nicely. Of course, that isn’t something that you want to run into your systems, but it is very important diagnostic tool.
Next, I tried to do the following:
[Route]
public HttpResponseMessage Get()
{
var file = File.ReadAllText(@"D:\5mb.json");
var deserializeObject = JsonConvert.DeserializeObject<Item>(file);
return this.Request.CreateResponse(HttpStatusCode.OK, new { deserializeObject.item, deserializeObject.children });
}
This involves no RavenDB code, but it has roughly the same memory pressure semantics as the customer code. Indeed, the problem reproduced quite nicely there as well.
So the issue is about a request that uses a lot of memory (include at least one big buffer), likely causing some fragmentation in the heap that would bring memory utilization high. When the system had had enough, it would reclaim all of that, but unless there is active memory pressure, I’m guessing that it would rather leave it like that until it has to pay that price.
Reference: | Production postmortem The case of the native memory leak from our NCG partner Oren Eini at the Ayende @ Rahien blog. |