Tuesday, March 31, 2009

Identifying ThreadLocal Memory Leaks in JavaEE Web Apps

A few weeks ago wikis.sun.com powered by Confluence "Enterprise" Wiki grew beyond yet another invisible line that triggered intermittent instabilities. Oh boy, how I love these moments. This time the issue was that Confluence just kept on running out of memory. Increasing the heap didn't help, even breaking the 32bit barrier and using a 64bit JVM was not good enough to keep the app running for more than 24 hours.

The Xmx size of the heap suggested that something was out of order. It was time to take a heap dump using jmap and check what was consuming so much memory. I tried jhat to analyze the heap dump, but 3.5GB dump was just too much for it. The next tool I used was IBM's Heap Analyzer - a decent tool, which was able to read the dump, but consumed a lot of memory in order to do so (~8GB), and was pretty hard to use once the dump was processed.

While looking for more heap analyzing tools, I found SAP Memory Analyzer, now known as Eclipse Memory Analyzer, a.k.a MAT. I thought "What the heck does SAP know about JVM?" and reluctantly gave it a try, only to find out how prejudiced I was. MAT is a really wonderful tool, which was able to process the heap really quickly, visualize the heap in a easy-to-navigate way, use special algorithms to find suspicious memory regions, and all of that while using only ~2GB of memory. An excellent preso that walks through MAT features and how heap and memory leaks work, can be found here.

Thanks to MAT I was able to create two bug reports for folks at Atlassian (CONF-14988, CONF-14989). The only feature I missed was some kind of PDF or HTML export, but I did quite well with using Skitch to take screenshots and annotate them.

One of the leaks was confirmed right away, while it wasn't clear what was causing the other one. All we knew was that significant amounts of memory were retained via ThreadLocal variables. More debugging was in order.

I got this idea to create a servlet filter, that would inspect the thread-local store for the thread currently processing the request and log any thread-local references that exist before the request is dispatched down the chain and also when it comes back. Such a servlet could be packaged as a Confluence Servlet Filter Plugin, so that it is convenient to develop and deploy it.

There was only one problem with this idea, the thread-local store is a private field of the Thread class and is in fact implemented as an inner class with a package default access - kinda hard to get your hands on to. Thankfully private stuff is not necessarily private in Java, if you get your hands dirty with reflection code:
Thread thread = Thread.currentThread();

Field threadLocalsField = Thread.class.getDeclaredField("threadLocals");
threadLocalsField.setAccessible(true);

Class threadLocalMapKlazz = Class.forName("java.lang.ThreadLocal$ThreadLocalMap");
Field tableField = threadLocalMapKlazz.getDeclaredField("table");
tableField.setAccessible(true);

Object table = tableField.get(threadLocalsField.get(thread));

int threadLocalCount = Array.getLength(table);
StringBuilder sb = new StringBuilder();
StringBuilder classSb = new StringBuilder();


int leakCount = 0;

for (int i=0; i < threadLocalCount; i++) {
    Object entry = Array.get(table, i);
    if (entry != null) {
        Field valueField = entry.getClass().getDeclaredField("value");
        valueField.setAccessible(true);
        Object value = valueField.get(entry);
        if (value != null) {
            classSb.append(value.getClass().getName()).append(", ");
        } else {
            classSb.append("null, ");
        }
        leakCount++;
    }
}

sb.append("possible ThreadLocal leaks: ")
        .append(leakCount)
        .append(" of ")
        .append(threadLocalCount)
        .append(" = [")
        .append(classSb.substring(0, classSb.length() - 2))
        .append("] ");

logger.warn(sb);

A simple plugin like this, was able to confirm that the leaked SAXParser instances are created and stored as thread-local variables somewhere within the code that exports content as PDF. That is good enough info to pinpoint the exact line of code that creates the thread-local instance by BTrace (or code review), but that's a story for a separate blog post.

The morale of the story: ThreadLocal variables are a very powerful feature, which as is common for powerful stuff can result in a lot of nasty things when not used properly. Hopefully all the info I provided to Atlassian will be enough to get a speedy fix for the issue and bring stability to wikis.sun.com - at least until we step over the next "invisible line".