Saturday, November 14, 2009

Using Mercurial Bisect to Find Bugs

Yesterday I tried to find a regression bug in Grizzly that was preventing grizzly-sendfile from using blocking IO. I knew that the bug was not present in grizzly 1.9.15, but somewhere between that release and the current head someone introduced a changeset that broke things for me. Here is how I found out who that person was.

Grizzly is unfortunately still stuck with subversion, so the only thing (besides complaining) that I can do to make my life easier, is to convert the grizzly svn repo to some sane SCM, such as mercurial. I used hgsvn to convert the svn repo.

Once I had a mercurial repo, I wrote BlockingIoAsyncTest - a JUnit test for the bug. And that was all I needed to run bisect:
$ echo '#!/bin/bash                                                                          
mvn clean test -Dtest=com.sun.grizzly.http.BlockingIoAsyncTest' > test.sh
$ chmod +x test.sh
$ hg bisect --reset #clean repo from any previous bisect run
$ hg bisect --good 375 #specify the last good revision
$ hg bisect --bad tip #specify a known bad revision
Testing changeset 604:82e43b848ae7 (458 changesets remaining, ~8 tests)
517 files updated, 0 files merged, 158 files removed, 0 files unresolved
$ hg bisect --command ./test.sh #run the automated bisect
...
(output from the test)
...
...
...

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESSFUL
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9 seconds
[INFO] Finished at: Sat Nov 14 11:41:07 PST 2009
[INFO] Final Memory: 25M/79M
[INFO] ------------------------------------------------------------------------
Changeset 500:983a3fc2debe: good
The first good revision is:
changeset:   501:b5239bf9427b
branch:      code
tag:         svn.3343
user:        jfarcand
date:        Wed Jun 17 17:20:32 2009 -0700
summary:     [svn r3343] Fix for https://grizzly.dev.java.net/issues/show_bug.cgi?id=672
In under two minutes I found out who and with which revision caused all the trouble! Bisect is very useful for large projects, developed by multiple users, where the amount of code and changes is not trivial. Finding regressions in this way can save a lot of time otherwise spent by debugging.

The only caveat I noticed is that you need to create a shell script that is passed in as an argument to the bisect command. It would be a lot easier if I could just specify the maven command directly without the intermediate shell script.

Thursday, May 28, 2009

grizzly-sendfile to Become an Official Grizzly Module

After a chat with JFA about grizzly-sendfile's future, I'm pleased to announce today that grizzly-sendfile 0.4 will be the first version of grizzly-sendfile released as an official module of grizzly. This is a huge news for grizzly-sendfile and I believe an equally important news for grizzly and its community.

What this "merger" means for grizzly-sendfile:

  • great opportunity to extend the reach
  • opportunity to become the default static file handler in Grizzly
  • aspiration to become the default static file handler in GlassFish v3
  • more testing and QA
  • easier and faster access to grizzly developers and contributors

What this "merger" means for grizzly:

  • contribution of 1 year of research, development and testing time in the area of static http downloads
  • several times better performance and scalability of http static file downloads
  • built-in X-Sendfile functionality
  • better JMX instrumentation for http downloads
  • and more

If you can't wait for 0.4, go and get recently released version 0.3.

This is a great day for both projects! :-)

Project site: http://grizzly-sendfile.kenai.com/

Thursday, May 14, 2009

grizzly-sendfile 0.3 is out!

After a few months of late night hacking, grizzly-sendfile 0.3 is finally ready for prime time!

New features include:

I also started using kenai's JIRA for issue tracking. So feel free to file bugs or RFE's there.

Benchmark & Enjoy!

Project Website The source code The binaries

Sunday, May 10, 2009

grizzly-sendfile and Comparison of Blocking and NonBlocking IO

From the very early beginnings of my work on grizzly-sendfile (intro) I was curious to compare blocking and non-blocking IO side to side. Since I didn't have any practical experience to understand which one would be more suitable when, I designed grizzly-sendfile to be flexible so that I could try different strategies and come to conclusions based on some real testing rather than theorizing or based on the words of others. In this post I'd like to compare blocking and nonblocking IO, benchmark them, and draw some conclusions as to which one is more suitable for specific situations.

grizzly-sendfile has a notion of algorithms that control the IO operations responsible for writing data to a SocketChannel (grizzly-sendfile is based on NIO and leverages lots of great work put into grizzly). Different algorithms can do this in different ways, and this is explained in depth on the project wiki. The point is that this allows me to create algorithms that use blocking or nonblocking IO in different ways, and easily swap them and compare their performance (in a very isolated manner).

Two algorithms I implemented right away were SimpleBlockingAlgorithm (SBA) and EqualNonBlockingAlgorithm (ENBA), and only recently followed by EqualBlockingAlgorithm (EBA). The first one employs the traditional approach of sending a file via the network (while not EOF write data to a SocketChannel using blocking writes), while ENBA uses non-blocking writes and Selector re-registration (in place of blocking) to achieve the same task. This means that a download is split into smaller parts, each sequentially streamed by an assigned worker thread to the client. EBA works very similarly to ENBA, but uses blocking writes.

I ran two variations of my faban benchmark[0] against these three algorithms. At first I made my simulated clients hit the server as often as possible and download files of different sizes as quickly as possible. Afterward I throttled the file download speed to 1MB/s per client (throttling was done by clients). While the first benchmark simulates traffic close to the one on the private network in a datacenter, the second benchmarks better represents client/server traffic on the Internet.

grizzly-sendfile delegates the execution of the selected algorithm to a pool of worker threads, so the maximum number of the treads in the pool, along with the selected algorithm, are one of the major factors that affects the performance[1] and scalability[2] of the server. In my tests I kept the pool size relatively small (50 threads), in order to easily simulate situations when there are more concurrent requests than the number of workers, which is common during traffic spikes.

Conc. ClientsDownload limitAlgorithmAvg init time[3] (sec)Avg speed[4] (MB/s)Avg total throughput (MB/s)
50noneSBA0.0194.36208.76
50noneENBA0.0214.15198.79
50noneEBA0.0184.23202.29
100noneSBA4.6664.32212.79
100noneENBA0.0481.84168.15
100noneEBA0.1401.96175.71
200noneSBA14.2884.31208.59
200noneENBA0.1080.87144.69
200noneEBA0.2640.97158.83


Conc. ClientsDownload limitAlgorithmAvg init time[3] (sec)Avg speed[4] (MB/s)Avg total throughput (MB/s)
501MB/sSBA0.0031.042.9
501MB/sENBA0.0020.9841.82
501MB/sEBA (81k)[5]0.0030.9841.91
501MB/sEBA (40k)[6]0.0020.9842.85
1001MB/sSBA20.51.040.4
1001MB/sENBA0.0030.9985.14
1001MB/sEBA (81k)0.0181.084.19
1001MB/sEBA (40k)0.0130.9984.34
2001MB/sSBA64.81.037.12
2001MB/sENBA0.1120.86141.8
2001MB/sEBA (81k)0.20.95156.59
2001MB/sEBA (40k)0.1590.96154.2
3001MB/sSBA113.91.034.2
3001MB/sENBA0.1850.58127.53
3001MB/sEBA (81k)0.310.61133.66
3001MB/sEBA (40k)0.2390.63132.75


The interpretation of the results is that with SBA the individual download speeds are slightly higher than when ENBA or EBA is used (this is mainly due to some extra scheduling related to the Selector reregistration). However, EBA and ENBA can utilize the available server bandwidth significantly better than SBA. This is especially true when there is a large number of slow clients downloading larger files. One big downside of SBA is that if the number of concurrent downloads is higher than the number of worker threads, the time to initiate download easily increases into extreme heights.

The conclusion is that SBA is well suited for controlled environments where there is a small number of fast clients (server-to-server communication on an internal network), while EBA shines in environments where there is very little control over the number and speed of clients (file hosting on the Internet). While EBA performs and scales better than ENBA, two advantages that ENBA has over EBA are smaller latency and higher resiliency to DoS or DDoS attacks when malicious clients open connections and block.

The results above do not represent well the performance and throughput of grizzly-sendfile (the benchmarks were run on my laptop!), but they certainly provide some evidence that can be used to determine characteristics of different algorithms. I'll do some more thorough testing later when I feel that grizzly-sendfile has enough features and it is time for some fine tuning (there is still a lot that can be done to make things more efficient).

I love explaining things visually, so I drew the following diagrams that describe the differences between the two algorithms much better than many words.

SimpleBlockingAlgorithm

The server can concurrently handle only the same number of requests as the number of worker threads in the pool. All the extra downloads have to be queued until one of some workers completes a download. The downloads take shorter to process, but the worker is blocked while a write is blocked. The slower the client the more blocking occurs and the worker utilization (and server throughput) goes down.

EqualNonBlockingAlgorithm

The number of downloads a server can concurrently handle is not limited by the number of workers. The downloads take slightly longer to process, but the workers are much better utilized. The increase in utilization is due to the fact that no blocking occurs and workers can multiplex - "pause" downloads for which clients are processing the data (stored in OS/network buffers). In the meantime workers can serve whichever client is ready to receive more data. This causes the download to be split into several parts, each possibly served by a different worker thread.

EqualBlockingAlgorithm

This algorithm is a combination of the previous two. It uses blocking IO and multiplexing. I think that thanks to multiplexing the client has more time to process data in OS/network buffers and thus deplete the buffer more than without multiplexing, which results in less blocking. The blocked writes are also more efficient because they decrease the number of parts a download is split into and thus decrease the amount of overhead associated with re-registration.

Based on these benchmarks, I'm going to call EqualBlockingAlgorithm the winner and make it the default grizzly-sendfile algorithm for now. It is quite easy to override this via the grizzly-sendfile configuration, so one will still be able to pick the algorithm that fits their deployment environment the best. To be honest the results of EBA benchmarks surprised me a bit because I expected ENBA to be the winner in throttled tests, so I'm really glad that I went into the trouble of creating and benchmarking it.

All this work will be part of the 0.3 release, which is due any day now. Follow the project on kenai and subscribe to the announce mailing list if you are interested to hear more news.



[0] a few notes on how I tested - both clients and the grizzly-sendfile-server were running on my mac laptop using Java 6 and server vm. I set the ramp up period to 1min so that the server can warm up and then I started counting downloads for 10min and calculated the result. Files used for the tests were 1KB, 200KB, 500KB, 1MB 20MB and 100MB large and equaly represented. All the tests passed with 0 errors. A download is successful when the length, md5 checksum and http headers match expected values.
[1] performance - ability to process a single download quickly
[2] scalability - ability to process large number of concurrent downloads at acceptable performance
[3] init time - duration from the time a request is made, until the first bytes of the http response body (not headers) are received
[4] avg speed - for 100MB files. The smaller the file the lower the effective download speed because of the overhead associated with a download execution.
[5] EBA (81k) - EqualBlockingAlgorithm with buffer size 81KB (the default socket buffer size on mac)
[6] EBA (40k) - EqualBlockingAlgorithm with buffer size 40KB

Tuesday, March 31, 2009

Identifying ThreadLocal Memory Leaks in JavaEE Web Apps

A few weeks ago wikis.sun.com powered by Confluence "Enterprise" Wiki grew beyond yet another invisible line that triggered intermittent instabilities. Oh boy, how I love these moments. This time the issue was that Confluence just kept on running out of memory. Increasing the heap didn't help, even breaking the 32bit barrier and using a 64bit JVM was not good enough to keep the app running for more than 24 hours.

The Xmx size of the heap suggested that something was out of order. It was time to take a heap dump using jmap and check what was consuming so much memory. I tried jhat to analyze the heap dump, but 3.5GB dump was just too much for it. The next tool I used was IBM's Heap Analyzer - a decent tool, which was able to read the dump, but consumed a lot of memory in order to do so (~8GB), and was pretty hard to use once the dump was processed.

While looking for more heap analyzing tools, I found SAP Memory Analyzer, now known as Eclipse Memory Analyzer, a.k.a MAT. I thought "What the heck does SAP know about JVM?" and reluctantly gave it a try, only to find out how prejudiced I was. MAT is a really wonderful tool, which was able to process the heap really quickly, visualize the heap in a easy-to-navigate way, use special algorithms to find suspicious memory regions, and all of that while using only ~2GB of memory. An excellent preso that walks through MAT features and how heap and memory leaks work, can be found here.

Thanks to MAT I was able to create two bug reports for folks at Atlassian (CONF-14988, CONF-14989). The only feature I missed was some kind of PDF or HTML export, but I did quite well with using Skitch to take screenshots and annotate them.

One of the leaks was confirmed right away, while it wasn't clear what was causing the other one. All we knew was that significant amounts of memory were retained via ThreadLocal variables. More debugging was in order.

I got this idea to create a servlet filter, that would inspect the thread-local store for the thread currently processing the request and log any thread-local references that exist before the request is dispatched down the chain and also when it comes back. Such a servlet could be packaged as a Confluence Servlet Filter Plugin, so that it is convenient to develop and deploy it.

There was only one problem with this idea, the thread-local store is a private field of the Thread class and is in fact implemented as an inner class with a package default access - kinda hard to get your hands on to. Thankfully private stuff is not necessarily private in Java, if you get your hands dirty with reflection code:
Thread thread = Thread.currentThread();

Field threadLocalsField = Thread.class.getDeclaredField("threadLocals");
threadLocalsField.setAccessible(true);

Class threadLocalMapKlazz = Class.forName("java.lang.ThreadLocal$ThreadLocalMap");
Field tableField = threadLocalMapKlazz.getDeclaredField("table");
tableField.setAccessible(true);

Object table = tableField.get(threadLocalsField.get(thread));

int threadLocalCount = Array.getLength(table);
StringBuilder sb = new StringBuilder();
StringBuilder classSb = new StringBuilder();


int leakCount = 0;

for (int i=0; i < threadLocalCount; i++) {
    Object entry = Array.get(table, i);
    if (entry != null) {
        Field valueField = entry.getClass().getDeclaredField("value");
        valueField.setAccessible(true);
        Object value = valueField.get(entry);
        if (value != null) {
            classSb.append(value.getClass().getName()).append(", ");
        } else {
            classSb.append("null, ");
        }
        leakCount++;
    }
}

sb.append("possible ThreadLocal leaks: ")
        .append(leakCount)
        .append(" of ")
        .append(threadLocalCount)
        .append(" = [")
        .append(classSb.substring(0, classSb.length() - 2))
        .append("] ");

logger.warn(sb);

A simple plugin like this, was able to confirm that the leaked SAXParser instances are created and stored as thread-local variables somewhere within the code that exports content as PDF. That is good enough info to pinpoint the exact line of code that creates the thread-local instance by BTrace (or code review), but that's a story for a separate blog post.

The morale of the story: ThreadLocal variables are a very powerful feature, which as is common for powerful stuff can result in a lot of nasty things when not used properly. Hopefully all the info I provided to Atlassian will be enough to get a speedy fix for the issue and bring stability to wikis.sun.com - at least until we step over the next "invisible line".

Thursday, February 05, 2009

Announcing grizzly-sendfile!

It's my pleasure to finally announce grizzly-sendfile v0.2 - the first stable version of a project that I started after I got one of those "Sudden Burst of Ideas" last summer.

For people who follow the grizzly development or mediacast.sun.com, this is not exactly hot news. grizzly-sendfile has been used by mediacast since last September and mentioned on the grizzly mailing list several times since then, but I haven't had time to promote it and explain what it does and how it works, so here I go.

If you don't care about my diary notes, skip down to "What is grizzly-sendfile".

A bit of background: the whole story goes back to the end of 2007 when a bunch of us where finishing up the rewrite of mediacast.sun.com in JRuby on Rails. At that time we realized that one of the most painful parts of the rewrite would be implementing the file streaming functionality. Back then Rails was single-threaded (not any more, yay!), so sending the data from rails was not an option. Fortunately, my then-colleague Peter, came up with an idea to use a servlet filter to intercept empty download responses from rails and stream the files from this filter. That did the trick for us, but it was a pretty ugly solution that was unreliable from time to time and was PITA to extend and maintain.

At around this time, I learned about X-Sendfile - a not well known http header - that some webservers (e.g. apache and ligttpd) support. This header could be used to offload file transfers from an application to the web server. Rails supports it natively via the :x_sendfile option of send_file method.

I started looking for the X-Sendfile support in GlassFish, which we have been using at mediacast, but it was missing. After some emails with glassfish and grizzly folks, mainly Jean-Francois, I learned that the core component of glassfish called grizzly could be extended via custom filters, which could implement this functionality.

The idea stuck in my head for a few weeks. I looked up some info on grizzly and NIO and then during one overnight drive to San Diego, I designed grizzly-sendfile in my head. It took many nights and a few weekends to get it into reasonable shape and test it under load with some custom faban benchmarks that I had to write, but in late August I had version 0.1 and was able to "sell" it to Rama as a replacement of the servlet filter madness that we were using at mediacast.

Except for a few initial bugs that showed up under some unusual circumstances, the 0.1 version was very stable. A few minor 0.1.x releases were followed by 0.2 version, which was installed on mediacast servers some time in November. Since then I've worked on docs and setting up the project at kenai.com.

What is grizzly-sendfile?

From the wiki: grizzly-sendfile is an extension for grizzly - a NIO framework that among other things powers GlassFish application server.

The goal of this extension is to facilitate an efficient file transfer functionality, which would allow applications to delegate file transfers to the application server, while retaining control over which file to send, access control or any other application specific logic.

How does it work?

By mixing some NIO "magic" and leveraging code of the hard working grizzly team, I was able to come up with an ARP (asynchronous request processing) filter for grizzly. This filter can be easily plugged in to grizzly (and glassfish v2) and will intercept all the responses that contain X-Sendfile header. The value of this header is the path of the file that the application that processed the request wants to send to the client.

All that an application needs to do is to set the header. In Java a simple example of such a code looks like this:
 response.setHeader("X-Sendfile", "/path/to/file.avi");
In Rails, it looks even nicer:
send_file '/path/to.png', :x_sendfile => true
That's it, grizzly-sendfile will take care of the rest.



Why should you care?


For me it was all about keeping my code clean and solving problems at layers where they made the most sense. Then it was also about performance and scalability - the kind of stuff that one can do with NIO, can't be now done in JavaEE because of its synchronous nature. And then of course it was about having full control over downloads (like successful download notification and other customizations that are possible via grizzly-sendfile plugins). Oh, and I must mention JMX monitoring:



What's next?

There is a lot of stuff on my roadmap. Two of the main missing features are partial downloads and glassfish v3 (grizzly 1.9.x) support. Then there is better monitoring and tons of performance and scalability tuning, which I haven't really focus on yet. A lot of the API still needs to be polished and cleaned-up. Also needed is a solid test suite that is more fine grained than the system/integration tests that I created with faban.

Can you use grizzly-sendfile?

Yeah, go for it. This is my pet project that I developed in my free time. The project is licensed under GPL2, so you can even grab the code if you want.

Can you help?

Sure. Code reviews, patches, suggestions and help with testing and documentation are more than welcome!

Saturday, January 31, 2009

Benchmarking JRuby on Rails

Last night, while working on a project I found a really neat use of Rails Components, but I also noticed that this part of Rails is deprecated, among other reasons because it's slow.

Well, how slow? During my quest to find out, I collected some interesting data, and even more importantly put JRuby and MRI Ruby face to face.

Disclaimer: the benchmarks were not done on a well isolated and specially configured test harness, but I did my best to gather data with informational value. All the components were used with OOB settings.

Setup

  • ruby 1.8.6 (2008-03-03 patchlevel 114) [universal-darwin9.0] + Mongrel Web Server 1.1.4
  • jruby 1.1.6 (ruby 1.8.6 patchlevel 114) (2008-12-17 rev 8388) [x86_64-java] + GlassFish gem version: 0.9.2
  • common backend: mysql5 5.0.75 Source distribution (InnoDB table engine, Rails pool set to 30)


Benchmarks

I used an excellent high quality benchmarking framework Faban for my tests. I was lazy, so I only used fhb (very similar to ab, but without its flaws) to invoke simple benchmarks:
  • simple request benchmark: bin/fhb -r 60/120/5 -c 10 http://localhost:3000/buckets/1
  • component request benchmark: bin/fhb -r 60/120/5 -c 10 http://localhost:3000/bucket1/object1
Both tests were run with JRuby as well as with RMI Ruby and in addition to that I ran the tests with Rails in single-threaded as well as multi-threaded modes. I didn't use mongler clusters or glassfish pooled instances - there was always only one Ruby instance serving all the requests.

Results

ruby 1.8.6 + mongrel
---------------------------------
simple action + single-threaded:
ops/sec: 210.900
% errors: 0.0
avg. time: 0.047
max time: 0.382
90th %: 0.095

simple action + multi-threaded:
ops/sec: 226.483
% errors: 0.0
avg. time: 0.044
max time: 0.180
90th %: 0.095

component action + single-threaded:
ops/sec: 132.950
% errors: 0.0
avg. time: 0.075
max time: 0.214
90th %: 0.130

component action + multi-threaded:
ops/sec: 131.775
% errors: 0.0
avg. time: 0.076
max time: 0.279
90th %: 0.125

jruby 1.2.6 + glassfish gem 0.9.2
----------------------------------
simple action + single-threaded:
ops/sec: 141.417
% errors: 0.0
avg. time: 0.070
max time: 0.259
90th %: 0.115

simple action + multi-threaded:
ops/sec: 247.333
% errors: 0.0
avg. time: 0.040
max time: 0.318
90th %: 0.065

component action + single-threaded:
ops/sec: 107.858
% errors: 0.0
avg. time: 0.092
max time: 0.595
90th %: 0.145

component action + multi-threaded:
ops/sec: 179.042
% errors: 0.0
avg. time: 0.055
max time: 0.357
90th %: 0.085


Platform/ActionSimple+/-Component+/-
Ruby ST 210ops0%132ops0%
Ruby MT226ops7.62%131ops-0.76%
JRuby ST141ops-32.86%107ops-18.94%
JRuby MT247ops17.62%179ops35.61%
(ST - single-threaded; MT - multi-threaded)

Conclusion

From my tests it appears that MRI is faster in single threaded mode, but JRuby makes up for the loss big time in the multi-threaded tests. It's also interesting to see that the multi-threaded mode gives MRI(green threads) a performance boost, but it's nowhere close to the boost that JRuby(native threads) can squeeze out from using multiple threads.

During the tests I noticed that rails was reporting more times spent in the db when using JRuby (2-80ms) compared to MRI (1-3ms). I don't know how reliable this data is but I wonder if this is the bottleneck that is holding JRuby back in the single threaded mode.