Sunday, May 10, 2009

grizzly-sendfile and Comparison of Blocking and NonBlocking IO

From the very early beginnings of my work on grizzly-sendfile (intro) I was curious to compare blocking and non-blocking IO side to side. Since I didn't have any practical experience to understand which one would be more suitable when, I designed grizzly-sendfile to be flexible so that I could try different strategies and come to conclusions based on some real testing rather than theorizing or based on the words of others. In this post I'd like to compare blocking and nonblocking IO, benchmark them, and draw some conclusions as to which one is more suitable for specific situations.

grizzly-sendfile has a notion of algorithms that control the IO operations responsible for writing data to a SocketChannel (grizzly-sendfile is based on NIO and leverages lots of great work put into grizzly). Different algorithms can do this in different ways, and this is explained in depth on the project wiki. The point is that this allows me to create algorithms that use blocking or nonblocking IO in different ways, and easily swap them and compare their performance (in a very isolated manner).

Two algorithms I implemented right away were SimpleBlockingAlgorithm (SBA) and EqualNonBlockingAlgorithm (ENBA), and only recently followed by EqualBlockingAlgorithm (EBA). The first one employs the traditional approach of sending a file via the network (while not EOF write data to a SocketChannel using blocking writes), while ENBA uses non-blocking writes and Selector re-registration (in place of blocking) to achieve the same task. This means that a download is split into smaller parts, each sequentially streamed by an assigned worker thread to the client. EBA works very similarly to ENBA, but uses blocking writes.

I ran two variations of my faban benchmark[0] against these three algorithms. At first I made my simulated clients hit the server as often as possible and download files of different sizes as quickly as possible. Afterward I throttled the file download speed to 1MB/s per client (throttling was done by clients). While the first benchmark simulates traffic close to the one on the private network in a datacenter, the second benchmarks better represents client/server traffic on the Internet.

grizzly-sendfile delegates the execution of the selected algorithm to a pool of worker threads, so the maximum number of the treads in the pool, along with the selected algorithm, are one of the major factors that affects the performance[1] and scalability[2] of the server. In my tests I kept the pool size relatively small (50 threads), in order to easily simulate situations when there are more concurrent requests than the number of workers, which is common during traffic spikes.

Conc. ClientsDownload limitAlgorithmAvg init time[3] (sec)Avg speed[4] (MB/s)Avg total throughput (MB/s)

Conc. ClientsDownload limitAlgorithmAvg init time[3] (sec)Avg speed[4] (MB/s)Avg total throughput (MB/s)
501MB/sEBA (81k)[5]0.0030.9841.91
501MB/sEBA (40k)[6]0.0020.9842.85
1001MB/sEBA (81k)0.0181.084.19
1001MB/sEBA (40k)0.0130.9984.34
2001MB/sEBA (81k)0.20.95156.59
2001MB/sEBA (40k)0.1590.96154.2
3001MB/sEBA (81k)0.310.61133.66
3001MB/sEBA (40k)0.2390.63132.75

The interpretation of the results is that with SBA the individual download speeds are slightly higher than when ENBA or EBA is used (this is mainly due to some extra scheduling related to the Selector reregistration). However, EBA and ENBA can utilize the available server bandwidth significantly better than SBA. This is especially true when there is a large number of slow clients downloading larger files. One big downside of SBA is that if the number of concurrent downloads is higher than the number of worker threads, the time to initiate download easily increases into extreme heights.

The conclusion is that SBA is well suited for controlled environments where there is a small number of fast clients (server-to-server communication on an internal network), while EBA shines in environments where there is very little control over the number and speed of clients (file hosting on the Internet). While EBA performs and scales better than ENBA, two advantages that ENBA has over EBA are smaller latency and higher resiliency to DoS or DDoS attacks when malicious clients open connections and block.

The results above do not represent well the performance and throughput of grizzly-sendfile (the benchmarks were run on my laptop!), but they certainly provide some evidence that can be used to determine characteristics of different algorithms. I'll do some more thorough testing later when I feel that grizzly-sendfile has enough features and it is time for some fine tuning (there is still a lot that can be done to make things more efficient).

I love explaining things visually, so I drew the following diagrams that describe the differences between the two algorithms much better than many words.


The server can concurrently handle only the same number of requests as the number of worker threads in the pool. All the extra downloads have to be queued until one of some workers completes a download. The downloads take shorter to process, but the worker is blocked while a write is blocked. The slower the client the more blocking occurs and the worker utilization (and server throughput) goes down.


The number of downloads a server can concurrently handle is not limited by the number of workers. The downloads take slightly longer to process, but the workers are much better utilized. The increase in utilization is due to the fact that no blocking occurs and workers can multiplex - "pause" downloads for which clients are processing the data (stored in OS/network buffers). In the meantime workers can serve whichever client is ready to receive more data. This causes the download to be split into several parts, each possibly served by a different worker thread.


This algorithm is a combination of the previous two. It uses blocking IO and multiplexing. I think that thanks to multiplexing the client has more time to process data in OS/network buffers and thus deplete the buffer more than without multiplexing, which results in less blocking. The blocked writes are also more efficient because they decrease the number of parts a download is split into and thus decrease the amount of overhead associated with re-registration.

Based on these benchmarks, I'm going to call EqualBlockingAlgorithm the winner and make it the default grizzly-sendfile algorithm for now. It is quite easy to override this via the grizzly-sendfile configuration, so one will still be able to pick the algorithm that fits their deployment environment the best. To be honest the results of EBA benchmarks surprised me a bit because I expected ENBA to be the winner in throttled tests, so I'm really glad that I went into the trouble of creating and benchmarking it.

All this work will be part of the 0.3 release, which is due any day now. Follow the project on kenai and subscribe to the announce mailing list if you are interested to hear more news.

[0] a few notes on how I tested - both clients and the grizzly-sendfile-server were running on my mac laptop using Java 6 and server vm. I set the ramp up period to 1min so that the server can warm up and then I started counting downloads for 10min and calculated the result. Files used for the tests were 1KB, 200KB, 500KB, 1MB 20MB and 100MB large and equaly represented. All the tests passed with 0 errors. A download is successful when the length, md5 checksum and http headers match expected values.
[1] performance - ability to process a single download quickly
[2] scalability - ability to process large number of concurrent downloads at acceptable performance
[3] init time - duration from the time a request is made, until the first bytes of the http response body (not headers) are received
[4] avg speed - for 100MB files. The smaller the file the lower the effective download speed because of the overhead associated with a download execution.
[5] EBA (81k) - EqualBlockingAlgorithm with buffer size 81KB (the default socket buffer size on mac)
[6] EBA (40k) - EqualBlockingAlgorithm with buffer size 40KB


John said...

I've observed: for the non-blocking case, a worker pool size equal to the number of cores performs better than a pool size greater than the number of cores.

I've been running load tests with grizzly for rtp/udp traffic (media streaming).

I'm wondering if you have the same observation.


Igor Minar said...

Hi John,

I have not tried that, but can run one or two tests. There might be something into that observation, I'll test and report back.

Igor Minar said...

@John first test suggest that the throughput goes down if I decrease the pool size to the number of cores. I'll try to test with a pool size equal to double of number of cores later.

Igor Minar said...

Tested both throttled and throttled ENBA setup with 8 threads (2x4cores) and 200 clients and got better results than with 50 threads, but worse than EBA w/ 50 threads.

Still, it's a valuable suggestion. Thanks