Sunday, July 25, 2010

DGC I: The Infrastructure

In the introductory post, I mentioned that a Confluence cluster is the way to go big. Let's go through some of the main things to consider when you start preparing your infrastructure.

Confluence cluster

To build a Confluence site, you need Confluence :-). Well, make it two... as in a two-node cluster license. I recommend this for any bigger site with relatively high uptime expectations, even if you know that your amount of traffic won't require load balancing between two nodes. I often find my self in a need of a restart (e.g. during a patch deployment) and with a cluster, you can restart one node at a time and your users won't even know about it.

Network

My team operates other big sites, and from all of them we expect some level of redundancy. Typically we split everything between "odd" (composed of hosts with hostnames ending with an odd number) and "even" strings, and this applies to Confluence nodes as well (that's why you need two-node license). Each string is composed of a border firewall, load balancer, switches and the actual servers (web/application/database/whathaveyou) and both strings can either share the load or work as primary&standby depending on your application needs and network configuration.

This kind of splitting, allows us to take half of our datacenter offline for maintenance when needed or allows us to absorb potential failure of any hardware or software within one string without any perceivable interruption of service.

Sure, you can make things even more redundant by adding a third or forth string, but none of our apps requires that level of redundancy and the cost and complexity of getting there is therefore hard to justify.

There are two important things that matter when it comes to setting up the network, and both can make or break you Confluence clustering.
  1. The latency between the two nodes should be minimal. Ideally they should be just one hop apart and on a fast network (1GBit). There will be a lot of communication going on between your Confluence nodes, and you want it to happen as quickly as possible, otherwise the cluster synchronization will drag down your overall cluster performance. Don't even think about putting the two nodes into different datacenters, let alone on different continents. Confluence clustering was not built for that type of scenario.
  2. Make absolutely sure that your network (mainly switches, OS, firewall) supports multicast.

The best way to check that the multicast works reliably is to use the multicast test tool that is bundled with Coherence (a library that is bundled with Confluence). To run it just run the following command on all nodes and check if all packes are being delivered and no duplicates are present:
java -cp $CONFLUENCE_PATH/WEB-INF/lib/coherence-x.y.jar:$CONFLUENCE_PATH/WEB-INF/lib/tangosol-x.y.jar \
com.tangosol.net.MulticastTest \
-group $YOUR_MULTICAST_IP:$YOUR_MULTICAST_PORT \
-ttl 1 \
-local $NODE_IP

In our environment, it took us months of waiting for the right patch from our network gear vendor and some OS patching to make things totally stable. Fortunately, our ops guys eventually found the magic combination of patches and settings, and then we were good to go.

Our site uses both http and https protocols for content delivery and since we already had an SSL accelerator available in our datacenter we utilized it for Confluence, but I don't think that with current hardware, hw acceleration is not very important these days.

Another noteworthy suggestion I have for your network is the load balancer configuration. We started off with a session-affinity-based load-balancing, but at one point people started to notice that sometimes they see different content than their colleagues. This was due to delay in propagation of changes throughout the cluster. Usually the delay is unnoticeable, but for some reasons it's not always the case. I haven't investigated this issue further and just switched to primary&slave load balancing, which has been working great for us since. This of course will work only if each of your nodes can handle all the traffic on its own, but you can trust me that it solves all the issues with users that don't believe in eventual consistency :-).

Hopefully your load balancer will perform healthchecks against your nodes. The /errors.jsp path is the ideal target for these healthchecks, because it returns HTTP 200 only if everything is ok with the node.

When it comes to firewall rules (you have a firewall right?), you shouldn't allow incoming connection from public networks directly to your servers, all the public traffic should go through loadbalancer only. As for outbound connections, you should allow your servers to connect to any public server on ports 80 (HTTP) and 443 (HTTPS); these connections are needed for feed retrieval, open social gadgets and plugin installation.

Hardware (cpu, memory, disk)

Update: I came across this HW requirements document from Atlassian, which is helpful especially for smaller instances.

When you are making your hardware choices, I suggest you stick with a server that is relatively recent and has decent single-threaded performance, yet offers multicore parallelism. Confluence does relatively a lot of number crunching per http request, so both single-threaded and multi-threaded horse-power are needed to get good results. Additionally Confluence's boot process is not the best one, so with poor single-threaded throughput you'll end up waiting minutes for the app to start (at one point I did!).

Confluence loves memory! So don't be stingy. RAM is cheap these days, so get a few gigs that will be dedicated just to Confluence. My instance uses 6 GB JVM heap and with additional non-heap memory consumption, OS overhead and an extra buffer. I allocated 10GB of RAM for each Confluence node. You will likely start with much lower memory requirements, but as your instance grows, so will the memory requirements - keep that in mind.

When it comes to disk and disk space, you have to realize three things.
  1. Confluence stores all of its persistent data in a (hopefully remote) database.
  2. Confluence relies on fetching data from its Lucene index, stored on the local file system (each node has its own copy). This index is built from the db contents and can be rebuilt at any time.
  3. Attachments, which can represent a huge chunk of your persistent data, will be stored in the database. Confluence won't let you use e.g. shared filesystem when you are running a cluster.

All of this means that you will need a few (dozen) gigabytes of local disk space that can be accessed reasonably quickly. SSD will likely not buy you much, use it for your DB! Server grade hard drives configured in redundant software or hardware RAID should be sufficient for your web/application server (you can skip the RAID if you can rebuild the server really quickly after a disk failure).

OS & Filesystem

The choice of OS is often a religious one. But I think that it's more important that you are comfortable administering your OS than anything else. We use Solaris 10, or more recently OpenSolaris everywhere. Especially OpenSolaris is superior to most (all?) of the OSes out there (heh, now I'm being religious), but it will be worth cat's pee to you if you have no clue about how to work with it and don’t have time or willingness to learn a lot of cool stuff about the OS. In general, I'd say that any 64bit *nix OS should be suitable as long as you know how to use it. You'll want 64bit OS so that you can load it with loads of RAM and create big JVM heap once you need it.

One nice thing that comes with Solaris and OpenSolaris (and BSD) is ZFS file system. If you don't know much about it, I suggest that you read a bit about it. ZFS can make your backup strategy a lot simpler and allows you to revert from a failed upgrade in a matter of seconds. I'm not exaggerating, it happened to me several times. Hopefully Btrfs will soon be production ready for Linux distros and will offer comparable conveniences. If you can't use either of these, you'll have to suck it up and deal with it. I don't envy you...

Virtualization

During the last 3 years, we tried several combinations of deployment configurations for our Confluence site. These include Solaris 10 servers shared by several apps, Solaris 10 Zones (one zone per app) and OpenSolaris with Xen virtualization. Xen and OpenSolaris is what we currently use. It works well, but if I were to make a decision today, I would probably go with OpenSolaris and Zones. This combination gives you the best stability, performance, resource virtualization and application isolation.

In any case, many people ask what is the performance penalty for going virtualized. My answer is that it depends on your application, but for a webapp, more likely than not, it isn't going be the main reason of your performance problems. Decent hardware is going to make the virtualization penalty almost invisible and at the same time will give you flexibility when allocating resources in your data center. Just to give you a rough idea, the overhead for Xen is 10-30%; for Solaris Zones it's a lot less.

Web Container

Atlassian recommends using Tomcat as the web container for Confluence. We could again spend a lot of time fighting a religious battle here, but I'm going to avoid that. If Tomcat works for you and you don't find it lacking features that make enterprise deployments and operation easier, then good for you. You will most likely want it fronted it with Apache webserver or something similar though.

I've been using Sun Web Server 7 in my production environment and was quite happy with it. Another excellent choice is GlassFish v2.1 or v3, which I've been using for Confluence on my Mac. Unfortunately, Confluence doesn't adhere to the Servlet spec in some places, so you'll have to patch it to get it to run with GF v3. Glassfish v2.1 is not affected, but suffers from Xalan classes clashing, so to fix that you need to put Confluence's xalan-x.y.z.jar into $GLASSFISH_HOME/domains/$YOURDOMAN/lib/. Otherwise everything works as expected.

For a bigger site, you'll likely need to increase the worker thread count in your servlet container. Check your container's documentation to see what's the default and how to increase it. You should also know what your peak concurrent request rate is (monitor it!) and in combination with your infrastructure capabilities (load test it!) choose the right value for you. Ours is 256 which is higher than our usual peak traffic, but lower than what we could handle if we had to.

Logs & Monitoring

Paraphrasing my friends daughter: "More data, more better!". Log as much as you can and archive logs. You'll never know when you'll need to search for an exception log and confirm that it started to appear 7 months ago, right after that particular confluence upgrade.

What helped me on several occasions, was having detailed access logs. I use Apache combined log format with an extra attribute - request duration in microseconds. This format will not only give you a good idea about your app's performance, but will also help you track various issues by logging http referal[sic] and user-agent headers. This can often be invaluable info!

Here is a list of different types of logs you should be gathering: confluence log, web container log, jvm gc log and http access log.

In order to get an in-depth information about your visitors, usage patterns and content, I suggest that you integrate your Confluence with web analytics services like Google Analytics or Omniture. From their reports you can learn more about how, when and from where your users use your site.

When it comes to monitoring, your strategy should most certainly include JVM and JMX monitoring. Confluence as well as JVM exposes quite some interesting metrics via JMX. You should know what these values look like throughout the day or week. Only then you'll be able to efficiently troubleshoot issues when they occur (and they will occur!). Bare minimum include: heap space usage, cpu usage, requests per 10 second, errors per 10 seconds, avg request duration

We have a custom monitoring app that allow us to gather, archive and analyze these JVM/JMX metrics, but there are also some open source tools available of various quality (e.g. Munin looks promising).

The second part of our monitoring strategy is implemented as a local agent (we use Satan), that closely monitors the JVM process and the app itself by checking if it's not running out of heap space, as well as by performing http health checks. In the case that multiple failures are registered, the agent restarts the app and emails out an alert with the description of the failure. This allows us to sleep through the night without worrying that a pesky memory leak is going to take down our site at night. Fortunately, we haven’t seen any stability issues for a while now, but things were different in the past.

The last part of our monitoring strategy is implemented as remote http agents. These periodically perform http health checks from various locations on the Internet and send out alerts when an issue is detected. This gives us a good visibility into potential networking issues that wouldn't be caught by a local agent. There are several third party solution that you could use, or you can build your own (and host it cross the globe on EC2).

DB

The choice is up to you. Pick something supported by Atlassian or else you'll likely regret it. We use MySQL5 and for the most part we've been quite happy with it. Our db currently takes ~26GB, so be sure to account for gigabytes of db files and several times that for db backups. The biggest space sucker are attachments. Since a Confluence cluster can currently store attachments only in the database, you have to limit the attachment size, or else you'll likely end up with performance problems due to overloaded db.

We limit attachment size to 5MB. There are several users that are not happy about that, but on the other hand, it helps people to realize that often a simple wiki page is a much better distribution medium than an OpenOffice document attached to a blank wiki page. I'd bet that our users would stick huge ISO images into our db if we allowed them to. My suggestion is to start with a low limit and increase it if there is a business justification for it. Maybe one day Confluence will support S3 or Google storage as the backend for attachments, until then, keep the size limit low.

The db should be hosted on a dedicated server with lots of RAM. I'm fortunate enough to have DBAs that take care of running the DB for me, so I don't have to worry about that part. A good DBA , MANY FAST disks (possibly SSD) and lots of RAM are the key ingredients to well performing db. Of course, make sure the latency between both Confluence nodes and the db server is minimal. You shouldn't think of doing anything worse than 1GBit network and locate the db within the same datacenter.

I mentioned ZFS before and I'll mention it again. If you put the db files that contain your Confluence database on a dedicated ZFS dataset (think volume), you'll be able to take snapshots of your db during upgrades or on the fly (you'll have to momentarily lock the db to do that) and then revert from these snapshots instantly when you need it. This is just awesome. :-)

If you are using MySQL5, your minimal my.cnf should look like this:
[mysqld]
default-storage-engine=innodb
default-table-type=innodb
default-character-set=utf8
default-collation=utf8_general_ci
max_allowed_packet=32M
The last setting will allow you to upload up to 32MB large files (attachments, plugins, etc) into the db.

Backups

Your users will hate you if you lose any of their precious data, so don’t do it! The best way to avoid any data loss is to have a backup strategy in place. Ours is composed of several parts.

Config files are stored in our version control system, which is, surprise surprise, being backed up.

Confluence home directory on our Confluence nodes is being backed up only just before the upgrade via a ZFS snapshot. All the files in there (except for the config files) can be rebuilt from the database, so I don’t worry about them.

The database is being backed up nightly via a SQL dump, which is then backed up on a tape. Additionally, just before an upgrade, we take a ZFS snapshot of the filesystem the db files reside on. This allows us to do instant rollbacks in case the upgrade fails. I experienced a situation where it took us hours to roll back from a SQL dump. It’s slooow. Since then we switched to ZFS snapshots.

The database is really the master storage of all the Confluence data, so in addition to all the backups, we also run a redundant (remember “odd” and “even”?) db server, that the master database is being replicated to on the fly via MySQL master/slave replication. During an upgrade we now also stop the replication, so that we can use the slave right away if something happened to the master during an upgrade and we couldn't use ZFS to rollback.

As if that was not enough, there is one more layer that allows users to recover from user errors in a fine-grained manner. It’s Confluence wiki page versioning and wiki space trash. The combination of these two features, enables users to undo most of the editing mistakes on their own, without bothering site administrators (I’ll talk more about delegation in chapter IV of the guide).

There is also a Confluence built-in backup mechanism, but it works well only for small instances. This backup process is resource intensive, generates lots of data and if I remember correctly breaks ones you reach certain size. Don't use it. You'll have to explicitly disable it via the Confluence Admin UI.

Prod, Test, Dev Environments

The ability to experiment in the production environment will decrease with the increase of users using the site. For this reason, you'll need to build a Test environment that closely matches your production environment. Here you can practice your Confluence upgrade, or run automated tests just before a release. If you are doing Confluence core or plugin development, you'll also need a dev environment. This one can be a simplified and scaled down version of production (e.g. you can forgo clustering) and should be conveniently located on your dev machine or server.

Conclusion

If you follow my advice, you should now have an infrastructure that is will help you run your Confluence site in a performant, scalable and reliably way. If you found something important missing, feel free to post your suggestions as comments.

In the next chapter of this guide we'll look at the JVM tuning.

No comments: