Friday, July 30, 2010

DGC IV: Confluence Upgrades

This blog post is part of the DevOps Guide to Confluence series. In this chapter of the guide, we’ll have a look at Confluence upgrades.

Confluence Release History and Track Record

I started using Confluence at around version 2.4.4 (released March 2007). A lot has changed since then, mostly for better. In my early days, Atlassian was spitting out one release after another — typically 3 weeks or less apart — followed by a major release every 3 months. You can check out the full release history on their wiki.

This changed later on and recently there have been fewer minor releases and bigger major releases delivered 3.5-4 months. Depending on your point of view this is good or bad. It now takes longer to get awaited features and fixes, but on the other hand the releases are more solid and better tested.

For major releases, Atlassian now usually offers Early Access Program, which gives you access to milestone builds so that you can see and mold the new stuff before it ships.

Contrary to the past, the minor versions have been very stable lately and have contained only bugfixes, so it is generally safe to upgrade without a lot of hesitation.

The same can't be said about major releases. Even though the stability of x.y.0 releases has been dramatically improving lately, I still consider it risky for a big site to upgrade soon after a major release is announced. Wait for the first bugfix release (x.y.1), monitor the bug tracker, knowledge base and forums, and then consider the upgrade.

Having gone through many upgrades myself, I think that it is a good practice to stay up to date with your Confluence site. We have usually been at most one major version behind and frequently on the latest version, but as I mentioned avoiding the x.y.0 releases. This has been working well for us.

Staying in Touch and Getting Support

In order to know what's going on with Confluence releases, it is a good idea to subscribe to the Confluence Announcements mailing list. This is a very low traffic mailing list used for release and security announcements only.

Atlassian's tech writers usually do a good job at creating informative release notes, upgrade notes and security advisories, so be sure to read those for each release (even if you are skipping some).

There are several other channels through which people working on Confluence (plugin) development can communicate and support each other, these include:

Despite Atlassian's claims about their legendary support, I found the official support channel rarely useful. Being a DIY guy and having a reasonable knowledge about Confluence internals, I usually found myself in need of a more qualified support than what the support channel was created for. For this reason my occasional support tickets usually ended up being escalated to the development team, instead of handled by the support team.

On the other hand the public issue tracker has been an invaluable source of information and a great communication tool. I wish that more of my bug reports had been addressed, but for the most part I have been receiving reasonable amount of attention even though sometimes I had to request escalation to have someone look at and fix issues that were critical for us.

The biggest hurdle I've been experiencing with bug fixes and support was that sites of our size are not the main focus for Atlassian and they are not hesitant to be open about it. I often shake my head when I see features of little value (for us that is - because they target small deployments and have little to do with core wiki functionality) being implemented and promoted, but major architectural issues, bugs and highly anticipated features go without attention for years. Just browser the issue tracker and you'll get the idea.

Confluence Upgrades

The core of the upgrade procedure will depend on the build distribution type you use (standalone, war, building from source), but fundamentally in all cases, you need to shut down your Confluence, replace your app (standalone or war) with the new version and then start it again. An automated upgrade process will take care of updating the database schema, rebuilding the search index and other tasks required for a successful upgrade.

That was the good news, the bad news is that there is a lot more work to be done in order to successfully upgrade a site with as little downtime as possible.

Dev and Test Deployments and Testing

Before you upgrade the real thing, you should at first get familiar with the release by upgrading your dev and test environments.

It's often handy to invite your users to do a brief UAT (user acceptance testing) on your test instance as they might catch something that you or your automated tests haven't.

Picking the Outage Window

Based on your users' usage patterns (as easily identified by web analytics solutions like Google Analytics), you should pick a time when the usage is low. For our global site this has been early mornings at around 4:30 or 5am PT.

When it comes to picking a day, we usually stuck with Tuesdays, Wednesday or Thursdays. Nobody wants to be dealing with an issue during a weekend when internal (infrastructure) or external (Atlassian) support is harder to get hold of.

You also want to communicate the planned outage to your users, so that they are not caught by surprise when you announce an outage on a day when they are releasing important documents on the wiki.

As far as outage duration goes, we usually plan for a 30min outage during a 1 hour window and most of the time have been able to bring the site back online within 30min or less.

Ready, Set, Go!

The actual deployment consists of several steps, which in our case are:

  • disabling load balancing for both nodes (which automatically triggers redirection of all requests to a maintenance pages hosted elsewhere)
  • shutting down both nodes
  • disabling MySQL replication between the master and slave db
  • taking ZFS snapshot of the Confluence Home directory
  • taking ZFS snapshot of the MySQL db filesystem on the master
  • deploying the new war file
  • starting one node (while the loadbalancer still ignores it)
  • watching container and Confluence logs for any signs of problems

At this point, we have one of our nodes up and running (hopefully :-)). We can log in with an admin account and check if everything works as expected. The next tasks include:

  • upgrading installed plugins
  • upgrading custom theme (if there is one)
  • running a bunch of automated or manual tests, just to verify that everything is ok

If things are looking good, we can allow the load balancer to start sending requests to our upgraded node. Continue watching logs and eventually deploy the war on the second node and re-enable the MySQL replication.

If any issues occur during the deployment, we can simply:

  • shut down the upgraded node
  • revert to the latest Confluence Home snapshot
  • revert to the latest MySQL db snapshot
  • redeploy the older version of war file
  • either retry the deployment or re-enable load balancer and deal work on resolving the issues outside of production environment

In my experience from all the dev, test and prod deployments, we've had to roll back and redo an upgrade from scratch only once or twice. It's very unlikely that you'll have to do it, but it's better to be ready than sorry.

If you are building Confluence from patched sources and deploy your own builds frequently, then you might want to consider automating your deployments with tools like Capistrano. This will save you a lot of time and make the deployments more reliable and consistent.

Conclusion

If you do your homework, Confluence is quite easy to upgrade. It's unfortunate that the entire cluster must be shut down for an upgrade even between minor releases, but if you plan your deployment well, you will be able to minimize the downtime to just a few minutes outside of peak hours.

In the next chapter of this guide, we'll take a look at customizing and patching Confluence.

2 comments:

Sherif Mansour said...

Igor,
Thanks for writing this up. I'm sure a lot of other customers will appreciate the information you have shared about your experiences in administering Confluence.

I've also noted some of the feedback you have provided about Confluence clustered. This feedback is very helpful as we do our roadmap planning for the product.

Thanks, once again.

Sherif Mansour
Confluence Product Manager

Sarah Maddox said...

Hallo Igor

Nice post! I've added a link to your post from the Atlassian Confluence documentation:
http://confluence.atlassian.com/display/DOC/Tips+of+the+Trade

Cheers, Sarah