Sunday, July 25, 2010

DevOps Guide to Confluence (DGC)

After working with Atlassian Confluence for 3 years, running one of the bigger public Confluence installations, I realized that there is a major lack of information about how to run Confluence on a larger scale and outside of the intranet firewalls. I'm hoping that I can improve this situation with a blog series that will describe some of the (best?) practices that I implemented while running, tweaking, patching and supporting our Confluence-based site.

Just to throw out some numbers to give context of what I mean by "relatively large":
  • # registered users: 180k+
  • # contributing users: 7k+
  • # wiki spaces: 1.5k+
  • # wiki pages: 65k+
  • # page revisions: 570k+
  • # comments: 10k+
  • # visits per month: ~300k
  • # page views per month: ~800k
  • # http requests per day: ~1m+ (includes crawlers and users with disable javascript)
So I'm not talking about a huge site like amazon, twitter, etc, but still bigger than most of the public facing confluence instances out there.

Some of the practices described in this guide might be an overkill for smaller deployments, so I’ll leave it up to you to pick the right ones for you and your environment.

There are many aspects that need careful consideration if you want to go relatively big, and there are even more of them when you run your site on the Internet as opposed to doing it internally within an organization. In my blog series I'm going to focus on these areas that I consider important:

I'm not going to go into details about why to pick Confluence or why not to pick it. I really just want to focus on how to make it run smoothly and reliably while serving a relatively large audience of users (and robots).

Given that we want to run a site on the Internet, we are lucky to have well defined maintenance windows, that we can work with. Meaning that any downtime will be perceived by at least a portion of your users as your failure, and the only way how you can avoid looking like an idiot is to keep the downtime to the absolute minimum.

You are now probably thinking that a Confluence cluster will solve all your problems with scalability and reliability.

Right, that's what the marketing people tell you. Anyone who knows a thing or two about software engineering, knows that there is no such a thing as "unlimited scalability" and ironically a Confluence cluster can hit several bottlenecks quite quickly in certain situations. That said, a Confluence cluster with all its pros and cons is really the way to go big with Confluence, but you should have realistic expectations about its scalability and reliability.

The fact that makes things even more difficult is that if you do things right, your wiki is going to take off. More users, more content, more traffic, more spam, more crawlers, more users unhappy about any kind of downtime... Growth is what you need to take into account from day one. I'm not saying that you have to start big, you just shouldn't paint your self into a corner and I'm going to mention some tips on how to avoid just that.

I was inspired to write up this guide after watching George Barnett’s presentation from this year’s Atlassian Summit. George made some really good points and I encourage you to watch his talk. My guide will not focus just on performance and scalability, but also on reliability, smooth day-to-day operation and more.

Continue reading: DGC I: The Infrastructure

No comments: