One gripe I have with SMF is that its process monitoring capabilities are rather simple. A process associated with a contract (service) must die in order for SMF to get the idea that something is wrong and that the service should be restarted. In practice, more often than not a process gets into a weird state that prevents it from working properly, yet it doesn't die. Failures might include excessive cpu or memory usage or even application level failures that can be detected only by interacting with the application (e.g. http health check). SMF in its current implementation is incapable of detecting these failures. And this is where Satan comes into the play.
Satan a small ruby script that monitors a process and following the Crash-only Software philosophy, kills it when a problem is detected. It then relies on SMF to detect the process death(s) and restart the given service. I fell in love with the simplicity of Satan (which was inspired by God) and started exploring the feasibility of using it to improve the reliability of SMF on our production servers.
Upon a code review of the script, I noticed several things that I wished were implemented differently. Here are some:
- Satan watches processes rather than services as defined via SMF
- One Satan instance is designed to watch many different processes for different services, which adds unnecessary complexity and lacks isolation
- Satan is merciless (what a surprise! :-) ) and uses
kill -9
without a warning - Satan has no test suite!!! :-( (i.e. I must presume that it doesn't work)
Thankfully the source code was out there on GitHub and licensed under BSD license so it was just a matter of a few keystrokes to fork it (open source FTW!). By the time I was done with my changes, there wasn't much of the original source code left, but oh well :-)
I'm happy to present to you http://github.com/IgorMinar/satan for review and comments. The main changes I made are the following:
- One Satan instance watches single SMF service and its one or more processes
- The single service to monitor design allows for automatic monitoring suspension via SMF dependencies while the monitored service is being started, restarted or disabled
- Several bugfixes around how rule failures and recoveries are counted before a service is deemed unhealthy
- At first Satan tries to invoke
svcadm restart
and only if that doesn't occur within a specified grace period, it useskill -9
to kill all processes for the given contract (service) - Satan now has decent RSpec test suite (more on that in my previous post)
- Improved HTTP condition with a timeout setting
- New JVM free heap space condition to monitor those pesky JVM memory leaks
- Extensible design now allows for new monitoring conditions (rules) to be defined outside of the main Satan source code
4 comments:
Nice! OpenSolaris FTW!
You should add a user method credential to the Satan smf service, so you can limit the damage he can do :)
E.g.
But then you need to grant the daemon user privilege to restart the smf service also.
yeah, I was thinking about that. originally Satan could be run as webservd:webservd because it used only kill -9 on webservd processes, with svcadm restart you need to be root or have a special role.
+1 on Martin's comment.
Also it would be nicer if Satan tried a kill, sleep, check if still running, kill -9 if still alive type thing. Forced kills are often unnecessary and may leave behind cruft that would prevent a clean restart (lock files for example.)
I think a log of restart event might be a good idea too. Saving it locally would be the easiest/most universal.
@rama, my fork does svcadm restart first, then sleep, then if no restart happened uses kill -9 to force the restart. I prefer svcadm restart over regular kill.
Also satan's activity is being logged in the SMF log for satan's service. Run "svcs -x satan" to see the log path.
Post a Comment