Monday, May 31, 2010

Testing matters, even with shell scripts

A few months ago, we migrated out production environment at work from Solaris 10 to OpenSolaris. We loved the change because it allowed us to take advantage of the latest inventions in Solaris land. All was good and dandy until one day one of our servers ran out of disk space and died. WTH? We have monitoring scripts that alert us long before we get even close to running out of space, yet no alert was issued this time. While investigating the cause of this incident, we found out that our monitoring scripts that work well on Solaris 10, didn't monitor the disk space correctly on OpenSolaris. When I asked our sysadmins if they didn't have any tests for their scripts that could validate their functionality, they laughed at me.

Fast forward a few months. A few days ago I started looking at Satan, to augment the self healing capabilities of Solaris SMF (think initd or launchd on stereoids). At first sight I loved the simplicity of the solution, but one thing that startled me during the code review was that there were no tests for the code, except for some helper scripts that made manual testing a bit less painful. At the same time, I spotted several bugs that would have resulted in an unwanted behavior.

Satan relies on invoking solaris commands from ruby and parsing the output and acting upon it. Thanks to its no BS nature, ruby makes for an excellent choice when it comes to writing programs that interact with the OS by executing commands. There are several ways to do this, but the most popular looks like this:
ps_output = `ps -o pid,pcpu,rss,args -p #{pid}`

All you need to do is to stick the command into backticks and optionally use #{variable} for variable expansion. To get a hold of the output, just assign the return value to a variable.

Now if you stick a piece of code like this in the middle of the ruby script you get something next to untestable:
module PsParser
 def ps(pid)
   out_raw = `ps -o pid,pcpu,rss,args -p #{pid}`
   out = out_raw.split(/\n/)[1].split(/ /).delete_if {|arg| arg == "" or arg.nil? }
   { :pid=>out[0].to_i,
     :cpu=>out[1].to_i,
     :rss=>out[2].to_i*1024,
     :command=>out[3..out.size].join(' ') }
 end
end

With the code structured (or unstructured) like this, you'll never be able to test if the code can parse the output correctly. However if you extract the command execution into a separate method call:
module PsParser
 def ps(pid)
   out = ps_for_pid(pid).split(/\n/)[1].split(/ /).delete_if {|arg| arg == "" or arg.nil? }
   { :pid=>out[0].to_i,
     :cpu=>out[1].to_i,
     :rss=>out[2].to_i*1024,
     :command=>out[3..out.size].join(' ') }
 end

 private
 def ps_for_pid(pid)
   `ps -o pid,pcpu,rss,args -p #{pid}`
 end
end

You can now open the module and redefine the ps_for_pid in your tests like this:
require 'ps_parser'

PS_OUT = {
 1 => "  PID %CPU    RSS ARGS
12790   2.7 707020 java",
 2 => "  PID %CPU    RSS ARGS
12791  92.7 107020 httpd"
}

module PsParser
 def ps_for_pid(pid)
   PS_OUT[pid]
 end
end

And now you can simply call the pid method and check if the fake output stored in PS_OUT is being parsed correctly. The concept is the same as when mocking webservices or other complex classes, but applied to running system command and programs.

To conclude, what makes you more confident about a software you want to rely on. An empty test folder: Or all green results from a test/spec suite?

2 comments:

Martin said...

1st: I think the reason for the forced laughter was that a) SAs ingeneral does not write test code, and b) have you tried writing test code for someone else's sh + sed + awk 1000 loc scripts! Those languages were just not designed to be easily testable :)

2nd: you can also use a mock library to return a fixed string.

Igor Minar said...

Just thinking about 1000 loc long sh + sed + awk scripts makes me shiver.