The TestOps Files, part 1: You can't mock out the operating system

DevOps is a software development method that, among other things, involves automating the provisioning and configuration of IT infrastructure. A key part of that definition is software development—DevOps is about writing software. Like all software, DevOps software needs to be tested. Like all software, DevOps software is best tested in an automated fashion as part of a Continuous Integration (CI) development process.

Acquia Cloud is a PaaS for developing and running web sites. It is a software product, and a massive DevOps project. Not surprisingly, we follow a CI process for developing it---including having extensive automated tests. Over the years, we have learned a lot about what it means to test "infrastructure as software", and I've given several public presenations on the topic. I've come to realize that testing infrastructure as software has some unique aspects and challenges as compared to testing normal application software, making it worthy of being its own topic of discussion.

I therefore hereby declare TestOps as the branch of DevOps devoted to automated testing of software for automated provisioning and configuration of IT infrastructure.

Unit vs system tests

Here's the most fundamental thing we've learned about TestOps over the years: Unit tests are great for application software, but they are insufficient for infrastructure software. There are lots of reasons for this, many of which can be summarized as "You can't mock out the real world."

The goal of unit tests is to test each individual component of a system in isolation. Since most components interact with other components, unit tests "mock out" the other components by providing very simple stub versions (mocks) that behave in a predetermined way. This allows testing one component under the assumption that the other components behave correctly. By unit testing each component in turn, you end up getting a lot of confidence in your software.

Unit testing breaks down when you cannot accurately mock another interacting component. Or rather, unit testing is still useful, but it will miss very important problems caused by unexpected behavior from other components. Under this condition, the solution is system (or integration) tests: testing real versions of all the components interacting together, as they will in production.

What kind of components cannot be accurately modeled by a mocked? Lots of them:

  • operating systems
  • the network
  • complex services like relational database replication
  • third-party APIs
  • the list goes on and on...

In other words, in the world of DevOps, almost all the components we deal with cannot be mocked out. The conclusion is simple: TestOps must be based on system testing to be useful.

When the operating system mocks you

I've been meaning to start this blog series for a long time. I was motivated today by the zillionth entertaining example of how imporatnt system tests are.

On our newly launched Acquia Cloud Free, we impose disk quota on XFS filesystem. We use the xfs_quota tool to set and monitor quota usage. We retrieve a user's current usage like this:

# xfs_quota -x -c 'quota -Nhu [username]' /path /dev/xvdm 18.7M 500M 500M 00 [------] /path

We split the output on whitespace and use the 3rd and 4th field. To make sure we do not encounter an error with xfs_quota and end up treating a blank line as indicating zero usage, we verify that the line contains at least four fields, and fail otherwise.

This code has been working fine for a long time. Today, I was testing a change which, as an unintended side effect, caused us to measure a user's quota usage before the user owns any files on the filesystem. We discovered something new: the xfs_quota 'quota' command only outputs data for a user if that user consumes at least one byte on the filesystem; otherwise, it outputs nothing. As a result, our code saw fewer than four fields in the output, raised an exception and did not complete its job, and thus our tests failed.

If we had mocked out xfs_quota, perhaps with a shell script that output known values, we would not have found this until we rolled out the change to production and our product failed. Instead, we caught it during initial development before any harm was done.

DevOps is real work. TestOps is hard. Automated system testing is essential.

More to come!