Vault, Continuous Integration, and Cruise Control.NET
For both of you who are still here after that scintillating gnat update, a reward is in order. The number of people interested in both the gnat count in my office and continuous integration with Vault may very well be the empty set, but I’ll take a chance. (I was just hours away from Vail! Cut me some slack!)
If you don’t know what continuous integration or CruiseControl.NET are, you can get started here before you crawl back under that rock.
If you just want Vault to play nicely with Cruise Control.NET, you’ll need Vault 3.1.8 (on both client and server) and Cruise Control.NET build 188.8.131.529 or later, and a look through the Vault Source Control Block documentation. If you want to find out more about what we changed, read on.
Last year, Eric subscribed to the CruiseControl.NET (CC.NET) users mailing list. The community of Vault users there made clear that Vault had several shortcomings when paired with CC.NET. One late summer afternoon Eric came into my office and channelled Shoeless Joe Jackson. Ease their pain, he said.
Due to some schedule changes we just happened to have a seriously fast server sitting unused, and nobody was particularly fond of our existing Vault build system or its 40-minute build time. So the first step was to set up CC.NET to build Vault on the new machine and make it as fast as possible. The build’s master of ceremonies was already NAnt, so tweaking that to be CC.NET-friendly wasn’t too difficult. Between dog-fooding the setup ourselves and listening to all the valid criticism on the mailing list, we found lots of room for improvment. It boiled down to three primary pain points:
- Lack of true “build traceability”, meaning the ability to identify (and typically retrieve) the code that makes up a particular build.
- Intermittent failures when CC.NET polls Vault for changes. This was especially annoying because CC.NET reports these as build failures. This appears to be a complaint when using CC.NET with other source control systems as well, but after a little investigation Vault was clearly misbehaving.
- Lack of backup sanity in a continuous integration environment. In some configurations, there was a perpetually-growing backup directory that was a pain to turn off, and the only workaround was pretty lame.
We had a couple of options to fix these:
- Implement a new CC.NET plug-in that talks directly to Vault via the client API.
- Fix the existing Vault source control block, which talks to Vault via the command-line client.
Primarily because we wanted CC.NET and Vault to play nicely out of the box, and because there are no other source control plug-ins that are part of the core CC.NET distribution, we chose option 2. (Option 1 gives us all kinds of interesting new options for deeper integration, and that ship has not sailed, but that’s a subject for another post.)
On the surface, Vault looks and feels a lot like SourceSafe because it was explicitly designed to be a painless transition for SourceSafe users. Unsurprisingly, the existing Vault integration with CC.NET looked just like SourceSafe’s:
- Poll for changes via the item history command
- Label the folder
- Get the code identified by that label
- If the build fails, remove the label
This method was fraught with peril:
- The SourceSafe-like “item history” command being used was far from ideal for change-polling purposes. CC.NET polls for changes every minute or two all day. It should be as fast as possible. The item history command is a nice, exhaustive change list, because it recursively returns all changes to all items under the queried folder, but that makes it a relatively slow option for change-polling.
- CC.NET has an option to disable “exception notifications.” It turns out that when you turn this off, you disable the whole block of code that runs after a failed build, and this is where the label was being removed. When the label isn’t removed, the next build typically fails when it attempts to apply what is now a duplicate label (unless you’re incrementing build numbers on failed builds). And it was possible to accidentally remove a label that somebody else had applied. There were an awful lot of possibilities and code-paths to consider here for this to be particularly robust.
- Get By Label is the slowest way to do a “get” with Vault. We’ve identified a number of ways to make it faster that will probably ship with Vault 4.0, but we wanted to avoid using this get for continuous integration if at all possible.
One of the things Vault does that’s not readily apparent from the SourceSafe-like GUI is track versions of folders. So we can track when you move a file from one version-controlled folder to another, for example. And you can say, “show me version 4 of this folder and everything in it, recursively.” Taking advantage of that, we can pretty easily simplify and speed things up:
- Poll for changes via the folder history command. This is a much less expensive operation on the server: it just returns a list of folder versions, meaning one record for every transaction that affected anything under this folder. Even better, I could essentially tell Vault, “give me any new folder versions after 47, which was the latest when I did the last build.”
- Get by folder version. If Vault indicated that the latest version of this folder is now 48, I can order up that particular version of the folder and all the source therein.
- If the build succeeds, label by version. If 15 transactions were committed while the build ran, that’s okay. The label will be applied to version 48 of the folder, which is what we built with. There’s now a label that positively identifies the code making up this build.
Perfect. Just one minor problem: Vault’s command-line client didn’t support most of it. A plug-in with access to the full API could do this no problem, but we’d already decided not to do that. So I had some changes to include in the next release of Vault. And I had to figure out how to smoothly support both the new way and the old way, and the old way still had to basically work. I had to dive into the many possible code-paths and fix the bugs in the old way, as best as possible, and add version checking code to use the new way when possible. I wrote what seemed like 100 unit tests to ensure all the permutations of options worked. (In reality it was closer to 40 unit tests. And despite my attempt to enumerate all the possibilities, I initially missed at least one.)
It was right around this time that Thoughtworks released CC.NET 1.0 Final, without any of my work. Doh! Not that I blame them: they’ve got well over a dozen source control systems to support, and the 1.0 release had been years in the making. And thanks to continuous integration there would be a tested build including my stuff as soon as I could get it submitted.
Intermittent change-polling failures
I listed this as number 2 because I tackled it second, but it was actually the biggest complaint. You had to be pretty savvy to hunt down the root of the error, so people were never sure if it was an intermittent network problem, CC.NET, or Vault. The good detectives seemed to be fingering Vault, and some priliminary testing confirmed that something was amiss in Vault-land. I was fresh off several months of server performance testing and tuning, so setting up an environment simulating a very busy Vault server took only a few minutes. Adding 20 or so CC.NET projects took a few minutes more. With this kind of simulated load on the Vault server, it took less than 5 minutes to start seeing the problems people were reporting. There were a handful of deadlocks occurring in Vault’s MS SQL Server databse.
The worst problem occurred when the various CC.NET projects logged into Vault. This was surprising, but it turns out CC.NET is typically deployed in a configuration that would be very unusual outside a continuously integrated environment: lots of simultaneous logins of the same user. A long time ago, (in a galaxy far, far… no wait. Right here in Champaign, actually.) somebody working on the Vault server actually took explicit steps to gradually slow down the login process when this occurs. The idea at the time was that this probably represented some kind of brute force password crack attempt. Whether or not this was a good decision is perhaps debatable, but at the end of the day it caused real problems for CC.NET users and didn’t provide enough tangible benefit to offset that. So I removed it. I also synchronized the order in which database tables are accessed during a login. Those two minor changes fixed about 95% of the deadlocks that caused false build failure alarms.
The remaining 5% were related to getting a version of a folder while new code is being checked in. After lots of tweaking and testing, we determined that these couldn’t be completely eradicated without some major changes on the server. This isn’t the kind of work we like to do for a “third dot” release, what ended up being Vault 3.1.7, so an alternative was in order. I minimized the deadlocking with judicious use of row-locking hints, and added retry code in CC.NET. In the unusual event that one of these deadlocks occur, CC.NET will automatically retry up to a configurable number of times before failing the build. This is one of those decisions that pains you as a engineer (or as a craftsman), but ultimately eases people’s pain much sooner than The Right Fix would, so you have to live with it and move on. And the pragmatic fix worked.
Under a very high simulated load (over 100 normal Vault users, 20 CC.NET projects polling every 20 seconds), 24 hours a day for two weeks, CC.NET reported no false build failures.
By default, Vault will save a backup copy of files it overwrites when getting new code from the repository. Under typical source control usage people rarely even notice this is happening. But when you retrieve every new version over the course of months or years, those backups start to use more disk space than you’d like. Under typical source control usage this is easy to turn off, but the setting is user and machine-specific, and accessible to mortals only via the GUI client. So to turn it off on a CC.NET build machine, you either need to install the GUI client or hack the registry. People felt this was too much work for what should be the default on a build machine, and they were right. Since we were already including changes in a new Vault release, adding a fix for this scenario was no big deal. Vault’s command-line client, as of 3.1.7, allows you to specify a backup option that overrides the user’s preference. When getting source for a build, CC.NET provides that option to prevent the creation of backups.
In addition to these three primary complaints, we fixed several others. For example, you can clean out a source directory before each build, and it’s easier to build from a working directory or not, depending on your requirements. If you want to see the exhaustive list, the CC.NET bug-tracker has it all.
The new code was first included in a build in February this year, and few minor fixes and tweaks have been released since then. The “new way” mentioned above works with Vault 3.1.7, but there were a couple of other fixes made in 3.1.8, so that’s what I recommend. We’re running smoothly with a pre-release Vault 3.5 build and CC.NET build 184.108.40.2069. Finally, the number of people reporting trouble on the mailing list seems to have tapered off nicely, so I think we’ve gone a long way toward relieving those users’ pain.