Quality of Service

I clearly take my job a little too seriously.

I’ve been filling up some free hours these past couple of weeks putting some polish on ZingLists. Most of the changes involve tweaking a few things here and there, but the big one is that I’m adding a mobile interface for the most essential features, and that necessitated moving from Rails 1.2 to Rails 2.0.

Everything went smoothly during development, but late Friday, when I pushed the code to production, anything that touched the new code died with a scary-looking illegal instruction signal. My production server runs FreeBSD, and I had upgraded a few other packages to stay current with security issues and bug fixes, so this left me with the cold feeling of a completely dead site late on a Friday afternoon.

As it turns out, while I haven’t solved the problem yet, I got lucky through some over-engineering on my part when I first wrote my Capistrano deployment recipes. I do deployments in two stages. Stage 1: push the code, run migrations and background tasks (e.g generating a Google sitemap). Stage 2: move configuration files into place, switch the “current” symlink and restart processes. Since I hadn’t run stage 2 yet, everything was still working fine.

Twitter also pushed some new code on Friday, but they’re having problems big enough that the site may as well be down.

ZingLists is microscopic compared to Twitter in just about every way imaginable. I’m one guy with no VC, Twitter is a dozen-ish people with at least a few million in VC. Neither of us are making any money off our web sites. Shouldn’t I be the one that doesn’t care about my uptime?

I’m optimistic that I will solve my “illegal instruction” problem with Rails 2.0, so in the interest of priming this post for anyone else who runs into the same thing, here’s a summary of the issue. (I expect most of you will stop reading here.)

The “illegal instruction” seems to be a case of the Ruby interpreter not dealing well with a stack overflow. This could be from deep recursion, but I doubt it in my case since the same works fine in development. Even simple things like List.find(:all) from script/console causes the problem, and that’s a simple ActiveRecord query.

On FreeBSD 6.2, the Ruby port is always compiled with pthreads. Because of the way that this version of FreeBSD is written, if a dynamic shared object loaded into a running program needs a shared library (like pthreads), that library must be linked to by the executing program. In this case, that’s the Ruby interpreter. The default pthreads stack size on FreeBSD is relatively small compared to other operating systems, though I couldn’t track down an exact number. It’s in the ballpark of 64 KB or 128 KB. Linux usually uses something closer to 1 or 2 MB.

My theory at the moment is that Rails 2.0 nests method calls deeper than 1.2, just enough to exhaust the stack and trigger the illegal instruction signal. I’m guessing this is because the stack overflow results in some random bit of code being executed. I’m on the AMD64 platform, and I believe it marks non-code pages with the NX (no execute) bit, so a jump off to a random address is highly likely to be caught by the kernel.

I don’t really want to recompile or hack the Ruby interpreter to make pthreads stacks larger, since I’d have to make the same fix every time Ruby is updated. I’m going to try to replicate the problem in a 6.2 virtual machine, then upgrade it to 6.3 and possibly 7.0 to see if either of those solve it. Barring a response from the Ruby port maintainer, I may even look at switching to Ubuntu Linux, though I hate to do that since I’m otherwise very satisfied with FreeBSD.

Updated April 24: so as usual, admitting in public that I have a problem with some code virtually guarantees that the problem is in my code.  I had an infinite recursion bug due to an <tt>:include</tt> on an association.  FreeBSD simply died really fast, probably due its small thread stack, but I also was missing an adequate test case to catch the problem.