The Details of Parallel Processing Are the Hardest Part

Mike Ash has an absolutely excellent post up today discussing the importance of the details when designing an application for parallel processing.

His post focuses on using Grand Central Dispatch (part of Mac OS X 10.6 “Snow Leopard”), but the core point is valid for any approach to multiprocessing: it’s not just a simple matter of queueing jobs. You have to consider the impact of those jobs on the system as a whole.

Running Sweepers from a Model

Oh, the pain. Over the last 24 hours I have fought an exhausting battle with Rails and the testing environment to do a couple of seemingly simple things:

  1. Expire cached pages and fragments from outside the context of a normal HTTP request
  2. Test it.

Testing, in particular, is particularly difficult because Rails does not offer any built-in mechanism to test an application’s caching, and I’ve had problems in the past with the only plug-in that does it (cache_test) it on Rails 2.x. I’ve released a new plug-in, called Banker, that provides assertions and the necessary support to test caching, including Shoulda macros.

The first item has come up for me more than once. Complex applications often have scheduled jobs that make changes to the database. If the application does any caching, there is a good chance that these jobs will affect content that is cached. The problem is that expiring caches from outside the context of a controller + request is a pain. Here is the solution:

Create your sweepers as you normally would. Then, either within test code or your script/runner code:

def setup_cache_sweepers(*sweepers)
  sweepers = sweepers.flatten
 
  ActiveRecord::Base.observers = sweepers
  ActiveRecord::Base.instantiate_observers
 
  returning ActionController::Base.new do |controller|
    controller.request = ActionController::TestRequest.new
    controller.request.host = URL_HOST
    controller.instance_eval do
      @url = ActionController::UrlRewriter.new(request, {})
    end
 
    sweepers.each do |sweeper|
      sweeper.instance.controller = controller
    end
  end
end

URL_HOST is a constant, defined in each environment file, with the host:port part of a URL. It is needed in order to generate URLs, and more importantly, fragment cache keys, outside the context of an HTTP request.

The method returns a controller. For unit test code, hang on to it, because it gives you access to named routes:

@controller.send(:users_path)

For code run from script/runner, nothing else is needed. You’ve already instantiated your sweepers and given them the controller instance, so anything you do to a model instance that generates a callback to the sweeper will use that controller, including calling cache expiration methods.

Manage vendor/rails with Git on a Subversion Project

Here is something I’ve been experimenting with over the last week or so, and it’s working out very nicely so far:

Use Git to manage vendor/rails when your project is using Subversion.

Here’s how:

First time freezing? Easy:

  1. cd vendor
  2. git clone git://github.com/rails/rails.git
  3. svn add -N rails; svn ps svn:ignore .git rails
  4. cd rails; git checkout v2.3.3.1 (or whatever version you want)
  5. svn add *

Switching to a different version of Rails is now as simple as:

  1. cd vendor/rails
  2. git checkout master
  3. git pull origin master
  4. git checkout whatever
  5. svn add `svn st | grep ^\? | cut -f7 -d" "`
  6. svn rm `svn st | grep ^\! | cut -f7 -d" "`
  7. svn commit

If you already have a vendor/rails and you want to try this technique out, you should first clone the Git repository to a temporary directory, git checkout the version that matches what is already in your vendor/rails, then copy (or move) the .git directory into vendor/rails. Add the svn:ignore property as above and you’re all set.

Here’s why I’m doing this: when a new Rails release comes out, some files are changed, some added and others deleted. Because Subversion litters a checked-out project with its .svn directories, you can’t just delete the entire thing and re-freeze without losing local patches (yes, I’ve had to do this) and history (which isn’t terribly important, but can be nice). Even ignoring those two reasons, completely deleting and adding vendor/rails will cause your Subversion repository to grow more than necessary (Rails 2.3.3 checks in at 35 MB).

Git, by putting everything in a top-level .git directory, makes itself easy for Subversion to ignore. Checking out a tag to switch to a different release is simple, and Git deletes files, unlike applying a patch, which truncates deleted files to 0 bytes, requiring a find to actually remove them.

ssh-agent on Mac OS X 10.5

For as long as I can remember, I’ve been using a tool called SSHKeychain on Mac OS X to manage ssh-agent and my identities, to make logging into remote servers secure, yet password-free.

Lately, however, something has changed and SSHKeychain isn’t able to keep track of my keys. The result is that instead of rarely typing my passphrases, I’m doing it constantly. I think it started around the time I updated to 10.5.8.

Turns out that Leopard has much better support for ssh-agent built-in and SSHKeychain isn’t necessary. Dave Dribin’s blog lays it all out: ssh-agent on Mac OS X 10.5 and, for the security conscious, Securing ssh-agent on Mac OS X 10.5.

A couple of things to watch out for:

  • If you are switching from SSHKeychain, remove the environment override for SSH_AUTH_SOCK from ~/.MacOSX/environment.plist.
  • To get the GUI passphrase dialog and the option to save the passphrase in your keychain, you must use the system ssh, not one from Fink or MacPorts.
CakePHP Quick Start Guide for Experts

I’ve started a new project and, to my disappointment, the hosting environment won’t support Ruby. It is the client’s own facility, so using another provider is not an option.

I looked at the web frameworks page at Wikipedia and picked out six for a deeper look: Akelos, CakePHP, CodeIgniter, Kohana, Symfony and Zend. My major requirements were:

  • PHP (<= 5.1)
  • MVC
  • ORM
  • Test framework
  • Some support for database migration
  • Caching
  • Maturity: stable code and a healthy community

In a nutshell, something approximating Rails. I settled on CakePHP. Much of what makes Rails great can be traced directly back to Ruby, and you can’t approach that level of expressiveness in PHP, so an approximation is the best we can do.

Coming up to speed on a new framework is always a stop-and-go affair, but Matt Curry’s Super Awesome Advanced CakePHP Tips (free PDF e-book) is terrific. Highly recommended if you already know what you’re doing and just need answers to “how do I… in CakePHP?”

Updating RubyGems to Recent 1.3.x

The RubyGems update process can be temperamental. If you fall more than a release or two behind, you might find yourself in a dependency cycle that stops an update cold. Recently, I tried to update a 1.2.0 install to the current 1.3.5.

# gem update --system
Updating RubyGems
Updating rubygems-update
Successfully installed rubygems-update-1.3.5
ERROR:  While executing gem ... (NameError)
    undefined local variable or method `remote_gemspecs' for
#<Gem::Commands::UpdateCommand:0xb7e26640>

This is a known issue. Running the command again results in “nothing to update,” also a known issue. The fix is to separately install rubygems-update and run update_rubygems.

# gem install rubygems-update
Successfully installed rubygems-update-1.3.5
1 gem installed
Installing ri documentation for rubygems-update-1.3.5...
Installing RDoc documentation for rubygems-update-1.3.5...
Could not find main page README
Could not find main page README
Could not find main page README
Could not find main page README
# update_rubygems
/usr/lib/ruby/site_ruby/1.8/rubygems.rb:578:in `report_activate_error':
Could not find RubyGem builder (>= 0) (Gem::LoadError)
	from /usr/lib/ruby/site_ruby/1.8/rubygems.rb:134:in `activate'
	from /usr/lib/ruby/site_ruby/1.8/rubygems.rb:158:in `activate'
	from /usr/lib/ruby/site_ruby/1.8/rubygems.rb:157:in `each'
	from /usr/lib/ruby/site_ruby/1.8/rubygems.rb:157:in `activate'
	from /usr/lib/ruby/site_ruby/1.8/rubygems.rb:49:in `gem'
	from /usr/bin/update_rubygems:18

To fix this “report_activate_error”, I installed builder. Tried the update again, same thing for session and hoe-seattlerb. This is where I found myself in a dependency cycle.

rubygems-update 1.3.5 requires (among other things) hoe-seattlerb
hoe-seattlerb requires hoe >= 2.3.0
hoe >= 2.3.0 requires rubygems >= 1.3.1

I might have the specific dependency chain a bit off, but it’s enough to say that something required by rubygems-update itself requires a semi-recent RubyGems.

The solution is to update in stages. First update to 1.3.0, then update the rest of the way:

# gem install rubygems-update -v 1.3.0
Successfully installed rubygems-update-1.3.0
1 gem installed
# update_rubygems
Installing RubyGems 1.3.1
...
# gem update --system
Updating RubyGems
Updating rubygems-update
Successfully installed rubygems-update-1.3.5
:0:Warning: Gem::SourceIndex#search support for String patterns is deprecated
Updating RubyGems to 1.3.5
Installing RubyGems 1.3.5
RubyGems 1.3.5 installed
Capistrano Deployments from GitHub

Late last week I spent a chunk of time helping a client troubleshoot a deployment problem with Capistrano. For several weeks, I’ve been working on the site, deploying without any trouble, but when he tried to do it himself, it failed:

$ cap deploy
  * executing `deploy'
  * executing `deploy:update'
 ** transaction: start
  * executing `deploy:update_code'
    updating the cached checkout on all servers
  * executing (...long git command snipped...)
    servers: ["example.com"]
    [example.com] executing command
 ** [example.com :: err] Permission denied (publickey).
 ** [example.com :: err] fatal: The remote end hung up unexpectedly
    command finished
*** [deploy:update_code] rolling back

The problem was that I’d created a new user on his server to separate the production site from the staging site, and the staging user did not have a SSH key pair to access GitHub. It worked for me because I did have my SSH key on GitHub and a SSH agent in place that allowed authentication to be done through me instead of the server.

Capistrano’s error message in this case is not very helpful. If you think you’re having a similar problem, try cap shell and run a command, such as uptime. If that works, it’s not a permissions problem to your server, it’s probably from your server to GitHub. Either install a SSH agent or create a new key pair specifically for deployments.

Choose Your Words Wisely

One of my recent projects is taking over development of a Rails site to prepare it for a private beta and ultimately a public launch. Aside from the usual learning curve of understanding a large existing code base, I’ve been struggling more than I should with some of the terms chosen by the original developers. It reminds me yet again that naming is hard, but also that names are important.

Names are important because they instantly put your brain in a certain context. If you choose a poor name, while you may not have any trouble with it, new developers or your customers might find themselves nearly unable to understand your intention. Simply using a different word can avoid this problem entirely.

Consider the environment in which names exist, too. One of the core models in this project is Action, which turns out to be a very unfortunate name in a Rails project. It is natural to want to use a local variable called action in code that uses this model, but that gets messy because a method that processes a request in a Controller is also called an action. This code “worked around” the problem by using actions instead, but being plural, it makes my brain want to treat it as an array, when it usually isn’t.

Another term used in a lot of the partials is node. Again, considering the context, this is an unfortunate choice because when you talk about the DOM as a tree, it is made up of nodes. There is now an extra moment of thought when I see this term: is this talking about the application’s node or a DOM node?

Names are meant to ease comprehension of a system by putting it in familiar terms. As complexity increases, so does the importance of good naming. Spend the extra time settling on good names. It’s much harder to change them later, when they have spread throughout your code and your team’s vocabulary.

Testing HTTP Digest Authentication in Rails

Rails 2.3 introduced HTTP Digest authentication to go along with HTTP Basic as simple ways to authenticate access to your application. HTTP Digest is more secure than Basic for several reasons. First, no passwords are transmitted in cleartext (Basic only Base64 encodes them — there is no encryption). Second, the content of the HTTP_AUTHORIZATION header is tied to a request method and URI. Third, there is a limited window of time an attacker could re-use the header in a replay attack even against the same request method and URI (five minutes in 2.3.2).

These are compelling reasons to switch to Digest from Basic, but there is one problem. How do you test it? For good test coverage, clearly you want to verify that actions that should be protected are, and that only valid username/password combinations permit access.

Basic is easy (in test_helper.rb):

def authenticate_with_http_basic(user = 'one', password = 'one')
  @request.env['HTTP_AUTHORIZATION'] = "Basic #{Base64.encode64("#{user}:#{password}")}"
end

Try this for Digest (outside of everything else in ActiveSupport::TestCase):

require 'digest/md5'
 
class ActionController::TestCase
  def authenticate_with_http_digest(user = 'admin', password = 'admin', realm = 'Application')
    unless ActionController::Base < ActionController::ProcessWithTest
      ActionController::Base.class_eval { include ActionController::ProcessWithTest }
    end
 
    @controller.instance_eval %Q(
      alias real_process_with_test process_with_test
 
      def process_with_test(request, response)
        credentials = {
          :uri => request.env['REQUEST_URI'],
          :realm => "#{realm}",
          :username => "#{user}",
          :nonce => ActionController::HttpAuthentication::Digest.nonce,
          :opaque => ActionController::HttpAuthentication::Digest.opaque,
        }
        request.env['HTTP_AUTHORIZATION'] = ActionController::HttpAuthentication::Digest.encode_credentials(
          request.request_method, credentials, "#{password}", false
        )
        real_process_with_test(request, response)
      end
    )
  end
end

Now, precede a call to get, post, put or delete with a call to authenticate_with_http_digest. The code above monkeypatches process_with_test to set up the proper HTTP_AUTHORIZATION header, but only for the current @controller, then runs the request normally.

Natural Language Date & Time Parsing for ActiveRecord (Rails 2.1+)

If you read my previous post on natural language date & time parsing for ActiveRecord and tried it on Rails 2.1 or later, you may have noticed it works fine for dates, but not dates with times.

As part of the time zones feature introduced in Rails 2.1, changes were made to how ActiveRecord parses date/time columns during attribute assignment. The overridden method I showed in my previous post is never called, and Chronic doesn’t get a chance to parse the string.

For Rails 2.1 and later, instead you need to provide a new version of ActiveSupport::TimeZone#parse:

class ActiveSupport::TimeZone
  def parse_with_chronic(string)
    time = parse_without_chronic(string)
    if time.nil?
      time = Chronic.parse(string, :now => self.now)
      time = self.local(time.year, time.month, time.day, time.hour, time.min, time.sec)
    end
 
    time
  end
 
  alias_method_chain :parse, :chronic
end