Debugging Rails processes under Phusion Passenger with nginx.

14 Jun 2010

If you are involved in web operations you'll no doubt be familiar with this problem: Everything is running along smoothly and suddenly an anomalous load spike occurs which does not correlate to a traffic spike or any problem on which you have any data. Last week this very thing occurred in a mature Rails application which runs under Phusion Passenger on nginx. The initial symptom was a rapid CPU spike (Rails processes in user space) and then Rails processes becoming stuck on a single request. This rapidly brought on a typical web server death spiral: Increasing CPU usage, increasing memory usage, eventually swapping occurs and/or the application server runs out of available processes in the pool. Frustrated users hit refresh and the issue compounds.

In my experience the order of likely causes of a problem goes like this: Buggy code recently deployed, large and expensive reports being run, or network (or firewall) problems causing processes to hang on IO operations. How to troubleshoot this problem on Passenger?

Checking the output of passenger-status we were seeing Rails processes not completing requests, the "Processed" column for certain Rails processes were not incrementing, indicating these processes were somehow stuck, thereby decreasing the pool of available processes for subsequent requests. Attaching to these stuck processes with strace revealed no activity in the process at all. Similarly, using gdb.rb revealed nothing immediately useful. Strangely we were seeing some Rails processes continuing to run even after they were no longer being actively managed by Passenger, at least according to the output of passenger-status, in which these processes no longer appeared.

First I upgraded Passenger (using Chef, of course) to the latest version on one server, to see if there was a Passenger bug causing this problem. The issue persisted. Making things a bit more complicated we had recently moved a memcached instance to a different part of the network, and all the stuck Rails processes had open network connections to this memcached instance, leading me to wonder if it was this network IO which was somehow causing the Rails process to hang. I could find no problems between the application servers and the memcached port and no logs of any network issues there. So what are the stuck Rails processes stuck on?

The way to troubleshoot this problem on Passenger is to  issue a kill -SIGABRT to the stuck Rails processes, and find their backtrace in the Rails log. Looking at the backtraces we were able to track the issue down to a bug in a Ruby gem which was being used to generate certain types of reports. Disabling this reporting prevented new occurrences of the problem until the underlying issue can be resolved. An interesting combination of two usual suspects: Reporting and code library bugs.