Kicking ass with Sidekiq
Not that they need my endorsement, but Sidekiq kicks some serious ass. Doing a ton of super latent communication in the form of emails and text messages lends itself way better to the threaded model than it does to the process model. We saved gobs of time and probably a factor of 50 in cost after switching to Sidekiq.
Delayed Job was the only thing I had ever used to run background processes and, although it works well when coupled with foreman, isn't the friendliest solution when you don't have an infinitely parallel slack pool at your disposal like those running on Heroku. With Delayed Job, each Delayed Job process must complete a single job before moving on to the next, so each unit of concurrency requires its own process. I estimate there to be something like 25ms of ruby processing time for each 975ms of waiting for Mailgun or Twilio to return the API call.
StatusPage does most things in feast or famine fashion - either we're hosting lots of traffic and sending lots of notifications, or things are cricket silent. Not only that, notifications are time sensitive and need to go out ASAP.
Starting out with Delayed Job was tough, each unit of concurrency required enough RAM and its own boot of the Rails environment. On a basic m1.medium instance we could run something like 20 concurrent workers, but getting all of the processes launched took about 7 minutes! Using foreman to publish to upstart, we did some pretty nasty hacks to have all "processes" start at once, but each had a certain amount of sleep such that we could slowly ramp up the processes 1 by 1 (remember, they all have to boot the full Rails environment).
The Long Wait
Even once we got all that figured out, there was a grave mismatch between the amount of ruby processing as compared to the amount of waiting for the network to return. You can imagine even a single core m1.medium machine, will all those rails processes, still not utilizing a full core. The RAM was tapped out and most of the processor time was spent asleep and waiting for network sockets.
After all of the work of coaxing the elephant up on the stand, we took a good hard look at Sidekiq and realized it was exactly what we would need to solve the CPU/RAM mismatch that came with the process model for background jobs.
Moving to Sidekiq
Sidekiq is a single Rails process that delegates work to worker threads to perform. Because Ruby can only ever be executing one thread at a time, it's not recommended you do any computationally intensive tasks (like image processing) using Sidekiq - Delayed Job would be much more appropriate for that. Conversely, typical SaaS background tasks almost always involve offloading some network-intensive call from the synchronous web request and, for this, Sidekiq is a darling. Our main use case is around sending notifications via email and SMS, but we do a bit of the former as well.
Each of our background workers requires just a single boot of Rails (since it's just 1 process), and with our recent move to Ruby 2.0.0 it's faster than ever. Once the process is up, it launches 50 worker threads (configurable) to start consuming jobs off of the Redis queue. Going back to our previous simplification of 25ms of Ruby processing for every 975ms of network latency, we can max out a single AWS core at a throughput of 40+ emails or SMS messages per second outbound from the StatusPage.io app...with only 1 physical process! While one thread is waiting for the network to return, another thread can start some meaningful ruby processing generating an email or a text message.
The process stays occuppied as many threads wait to get some meaningful work done before shipping it out over the wire. Using our 25ms simplification, we should have a theoretical throughput of about 40 messages / second (40 * 25ms = 1000ms), but in practice we've found that a concurrency of 50 works best for us.
Spinning up more concurrency comes in the form of 1 process and 50 threads. Even if we have to take on a huge customer with 10,000 subscribers, we can clear the whole messaging queue in under a minute with only 4 worker processes (200 threads).
Remember kids, practice safe threads
Most Ruby and Rails libraries seem to assume only a single thread, and strange things begin to surface when you're sharing clients used for external communication to other services. For us, this surfaced with the
To fix the
twilio-ruby issue, we needed to move from an application-wide client singleton over to initiating a new client for each communication with Twilio.
To fix the
redis-semaphore issue, we needed to pass the Sidekiq Redis connection around in order to avoid deadlocking in the synchronization stuff (technical term), and ensure only a single Redis client was doing the communication with the server. Sidekiq doesn't appear to accept another Redis connection (only accepts URL), so unfortunately it must become the authoritative connection for all Redis communication application-wide.
Redis::Semaphore.new(:name, :redis => Rails.application.redis_client) do # do protected work end
Sidekiq.redis do |redis| Redis::Semaphore.new(:name, :redis => redis) do # do protected work end end
Admin Back End
All hail mperham for creating a fantastic monitoring back end! The wiki has great documentation for mounting
sidekiq/web somewhere in your app, and you immediately gain visibility into realtime statistics, manual retry/delete of failed tasks, historical view of jobs completed, etc.