Recently, in one of my projects, I came across a situation wherein I was required to parse a CSV uploaded by the end user. For performance reasons, I choose to create a new Sidekiq worker for each row in my CSV. All these rows (workers) were running in parallel (standard Sidekiq behaviour). I wanted to execute some logic ONLY after all rows (workers) had finished processing. Sidekiq Pro let’s you do this by creating a batch of workers. Sidekiq has a Batch API that provides success and failure callbacks for such scenarios.
Everything was working well in development and staging environment. Until a day when a customer reported that he is not receiving success/failure report for his jobs. I tried to reproduce the scenario described by the user but, couldn’t. Not even for the same customer account. But, then I was able to reproduce it and again I couldn’t.
Confused??? The problem was that issue was not consistent. i.e. it was occurring randomly. So, I started researching the source of the problem.
Back then, I was completely clueless about what went wrong and of course very tensed. But now, after putting in a lot of efforts, I have finally figured out what went wrong, why? and what can be done to prevent it. That’s exactly what this blog is all about.
Ultimately what I found was that a combination of 3 factors was causing it.
First of all, we were using an older version of Sidekiq Pro 2.0.8, which supported only one strategy called reliable_fetch (which is deprecated as of Sidekiq Pro 3.4.1).
Still, how does it matter? To understand that, we must first understand how Sidekiq Pro works in general and how does it use reliable_fetch algorithm.
Sidekiq Pro uses a strategy called reliable_fetch, which internally uses the rpoplpush Redis command. rpoplpush atomically returns and removes the last element of the list.
In simple words, when a job is picked up by Sidekiq, it removes it from the Redis queue and stores it within a private queue for each process while executing. If there are any unfinished jobs, they are pushed back to Redis.
99% of the time, that’s sufficient. But there are limits. Since jobs are stored in-process while executing, if the process crashes or network connectivity goes down, the job can be lost.
Now, add to the mix second factor, Docker. In our production environment (but, not in our development environment ;)), we were running our code inside a Docker container. But, that’s still not all.
Third factor, AWS auto scaling group.
That’s where the suspense ends! In our case, AWS auto scale policy forced one of the docker containers to shutdown while it was in the middle of executing a job. This caused the Docker container to shutdown (not just the Sidekiq process) and as you may have guessed by now, it resulted in loss of in-memory private queue that Sidekiq had stored and hence the missing jobs.
Sidekiq Pro’s batch API is configured to fire the success/error callback ONLY when all the child jobs have completed (failed or succeeded). In this case, that never happened and hence, the callbacks never got fired. Thus explaining the intermittent reproduction of client’s issue.
That’s all about the problem. Now, it’s time for the Solution.
Luckily, Sidekiq Pro 3.4 introduced a new strategy called super_fetch, which is also the default strategy starting Sidekiq Pro 4.0. It handles such a situation by maintaining two separate queues – public, private. When a process crashes, it pushes back the unused jobs from private queue, back to the public queue. Thus, they can be picked up again by some other worker. You can compare the available strategies over here.
End of the day, all’s well that ends well. With this blog post, I hope someone gets to benefit from my experience.
Pro Tip: Sidekiq is available in three variants: Standard (open source edition), Pro, Enterprise. I would highly recommend you to go for the Pro version. Detailed comparison available on Sidekiq website.