We found a bug. A bug that affected a lot of historical records that we now have the pleasure of reprocessing. Fortunately we already had an async job infrastructure in place with the delayed job gem. Unfortunately, this gem is intended for fairly small batches of records and isn’t tuned to handle 1M+ records in the delayed_jobs table.
After reading some best practices we decided on a queue based approach. To keep our day to day async jobs running we would use a “default” queue. And to reprocess our old records we used a new queue. Starting the workers up with a “–queue” flag did the trick. We had hardware dedicated for day-to-day operations, and new hardware dedicated to our new queue operations. Now it was simply a matter of filling up the queue with the jobs to reprocess the records.
Our initial approach maxed out the CPU on our database server. This was largely due to us not tuning our SQL in our async jobs. Because the volume we processed was always low, this was never really a noticeable problem. But when we threw lots of new jobs into the queues, it became very noticeable. The workers would start up, then mysteriously die. After some digging in /var/log/kern.log we discovered the workers were being killed due to an out of memory manager. Attaching a small swap partition helped, but once you’ve hit swap, things become horribly slow. Whats the point in keeping the worker alive if its running a thousand times slower? Clearly we needed to profile and tune. So we did. (The specifics are out of scope for this article, but it involved consolidating N+1 queries and limiting columns returned by the SELECT).
With our newly tuned SQL our spirits were high as we cranked up the workers again. Only to reach the next process bottleneck. And this is where databases gets frustrating. Delayed job workers run a query each time they are looking for a new job to find out which job to pick up. It puts a mutex lock on the record by setting locked_at and locked_by. The query looks like this:
UPDATE delayed_jobs SET `delayed_jobs`.`locked_at` = '2016-06-05 11:48:28', `delayed_jobs`.`locked_by` = 'delayed_job.2 host:ip-10-203-174-216 pid:3226' WHERE ((run_at <= '2016-06-05 11:48:28' AND (locked_at IS NULL OR locked_at < '2016-06-05 07:48:28') OR locked_by = 'delayed_job.2 host:ip-10-203-174-216 pid:3226') AND failed_at IS NULL) ORDER BY priority ASC, run_at ASC LIMIT 1;
The UPDATE does an ORDER which results in a filesort. Filesorts are typically something an index can resolve. So I optimistically added the following:
CREATE INDEX delayed_job_priority ON delayed_jobs(priority,run_at);
Sadly, this index was completely ignored when I ran an EXPLAIN on my UPDATE. And the reason is that MySQL doesn’t execute an UPDATE query the same way as if you did a SELECT with the same conditions. The index probably made things worse, because now with each record update, we now also have an index update as well. I could probably fork the code and probably use some type of isolation level in a transaction to get the best of both worlds with an index based SELECT, and a quick UPDATE on a single record by id. But there are easier solutions to try first.
My UPDATE statements were pushing 40 seconds in some cases according to MySQL. Eventually the lock wait timeout is exceeded and you see an error in the delayed_jobs.log:
Error while reserving job: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction
Jobs were moving very slowly, and throwing more workers at it didn’t make an improvement. This is because each time a worker picks up a job, it was waiting 40+ seconds. The UPDATE was doing a filesort, and any index was being ignored. (And MySQL doesn’t support UPDATE hints). It was pretty clear that all of the jobs from the reprocessing queue needed to find a new home that didn’t negatively impact my filesort. I settled on the following solution:
CREATE TABLE delayed_jobs_backup LIKE delayed_jobs; INSERT INTO delayed_jobs_backup SELECT * FROM delayed_jobs WHERE queue='new_queue'; DELETE FROM delayed_jobs WHERE queue='new_queue';
This creates a new database table with the structure of the existing delayed_jobs table. The table is then populated with the jobs that needed to find a new home (All 1M+ of them). And finally, deleted from the original delayed_jobs table. Be careful doing this, and do some SELECT/EXPLAIN queries in between to ensure you are doing what you think you are doing. (Deleting 1M+ records from a production database makes me sit up in my chair a bit).
Looking at MySQL’s process list I no longer have System locks on my UPDATE statements (presumably because the table size is small enough that the filesort is mostly painless):
mysql> SHOW FULL PROCESSLIST; # Id, User, Host, db, Command, Time, State, Info 1, user, 0.0.0.0:34346, localhost, Query, 0, System lock, UPDATE ...
The important columns here are the Time (in seconds), State, and Info. This proves that my job locking was happening quickly. I was seeing Time values of 40+ seconds before. I kept referring back to this process list to verify that the UPDATES were remaining snappy while I modified the number of workers running, and the number of jobs in the queue. I had a goal of keeping the UPDATE system lock times under 2 seconds. Adding more workers pushed the times up. Adding more jobs to the queue pushed the times up. Its a balance that probably depends very much on what you are processing, how much your database can handle, and what your memory constraints are on your worker servers.
To conclude – my job over the next few days will be to run the following command to put some jobs into the delayed_jobs table 10,000 at a time:
INSERT INTO delayed_jobs SELECT * FROM delayed_jobs_backup LIMIT 10000; DELETE FROM delayed_jobs_backup LIMIT 10000;
You can of course automate this. But my objective was met. The jobs can reprocess old records without impacting day to day jobs in the default queue. When the delayed_jobs table is almost empty, I move over another batch of jobs from the delayed_jobs_backup table. Rinse and repeat until there are no more jobs left to process. Its a bit more painful, but day to day operations continue to function, and I can cross of the reprocessing task from my list of things to do. All without any code changes!
I’ve been reading up on transaction isolation levels thinking something like a SELECT FOR UPDATE lock might be worthy contribution to the delayed_job codebase: http://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html