A “job” in the coding world is basically what it sounds like—a task that needs to be completed. These jobs are typically automated, and can occur on a recurring basis. An example of a job would be one that we at ToneDen use to check on the status of a running ad campaign. We call this a “check ad campaign status” job; one runs every 5 minutes for every campaign that is currently live on our platform.
Some of the responsibilities of this “check ad campaign status” job include, but are not limited to: reporting when the ads in a campaign are approved or rejected by Facebook, alerting the user when a campaign generates its first sale, and notifying the user if the campaign needs adjustments to improve its performance.
This means that when you launch a campaign in ToneDen, a “check ad campaign status” job will run for that campaign every 5 minutes for the duration of the campaign. As our platform (and the number of campaigns live at any given moment) grows, we have more and more of these jobs to run, which means they take up an ever-increasing amount of computational power. Because of this, it has become important to ensure that our jobs are running as efficiently as possible.
Jobs Gone Wild
Part of the responsibility of a “check ad campaign status” job is to create the next “check ad campaign status” job for the campaign and set it to run 5 minutes after the current job completes. Until recently, we used a library called Kue to accomplish this for us. When we create a new job in Kue and schedule it to be run in 5 minutes, we naturally expect that this job will run once—5 minutes from the time it was created.
So when I woke up one morning to alerts that our site had started running slowly, I was surprised to see our logs full of reports that the “check ad campaign status” job was running several times a minute for each campaign. Since this job should only run once every 5 minutes, this was indicative of a serious problem in our job scheduling system!
The sharp increase in the frequency of “check ad campaign status” jobs was putting so much load on our servers that they'd started to respond slowly to network requests, which, in turn, meant our site was loading slowly, too. I immediately set out to find the cause of the duplicate “check ad campaign status” jobs. Fortunately, we log a message any time a job is scheduled or run, so I was able to identify what had caused the problem. When looking through the “check ad campaign status” jobs for a specific campaign, I noticed that the job was always scheduled to run once, but on very rare occasions (maybe 1% of the time), 5 minutes later the job would run twice!
Okay, so occasionally one job runs twice. On its own, that’s not enough to bog down our servers. The problem is that every time a job runs, it creates another job to run 5 minutes down the line. So when a job runs twice, two jobs are scheduled to run 5 minutes from the time that it ends. Now all of a sudden we have two sets of “check ad campaign status” jobs running for that campaign every 5 minutes. If we get unlucky again, and one of those jobs is run twice at some point in the future, we’ll have three sets of jobs running for that campaign. You start out with one job running every 5 minutes for a campaign, but as time goes on this job multiplies and you could end up with hundreds!
When you have thousands of campaigns all running simultaneously, the sheer quantity of jobs can easily become overwhelming.
Once I saw that a single job could be processed twice, I figured that either there was a bug in Kue or that we were using it incorrectly. I dug through the issues that other Kue users had reported at https://github.com/Automattic/kue/issues, and I found that several people had encountered similar problems when using Kue to process a high volume of jobs (https://github.com/Automattic/kue/issues/449 and https://github.com/Automattic/kue/issues/375 in particular).
After narrowing down the problem to Kue, I saw two options: we could stick with Kue and attempt to find a way to eliminate this problem or we could jump ship to a different job queuing library. I did some research and found two other job queuing libraries that came highly recommended, Bee Queue and Bull.
Bull had a feature that could potentially resolve this issue immediately— it only allowed a single job to be created with a given identifier—so if we made the identifier of a “check ad campaign status” job unique to the campaign it was running for, then there could only ever be one “check ad campaign status” job for a given campaign at a time. As a bonus, Bull was far more popular and more frequently updated than either Kue or Bee Queue.
A Better Job
Switching to Bull proved simple. The concepts were the same and a lot of code only needed to be modified slightly. For example, the code to create and schedule a “check ad campaign status” job in Kue looked like this:
In Bull, the exact same task looked like this:
delay: nextJobDate.getTime() - new Date().getTime()
While migrating to Bull appeared to immediately fix our duplicate job issue, I still wanted to add unique identifiers to each of the jobs so I could be absolutely sure that there wouldn’t be any more duplicates. I decided to set the identifier of the “check ad campaign status” job for a campaign to “<campaign ID>-checkAdCampaignStatus”. So for a campaign with ID 1234, the “check ad campaign status” job would have the identifier “1234-checkAdCampaignStatus,” and only one job with this identifier would be allowed to exist at a time. This was easily accomplished by modifying the above code:
delay: nextJobDate.getTime() - new Date().getTime(),
Since switching to Bull, we haven’t had any more slowdown issues. It’s been a great relief, and we even got the bonus of having a cool dashboard where we can monitor the health of our job queue!