Closed Bug 1173469 Opened 9 years ago Closed 7 years ago

Dangerous to start two remote crontabber instances concurrently

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Unassigned)

Details

At the moment we run 1 crontabber instance on one node under one shell. That shell is used to lock and prevent concurrent crontabbers executing on the same jobs config. If we have two remote crontabber instances (aka. AWS nodes) we run the risk of firing the same stored procs at the same time instigated from two different places. https://github.com/mozilla/crontabber/issues/68 deals with the problem of starting >1 jobs (for the first time) simultaneously. https://github.com/mozilla/crontabber/issues/67 deals with the problem of starting >1 *established* jobs simultaneously. Both issues will be solved with https://github.com/mozilla/crontabber/pull/69 which is yet to be resolved (difficult to write tests for these new fixes). The above mentioned issues deals with simultaneous execution on the single-digit millisecond scale. We still have the problem of two jobs starting >10 milliseconds apart. We're currently ignoring the `ongoing` state of a job from a remote and different instance. No crontabber issues filed for this yet.
The 1118288 blockage is "vague". We don't need these issues worked out to be able to switch to AWS but it does mean we need to deploy crontabber carefully in AWS so we don't accidentally let automation slip and run multiple instances at the same time.
Blocks: 1118288
Let's do bug 1173465 for now to unblock AWS.
No longer blocks: 1118288
Peter worked on this in crontabber PR 67: https://github.com/mozilla/crontabber/pull/69 After that, I did an audit of crontabber jobs where I removed a bunch, pinned a bunch to a specific time of day, and then read through the code for the rest to see if the jobs could be interrupted safely. I also changed the job lock from 12 hours to 2. I'm pretty sure we can kill off a running crontabber node during a deploy and start a new one and we'll be ok. I think all of that work means we're probably ok now. Marking as FIXED.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
The way I remember it is, that that change to crontabber meant you can have two (EC2) instances on the same config and same PostgreSQL server and NOT have to worry about two (undesired) concurrent execution of the same job. Thank you PostgreSQL for being a rock! However, that didn't it's safe to kill a crontabber in the midst of running. Especially if it's in the midst of running a backfill based stored procedure that is not atomic. If you kill it, whilst running, it gets no chance to finalize the crontabber state in PostgreSQL and it'll look like the job is ongoing forever. (That's why the "ongoing" state is ignored if it's older than 24 hours). Also if you kill some stored procedure, that nobody active on the team understands, it might leave the data in a weird rotten state that is hard to debug. The solution we outlined was: 1) "Poison crontabber" - Do something like `touch /tmp/hey-crontabber-quit-when-youre-done-with-what-youre-doing-right-now`. (can be done using the crontab shell script that wraps calling python crontabber) 2) Start a new EC2 crontabber (aka. admin node), but don't terminate the existing one for at least, say, 1 hour. Don't we do something similar with processors? The existing nodes don't get terminated the second we start new ones. There's a "linger period".
Peter: I did an audit as part of bug #1429563 related to how should we do deploys with the new infra. I think you're bringing up a conversation that's better had there.
You need to log in before you can comment on or make changes to this bug.