Closed Bug 592244 Opened 14 years ago Closed 13 years ago

race condition in schedulerdb

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: Pike, Assigned: catlee)

References

Details

(Whiteboard: [buildbot])

Attachments

(3 files, 2 obsolete files)

Log from pm01 from 04:41:31 14 years ago Chris AtLee [:catlee] (deleted), text/plain		Details
Log from pm03 from 04:41:31 14 years ago Chris AtLee [:catlee] (deleted), text/plain		Details
Try to handle race conditions 13 years ago Chris AtLee [:catlee] (deleted), patch		Details \| Diff \| Splinter Review
Try to handle race conditions, v2 13 years ago Chris AtLee [:catlee] (deleted), patch	dustin : feedback+	Details \| Diff \| Splinter Review
Try to handle race conditions, v3 13 years ago Chris AtLee [:catlee] (deleted), patch	dustin : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

Axel Hecht [:Pike]

Reporter

Description

•

14 years ago

I'm seeing two nightlies for the same locale on the same builder, but different masters. http://production-master01.build.mozilla.org:8010/builders/Firefox%20mozilla-central%20macosx64%20l10n%20nightly/builds/1707 http://production-master03.build.mozilla.org:8010/builders/Firefox%20mozilla-central%20macosx64%20l10n%20nightly/builds/1670 Both with the same start time, Tue Aug 31 04:41:31 2010. Is there a race of masters grabbing build jobs?

Chris AtLee [:catlee]

Assignee

Comment 1

•

14 years ago

Looks possible...There's only one build request for these, which has two build rows. There are 4 builds that started at exactly that time: select buildrequests.id,buildername from buildrequests, builds where buildrequests.id = builds.brid and start_time = 1283254891; +--------+-----------------------------------------------+ | id | buildername | +--------+-----------------------------------------------+ | 487448 | Firefox mozilla-central macosx64 l10n nightly | | 487449 | Firefox mozilla-central macosx64 l10n nightly | | 487448 | Firefox mozilla-central macosx64 l10n nightly | | 487524 | Firefox mozilla-central linux64 l10n nightly | +--------+-----------------------------------------------+ The two that share the same request id are the problematic ones.

Assignee: nobody → catlee

Chris AtLee [:catlee]

Assignee

Comment 2

•

14 years ago

Attached file Log from pm01 from 04:41:31 (deleted) — Details

Chris AtLee [:catlee]

Assignee

Comment 3

•

14 years ago

Attached file Log from pm03 from 04:41:31 (deleted) — Details

Chris AtLee [:catlee]

Assignee

Comment 4

•

14 years ago

Nothing obvious in the logs....

Chris AtLee [:catlee]

Assignee

Comment 5

•

14 years ago

So it looks like both masters are seeing the same unclaimed build request, and are able to both update the same row without the database complaining. P3 for now. If serious issues come up because of this, please let me know.

Priority: -- → P3

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

14 years ago

Checking the effects of the code against an InnoDB MySQL table, concurrent updates are handled just fine - the last transaction to commit takes precedence. To the extent it handles this at all, the db code expects to get an error about conflicts from the DB server. SQLite gives such an error. MySQL and Postgres do not, whether with MyISAM or InnoDB. Incidentally, catlee reports we're using MyISAM tables on the backend. So again, I'm not sure how this has *ever* worked.

Chris AtLee [:catlee]

Assignee

Comment 7

•

14 years ago

(In reply to comment #6) > Checking the effects of the code against an InnoDB MySQL table, concurrent > updates are handled just fine - the last transaction to commit takes > precedence. To the extent it handles this at all, the db code expects to get > an error about conflicts from the DB server. SQLite gives such an error. > MySQL and Postgres do not, whether with MyISAM or InnoDB. > > Incidentally, catlee reports we're using MyISAM tables on the backend. > > So again, I'm not sure how this has *ever* worked. Until recently we had at most 2 masters configured to run the same builders. Now we've got 4 or 6 in some cases. To mitigate this, we could consolidate masters again...or split them up so that they're not running all the same builders. Or come up with a scheme to ensure that masters aren't all checking for builds at the same time.

Axel Hecht [:Pike]

Reporter

Comment 8

•

14 years ago

Wouldn't that kinda defeat the purpose of schedulerdb, though?

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

14 years ago

Yes, but it's a short-term solution while we work out the schema changes required to fix it. Catlee, do you want to get things squared away short-term, and then hand this over to me?

Chris AtLee [:catlee]

Assignee

Comment 10

•

14 years ago

(In reply to comment #9) > Yes, but it's a short-term solution while we work out the schema changes > required to fix it. > > Catlee, do you want to get things squared away short-term, and then hand this > over to me? Sure. I'll start by writing a tool to figure out how often this is happening. If it's fairly rare, I'm ok to leave it be for now.

Summary: race condition in nightly schedulers? → race condition in schedulerdb

Chris AtLee [:catlee]

Assignee

Comment 11

•

14 years ago

There have been 58 occurrences of duplicate builds since I started counting on Nov 3rd at 1pm PDT. The vast majority of these are l10n repack jobs, the rest being a mix of various test jobs. I haven't seen any regular builds get duplicated so far. Axel, what's the impact of duplicate l10n repacks? Have we seen any failures as a result?

Axel Hecht [:Pike]

Reporter

Comment 12

•

14 years ago

I guess that you're in a better position to tell, as you can work on first-class data. Pretty hard to tell from here. http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-05&buildbot=true looks more dupe-dense that usual, and no obvious failures this time. I quite often see failed builds and passing next to each other though. See ga-IE fx36 yesterday, http://l10n.mozilla.org/~axel/nightlies/?date=2010-11-04&buildbot=true. Given that we're still seeing upload errors much more frequently than anything else, it's hard to figure out what fails when and why from here.

Chris AtLee [:catlee]

Assignee

Comment 13

•

14 years ago

We could fix this temporarily by restricting the l10n repacks to happen on only one master.

Chris AtLee [:catlee]

Assignee

Comment 14

•

14 years ago

This needs to be fixed upstream.

Assignee: catlee → nobody

Whiteboard: [buildbot][waiting-for-upstream]

Phil Ringnalda (:philor)

Comment 15

•

14 years ago

We're doing a *ton* of this right now (way more than I thought just by looking at tbpl, since for the most part the double runs start at the same time, and thus Tinderbox and tbpl wind up only knowing about one of them), which seems unlikely to do good things for our shortage of slaves and consequent wait times. From a quick look through the tip TraceMonkey push (on buildapi, where they're visible), looks like it had 28 jobs that were double-run, and 1 that was triple-run.

Chris AtLee [:catlee]

Assignee

Comment 16

•

14 years ago

It's definitely gotten worse lately, not sure why. Maybe the change in bug 649734 made the builder loop more efficient, so masters are racing each other more often?

Chris AtLee [:catlee]

Assignee

Comment 17

•

14 years ago

Actually, this might explain why wait times are bad if we're duplicating a lot of test jobs.

Armen [:armenzg]

Updated

•

14 years ago

Blocks: 649734

Aki Sasaki (not active)

Comment 20

•

14 years ago

This seems to be intermittently breaking Android updates (bug 651925).

Aki Sasaki (not active)

Comment 21

•

13 years ago

Android updates again: bug 652667

Chris AtLee [:catlee]

Assignee

Updated

•

13 years ago

Assignee: nobody → catlee

Priority: P3 → P2

Chris AtLee [:catlee]

Assignee

Comment 22

•

13 years ago

Attached patch Try to handle race conditions (obsolete) (deleted) — Details — Splinter Review

The approach here is similar to what's used in 0.8.4 I think. The UPDATE query to claim the buildrequest rows is qualified by conditions that should match only unclaimed requests. If the master claims fewer rows than is in the batch, the entire set of buildrequests is then unclaimed again. I've seen no duplicated builds with this code. I think there's a small risk that if re-claiming a buildrequest fails, then it could be subsequently claimed by another master. I haven't actually seen this happen, and I think that the master would have to fail to re-claim the request 6 times for this to occur (reclaims happen every 10 minutes, and buildrequests aren't stolen by other masters until an hour has gone by)

Attachment #528306 - Flags: feedback?(dustin)

Aki Sasaki (not active)

Updated

•

13 years ago

Blocks: 652785

Dustin J. Mitchell [:dustin] (he/him)

Comment 23

•

13 years ago

Comment on attachment 528306 [details] [diff] [review] Try to handle race conditions + missed = t.rowcount This seems an odd variable name - it's the number of build requests that were *hit*, not missed, right? If a reclaim fails, this will unclaim the builds that should have been reclaimed. But I suppose that's a very narrow failure case. So, barring moving to a "claims" table, I think this looks good.

Chris AtLee [:catlee]

Assignee

Comment 24

•

13 years ago

Attached patch Try to handle race conditions, v2 (obsolete) (deleted) — Details — Splinter Review

This builds on the previous patch by adding the 'am_reclaiming' flag to the claim buildrequests method. If we fail to reclaim the buildrequests, we log a warning and exit instead of trying to unclaim them.

Attachment #528306 - Attachment is obsolete: true

Attachment #529712 - Flags: feedback?(dustin)

Attachment #528306 - Flags: feedback?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

13 years ago

Comment on attachment 529712 [details] [diff] [review] Try to handle race conditions, v2 Review of attachment 529712 [details] [diff] [review]: lgtm ::: master/buildbot/db/connector.py @@ +872,5 @@ + qargs = [now, master_name, master_incarnation] + list(batch) + [old, master_name] + t.execute(q, qargs) + matched = t.rowcount + if matched != len(batch): + log.msg("Failed to claim everything; unclaiming (got %s, expected %s)" % (matched, len(batch))) You should probably change this log message, since you won't necessarily claim here.

Attachment #529712 - Flags: feedback?(dustin) → feedback+

Chris AtLee [:catlee]

Assignee

Comment 26

•

13 years ago

Attached patch Try to handle race conditions, v3 (deleted) — Details — Splinter Review

3rd time's the charm? Basically the same as before; adjusted the log message you pointed out, and fix claim_buildrequests's return value in the case that we're not reclaiming anything.

Attachment #529712 - Attachment is obsolete: true

Attachment #531158 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

13 years ago

Attachment #531158 - Flags: review?(dustin) → review+

Chris AtLee [:catlee]

Assignee

Comment 27

•

13 years ago

Comment on attachment 531158 [details] [diff] [review] Try to handle race conditions, v3 http://hg.mozilla.org/build/buildbot/rev/ed19e0d5b2e5

Attachment #531158 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Comment 29

•

13 years ago

Optimistically calling this fixed. It's been deployed to all of the production build masters, and I haven't seen any of the duplicated builds on masters where this has been deployed. The last duplicate builds I've found were at 4am, just prior to this fix being deployed on pm03.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Whiteboard: [buildbot][waiting-for-upstream] → [buildbot]

Wladimir Palant

Comment 30

•

13 years ago

Confirming. The latest of AMO's performance measurement runs created the expected 404 results, not 430 or 450 as it used to be (see bug 650744).

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.