Closed Bug 636101 Opened 14 years ago Closed 14 years ago

Slow slaves get used for builders that ask for fast ones

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: catlee)

References

Details

(Whiteboard: [release][automation])

Attachments

(1 file)

I hit this for both Firefox and XULRunner source builders: ... 62712 changesets found command timed out: 1200 seconds without output, killing pid 1535 The default timeout of 20 minutes may be too short as m-c continues to grow. I hadn't set any reserved slaves on pm01, so feel free if that's the problem here. However, I think the xulrunner source is not set to use a fast slave so may always hit it there.
Morphing this to the root cause, which is that we're setting 'nextSlave': _nextFastReservedSlave on the builder but get allocated a VM. Saw other problems with 4.0b12 with getting a VM for l10n repack and Armen got a VM for final verify for 3.5.17. catlee points about that http://hg.mozilla.org/build/buildbotcustom/file/tip/misc.py#l311 has a bug at line 324 (elif not onlyFast and slow). There are several functions that would need a check too.
Assignee: nobody → catlee
Summary: Release source times out in create_bundle when run on VMs → Slow slaves get used for builders that ask for fast ones
Note that that priority only applies on a particular master, too - if a builder on a master with only slow slaves happens to claim the job, then it will run on the slow slave..
Attachment #516702 - Flags: review?(nrthomas)
Attachment #516702 - Flags: review?(dustin)
Comment on attachment 516702 [details] [diff] [review] Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests! This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being more careful to return a list from _recentSort and a slave or None from _nextFastSlave. I'm not sure how the addition of 'not fast and only fast' helps: that would have gotten to the return [] anyway, right? if fast: # not taken return sorted(fast, _recentSort(builder))[-1] elif slow and not only_fast: # not taken return sorted(slow, _recentSort(builder))[-1] else: return [] so just changing the return [] to return None would be equivalent? I don't see how this fixes the original issue, but the refactor looks fine. # Choose the fast slave that was most recently on this builder If this comment is true, then idle slaves shouldn't be that surprising - masters get an affinity for slaves they've used recently, so unused slaves never get contacted. Right?
Attachment #516702 - Flags: review?(dustin) → review+
(In reply to comment #4) > Comment on attachment 516702 [details] [diff] [review] > Fixes for _nextFastReservedSlave, and checking the reserved file. Now with > tests! > > This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being > more careful to return a list from _recentSort and a slave or None from > _nextFastSlave. I'm not sure how the addition of 'not fast and only fast' > helps: that would have gotten to the return [] anyway, right? > > if fast: # not taken > return sorted(fast, _recentSort(builder))[-1] > elif slow and not only_fast: # not taken > return sorted(slow, _recentSort(builder))[-1] > else: > return [] > > so just changing the return [] to return None would be equivalent? The old _nextFastReservedSlave looked like this: if fast: # not taken return sorted(fast, _recentSort(builder))[-1] elif not onlyFast and slow: # not taken return sorted(slow, _recentSort(builder))[-1] else: # oops! log.msg("No fast or slow slaves found for builder '%s', choosing randomly instead" % builder.name) return random.choice(available_slaves) This the motivation for the refactor - it was too easy for the logic in the two very similar functions to get out of sync. The original version in _nextFast should be equivalent to what the patch is doing now, a little less verbosely though. Returning None or [] should be equivalent, we use None elsewhere, and buildbot expects the function to return a slave object so None makes more sense than [] for "no slave". > I don't see how this fixes the original issue, but the refactor looks fine. > > # Choose the fast slave that was most recently on this builder > > If this comment is true, then idle slaves shouldn't be that surprising - > masters get an affinity for slaves they've used recently, so unused slaves > never get contacted. Right? Correct. This is done by looking through the builds in the builder's build cache (so we don't load pickles from disk), and ordering slaves by time-since-last-build on that builder.
Cool - my r+ stands then, and thanks for the explanation. My mental diff with _nextFastReservedSlave didn't catch that!
Attachment #516702 - Flags: review?(nrthomas) → review+
Comment on attachment 516702 [details] [diff] [review] Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests! http://hg.mozilla.org/build/buildbotcustom/rev/204086db1d2c
Attachment #516702 - Flags: checked-in+
Tests are failing: Traceback (most recent call last): File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py", line 382, in loadTestsFromName addr.filename, addr.module) File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 39, in importFromPath return self.importFromDir(dir_path, fqname) File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 86, in importFromDir mod = load_module(part_fqname, fh, filename, desc) File "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py", line 132 with mock.patch.object(time, 'time') as time_method: ^ SyntaxError: invalid syntax
(In reply to comment #8) > Tests are failing: > > Traceback (most recent call last): > File > "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py", > line 382, in loadTestsFromName > addr.filename, addr.module) > File > "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", > line 39, in importFromPath > return self.importFromDir(dir_path, fqname) > File > "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", > line 86, in importFromDir > mod = load_module(part_fqname, fh, filename, desc) > File > "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py", > line 132 > with mock.patch.object(time, 'time') as time_method: > ^ > SyntaxError: invalid syntax Will fix in bug 643408
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: