Closed
Bug 636101
Opened 14 years ago
Closed 14 years ago
Slow slaves get used for builders that ask for fast ones
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: catlee)
References
Details
(Whiteboard: [release][automation])
Attachments
(1 file)
(deleted),
patch
|
dustin
:
review+
nthomas
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
I hit this for both Firefox and XULRunner source builders:
...
62712 changesets found
command timed out: 1200 seconds without output, killing pid 1535
The default timeout of 20 minutes may be too short as m-c continues to grow.
I hadn't set any reserved slaves on pm01, so feel free if that's the problem here. However, I think the xulrunner source is not set to use a fast slave so may always hit it there.
Reporter | ||
Comment 1•14 years ago
|
||
Morphing this to the root cause, which is that we're setting
'nextSlave': _nextFastReservedSlave
on the builder but get allocated a VM. Saw other problems with 4.0b12 with getting a VM for l10n repack and Armen got a VM for final verify for 3.5.17.
catlee points about that
http://hg.mozilla.org/build/buildbotcustom/file/tip/misc.py#l311
has a bug at line 324 (elif not onlyFast and slow). There are several functions that would need a check too.
Assignee: nobody → catlee
Summary: Release source times out in create_bundle when run on VMs → Slow slaves get used for builders that ask for fast ones
Comment 2•14 years ago
|
||
Note that that priority only applies on a particular master, too - if a builder on a master with only slow slaves happens to claim the job, then it will run on the slow slave..
Assignee | ||
Comment 3•14 years ago
|
||
Attachment #516702 -
Flags: review?(nrthomas)
Attachment #516702 -
Flags: review?(dustin)
Comment 4•14 years ago
|
||
Comment on attachment 516702 [details] [diff] [review]
Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests!
This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being more careful to return a list from _recentSort and a slave or None from _nextFastSlave. I'm not sure how the addition of 'not fast and only fast' helps: that would have gotten to the return [] anyway, right?
if fast: # not taken
return sorted(fast, _recentSort(builder))[-1]
elif slow and not only_fast: # not taken
return sorted(slow, _recentSort(builder))[-1]
else:
return []
so just changing the return [] to return None would be equivalent?
I don't see how this fixes the original issue, but the refactor looks fine.
# Choose the fast slave that was most recently on this builder
If this comment is true, then idle slaves shouldn't be that surprising - masters get an affinity for slaves they've used recently, so unused slaves never get contacted. Right?
Attachment #516702 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 5•14 years ago
|
||
(In reply to comment #4)
> Comment on attachment 516702 [details] [diff] [review]
> Fixes for _nextFastReservedSlave, and checking the reserved file. Now with
> tests!
>
> This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being
> more careful to return a list from _recentSort and a slave or None from
> _nextFastSlave. I'm not sure how the addition of 'not fast and only fast'
> helps: that would have gotten to the return [] anyway, right?
>
> if fast: # not taken
> return sorted(fast, _recentSort(builder))[-1]
> elif slow and not only_fast: # not taken
> return sorted(slow, _recentSort(builder))[-1]
> else:
> return []
>
> so just changing the return [] to return None would be equivalent?
The old _nextFastReservedSlave looked like this:
if fast: # not taken
return sorted(fast, _recentSort(builder))[-1]
elif not onlyFast and slow: # not taken
return sorted(slow, _recentSort(builder))[-1]
else: # oops!
log.msg("No fast or slow slaves found for builder '%s', choosing randomly instead" % builder.name)
return random.choice(available_slaves)
This the motivation for the refactor - it was too easy for the logic in the two
very similar functions to get out of sync. The original version in _nextFast
should be equivalent to what the patch is doing now, a little less verbosely
though.
Returning None or [] should be equivalent, we use None elsewhere, and buildbot
expects the function to return a slave object so None makes more sense than []
for "no slave".
> I don't see how this fixes the original issue, but the refactor looks fine.
>
> # Choose the fast slave that was most recently on this builder
>
> If this comment is true, then idle slaves shouldn't be that surprising -
> masters get an affinity for slaves they've used recently, so unused slaves
> never get contacted. Right?
Correct. This is done by looking through the builds in the builder's build
cache (so we don't load pickles from disk), and ordering slaves by
time-since-last-build on that builder.
Comment 6•14 years ago
|
||
Cool - my r+ stands then, and thanks for the explanation. My mental diff with _nextFastReservedSlave didn't catch that!
Reporter | ||
Updated•14 years ago
|
Attachment #516702 -
Flags: review?(nrthomas) → review+
Assignee | ||
Comment 7•14 years ago
|
||
Comment on attachment 516702 [details] [diff] [review]
Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests!
http://hg.mozilla.org/build/buildbotcustom/rev/204086db1d2c
Attachment #516702 -
Flags: checked-in+
Comment 8•14 years ago
|
||
Tests are failing:
Traceback (most recent call last):
File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py", line 382, in loadTestsFromName
addr.filename, addr.module)
File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 39, in importFromPath
return self.importFromDir(dir_path, fqname)
File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 86, in importFromDir
mod = load_module(part_fqname, fh, filename, desc)
File "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py", line 132
with mock.patch.object(time, 'time') as time_method:
^
SyntaxError: invalid syntax
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> Tests are failing:
>
> Traceback (most recent call last):
> File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py",
> line 382, in loadTestsFromName
> addr.filename, addr.module)
> File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py",
> line 39, in importFromPath
> return self.importFromDir(dir_path, fqname)
> File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py",
> line 86, in importFromDir
> mod = load_module(part_fqname, fh, filename, desc)
> File
> "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py",
> line 132
> with mock.patch.object(time, 'time') as time_method:
> ^
> SyntaxError: invalid syntax
Will fix in bug 643408
Assignee | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•