Closed
Bug 712244
Opened 13 years ago
Closed 12 years ago
increase or work around hitting MAX_BROKER_REFS on test masters (too many builders per slave problem)
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: mozilla)
References
Details
(Whiteboard: [buildmasters][capacity])
Attachments
(5 files)
(deleted),
patch
|
dustin
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
catlee
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dustin
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
catlee
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
coop
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
Seen on talos-r3-xp-037 trying to talk to buildbot-master16:
2011-12-19 11:19:51-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2011-12-19 11:19:51-0800 [Broker,client] While trying to connect:
Traceback from remote host -- Traceback (most recent call last):
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
self._runCallbacks()
...
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 664, in registerReference
raise Error("Maximum PB reference count exceeded. "
twisted.spread.pb.Error: Maximum PB reference count exceeded. Goodbye.
Connected fine after a reboot, so tried to hit the master during a reconfig ?
Reporter | ||
Comment 1•13 years ago
|
||
Also talos-r3-xp-039 at
2011-12-19 11:18:52-0800 [Broker,client] While trying to connect:
Reporter | ||
Comment 2•13 years ago
|
||
And talos-r3-xp-057 slightly older, full stack:
2011-12-16 18:22:01-0800 [Broker,client] ReconnectingPBClientFactory.failedToGetPerspective
2011-12-16 18:22:01-0800 [Broker,client] While trying to connect:
Traceback from remote host -- Traceback (most recent call last):
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
self._runCallbacks()
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 397, in _continue
self.unpause()
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 363, in unpause
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 763, in serialize
return jelly(object, self.security, None, self)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/jelly.py", line 1122, in jelly
return _Jellier(taster, persistentStore, invoker).jelly(object)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/jelly.py", line 475, in jelly
return obj.jellyFor(self)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/flavors.py", line 127, in jellyFor
return "remote", jellier.invoker.registerReference(self)
File "/builds/buildbot/tests1-windows/lib/python2.6/site-packages/twisted/spread/pb.py", line 664, in registerReference
raise Error("Maximum PB reference count exceeded. "
twisted.spread.pb.Error: Maximum PB reference count exceeded. Goodbye.
Comment 3•13 years ago
|
||
jhford has a bug to add a couple more masters but it is blocked on IT to open firewalls for DB access (bug 708804).
Comment 4•13 years ago
|
||
jhford mention those are for tegras. I am asking dustin for more masters.
Also CPU wio is happening on that master, which is not surprising.
Comment 5•13 years ago
|
||
found in triage:
(In reply to Armen Zambrano G. [:armenzg] - (vactions from Dec. 24th & back on Jan. 9th) from comment #3)
> jhford has a bug to add a couple more masters but it is blocked on IT to
> open firewalls for DB access (bug 708804).
bug#708804 now done.
...and so too is bug#712398.
Comment 6•13 years ago
|
||
Once buildbot-master21 is created, we should use our scripts to set it up as a Windows test master, enable it in slavealloc, and let it start grabbing Windows slaves from buildbot-master16.
OS: Mac OS X → Linux
Priority: -- → P3
Summary: Too many builders on buildbot-master16 ? → Master setup for buildbot-master21
Whiteboard: [buildmasters][capacity][buildduty]
Comment 8•13 years ago
|
||
lsblakk, catlee mentioned that you're setting up some masters.
Do you have instructions on how to? I would like to try one myself.
Thanks!
Comment 9•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #8)
> lsblakk, catlee mentioned that you're setting up some masters.
> Do you have instructions on how to? I would like to try one myself.
> Thanks!
Armen, I just set up buildbot-master{22..27} following https://wiki.mozilla.org/ReleaseEngineering/Master_Setup and have added a few notes where I found it necessary. Am currently finished everything on my end and just trying to figure out the best practice for filing the IT bugs required. I will make notes in that wiki page with whatever I discover.
Reporter | ||
Comment 10•13 years ago
|
||
If buildbot-master16 is already windows only (win32+win64) and there really are too many builders, is there a plan to split win32 vs win64 ? Simply adding another master doesn't seem like it'll help any.
Comment 11•13 years ago
|
||
If we're hitting the PB limit, it's a problem with number-of-builders-per-slave, not number-of-slaves-per-master. So we need to reduce number of builders and/or bump the PB limit.
Comment 12•13 years ago
|
||
Should we split the win64 slaves into a separate master?
or increase the PB limit?
Comment 13•13 years ago
|
||
I think that the limit to the reference count in Twisted is actually reasonable, so it may be time to look for ways to have fewer builders defined. But you can monkey-patch it. Details in the Twisted bug (#2045)
Comment 14•13 years ago
|
||
sorry, *buildbot* bug - http://trac.buildbot.net/ticket/2045
Comment 15•13 years ago
|
||
Does this happen when we do a reconfig that adds more builders? and then have to backout?
I just don't know when this problem happens and how often.
Comment 16•13 years ago
|
||
I will leave this bug open for someone else to determine what is the way forward.
Assignee: armenzg → nobody
Priority: P2 → --
Summary: Master setup for buildbot-master21 → Too many builders on buildbot-master16 ?
Comment 17•13 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #11)
> If we're hitting the PB limit, it's a problem with
> number-of-builders-per-slave, not number-of-slaves-per-master. So we need to
> reduce number of builders and/or bump the PB limit.
* Do we actually know how many builders we have per slave?
* Does the new PB ref limit need to be a power-of-2?
* Do we actually have a way to reduce the # of builders per slave? I assume this is a direct result of having (many project branches) x (many tests split into smaller parts)
Priority: -- → P3
Comment 18•13 years ago
|
||
It doesn't need to be a power of two.
Comment 19•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #17)
> (In reply to Chris AtLee [:catlee] from comment #11)
> > If we're hitting the PB limit, it's a problem with
> > number-of-builders-per-slave, not number-of-slaves-per-master. So we need to
> > reduce number of builders and/or bump the PB limit.
>
> * Do we actually know how many builders we have per slave?
You can count it locally by editing master.cfg for a test master and running checkconfig.
> * Does the new PB ref limit need to be a power-of-2?
> * Do we actually have a way to reduce the # of builders per slave? I assume
> this is a direct result of having (many project branches) x (many tests
> split into smaller parts)
We'd have to consolidate builders e.g. doing all the different mochitest suites in the same builder, relying on runtime information to run the right suite. This also means tbpl would need updating since it looks at the builder name to determine what type of job something is.
Comment 20•13 years ago
|
||
Updating summary.
Summary: Too many builders on buildbot-master16 ? → increase or work around hitting MAX_BROKER_REFS on test masters (too many builders per slave problem)
Updated•13 years ago
|
Component: Release Engineering → Release Engineering: Automation
Priority: P3 → --
QA Contact: release → catlee
Updated•13 years ago
|
Priority: -- → P2
Comment 21•13 years ago
|
||
Not sure why this is marked as [buildduty]. This seems like more work than buildduty could hope to tackle in a given week in addition to everything else.
Priority: P2 → P3
Whiteboard: [buildmasters][capacity][buildduty] → [buildmasters][capacity]
Comment 22•13 years ago
|
||
Unblocking bug 698843: with Thunderbird builders, the highest usage is talos-r3-fed-076 has 981 builders; limit is 1012; 96 percent of max. Still, we have very little wiggle room for new builders.
No longer blocks: 698843
Assignee | ||
Comment 24•13 years ago
|
||
Dustin:
I'm guessing here, but I think this will make slavealloc give all slavealloc-enabled slaves a MAX_BROKER_REFS of 2048.
Does it look right to you?
Attachment #623878 -
Flags: review?(dustin)
Assignee | ||
Comment 25•13 years ago
|
||
This patch hacks the production-0.8 copy of hg.m.o/build/buildbot to have a [larger] hardcoded MAX_BROKER_REFS.
Aiui, this doesn't affect any existing masters, but will affect any new masters.
Attachment #623884 -
Flags: review?(catlee)
Comment 26•13 years ago
|
||
Comment on attachment 623878 [details] [diff] [review]
update slavealloc's buildbot.tac template
looks right to me, untested
Attachment #623878 -
Flags: review?(dustin) → review+
Updated•13 years ago
|
Attachment #623884 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 27•13 years ago
|
||
Did a quick'n'dirty test:
1. Started up buildmaster with
import twisted.spread.pb
twisted.spread.pb.MAX_BROKER_REFS = 2048
in the master's buildbot.tac. Connected a linux64 slave with no MAX_BROKER_REFS update. Forced a build; the slave picked it up and started building. Killed the job.
2. Stopped the master + slave. Commented out the MAX_BROKER_REFS lines on the master; added those lines to the slave. Restarted both, forced a build. The slave picked it up and started building.
Assignee | ||
Comment 28•13 years ago
|
||
Comment on attachment 623884 [details] [diff] [review]
hack buildbot/master/buildbot/scripts/runner.py
http://hg.mozilla.org/build/buildbot/rev/082cd6ddcb18
Landed on the production-0.8 branch.
Any new masters created should have this fix.
We still need to land+deploy to slavealloc, and manually update the existing masters.
Catlee: is there anything else I need to do re: the buildbot repo?
Attachment #623884 -
Flags: checked-in+
Assignee | ||
Comment 29•13 years ago
|
||
I set up a new test master on dev-master01 via
make -f Makefile.setup \
MASTER_NAME=preproduction-tests_master \
BASEDIR=/builds/buildbot/aki/test-master \
PYTHON=/usr/bin/python26 \
VIRTUALENV="/usr/bin/python26 /tools/misc-python/virtualenv.py" \
BUILDBOTCUSTOM_BRANCH=default \
BUILDBOTCONFIGS_BRANCH=default \
virtualenv deps install-buildbot master master-makefile
The buildbot.tac had the MAX_BROKER_REFS fix.
Assignee | ||
Comment 30•13 years ago
|
||
Comment on attachment 623878 [details] [diff] [review]
update slavealloc's buildbot.tac template
http://hg.mozilla.org/build/tools/rev/2afbe1c9176d
This needs to be deployed, per https://wiki.mozilla.org/ReleaseEngineering/Applications/Slavealloc#Deployment .
Attachment #623878 -
Flags: checked-in+
Assignee | ||
Comment 31•13 years ago
|
||
When pointing at http://staging-puppet.build.mozilla.org/staging/python-packages/ , it dies when trying to find sqlalchemy. Pointing it at repos/python/packages works.
Attachment #624766 -
Flags: review?(dustin)
Assignee | ||
Comment 32•13 years ago
|
||
Updated slavealloc with the above script; http://slavealloc.build.mozilla.org/gettac/linux-ix-slave03 gives me a tac with the MAX_BROKER_REFS lines.
Comment 33•13 years ago
|
||
Comment on attachment 624766 [details] [diff] [review]
fix slavealloc script
This will only work in scl1 and mtv1 right now, but since slavealloc is in scl1, this looks good.
Attachment #624766 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 34•13 years ago
|
||
Comment on attachment 624766 [details] [diff] [review]
fix slavealloc script
http://hg.mozilla.org/build/tools/rev/acbaea8e273b
Attachment #624766 -
Flags: checked-in+
Assignee | ||
Comment 35•13 years ago
|
||
Added the MAX_BROKER_REFS lines to existing enabled masters in production-masters.json.
Assignee | ||
Comment 36•13 years ago
|
||
This shouldn't land until we restart all our test masters, or fix via manhole.
Attachment #624777 -
Flags: review?(catlee)
Updated•13 years ago
|
Attachment #624777 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 37•12 years ago
|
||
Attachment #627374 -
Flags: review?(coop)
Assignee | ||
Comment 38•12 years ago
|
||
Comment on attachment 624777 [details] [diff] [review]
increase builder limit to 2048 in master.cfg
http://hg.mozilla.org/build/buildbot-configs/rev/6da2ccee77a0
Attachment #624777 -
Flags: checked-in+
Updated•12 years ago
|
Attachment #627374 -
Flags: review?(coop) → review+
Assignee | ||
Comment 39•12 years ago
|
||
Comment on attachment 627374 [details] [diff] [review]
up builder limit for build masters
http://hg.mozilla.org/build/buildbot-configs/rev/45bff744e472
Attachment #627374 -
Flags: checked-in+
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → aki
Comment 40•12 years ago
|
||
All the enabled masters (including bm34) have had this change deployed via the manhole.
Assignee | ||
Comment 41•12 years ago
|
||
Thanks Coop!
-> RESO FIXED
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•