Closed
Bug 970738
Opened 11 years ago
Closed 10 years ago
Build some jacuzzis
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Unassigned)
References
Details
Attachments
(2 files, 2 obsolete files)
(deleted),
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bhearsum
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
We can speed up build times & reduce disk space by reducing the set of machines that certain build types are runnable on.
(small pools of machines == jacuzzis)
Comment 1•11 years ago
|
||
Catlee, Rail and I talked about this this morning. We want to start trying different allocation methods as soon as possible. Because of this, we're going to use static files on a server immediately. These will be replaced by a proper service later.
The server will have a few different endpoints:
* /$builder. A GET to this will return a the list of slaves that should be used for this builder.
* /fallback/$pool. A GET to this will return a list of slaves that can be used for any builder. It is intended to be used as a fallback mechanism.
I will work on generating static content for some builders for these endpoints. We will adjust allocations to try out different builder/slave combinations throughout this week. To start with, we'll try 1 on demand and 2 spot instances for some builders on Cedar.
Each Buildbot master will talk to the server as part of nextSlave. It will make a GET request to /$builder. If the response is 2xx it will only consider the slaves returned in the body for the build. If the response is anything else, it will make a GET request to /fallback/$pool to obtain the list of fallback slaves. Responses from all unique endpoints will be cached for some period of time (5-10min?) If neither /$builder nor /fallback/$pool return a useful response, the master will consider all slaves associated with that builder. Catlee is going to work on this part.
watch_pending.py will also be updated to talk to this service to figure out which slaves should be started for a pending job. It will use the same logic as the Buildbot master for fallback. Rail will be working on this part.
Reporter | ||
Comment 2•11 years ago
|
||
WIP for buildbot nextSlave: https://github.com/catlee/buildbotcustom/compare/master...jacuzzi
Reporter | ||
Comment 3•11 years ago
|
||
So the jacuzzi client-side code is going to interact in fun ways with this:
http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla/master_common.py#l188
I was seeing this situation in staging:
- builder A had a relatively new request, compared to builder B.
- builder A had slave S1 allocated to it in the jacuzzi service
- builders A,B have same sets of slaves configured
- prioritizeBuilders therefore dropped A from the list of builders. Because it has dropped builders, it triggers the builder loop again, which happens after the processing below is done
- builder B tries to find slaves, can't find any since S1 is the only one connected and it's allocated to builder A
- prioritizeBuilders runs again. repeat ad nauseam, and generate lots of log data in the meanwhile.
Reporter | ||
Comment 4•11 years ago
|
||
Attachment #8378637 -
Flags: feedback?(nthomas)
Attachment #8378637 -
Flags: feedback?(jhopkins)
Attachment #8378637 -
Flags: feedback?(bhearsum)
Reporter | ||
Comment 5•11 years ago
|
||
Attachment #8378638 -
Flags: feedback?(nthomas)
Attachment #8378638 -
Flags: feedback?(jhopkins)
Attachment #8378638 -
Flags: feedback?(bhearsum)
Comment 6•11 years ago
|
||
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave
Looks good. As discussed, a couple of possible improvements:
- fetch a single 'cache control' file to indicate that all caches need to be updated (fetched).
- store all data in a single file (compressed) and poll that for changes
Either of these changes would eliminate the need to poll many files.
Attachment #8378637 -
Flags: feedback?(jhopkins) → feedback+
Comment 7•11 years ago
|
||
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders
Looks good. I might suggest adding a couple of comments indicating the intention of this code.
Attachment #8378638 -
Flags: feedback?(jhopkins) → feedback+
Comment 8•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #6)
> - store all data in a single file (compressed) and poll that for changes
This might be something we can add in addition to the existing endpoints, but I think it's important to keep at least the /builders/* and /machines/* endpoints around so that other tools can make use of them. I suspect this is something that we'll tackle when we write the real server.
Comment 9•11 years ago
|
||
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave
Review of attachment 8378637 [details] [diff] [review]:
-----------------------------------------------------------------
::: misc.py
@@ +331,5 @@
> + if exc_info:
> + log.err()
> + log.msg("Jacuzzi %i: %s" % (id(self), msg))
> + else:
> + log.msg("Jacuzzi %i: %s" % (id(self), msg))
These messages might be a little confusing as is. If I saw "Jacuzzi 123456: blah blah" in a log I would think that 123456 is the id of a combination of builders/slaves. AFAICT, the same id will be used for a bunch of different jacuzzis. Maybe s/Jacuzzi/JacuzziAllocator/?
@@ +343,5 @@
> + if self.allocated_cache:
> + cache_time, slaves = self.allocated_cache
> + if cache_time > time.time():
> + # TODO: This could get spammy
> + self.log("fresh cache: %s" % slaves)
I'm OK with spammy to start with. It's better than being too quiet.
@@ +373,5 @@
> + # Check the cache for this builder
> + self.log("checking cache for builder %s" % str(buildername))
> + c = self.cache.get(buildername)
> + if c:
> + self.log("cache hit")
Is it still a cache it if the cache has expired?
@@ +407,5 @@
> + self.cache[buildername] = (time.time() + self.CACHE_MAXAGE, slaves)
> + # Filter the list of available slaves by the set the service
> + # returned to us
> + return [s for s in available_slaves if s.slave.slavename in slaves]
> + except urllib2.HTTPError, e:
You might be able to make this block simpler by passing available_slaves to get_allocated_slaves, and letting it filter. If you do that, I think you can get rid of the fake 404 stuff and just call it in the "if missing_cache" block above, and in here.
@@ +414,5 @@
> + # of our list of available slaves
> + if e.code == 404:
> + try:
> + if not fake_404:
> + self.log("got 404; falling back to looking for allocated slaves")
Unless I'm misreading, this isn't true. The only difference between a fake 404 and a real 404 is whether or not you cache anything in self.missing_cache.
@@ +628,5 @@
>
> +_nextAWSSlave_wait_sort = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True)))
> +_nextAWSSlave_wait_sort_skip_spot = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True,
> + skip_spot=True)))
> +_nextAWSSlave_nowait = safeNextSlave(_nextAWSSlave())
Wow, such wraps.
Attachment #8378637 -
Flags: feedback?(bhearsum) → feedback+
Updated•11 years ago
|
Attachment #8378638 -
Flags: feedback?(bhearsum) → feedback+
Comment 10•11 years ago
|
||
Comment on attachment 8378637 [details] [diff] [review]
jacuzzi client for nextSlave
Review of attachment 8378637 [details] [diff] [review]:
-----------------------------------------------------------------
+1 to the comments Ben made leading to simplification. Helper functions for common code blocks may help too.
::: misc.py
@@ +342,5 @@
> + self.log("checking cache allocated slaves")
> + if self.allocated_cache:
> + cache_time, slaves = self.allocated_cache
> + if cache_time > time.time():
> + # TODO: This could get spammy
cache_time implies creation time to me, so maybe cache_expiry_time.
Attachment #8378637 -
Flags: feedback?(nthomas) → feedback+
Comment 11•11 years ago
|
||
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders
Review of attachment 8378638 [details] [diff] [review]:
-----------------------------------------------------------------
::: mozilla/master_common.py
@@ +187,5 @@
> + try:
> + # Filter the available slaves through the jacuzzi bubbles..
> + slaves = J.get_slaves(b[1].name, slaves)
> + except Exception:
> + twlog.err("handled exception talking to jacuzzi; trying to carry on")
Deliberately twlog instead of log here ?
Attachment #8378638 -
Flags: feedback?(nthomas) → feedback+
Reporter | ||
Comment 12•11 years ago
|
||
Comment on attachment 8378638 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders
Review of attachment 8378638 [details] [diff] [review]:
-----------------------------------------------------------------
::: mozilla/master_common.py
@@ +187,5 @@
> + try:
> + # Filter the available slaves through the jacuzzi bubbles..
> + slaves = J.get_slaves(b[1].name, slaves)
> + except Exception:
> + twlog.err("handled exception talking to jacuzzi; trying to carry on")
Yes. The log() method here doesn't handle exceptions.
Reporter | ||
Comment 13•11 years ago
|
||
Attachment #8378637 -
Attachment is obsolete: true
Attachment #8380643 -
Flags: review?(bhearsum)
Reporter | ||
Comment 14•11 years ago
|
||
Attachment #8378638 -
Attachment is obsolete: true
Attachment #8380645 -
Flags: review?(bhearsum)
Comment 15•11 years ago
|
||
Comment on attachment 8380645 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders
Review of attachment 8380645 [details] [diff] [review]:
-----------------------------------------------------------------
Looks fine with your pastebin fix.
Attachment #8380645 -
Flags: review?(bhearsum) → review+
Comment 16•11 years ago
|
||
Comment on attachment 8380643 [details] [diff] [review]
jacuzzi client for nextSlave
Review of attachment 8380643 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good to me. Turn on the bubbles!
Attachment #8380643 -
Flags: review?(bhearsum) → review+
Reporter | ||
Comment 17•11 years ago
|
||
Comment on attachment 8380643 [details] [diff] [review]
jacuzzi client for nextSlave
https://hg.mozilla.org/build/buildbotcustom/rev/93c609f031a4
Attachment #8380643 -
Flags: checked-in+
Reporter | ||
Comment 18•11 years ago
|
||
Comment on attachment 8380645 [details] [diff] [review]
jacuzzi integration for prioritizeBuilders
https://hg.mozilla.org/build/buildbot-configs/rev/743f092b924b
Attachment #8380645 -
Flags: checked-in+
Comment 19•11 years ago
|
||
in production.
Comment 20•11 years ago
|
||
URL-encoding of builder names went into production.
Comment 21•11 years ago
|
||
Catlee mentioned yesterday that we should put b2g_b2g-inbound_hamachi_eng_dep in a jacuzzi and remove the non-eng one. I did that this morning.
Reporter | ||
Comment 22•11 years ago
|
||
I think we're all done here.
Reporter | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•