Closed Bug 970738 Opened 11 years ago Closed 10 years ago

Build some jacuzzis

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(2 files, 2 obsolete files)

We can speed up build times & reduce disk space by reducing the set of machines that certain build types are runnable on. (small pools of machines == jacuzzis)
Catlee, Rail and I talked about this this morning. We want to start trying different allocation methods as soon as possible. Because of this, we're going to use static files on a server immediately. These will be replaced by a proper service later. The server will have a few different endpoints: * /$builder. A GET to this will return a the list of slaves that should be used for this builder. * /fallback/$pool. A GET to this will return a list of slaves that can be used for any builder. It is intended to be used as a fallback mechanism. I will work on generating static content for some builders for these endpoints. We will adjust allocations to try out different builder/slave combinations throughout this week. To start with, we'll try 1 on demand and 2 spot instances for some builders on Cedar. Each Buildbot master will talk to the server as part of nextSlave. It will make a GET request to /$builder. If the response is 2xx it will only consider the slaves returned in the body for the build. If the response is anything else, it will make a GET request to /fallback/$pool to obtain the list of fallback slaves. Responses from all unique endpoints will be cached for some period of time (5-10min?) If neither /$builder nor /fallback/$pool return a useful response, the master will consider all slaves associated with that builder. Catlee is going to work on this part. watch_pending.py will also be updated to talk to this service to figure out which slaves should be started for a pending job. It will use the same logic as the Buildbot master for fallback. Rail will be working on this part.
Depends on: 971463
So the jacuzzi client-side code is going to interact in fun ways with this: http://hg.mozilla.org/build/buildbot-configs/file/default/mozilla/master_common.py#l188 I was seeing this situation in staging: - builder A had a relatively new request, compared to builder B. - builder A had slave S1 allocated to it in the jacuzzi service - builders A,B have same sets of slaves configured - prioritizeBuilders therefore dropped A from the list of builders. Because it has dropped builders, it triggers the builder loop again, which happens after the processing below is done - builder B tries to find slaves, can't find any since S1 is the only one connected and it's allocated to builder A - prioritizeBuilders runs again. repeat ad nauseam, and generate lots of log data in the meanwhile.
Attached patch jacuzzi client for nextSlave (obsolete) (deleted) — Splinter Review
Attachment #8378637 - Flags: feedback?(nthomas)
Attachment #8378637 - Flags: feedback?(jhopkins)
Attachment #8378637 - Flags: feedback?(bhearsum)
Attached patch jacuzzi integration for prioritizeBuilders (obsolete) (deleted) — Splinter Review
Attachment #8378638 - Flags: feedback?(nthomas)
Attachment #8378638 - Flags: feedback?(jhopkins)
Attachment #8378638 - Flags: feedback?(bhearsum)
Comment on attachment 8378637 [details] [diff] [review] jacuzzi client for nextSlave Looks good. As discussed, a couple of possible improvements: - fetch a single 'cache control' file to indicate that all caches need to be updated (fetched). - store all data in a single file (compressed) and poll that for changes Either of these changes would eliminate the need to poll many files.
Attachment #8378637 - Flags: feedback?(jhopkins) → feedback+
Comment on attachment 8378638 [details] [diff] [review] jacuzzi integration for prioritizeBuilders Looks good. I might suggest adding a couple of comments indicating the intention of this code.
Attachment #8378638 - Flags: feedback?(jhopkins) → feedback+
(In reply to John Hopkins (:jhopkins) from comment #6) > - store all data in a single file (compressed) and poll that for changes This might be something we can add in addition to the existing endpoints, but I think it's important to keep at least the /builders/* and /machines/* endpoints around so that other tools can make use of them. I suspect this is something that we'll tackle when we write the real server.
Comment on attachment 8378637 [details] [diff] [review] jacuzzi client for nextSlave Review of attachment 8378637 [details] [diff] [review]: ----------------------------------------------------------------- ::: misc.py @@ +331,5 @@ > + if exc_info: > + log.err() > + log.msg("Jacuzzi %i: %s" % (id(self), msg)) > + else: > + log.msg("Jacuzzi %i: %s" % (id(self), msg)) These messages might be a little confusing as is. If I saw "Jacuzzi 123456: blah blah" in a log I would think that 123456 is the id of a combination of builders/slaves. AFAICT, the same id will be used for a bunch of different jacuzzis. Maybe s/Jacuzzi/JacuzziAllocator/? @@ +343,5 @@ > + if self.allocated_cache: > + cache_time, slaves = self.allocated_cache > + if cache_time > time.time(): > + # TODO: This could get spammy > + self.log("fresh cache: %s" % slaves) I'm OK with spammy to start with. It's better than being too quiet. @@ +373,5 @@ > + # Check the cache for this builder > + self.log("checking cache for builder %s" % str(buildername)) > + c = self.cache.get(buildername) > + if c: > + self.log("cache hit") Is it still a cache it if the cache has expired? @@ +407,5 @@ > + self.cache[buildername] = (time.time() + self.CACHE_MAXAGE, slaves) > + # Filter the list of available slaves by the set the service > + # returned to us > + return [s for s in available_slaves if s.slave.slavename in slaves] > + except urllib2.HTTPError, e: You might be able to make this block simpler by passing available_slaves to get_allocated_slaves, and letting it filter. If you do that, I think you can get rid of the fake 404 stuff and just call it in the "if missing_cache" block above, and in here. @@ +414,5 @@ > + # of our list of available slaves > + if e.code == 404: > + try: > + if not fake_404: > + self.log("got 404; falling back to looking for allocated slaves") Unless I'm misreading, this isn't true. The only difference between a fake 404 and a real 404 is whether or not you cache anything in self.missing_cache. @@ +628,5 @@ > > +_nextAWSSlave_wait_sort = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True))) > +_nextAWSSlave_wait_sort_skip_spot = safeNextSlave(J(_nextAWSSlave(aws_wait=60, recentSort=True, > + skip_spot=True))) > +_nextAWSSlave_nowait = safeNextSlave(_nextAWSSlave()) Wow, such wraps.
Attachment #8378637 - Flags: feedback?(bhearsum) → feedback+
Attachment #8378638 - Flags: feedback?(bhearsum) → feedback+
Comment on attachment 8378637 [details] [diff] [review] jacuzzi client for nextSlave Review of attachment 8378637 [details] [diff] [review]: ----------------------------------------------------------------- +1 to the comments Ben made leading to simplification. Helper functions for common code blocks may help too. ::: misc.py @@ +342,5 @@ > + self.log("checking cache allocated slaves") > + if self.allocated_cache: > + cache_time, slaves = self.allocated_cache > + if cache_time > time.time(): > + # TODO: This could get spammy cache_time implies creation time to me, so maybe cache_expiry_time.
Attachment #8378637 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8378638 [details] [diff] [review] jacuzzi integration for prioritizeBuilders Review of attachment 8378638 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla/master_common.py @@ +187,5 @@ > + try: > + # Filter the available slaves through the jacuzzi bubbles.. > + slaves = J.get_slaves(b[1].name, slaves) > + except Exception: > + twlog.err("handled exception talking to jacuzzi; trying to carry on") Deliberately twlog instead of log here ?
Attachment #8378638 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8378638 [details] [diff] [review] jacuzzi integration for prioritizeBuilders Review of attachment 8378638 [details] [diff] [review]: ----------------------------------------------------------------- ::: mozilla/master_common.py @@ +187,5 @@ > + try: > + # Filter the available slaves through the jacuzzi bubbles.. > + slaves = J.get_slaves(b[1].name, slaves) > + except Exception: > + twlog.err("handled exception talking to jacuzzi; trying to carry on") Yes. The log() method here doesn't handle exceptions.
Depends on: 976055
Attached patch jacuzzi client for nextSlave (deleted) — Splinter Review
Attachment #8378637 - Attachment is obsolete: true
Attachment #8380643 - Flags: review?(bhearsum)
Attachment #8378638 - Attachment is obsolete: true
Attachment #8380645 - Flags: review?(bhearsum)
Comment on attachment 8380645 [details] [diff] [review] jacuzzi integration for prioritizeBuilders Review of attachment 8380645 [details] [diff] [review]: ----------------------------------------------------------------- Looks fine with your pastebin fix.
Attachment #8380645 - Flags: review?(bhearsum) → review+
Comment on attachment 8380643 [details] [diff] [review] jacuzzi client for nextSlave Review of attachment 8380643 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me. Turn on the bubbles!
Attachment #8380643 - Flags: review?(bhearsum) → review+
Attachment #8380645 - Flags: checked-in+
in production.
URL-encoding of builder names went into production.
Blocks: 977611
Catlee mentioned yesterday that we should put b2g_b2g-inbound_hamachi_eng_dep in a jacuzzi and remove the non-eng one. I did that this morning.
Depends on: 980470
Depends on: 984870
Depends on: 1013568
I think we're all done here.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: