Closed Bug 932392 Opened 11 years ago Closed 11 years ago

slaveapi should be able to stop buildbot slaves

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(3 files)

We need this to safely replace kittenherder with something SlaveAPI-based. I also think it another thing that sheriffs could make use of to deal with problematic slaves.
Blocks: 932396
if we do this, my suggestion is to enable "shutdown flag file" monitoring in our buildbot.tac's via slavealloc and then use our ssh magic to add/touch a shutdown.flag like we currently already do on mobile jobs on the foopies. This would eliminate the need for us to connect to buildbot masters via http and potentially cause performance degredation that way as the master loads the scores of pickle files, especially if the slave hasn't had a job in a while but slavealloc thinks its connected to a master.
(In reply to Justin Wood (:Callek) from comment #1) > This would eliminate the need for us to connect to buildbot masters via http > and potentially cause performance degredation that way as the master loads > the scores of pickle files, especially if the slave hasn't had a job in a > while but slavealloc thinks its connected to a master. This isn't an issue. You don't need to load a /buildslaves page to initiate a shutdown, you just need to POST to /buildslaves/:slave/shutdown.
(In reply to Ben Hearsum [:bhearsum] from comment #2) > (In reply to Justin Wood (:Callek) from comment #1) > > This would eliminate the need for us to connect to buildbot masters via http > > and potentially cause performance degredation that way as the master loads > > the scores of pickle files, especially if the slave hasn't had a job in a > > while but slavealloc thinks its connected to a master. > > This isn't an issue. You don't need to load a /buildslaves page to initiate > a shutdown, you just need to POST to /buildslaves/:slave/shutdown. Per IRC, the problem here is actually that the response to this 302s to /buildslaves/:slave. We have code to workaround this in self serve, but it looks like the easier thing to do is use slave-side shutdown as you suggest.
Depends on: 939118
Attached patch graceful_shutdown-slaveapi.diff (deleted) — Splinter Review
(In reply to Ben Hearsum [:bhearsum] from comment #3) > Per IRC, the problem here is actually that the response to this 302s to > /buildslaves/:slave. We have code to workaround this in self serve, but it > looks like the easier thing to do is use slave-side shutdown as you suggest. And then I changed my mind. It's much easier just to use the buildbot master, because it works regardless of the underlying slave type. It's easy to ignore the redirect with requests, too. Most of the stuff going on in this patch is explained in docstrings/comments but I want to call out the dependency on bug 939118 explicitly. I filed that because if slaves reboot while we're waiting for a graceful shutdown it's nearly impossible to verify that it shut down successfully. I think it makes sense to do, but I've asked for other opinions in that bug as well. The other notable thing here is the abstraction of all of the code from the Reboot endpoint into a base class. I think you asked for this during the original review, and now it's done! Any new simple endpoints can be added by subclassing, setting self.action, and hooking it up to a route.
Attachment #832947 - Flags: review?(jhopkins)
Comment on attachment 832947 [details] [diff] [review] graceful_shutdown-slaveapi.diff + Returns: + The status of the request specified or the status of all previous + actions of this type. See :py:func:`slaveapi.actions.results.ActionResult.to_dict` + for details on what status looks like. + """ ... + res = results[slave][self.action.__name__].get(requestid, None) + if res: + return jsonify(res.to_dict()) + else: + action_results = {} + for id_, res in results[slave][self.action.__name__].iteritems(): + action_results[id_] = res.to_dict() + return jsonify({self.action.__name__: action_results}) This code appears to return all actions also if the requested id does not exist. If that's intentional (and true), should update the Returns: comment. Should we be making an attempt to avoid overlapped reboot or shutdown requests?
Attachment #832947 - Flags: review?(jhopkins) → review+
(In reply to John Hopkins (:jhopkins) from comment #5) > This code appears to return all actions also if the requested id does not > exist. > If that's intentional (and true), should update the Returns: comment. I don't think that's true: > + res = results[slave][self.action.__name__].get(requestid, None) This line grabs the results from _just_ the specified requestid. > + if res: > + return jsonify(res.to_dict()) And we only return that if it was found. > + else: > + action_results = {} > + for id_, res in > results[slave][self.action.__name__].iteritems(): > + action_results[id_] = res.to_dict() > + return jsonify({self.action.__name__: action_results}) Otherwise, we'll return the entire thing. This code didn't change with this patch though (it just moved), and you can see this on the current dev instance: http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/b-linux64-hp-006/actions/reboot http://slaveapi-dev1.srv.releng.scl3.mozilla.com:8080/slaves/b-linux64-hp-006/actions/reboot?requestid=52090960 > > Should we be making an attempt to avoid overlapped reboot or shutdown > requests? Yes, but I'd like to deal with that in a separate bug. I think there's a larger picture rate-limiting/locking of actions to look at.
(In reply to Ben Hearsum [:bhearsum] from comment #6) > (In reply to John Hopkins (:jhopkins) from comment #5) > > > > Should we be making an attempt to avoid overlapped reboot or shutdown > > requests? > > Yes, but I'd like to deal with that in a separate bug. I think there's a > larger picture rate-limiting/locking of actions to look at. I filed bug 939886 for this.
Summary: slaveapi should be able to stop buildbot slaves (both gracefully and forcefully) → slaveapi should be able to stop buildbot slaves
Comment on attachment 832947 [details] [diff] [review] graceful_shutdown-slaveapi.diff Landed this w/ a version bump to SlaveAPI. I've put the new 1.0.3 package on releng-puppet2 in scl3. Patch incoming to bump the installed version on dev/prod.
Attachment #832947 - Flags: checked-in+
Attached patch upgrade slaveapi version (deleted) — Splinter Review
Attachment #8338546 - Flags: review?(dustin)
Attachment #8338546 - Flags: review?(dustin) → review+
Attachment #8338546 - Flags: checked-in+
Attached patch couple of docstring fixes (deleted) — Splinter Review
Just noticed these errors in the docstring syntax, whoops!
Attachment #8338576 - Flags: review?(jhopkins)
Attachment #8338576 - Flags: review?(jhopkins) → review+
Attachment #8338576 - Flags: checked-in+
This is working, but I noticed bug 943508 while testing it.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: