Closed Bug 1045114 Opened 10 years ago Closed 10 years ago

implement 'terminate' action in slaveapi

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264] )

Attachments

(2 files)

No description provided.
Blocks: 1044788
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264]
coop, feel free to bounce review if you are not familiar with slaveapi had time today to finish this off and test. This solution is not pretty; in the future we can adjust cloud-tools/aws_manage_instances.py to output a return code so we don't need to parse output from it. However, it gets the job done for now. some examples of this: * when an instance exists but there is a problem with terminating the slave (I used dry_run to simulate this): > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [ada7315] untracked { "aws_terminate_instance": { "33147664": { "finish_timestamp": 1419996261.226675, "request_timestamp": 1419996223.114596, "start_timestamp": 1419996223.115705, "state": 3, "text": "Instance `tst-linux64-ec2-jlund` could not be terminated. Output received: '2014-12-30 19:24:18,741 - INFO - Found tst-linux64-ec2-jlund (i-aa538 d47)...\n2014-12-30 19:24:18,741 - INFO - Terminating tst-linux64-ec2-jlund...\n2014-12-30 19:24:18,741 - INFO - Dry run mode, skipping...\n'" } } }% * when an instance doesn't exist > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/foo-bar/actions/terminate aws-manage [135b455] (!) untracked { "aws_terminate_instance": { "45016336": { "finish_timestamp": 1419997048.643463, "request_timestamp": 1419997034.281548, "start_timestamp": 1419997034.282599, "state": 3, "text": "Instance 'foo-bar' could not be determined. Does it exist?" } } }% * finally, when it exists and the instance actually is terminated: > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [0ed9781] untracked { "aws_terminate_instance": { "44702480": { "finish_timestamp": 1419996772.016453, "request_timestamp": 1419996741.390275, "start_timestamp": 1419996741.391327, "state": 2, "text": "Instance 'tst-linux64-ec2-jlund' has been terminated" } } }%
Attachment #8542804 - Flags: review?(coop)
Comment on attachment 8542804 [details] [diff] [review] 141230_bug_1045114_terminate_instance-slaveapi.patch Review of attachment 8542804 [details] [diff] [review]: ----------------------------------------------------------------- No showstoppers, just a few nits. ::: slaveapi/clients/aws.py @@ +1,5 @@ > +import logging > +import os > +import re > +import subprocess > +from slaveapi.actions.results import SUCCESS, FAILURE Should this be: ..actions.results import SUCCESS, FAILURE @@ +6,5 @@ > + > +log = logging.getLogger(__name__) > +from ..global_state import config > + > +INSTANCE_NOT_FOUND_MSG = "Instance '%s' could not be determined. Does it exist?" I would prefer "Instance '%s' not found. Does it exist?" @@ +20,5 @@ > + > + p = subprocess.Popen(['python', query_script] + options + [action, name], > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > + (std_output, logging_output) = p.communicate() > + p.wait() What's the timeout situation like here, i.e. what's the longest we'll ever need to wait if this fails? @@ +40,5 @@ > + std_output, logging_output = _manage_instance(name, 'status') > + > + # we rely on print statements (std out) for instance status > + if std_output: # instance exists > + # TODO - this is fragile. aws_manage_instances.py 'status' should return a dict ready string Yes, it should take a flag to output JSON (or similar). Can you file a follow-up bug for that, please? ::: slaveapi/web/slave.py @@ +9,5 @@ > from ..actions.buildslave_uptime import buildslave_uptime > from ..actions.buildslave_last_activity import buildslave_last_activity > from ..actions.disable import disable > from ..slave import Slave as SlaveClass > +from slaveapi.actions.aws_terminate_instance import aws_terminate_instance Should this take the same form as the above imports? i.e. from ..actions.aws_terminate_instance import aws_terminate_instance
Attachment #8542804 - Flags: review?(coop) → review+
Comment on attachment 8542804 [details] [diff] [review] 141230_bug_1045114_terminate_instance-slaveapi.patch fixed nits. interdiff: http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi-interdiff.patch checked in: http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=08447554d2b07c3198314c71723ec52e099dec66 > @@ +20,5 @@ > > + > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > + (std_output, logging_output) = p.communicate() > > + p.wait() > > What's the timeout situation like here, i.e. what's the longest we'll ever > need to wait if this fails? for aws_manage_instances.py, I don't see there being a big concern for monitoring timeout. for context, aws_create_instance.py can take a _long_ time with many different hang points, the manage script simply makes one request call. I can look into adding a timeout in the cloud tools script if you think this is a blocker.
Attachment #8542804 - Flags: checked-in+
(In reply to Jordan Lund (:jlund) from comment #4) > Comment on attachment 8542804 [details] [diff] [review] > 141230_bug_1045114_terminate_instance-slaveapi.patch > > fixed nits. interdiff: > http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi- > interdiff.patch > > checked in: > http://git.mozilla.org/?p=build/slaveapi.git;a=commit; > h=08447554d2b07c3198314c71723ec52e099dec66 > > > @@ +20,5 @@ > > > + > > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > > + (std_output, logging_output) = p.communicate() > > > + p.wait() > > > > What's the timeout situation like here, i.e. what's the longest we'll ever > > need to wait if this fails? > > for aws_manage_instances.py, I don't see there being a big concern for > monitoring timeout. for context, aws_create_instance.py can take a _long_ > time with many different hang points, the manage script simply makes one > request call. I can look into adding a timeout in the cloud tools script if > you think this is a blocker. That p.wait could be a very long time then, if we expect it to be a long time, can you do some yielding to gevent here? specifically a p.poll() ... time.sleep(N) ... with some upper limit on how long we are willing to wait, say 4 hours, or 6 hours or something? Slaveapi only has a limited number of "[gevent] threads" as well, so I worry if we get 20 requests, we'll suddenly not be processing anything else....
...actually if it works with the gevent we have, you might be better off not doing the poll dance and just using http://www.gevent.org/gevent.subprocess.html instead of vanilla subprocess. (my comment on a max wait still stands though)
(In reply to Justin Wood (:Callek) from comment #5) > (In reply to Jordan Lund (:jlund) from comment #4) > > Comment on attachment 8542804 [details] [diff] [review] > > 141230_bug_1045114_terminate_instance-slaveapi.patch > > > > fixed nits. interdiff: > > http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi- > > interdiff.patch > > > > checked in: > > http://git.mozilla.org/?p=build/slaveapi.git;a=commit; > > h=08447554d2b07c3198314c71723ec52e099dec66 > > > > > @@ +20,5 @@ > > > > + > > > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > > > + (std_output, logging_output) = p.communicate() > > > > + p.wait() > > > > > > What's the timeout situation like here, i.e. what's the longest we'll ever > > > need to wait if this fails? > > > > for aws_manage_instances.py, I don't see there being a big concern for > > monitoring timeout. for context, aws_create_instance.py can take a _long_ > > time with many different hang points, the manage script simply makes one > > request call. I can look into adding a timeout in the cloud tools script if > > you think this is a blocker. > > That p.wait could be a very long time then, if we expect it to be a long > time, can you do some yielding to gevent here? I was meaning that IMO this won't take a long time. I wasn't explicit enough: aws_create_instance.py is a script that can take a long time. But aws_manage_instances.py (what is used for this bug) is just one request call that takes a few seconds to run. I don't anticipate it 'hanging' but I can look into it as I finish the aws_create_instance and adding a timeout for everything. Does that make sense?
Attached patch slaveapi version bump (deleted) — Splinter Review
Attachment #8543406 - Flags: review?(coop)
Attachment #8543406 - Flags: review?(coop) → review+
Comment on attachment 8543406 [details] [diff] [review] slaveapi version bump in production: https://hg.mozilla.org/build/puppet/rev/e79bfeaf9e37 testing on dev slaveapi node
Attachment #8543406 - Flags: checked-in+
Assignee: nobody → jlund
Depends on: 1126928
This is done, correct?
Flags: needinfo?(jlund)
yes and thanks for cleaning
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: