Closed Bug 1045114 Opened 10 years ago Closed 10 years ago

implement 'terminate' action in slaveapi

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264] )

Attachments

(2 files)

141230_bug_1045114_terminate_instance-slaveapi.patch 10 years ago Jordan Lund (:jlund) (deleted), patch	coop : review+ jlund : checked-in+	Details \| Diff \| Splinter Review
slaveapi version bump 10 years ago Jordan Lund (:jlund) (deleted), patch	coop : review+ jlund : checked-in+	Details \| Diff \| Splinter Review

Jordan Lund (:jlund)

Assignee

Description

•

10 years ago

No description provided.

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

Blocks: 1044788

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260]

:kanban-engops

Updated

•

10 years ago

Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264]

Chris Cooper [:coop] (he/him)

Comment 1

•

10 years ago

I’m picturing an endpoint like https://secure.pub.build.mozilla.org/slaveapi/slaves/tst-linux64-spot-1069/actions/terminate I landed support for that endpoint in slave health this morning: https://hg.mozilla.org/build/slave_health/rev/eee55200bab0 The reboot button becomes terminate instance for AWS nodes: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-1069

Jordan Lund (:jlund)

Assignee

Comment 2

•

10 years ago

Attached patch 141230_bug_1045114_terminate_instance-slaveapi.patch (deleted) — Details — Splinter Review

coop, feel free to bounce review if you are not familiar with slaveapi had time today to finish this off and test. This solution is not pretty; in the future we can adjust cloud-tools/aws_manage_instances.py to output a return code so we don't need to parse output from it. However, it gets the job done for now. some examples of this: * when an instance exists but there is a problem with terminating the slave (I used dry_run to simulate this): > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [ada7315] untracked { "aws_terminate_instance": { "33147664": { "finish_timestamp": 1419996261.226675, "request_timestamp": 1419996223.114596, "start_timestamp": 1419996223.115705, "state": 3, "text": "Instance `tst-linux64-ec2-jlund` could not be terminated. Output received: '2014-12-30 19:24:18,741 - INFO - Found tst-linux64-ec2-jlund (i-aa538 d47)...\n2014-12-30 19:24:18,741 - INFO - Terminating tst-linux64-ec2-jlund...\n2014-12-30 19:24:18,741 - INFO - Dry run mode, skipping...\n'" } } }% * when an instance doesn't exist > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/foo-bar/actions/terminate aws-manage [135b455] (!) untracked { "aws_terminate_instance": { "45016336": { "finish_timestamp": 1419997048.643463, "request_timestamp": 1419997034.281548, "start_timestamp": 1419997034.282599, "state": 3, "text": "Instance 'foo-bar' could not be determined. Does it exist?" } } }% * finally, when it exists and the instance actually is terminated: > curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [0ed9781] untracked { "aws_terminate_instance": { "44702480": { "finish_timestamp": 1419996772.016453, "request_timestamp": 1419996741.390275, "start_timestamp": 1419996741.391327, "state": 2, "text": "Instance 'tst-linux64-ec2-jlund' has been terminated" } } }%

Attachment #8542804 - Flags: review?(coop)

Chris Cooper [:coop] (he/him)

Comment 3

•

10 years ago

Comment on attachment 8542804 [details] [diff] [review] 141230_bug_1045114_terminate_instance-slaveapi.patch Review of attachment 8542804 [details] [diff] [review]: ----------------------------------------------------------------- No showstoppers, just a few nits. ::: slaveapi/clients/aws.py @@ +1,5 @@ > +import logging > +import os > +import re > +import subprocess > +from slaveapi.actions.results import SUCCESS, FAILURE Should this be: ..actions.results import SUCCESS, FAILURE @@ +6,5 @@ > + > +log = logging.getLogger(__name__) > +from ..global_state import config > + > +INSTANCE_NOT_FOUND_MSG = "Instance '%s' could not be determined. Does it exist?" I would prefer "Instance '%s' not found. Does it exist?" @@ +20,5 @@ > + > + p = subprocess.Popen(['python', query_script] + options + [action, name], > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > + (std_output, logging_output) = p.communicate() > + p.wait() What's the timeout situation like here, i.e. what's the longest we'll ever need to wait if this fails? @@ +40,5 @@ > + std_output, logging_output = _manage_instance(name, 'status') > + > + # we rely on print statements (std out) for instance status > + if std_output: # instance exists > + # TODO - this is fragile. aws_manage_instances.py 'status' should return a dict ready string Yes, it should take a flag to output JSON (or similar). Can you file a follow-up bug for that, please? ::: slaveapi/web/slave.py @@ +9,5 @@ > from ..actions.buildslave_uptime import buildslave_uptime > from ..actions.buildslave_last_activity import buildslave_last_activity > from ..actions.disable import disable > from ..slave import Slave as SlaveClass > +from slaveapi.actions.aws_terminate_instance import aws_terminate_instance Should this take the same form as the above imports? i.e. from ..actions.aws_terminate_instance import aws_terminate_instance

Attachment #8542804 - Flags: review?(coop) → review+

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1117173

Jordan Lund (:jlund)

Assignee

Comment 4

•

10 years ago

Comment on attachment 8542804 [details] [diff] [review] 141230_bug_1045114_terminate_instance-slaveapi.patch fixed nits. interdiff: http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi-interdiff.patch checked in: http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=08447554d2b07c3198314c71723ec52e099dec66 > @@ +20,5 @@ > > + > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > + (std_output, logging_output) = p.communicate() > > + p.wait() > > What's the timeout situation like here, i.e. what's the longest we'll ever > need to wait if this fails? for aws_manage_instances.py, I don't see there being a big concern for monitoring timeout. for context, aws_create_instance.py can take a _long_ time with many different hang points, the manage script simply makes one request call. I can look into adding a timeout in the cloud tools script if you think this is a blocker.

Attachment #8542804 - Flags: checked-in+

Justin Wood (:Callek)

Comment 5

•

10 years ago

(In reply to Jordan Lund (:jlund) from comment #4) > Comment on attachment 8542804 [details] [diff] [review] > 141230_bug_1045114_terminate_instance-slaveapi.patch > > fixed nits. interdiff: > http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi- > interdiff.patch > > checked in: > http://git.mozilla.org/?p=build/slaveapi.git;a=commit; > h=08447554d2b07c3198314c71723ec52e099dec66 > > > @@ +20,5 @@ > > > + > > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > > + (std_output, logging_output) = p.communicate() > > > + p.wait() > > > > What's the timeout situation like here, i.e. what's the longest we'll ever > > need to wait if this fails? > > for aws_manage_instances.py, I don't see there being a big concern for > monitoring timeout. for context, aws_create_instance.py can take a _long_ > time with many different hang points, the manage script simply makes one > request call. I can look into adding a timeout in the cloud tools script if > you think this is a blocker. That p.wait could be a very long time then, if we expect it to be a long time, can you do some yielding to gevent here? specifically a p.poll() ... time.sleep(N) ... with some upper limit on how long we are willing to wait, say 4 hours, or 6 hours or something? Slaveapi only has a limited number of "[gevent] threads" as well, so I worry if we get 20 requests, we'll suddenly not be processing anything else....

Justin Wood (:Callek)

Comment 6

•

10 years ago

...actually if it works with the gevent we have, you might be better off not doing the poll dance and just using http://www.gevent.org/gevent.subprocess.html instead of vanilla subprocess. (my comment on a max wait still stands though)

Jordan Lund (:jlund)

Assignee

Comment 7

•

10 years ago

(In reply to Justin Wood (:Callek) from comment #5) > (In reply to Jordan Lund (:jlund) from comment #4) > > Comment on attachment 8542804 [details] [diff] [review] > > 141230_bug_1045114_terminate_instance-slaveapi.patch > > > > fixed nits. interdiff: > > http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi- > > interdiff.patch > > > > checked in: > > http://git.mozilla.org/?p=build/slaveapi.git;a=commit; > > h=08447554d2b07c3198314c71723ec52e099dec66 > > > > > @@ +20,5 @@ > > > > + > > > > + p = subprocess.Popen(['python', query_script] + options + [action, name], > > > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE) > > > > + (std_output, logging_output) = p.communicate() > > > > + p.wait() > > > > > > What's the timeout situation like here, i.e. what's the longest we'll ever > > > need to wait if this fails? > > > > for aws_manage_instances.py, I don't see there being a big concern for > > monitoring timeout. for context, aws_create_instance.py can take a _long_ > > time with many different hang points, the manage script simply makes one > > request call. I can look into adding a timeout in the cloud tools script if > > you think this is a blocker. > > That p.wait could be a very long time then, if we expect it to be a long > time, can you do some yielding to gevent here? I was meaning that IMO this won't take a long time. I wasn't explicit enough: aws_create_instance.py is a script that can take a long time. But aws_manage_instances.py (what is used for this bug) is just one request call that takes a few seconds to run. I don't anticipate it 'hanging' but I can look into it as I finish the aws_create_instance and adding a timeout for everything. Does that make sense?

Jordan Lund (:jlund)

Assignee

Comment 8

•

10 years ago

Attached patch slaveapi version bump (deleted) — Details — Splinter Review

Attachment #8543406 - Flags: review?(coop)

Jordan Lund (:jlund)

Assignee

Updated

•

10 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1117214

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Attachment #8543406 - Flags: review?(coop) → review+

Jordan Lund (:jlund)

Assignee

Comment 9

•

10 years ago

Comment on attachment 8543406 [details] [diff] [review] slaveapi version bump in production: https://hg.mozilla.org/build/puppet/rev/e79bfeaf9e37 testing on dev slaveapi node

Attachment #8543406 - Flags: checked-in+

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Assignee: nobody → jlund

Depends on: 1126928

Chris Cooper [:coop] (he/him)

Comment 10

•

10 years ago

This is done, correct?

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Flags: needinfo?(jlund)

Jordan Lund (:jlund)

Assignee

Comment 11

•

10 years ago

yes and thanks for cleaning

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: needinfo?(jlund)

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.