Closed
Bug 1045114
Opened 10 years ago
Closed 10 years ago
implement 'terminate' action in slaveapi
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jlund, Assigned: jlund)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264] )
Attachments
(2 files)
(deleted),
patch
|
coop
:
review+
jlund
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
coop
:
review+
jlund
:
checked-in+
|
Details | Diff | Splinter Review |
No description provided.
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2250] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2260] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2264]
Comment 1•10 years ago
|
||
I’m picturing an endpoint like https://secure.pub.build.mozilla.org/slaveapi/slaves/tst-linux64-spot-1069/actions/terminate
I landed support for that endpoint in slave health this morning: https://hg.mozilla.org/build/slave_health/rev/eee55200bab0
The reboot button becomes terminate instance for AWS nodes:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class=test&type=tst-linux64-spot&name=tst-linux64-spot-1069
Assignee | ||
Comment 2•10 years ago
|
||
coop, feel free to bounce review if you are not familiar with slaveapi
had time today to finish this off and test. This solution is not pretty; in the future we can adjust cloud-tools/aws_manage_instances.py to output a return code so we don't need to parse output from it. However, it gets the job done for now.
some examples of this:
* when an instance exists but there is a problem with terminating the slave (I used dry_run to simulate this):
> curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [ada7315] untracked
{
"aws_terminate_instance": {
"33147664": {
"finish_timestamp": 1419996261.226675,
"request_timestamp": 1419996223.114596,
"start_timestamp": 1419996223.115705,
"state": 3,
"text": "Instance `tst-linux64-ec2-jlund` could not be terminated. Output received: '2014-12-30 19:24:18,741 - INFO - Found tst-linux64-ec2-jlund (i-aa538
d47)...\n2014-12-30 19:24:18,741 - INFO - Terminating tst-linux64-ec2-jlund...\n2014-12-30 19:24:18,741 - INFO - Dry run mode, skipping...\n'"
}
}
}%
* when an instance doesn't exist
> curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/foo-bar/actions/terminate aws-manage [135b455] (!) untracked
{
"aws_terminate_instance": {
"45016336": {
"finish_timestamp": 1419997048.643463,
"request_timestamp": 1419997034.281548,
"start_timestamp": 1419997034.282599,
"state": 3,
"text": "Instance 'foo-bar' could not be determined. Does it exist?"
}
}
}%
* finally, when it exists and the instance actually is terminated:
> curl slaveapi-dev1.srv.releng.scl3.mozilla.com:8086/slaves/tst-linux64-ec2-jlund/actions/terminate aws-manage [0ed9781] untracked
{
"aws_terminate_instance": {
"44702480": {
"finish_timestamp": 1419996772.016453,
"request_timestamp": 1419996741.390275,
"start_timestamp": 1419996741.391327,
"state": 2,
"text": "Instance 'tst-linux64-ec2-jlund' has been terminated"
}
}
}%
Attachment #8542804 -
Flags: review?(coop)
Comment 3•10 years ago
|
||
Comment on attachment 8542804 [details] [diff] [review]
141230_bug_1045114_terminate_instance-slaveapi.patch
Review of attachment 8542804 [details] [diff] [review]:
-----------------------------------------------------------------
No showstoppers, just a few nits.
::: slaveapi/clients/aws.py
@@ +1,5 @@
> +import logging
> +import os
> +import re
> +import subprocess
> +from slaveapi.actions.results import SUCCESS, FAILURE
Should this be:
..actions.results import SUCCESS, FAILURE
@@ +6,5 @@
> +
> +log = logging.getLogger(__name__)
> +from ..global_state import config
> +
> +INSTANCE_NOT_FOUND_MSG = "Instance '%s' could not be determined. Does it exist?"
I would prefer "Instance '%s' not found. Does it exist?"
@@ +20,5 @@
> +
> + p = subprocess.Popen(['python', query_script] + options + [action, name],
> + stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> + (std_output, logging_output) = p.communicate()
> + p.wait()
What's the timeout situation like here, i.e. what's the longest we'll ever need to wait if this fails?
@@ +40,5 @@
> + std_output, logging_output = _manage_instance(name, 'status')
> +
> + # we rely on print statements (std out) for instance status
> + if std_output: # instance exists
> + # TODO - this is fragile. aws_manage_instances.py 'status' should return a dict ready string
Yes, it should take a flag to output JSON (or similar). Can you file a follow-up bug for that, please?
::: slaveapi/web/slave.py
@@ +9,5 @@
> from ..actions.buildslave_uptime import buildslave_uptime
> from ..actions.buildslave_last_activity import buildslave_last_activity
> from ..actions.disable import disable
> from ..slave import Slave as SlaveClass
> +from slaveapi.actions.aws_terminate_instance import aws_terminate_instance
Should this take the same form as the above imports?
i.e.
from ..actions.aws_terminate_instance import aws_terminate_instance
Attachment #8542804 -
Flags: review?(coop) → review+
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 4•10 years ago
|
||
Comment on attachment 8542804 [details] [diff] [review]
141230_bug_1045114_terminate_instance-slaveapi.patch
fixed nits. interdiff: http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi-interdiff.patch
checked in: http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=08447554d2b07c3198314c71723ec52e099dec66
> @@ +20,5 @@
> > +
> > + p = subprocess.Popen(['python', query_script] + options + [action, name],
> > + stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> > + (std_output, logging_output) = p.communicate()
> > + p.wait()
>
> What's the timeout situation like here, i.e. what's the longest we'll ever
> need to wait if this fails?
for aws_manage_instances.py, I don't see there being a big concern for monitoring timeout. for context, aws_create_instance.py can take a _long_ time with many different hang points, the manage script simply makes one request call. I can look into adding a timeout in the cloud tools script if you think this is a blocker.
Attachment #8542804 -
Flags: checked-in+
Comment 5•10 years ago
|
||
(In reply to Jordan Lund (:jlund) from comment #4)
> Comment on attachment 8542804 [details] [diff] [review]
> 141230_bug_1045114_terminate_instance-slaveapi.patch
>
> fixed nits. interdiff:
> http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi-
> interdiff.patch
>
> checked in:
> http://git.mozilla.org/?p=build/slaveapi.git;a=commit;
> h=08447554d2b07c3198314c71723ec52e099dec66
>
> > @@ +20,5 @@
> > > +
> > > + p = subprocess.Popen(['python', query_script] + options + [action, name],
> > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> > > + (std_output, logging_output) = p.communicate()
> > > + p.wait()
> >
> > What's the timeout situation like here, i.e. what's the longest we'll ever
> > need to wait if this fails?
>
> for aws_manage_instances.py, I don't see there being a big concern for
> monitoring timeout. for context, aws_create_instance.py can take a _long_
> time with many different hang points, the manage script simply makes one
> request call. I can look into adding a timeout in the cloud tools script if
> you think this is a blocker.
That p.wait could be a very long time then, if we expect it to be a long time, can you do some yielding to gevent here?
specifically a p.poll() ... time.sleep(N) ... with some upper limit on how long we are willing to wait, say 4 hours, or 6 hours or something?
Slaveapi only has a limited number of "[gevent] threads" as well, so I worry if we get 20 requests, we'll suddenly not be processing anything else....
Comment 6•10 years ago
|
||
...actually if it works with the gevent we have, you might be better off not doing the poll dance and just using
http://www.gevent.org/gevent.subprocess.html instead of vanilla subprocess. (my comment on a max wait still stands though)
Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #5)
> (In reply to Jordan Lund (:jlund) from comment #4)
> > Comment on attachment 8542804 [details] [diff] [review]
> > 141230_bug_1045114_terminate_instance-slaveapi.patch
> >
> > fixed nits. interdiff:
> > http://people.mozilla.org/~jlund/20150102_aws_terminate_instance_slaveapi-
> > interdiff.patch
> >
> > checked in:
> > http://git.mozilla.org/?p=build/slaveapi.git;a=commit;
> > h=08447554d2b07c3198314c71723ec52e099dec66
> >
> > > @@ +20,5 @@
> > > > +
> > > > + p = subprocess.Popen(['python', query_script] + options + [action, name],
> > > > + stdout=subprocess.PIPE, stderr=subprocess.PIPE)
> > > > + (std_output, logging_output) = p.communicate()
> > > > + p.wait()
> > >
> > > What's the timeout situation like here, i.e. what's the longest we'll ever
> > > need to wait if this fails?
> >
> > for aws_manage_instances.py, I don't see there being a big concern for
> > monitoring timeout. for context, aws_create_instance.py can take a _long_
> > time with many different hang points, the manage script simply makes one
> > request call. I can look into adding a timeout in the cloud tools script if
> > you think this is a blocker.
>
> That p.wait could be a very long time then, if we expect it to be a long
> time, can you do some yielding to gevent here?
I was meaning that IMO this won't take a long time. I wasn't explicit enough: aws_create_instance.py is a script that can take a long time. But aws_manage_instances.py (what is used for this bug) is just one request call that takes a few seconds to run. I don't anticipate it 'hanging' but I can look into it as I finish the aws_create_instance and adding a timeout for everything. Does that make sense?
Assignee | ||
Comment 8•10 years ago
|
||
Attachment #8543406 -
Flags: review?(coop)
Assignee | ||
Updated•10 years ago
|
Updated•10 years ago
|
Attachment #8543406 -
Flags: review?(coop) → review+
Assignee | ||
Comment 9•10 years ago
|
||
Comment on attachment 8543406 [details] [diff] [review]
slaveapi version bump
in production: https://hg.mozilla.org/build/puppet/rev/e79bfeaf9e37
testing on dev slaveapi node
Attachment #8543406 -
Flags: checked-in+
Comment 10•10 years ago
|
||
This is done, correct?
Updated•10 years ago
|
Flags: needinfo?(jlund)
Assignee | ||
Comment 11•10 years ago
|
||
yes and thanks for cleaning
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(jlund)
Resolution: --- → FIXED
Updated•7 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•