Closed
Bug 1437973
Opened 7 years ago
Closed 7 years ago
Write and deploy a new script for bug 1372172 -- Automatic rebooting of 'impaired' instances
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jhford, Assigned: dustin)
References
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
text/plain
|
Details |
Bug 1372172 is about something broken with our Windows 10 64-Bit GPU instances which is causing them to lose network connectivity. There's a hacky workaround which checks the EC2 api and reboots all instances which have been impaired for more than 10 minutes.
This script was using internal implementation details of the provisioner and as a result broken when we changed those internal implementation details.
The best way to find instances which are managed by the provisioner is through the provisioner endpoint:
https://aws-provisioner.taskcluster.net/v1/state/gecko-t-win10-64-gpu, which will give a list of objects which contain the InstanceId for all managed instances as well as the region, when they were started and what their state is.
As an interim step, I've written a new version of this script which should reboot the instances in this bad state.
A couple questions:
- why are we globing the workerType? is it just to make checking the key pair name easier or do we really *need* globing?
- rebootInstances over terminateInstances.... which should we be doing? We use terminateInstances elsewhere
In order to run this script, you should do the following:
mkdir kill-impaired
cd kill-impaired
yarn add aws-sdk || npm install aws-sdk
curl -L -o script.js <URL of this attachment>
node script.js
There will be a Yaml file which contains a JSON log of the output. Each document in the document is an object with a start, end timestamps as well as per-region lists of which instances were rebooted.
If the DRY_RUN flag is set to any value (including false, no or 0), no action will happen, just printing out information to screen and the log file.
Reporter | ||
Comment 1•7 years ago
|
||
Dustin, per IRC, could you take a stab at deploying this somewhere? I've got it running under screen on my work desktop for tonight, but we should figure out a real deployment.
Oh, and to run the script, you'll need
export AWS_SECRET_ACCESS_KEY=<snip>
export AWS_ACCESS_KEY_ID=<snip>
as well, or have something such that new aws.EC2() would pick up your creds.
Flags: needinfo?(dustin)
Reporter | ||
Comment 2•7 years ago
|
||
More expansive glob pattern for hostnames.
Attachment #8950677 -
Attachment is obsolete: true
Assignee | ||
Updated•7 years ago
|
Assignee: relops → dustin
Flags: needinfo?(dustin)
Comment 3•7 years ago
|
||
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #0)
>
> - rebootInstances over terminateInstances.... which should we be doing? We
> use terminateInstances elsewhere
From #taskcluster archaeology, it came out of a conversation between grenade and garndt, hoping that by switching to a reboot they might get more data in the logs than they had been when terminating instances.
< grenade> garndt: we could adapt the kill script to reboot instead of terminate. We might get better log data in papertrail.
<&garndt> grenade: that might help, althoguh I did manually reboot one of the machines yesterday, and all I got in the logs was that it started things up, and then some validation check failed and then self terminated
<&garndt> we certainly could try that though
Unclear if it's been useful, though.
Assignee | ||
Comment 4•7 years ago
|
||
Trees closed for this again, and the impaired-instance script is only finding a few instances. I see a few more in the console, but most of those are from the pre-deployment ownership.
I clicked the terminate-all button for everything windows with a backlog.
Perhaps we're in the instances-never-claim-work state again? Or instances-take-forever-to-start?
Assignee | ||
Comment 5•7 years ago
|
||
I created a repo:
https://github.com/djmitche/winstance-slayer
with a modified version of jhford's modification of grenade's script:
- accept more args via env vars
- look for creds in secrets if necessary
- terminate or reboot
I created an AWS IAM Policy, WinstanceSlayer:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ec2:RebootInstances",
"ec2:TerminateInstances",
"ec2:DescribeInstances",
"ec2:DescribeInstanceStatus"
],
"Resource": "*"
}
]
}
I created an IAM user, `winstance-slayer`, with this policy.
I created a secret, `project/releng/winstance-slayer/aws-creds`, containing that user's creds.
I created a task:
https://tools.taskcluster.net/groups/Hkrp4jqmS42QZ5WAnBmuJw/tasks/Hkrp4jqmS42QZ5WAnBmuJw/details
It worked!
I created a hook:
https://tools.taskcluster.net/hooks/project-releng/winstance-slayer
I created a role:
https://tools.taskcluster.net/auth/roles/hook-id%3Aproject-releng%2Fwinstance-slayer
and triggered it:
https://tools.taskcluster.net/groups/EZ70DvtoQ_yptC6KZWUvEQ/tasks/EZ70DvtoQ_yptC6KZWUvEQ/runs/0/logs/public%2Flogs%2Flive.log
and set it to run every 10 minutes
Assignee | ||
Comment 6•7 years ago
|
||
I added a cache for the git checkout, so this should hit github and yarn a little less hard.
I'm happy with this as-is. It's nicely visible for anyone else who needs to poke at it (and in so doing, please feel free to fork the repo and point at your fork, or I'm happy to grant collaborator access).
It's worth noting this is very much not fixing the backlog.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•