Closed Bug 1437973 Opened 7 years ago Closed 7 years ago

Write and deploy a new script for bug 1372172 -- Automatic rebooting of 'impaired' instances

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Assigned: dustin)

References

Details

Attachments

(1 file, 1 obsolete file)

Attached file script to terminate impaired instances (obsolete) (deleted) —
Bug 1372172 is about something broken with our Windows 10 64-Bit GPU instances which is causing them to lose network connectivity. There's a hacky workaround which checks the EC2 api and reboots all instances which have been impaired for more than 10 minutes. This script was using internal implementation details of the provisioner and as a result broken when we changed those internal implementation details. The best way to find instances which are managed by the provisioner is through the provisioner endpoint: https://aws-provisioner.taskcluster.net/v1/state/gecko-t-win10-64-gpu, which will give a list of objects which contain the InstanceId for all managed instances as well as the region, when they were started and what their state is. As an interim step, I've written a new version of this script which should reboot the instances in this bad state. A couple questions: - why are we globing the workerType? is it just to make checking the key pair name easier or do we really *need* globing? - rebootInstances over terminateInstances.... which should we be doing? We use terminateInstances elsewhere In order to run this script, you should do the following: mkdir kill-impaired cd kill-impaired yarn add aws-sdk || npm install aws-sdk curl -L -o script.js <URL of this attachment> node script.js There will be a Yaml file which contains a JSON log of the output. Each document in the document is an object with a start, end timestamps as well as per-region lists of which instances were rebooted. If the DRY_RUN flag is set to any value (including false, no or 0), no action will happen, just printing out information to screen and the log file.
Dustin, per IRC, could you take a stab at deploying this somewhere? I've got it running under screen on my work desktop for tonight, but we should figure out a real deployment. Oh, and to run the script, you'll need export AWS_SECRET_ACCESS_KEY=<snip> export AWS_ACCESS_KEY_ID=<snip> as well, or have something such that new aws.EC2() would pick up your creds.
Flags: needinfo?(dustin)
More expansive glob pattern for hostnames.
Attachment #8950677 - Attachment is obsolete: true
Assignee: relops → dustin
Flags: needinfo?(dustin)
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #0) > > - rebootInstances over terminateInstances.... which should we be doing? We > use terminateInstances elsewhere From #taskcluster archaeology, it came out of a conversation between grenade and garndt, hoping that by switching to a reboot they might get more data in the logs than they had been when terminating instances. < grenade> garndt: we could adapt the kill script to reboot instead of terminate. We might get better log data in papertrail. <&garndt> grenade: that might help, althoguh I did manually reboot one of the machines yesterday, and all I got in the logs was that it started things up, and then some validation check failed and then self terminated <&garndt> we certainly could try that though Unclear if it's been useful, though.
Trees closed for this again, and the impaired-instance script is only finding a few instances. I see a few more in the console, but most of those are from the pre-deployment ownership. I clicked the terminate-all button for everything windows with a backlog. Perhaps we're in the instances-never-claim-work state again? Or instances-take-forever-to-start?
I created a repo: https://github.com/djmitche/winstance-slayer with a modified version of jhford's modification of grenade's script: - accept more args via env vars - look for creds in secrets if necessary - terminate or reboot I created an AWS IAM Policy, WinstanceSlayer: { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "ec2:RebootInstances", "ec2:TerminateInstances", "ec2:DescribeInstances", "ec2:DescribeInstanceStatus" ], "Resource": "*" } ] } I created an IAM user, `winstance-slayer`, with this policy. I created a secret, `project/releng/winstance-slayer/aws-creds`, containing that user's creds. I created a task: https://tools.taskcluster.net/groups/Hkrp4jqmS42QZ5WAnBmuJw/tasks/Hkrp4jqmS42QZ5WAnBmuJw/details It worked! I created a hook: https://tools.taskcluster.net/hooks/project-releng/winstance-slayer I created a role: https://tools.taskcluster.net/auth/roles/hook-id%3Aproject-releng%2Fwinstance-slayer and triggered it: https://tools.taskcluster.net/groups/EZ70DvtoQ_yptC6KZWUvEQ/tasks/EZ70DvtoQ_yptC6KZWUvEQ/runs/0/logs/public%2Flogs%2Flive.log and set it to run every 10 minutes
I added a cache for the git checkout, so this should hit github and yarn a little less hard. I'm happy with this as-is. It's nicely visible for anyone else who needs to poke at it (and in so doing, please feel free to fork the repo and point at your fork, or I'm happy to grant collaborator access). It's worth noting this is very much not fixing the backlog.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: