Closed Bug 1290922 Opened 8 years ago Closed 8 years ago

do not store metadata in the EC2 KeyPair field

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jhford, Unassigned)

References

Details

Attachments

(1 file)

Until now, we've been using the KeyPair.Name field in the launch specification to save information on which worker type an instance or spot request belong to, and which provisioner manages them. This works really well, but has the downside that we need to automatically create and delete KeyPairs when creating worker types. As part of the work to build a GPG Token passing system for artifacts, we're going to start putting a GPG key onto AMIs. The risk here is that if an attacker compromises the Heroku account or the provisioner itself, they might obtain the AWS credentials that the provisioner uses. So long as the provisioner has CreateKey permissions, an attacker could theoretically create a new ssh key with the ec2 api, launch a worker type, ssh in with their key and obtain the GPG key used to sign/signature artifacts. The goal of this bug is to store the critical workerType information in a different field that has zero security concerns. This way, we can have an IAM profile for the provisioner that cannot create key pairs. By removing key pair creation outside of the provisioner, you can no longer exploit this path to go from compromising the provisioner or heroku to having the GPG key.
My initial approach was to build a database of spot request id to worker type mappings, store them in azure and then set them as tags as soon as the API allowed us. I got it mostly throught through but there are just too many oddities to consider this a valid approach. Here's the branch in case anyone's curious: https://github.com/taskcluster/aws-provisioner/tree/idea-to-store-workerType-in-tags
Attached file v1 (deleted) —
I just realised that I need to also expire things from this file, since we don't want it to grow unbounded. That change should be pretty trivial, so it'd take only a quick review. Can you take a look at this code and let me know what you think? I'll write up the expiring patch, so you may or may not see the review at the time that you see the next commit.
Attachment #8776612 - Flags: review?(jopsen)
Hi Selena, I don't know which bug this one should be blocking. Do you have it handy?
Flags: needinfo?(sdeckelmann)
Note to self: create the Production key pairs in all the regions using the format ${provisionerId}-ssh-key.
Flags: needinfo?(sdeckelmann)
Jonas and I had a quick meeting about this. We talked about 5 strategies that we could take to solve this bug: 1. consider that if someone has hacked heroku or has or has our aws credentials, we've lost already. 2. use a network interface or something 3. use security groups 4. use the code as specified in the PR with a change to cover rogue instances 5. lock down createKeyPair to specific key-fingerprint. We chose 4, and rejected the others because: 1. we have an idea to implement which does address this attack path 2. network interfaces impose restrictions on worker types and we don't know all the details of how this works 3. security groups are limited to 500, which means we'd be limited to 500 worker types 5. our reading of the IAM policy guide suggests that we can only lock resources down by name and not by value. This would require us to know every possible worker type name before they're created, or it would require someone to manually create a keypair in the EC2 console to create a workertype, an error prone process and one that ruins our api The solution we've decided is to add another feature to the PR that I've made here. The case that we were concerned with was around data loss. The two known methods of data loss were: * EC2 api processes request, response is not sent because of any http fault. * EC2 api result value is returned, but the Azure blob storage system goes down and the provisioner restarts, thus not persisting the SpotInstanceRequestId. The case that concerns us is that if we do not get this value back from the API, we might have a rouge instance that can never be cleaned up. The change to the PR that we've chose is to: 1. change workerTypeForSRID to workerTypeForResource 2. In workerTypeForResource, if we cannot find the workerType in spot request id cache, we will use DescribeInstanceAttribute on the instance's UserData to determine which workerType it is. We can assume that anything which has unparsable user data is not provisioner owned, since the provisioner enforces JSON formatting for UserData. This change will let us have the same assurances as the current KeyPair system while letting us limit the IAM profile to a single SSH key that we can create, and lock the IAM profile down to using.
Comment on attachment 8776612 [details] v1 See comments on github... This seems pretty good though...
Attachment #8776612 - Flags: review?(jopsen)
WONTFIX in favour of bug 1299148
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: