Closed Bug 971907 Opened 11 years ago Closed 11 years ago

Backchannel for deleting user data from Mozilla Services

Categories

(Cloud Services Graveyard :: Server: Token, defect, P2)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: telliott, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

Attachments

(1 file, 1 obsolete file)

Similar to the current account portal, we need an option for users to go onto the FxA website and delete their account. When they do so, it needs to purge data from services associated with that account. Because of token auth, the FxA portal can't do sync deletes - it has no way to associate their user with anything in the sync subsystem. That means that the tokenserver will need to do the delete on behalf of the user. However, in the new system, where a user's FxA ID will be used for more than just services (marketplace, etc), we can't rely on tokenserver to have information about all a user's services. ckarlof suggested we use a PubSub model. That seems likely to be correct in the long run, but given our deadlines on this (needs to be in place for launch), I'd prefer to defer on a final solution until we have a better grip on the scope, what our partners will want, and what will need to happen. For now, let's just get sync working. This means two things for now: 1) establish a backchannel between FxA server and tokenserver that can notify on a delete request. I think we can make this asynchronous (the synchronous version we use in current sync has problems if the dbs are busy, and we're adding yet another layer of indirection.) 2) have a tokenserver script that can take a userid from FxA server and purge their account.
Blocks: 969626
OS: Mac OS X → All
Hardware: x86 → All
Toby: what's our timeline for the "let's just get sync working" scope?
We should have something in place for Fx29 launch. Doesn't have to be pretty, just some backchannel that gets the job done. Ryan and I brainstormed a bit about how that would look and security tradeoffs (shared db vs queue, etc), so he's likely off sketching something out now.
/cc danny for context
Per IRL conversation w Chris and Danny, let's aim to have something in place before FF29 goes to beta (march 18)
Hmmm. I'm in two minds here, so here's two strawmen for how this might work. Option 1: SQS ============== Create a dedicated "sync account deletion" queue hosted in Amazon SQS. The FxA server drops a job in there whenever an account is deleted. Our tokenserver maintenance scripts periodically check for jobs in the queue and process the deletes asynchronously, deleting the each job as it completes. This is pretty classic job-queue stuff so SQS should be well suited for it. Things I'm not thrilled about: * The one-off-ness of the solution. Depending on how quickly fxa-profile-server, marketplace, WMF etc all roll around, we may find ourselves managing several account-deletion queues in short order and/or rushing out a better solution. * Another service to provision, monitor, secure, etc. In theory SQS will "just work" and will magically auto-scale as needed, and we can grant the relevant machines access automatically with IAM permissions. In theory. Option 2: Quick-n-Dirty Deletion Log in MySQL ============================================== Have the FxA server record each deletion event in a big old mysql table along the lines of: CREATE TABLE deletion_events ( timestamp BIGINT NOT NULL PRIMARY KEY, // yes, only one deletion allowed per millisecond uuid VARCHAR NOT NULL, ) Tokenserver maintenance scripts can periodically query this table for new deletion events: SELECT * FROM deletion_events WHERE timestamp > $LAST_SEEN_TS And we just periodically purge events that are more than, say, a week old. This is pretty ugly but it would have some advantages: * we could just host this table inside the FxA database, and have the app provide a (privileged?) web api through which other services can query it. One less moving part for our Ops team, and a whole set of failure modes not to have to worry about (e.g. what happens when we delete an account but the job queue is down?) * additional services can just go ahead and read from this log for their own cleanup purposes, without us having to add extra handling for them in the FxA side. * despite an RDB being less-than-ideally-suited for this use case, I doubt the event volume here would be enough to cause any serious problems with the approach. I'm not convinced either approach would be substantially simpler/faster than the other. Thoughts anyone?
(In reply to Ryan Kelly [:rfkelly] from comment #6) > * we could just host this table inside the FxA database, and have the app > provide a (privileged?) web api through which other services can query it. > One less moving part for our Ops team, and a whole set of failure modes not > to have to worry about (e.g. what happens when we delete an account but the > job queue is down?) I'm not a fan of this from a security standpoint. We don't want tokenserver/future services to have direct access to the FxA database. (I'm fine with it living in it's own space somewhere)
> I'm not a fan of this from a security standpoint. We don't want tokenserver/future > services to have direct access to the FxA database. Hence "have the app provide a (privileged?) web api through which other services can query it." :-) But yeah, we could push it out into separate persistent queue/database somewhere if it improves security of the setup.
/cc ops because one way or another, this will have to have some stuff running on the admin server.
Whiteboard: [qa+]
Blocks: 975903
Funnily enough, I'm currently liking a half-way hybrid between the two options outlined above: 1) When the FxA server deletes an account, write its uuid into a separate table in the FxA database so that we can find it for further processing. 2) Have a background process internal to the FxA deployment, that pulls things out of this table and pushes them on an SQS queue for downstream apps to deal with. 3) Have a tokenserver admin process that pulls accounts off this queue and cleans them up. Point (1) makes it much easier to keep things in sync with the actual state of the FxA database, and not risk deleting data for an account that was queued but not actually deleted, or risk orphaning data for an account that was deleted but not queued. Point (2) means we don't have to commit to any sort of API change in the FxA server process itself. And (3) means we can offload a lot of the queue-management logic to SQS. Gene, does current FxA deployment plan include an "admin box" form which we could run periodic maintenance tasks such as this? (akin the the tokenserver "admin box" from which can perform db management and cleanup tasks)
rfk is proposing the FxA server does stuff.
Flags: needinfo?(dcoates)
Flags: needinfo?(gene)
> rfk is proposing the FxA server does stuff. Any solution will require the FxA server to do some non-zero amount of stuff here; precisely how much remains an open question. We can probably ship with "fxa-auth-server posts to an SQS queue and forgets about it" but should iterate on that as things move forward to prevent corner-cases where the queue gets out of sync with the db.
> Gene, does current FxA deployment plan include an "admin box" form which we could run periodic maintenance tasks such as this? (akin the the tokenserver "admin box" from which can perform db management and cleanup tasks) No. Can the FxA auth app handle this MySQL table and SQS injection?
Flags: needinfo?(gene)
There is an "admin" server planned for FxA. :kparlante have discussed one that queries the database and outputs basic stats. It could also have other miscellaneous functionality for managing / querying the DB as well. Also some notes from chatting with :rfkelly yesterday: - FxA app servers will post to an SNS topic - TokenServer CFN will add SQS queue listeners to that SNS topic - TokenServer app servers can pop messages off the SQS queue and perform the delete/clean up The main idea to separate concerns and complexity. When the delete message goes out and is delivered to 3rd parties, it is up to those 3rd parties to do the delete.
Assignee: nobody → rfkelly
Flags: needinfo?(dcoates)
To expand slightly on Benson's comment above, here's what I'm looking at for a first iteration: * Heka on the FxA webheads watches for successful account-delete requests. When it sees one, it posts a message to a dedicated SNS endpoint. * Tokenserver stack manages its own SQS queue, and ensures that it is a listener for the appropriate SNS endpoint. * Tokenserver admin box runs a periodic maintenance script, which pulls jobs off the SQS queue and does the deletion. Future relying services can take a similar approach, adding their own SQS queue to the FxA SNS endpoint and doing whatever deletion is appropriate. We can also iterate on the reliability of the FxA=>SNS pathway in the future, but this should be good enough for now.
+1 to use of SNS topic as the "fanout" to multiple SQS queues for app-specific actions. I'm not sure there's a win to having heka watch and post to the SNS topic vs the FxA app code just doing it directly, but that's an implementation detail for you guys to figure out.
> I'm not sure there's a win to having heka watch and post to the SNS topic vs the FxA app > code just doing it directly Related discussion here: https://github.com/mozilla/fxa-auth-server/issues/608 But yeah, that's a potential rat-hole that I'll be quick to jump out of if heka->SNS proves more fiddly than it's worth.
Could always just SMTP to SNS from heka, as gross as that might sounds.. but No New Libs (tm) required.
> Could always just SMTP to SNS from heka, as gross as that might sound This is...I...but...hmm. Yeah. That's gross, but it's not crazy. I'll take a look.
Hmm, I may be missing it, but I can't see a good way to get data *into* SNS via SMTP. Notifications out, yes, but not notifications in. I'll do some more digging in the morrow.
Attached patch ts-process-account-deletions.diff (obsolete) (deleted) — Splinter Review
Here's a first cut at the tokenserver side of the equation. It's a simple script that will poll an SQS queue, pulling deletion events off of it and handling them appropriately. It re-uses some of the cleanup logic from Bug 984297, and like that bug, it needs more testing before review but is far enough along for general feedback. Benson, the idea here would be to run this as a background process on some box with appropriate IAM permissions to handle the SQS queue. Either TS webheads or admin box would be appropriate hosts.
Attachment #8392101 - Flags: feedback?(telliott)
Attachment #8392101 - Flags: feedback?(bwong)
Depends on: 984297
I like this. I'd probably run the SQS worker on every TS app server, supervised via a circus watcher to restart it if it crashes. Looks like a few things need to be in place to wire everything up: - we need an SNS topic, probably created as its own stack and the FxA servers know about it - the TS stack needs its own SQS queue which subscribes to the SNS stack - Additional SQS IAM permissions for app servers These are all pretty trivial to do. Let me know when the FxA auth server code has landed and we can wire up the infrastructure bits to make it work.
Additional concern: after we delete their data, the client might use an existing certificate to re-create it, not realizing that its account has been deleted. We probably need to tweak the user record in some way to indicate deletion, so that we don't revive the account. One possibility: set a really high generation number so that the existing certificates are invalidated.
Comment on attachment 8392101 [details] [diff] [review] ts-process-account-deletions.diff clearing mostlygeek feedback flag
Attachment #8392101 - Flags: feedback?(bwong)
We'll probably want to schedule the actually data deletion to occur after all the user's existing tokens have had time to expire.
Perhaps we can just mark the account as deleted, and let the usual old-data-cleanup process (Bug 984297) handle the deletion. Would this introduce an unacceptable delay in processing the user's request? It would certainly simply things.
Whatever you do, remember that the user who deletes the account may be doing it to untangle a mess, and may want to reestablish a new account immediately. Since the email address is the account id, the user will be stuck if the email address is in effect locked by a deferred delete process.
The new iteration of the id will have a new generation of cert, which will cause it to be assigned to a new node. There won't be any overlap here.
A good point, fortunately I think we're clear here because: > Since the email address is the account id From the tokenserver's perspective, the "email address" is a unique, firefox-accounts-specific identifier. Once deleted, it will never re-appear or be re-used. If the user creates a new firefox accounts account, they'll get a new internal identifier.
The FxA=>SNS part of this should roll with the next FxA train: https://github.com/mozilla/fxa-auth-server/pull/629 I think we should go with "operational reasons mean a small delay in deleting the data" and let the purge process from Bug 984297 handle the cleanup.
Updated patch, using tweaked API from Bug 984232 and compatible with the final message send by FxA in the above-linked PR.
Attachment #8392101 - Attachment is obsolete: true
Attachment #8392101 - Flags: feedback?(telliott)
Attachment #8393282 - Flags: review?(telliott)
Comment on attachment 8393282 [details] [diff] [review] ts-process-account-deletions.diff Review of attachment 8393282 [details] [diff] [review]: ----------------------------------------------------------------- Feels a little weird to have separate processes for marking the db and deleting the data, but I can see why you did that and it'll work.
Attachment #8393282 - Flags: review?(telliott) → review+
> Feels a little weird to have separate processes for marking the db and deleting the data Agree. If it weren't for the timing issues that Chris mentions in Comment 25, I think we'd try harder to do the cleanup inline.
Committed in https://github.com/mozilla-services/tokenserver/commit/da71fb36e9f5847c37583ad95a992eb3cd34c7f1 > [:mostlygeek] > These are all pretty trivial to do. Let me know when the FxA auth server code has landed and > we can wire up the infrastructure bits to make it work. We should have all the pieces in place for this to roll out in the next set of deploys. Ben, please set up the infra stuff you describe in Comment 22.
Assignee: rfkelly → bwong
Priority: -- → P2
Where we at on this? Maybe QA can help coordinate the deploy.
Flags: needinfo?(bwong)
This will happen today or tomorrow while :rfkelly is onsite...
Depends on: 933015
Flags: needinfo?(bwong)
Depends on: 993015
No longer depends on: 933015
I did a basic successful run-through of this in stage - created an account, deleted it, watched the notification propagate through to tokenserver for garbage collection. Here's what I did: * Created an SNS topic, and an SQS queue, and hooked up the topic to publish into the queue. * sns topic arn: arn:aws:sns:us-east-1:142069644989:fxa-stage-account-events-rfkelly * queue name: fxa-account-deletion * Made sure that the fxa and ts webheads had IAM permissions to access the topic and the queue respectively (ok ok, I just made them "allow all" within the aws env, but you get the idea...) * Updated fxa-auth-server to publish to the SNS topic: * pass the topic arn in as the "snsTopicArn" setting in default.json * change circus to run ./scripts/start-server.sh, which runs key_server.js and notifier.js in a pipeile * something like this: https://github.com/rfk/puppet-config/commit/547376e813eeaea43aa19b30a3114974a49d89f1 * Ran the process_account_deletions script on one of the tokenserver webheads: * ./bin/python -m tokenserver.scripts.process_account_deletions -v ./token-prod.ini fxa-account-deletion * Ideally we run this via circus e.g. https://github.com/rfk/puppet-config/commit/24ea052d56916c59a75d87d9929af687a8793f65 * Created an FxA account, faked a sync to give it a node assignment, then deleted the account via the web interface. * Watched the output of process_account_deletions to check that it received the deletion event, then checked the row in the "users" table to see that it was marked as replaced. * Ran the purge_old_records script on one of the tokenserver webheads: * ./bin/python -m tokenserver.scripts.purge_old_records -v ./token-prod.ini --grace-period=0 * the grace-period=0 means that it will purge it straight away even though there might be "live clients" using the account * Ideally we run this from the node admin server as a background process * Watched the output of purge_old_records to check that it found the deleted record, and sent a DELETE request to the sync node. * Checked that the row in the "users" table was in fact deleted. So I'm pretty happy that this is all working as intended. I did not confirm that the data-deletion request to the sync node worked as intended, this is something we should check on explicitly in a future test. We also need to iterate on the robustness of this, esp. checking for problems while running the process_account_deletions an dpurge_old_records scripts. Currently they will print errors to stderr, not sure what the best approach is there from an ops POV.
It seems having this depend on bug 993015 was a mistake.
No longer depends on: 993015
Depends on: 996751
Depends on: 986204
No longer depends on: 986204
Depends on: 993537
All wired up into the latest TokenServer: https://github.com/mozilla-services/puppet-config/pull/365
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Re-opened because we also need Bug 996751 before calling this complete.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is all live and deployed, woot
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
OK.
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: