Closed Bug 993537 Opened 11 years ago Closed 11 years ago

Deploy tokenserver tag rpm-1.2.0-2 to stage

Categories

(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rfkelly, Assigned: mostlygeek)

References

Details

(Whiteboard: [qa+])

Please deploy tokenserver tag rpm-1.2.0-1 (rev 56a520fce1f60daaa0bc3cda33c1d32c8524865b) to stage. This version includes an important server-side fix to allow cleaner node migration: Bug 988134 - X-Client-State tracking prevents proper node-reassignment Bug 988137 - do node-reassignment when all records are marked replaced It also includes some extra sanity-checks for correct client behavior around generation change and x-client-state. Bug 988134 requires a schema change to remove a no-longer-correct: DROP INDEX clientstate_idx ON users; As a side note, we should put better db-migration-handling code into the tokenserver. I'll be pretty surprised if this turns out to be the last tweak to need to do and managing them through deployment bugs is far from ideal. Note for future prod deployment: we also had schema changes in a previous deploy that did not make it to prod: Bug 986204.
Whiteboard: [qa+]
Regarding schema handling the FxA guys have this: https://github.com/mozilla/fxa-auth-server/blob/master/bin/db_patcher.js Which essentially uses a table to keep track of the version the database is at and gives a little script for ops to run to patch the database.
Some older discussion on python db migrations in Bug 777650. Benson, do you want me to see if I can get something more formal up-and-running for this deploy, or just consider it for future?
Flags: needinfo?(bwong)
*NEAR* future is fine. For this deploy I can just drop the index. Dropping indexes are pretty safe operations.
Flags: needinfo?(bwong)
Pause this deploy due to re-opening of Bug 988643, we'll need to figure out something better there.
Something to consider as well: Do we want to keep the number of TS instances as is? or do we want to play with the match a bit to get either 1. Two larger instances than what we have 2. More of the same size instances Thinking ahead for load testing and for scaling Stage to the "right spot" sooner than later.
Three m3.medium will max out the number IOPS to the RDS before we max out the CPU.
Depends on: 988643, 777650, 995767
It would be great to get these deployed in the coming week. I'm in transit Monday but will be available to work on this on Tuesday if we can get the changes through review. (Then in transit again, available PDT Friday, then on PTO for a week unless it really hits the fan)
I will be around Monday after lunchtime. Same for Tuesday. Rest of the week - normal business hours. Just let me know if we get that Tuesday window (or later in the week).
No longer depends on: 777650, 988643, 995767
Summary: Deploy tokenserver tag rpm-1.2.0-1 to stage → Deploy tokenserver tag rpm-1.2.0-2 to stage
Per IRL discussion with Benson, let's deploy this *without* doing the database index changes so that we can get the latest updates for Bug 971907 live. I've tagged rpm-1.2.0-2 with a couple of tweaks to those scripts. The plan: * build and deploy rpm-1.2.0-2 of tokenserver with: * SQS setup necessary for Bug 971907 * increased number of webheads, let's go with 3 per Comment 6 above * run tokenserver-only loadtest to confirm it's not broken * get all that goodness out to prod I will then follow up with: * a new deployment for the db migrations stuff * a new bug for enabling the purge_old_records script in the node-management server
Blocks: 971907
In addition to tokenserver loadtest, we will need to verify that the deleted-account-notification stuff is working with this deploy. This will require some inspection of the tokenserver db and some client-side work by QA. Here's a sketch of the process: * Confirm with ops that stage fxa-auth-server is plugged into the SNS/SQS setup for account deletions * Confirm with ops that the "process_account_deletions" script is running on tokenserver webheads and is writing stdout/stderr to a file. * Create a new account on stage fxa-auth-server, log into firefox using this account, and sync with it. * Go into about:config and pull out the numeric uid for this user (I believe it can be found from either "services.sync.username" or "services.sync.clusterURL") * On the tokenserver db, find the user record for this uid and confirm that its replaced_at column is NULL: SELECT * FROM users WHERE uid=<uid>; * Note the email address associated with the account, which will be returned in the above query. * Go to the stage auth server and delete the account using the web-based management flow. * Watch the log output from the new "process_account_deletions" script on the tokenserver webheads. One of them should get the account-deletion message, process it, and log about it. * Query for all tokenserver db records associated with the account: SELECT * FROM users WHERE email = <the email noted above> * Verify that there's only the one record from before, that its `replaced_at` column is no longer NULL, and that its `generation` column is a very large integer. :jbonacci does this make sense?
:rfkelly yes Sounds like about 2 hours work after a good load test result on TS Stage. Dependencies: 1. Deployment of train-10 to FxA Stage 2. Ideally, a good load test result on FxA Stage (The load tests will quickly tell us if we broke anything) (and they can be run in parallel) So: 1. Deployments to TS Stage and FxA Stage 2. Successful load tests on both 3. Run steps from https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10 I can work with :jrgm, :gene, and :mostlygeek on this as needed...
Status: NEW → ASSIGNED
Blocks: 996915
Assignee: nobody → bwong
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Verified the launch of a new TS stack for Stage: ts-s-2014-04-16 This has 3 m3.medium instances to more closely match Production. Verified that no new changes to Verifier Stage or Sync Server Stage are needed. Verified that the old stack is down. Verified that the DNS is now pointing to the new stack. Continuing with items from here: https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10 Starting with a two-hour TS load test while we wait for FxA Stage to come online...
Also, these are verified: * Confirm with ops that stage fxa-auth-server is plugged into the SNS/SQS setup for account deletions * Confirm with ops that the "process_account_deletions" script is running on tokenserver webheads and is writing stdout/stderr to a file.
Here are the results of the load test - from the Loads dashboard: Status Test was launched by jbonacci Run Id 3ceaebd3-bf28-4f45-a409-e7e3dd55908c Duration 2 h and 9 sec. Started 2014-04-16 23:22:06 UTC Ended 2014-04-17 00:25:24 UTC State Ended Configuration Users [20] Hits None Agents 5 Duration 7200 Server URL https://token.stage.mozaws.net Results Tests over 1529874 Successes 1529861 Failures 0 Errors 0 TCP Hits 1560302 Opened web sockets 0 Total web sockets 0 Bytes/websockets 0 Requests / second (RPS) 216 Custom metrics addFailure 13 We should be able to pull the server-side metrics off of Stackdriver. A TS node showed the following breakdown of 200s and 401s: 200s: 525708 401s: 15842 Total: 541550 of which the 401s contribute about 2.9%, which is probably good enough The other two nodes showed similar stats. I call this a pass. Tomorrow we move on to manual testing of TS and FxA Auth in Stage.
Further Stage testing is blocked by bug 997964
OK, bug 997964 has been resolved and verified. I have Fx29b8 running on my Mac. I have a new-ish profile with an account created. I am pointing to Stage FxA, TS, Verifier, Sync. So, tomorrow (Friday 4/18), we will finish out this ticket with some manual testing. Test settings: services.sync.log.appender.file.logOnError = Yes services.sync.log.appender.file.logOnSuccess = Yes services.sync.log.appender.file.level = Trace identity.fxaccounts.remote.force_auth.uri = https://accounts.stage.mozaws.net/force_auth?service=sync&context=fx_desktop_v1 identity.fxaccounts.remote.signin.uri = https://accounts.stage.mozaws.net/signin?service=sync&context=fx_desktop_v1 identity.fxaccounts.remote.signup.uri = https://accounts.stage.mozaws.net/signup?service=sync&context=fx_desktop_v1 services.sync.tokenServerURI = https://token.stage.mozaws.net/1.0/sync/1.5 identity.fxaccounts.auth.uri = https://api-accounts.stage.mozaws.net/v1 identity.fxaccounts.settings.uri = https://accounts.stage.mozaws.net/settings And also, services.sync.clusterURL = https://sync-1-us-east-1.stage.mozaws.net/1.5/BLAH/
OK. :rfkelly and I covered everything here: https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10 There are some pretty serious UX/UI and functional issues surrounding account deletion, but all are independent of this Stage deploy and test.
Status: RESOLVED → VERIFIED
Blocks: 998511
You need to log in before you can comment on or make changes to this bug.