Closed
Bug 993537
Opened 11 years ago
Closed 11 years ago
Deploy tokenserver tag rpm-1.2.0-2 to stage
Categories
(Cloud Services :: Operations: Deployment Requests - DEPRECATED, task)
Cloud Services
Operations: Deployment Requests - DEPRECATED
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: rfkelly, Assigned: mostlygeek)
References
Details
(Whiteboard: [qa+])
Please deploy tokenserver tag rpm-1.2.0-1 (rev 56a520fce1f60daaa0bc3cda33c1d32c8524865b) to stage. This version includes an important server-side fix to allow cleaner node migration:
Bug 988134 - X-Client-State tracking prevents proper node-reassignment
Bug 988137 - do node-reassignment when all records are marked replaced
It also includes some extra sanity-checks for correct client behavior around generation change and x-client-state.
Bug 988134 requires a schema change to remove a no-longer-correct:
DROP INDEX clientstate_idx ON users;
As a side note, we should put better db-migration-handling code into the tokenserver. I'll be pretty surprised if this turns out to be the last tweak to need to do and managing them through deployment bugs is far from ideal.
Note for future prod deployment: we also had schema changes in a previous deploy that did not make it to prod: Bug 986204.
Updated•11 years ago
|
Whiteboard: [qa+]
Assignee | ||
Comment 1•11 years ago
|
||
Regarding schema handling the FxA guys have this: https://github.com/mozilla/fxa-auth-server/blob/master/bin/db_patcher.js
Which essentially uses a table to keep track of the version the database is at and gives a little script for ops to run to patch the database.
Reporter | ||
Comment 2•11 years ago
|
||
Some older discussion on python db migrations in Bug 777650.
Benson, do you want me to see if I can get something more formal up-and-running for this deploy, or just consider it for future?
Flags: needinfo?(bwong)
Assignee | ||
Comment 3•11 years ago
|
||
*NEAR* future is fine. For this deploy I can just drop the index. Dropping indexes are pretty safe operations.
Flags: needinfo?(bwong)
Reporter | ||
Comment 4•11 years ago
|
||
Pause this deploy due to re-opening of Bug 988643, we'll need to figure out something better there.
Comment 5•11 years ago
|
||
Something to consider as well:
Do we want to keep the number of TS instances as is?
or do we want to play with the match a bit to get either
1. Two larger instances than what we have
2. More of the same size instances
Thinking ahead for load testing and for scaling Stage to the "right spot" sooner than later.
Assignee | ||
Comment 6•11 years ago
|
||
Three m3.medium will max out the number IOPS to the RDS before we max out the CPU.
Reporter | ||
Updated•11 years ago
|
Reporter | ||
Comment 7•11 years ago
|
||
It would be great to get these deployed in the coming week. I'm in transit Monday but will be available to work on this on Tuesday if we can get the changes through review. (Then in transit again, available PDT Friday, then on PTO for a week unless it really hits the fan)
Comment 8•11 years ago
|
||
I will be around Monday after lunchtime. Same for Tuesday.
Rest of the week - normal business hours.
Just let me know if we get that Tuesday window (or later in the week).
Reporter | ||
Updated•11 years ago
|
Reporter | ||
Comment 9•11 years ago
|
||
Per IRL discussion with Benson, let's deploy this *without* doing the database index changes so that we can get the latest updates for Bug 971907 live. I've tagged rpm-1.2.0-2 with a couple of tweaks to those scripts.
The plan:
* build and deploy rpm-1.2.0-2 of tokenserver with:
* SQS setup necessary for Bug 971907
* increased number of webheads, let's go with 3 per Comment 6 above
* run tokenserver-only loadtest to confirm it's not broken
* get all that goodness out to prod
I will then follow up with:
* a new deployment for the db migrations stuff
* a new bug for enabling the purge_old_records script in the node-management server
Reporter | ||
Comment 10•11 years ago
|
||
In addition to tokenserver loadtest, we will need to verify that the deleted-account-notification stuff is working with this deploy. This will require some inspection of the tokenserver db and some client-side work by QA. Here's a sketch of the process:
* Confirm with ops that stage fxa-auth-server is plugged into the SNS/SQS setup for account deletions
* Confirm with ops that the "process_account_deletions" script is running on tokenserver webheads and is writing stdout/stderr to a file.
* Create a new account on stage fxa-auth-server, log into firefox using this account, and sync with it.
* Go into about:config and pull out the numeric uid for this user (I believe it can be found from either "services.sync.username" or "services.sync.clusterURL")
* On the tokenserver db, find the user record for this uid and confirm that its replaced_at column is NULL:
SELECT * FROM users WHERE uid=<uid>;
* Note the email address associated with the account, which will be returned in the above query.
* Go to the stage auth server and delete the account using the web-based management flow.
* Watch the log output from the new "process_account_deletions" script on the tokenserver webheads. One of them should get the account-deletion message, process it, and log about it.
* Query for all tokenserver db records associated with the account:
SELECT * FROM users WHERE email = <the email noted above>
* Verify that there's only the one record from before, that its `replaced_at` column is no longer NULL, and that its `generation` column is a very large integer.
:jbonacci does this make sense?
Comment 11•11 years ago
|
||
:rfkelly yes
Sounds like about 2 hours work after a good load test result on TS Stage.
Dependencies:
1. Deployment of train-10 to FxA Stage
2. Ideally, a good load test result on FxA Stage
(The load tests will quickly tell us if we broke anything)
(and they can be run in parallel)
So:
1. Deployments to TS Stage and FxA Stage
2. Successful load tests on both
3. Run steps from https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10
I can work with :jrgm, :gene, and :mostlygeek on this as needed...
Status: NEW → ASSIGNED
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → bwong
Assignee | ||
Updated•11 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 12•11 years ago
|
||
Verified the launch of a new TS stack for Stage: ts-s-2014-04-16
This has 3 m3.medium instances to more closely match Production.
Verified that no new changes to Verifier Stage or Sync Server Stage are needed.
Verified that the old stack is down.
Verified that the DNS is now pointing to the new stack.
Continuing with items from here: https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10
Starting with a two-hour TS load test while we wait for FxA Stage to come online...
Comment 13•11 years ago
|
||
Also, these are verified:
* Confirm with ops that stage fxa-auth-server is plugged into the SNS/SQS setup for account deletions
* Confirm with ops that the "process_account_deletions" script is running on tokenserver webheads and is writing stdout/stderr to a file.
Comment 14•11 years ago
|
||
Here are the results of the load test - from the Loads dashboard:
Status
Test was launched by jbonacci
Run Id 3ceaebd3-bf28-4f45-a409-e7e3dd55908c
Duration 2 h and 9 sec.
Started 2014-04-16 23:22:06 UTC
Ended 2014-04-17 00:25:24 UTC
State Ended
Configuration
Users [20]
Hits None
Agents 5
Duration 7200
Server URL https://token.stage.mozaws.net
Results
Tests over 1529874
Successes 1529861
Failures 0
Errors 0
TCP Hits 1560302
Opened web sockets 0
Total web sockets 0
Bytes/websockets 0
Requests / second (RPS) 216
Custom metrics
addFailure 13
We should be able to pull the server-side metrics off of Stackdriver.
A TS node showed the following breakdown of 200s and 401s:
200s: 525708
401s: 15842
Total: 541550
of which the 401s contribute about 2.9%, which is probably good enough
The other two nodes showed similar stats.
I call this a pass.
Tomorrow we move on to manual testing of TS and FxA Auth in Stage.
Comment 15•11 years ago
|
||
Further Stage testing is blocked by bug 997964
Comment 16•11 years ago
|
||
OK, bug 997964 has been resolved and verified.
I have Fx29b8 running on my Mac.
I have a new-ish profile with an account created.
I am pointing to Stage FxA, TS, Verifier, Sync.
So, tomorrow (Friday 4/18), we will finish out this ticket with some manual testing.
Test settings:
services.sync.log.appender.file.logOnError = Yes
services.sync.log.appender.file.logOnSuccess = Yes
services.sync.log.appender.file.level = Trace
identity.fxaccounts.remote.force_auth.uri = https://accounts.stage.mozaws.net/force_auth?service=sync&context=fx_desktop_v1
identity.fxaccounts.remote.signin.uri = https://accounts.stage.mozaws.net/signin?service=sync&context=fx_desktop_v1
identity.fxaccounts.remote.signup.uri = https://accounts.stage.mozaws.net/signup?service=sync&context=fx_desktop_v1
services.sync.tokenServerURI = https://token.stage.mozaws.net/1.0/sync/1.5
identity.fxaccounts.auth.uri = https://api-accounts.stage.mozaws.net/v1
identity.fxaccounts.settings.uri = https://accounts.stage.mozaws.net/settings
And also,
services.sync.clusterURL = https://sync-1-us-east-1.stage.mozaws.net/1.5/BLAH/
Comment 17•11 years ago
|
||
OK. :rfkelly and I covered everything here:
https://bugzilla.mozilla.org/show_bug.cgi?id=993537#c10
There are some pretty serious UX/UI and functional issues surrounding account deletion, but all are independent of this Stage deploy and test.
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•