Closed Bug 1389544 Opened 7 years ago Closed 7 years ago

[Postmortem][releaseduty] 20% of all Firefox BETA populations upgraded to 56.0b2 for 30mins

Categories

(Release Engineering :: Release Automation: Other, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: mtabara, Unassigned)

References

Details

(Whiteboard: [releaseduty])

I'll follow-up in a bit with description of what happened. Maybe it's worth having a post-mortem on this.
Summary: [Postmortem][releaseduty] 20% of all populations upgraded to 56.0b2 for 30mins → [Postmortem][releaseduty] 20% of all Firefox BETA populations upgraded to 56.0b2 for 30mins
Sorry for delays in following up with more information here. To describe what happened on Friday night: 1. What were we supposed to do? We were supposed to ship 56.0b2 to existing 56.0b1 population, keeping everyone else on 55.0-build3. 2. What actually happened? For ~30min, 20% of all eligible beta population had background updates turned on to upgrade to 56.0b2. 3. Reasoning for this happening Misunderstanding from my side / Delays to fix the problem due to signoffs in Balrog. === tl;dr steps * beta QE signoff arrives and Relman asks us to push to beta channel, mentioning "< 56 users should stay on 55.0 rc" * I let automation do the rules change scheduling + amend to what I think RelMan meant + shorten the time to wait until we ship to X * QE + RelEng sign the rules off, pending Relman signing off as well * RelMan asks for more clarifications as scheduled rules in Balrog didn't reflect what they asked for * going back and forth in asking clarifications in #releaseduty between RelEng/RelMan * the fog clears out and I schedule another rule to reflect RelMan's intent at time Y (Y = X + 1.5h) * before I manage to munge the automation rule as well, RelMan signs it off as well and goes into effect * at this point, the default rule was in effect with 20% of all eligible beta population had background updates turned on to upgrade to 56.0b2. The correction rule is scheduled to happen in ~1h only and still lacks RelMan + QE signoff. * I amend the timing and the initial rule to to go in effect as soon as possible but still needs RelMan/QE signoff. * another RelMan shows up, we still have no QE to signoff. :bhearsum suggests we use the temporary granting role to somebody else to resolve the emergency * ~20 mins going back and forth to explain the current situation and to determine whether or not is a good idea to use the temp-grant-QE-role hack. * RelMan/RelEng is on the same page, another RelEnger is temporarily granted the QE role and signs off * changes go live, we amend the existing rule and schedule the auxiliary one to reflect RelMan's initial intent * we're all good. * the second RelEnger temp QE role is being removed * file postmortem bug. More details, actors involved, timings and initial conclusions will be addressed in the postmortem document. I'll try to find a good time slot to fit both RelEng/RelMan calendar.
Assignee: nobody → mtabara
Status: NEW → ASSIGNED
Depends on: 1392871
Post mortem done, no need to have this bug assigned to me, we're still tracking the action items here.
Assignee: mtabara → nobody
Status: ASSIGNED → NEW
Whiteboard: [releaseduty]
I think we can close this umbrella bug. There's three action items (as dep) bugs on RelMan's plate and one on ours. But I guess they can all be treated differently at this point.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.