Closed
Bug 1458188
Opened 7 years ago
Closed 3 years ago
Monitoring/alerting for bouncer aliases
Categories
(Release Engineering :: Release Automation: Bouncer, enhancement)
Release Engineering
Release Automation: Bouncer
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: sfraser, Assigned: sfraser)
References
Details
The bouncer aliases are updated with each release, and it would be good to have an extra check on the end of a release graph to ensure the URL points to expected values. A periodic check would also be very useful.
We have some choices about configuration for these checks, and where to alert.
Alerting could go to:
* Email
* IRC
* Slack
Configuration could be:
a. No configuration, just notification if the value changes.
b. Configuration from Balrog rules for 'latest version' to ensure the values match
c. Manually configured data source of expected values.
'a' is probably least effort, but requires a human to decide if the change is expected, and requiring context like that for operational health alerts is an anti-pattern.
'b' would work, assuming we can formulate balrog queries that are equivalent to the bouncer ones. The new rules would also not have been signed off, and so not live, when the in-graph test runs.
'c' has a layer of extra work attached, but is likely the most reliable indicator of an unexpected change.
Since we'd like to run this as part of the release graphs as well as a periodic check, I'm not convinced nagios is the best check for this, although we could put the same code in multiple places and have nagios as well as the in-tree variant, it increases code support complexity.
Given the multiple locations, I think I will:
1. Use python, and pytest, to manage the tests
2. Separate out the expected values from the test logic, so that they can be provided by nagios/balrog checker/something else
3. For a first deployment, get this into a periodic test, somewhere, as its actual location is less important than it running.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → sfraser
Comment 1•7 years ago
|
||
Thanks for filing this bug, this is great information gathered.
Just FYI, there's a larger effort, not only for bouncer aliases but for all tasks that are leafs in the release graph to be tracked and tested before and after to prevent issues.
Tracking bug is 1445946. I'll chain it here for reference as it might be useful later on.
Comment 2•7 years ago
|
||
mbrandt has some periodic tests checking bouncer aliases and if they align with product-details data. Should we use them instead or maybe adopt them somehow?
Assignee | ||
Comment 3•7 years ago
|
||
(In reply to Rail Aliiev [:rail] ⌚️ET from comment #2)
> mbrandt has some periodic tests checking bouncer aliases and if they align
> with product-details data. Should we use them instead or maybe adopt them
> somehow?
Seems like a good idea, we avoid reinventing too much. We could add another notification path to them, perhaps.
Comment 4•7 years ago
|
||
(In reply to Rail Aliiev [:rail] ⌚️ET from comment #2)
> mbrandt has some periodic tests checking bouncer aliases and if they align
> with product-details data. Should we use them instead or maybe adopt them
> somehow?
IIRC after talking to catlee about this, he was totally fine duping some of that work/logic. We definitely want to run those more often (or before/after we do changes). For periodic, I think we can rely on mbrandt's stuff, but in RelEng harndess we should be doing these checks before/after for sure.
Concern was that, last we got hit by this - when beta aliases updated release aliases - even those tests that run periodically found the issue 1.5h after that fact. It was good, but could have been better if we chained a task after the bouncer aliases to sanitize that. And other as well, this is not the only task that needs coverage.
++ to more notification for this, great idea.
Comment 5•7 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)
> (In reply to Rail Aliiev [:rail] ⌚️ET from comment #2)
> > mbrandt has some periodic tests checking bouncer aliases and if they align
> > with product-details data. Should we use them instead or maybe adopt them
> > somehow?
>
> Seems like a good idea, we avoid reinventing too much. We could add another
> notification path to them, perhaps.
I am happy to assist in anyway that I can. Our bouncer tests currently run on a 15 cronjob.
+1 to adding more/better notification paths. It would also be interesting to chain the tests to run as a step vs cron. We're using Jenkins, so this should in theory be configurable.
Updated•7 years ago
|
Component: General Automation → General
Assignee | ||
Comment 6•7 years ago
|
||
(In reply to Matt Brandt [:mbrandt] from comment #5)
> (In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #3)
> > (In reply to Rail Aliiev [:rail] ⌚️ET from comment #2)
> > > mbrandt has some periodic tests checking bouncer aliases and if they align
> > > with product-details data. Should we use them instead or maybe adopt them
> > > somehow?
> >
> > Seems like a good idea, we avoid reinventing too much. We could add another
> > notification path to them, perhaps.
>
> I am happy to assist in anyway that I can. Our bouncer tests currently run
> on a 15 cronjob.
> +1 to adding more/better notification paths. It would also be interesting to
> chain the tests to run as a step vs cron. We're using Jenkins, so this
> should in theory be configurable.
How difficult would it be to add notification paths? Or to run the test on demand, as part of release promotion?
Flags: needinfo?(mbrandt)
Comment 7•6 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #6)
> How difficult would it be to add notification paths? Or to run the test on
> demand, as part of release promotion?
Adding a notification path should be straight forward, were you thinking of email, IRC, etc?
On demand would be a bit more work, out of my area of experience, but in theory also possible.
Flags: needinfo?(mbrandt) → needinfo?(sfraser)
Assignee | ||
Comment 8•6 years ago
|
||
Which notification path do you use at the moment? Let's get something running on the same method, but to RelEng, too, at least as a first pass.
If you point me at the code I can have a look at running it on demand for our purposes.
Flags: needinfo?(sfraser) → needinfo?(mbrandt)
Comment 9•6 years ago
|
||
We're currently using several paths; irc, email, and treeherder.
https://github.com/mozilla-services/go-bouncer/blob/master/tests/e2e/Jenkinsfile#L48-L81
Flags: needinfo?(mbrandt)
Updated•6 years ago
|
Flags: needinfo?(sfraser)
Assignee | ||
Comment 10•6 years ago
|
||
Apologies, course & travel, this got away from me. I'm not sure from reading the Jenkinsfile what actually does the work, there. Getting it to run in-tree would likely mean rewriting it.
Could irc#releaseduty be added to notifications?
Flags: needinfo?(sfraser)
Comment 11•6 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #10)
> Apologies, course & travel, this got away from me. I'm not sure from reading
> the Jenkinsfile what actually does the work, there. Getting it to run
> in-tree would likely mean rewriting it.
No worries, I've been offline and on pto for a bit myself. Maybe we can explore this for next quarter.
> Could irc#releaseduty be added to notifications?
This looks fairly straightforward, I've opened this pr https://github.com/mozilla-services/go-bouncer/pull/243. How does sound :sfraser?
Flags: needinfo?(sfraser)
Comment 12•6 years ago
|
||
A caveat that I forgot to mention, if this were to get merged failed build would be reported to the channel in a format that includes a URL to the Jenkins build. To view the failure you'd need to configure a proxy to bastion, https://mana.mozilla.org/wiki/display/TestEngineering/qa-master.fxtest.jenkins.stage.mozaws.net.
Assignee | ||
Comment 13•6 years ago
|
||
(In reply to Matt Brandt [:mbrandt] from comment #11)
> > Could irc#releaseduty be added to notifications?
> This looks fairly straightforward, I've opened this pr
> https://github.com/mozilla-services/go-bouncer/pull/243. How does sound
> :sfraser?
Works for me. Thank you!
Flags: needinfo?(sfraser)
Updated•6 years ago
|
Component: General → Release Automation: Bouncer
QA Contact: catlee
Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)
Comment 15•3 years ago
|
||
I think we have a task that checks bouncer aliases, no? Resolved?
QA Contact: mtabara
Comment 16•3 years ago
|
||
Bug 1469803 took care of this maybe?
Comment 17•3 years ago
|
||
I think this is done, we run this frequence per each release tree afaik.
Updated•3 years ago
|
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•