Closed Bug 914877 Opened 11 years ago Closed 7 years ago

Determine which nagios alerts should go to the sheriffs list

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: hwine, Unassigned)

References

Details

This bug is to come up with the plan, then we can move it to server operations when it's actionable for them. From Ed Morley [:edmorley UTC+1] in bug 914699 comment #0: > Bug 914570 would have become apparent sooner (502 response from self-serve), > if we had a nagios alert set up against self-serve, that emailed sheriffs at > m dot org. (It's possible there are already nagios alerts for it, but if so, > they'll only be alerting in #buildduty). We can certainly have some/all of the nagios alerts also go to the sheriffs list. That's a simple addition to the configuration. Let's figure out if there's a simple "rule" about which alerts sheriffs care about, or if it's better to go "ala carte" and just ask to be added on a per instance basis. FYI, we're in the midst of doing a review & revamp of alerts, so it's a great time.
Thank you for filing this! :-) I've added a few dependant bugs; as for what else to add, as long as the alerts aren't too noisy (#buildduty is currently a bit noisy, but it includes warnings for per slave issues, which we wouldn't need), then I'd be up for getting most of them to be sent to sheriffs@m dot org
Depends on: 912463, 914699
Summary: Determine which nagios alerts should go to sheriffs → Determine which nagios alerts should go to the sheriffs list
Sorry meant to ask: is there a list somewhere of the releng Nagios alerts that we can skim through?
(In reply to Ed Morley [:edmorley UTC+1] from comment #2) > Sorry meant to ask: is there a list somewhere of the releng Nagios alerts > that we can skim through? Not at the moment -- that list will be built as part of the effort on bug 885560. Currently, all the notices are independent -- we want to have them set up hierarchically, such that we don't get per slave alerts when a whole data center is unreachable (as one example). My assumption is that sheriffs will be more interested in that level of alert. Note that there are (at least) 2 levels of nagios access -- getting alerts sent to you/list/channel, and being able to browse the web interface. Do you have any interest in being able to browse the web ui? Since it's the first of the new alert systems, how about we set up the sheriff's list to receive the "data center connectivity lost" emails. That will shake out the issues of white listing all the nagios hosts in the list, etc.
Adding this as a reminder here. We should add something to cover: 00:56:33 INFO - ##### Running clobber step. 00:56:33 INFO - ##### 00:56:33 INFO - Running main action method: clobber 00:56:33 INFO - rmtree: /builds/slave/b2g_m-cen_hamachi_dep-00000000/build/upload 00:56:33 INFO - rmtree: /builds/slave/b2g_m-cen_hamachi_dep-00000000/build/testdata 00:56:33 INFO - Running command: ['/builds/slave/b2g_m-cen_hamachi_dep-00000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-t', '168', 'http://clobberer.pvt.build.mozilla.org/index.php', u'mozilla-central', u'b2g_mozilla-central_hamachi_dep', u'b2g_m-cen_hamachi_dep-00000000', u'bld-centos6-hp-009', u'http://buildbot-master57.srv.releng.use1.mozilla.com:8001/'] in /builds/slave 00:56:33 INFO - Copy/paste: /builds/slave/b2g_m-cen_hamachi_dep-00000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php mozilla-central b2g_mozilla-central_hamachi_dep b2g_m-cen_hamachi_dep-00000000 bld-centos6-hp-009 http://buildbot-master57.srv.releng.use1.mozilla.com:8001/ 00:57:03 INFO - Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master57.srv.releng.use1.mozilla.com%3A8001%2F&slave=bld-centos6-hp-009&builddir=b2g_m-cen_hamachi_dep-00000000&branch=mozilla-central&buildername=b2g_mozilla-central_hamachi_dep 00:57:03 ERROR - Error contacting server 00:57:03 ERROR - Error contacting server for clobberer information.
(In reply to Hal Wine [:hwine] (use needinfo) from comment #3) > Not at the moment -- that list will be built as part of the effort on bug > 885560. Currently, all the notices are independent -- we want to have them > set up hierarchically, such that we don't get per slave alerts when a whole > data center is unreachable (as one example). My assumption is that sheriffs > will be more interested in that level of alert. Yeah that sounds preferable :-) > Note that there are (at least) 2 levels of nagios access -- getting alerts > sent to you/list/channel, and being able to browse the web interface. Do you > have any interest in being able to browse the web ui? If it's not too much trouble...? > > Since it's the first of the new alert systems, how about we set up the > sheriff's list to receive the "data center connectivity lost" emails. That > will shake out the issues of white listing all the nagios hosts in the list, > etc. Sounds good to me! Thank you :-)
We need a Nagios alert for the failure case in bug 917251 too (no idea if there is an IRC Nagios alert for releng, or if there was, whether anyone watches them :-()
Depends on: 917279
+1 for comment 4 (clobberer), has happened again.
(Note comment 7 for me keeping track, not intended as a ping for anyone)
Depends on: 921826
(In reply to Ed Morley [:edmorley UTC+0] from comment #9) > Another one to add to the todo list: > > HTTP 500s on > https://git.mozilla.org/external/google/gerrit/git-repo.git/clone.bundle > bug 920096 Actually, this should be on 500's from either git or hg .m.o, not any specific repository. I'll note that the root cause of this particular case was exceptional, and the associated processes causing the issue have since been discontinued. If this is not an easy win, we may want to re-evaluate the cost/benefit.
QA Contact: john+bugzilla
Depends on: 1017551
Component: Other → Tools
QA Contact: hwine
Component: Tools → General
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.