Closed
Bug 914877
Opened 11 years ago
Closed 7 years ago
Determine which nagios alerts should go to the sheriffs list
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: hwine, Unassigned)
References
Details
This bug is to come up with the plan, then we can move it to server operations when it's actionable for them.
From Ed Morley [:edmorley UTC+1] in bug 914699 comment #0:
> Bug 914570 would have become apparent sooner (502 response from self-serve),
> if we had a nagios alert set up against self-serve, that emailed sheriffs at
> m dot org. (It's possible there are already nagios alerts for it, but if so,
> they'll only be alerting in #buildduty).
We can certainly have some/all of the nagios alerts also go to the sheriffs list. That's a simple addition to the configuration.
Let's figure out if there's a simple "rule" about which alerts sheriffs care about, or if it's better to go "ala carte" and just ask to be added on a per instance basis.
FYI, we're in the midst of doing a review & revamp of alerts, so it's a great time.
Comment 1•11 years ago
|
||
Thank you for filing this! :-)
I've added a few dependant bugs; as for what else to add, as long as the alerts aren't too noisy (#buildduty is currently a bit noisy, but it includes warnings for per slave issues, which we wouldn't need), then I'd be up for getting most of them to be sent to sheriffs@m dot org
Updated•11 years ago
|
Summary: Determine which nagios alerts should go to sheriffs → Determine which nagios alerts should go to the sheriffs list
Comment 2•11 years ago
|
||
Sorry meant to ask: is there a list somewhere of the releng Nagios alerts that we can skim through?
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+1] from comment #2)
> Sorry meant to ask: is there a list somewhere of the releng Nagios alerts
> that we can skim through?
Not at the moment -- that list will be built as part of the effort on bug 885560. Currently, all the notices are independent -- we want to have them set up hierarchically, such that we don't get per slave alerts when a whole data center is unreachable (as one example). My assumption is that sheriffs will be more interested in that level of alert.
Note that there are (at least) 2 levels of nagios access -- getting alerts sent to you/list/channel, and being able to browse the web interface. Do you have any interest in being able to browse the web ui?
Since it's the first of the new alert systems, how about we set up the sheriff's list to receive the "data center connectivity lost" emails. That will shake out the issues of white listing all the nagios hosts in the list, etc.
Comment 4•11 years ago
|
||
Adding this as a reminder here.
We should add something to cover:
00:56:33 INFO - ##### Running clobber step.
00:56:33 INFO - #####
00:56:33 INFO - Running main action method: clobber
00:56:33 INFO - rmtree: /builds/slave/b2g_m-cen_hamachi_dep-00000000/build/upload
00:56:33 INFO - rmtree: /builds/slave/b2g_m-cen_hamachi_dep-00000000/build/testdata
00:56:33 INFO - Running command: ['/builds/slave/b2g_m-cen_hamachi_dep-00000000/scripts/external_tools/clobberer.py', '-s', 'scripts', '-s', 'logs', '-s', 'buildprops.json', '-s', 'token', '-t', '168', 'http://clobberer.pvt.build.mozilla.org/index.php', u'mozilla-central', u'b2g_mozilla-central_hamachi_dep', u'b2g_m-cen_hamachi_dep-00000000', u'bld-centos6-hp-009', u'http://buildbot-master57.srv.releng.use1.mozilla.com:8001/'] in /builds/slave
00:56:33 INFO - Copy/paste: /builds/slave/b2g_m-cen_hamachi_dep-00000000/scripts/external_tools/clobberer.py -s scripts -s logs -s buildprops.json -s token -t 168 http://clobberer.pvt.build.mozilla.org/index.php mozilla-central b2g_mozilla-central_hamachi_dep b2g_m-cen_hamachi_dep-00000000 bld-centos6-hp-009 http://buildbot-master57.srv.releng.use1.mozilla.com:8001/
00:57:03 INFO - Checking clobber URL: http://clobberer.pvt.build.mozilla.org/index.php?master=http%3A%2F%2Fbuildbot-master57.srv.releng.use1.mozilla.com%3A8001%2F&slave=bld-centos6-hp-009&builddir=b2g_m-cen_hamachi_dep-00000000&branch=mozilla-central&buildername=b2g_mozilla-central_hamachi_dep
00:57:03 ERROR - Error contacting server
00:57:03 ERROR - Error contacting server for clobberer information.
Comment 5•11 years ago
|
||
(In reply to Hal Wine [:hwine] (use needinfo) from comment #3)
> Not at the moment -- that list will be built as part of the effort on bug
> 885560. Currently, all the notices are independent -- we want to have them
> set up hierarchically, such that we don't get per slave alerts when a whole
> data center is unreachable (as one example). My assumption is that sheriffs
> will be more interested in that level of alert.
Yeah that sounds preferable :-)
> Note that there are (at least) 2 levels of nagios access -- getting alerts
> sent to you/list/channel, and being able to browse the web interface. Do you
> have any interest in being able to browse the web ui?
If it's not too much trouble...?
>
> Since it's the first of the new alert systems, how about we set up the
> sheriff's list to receive the "data center connectivity lost" emails. That
> will shake out the issues of white listing all the nagios hosts in the list,
> etc.
Sounds good to me!
Thank you :-)
Comment 6•11 years ago
|
||
We need a Nagios alert for the failure case in bug 917251 too (no idea if there is an IRC Nagios alert for releng, or if there was, whether anyone watches them :-()
Comment 9•11 years ago
|
||
Another one to add to the todo list:
HTTP 500s on https://git.mozilla.org/external/google/gerrit/git-repo.git/clone.bundle
bug 920096
Reporter | ||
Comment 10•11 years ago
|
||
(In reply to Ed Morley [:edmorley UTC+0] from comment #9)
> Another one to add to the todo list:
>
> HTTP 500s on
> https://git.mozilla.org/external/google/gerrit/git-repo.git/clone.bundle
> bug 920096
Actually, this should be on 500's from either git or hg .m.o, not any specific repository.
I'll note that the root cause of this particular case was exceptional, and the associated processes causing the issue have since been discontinued. If this is not an easy win, we may want to re-evaluate the cost/benefit.
Reporter | ||
Updated•11 years ago
|
QA Contact: john+bugzilla
Updated•10 years ago
|
Blocks: byebyebuildduty
Updated•10 years ago
|
Component: Other → Tools
QA Contact: hwine
Assignee | ||
Updated•7 years ago
|
Component: Tools → General
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•