603343 - (releng-nagios) [Tracking bug] cleanup nagios configs, so monitoring RelEng systems is less noisy

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

14 years ago

Between Friday night and first thing Monday morning, I got 1094 nagios messages. Its impossible to see if there are any *real* problems in the midst of all that noise. This bug is to track fixing configs to correctly handle how our RelEng systems behave routinely, so whenever we *do* get a nagios alert, it is something we notice! One example (bug#575472, already fixed) was configuring nagios to treat mobile phones as devices that behave differently to desktop machines, and have different time thresholds for reporting errors on reboot - phones run so much slower!

bhearsum@mozilla.com (:bhearsum)

Comment 1

•

14 years ago

For what it's worth, monitoring using the web page or IRC is much easier than mail, because of all the frequent status changes. For example, this query shows all things that are *currently* FAILing or WARNing: https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 This one shows things currently FAILing or WARNing that have not been acknowledged. Generally, this are things that currently require attention: https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346 A couple ideas on how to make things better - Sort things on the nagios display. Having sections for all of the platform/slave type combinations would help us find patterns quicker, I think. Having the less redundant things such as masters in their own section would make it easier to find critical problems with them.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 2

•

14 years ago

(In reply to comment #1) > For what it's worth, monitoring using the web page or IRC is much easier than > mail, because of all the frequent status changes. For example, this query shows > all things that are *currently* FAILing or WARNing: > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 > > This one shows things currently FAILing or WARNing that have not been > acknowledged. Generally, this are things that currently require attention: > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346 Actually, that misses the point. Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the nagios master shows what is *currently* failing, but those errors still includes a bunch of noisy/flapping alerts. A human has to then weed through those looking which alerts are real, and which are noise. Having noisy/flapping nagios is the problem. The real fix is to debug the nagios thresholds and configs so that all (or at least the vast majority) of nagios alerts are actually real valid problems. This bug is to track identifying and fixing those noisy/flapping nagios configs. > A couple ideas on how to make things better > - Sort things on the nagios display. Having sections for all of the > platform/slave type combinations would help us find patterns quicker, I think. > Having the less redundant things such as masters in their own section would > make it easier to find critical problems with them. I guess that might be helpful, but it feels like a separate issue.

OS: Mac OS X → All

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=589006

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

14 years ago

(In reply to comment #2) > (In reply to comment #1) > > For what it's worth, monitoring using the web page or IRC is much easier than > > mail, because of all the frequent status changes. For example, this query shows > > all things that are *currently* FAILing or WARNing: > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 > > > > This one shows things currently FAILing or WARNing that have not been > > acknowledged. Generally, this are things that currently require attention: > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346 > > Actually, that misses the point. I don't think it entirely misses the point -- it's still a ton easier than matching up errors and recoveries in e-mail. > Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the > nagios master shows what is *currently* failing, but those errors still > includes a bunch of noisy/flapping alerts. In my experience, there's very few flapping alerts. Which ones have you noticed to be flapping? > > A couple ideas on how to make things better > > - Sort things on the nagios display. Having sections for all of the > > platform/slave type combinations would help us find patterns quicker, I think. > > Having the less redundant things such as masters in their own section would > > make it easier to find critical problems with them. > > I guess that might be helpful, but it feels like a separate issue. OK, I'll file it separately, I didn't realize this bug was limited to your original idea.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 4

•

14 years ago

(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > For what it's worth, monitoring using the web page or IRC is much easier than > > > mail, because of all the frequent status changes. For example, this query shows > > > all things that are *currently* FAILing or WARNing: > > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15 > > > > > > This one shows things currently FAILing or WARNing that have not been > > > acknowledged. Generally, this are things that currently require attention: > > > https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?host=all&servicestatustypes=28&hoststatustypes=15&serviceprops=270346 > > > > Actually, that misses the point. > I don't think it entirely misses the point -- it's still a ton easier than > matching up errors and recoveries in e-mail. Organizing the errors is helpful, of course, but the root problem of eliminating the nagios noise is the topic here. > > Yes, you can skip some of the nagios noise in email/irc. Yes, looking at the > > nagios master shows what is *currently* failing, but those errors still > > includes a bunch of noisy/flapping alerts. > > In my experience, there's very few flapping alerts. Which ones have you noticed > to be flapping? Not complete list, but from quick glance in #build just now: 12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: python.exe: stopped (critical) 12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1 > > > A couple ideas on how to make things better > > > - Sort things on the nagios display. Having sections for all of the > > > platform/slave type combinations would help us find patterns quicker, I think. > > > Having the less redundant things such as masters in their own section would > > > make it easier to find critical problems with them. > > > > I guess that might be helpful, but it feels like a separate issue. > > OK, I'll file it separately, I didn't realize this bug was limited to your > original idea. I've linked your bug, thanks.

Depends on: 603684

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 5

•

14 years ago

(In reply to comment #4) > (In reply to comment #3) > > (In reply to comment #2) > > > (In reply to comment #1) > > In my experience, there's very few flapping alerts. Which ones have you noticed > > to be flapping? > > Not complete list, but from quick glance in #build just now: > 12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: > python.exe: stopped (critical) > 12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1 jhford tells me this random example I picked happened to be when he was working on this slave. Here's another different random example of flapping from this morning's set of nagios alerts: Host: try-mac-slave40.build State: CRITICAL Date/Time: 10-13-2010 19:43:03 FILE_AGE CRITICAL: /builds/slave/twistd.log is 432149 seconds old and 835444 bytes Host: try-linux-slave04.build State: OK Date/Time: 10-13-2010 19:43:11 Additional Info: FILE_AGE OK: /builds/slave/twistd.log is 205 seconds old and 346783 bytes

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

14 years ago

(In reply to comment #5) > (In reply to comment #4) > > (In reply to comment #3) > > > (In reply to comment #2) > > > > (In reply to comment #1) > > > In my experience, there's very few flapping alerts. Which ones have you noticed > > > to be flapping? > > > > Not complete list, but from quick glance in #build just now: > > 12:01:22 < nagios> [67] mw32-ix-slave11.build:buildbot is CRITICAL: CRITICAL: > > python.exe: stopped (critical) > > 12:06:28 < nagios> mw32-ix-slave11.build:buildbot is OK: OK: python.exe: 1 > > jhford tells me this random example I picked happened to be when he was working > on this slave. > > > > Here's another different random example of flapping from this morning's set of > nagios alerts: > > Host: try-mac-slave40.build > State: CRITICAL > Date/Time: 10-13-2010 19:43:03 > FILE_AGE CRITICAL: /builds/slave/twistd.log is 432149 seconds old and 835444 > bytes > Host: try-linux-slave04.build > State: OK > Date/Time: 10-13-2010 19:43:11 > Additional Info: > FILE_AGE OK: /builds/slave/twistd.log is 205 seconds old and 346783 bytes (Sorry for distracting from the point of this bug again, but I want to clarify these). I suspect they are both legitimate. We have very low load on 10.5 mac build machines these days, since the universal build change. Linux VMs are also less loaded beacuse of all the ix build machines.

Armen [:armenzg]

Comment 7

•

14 years ago

> (Sorry for distracting from the point of this bug again, but I want to clarify > these). I suspect they are both legitimate. We have very low load on 10.5 mac > build machines these days, since the universal build change. Linux VMs are also > less loaded beacuse of all the ix build machines. Time to think what to do with these underused guys?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter