Closed
Bug 625867
Opened 14 years ago
Closed 14 years ago
investigate why a few slaves are causing nagios PING CRITICAL alerts
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: bear, Assigned: zandr)
References
Details
it's not often, but they come in clumps:
[09] talos-r3-fed64-033.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[11] talos-r3-fed64-017.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[15] talos-r3-fed-008.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[16] linux-ix-slave20.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
if I react slowly and look, the slaves are fine. only once caught one when it was truly offline.
Also don't ever see the alert that they are ok.
Reporter | ||
Updated•14 years ago
|
Assignee: server-ops-releng → zandr
Blocks: releng-nagios
Comment 2•14 years ago
|
||
maybe they took slightly longer then usual to reboot after completing a job and tripped the nagios threshold on ping? (we had a similar problem before with the n810s, hence the idea).
Comment 3•14 years ago
|
||
That's one theory. I don't know what the thresholds are right now.
I'll be taking a deeper look at this on Monday, and will reassign to myself at that point unless zandr has different plans.
Comment 4•14 years ago
|
||
OK, here's a funny one. I'm logged into linux-ix-slave08 right now, and it's pingable. However, Nagios has
Current Status: CRITICAL (for 60d 19h 48m 33s) (Has been acknowledged)
Last Check Time: 01-15-2011 10:33:57
Two problems here:
* somehow nagios has been pinging this box for two months and not getting a response
* PING failure is not causing Nagios to think that the host is down (which would appear as a host problem, not a service problem, and would silence all service alerts for the host). In fact, despite a lot of known-down hosts, our "hosts" list is entirely green!
Comment 5•14 years ago
|
||
It *does* look like Nagios is avoiding checking the other services when PING is down - it hasn't looked at the number of running buildbot processes since November, either.
I submitted a passive "OK" result for the host - let's see if that gets it back into action.
Comment 6•14 years ago
|
||
It did not get it back into action. For a while, the service was green/OK, but now it's CRITICAL again (although with no alert to #build). The host has been up for almost 2h, and it's not been 2h since I posted the last comment, so in principle it should have been pingable for that entire time.
I'm tcpdumping ICMP messages on the machine now. I see pings from nagios to other hosts, but not yet to linux-ix-slave08 yet.
Comment 7•14 years ago
|
||
OK, I tried twice, and I don't see any pings to linux-ix-slave08 via tcpdump on that host[1], yet nagios reverts to marking the PING service on that slave as CRITICAL.
Maybe it's time to punt this to netops to see if the pings are blocked for some reason? We could also do a simple test by running 'ping' on the command line on bm-admin01 while running tcpdump on the slave.
[1] note that I do see a lot of pings to various talos-* slaves, but not to any non-talos* slaves
Comment 8•14 years ago
|
||
linux-ix-slave08 will be fallout from it moving to SCL and the nagios config not getting regenerated (it caches IP addresses or something). It moved at bug 612299 comment #14. IT can fix that up pretty quick.
Comment 9•14 years ago
|
||
(In reply to comment #0)
> it's not often, but they come in clumps:
Is this different from talos box falls over, nagios report isn't actioned (adding to reboot bug, acking nagios), nagios repeats the alert every 2 hours ?
Comment 10•14 years ago
|
||
eg for talos-r3-fed64-33:
* the last job it did ended at 2011-01-12 04:59:45 (status db) [all times in PDT]
* the PING alert started failing 2011-01-12 05:02, which will be the reboot after that last job
* nagios thinks it made these notifications:
https://nagios.mozilla.org/nagios/cgi-bin/notifications.cgi?host=talos-r3-fed64-033.build&service=PING
* if you step back in time on that there are *no* OK notifications between Jan 12 and now
* therefore this box has been down for 4 days
So what am I missing ? Looks just like the usual 'talos box failing to reboot' issue.
Comment 11•14 years ago
|
||
bug 626872 filed to reset the IP cache as nthomas mentioned in comment #8 (oops, I blamed that comment on zandr, sorry!)
I'm keeping this open to try to find an example that's not linux-ix-slave08.
Comment 12•14 years ago
|
||
OK, no luck - all of the other ping failures are either transient (by design) or real.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → INVALID
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•