625867 - investigate why a few slaves are causing nagios PING CRITICAL alerts

Reporter

Description

•

14 years ago

it's not often, but they come in clumps: [09] talos-r3-fed64-033.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% [11] talos-r3-fed64-017.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% [15] talos-r3-fed-008.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100% [16] linux-ix-slave20.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100% if I react slowly and look, the slaves are fine. only once caught one when it was truly offline. Also don't ever see the alert that they are ok.

Mike Taylor [:bear]

Reporter

Updated

•

14 years ago

Assignee: server-ops-releng → zandr

Blocks: releng-nagios

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

14 years ago

maybe they took slightly longer then usual to reboot after completing a job and tripped the nagios threshold on ping? (we had a similar problem before with the n810s, hence the idea).

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

14 years ago

That's one theory. I don't know what the thresholds are right now. I'll be taking a deeper look at this on Monday, and will reassign to myself at that point unless zandr has different plans.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

14 years ago

OK, here's a funny one. I'm logged into linux-ix-slave08 right now, and it's pingable. However, Nagios has Current Status: CRITICAL (for 60d 19h 48m 33s) (Has been acknowledged) Last Check Time: 01-15-2011 10:33:57 Two problems here: * somehow nagios has been pinging this box for two months and not getting a response * PING failure is not causing Nagios to think that the host is down (which would appear as a host problem, not a service problem, and would silence all service alerts for the host). In fact, despite a lot of known-down hosts, our "hosts" list is entirely green!

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

14 years ago

It *does* look like Nagios is avoiding checking the other services when PING is down - it hasn't looked at the number of running buildbot processes since November, either. I submitted a passive "OK" result for the host - let's see if that gets it back into action.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

14 years ago

It did not get it back into action. For a while, the service was green/OK, but now it's CRITICAL again (although with no alert to #build). The host has been up for almost 2h, and it's not been 2h since I posted the last comment, so in principle it should have been pingable for that entire time. I'm tcpdumping ICMP messages on the machine now. I see pings from nagios to other hosts, but not yet to linux-ix-slave08 yet.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

14 years ago

OK, I tried twice, and I don't see any pings to linux-ix-slave08 via tcpdump on that host[1], yet nagios reverts to marking the PING service on that slave as CRITICAL. Maybe it's time to punt this to netops to see if the pings are blocked for some reason? We could also do a simple test by running 'ping' on the command line on bm-admin01 while running tcpdump on the slave. [1] note that I do see a lot of pings to various talos-* slaves, but not to any non-talos* slaves

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

14 years ago

linux-ix-slave08 will be fallout from it moving to SCL and the nagios config not getting regenerated (it caches IP addresses or something). It moved at bug 612299 comment #14. IT can fix that up pretty quick.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

14 years ago

(In reply to comment #0) > it's not often, but they come in clumps: Is this different from talos box falls over, nagios report isn't actioned (adding to reboot bug, acking nagios), nagios repeats the alert every 2 hours ?

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

14 years ago

eg for talos-r3-fed64-33: * the last job it did ended at 2011-01-12 04:59:45 (status db) [all times in PDT] * the PING alert started failing 2011-01-12 05:02, which will be the reboot after that last job * nagios thinks it made these notifications: https://nagios.mozilla.org/nagios/cgi-bin/notifications.cgi?host=talos-r3-fed64-033.build&service=PING * if you step back in time on that there are *no* OK notifications between Jan 12 and now * therefore this box has been down for 4 days So what am I missing ? Looks just like the usual 'talos box failing to reboot' issue.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

14 years ago

bug 626872 filed to reset the IP cache as nthomas mentioned in comment #8 (oops, I blamed that comment on zandr, sorry!) I'm keeping this open to try to find an example that's not linux-ix-slave08.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

14 years ago

OK, no luck - all of the other ping failures are either transient (by design) or real.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → INVALID

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Depends on: 626872

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

investigate why a few slaves are causing nagios PING CRITICAL alerts

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: bear, Assigned: zandr)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated