Closed
Bug 886637
Opened 11 years ago
Closed 11 years ago
change releng slave notifications to not alert host down until 7 hour duration
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwine, Assigned: ashish)
References
Details
(Whiteboard: [reit-nagios])
Attachments
(1 file, 1 obsolete file)
(deleted),
text/plain
|
Details |
We now have automation to handle the majority of host down issues. The automation runs every 6 hours.
This means that no human is going to take an action on such a host down until the automation has a chance to fix things.
Please change the slave host down notifications to not alert until they have been down for over 7 hours. We don't want to change ec2 hosts.
This includes:
- build slaves
- try slaves
- talos slaves
- tegra slaves
- panda slaves
The following hosts should NOT be modified - they continue to be handled by humans:
- buildbot masters (build & test)
- machines in ec2
Assignee | ||
Updated•11 years ago
|
Assignee: server-ops → ashish
Comment 1•11 years ago
|
||
Hal: I presume this should also apply to all service checks as well as host checks since they will alert much sooner than 7 hours? Also, what mechanism will you be using to catch when large numbers (or entire silos) of hosts go offline at once due to a service disruption? Will this be done manually by the sheriffs now?
Flags: needinfo?(hwine)
Reporter | ||
Comment 2•11 years ago
|
||
Amy:
1. Good question -- I think nagios doesn't check services on "known down" boxes. But, I don't know if it makes that decision on "soft" or "hard" downs. If "soft", then we're okay. Let's assume soft until proven otherwise.
2. The "entire silo" question is being addressed separately as part of the "let's get nagios working".
I.e. this is a "progress not perfection" step, and we'll be watching for the challenges.
Flags: needinfo?(hwine)
Reporter | ||
Comment 3•11 years ago
|
||
Ashish - what is the current ETR on this -- it will make life much better for buildduty.
Flags: needinfo?(ashish)
Comment 4•11 years ago
|
||
re loss of entire silos, the cluster checks on nagios1.private.releng.scl3.mozilla.com should catch problems.
Assignee | ||
Comment 5•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #3)
> Ashish - what is the current ETR on this -- it will make life much better
> for buildduty.
Need a little assistance here. A list of hostgroups would be most helpful [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=grid]. Thanks!
:arr Is it ok to move the hosts to the "very lazy" category and bump first_notification_delay to 420m?
Flags: needinfo?(ashish) → needinfo?(arich)
Comment 6•11 years ago
|
||
Very lazy is just the tegras, I believe, and since they're included in this list, I think that would work. For the pandas, we don't want to change the timing for the imaging servers, so I'm not sure what magic you would need to do to make that happen (since I'm not 100% sure how the panda check is set up to query mozpool).
Flags: needinfo?(arich)
Comment 7•11 years ago
|
||
Also, as part of this you probably want to revisit the cluster checks with releng to make sure that they're set at appropriate thresholds (and add cluster checks in silos where they don't exist).
Reporter | ||
Comment 8•11 years ago
|
||
list taken from https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=summary
Attachment #771370 -
Flags: review?(jhopkins)
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #5)
> (In reply to Hal Wine [:hwine] from comment #3)
> > Ashish - what is the current ETR on this -- it will make life much better
> > for buildduty.
>
> Need a little assistance here. A list of hostgroups would be most helpful
> [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.
> cgi?hostgroup=all&style=grid]. Thanks!
See attachment, once :jhopkins approves I hope that removes all blockers to this.
Comment 10•11 years ago
|
||
Comment on attachment 771370 [details]
nagios service groups to delay notification on
hwine: The list looks good except for a handful of possible omissions:
HP centos6 mock builders (bld-centos6-hp)
staging windows xp 32 bit talos servers (staging-talos-r3-xp)
windows 8 64 bit talos servers (t-w864-ix)
windows 2008R2 64 bit build hosts (w64-ix-slaves)
Attachment #771370 -
Flags: review?(jhopkins) → review+
Updated•11 years ago
|
Flags: needinfo?(hwine)
Reporter | ||
Comment 11•11 years ago
|
||
with additions from comment 10
:ashish - ETR?
Attachment #771370 -
Attachment is obsolete: true
Flags: needinfo?(hwine) → needinfo?(ashish)
Assignee | ||
Comment 12•11 years ago
|
||
All these hostgroups (extra being mac-partner-repack) have their first_notification_delay bumped to 420 mins:
bld-centos6-hp
linux64-ix-slaves
linux-ix-slaves
mac-partner-repack
mtv1-bld-linux64-ix
mw32-ix-slaves
pandas
prod-talos-linux32-ix
prod-talos-linux64-ix
prod-talos-mtnlion-r5
prod-talos-r3-fed
prod-talos-r3-fed64
prod-talos-r3-w7
prod-talos-r3-xp
prod-talos-r4-lion
prod-talos-r4-snow
prod-t-w732-ix
prod-t-xp32-ix
r5-production-builders
r5-try-builders
scl1-bld-linux64-ix
staging-talos-r3-xp
t-w864-ix
tegras
w64-ix-slaves
mac-partner-repack got included because it was in the same classification of hosts all the other hostgroups belongs to, so I included that as well for simplicity.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(ashish)
Resolution: --- → FIXED
Reporter | ||
Comment 13•11 years ago
|
||
Ashish - thanks, and thanks for noting the repack machine.
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•