Closed Bug 886637 Opened 11 years ago Closed 11 years ago

change releng slave notifications to not alert host down until 7 hour duration

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: [reit-nagios])

Attachments

(1 file, 1 obsolete file)

nagios service groups to delay notification on 11 years ago Hal Wine [:hwine] (use NI) (deleted), text/plain	jhopkins : review+	Details
nagios service groups to delay notification on 11 years ago Hal Wine [:hwine] (use NI) (deleted), text/plain		Details

Hal Wine [:hwine] (use NI)

Reporter

Description

•

11 years ago

We now have automation to handle the majority of host down issues. The automation runs every 6 hours. This means that no human is going to take an action on such a host down until the automation has a chance to fix things. Please change the slave host down notifications to not alert until they have been down for over 7 hours. We don't want to change ec2 hosts. This includes: - build slaves - try slaves - talos slaves - tegra slaves - panda slaves The following hosts should NOT be modified - they continue to be handled by humans: - buildbot masters (build & test) - machines in ec2

Ashish Vijayaram [:ashish]

Assignee

Updated

•

11 years ago

Assignee: server-ops → ashish

Amy Rich [:arr] [:arich]

Comment 1

•

11 years ago

Hal: I presume this should also apply to all service checks as well as host checks since they will alert much sooner than 7 hours? Also, what mechanism will you be using to catch when large numbers (or entire silos) of hosts go offline at once due to a service disruption? Will this be done manually by the sheriffs now?

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Reporter

Comment 2

•

11 years ago

Amy: 1. Good question -- I think nagios doesn't check services on "known down" boxes. But, I don't know if it makes that decision on "soft" or "hard" downs. If "soft", then we're okay. Let's assume soft until proven otherwise. 2. The "entire silo" question is being addressed separately as part of the "let's get nagios working". I.e. this is a "progress not perfection" step, and we'll be watching for the challenges.

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Reporter

Comment 3

•

11 years ago

Ashish - what is the current ETR on this -- it will make life much better for buildduty.

Flags: needinfo?(ashish)

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

11 years ago

re loss of entire silos, the cluster checks on nagios1.private.releng.scl3.mozilla.com should catch problems.

Ashish Vijayaram [:ashish]

Assignee

Comment 5

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #3) > Ashish - what is the current ETR on this -- it will make life much better > for buildduty. Need a little assistance here. A list of hostgroups would be most helpful [https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=grid]. Thanks! :arr Is it ok to move the hosts to the "very lazy" category and bump first_notification_delay to 420m?

Flags: needinfo?(ashish) → needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 6

•

11 years ago

Very lazy is just the tegras, I believe, and since they're included in this list, I think that would work. For the pandas, we don't want to change the timing for the imaging servers, so I'm not sure what magic you would need to do to make that happen (since I'm not 100% sure how the panda check is set up to query mozpool).

Flags: needinfo?(arich)

Amy Rich [:arr] [:arich]

Comment 7

•

11 years ago

Also, as part of this you probably want to revisit the cluster checks with releng to make sure that they're set at appropriate thresholds (and add cluster checks in silos where they don't exist).

Hal Wine [:hwine] (use NI)

Reporter

Comment 8

•

11 years ago

Attached file nagios service groups to delay notification on (obsolete) (deleted) — Details

list taken from https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?hostgroup=all&style=summary

Attachment #771370 - Flags: review?(jhopkins)

Hal Wine [:hwine] (use NI)

Reporter

Comment 9

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #5) > (In reply to Hal Wine [:hwine] from comment #3) > > Ashish - what is the current ETR on this -- it will make life much better > > for buildduty. > > Need a little assistance here. A list of hostgroups would be most helpful > [https://nagios.mozilla.org/releng-scl3/cgi-bin/status. > cgi?hostgroup=all&style=grid]. Thanks! See attachment, once :jhopkins approves I hope that removes all blockers to this.

John Hopkins (:jhopkins)

Comment 10

•

11 years ago

Comment on attachment 771370 [details] nagios service groups to delay notification on hwine: The list looks good except for a handful of possible omissions: HP centos6 mock builders (bld-centos6-hp) staging windows xp 32 bit talos servers (staging-talos-r3-xp) windows 8 64 bit talos servers (t-w864-ix) windows 2008R2 64 bit build hosts (w64-ix-slaves)

Attachment #771370 - Flags: review?(jhopkins) → review+

John Hopkins (:jhopkins)

Updated

•

11 years ago

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Reporter

Comment 11

•

11 years ago

Attached file nagios service groups to delay notification on (deleted) — Details

with additions from comment 10 :ashish - ETR?

Attachment #771370 - Attachment is obsolete: true

Flags: needinfo?(hwine) → needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 12

•

11 years ago

All these hostgroups (extra being mac-partner-repack) have their first_notification_delay bumped to 420 mins: bld-centos6-hp linux64-ix-slaves linux-ix-slaves mac-partner-repack mtv1-bld-linux64-ix mw32-ix-slaves pandas prod-talos-linux32-ix prod-talos-linux64-ix prod-talos-mtnlion-r5 prod-talos-r3-fed prod-talos-r3-fed64 prod-talos-r3-w7 prod-talos-r3-xp prod-talos-r4-lion prod-talos-r4-snow prod-t-w732-ix prod-t-xp32-ix r5-production-builders r5-try-builders scl1-bld-linux64-ix staging-talos-r3-xp t-w864-ix tegras w64-ix-slaves mac-partner-repack got included because it was in the same classification of hosts all the other hostgroups belongs to, so I included that as well for simplicity.

Status: NEW → RESOLVED

Closed: 11 years ago

Flags: needinfo?(ashish)

Resolution: --- → FIXED

Hal Wine [:hwine] (use NI)

Reporter

Comment 13

•

11 years ago

Ashish - thanks, and thanks for noting the repack machine.

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

change releng slave notifications to not alert host down until 7 hour duration

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: [reit-nagios])

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type