Closed Bug 617166 Opened 14 years ago Closed 14 years ago

Reduce timeout on idle-slave nagios checks

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: nthomas, Assigned: arich)

References

Details

(Whiteboard: [buildslaves])

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

14 years ago

We currently have poor visiblity to build and test slaves which are hung/idle because of crashes/blocking dialogs that block reboot/buildbot config errors. For example I yesterday found about 10 talos-r3-xp-NNN machines which were out of action. We should add nagios checks so that this state is reported to us in the same way as other errors. This could go a few ways, depending on how fine grained we want and what nagios lets us do: 1) something like the Nightly builds checks we run on stage. We'd report all the machines that haven't done a build recently, or all machines of a certain class. This would probably run on some machine that can query the statusdb. 2) we add a last build check for each slave, and get nagios to report that service against that host. Imagine https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=talos-r3-fed-019 with a PING and a last build check. The actual check runs on some other machine, but it's reported there for convenience (dunno if nagios lets you do this) We can already query statusdb for the last build of a given slave via buildapi, and could work up some more SQL to return the last build for every active slave (which can be gotten from config.py).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

14 years ago

OS: Mac OS X → All

Priority: -- → P3

Whiteboard: [buildslaves]

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Updated

•

14 years ago

Blocks: releng-nagios

Chris Cooper [:coop] (he/him)

Comment 1

•

14 years ago

Where does catlee's current idle slave script live? It sounds like we'd just want to extend that with some more smarts.

Chris AtLee [:catlee]

Comment 2

•

14 years ago

(In reply to comment #1) > Where does catlee's current idle slave script live? It sounds like we'd just > want to extend that with some more smarts. It runs as part of the talos regression detection actually...although there's no requirement for that any more.

Mike Taylor [:bear]

Comment 3

•

14 years ago

The nagios check for idle slaves should be adjusted to this: warnings - 7 hrs error - 8 hrs This will allow a more aggressive detection once the idle-slave-reboot bug 565397 is landed

Assignee: nobody → zandr

Depends on: 565397

Mike Taylor [:bear]

Updated

•

14 years ago

Assignee: zandr → server-ops-releng

Component: Release Engineering → Server Operations: RelEng

QA Contact: release → zandr

Phong Tran [:phong]

Updated

•

14 years ago

Assignee: server-ops-releng → zandr

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Summary: Nagios checks for last job on build/test slaves → Reduce timeout on idle-slave nagios checks

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Depends on: 627126

Amy Rich [:arr] [:arich]

Assignee

Updated

•

14 years ago

Assignee: zandr → arich

Amy Rich [:arr] [:arich]

Assignee

Comment 4

•

14 years ago

Is there a hook for this to become a nagios check yet, or should I hand this bug over to someone in the releng group so that something can be whipped up on the back end first?

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

14 years ago

I don't think this is necessary anymore - we're replacing those checks with the idleizer, and they're *already* noisy even at the long time span we have set right now.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Reduce timeout on idle-slave nagios checks

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: arich)

References

Details

(Whiteboard: [buildslaves])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Updated

Updated

Comment 4

Comment 5

Updated