Closed Bug 617166 Opened 14 years ago Closed 14 years ago

Reduce timeout on idle-slave nagios checks

Categories

(Infrastructure & Operations :: RelOps: General, task, P3)

x86
All

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: nthomas, Assigned: arich)

References

Details

(Whiteboard: [buildslaves])

We currently have poor visiblity to build and test slaves which are hung/idle because of crashes/blocking dialogs that block reboot/buildbot config errors. For example I yesterday found about 10 talos-r3-xp-NNN machines which were out of action. We should add nagios checks so that this state is reported to us in the same way as other errors. This could go a few ways, depending on how fine grained we want and what nagios lets us do: 1) something like the Nightly builds checks we run on stage. We'd report all the machines that haven't done a build recently, or all machines of a certain class. This would probably run on some machine that can query the statusdb. 2) we add a last build check for each slave, and get nagios to report that service against that host. Imagine https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=talos-r3-fed-019 with a PING and a last build check. The actual check runs on some other machine, but it's reported there for convenience (dunno if nagios lets you do this) We can already query statusdb for the last build of a given slave via buildapi, and could work up some more SQL to return the last build for every active slave (which can be gotten from config.py).
OS: Mac OS X → All
Priority: -- → P3
Whiteboard: [buildslaves]
Where does catlee's current idle slave script live? It sounds like we'd just want to extend that with some more smarts.
(In reply to comment #1) > Where does catlee's current idle slave script live? It sounds like we'd just > want to extend that with some more smarts. It runs as part of the talos regression detection actually...although there's no requirement for that any more.
The nagios check for idle slaves should be adjusted to this: warnings - 7 hrs error - 8 hrs This will allow a more aggressive detection once the idle-slave-reboot bug 565397 is landed
Assignee: nobody → zandr
Depends on: 565397
Assignee: zandr → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Assignee: server-ops-releng → zandr
Summary: Nagios checks for last job on build/test slaves → Reduce timeout on idle-slave nagios checks
Assignee: zandr → arich
Is there a hook for this to become a nagios check yet, or should I hand this bug over to someone in the releng group so that something can be whipped up on the back end first?
I don't think this is necessary anymore - we're replacing those checks with the idleizer, and they're *already* noisy even at the long time span we have set right now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.