Closed
Bug 617166
Opened 14 years ago
Closed 14 years ago
Reduce timeout on idle-slave nagios checks
Categories
(Infrastructure & Operations :: RelOps: General, task, P3)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: nthomas, Assigned: arich)
References
Details
(Whiteboard: [buildslaves])
We currently have poor visiblity to build and test slaves which are hung/idle because of crashes/blocking dialogs that block reboot/buildbot config errors. For example I yesterday found about 10 talos-r3-xp-NNN machines which were out of action. We should add nagios checks so that this state is reported to us in the same way as other errors.
This could go a few ways, depending on how fine grained we want and what nagios lets us do:
1) something like the Nightly builds checks we run on stage. We'd report all the machines that haven't done a build recently, or all machines of a certain class. This would probably run on some machine that can query the statusdb.
2) we add a last build check for each slave, and get nagios to report that service against that host. Imagine https://nagios.mozilla.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=talos-r3-fed-019 with a PING and a last build check. The actual check runs on some other machine, but it's reported there for convenience (dunno if nagios lets you do this)
We can already query statusdb for the last build of a given slave via buildapi, and could work up some more SQL to return the last build for every active slave (which can be gotten from config.py).
Reporter | ||
Updated•14 years ago
|
OS: Mac OS X → All
Priority: -- → P3
Whiteboard: [buildslaves]
Updated•14 years ago
|
Blocks: releng-nagios
Comment 1•14 years ago
|
||
Where does catlee's current idle slave script live? It sounds like we'd just want to extend that with some more smarts.
Comment 2•14 years ago
|
||
(In reply to comment #1)
> Where does catlee's current idle slave script live? It sounds like we'd just
> want to extend that with some more smarts.
It runs as part of the talos regression detection actually...although there's no requirement for that any more.
Comment 3•14 years ago
|
||
The nagios check for idle slaves should be adjusted to this:
warnings - 7 hrs
error - 8 hrs
This will allow a more aggressive detection once the idle-slave-reboot bug 565397 is landed
Assignee: nobody → zandr
Depends on: 565397
Updated•14 years ago
|
Assignee: zandr → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
Updated•14 years ago
|
Assignee: server-ops-releng → zandr
Updated•14 years ago
|
Summary: Nagios checks for last job on build/test slaves → Reduce timeout on idle-slave nagios checks
Assignee | ||
Updated•14 years ago
|
Assignee: zandr → arich
Assignee | ||
Comment 4•14 years ago
|
||
Is there a hook for this to become a nagios check yet, or should I hand this bug over to someone in the releng group so that something can be whipped up on the back end first?
Comment 5•14 years ago
|
||
I don't think this is necessary anymore - we're replacing those checks with the idleizer, and they're *already* noisy even at the long time span we have set right now.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•