add nagios check for taskcluster worker processes
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Assigned: dhouse)
References
(Blocks 1 open bug)
Details
User Story
nagios service check configuration options: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html#service
Attachments
(5 files, 1 obsolete file)
(deleted),
patch
|
ryanc
:
review+
dividehex
:
review+
dragrom
:
review+
dhouse
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
text/x-github-pull-request
|
dhouse
:
checked-in+
|
Details |
(deleted),
patch
|
dividehex
:
review+
dhouse
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
text/x-github-pull-request
|
dhouse
:
checked-in+
|
Details |
(deleted),
patch
|
dhouse
:
checked-in+
|
Details | Diff | Splinter Review |
Comment 6•6 years ago
|
||
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
Comment 13•6 years ago
|
||
Comment 14•6 years ago
|
||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Assignee | ||
Comment 17•6 years ago
|
||
Assignee | ||
Comment 18•6 years ago
|
||
Assignee | ||
Comment 19•6 years ago
|
||
Comment 20•6 years ago
|
||
Assignee | ||
Comment 21•6 years ago
|
||
Assignee | ||
Comment 22•6 years ago
|
||
Comment 23•6 years ago
|
||
Assignee | ||
Comment 24•6 years ago
|
||
Comment 25•6 years ago
|
||
Assignee | ||
Comment 26•6 years ago
|
||
Assignee | ||
Comment 27•6 years ago
|
||
Comment 28•6 years ago
|
||
Assignee | ||
Comment 29•6 years ago
|
||
Comment 30•6 years ago
|
||
Assignee | ||
Comment 31•6 years ago
|
||
Assignee | ||
Comment 32•6 years ago
|
||
Assignee | ||
Comment 33•6 years ago
|
||
Updated•6 years ago
|
Assignee | ||
Comment 34•6 years ago
|
||
Assignee | ||
Comment 35•6 years ago
|
||
Assignee | ||
Comment 36•6 years ago
|
||
Assignee | ||
Comment 37•6 years ago
|
||
These checks have been running, and CIDuty has pointed me to them when there are problems.
example for linux in mdc1: https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-linux64-moonshot&style=overview
I will leave them as not alerting via irc for every worker for now; There is some noise in that because nagios does not know about quarantined workers or other activity, and so it is not helpful as an alert but it is helpful for checking state and monitoring.
If we keep using nagios long-term, and do not rely on other monitoring for alerts on workers, then it may be useful to change these alerts to post to irc and alert the production group. Or it may be better to change it to a percentage alert, and alarm when we have N% workers down.
Description
•