Closed
Bug 627825
Opened 14 years ago
Closed 11 years ago
review nagios alerts for builds-running, builds-pending
Categories
(Release Engineering :: General, defect, P5)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: ashish)
References
Details
(Whiteboard: [monitoring][nagios])
To reduce nagios spam, these alerts were intentionally over-relaxed until bug#625978 was fixed and, we have some idea of how quickly we can expect the files to now being posted.
Once bug#625978, we should update this bug with the new nagios threshold we want, and kick this bug over to ServerOps. For now, filing to track, and leaving in RelEng.
Updated•13 years ago
|
Component: Release Engineering → Release Engineering: Automation (General)
QA Contact: release → catlee
Hardware: x86 → All
Whiteboard: [monitoring][nagios]
Comment 1•13 years ago
|
||
So bug 627821 (relaxing the checks) was WONTFIX so we may never have eased them.
Amy, what are the age thresholds for the these nagios checks on dm-wwwbuild01:
http_age - build-4hr
http_age - builds-pending
http_age - builds-running
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #1)
> So bug 627821 (relaxing the checks) was WONTFIX so we may never have eased
> them.
>
> Amy, what are the age thresholds for the these nagios checks on
> dm-wwwbuild01:
> http_age - build-4hr
> http_age - builds-pending
> http_age - builds-running
arr: ping? what is the current threshold on these alerts?
ed: also, as sheriffs closed the trees because of these build json files being stale over the weekend (bug#926245), any opinions on what threshold you'd be looking for?
Flags: needinfo?(emorley)
Flags: needinfo?(arich)
Comment 3•11 years ago
|
||
(In reply to John O'Duinn [:joduinn] from comment #2)
> ed: also, as sheriffs closed the trees because of these build json files
> being stale over the weekend (bug#926245), any opinions on what threshold
> you'd be looking for?
I requested "http_age - build-4hr" be adjusted in bug 914686 to...
* check_interval: 300s
* file age threshold: 300s
...I'm presuming this affected both our email alert and the #releng IRC alert (arr: would be good to confirm they are linked?).
Something similar for builds-running and builds-pending would be ideal :-)
Flags: needinfo?(emorley)
Comment 4•11 years ago
|
||
The SRE team takes care of nagios, now, so tagging ashish to answer comment 1, since he'll probably be up soon.
Flags: needinfo?(arich) → needinfo?(ashish)
Assignee | ||
Comment 5•11 years ago
|
||
Given the age of the original request, these checks don't exist anymore, likely gone with dm-wwwbuild01:
> http_age - builds-pending
> http_age - builds-running
From what I gather, "http_age - builds-4hr" is now "http file age - /buildjson/builds-4hr.js.gz". I'm positive the other two checks were not lost in migration (to the current Nagios infrastructure).
I'll be glad to add these back at priority. Please let me know which server to check these on and other parameters - check_interval, thresholds, contact groups.
Flags: needinfo?(ashish)
Assignee | ||
Comment 6•11 years ago
|
||
Okay, in the general interest of keeping things monitored, I've gone ahead and added checks for builds-pending.js and builds-running.js. Same thresholds and config as for builds-4hr.js.gz:
Check interval: 300s
Check failures before alert: 3
File age threshold: 300s
URL: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=builddata.pub.build.mozilla.org
Please reopen this bug for any changes.
Assignee: nobody → ashish
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•