Closed
Bug 625978
Opened 14 years ago
Closed 13 years ago
flapping nagios alerts for builds-running, builds-pending
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Unassigned)
References
Details
(Keywords: buildapi)
Not sure whats causing this flapping alert, filing to track fixing root cause. Leaving in RelEng until we figure out what needs fixing.
16:38:45 < nagios> [19] dm-wwwbuild01:http_age - builds-running is CRITICAL: HTTP CRITICAL:
HTTP/1.1 200 OK - Last modified 0:09:30 ago - 138425 bytes in 0.009 second
response time
16:38:45 < nagios> [20] dm-wwwbuild01:http_age - builds-pending is CRITICAL: HTTP CRITICAL:
HTTP/1.1 200 OK - Last modified 0:09:29 ago - 148134 bytes in 0.011 second
response time
16:41:47 < nagios> dm-wwwbuild01:http_age - builds-running is OK: HTTP OK: HTTP/1.1 200 OK -
131430 bytes in 0.011 second response time
16:41:47 < nagios> dm-wwwbuild01:http_age - builds-pending is OK: HTTP OK: HTTP/1.1 200 OK -
143845 bytes in 0.013 second response time
16:58:43 < nagios> [29] dm-wwwbuild01:http_age - builds-pending is CRITICAL: HTTP CRITICAL:
HTTP/1.1 200 OK - Last modified 0:09:17 ago - 148940 bytes in 0.009 second
response time
16:58:43 < nagios> [30] dm-wwwbuild01:http_age - builds-running is CRITICAL: HTTP CRITICAL:
HTTP/1.1 200 OK - Last modified 0:09:19 ago - 132278 bytes in 0.007 second
response time
17:01:46 < nagios> dm-wwwbuild01:http_age - builds-running is OK: HTTP OK: HTTP/1.1 200 OK -
127032 bytes in 0.008 second response time
17:01:46 < nagios> dm-wwwbuild01:http_age - builds-pending is OK: HTTP OK: HTTP/1.1 200 OK -
144640 bytes in 0.009 second response time
Updated•14 years ago
|
Assignee: nobody → nrthomas
Priority: -- → P2
Comment 1•14 years ago
|
||
The setup
* the two files get on cruncher updated by cron, every 5 minutes
* there's a cron job that pulls them over to build.m.o every 5 minutes
* the nagios check allows the files to up to 7 mins stale on build.m.o
Recently cruncher and/or the db slave has been running slower than usual. And still is after bug 623821 improved other jobs using the same host & db.
I'm going to shift the cruncher cronjob a little earlier w.r.t the sync job to cope with the slower script time.
We could also make the pull more frequent, or make it a push when the script completes.
Comment 2•14 years ago
|
||
(In reply to comment #1)
> * the two files get on cruncher updated by cron, every 5 minutes
Once more in English,
* the two files get updated on cruncher by cron, every 5 minutes
Was at 4,9,14 ... past the hour, now at 2,7,17,...
Comment 3•14 years ago
|
||
Hasn't flapped over the weekend, lets see what the week brings.
Comment 4•14 years ago
|
||
I'm putting this in a downtime until 18:00 Monday.
Comment 5•14 years ago
|
||
No alerts since the 22nd. Calling it FIXED by the work catlee has done to add indexes.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 6•14 years ago
|
||
This is happening again.. nthomas is busy, so if someone else can have a look, please do.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•14 years ago
|
Assignee: nrthomas → nobody
Comment 7•14 years ago
|
||
I put the nagios services in a downtime until Monday morning, but that isn't meant to indicate a choice of priority on fixing the underlying problem.
Comment 8•14 years ago
|
||
We got some more of these alerts today and yesterday. Turns out we rsync from cruncher to build.m.o runs every 2 minutes now (used to be 5 mins), so I set the cron to generate builds-{running,pending} to run every 2 mins as well, on alternate minutes. That should help a bunch with races between 5 min cron, 2 min rsync, and the nagios testing interval.
catlee suggested we move to a model of
generate content && rsync
rather than doing it separately.
Comment 9•14 years ago
|
||
Is this a releng process bug at this point, or is there a request for IT?
Comment 10•14 years ago
|
||
Nothing for IT to do right now. We know we have a leak somewhere in self-serve, hopefully catlee can track that down sometime.
Updated•14 years ago
|
No longer blocks: releng-nagios
Updated•14 years ago
|
Whiteboard: [buildapi][selfserve]
Comment 11•13 years ago
|
||
We haven't had a problem with this for some time, and cruncher got a bunch more memory in bug 730319.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 13 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•