Closed Bug 625978 Opened 14 years ago Closed 13 years ago

flapping nagios alerts for builds-running, builds-pending

Categories

(Release Engineering :: General, defect, P2)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Unassigned)

References

Details

(Keywords: buildapi)

Not sure whats causing this flapping alert, filing to track fixing root cause. Leaving in RelEng until we figure out what needs fixing. 16:38:45 < nagios> [19] dm-wwwbuild01:http_age - builds-running is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:30 ago - 138425 bytes in 0.009 second response time 16:38:45 < nagios> [20] dm-wwwbuild01:http_age - builds-pending is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:29 ago - 148134 bytes in 0.011 second response time 16:41:47 < nagios> dm-wwwbuild01:http_age - builds-running is OK: HTTP OK: HTTP/1.1 200 OK - 131430 bytes in 0.011 second response time 16:41:47 < nagios> dm-wwwbuild01:http_age - builds-pending is OK: HTTP OK: HTTP/1.1 200 OK - 143845 bytes in 0.013 second response time 16:58:43 < nagios> [29] dm-wwwbuild01:http_age - builds-pending is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:17 ago - 148940 bytes in 0.009 second response time 16:58:43 < nagios> [30] dm-wwwbuild01:http_age - builds-running is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:19 ago - 132278 bytes in 0.007 second response time 17:01:46 < nagios> dm-wwwbuild01:http_age - builds-running is OK: HTTP OK: HTTP/1.1 200 OK - 127032 bytes in 0.008 second response time 17:01:46 < nagios> dm-wwwbuild01:http_age - builds-pending is OK: HTTP OK: HTTP/1.1 200 OK - 144640 bytes in 0.009 second response time
Assignee: nobody → nrthomas
Priority: -- → P2
The setup * the two files get on cruncher updated by cron, every 5 minutes * there's a cron job that pulls them over to build.m.o every 5 minutes * the nagios check allows the files to up to 7 mins stale on build.m.o Recently cruncher and/or the db slave has been running slower than usual. And still is after bug 623821 improved other jobs using the same host & db. I'm going to shift the cruncher cronjob a little earlier w.r.t the sync job to cope with the slower script time. We could also make the pull more frequent, or make it a push when the script completes.
(In reply to comment #1) > * the two files get on cruncher updated by cron, every 5 minutes Once more in English, * the two files get updated on cruncher by cron, every 5 minutes Was at 4,9,14 ... past the hour, now at 2,7,17,...
Hasn't flapped over the weekend, lets see what the week brings.
Depends on: 627821
I'm putting this in a downtime until 18:00 Monday.
No alerts since the 22nd. Calling it FIXED by the work catlee has done to add indexes.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
This is happening again.. nthomas is busy, so if someone else can have a look, please do.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee: nrthomas → nobody
I put the nagios services in a downtime until Monday morning, but that isn't meant to indicate a choice of priority on fixing the underlying problem.
We got some more of these alerts today and yesterday. Turns out we rsync from cruncher to build.m.o runs every 2 minutes now (used to be 5 mins), so I set the cron to generate builds-{running,pending} to run every 2 mins as well, on alternate minutes. That should help a bunch with races between 5 min cron, 2 min rsync, and the nagios testing interval. catlee suggested we move to a model of generate content && rsync rather than doing it separately.
Is this a releng process bug at this point, or is there a request for IT?
Nothing for IT to do right now. We know we have a leak somewhere in self-serve, hopefully catlee can track that down sometime.
No longer blocks: releng-nagios
Whiteboard: [buildapi][selfserve]
We haven't had a problem with this for some time, and cruncher got a bunch more memory in bug 730319.
Status: REOPENED → RESOLVED
Closed: 14 years ago13 years ago
Resolution: --- → FIXED
Keywords: buildapi
Whiteboard: [buildapi][selfserve]
Product: mozilla.org → Release Engineering
Blocks: 926246
You need to log in before you can comment on or make changes to this bug.