Closed Bug 921826 Opened 11 years ago Closed 11 years ago

setup nagios monitoring to warn if buildapi is dead

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Ms2ger, Unassigned)

References

Details

Last night we hit a few instances of this error on the server: (2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)") It seemed to struggle a long for a while after this, but we started seeing an increasing number of this type of error: TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30 I've restarted self-serve, and the URLs above are responding correctly now.
nagios didn't seem to alert for this, unless I'm missing it somewhere
Severity: blocker → major
Thanks, reopened the trees.
nightly wait-times report for buildpool and trybuildpool also failed out with "Internal Server Error"... Related?
(In reply to Chris AtLee [:catlee] from comment #2) > Last night we hit a few instances of this error on the server: > > (2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)") > > It seemed to struggle a long for a while after this, but we started seeing > an increasing number of this type of error: > > TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection > timed out, timeout 30 > > I've restarted self-serve, and the URLs above are responding correctly now. (In reply to Chris AtLee [:catlee] from comment #3) > nagios didn't seem to alert for this, unless I'm missing it somewhere fox2mike: can you verify if nagios alerts for this, and if so, to where?
Flags: needinfo?(shyam)
It would be easier to alert on the URL I provided, "http://builddata.pub.build.mozilla.org/buildjson/builds-running.js" not returning a 200 status than trying to exercise the API and also the URL I provided is public so would not have to have built-in valid LDAP credentials.
(In reply to Chris AtLee [:catlee] from comment #2) > Last night we hit a few instances of this error on the server: > > (2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)") Can you give me an example machine that saw this error and in which DC?
Flags: needinfo?(shyam)
(In reply to Chris AtLee [:catlee] from comment #3) > nagios didn't seem to alert for this, unless I'm missing it somewhere catlee, Can you also elaborate what you expected nagios to alert for here?
Blocks: 926246
Is this bug here is still relevant? Maybe for the nagios problem reporting from Comment 9?
(In reply to Frank Wein [:mcsmurf] from comment #10) > Is this bug here is still relevant? Maybe for the nagios problem reporting > from Comment 9? mcsmurf: Thanks for finding this bug. The original outage is long since solved. You are correct that the remaining work here is to setup nagios monitoring in case of future occurrences. Therefore, I've tweaked summary and moved to correct component. hwine, fox2mike: to avoid tree closures, we need monitoring on this. Whats the best long term solution? And if that takes a long time to setup, whats the quick-to-setup monitor-it-now solution to avoid long tree closures?
Assignee: nobody → infra
Blocks: re-nagios
Component: Tools → Infrastructure: Monitoring
Flags: needinfo?(shyam)
Flags: needinfo?(hwine)
Product: Release Engineering → Infrastructure & Operations
QA Contact: hwine → jdow
Summary: buildapi is dead → setup nagios monitoring to warn if buildapi is dead
Version: unspecified → other
Lowering so oncall don't get paged
Severity: major → normal
(In reply to Ludovic Hirlimann [:Usul] (away dadying) from comment #12) > Lowering so oncall don't get paged :Usul, thanks for doing that - and agreed, totally no need to page oncall over the holidays for this!
Blocks: 914877
We have buildapi monitored for functional response. See: https://nagios.mozilla.org/releng-scl3/cgi-bin/extinfo.cgi?type=2&host=buildapi01.build.scl1.mozilla.com&service=http_expect+buildapi01.build.scl1 Nothing more to do here -- escalation & sheriff notification is taken care of in the blocked bugs.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(shyam)
Flags: needinfo?(hwine)
Resolution: --- → FIXED
Component: Infrastructure: Monitoring → Infrastructure: Other
You need to log in before you can comment on or make changes to this bug.