Closed
Bug 921826
Opened 11 years ago
Closed 11 years ago
setup nagios monitoring to warn if buildapi is dead
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Ms2ger, Unassigned)
References
Details
At least
https://secure.pub.build.mozilla.org/buildapi/running
https://secure.pub.build.mozilla.org/buildapi/pending
report internal server errors. Closing all trees.
Comment 1•11 years ago
|
||
http://builddata.pub.build.mozilla.org/buildjson/builds-running.js
returns a server error as well.
Comment 2•11 years ago
|
||
Last night we hit a few instances of this error on the server:
(2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)")
It seemed to struggle a long for a while after this, but we started seeing an increasing number of this type of error:
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
I've restarted self-serve, and the URLs above are responding correctly now.
Comment 3•11 years ago
|
||
nagios didn't seem to alert for this, unless I'm missing it somewhere
Updated•11 years ago
|
Severity: blocker → major
Reporter | ||
Comment 4•11 years ago
|
||
Thanks, reopened the trees.
Comment 5•11 years ago
|
||
nightly wait-times report for buildpool and trybuildpool also failed out with "Internal Server Error"... Related?
Comment 6•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #2)
> Last night we hit a few instances of this error on the server:
>
> (2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)")
>
> It seemed to struggle a long for a while after this, but we started seeing
> an increasing number of this type of error:
>
> TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection
> timed out, timeout 30
>
> I've restarted self-serve, and the URLs above are responding correctly now.
(In reply to Chris AtLee [:catlee] from comment #3)
> nagios didn't seem to alert for this, unless I'm missing it somewhere
fox2mike: can you verify if nagios alerts for this, and if so, to where?
Flags: needinfo?(shyam)
Comment 7•11 years ago
|
||
It would be easier to alert on the URL I provided, "http://builddata.pub.build.mozilla.org/buildjson/builds-running.js" not returning a 200 status than trying to exercise the API and also the URL I provided is public so would not have to have built-in valid LDAP credentials.
Comment 8•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #2)
> Last night we hit a few instances of this error on the server:
>
> (2005, "Unknown MySQL server host 'buildbot-ro-vip.db.scl3.mozilla.com' (1)")
Can you give me an example machine that saw this error and in which DC?
Flags: needinfo?(shyam)
Comment 9•11 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #3)
> nagios didn't seem to alert for this, unless I'm missing it somewhere
catlee,
Can you also elaborate what you expected nagios to alert for here?
Comment 10•11 years ago
|
||
Is this bug here is still relevant? Maybe for the nagios problem reporting from Comment 9?
Comment 11•11 years ago
|
||
(In reply to Frank Wein [:mcsmurf] from comment #10)
> Is this bug here is still relevant? Maybe for the nagios problem reporting
> from Comment 9?
mcsmurf: Thanks for finding this bug. The original outage is long since solved. You are correct that the remaining work here is to setup nagios monitoring in case of future occurrences. Therefore, I've tweaked summary and moved to correct component.
hwine, fox2mike: to avoid tree closures, we need monitoring on this. Whats the best long term solution? And if that takes a long time to setup, whats the quick-to-setup monitor-it-now solution to avoid long tree closures?
Assignee: nobody → infra
Blocks: re-nagios
Component: Tools → Infrastructure: Monitoring
Flags: needinfo?(shyam)
Flags: needinfo?(hwine)
Product: Release Engineering → Infrastructure & Operations
QA Contact: hwine → jdow
Summary: buildapi is dead → setup nagios monitoring to warn if buildapi is dead
Version: unspecified → other
Comment 13•11 years ago
|
||
(In reply to Ludovic Hirlimann [:Usul] (away dadying) from comment #12)
> Lowering so oncall don't get paged
:Usul, thanks for doing that - and agreed, totally no need to page oncall over the holidays for this!
Comment 14•11 years ago
|
||
We have buildapi monitored for functional response. See: https://nagios.mozilla.org/releng-scl3/cgi-bin/extinfo.cgi?type=2&host=buildapi01.build.scl1.mozilla.com&service=http_expect+buildapi01.build.scl1
Nothing more to do here -- escalation & sheriff notification is taken care of in the blocked bugs.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(shyam)
Flags: needinfo?(hwine)
Resolution: --- → FIXED
Updated•10 years ago
|
Component: Infrastructure: Monitoring → Infrastructure: Other
You need to log in
before you can comment on or make changes to this bug.
Description
•