Closed
Bug 740360
Opened 13 years ago
Closed 13 years ago
balrog needs to make sure connections are always closed
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: bhearsum)
References
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
catlee
:
review+
bhearsum
:
checked-in+
|
Details | Diff | Splinter Review |
We hit an issue this morning where Rail's Balrog client was getting "Bad Status Line" error when making requests. After poking around the server I found that there was a bunch of seemingly hung httpd workers, as well as a bunch of open sockets. Here's everything I found:
- 21 connections to the mysql server in a CLOSE_WAIT state.
-- Sheeri looked at the mysql server, and found no connections from the aus4-dev node, so as far as that server is concerned those connections are dead
- ~50 connections to zlb nodes in a TIME_WAIT state (actual number varies a little bit)
- This exception in the application log:
ERROR:auslib.web.base:Exception on /releases/Firefox-mozilla-aurora-nightly-20120329042012/builds/Linux_x86-gcc3/kk [PUT]
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/flask/app.py", line 1292, in wsgi_app
response = self.full_dispatch_request()
File "/usr/lib/python2.6/site-packages/flask/app.py", line 1062, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/lib/python2.6/site-packages/flask/app.py", line 1060, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/lib/python2.6/site-packages/flask/app.py", line 1047, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/usr/lib/python2.6/site-packages/flask/views.py", line 56, in view
return self.dispatch_request(*args, **kwargs)
File "/usr/lib/python2.6/site-packages/flask/views.py", line 112, in dispatch_request
return meth(*args, **kwargs)
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 40, in put
return self._put(*args, transaction=trans, **kwargs)
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 40, in put
return self._put(*args, transaction=trans, **kwargs)
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 11, in decorated
return f(*args, changed_by=username, **kwargs)
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/web/views/base.py", line 24, in decorated
if not db.permissions.hasUrlPermission(username, url, method, urlOptions=extra):
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 752, in hasUrlPermission
if self.select(where=[self.username==username, self.permission=='admin'], transaction=transaction):
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 22, in convertRows
for row in fn(*args, **kwargs):
File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/auslib/db.py", line 191, in select
return query.execute().fetchall()
File "/usr/lib/python2.6/site-packages/sqlalchemy/sql/expression.py", line 2667, in execute
return e._execute_clauseelement(self, multiparams, params)
File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 2297, in _execute_clauseelement
connection = self.contextual_connect(close_with_result=True)
File "/usr/lib/python2.6/site-packages/sqlalchemy/engine/base.py", line 2334, in contextual_connect
self.pool.connect(),
File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 210, in connect
return _ConnectionFairy(self).checkout()
File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 371, in __init__
rec = self._connection_record = pool._do_get()
File "/usr/lib/python2.6/site-packages/sqlalchemy/pool.py", line 685, in _do_get
(self.size(), self.overflow(), self._timeout))
TimeoutError: QueuePool limit of size 5 overflow 10 reached, connection timed out, timeout 30
ERROR:auslib.web.views.releases:Caught ISA 500 error.
DEBUG:auslib.web.views.releases:Balrog version is: 0.1.1
DEBUG:auslib.web.views.releases:Request path is: /releases/Firefox-mozilla-aurora-nightly-20120329042012/builds/Linux_x86-gcc3/kk
- Immediately before the above exception, a IndexError: list index out of range Exception, which is caused by bug 738755 (which has been fixed by not pulled onto the aus4-dev nodes yet)
- In the access log:
10.8.33.240 - stage-ffxbld [29/Mar/2012:06:43:25 -0700] "PUT /releases/Firefox-mozilla-central-dummy-nightly-20120329031156/builds/Linux_x86_64-gcc3/ko HTTP/1.1" 500 541 "-" "python-requests/0.10.8"
- In the error log:
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] mod_wsgi (pid=9620): Exception occurred processing WSGI script '/data/www/aus4-admin-dev.allizom.org/aus4-admin/admin.wsgi'.
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] Traceback (most recent call last):
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1306, in __call__
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] return self.wsgi_app(environ, start_response)
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1294, in wsgi_app
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] response = self.make_response(self.handle_exception(e))
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/usr/lib/python2.6/site-packages/flask/app.py", line 1120, in make_response
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] return self.response_class.force_type(rv, request.environ)
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/vendor/lib/python/werkzeug/wrappers.py", line 711, in force_type
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] response = BaseResponse(*_run_wsgi_app(response, environ))
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] File "/data/www/aus4-admin-dev.allizom.org/aus4-admin/vendor/lib/python/werkzeug/test.py", line 818, in run_wsgi_app
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] app_iter = app(environ, start_response)
[Thu Mar 29 06:43:55 2012] [error] [client 10.8.33.240] TypeError: 'TimeoutError' object is not callable
Assignee | ||
Comment 1•13 years ago
|
||
One thing we did find was that the admin application was pointed at a read-only database node. That node happened to be accidentally configured to be rw, but it makes me wonder if something else about this caused issues.
solarce changed that to point at the proper node and kicked apache, so at this point we're back to a working state. Not much more we can do unless the problem happens again.
Assignee | ||
Comment 2•13 years ago
|
||
Catlee thought that ISA 500 errors might be somehow hanging up db connections or something like that. We tested this theory by running a couple of hundred requests that we know will generate 500 errors quickly, and indeed, we got the server hung up the same way again. I'm going to be getting sudo access to the nodes, and should be able to debug more efficiently with that. I'll see if i can figure out what's up!
Comment 3•13 years ago
|
||
sudo has been given via puppet changes.
Assignee | ||
Comment 4•13 years ago
|
||
After a ton more debugging I finally figured out (with tons of help from arr, sheeri, solarce, and catlee) that the code that manages the Connection objects was never calling .close(), which makes perfect sense given the CLOSE_WAIT sockets.
After changing the application code to do that, I haven't been able to reproduce the problem.
One thing I'm still a bit unsure of is whether or not there's an issue with some networking equipment killing some connections. While debugging, sheeri found errors like this on the mysql server:
120330 9:48:20 [Warning] Aborted connection 30397518 to db: 'aus4_dev' user:
'aus4_dev' host: '10.8.70.207' (Got timeout reading communication packets)
And I'm not sure if those are explained by having too many open connections already. I don't think we need to debug that further for now, but if more issues crop up it's something to look into it.
Assignee | ||
Updated•13 years ago
|
Summary: balrog admin application gets hung sometimes → balrog needs to make sure connections are always closed
Assignee | ||
Comment 5•13 years ago
|
||
I also switched insert/delete/update to use the context manager to simplify those sections of code.
Attachment #610997 -
Flags: review?(catlee)
Assignee | ||
Comment 6•13 years ago
|
||
16:42 <@catlee> so are you sure you want transactions to always close their connections?
16:44 < bhearsum> pretty sure. an AUSTransaction is basically a Connection + Transaction - the
connection is only held by the AUSTransaction
16:44 < bhearsum> i guess that could change at some point
16:45 < bhearsum> but the reason i wrote AUSTransaction was to let all the code that executes things
work with one object, rather than managing both a Connection and Transaction
16:45 <@catlee> ok
16:46 <@catlee> so it's weird that it's not the one opening the connection
16:46 < bhearsum> hmmm
16:46 < bhearsum> yeah, i can see that
16:46 < bhearsum> maybe it should take an engine instead, and open the connection itself?
16:46 <@catlee> yeah, that would make more sense I think
16:50 < bhearsum> alright, i'll do that
Attachment #610997 -
Attachment is obsolete: true
Attachment #610997 -
Flags: review?(catlee)
Attachment #611021 -
Flags: review?(catlee)
Updated•13 years ago
|
Attachment #611021 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 7•13 years ago
|
||
Comment on attachment 611021 [details] [diff] [review]
pass engine to AUSTransaction
Landed. Jenkins run is here: https://jenkins.mozilla.org/job/Balrog/68/ (oddly, I had to force it).
Attachment #611021 -
Flags: checked-in+
Assignee | ||
Comment 8•13 years ago
|
||
Thanks again everyone for all your help diagnosing this.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•