[traceback] SystemExit: 1
Categories
(Socorro :: Antenna, task, P1)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(4 files)
Sentry issue: https://sentry.prod.mozaws.net/operations/socorro-new-stage/issues/5397097/events/28630511/
Traceback:
SystemExit: 1
File "raven/middleware.py", line 20, in common_exception_handling
yield
File "raven/middleware.py", line 100, in __call__
iterable = self.application(environ, start_response)
File "falcon/api.py", line 244, in __call__
responder(req, resp, **params)
File "antenna/health_resource.py", line 113, in on_get
resource.check_health(state)
File "antenna/breakpad_resource.py", line 177, in check_health
self.crashpublish.check_health(state)
File "antenna/ext/pubsub/crashpublish.py", line 125, in check_health
self.publisher.get_topic(self.topic_path)
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 45, in <lambda>
fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw) # noqa
File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/gapic/publisher_client.py", line 467, in get_topic
request, retry=retry, timeout=timeout, metadata=metadata
File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
return wrapped_func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 179, in retry_target
return target()
File "/usr/local/lib/python3.6/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
return callable_(*args, **kwargs)
File "grpc/_channel.py", line 548, in __call__
wait_for_ready)
File "grpc/_channel.py", line 537, in _blocking
event = call.next_event()
File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 310, in grpc._cython.cygrpc.SegregatedCall.next_event
File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 151, in grpc._cython.cygrpc._next_call_event
File "python3.6/threading.py", line 239, in __enter__
def __enter__(self):
File "gunicorn/workers/base.py", line 196, in handle_abort
sys.exit(1)
This has been happening in Antenna on stage since we got the Pub/Sub stuff working.
This bug covers looking into it.
Assignee | ||
Comment 1•6 years ago
|
||
I think what's going on is that I switched gunicorn to use --preload and so it's loading the application including the pubsub bits and then spinning off workers. When the workers start up, the monkeypatch at that point. Thus the arbiter doesn't run gevent.monkeypatch at all.
I think the problem is that the pubsub code slides through cython land and so monkeypatching isn't catching that.
I think we can fix this by going back to not preloading. I'm going to test that out first.
Assignee | ||
Comment 2•6 years ago
|
||
Assignee | ||
Comment 3•6 years ago
|
||
willkg merged PR #306: "bug 1538202: remove preload" in 8f86c92.
Assignee | ||
Comment 4•6 years ago
|
||
This deployed to stage and it's been over an hour since the last event.
Since this was only a problem on stage, I'm going to mark this FIXED.
Assignee | ||
Comment 5•6 years ago
|
||
Ugh. I just saw another one, so this didn't fix it. Reopening.
Assignee | ||
Comment 6•6 years ago
|
||
Assignee | ||
Comment 7•6 years ago
|
||
Assignee | ||
Comment 8•6 years ago
|
||
We had a few deploy issues with the stage collector today. It's entirely possible that the nix-preload change fixed the problem, but old nodes were still reporting issues. Fun.
I haven't seen another instance in 4 hours. I'm going to mark this as FIXED. Again.
Assignee | ||
Comment 9•6 years ago
|
||
I'm keeping tabs on Sentry and while the frequency has dropped radically, we still saw a couple over the weekend. I'm going to reopen this and work on it today.
Assignee | ||
Comment 10•6 years ago
|
||
The gist of the problem is that grpc is written in C and uses Python threads, but isn't getting monkeypatched by gevent's monkey patching stuff. The grpcio Python library has an additional experimental gevent initialization step that I don't see documented anywhere other than this issue (hoping I'm wrong on this):
https://github.com/grpc/grpc/issues/4629#issuecomment-376962677
I can't get that to work with Antenna--the tests hang, gunicorn segfaults, and their issue tracker has some follow up issues with grpc and gevent. I think this is the end of the road for Antenna's gevent architecture.
I'm looking at Antenna-rewrite options right now. If we go with a rewrite, I'll mark this as WONTFIX.
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
The thing I was struggling with was segfaults when running Antenna and hanging when running tests. I had an epiphany in the shower and decided to try another attempt to get it working by moving around where the monkey patching was happening. In researching that, I bumped into this:
https://github.com/grpc/grpc/issues/18422
I downgraded grpcio to 0.18.0 which ended the segfaulting, I moved the monkeypatching to the patch() stage of the GeventWorker, I nixed preloading which ended non-publishing, and I removed monkeypatching from the tests which ended tests hanging. Now everything is working and I feel like I understand enough of why it's working now and not before to be ok with moving forward.
While I'm ok moving forward right now, I do think we should do a rewrite. Amongst other things is this comment from a few months ago:
https://github.com/grpc/grpc/issues/17414#issuecomment-445296733
But I don't want to do a rewrite now if I don't have to, so I want to try one more attempt at getting things "working".
I'll let this sit on stage for a few days. If that looks ok, great. If not, then back to the rewrite plan.
willkg merged PR #308: "bug 1538202: fix SystemExit (attempt 3)" in e7d0249.
Assignee | ||
Comment 13•6 years ago
|
||
Assignee | ||
Comment 14•6 years ago
|
||
willkg merged PR #310: "bug 1538202: fix the docker CMD line to use the new worker" in a6a67f9.
I missed something in the last PR so grpc was getting initialized correctly in local dev, but not in stage/prod. This fixes that.
Assignee | ||
Comment 15•6 years ago
|
||
We haven't seen another issue in 3 days.
Couple of interesting things about this issue:
-
It never affected publishing crash ids. That code path does synchronous publishing that doesn't use threads. It does that so as to guarantee that it either published the crash id or gets an error which causes it to toss the crash id in a queue to try again later. It's easier to prove correctness if that whole operation is happening "here and now" rather than "at some point, maybe, in an execution context far far away".
-
It only affected the
/__heartbeat__
endpoint which calls publisher.get_topic() to make sure the topic exists and Antenna has access to it. That code path uses threads in grpcio.
So one of the things I did after landing the last fix was to hammer the /__heartbeat__
endpoint to try to trigger the issue rather than wait for the stars to align and it to get triggered on its own. I wasn't able to trigger it. That's not proof it's fixed, but it increased my confidence.
Anyhow, no instances in 3 days. I think we're good now. I'm going to mark this as FIXED.
Assignee | ||
Comment 16•6 years ago
|
||
Also, the fix was three fold:
-
downgrade to grpcio 0.18.0 which doesn't segfault
-
subclass
GeventWorker
with this:import grpc.experimental.gevent as grpc_gevent import gunicorn.workers.ggevent class GeventGrpcWorker(gunicorn.workers.ggevent.GeventWorker): """Gevent worker that also patches grpc.""" def patch(self): super().patch() grpc_gevent.init_gevent()
which calls grpc's
init_gevent
right after gevent monkey patching right after Gunicorn has spun off a worker. -
stop using
--preload
which loads everything including grpcio in the Gunicorn arbiter before patching has occurred
Description
•