Closed Bug 1538202 Opened 6 years ago Closed 6 years ago

[traceback] SystemExit: 1

Categories

(Socorro :: Antenna, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

Sentry issue: https://sentry.prod.mozaws.net/operations/socorro-new-stage/issues/5397097/events/28630511/

Traceback:

SystemExit: 1
  File "raven/middleware.py", line 20, in common_exception_handling
    yield
  File "raven/middleware.py", line 100, in __call__
    iterable = self.application(environ, start_response)
  File "falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "antenna/health_resource.py", line 113, in on_get
    resource.check_health(state)
  File "antenna/breakpad_resource.py", line 177, in check_health
    self.crashpublish.check_health(state)
  File "antenna/ext/pubsub/crashpublish.py", line 125, in check_health
    self.publisher.get_topic(self.topic_path)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 45, in <lambda>
    fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw)  # noqa
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/gapic/publisher_client.py", line 467, in get_topic
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 179, in retry_target
    return target()
  File "/usr/local/lib/python3.6/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "grpc/_channel.py", line 548, in __call__
    wait_for_ready)
  File "grpc/_channel.py", line 537, in _blocking
    event = call.next_event()
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 310, in grpc._cython.cygrpc.SegregatedCall.next_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 151, in grpc._cython.cygrpc._next_call_event
  File "python3.6/threading.py", line 239, in __enter__
    def __enter__(self):
  File "gunicorn/workers/base.py", line 196, in handle_abort
    sys.exit(1)

This has been happening in Antenna on stage since we got the Pub/Sub stuff working.

This bug covers looking into it.

I think what's going on is that I switched gunicorn to use --preload and so it's loading the application including the pubsub bits and then spinning off workers. When the workers start up, the monkeypatch at that point. Thus the arbiter doesn't run gevent.monkeypatch at all.

I think the problem is that the pubsub code slides through cython land and so monkeypatching isn't catching that.

I think we can fix this by going back to not preloading. I'm going to test that out first.

Assignee: nobody → willkg
Blocks: 1518281
Status: NEW → ASSIGNED
Priority: -- → P1

This deployed to stage and it's been over an hour since the last event.

Since this was only a problem on stage, I'm going to mark this FIXED.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED

Ugh. I just saw another one, so this didn't fix it. Reopening.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

We had a few deploy issues with the stage collector today. It's entirely possible that the nix-preload change fixed the problem, but old nodes were still reporting issues. Fun.

I haven't seen another instance in 4 hours. I'm going to mark this as FIXED. Again.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

I'm keeping tabs on Sentry and while the frequency has dropped radically, we still saw a couple over the weekend. I'm going to reopen this and work on it today.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

The gist of the problem is that grpc is written in C and uses Python threads, but isn't getting monkeypatched by gevent's monkey patching stuff. The grpcio Python library has an additional experimental gevent initialization step that I don't see documented anywhere other than this issue (hoping I'm wrong on this):

https://github.com/grpc/grpc/issues/4629#issuecomment-376962677

I can't get that to work with Antenna--the tests hang, gunicorn segfaults, and their issue tracker has some follow up issues with grpc and gevent. I think this is the end of the road for Antenna's gevent architecture.

I'm looking at Antenna-rewrite options right now. If we go with a rewrite, I'll mark this as WONTFIX.

The thing I was struggling with was segfaults when running Antenna and hanging when running tests. I had an epiphany in the shower and decided to try another attempt to get it working by moving around where the monkey patching was happening. In researching that, I bumped into this:

https://github.com/grpc/grpc/issues/18422

I downgraded grpcio to 0.18.0 which ended the segfaulting, I moved the monkeypatching to the patch() stage of the GeventWorker, I nixed preloading which ended non-publishing, and I removed monkeypatching from the tests which ended tests hanging. Now everything is working and I feel like I understand enough of why it's working now and not before to be ok with moving forward.

While I'm ok moving forward right now, I do think we should do a rewrite. Amongst other things is this comment from a few months ago:

https://github.com/grpc/grpc/issues/17414#issuecomment-445296733

But I don't want to do a rewrite now if I don't have to, so I want to try one more attempt at getting things "working".

I'll let this sit on stage for a few days. If that looks ok, great. If not, then back to the rewrite plan.

willkg merged PR #308: "bug 1538202: fix SystemExit (attempt 3)" in e7d0249.

willkg merged PR #310: "bug 1538202: fix the docker CMD line to use the new worker" in a6a67f9.

I missed something in the last PR so grpc was getting initialized correctly in local dev, but not in stage/prod. This fixes that.

We haven't seen another issue in 3 days.

Couple of interesting things about this issue:

  1. It never affected publishing crash ids. That code path does synchronous publishing that doesn't use threads. It does that so as to guarantee that it either published the crash id or gets an error which causes it to toss the crash id in a queue to try again later. It's easier to prove correctness if that whole operation is happening "here and now" rather than "at some point, maybe, in an execution context far far away".

  2. It only affected the /__heartbeat__ endpoint which calls publisher.get_topic() to make sure the topic exists and Antenna has access to it. That code path uses threads in grpcio.

So one of the things I did after landing the last fix was to hammer the /__heartbeat__ endpoint to try to trigger the issue rather than wait for the stars to align and it to get triggered on its own. I wasn't able to trigger it. That's not proof it's fixed, but it increased my confidence.

Anyhow, no instances in 3 days. I think we're good now. I'm going to mark this as FIXED.

Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED

Also, the fix was three fold:

  1. downgrade to grpcio 0.18.0 which doesn't segfault

  2. subclass GeventWorker with this:

    import grpc.experimental.gevent as grpc_gevent
    import gunicorn.workers.ggevent
    
    
    class GeventGrpcWorker(gunicorn.workers.ggevent.GeventWorker):
        """Gevent worker that also patches grpc."""
        def patch(self):
            super().patch()
            grpc_gevent.init_gevent()
    

    which calls grpc's init_gevent right after gevent monkey patching right after Gunicorn has spun off a worker.

  3. stop using --preload which loads everything including grpcio in the Gunicorn arbiter before patching has occurred

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: