Closed Bug 1538202 Opened 6 years ago Closed 6 years ago

[traceback] SystemExit: 1

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

pr 306: bug 1538202: remove preload 6 years ago Will Kahn-Greene [:willkg] ET needinfo? me (deleted), text/x-github-pull-request		Details
pr 307: bug 1538202: add monkey.patch_all() calls to guarantee patching 6 years ago Will Kahn-Greene [:willkg] ET needinfo? me (deleted), text/x-github-pull-request		Details
pr 308: bug 1538202: fix SystemExit (attempt 3) 6 years ago Will Kahn-Greene [:willkg] ET needinfo? me (deleted), text/x-github-pull-request		Details
pr 310: bug 1538202: fix the docker CMD line to use the new worker 6 years ago Will Kahn-Greene [:willkg] ET needinfo? me (deleted), text/x-github-pull-request		Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

6 years ago

Sentry issue: https://sentry.prod.mozaws.net/operations/socorro-new-stage/issues/5397097/events/28630511/

Traceback:

SystemExit: 1
  File "raven/middleware.py", line 20, in common_exception_handling
    yield
  File "raven/middleware.py", line 100, in __call__
    iterable = self.application(environ, start_response)
  File "falcon/api.py", line 244, in __call__
    responder(req, resp, **params)
  File "antenna/health_resource.py", line 113, in on_get
    resource.check_health(state)
  File "antenna/breakpad_resource.py", line 177, in check_health
    self.crashpublish.check_health(state)
  File "antenna/ext/pubsub/crashpublish.py", line 125, in check_health
    self.publisher.get_topic(self.topic_path)
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/_gapic.py", line 45, in <lambda>
    fx = lambda self, *a, **kw: wrapped_fx(self.api, *a, **kw)  # noqa
  File "/usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/gapic/publisher_client.py", line 467, in get_topic
    request, retry=retry, timeout=timeout, metadata=metadata
  File "/usr/local/lib/python3.6/site-packages/google/api_core/gapic_v1/method.py", line 143, in __call__
    return wrapped_func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 270, in retry_wrapped_func
    on_error=on_error,
  File "/usr/local/lib/python3.6/site-packages/google/api_core/retry.py", line 179, in retry_target
    return target()
  File "/usr/local/lib/python3.6/site-packages/google/api_core/timeout.py", line 214, in func_with_timeout
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/google/api_core/grpc_helpers.py", line 57, in error_remapped_callable
    return callable_(*args, **kwargs)
  File "grpc/_channel.py", line 548, in __call__
    wait_for_ready)
  File "grpc/_channel.py", line 537, in _blocking
    event = call.next_event()
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 310, in grpc._cython.cygrpc.SegregatedCall.next_event
  File "src/python/grpcio/grpc/_cython/_cygrpc/channel.pyx.pxi", line 151, in grpc._cython.cygrpc._next_call_event
  File "python3.6/threading.py", line 239, in __enter__
    def __enter__(self):
  File "gunicorn/workers/base.py", line 196, in handle_abort
    sys.exit(1)

This has been happening in Antenna on stage since we got the Pub/Sub stuff working.

This bug covers looking into it.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

6 years ago

I think what's going on is that I switched gunicorn to use --preload and so it's loading the application including the pubsub bits and then spinning off workers. When the workers start up, the monkeypatch at that point. Thus the arbiter doesn't run gevent.monkeypatch at all.

I think the problem is that the pubsub code slides through cython land and so monkeypatching isn't catching that.

I think we can fix this by going back to not preloading. I'm going to test that out first.

Assignee: nobody → willkg

Blocks: 1518281

Status: NEW → ASSIGNED

Priority: -- → P1

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

6 years ago

Attached file pr 306: bug 1538202: remove preload (deleted) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

6 years ago

willkg merged PR #306: "bug 1538202: remove preload" in 8f86c92.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

6 years ago

This deployed to stage and it's been over an hour since the last event.

Since this was only a problem on stage, I'm going to mark this FIXED.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

6 years ago

Ugh. I just saw another one, so this didn't fix it. Reopening.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

6 years ago

Attached file pr 307: bug 1538202: add monkey.patch_all() calls to guarantee patching (deleted) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

6 years ago

willkg merged PR #307: "bug 1538202: add monkey.patch_all() calls to guarantee patching" in a3a35c3.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

6 years ago

We had a few deploy issues with the stage collector today. It's entirely possible that the nix-preload change fixed the problem, but old nodes were still reporting issues. Fun.

I haven't seen another instance in 4 hours. I'm going to mark this as FIXED. Again.

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 9

•

6 years ago

I'm keeping tabs on Sentry and while the frequency has dropped radically, we still saw a couple over the weekend. I'm going to reopen this and work on it today.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

6 years ago

The gist of the problem is that grpc is written in C and uses Python threads, but isn't getting monkeypatched by gevent's monkey patching stuff. The grpcio Python library has an additional experimental gevent initialization step that I don't see documented anywhere other than this issue (hoping I'm wrong on this):

https://github.com/grpc/grpc/issues/4629#issuecomment-376962677

I can't get that to work with Antenna--the tests hang, gunicorn segfaults, and their issue tracker has some follow up issues with grpc and gevent. I think this is the end of the road for Antenna's gevent architecture.

I'm looking at Antenna-rewrite options right now. If we go with a rewrite, I'll mark this as WONTFIX.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

6 years ago

Attached file pr 308: bug 1538202: fix SystemExit (attempt 3) (deleted) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 12

•

6 years ago

The thing I was struggling with was segfaults when running Antenna and hanging when running tests. I had an epiphany in the shower and decided to try another attempt to get it working by moving around where the monkey patching was happening. In researching that, I bumped into this:

https://github.com/grpc/grpc/issues/18422

I downgraded grpcio to 0.18.0 which ended the segfaulting, I moved the monkeypatching to the patch() stage of the GeventWorker, I nixed preloading which ended non-publishing, and I removed monkeypatching from the tests which ended tests hanging. Now everything is working and I feel like I understand enough of why it's working now and not before to be ok with moving forward.

While I'm ok moving forward right now, I do think we should do a rewrite. Amongst other things is this comment from a few months ago:

https://github.com/grpc/grpc/issues/17414#issuecomment-445296733

But I don't want to do a rewrite now if I don't have to, so I want to try one more attempt at getting things "working".

I'll let this sit on stage for a few days. If that looks ok, great. If not, then back to the rewrite plan.

willkg merged PR #308: "bug 1538202: fix SystemExit (attempt 3)" in e7d0249.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 13

•

6 years ago

Attached file pr 310: bug 1538202: fix the docker CMD line to use the new worker (deleted) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 14

•

6 years ago

willkg merged PR #310: "bug 1538202: fix the docker CMD line to use the new worker" in a6a67f9.

I missed something in the last PR so grpc was getting initialized correctly in local dev, but not in stage/prod. This fixes that.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 15

•

6 years ago

We haven't seen another issue in 3 days.

Couple of interesting things about this issue:

It never affected publishing crash ids. That code path does synchronous publishing that doesn't use threads. It does that so as to guarantee that it either published the crash id or gets an error which causes it to toss the crash id in a queue to try again later. It's easier to prove correctness if that whole operation is happening "here and now" rather than "at some point, maybe, in an execution context far far away".
It only affected the /__heartbeat__ endpoint which calls publisher.get_topic() to make sure the topic exists and Antenna has access to it. That code path uses threads in grpcio.

So one of the things I did after landing the last fix was to hammer the /__heartbeat__ endpoint to try to trigger the issue rather than wait for the stars to align and it to get triggered on its own. I wasn't able to trigger it. That's not proof it's fixed, but it increased my confidence.

Anyhow, no instances in 3 days. I think we're good now. I'm going to mark this as FIXED.

Status: REOPENED → RESOLVED

Closed: 6 years ago → 6 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 16

•

6 years ago

Also, the fix was three fold:

downgrade to grpcio 0.18.0 which doesn't segfault

subclass GeventWorker with this:

import grpc.experimental.gevent as grpc_gevent
import gunicorn.workers.ggevent


class GeventGrpcWorker(gunicorn.workers.ggevent.GeventWorker):
    """Gevent worker that also patches grpc."""
    def patch(self):
        super().patch()
        grpc_gevent.init_gevent()

which calls grpc's init_gevent right after gevent monkey patching right after Gunicorn has spun off a worker.

stop using --preload which loads everything including grpcio in the Gunicorn arbiter before patching has occurred

You need to log in before you can comment on or make changes to this bug.

Bugzilla

[traceback] SystemExit: 1

Categories

(Socorro :: Antenna, task, P1)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Attachment

General

Description

File Name

Content Type