<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 1

•

4 years ago

I've looked at the logs but it seems that Firefox is always exiting cleanly, albeit with an error. I may be looking at the wrong thing though, can you point me to the crash?

Reporter

Comment 2

•

4 years ago

I don't think Firefox should be exiting cleanly-but-with-error? For my purposes any early exit here is a crash.

Reporter

Comment 3

•

4 years ago


[task 2020-11-12T20:57:26.606Z] 20:57:26     INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html
[task 2020-11-12T20:57:26.764Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277
[task 2020-11-12T20:57:26.889Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.890Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.891Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.892Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.964Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.965Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.965Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:27.170Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.266Z] 20:57:27     INFO - TEST-INFO | Main app process: killed by SIGSEGV
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - Buffered messages logged at 20:57:26
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | A valid string reason is expected 
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | Reason cannot be empty 
[task 2020-11-12T20:57:27.268Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.268Z] 20:57:27     INFO - Buffered messages logged at 20:57:27
[task 2020-11-12T20:57:27.269Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.270Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.271Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.271Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.272Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.272Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.273Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.274Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.274Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.275Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line 
[task 2020-11-12T20:57:27.276Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line 
[task 2020-11-12T20:57:27.276Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.277Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.277Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.279Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.279Z] 20:57:27     INFO - Buffered messages finished
[task 2020-11-12T20:57:27.280Z] 20:57:27    ERROR - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | application terminated with exit code -11

There's a SIGSEGV mentioned above, in https://firefoxci.taskcluster-artifacts.net/XsNTZuC-QxmTR3WWUgUXXw/0/public/logs/live_backing.log
From https://treeherder.mozilla.org/jobs?repo=try&revision=1c16512bcb9ee6ef4ec6644ca672c8dcb1af732a&selectedTaskRun=c_kIkS3zTGyQDbKeRhn4HA.0

Assignee

Comment 4

•

4 years ago

Ah yes, sorry I missed that. What's happening here is that we're segfaulting after we uninstalled the crash reporter's signal handler so the signal is not being caught.

Looking at the log it seems that we hit some kind of IPC issue:

[task 2020-11-12T20:57:26.764Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277

And then all the content processes quit in response to that :

[task 2020-11-12T20:57:27.170Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.

Deactivating the exception handler happens very late during shutdown but there's still stuff happening afterwards but not much. You can try commenting out the line where we disable the crash reporter and see if you get a minidump. Given that there's some IPC stuff going awry it might be possible that the crash is not happening on the main thread which would explain why the exception handler gets disabled before you hit the SIGSEGV.

Unfortunately the crash reporting machinery has some odd dependencies on XPCOM so it might not work; I've wanted to make it work all the way to the actual shutdown for a long time.

Reporter

Updated

•

4 years ago

Blocks: 1669532

Reporter

Comment 5

•

4 years ago

If we're not getting crash stacks because we're not hitting our crash reporting code, this could mean we're missing crash reports for a class of parent process crashes from users in the wild, which would be relatively severe.

Severity: -- → S2

Assignee

Comment 6

•

4 years ago

Yes, it is :-( We get some of those crashes on the Microsoft health dashboards for Windows builds but there's nothing of the sort on Linux. There's an effort to enable things like Windows Error Reporting on our Windows TaskCluster runners so that we at least catch those crashes there.

Comment 7

•

4 years ago

If this bug blocks WebGL remoting on Linux, then it also blocks Fission on Linux. Tentatively tracking this bug for Fission M7 Beta milestone.

Fission Milestone: --- → M7

Comment 8

•

4 years ago

Kris will be investigating this bug.

Assignee: nobody → kwright

Assignee

Comment 9

•

4 years ago

Attached file [TEST PATCH] Don't remove the exception handler on shutdown (deleted) — Details

This is a quick hack that prevents the exception handler from being uninstalled during shutdown, which means it should still be able to catch crashes and write minidumps right until the end. The patch has some significant side effects, one of which is that it leaks memory so be prepared to see those failures. It might also make funky things if a crash is generated way to late - which might lead to a cascade of crashes - but at least the first minidump should be written out correctly.

Comment 10

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #9)

Created attachment 9193989 [details]
[TEST PATCH] Don't remove the exception handler on shutdown

Jeff, when you have time, can you try reproducing your WebGL test crash with Gabriele's patch? You might be able to get a more meaningful stack trace.

Flags: needinfo?(jgilbert)

Reporter

Comment 11

•

4 years ago

Crashing test re-enabled:
https://treeherder.mozilla.org/jobs?repo=try&revision=f869b4d6a4dc887ea7812d68dcc22ff04ab3b7fe

Crash reporter UnsetExceptionHandler disabled per gsvelto's patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f39bafe79232dfd2f4ccb7eb672976eeb9e30b75

Flags: needinfo?(jgilbert)

Reporter

Comment 12

•

4 years ago

Still SIGSEV without stacks.

Assignee

Comment 13

•

4 years ago

This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like exit(-11) or somesuch. Normally this type of crashes also cause some spew from the IPC code because the connection is abruptly severed but there's none here. I think you're only way out here is Pernosco, that should be able to capture the point of the crash.

Reporter

Comment 14

•

4 years ago

Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.

Comment 15

•

4 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #14)

Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.

Jeff, any luck reproducing this Linux crash in Pernosco?

Flags: needinfo?(jgilbert)

OS: Unspecified → Linux

Whiteboard: fission-x11

Neha Kochar [:neha]

Updated

•

4 years ago

Whiteboard: fission-x11 → fission-x11, fission-linux

Reporter

Comment 16

•

4 years ago

Not yet but I'm trying to fit it in.

Reporter

Comment 17

•

4 years ago

Pernosco seems to have caught it very deep in a call stack, so this might be a stack exhaustion segfault?

Flags: needinfo?(jgilbert)

Reporter

Comment 18

•

4 years ago

Attached file pernosco stack (deleted) — Details

Reporter

Comment 19

•

4 years ago

This particular class of bugs, we can work around individually, but if this class of bugs is not getting crash stacks, that's dangerous.
Are there any stack-overflow caveats for our crash handling? Do we expect to handle this case?

Flags: needinfo?(gsvelto)

Assignee

Comment 20

•

4 years ago

On Linux we use an alternate stack for the exception handler so stack exhaustion shouldn't be an issue. It should segfault (or whatever happens in that case), the stack would be switched with the signal handling stack and then the exception handler would run from there.

BTW I had a look at the relevant code in mesa and it seems to push only a few bytes per frame so even though the stack is deep I'm not sure if it's deep enough to run out of stack space. Does pernosco save the crash reason too (i.e. the signal number and code)? It would be interesting to know what it is.

Flags: needinfo?(gsvelto)

Comment 21

•

4 years ago

Linux bugs don't block Fission Beta experiment (on Windows and macOS), so I am moving Linux Fission bugs from Fission milestone M7 to M7a.

Fission Milestone: M7 → M7a

Comment 22

•

4 years ago

Jeff, any updates on this Linux crash stack issue? Was the stack overflow in Mozilla or Mesa code?

I assume this crash stack issue should still block shipping WebGL remoting on Linux.

Flags: needinfo?(jgilbert)

Reporter

Comment 23

•

4 years ago

Yes, remoting is still blocked here.

Flags: needinfo?(jgilbert)

Neha Kochar [:neha]

Comment 24

•

4 years ago

Fission M7a will not block on this, since we will have semi-headless mode for Linux that jld is working on.

Fission Milestone: M7a → Future

Comment 25

•

3 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #13)

This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like exit(-11) or somesuch.

There's another way this can happen: if a thread has SIGSEGV blocked and then does something that would raise SIGSEGV, the signal is unblocked and the handler is reset to SIG_DFL before processing it. (The rationale is that it's assumed the signal was blocked for a reason, and it may not be safe to run the handler at that point. This behavior is also used for other cases of important synchronous signals, like SIGSYS raised from seccomp-bpf.)

In particular, sometimes things will block all signals, for reasons of varying validity; e.g., this code in glib (not glibc). And here's something suspiciously similar in Mesa (which helpfully omits SIGSYS, but not SIGSEGV). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it). But (as gsvelto reminded me out-of-band while I was writing this) any signal handler will also block its own signal while it's running, unless SO_NODEFER is used, and we do have a number of SIGSEGV handlers; they can also opt in to blocking other signals (the sa_mask member of struct sigaction).

For the case of SIGSYS in child processes, we're already using symbol interposition to prevent blocking it. (It doesn't work for code that inlines the syscall, which glibc's dynamic linker necessarily does, but glibc is usually fairly cautious about this kind of thing.) Something similar could be tried here to see if that's the problem; we might need to define the symbols in the executable, or maybe have the test harness LD_PRELOAD libmozsandbox.so.

Assignee

Comment 26

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #25)

In particular, sometimes things will block all signals, for reasons of varying validity; e.g., [this code in glib (not glibc)][glib-oops]. And here's [something suspiciously similar in Mesa][mesa-oops] (which helpfully omits SIGSYS, but not SIGSEGV). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it).

The mesa case is particularly suspicious given the issue here. The commit message states that it was a workaround for some signal-handling in X which is really sad. I think we'll have to audit both threads and handlers that block signals to get to the bottom of this.

I wonder if we'll have to interpose ourselves everywhere to make sure we can catch crashes even when they happen in threads which the authors have deemed not suitable for handling signals. Which BTW is not trivial either, we had bugs in breakpad in the past when several threads entered the signal handler simultaneously.

Comment 27

•

3 years ago

I ran an experiement on Try, using symbol interposition to prevent SIGSEGV (and a few others just in case: BUS, ILL, FPE) from being blocked via the usual libc APIs. And there still isn't a crash report. I also added logging, so I can see that the interposition is being loaded in the parent process (i.e., I didn't edit the wrong Python class when I tried to set the env var). I can also see in local testing, using procfs (SigBlk in the status file), that the signals in question aren't blocked on threads that are otherwise blocking everything. So I tried again, with logging of changes to the signal handlers, but that wasn't any more enlightening. One thing I haven't tested yet is forcing SA_ONSTACK, but that would only help if there's another handler applied after Breakpad which segfaults in the handler itself, and I don't know if there's any evidence for that.

One thing I noticed is the driver info, which might be trying to tell me that it's vmwgfx?

GL_VENDOR: VMware, Inc.
mVendor: VMware, Inc.
GL_RENDERER: llvmpipe (LLVM 10.0.0, 256 bits)
mRenderer: Unknown

I'll see if I can get a Pernosco trace; comment #18 indicates that some kind of crash is reproducible that way, and hopefully it's the same one (even though I assume it's using software rendering because of rr).

Comment 28

•

3 years ago

I got Pernosco working, and assuming the crash under rr is the same as the normal one, I think this is an OOM: it crashes on a direct call instruction where the target is readable, and the memory below the stack pointer is readable, but the entire rest of the page above the stack pointer is zeroes and the prologue subtracts 0xbd8 from it, suggesting that that call is the first thing to touch that page. That might explain why the crash reporter wouldn't work, although it is possible to crash during a “normal” OOM detected by mozjemalloc, so I'm not 100% sure I understand what's going on here.

It might also explain why the ASan runs didn't fail, if they have higher resource limits or otherwise manage memory differently in some way that matters here.

But this result suggests an experiment: mlock the compositor thread's stack and see what happens.

Comment 29

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #28)

I think this is an OOM

On further inspection, no. This is a stack overflow; the page in question was the guard page (even though it appeared readable in the replay debugger), as determined by inspecting the execution of pthread_create that created the stack. But that returns us to the question of why the crash reporter would have failed.

Comment 30

•

3 years ago

It looks like WasmTrapHandler was stacked on top of the crash reporter's signal handler, but it's also SA_ONSTACK so that should work.

However, sigaltstack is per-thread and it seems to be used only on each process's main thread, if I'm reading the trace (and the code) correctly; this means that off-main-thread stack exhaustion (in this case, the compositor thread) cannot be caught by the crash reporter.

Assignee

Comment 31

•

3 years ago

Interesting find! I had no idea sigaltstack() worked per-thread but in hindsight it makes sense given that multiple threads can handle signals concurrently - and hence would need separate stacks to do so. I'm sure I can write a test for that. As for the fix that's trickier, we'd need to call sigaltstack() on every thread and I don't think that's doable, plus it would have non-negligible memory impact.

Assignee

Comment 32

•

3 years ago

I've double-checked what happens in Chrome and Crashpad. Crashpad recommends that every thread of an application using it call sigaltstack() (through a Crashpad-specific call). It also provides a replacement for pthread_create() that does it automatically. It sounds like we might have to do it too.

Comment 33

•

3 years ago

If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object like what the profiler does. (In general there are a few things we tend to do on our own threads, including some stuff in XPCOM that pokes at the stack memory, so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)

As for memory, it looks like the chromium code to create the alternative stack just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). Our current alternative stack uses calloc which will probably(?) fault in the memory, which would be the opposite trade-off.

Gian-Carlo Pascutto [:gcp]

Comment 34

•

3 years ago

On the topic of fixing the compositor thread stack size, I did some Try runs: 512kB works, 448kB works, 384kB fails. The current size is 320kB, set in bug 1716120 because of Windows crashes which appeared to be stack overflows (but the last comments on that bug indicate there were some unanswered questions about that) and which went away with a higher stack size.

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1716120

Updated

•

3 years ago

Assignee: kwright → jld

Priority: -- → P1

Assignee

Comment 35

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #33)

If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object [like what the profiler does][prof-init]. (In general there are [a few things we tend to do on our own threads][broker-init], including [some stuff in XPCOM that pokes at the stack memory][xpc-init], so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)

I'd go for that if you're OK with it. The alternative would be a whack-a-mole game of finding which threads might be overflowing their stacks while not getting crashes from them.

As for memory, it looks like [the chromium code to create the alternative stack][cs-stack] just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). [Our current alternative stack][fx-stack] uses calloc which will probably(?) fault in the memory, which would be the opposite trade-off.

Thinking about this I had thought about mmap()'ing, good to know that Chrome uses it too. It will probably use a minuscule amount of extra memory in the kernel for the mapping but no actual pages. I think that mmap()'ing the alternate stack + interposition is the way to go. If I'm not too busy with Web MIDI I'll try to cook a patch before the end of the week.

Assignee

Comment 36

•

3 years ago

Stealing this bug, I've got a POC which includes a unit test that demonstrates the issue on Linux and a potential fix.

Assignee: jld → gsvelto

Status: NEW → ASSIGNED

Assignee

Comment 37

•

3 years ago

Attached file Bug 1678152 - Catch all stack overflows on Linux r=jld (deleted) — Details

Assignee

Updated

•

3 years ago

Depends on: 1744126

Phabricator Automation

Updated

•

3 years ago

Attachment #9253512 - Attachment description: WIP: Bug 1678152 - Catch all stack overflows on Linux → Bug 1678152 - Catch all stack overflows on Linux r=jld

Updated

•

3 years ago

Fission Milestone: Future → ---

Updated

•

3 years ago

Whiteboard: fission-x11, fission-linux

Pulsebot

Comment 38

•

3 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3cf2b111807a Catch all stack overflows on Linux r=jld

Iulian Moraru

Comment 39

•

3 years ago

Backed out for causing hazard bustage on Promise.cpp.

Push with failures

Failure log

Backout link

[task 2021-12-15T04:29:47.841Z] Spawned process 19668
[task 2021-12-15T04:31:43.024Z] Renaming hazards.tmp1 -> rootingHazards.1.of.8
[task 2021-12-15T04:32:02.104Z] Renaming hazards.tmp4 -> rootingHazards.4.of.8
[task 2021-12-15T04:32:04.333Z] Renaming hazards.tmp7 -> rootingHazards.7.of.8
[task 2021-12-15T04:32:28.737Z] Renaming hazards.tmp3 -> rootingHazards.3.of.8
[task 2021-12-15T04:32:46.991Z] Renaming hazards.tmp5 -> rootingHazards.5.of.8
[task 2021-12-15T04:32:50.616Z] Renaming hazards.tmp2 -> rootingHazards.2.of.8
[task 2021-12-15T04:33:06.691Z] Renaming hazards.tmp6 -> rootingHazards.6.of.8
[task 2021-12-15T04:33:54.604Z] Renaming hazards.tmp8 -> rootingHazards.8.of.8
[task 2021-12-15T04:33:54.604Z] Running gather-hazards to generate rootingHazards.txt
[task 2021-12-15T04:33:54.604Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' cat rootingHazards.1.of.8 rootingHazards.2.of.8 rootingHazards.3.of.8 rootingHazards.4.of.8 rootingHazards.5.of.8 rootingHazards.6.of.8 rootingHazards.7.of.8 rootingHazards.8.of.8 > rootingHazards.txt
[task 2021-12-15T04:33:54.605Z] Spawned process 20123
[task 2021-12-15T04:33:54.608Z] Renaming gather-hazards.tmp1 -> rootingHazards.txt
[task 2021-12-15T04:33:54.608Z] Running explain to generate ['hazards.txt', 'unnecessary.txt', 'refs.txt']
[task 2021-12-15T04:33:54.608Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /usr/bin/python3 /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/explain.py rootingHazards.txt gcFunctions.txt hazards.txt unnecessary.txt refs.txt
[task 2021-12-15T04:33:54.609Z] Spawned process 20124
[task 2021-12-15T04:34:02.718Z] Wrote explained_hazards.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote unnecessary.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote refs.tmp1
[task 2021-12-15T04:34:02.718Z] Found 2 hazards 230 unsafe references 0 missing
[task 2021-12-15T04:34:02.874Z] Renaming explained_hazards.tmp1 -> hazards.txt
[task 2021-12-15T04:34:02.874Z] Renaming unnecessary.tmp1 -> unnecessary.txt
[task 2021-12-15T04:34:02.874Z] Renaming refs.tmp1 -> refs.txt
[task 2021-12-15T04:34:02.874Z] Running heapwrites to generate heapWriteHazards.txt
[task 2021-12-15T04:34:02.874Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /builds/worker/workspace/obj-haz-shell/dist/bin/js /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/analyzeHeapWrites.js > heapWriteHazards.txt
[task 2021-12-15T04:34:02.875Z] Spawned process 20125
[task 2021-12-15T04:34:41.470Z] Renaming heapwrites.tmp1 -> heapWriteHazards.txt
[task 2021-12-15T04:34:41.535Z] + check_hazards /builds/worker/workspace/haz-browser
[task 2021-12-15T04:34:41.535Z] + set +e
[task 2021-12-15T04:34:41.536Z] ++ grep -c 'Function.*has unrooted.*live across GC call' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.537Z] + NUM_HAZARDS=2
[task 2021-12-15T04:34:41.537Z] ++ grep -c '^Function.*takes unsafe address of unrooted' /builds/worker/workspace/haz-browser/refs.txt
[task 2021-12-15T04:34:41.539Z] + NUM_UNSAFE=230
[task 2021-12-15T04:34:41.539Z] ++ grep -c '^Function.* has unnecessary root' /builds/worker/workspace/haz-browser/unnecessary.txt
[task 2021-12-15T04:34:41.541Z] + NUM_UNNECESSARY=1207
[task 2021-12-15T04:34:41.541Z] ++ grep -c '^Dropped CFG' /builds/worker/workspace/haz-browser/build_xgill.log
[task 2021-12-15T04:34:41.678Z] + NUM_DROPPED=0
[task 2021-12-15T04:34:41.678Z] ++ perl -lne 'print $1 if m!found (\d+)/\d+ allowed errors!' /builds/worker/workspace/haz-browser/heapWriteHazards.txt
[task 2021-12-15T04:34:41.682Z] + NUM_WRITE_HAZARDS=0
[task 2021-12-15T04:34:41.682Z] ++ grep -c '^Function.*expected hazard.*but none were found' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.683Z] + NUM_MISSING=0
[task 2021-12-15T04:34:41.683Z] + set +x
[task 2021-12-15T04:34:41.683Z] TinderboxPrint: rooting hazards<br/>2
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unsafe references to unrooted GC pointers)<br/>230
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unnecessary roots)<br/>1207
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: missing expected hazards<br/>0
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: heap write hazards<br/>0
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'this' of type 'js::PromiseObject*' live across GC call at js/src/builtin/Promise.cpp:5931
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'mover' of type 'js::TenuringTracer' live across GC call at js/src/gc/Nursery.cpp:1228
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | 2 rooting hazards detected
[task 2021-12-15T04:34:41.685Z] TinderboxPrint: documentation<br/><a href='https://wiki.mozilla.org/Javascript:Hazard_Builds#Diagnosing_a_rooting_hazards_failure'>static rooting hazard analysis failures</a>, visit "Inspect Task" link for hazard details

Flags: needinfo?(gsvelto)

BugBot [:suhaib / :marco/ :calixte]

Assignee

Comment 40

•

3 years ago

I don't think this failure is related to my patch, I'll re-try and see if I can re-land.

Flags: needinfo?(gsvelto)

Comment 41

•

3 years ago

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)

Flags: needinfo?(gsvelto)

Assignee

Comment 42

•

3 years ago

This needs changes to the TimeStamp class in order to work; I'll put them in another bug.

Flags: needinfo?(gsvelto)

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

3 years ago

Depends on: 1748094

Updated

•

3 years ago

Blocks: 1638466

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=AL4hr4FKRB-mmkyN9M8jag.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=034ae0e4c467bc707be98717de6c2963db95c1a4

Assignee

Updated

•

3 years ago

Depends on: 1751041

Phabricator Automation

Updated

•

3 years ago

Attachment #9253512 - Attachment description: Bug 1678152 - Catch all stack overflows on Linux r=jld → Bug 1678152 - Catch all stack overflows on Linux r=glandium

Phabricator Automation

Updated

•

3 years ago

Attachment #9253512 - Attachment description: Bug 1678152 - Catch all stack overflows on Linux r=glandium → Bug 1678152 - Catch all stack overflows on Linux r=jld

Pulsebot

Comment 43

•

3 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/46640f068ae4 Catch all stack overflows on Linux r=jld

Noemi Erli[:noemi_erli]

Comment 44

•

3 years ago

Backed out 2 changesets (Bug 1678152, Bug 1751041) for causing failures in test_missing_intermediate.js CLOSED TREE

Log: https://treeherder.mozilla.org/logviewer?job_id=365391373&repo=autoland&lineNumber=8975

Backout: https://hg.mozilla.org/integration/autoland/rev/ea291de828a9258194e0a74ed91248b6764ceebc

Flags: needinfo?(gsvelto)

Assignee

Comment 45

•

3 years ago

The problem here is that I've put the interposer inside mozglue (so that it would be loaded early at startup) but forgot that mozglue gets linked into other stuff - such as certutil in this case. In this particular instance the interposer can't find the real pthread_create() call and thus crashes the tool.

Flags: needinfo?(gsvelto)

Assignee

Comment 46

•

3 years ago

I think I'll have to pull the pthread_create() interposer out of mozglue, put it into its own library and then link it early only when building libxul.

Comment 47

•

3 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #46)

I think I'll have to pull the pthread_create() interposer out of mozglue, put it into its own library and then link it early only when building libxul.

You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.

Flags: needinfo?(jld)

Assignee

Comment 48

•

3 years ago

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #47)

You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.

Yes, I tried with libxul yesterday night only to realize that Firefox was linking libpthread before it was linking libxul. Now I'm trying with the firefox and xpcshell executables directly and it seems that it does the trick (which has the upside of not linking this into things that don't care about crash reporting).

https://hg.mozilla.org/mozilla-central/rev/cf94448b84dc

Assignee

Comment 49

•

3 years ago

I tweaked the patch to also include the plugin-container executable as suggested by :jld on Matrix and tested that it works as expected. Let's try to land this again.

Pulsebot

Comment 50

•

3 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cf94448b84dc Catch all stack overflows on Linux r=jld

Cristian Tuns

Comment 51

•

3 years ago

bugherder

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox98: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Treeherder Bug Filer

Updated

•

3 years ago

Regressions: 1752566

Calixte Denizet (:calixte)

Updated

•

3 years ago

Depends on: 1752697

Regressions: 1752697

Updated

•

3 years ago

Blocks: 1753340

Assignee

Updated

•

3 years ago

Blocks: 1754578

Mathew Hodson

Updated

•

3 years ago

No longer depends on: 1752697