Closed Bug 1678152 Opened 4 years ago Closed 3 years ago

Some (parent process) crashes don't yield stack traces

Categories

(Toolkit :: Crash Reporting, defect, P1)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox98 --- fixed

People

(Reporter: jgilbert, Assigned: gsvelto)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

I've looked at the logs but it seems that Firefox is always exiting cleanly, albeit with an error. I may be looking at the wrong thing though, can you point me to the crash?

I don't think Firefox should be exiting cleanly-but-with-error? For my purposes any early exit here is a crash.


[task 2020-11-12T20:57:26.606Z] 20:57:26     INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html
[task 2020-11-12T20:57:26.764Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277
[task 2020-11-12T20:57:26.889Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.890Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.891Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.892Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.908Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.964Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.965Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.965Z] 20:57:26     INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:27.170Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.266Z] 20:57:27     INFO - TEST-INFO | Main app process: killed by SIGSEGV
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - Buffered messages logged at 20:57:26
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | A valid string reason is expected 
[task 2020-11-12T20:57:27.267Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | Reason cannot be empty 
[task 2020-11-12T20:57:27.268Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.268Z] 20:57:27     INFO - Buffered messages logged at 20:57:27
[task 2020-11-12T20:57:27.269Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.270Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.271Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.271Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.272Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.272Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.273Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.274Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.274Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.275Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line 
[task 2020-11-12T20:57:27.276Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line 
[task 2020-11-12T20:57:27.276Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.277Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.277Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines 
[task 2020-11-12T20:57:27.278Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.279Z] 20:57:27     INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line 
[task 2020-11-12T20:57:27.279Z] 20:57:27     INFO - Buffered messages finished
[task 2020-11-12T20:57:27.280Z] 20:57:27    ERROR - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | application terminated with exit code -11

There's a SIGSEGV mentioned above, in https://firefoxci.taskcluster-artifacts.net/XsNTZuC-QxmTR3WWUgUXXw/0/public/logs/live_backing.log
From https://treeherder.mozilla.org/jobs?repo=try&revision=1c16512bcb9ee6ef4ec6644ca672c8dcb1af732a&selectedTaskRun=c_kIkS3zTGyQDbKeRhn4HA.0

Ah yes, sorry I missed that. What's happening here is that we're segfaulting after we uninstalled the crash reporter's signal handler so the signal is not being caught.

Looking at the log it seems that we hit some kind of IPC issue:

[task 2020-11-12T20:57:26.764Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26     INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277

And then all the content processes quit in response to that :

[task 2020-11-12T20:57:27.170Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27     INFO - GECKO(1510) | Exiting due to channel error.

Deactivating the exception handler happens very late during shutdown but there's still stuff happening afterwards but not much. You can try commenting out the line where we disable the crash reporter and see if you get a minidump. Given that there's some IPC stuff going awry it might be possible that the crash is not happening on the main thread which would explain why the exception handler gets disabled before you hit the SIGSEGV.

Unfortunately the crash reporting machinery has some odd dependencies on XPCOM so it might not work; I've wanted to make it work all the way to the actual shutdown for a long time.

Blocks: 1669532

If we're not getting crash stacks because we're not hitting our crash reporting code, this could mean we're missing crash reports for a class of parent process crashes from users in the wild, which would be relatively severe.

Severity: -- → S2

Yes, it is :-( We get some of those crashes on the Microsoft health dashboards for Windows builds but there's nothing of the sort on Linux. There's an effort to enable things like Windows Error Reporting on our Windows TaskCluster runners so that we at least catch those crashes there.

If this bug blocks WebGL remoting on Linux, then it also blocks Fission on Linux. Tentatively tracking this bug for Fission M7 Beta milestone.

Fission Milestone: --- → M7

Kris will be investigating this bug.

Assignee: nobody → kwright

This is a quick hack that prevents the exception handler from being uninstalled during shutdown, which means it should still be able to catch crashes and write minidumps right until the end. The patch has some significant side effects, one of which is that it leaks memory so be prepared to see those failures. It might also make funky things if a crash is generated way to late - which might lead to a cascade of crashes - but at least the first minidump should be written out correctly.

(In reply to Gabriele Svelto [:gsvelto] from comment #9)

Created attachment 9193989 [details]
[TEST PATCH] Don't remove the exception handler on shutdown

Jeff, when you have time, can you try reproducing your WebGL test crash with Gabriele's patch? You might be able to get a more meaningful stack trace.

Flags: needinfo?(jgilbert)

Still SIGSEV without stacks.

This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like exit(-11) or somesuch. Normally this type of crashes also cause some spew from the IPC code because the connection is abruptly severed but there's none here. I think you're only way out here is Pernosco, that should be able to capture the point of the crash.

Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.

(In reply to Jeff Gilbert [:jgilbert] from comment #14)

Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.

Jeff, any luck reproducing this Linux crash in Pernosco?

Flags: needinfo?(jgilbert)
OS: Unspecified → Linux
Whiteboard: fission-x11
Whiteboard: fission-x11 → fission-x11, fission-linux

Not yet but I'm trying to fit it in.

Pernosco seems to have caught it very deep in a call stack, so this might be a stack exhaustion segfault?

Flags: needinfo?(jgilbert)
Attached file pernosco stack (deleted) —

This particular class of bugs, we can work around individually, but if this class of bugs is not getting crash stacks, that's dangerous.
Are there any stack-overflow caveats for our crash handling? Do we expect to handle this case?

Flags: needinfo?(gsvelto)

On Linux we use an alternate stack for the exception handler so stack exhaustion shouldn't be an issue. It should segfault (or whatever happens in that case), the stack would be switched with the signal handling stack and then the exception handler would run from there.

BTW I had a look at the relevant code in mesa and it seems to push only a few bytes per frame so even though the stack is deep I'm not sure if it's deep enough to run out of stack space. Does pernosco save the crash reason too (i.e. the signal number and code)? It would be interesting to know what it is.

Flags: needinfo?(gsvelto)

Linux bugs don't block Fission Beta experiment (on Windows and macOS), so I am moving Linux Fission bugs from Fission milestone M7 to M7a.

Fission Milestone: M7 → M7a

Jeff, any updates on this Linux crash stack issue? Was the stack overflow in Mozilla or Mesa code?

I assume this crash stack issue should still block shipping WebGL remoting on Linux.

Flags: needinfo?(jgilbert)

Yes, remoting is still blocked here.

Flags: needinfo?(jgilbert)

Fission M7a will not block on this, since we will have semi-headless mode for Linux that jld is working on.

Fission Milestone: M7a → Future

(In reply to Gabriele Svelto [:gsvelto] from comment #13)

This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like exit(-11) or somesuch.

There's another way this can happen: if a thread has SIGSEGV blocked and then does something that would raise SIGSEGV, the signal is unblocked and the handler is reset to SIG_DFL before processing it. (The rationale is that it's assumed the signal was blocked for a reason, and it may not be safe to run the handler at that point. This behavior is also used for other cases of important synchronous signals, like SIGSYS raised from seccomp-bpf.)

In particular, sometimes things will block all signals, for reasons of varying validity; e.g., this code in glib (not glibc). And here's something suspiciously similar in Mesa (which helpfully omits SIGSYS, but not SIGSEGV). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it). But (as gsvelto reminded me out-of-band while I was writing this) any signal handler will also block its own signal while it's running, unless SO_NODEFER is used, and we do have a number of SIGSEGV handlers; they can also opt in to blocking other signals (the sa_mask member of struct sigaction).

For the case of SIGSYS in child processes, we're already using symbol interposition to prevent blocking it. (It doesn't work for code that inlines the syscall, which glibc's dynamic linker necessarily does, but glibc is usually fairly cautious about this kind of thing.) Something similar could be tried here to see if that's the problem; we might need to define the symbols in the executable, or maybe have the test harness LD_PRELOAD libmozsandbox.so.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #25)

In particular, sometimes things will block all signals, for reasons of varying validity; e.g., [this code in glib (not glibc)][glib-oops]. And here's [something suspiciously similar in Mesa][mesa-oops] (which helpfully omits SIGSYS, but not SIGSEGV). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it).

The mesa case is particularly suspicious given the issue here. The commit message states that it was a workaround for some signal-handling in X which is really sad. I think we'll have to audit both threads and handlers that block signals to get to the bottom of this.

I wonder if we'll have to interpose ourselves everywhere to make sure we can catch crashes even when they happen in threads which the authors have deemed not suitable for handling signals. Which BTW is not trivial either, we had bugs in breakpad in the past when several threads entered the signal handler simultaneously.

I ran an experiement on Try, using symbol interposition to prevent SIGSEGV (and a few others just in case: BUS, ILL, FPE) from being blocked via the usual libc APIs. And there still isn't a crash report. I also added logging, so I can see that the interposition is being loaded in the parent process (i.e., I didn't edit the wrong Python class when I tried to set the env var). I can also see in local testing, using procfs (SigBlk in the status file), that the signals in question aren't blocked on threads that are otherwise blocking everything. So I tried again, with logging of changes to the signal handlers, but that wasn't any more enlightening. One thing I haven't tested yet is forcing SA_ONSTACK, but that would only help if there's another handler applied after Breakpad which segfaults in the handler itself, and I don't know if there's any evidence for that.

One thing I noticed is the driver info, which might be trying to tell me that it's vmwgfx?

GL_VENDOR: VMware, Inc.
mVendor: VMware, Inc.
GL_RENDERER: llvmpipe (LLVM 10.0.0, 256 bits)
mRenderer: Unknown

I'll see if I can get a Pernosco trace; comment #18 indicates that some kind of crash is reproducible that way, and hopefully it's the same one (even though I assume it's using software rendering because of rr).

I got Pernosco working, and assuming the crash under rr is the same as the normal one, I think this is an OOM: it crashes on a direct call instruction where the target is readable, and the memory below the stack pointer is readable, but the entire rest of the page above the stack pointer is zeroes and the prologue subtracts 0xbd8 from it, suggesting that that call is the first thing to touch that page. That might explain why the crash reporter wouldn't work, although it is possible to crash during a “normal” OOM detected by mozjemalloc, so I'm not 100% sure I understand what's going on here.

It might also explain why the ASan runs didn't fail, if they have higher resource limits or otherwise manage memory differently in some way that matters here.

But this result suggests an experiment: mlock the compositor thread's stack and see what happens.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #28)

I think this is an OOM

On further inspection, no. This is a stack overflow; the page in question was the guard page (even though it appeared readable in the replay debugger), as determined by inspecting the execution of pthread_create that created the stack. But that returns us to the question of why the crash reporter would have failed.

It looks like WasmTrapHandler was stacked on top of the crash reporter's signal handler, but it's also SA_ONSTACK so that should work.

However, sigaltstack is per-thread and it seems to be used only on each process's main thread, if I'm reading the trace (and the code) correctly; this means that off-main-thread stack exhaustion (in this case, the compositor thread) cannot be caught by the crash reporter.

Interesting find! I had no idea sigaltstack() worked per-thread but in hindsight it makes sense given that multiple threads can handle signals concurrently - and hence would need separate stacks to do so. I'm sure I can write a test for that. As for the fix that's trickier, we'd need to call sigaltstack() on every thread and I don't think that's doable, plus it would have non-negligible memory impact.

I've double-checked what happens in Chrome and Crashpad. Crashpad recommends that every thread of an application using it call sigaltstack() (through a Crashpad-specific call). It also provides a replacement for pthread_create() that does it automatically. It sounds like we might have to do it too.

If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object like what the profiler does. (In general there are a few things we tend to do on our own threads, including some stuff in XPCOM that pokes at the stack memory, so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)

As for memory, it looks like the chromium code to create the alternative stack just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). Our current alternative stack uses calloc which will probably(?) fault in the memory, which would be the opposite trade-off.

On the topic of fixing the compositor thread stack size, I did some Try runs: 512kB works, 448kB works, 384kB fails. The current size is 320kB, set in bug 1716120 because of Windows crashes which appeared to be stack overflows (but the last comments on that bug indicate there were some unanswered questions about that) and which went away with a higher stack size.

Assignee: kwright → jld
Priority: -- → P1

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #33)

If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object [like what the profiler does][prof-init]. (In general there are [a few things we tend to do on our own threads][broker-init], including [some stuff in XPCOM that pokes at the stack memory][xpc-init], so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)

I'd go for that if you're OK with it. The alternative would be a whack-a-mole game of finding which threads might be overflowing their stacks while not getting crashes from them.

As for memory, it looks like [the chromium code to create the alternative stack][cs-stack] just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). [Our current alternative stack][fx-stack] uses calloc which will probably(?) fault in the memory, which would be the opposite trade-off.

Thinking about this I had thought about mmap()'ing, good to know that Chrome uses it too. It will probably use a minuscule amount of extra memory in the kernel for the mapping but no actual pages. I think that mmap()'ing the alternate stack + interposition is the way to go. If I'm not too busy with Web MIDI I'll try to cook a patch before the end of the week.

Stealing this bug, I've got a POC which includes a unit test that demonstrates the issue on Linux and a potential fix.

Assignee: jld → gsvelto
Status: NEW → ASSIGNED
Depends on: 1744126
Attachment #9253512 - Attachment description: WIP: Bug 1678152 - Catch all stack overflows on Linux → Bug 1678152 - Catch all stack overflows on Linux r=jld
Fission Milestone: Future → ---
Whiteboard: fission-x11, fission-linux
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3cf2b111807a Catch all stack overflows on Linux r=jld

Backed out for causing hazard bustage on Promise.cpp.

Push with failures

Failure log

Backout link

[task 2021-12-15T04:29:47.841Z] Spawned process 19668
[task 2021-12-15T04:31:43.024Z] Renaming hazards.tmp1 -> rootingHazards.1.of.8
[task 2021-12-15T04:32:02.104Z] Renaming hazards.tmp4 -> rootingHazards.4.of.8
[task 2021-12-15T04:32:04.333Z] Renaming hazards.tmp7 -> rootingHazards.7.of.8
[task 2021-12-15T04:32:28.737Z] Renaming hazards.tmp3 -> rootingHazards.3.of.8
[task 2021-12-15T04:32:46.991Z] Renaming hazards.tmp5 -> rootingHazards.5.of.8
[task 2021-12-15T04:32:50.616Z] Renaming hazards.tmp2 -> rootingHazards.2.of.8
[task 2021-12-15T04:33:06.691Z] Renaming hazards.tmp6 -> rootingHazards.6.of.8
[task 2021-12-15T04:33:54.604Z] Renaming hazards.tmp8 -> rootingHazards.8.of.8
[task 2021-12-15T04:33:54.604Z] Running gather-hazards to generate rootingHazards.txt
[task 2021-12-15T04:33:54.604Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' cat rootingHazards.1.of.8 rootingHazards.2.of.8 rootingHazards.3.of.8 rootingHazards.4.of.8 rootingHazards.5.of.8 rootingHazards.6.of.8 rootingHazards.7.of.8 rootingHazards.8.of.8 > rootingHazards.txt
[task 2021-12-15T04:33:54.605Z] Spawned process 20123
[task 2021-12-15T04:33:54.608Z] Renaming gather-hazards.tmp1 -> rootingHazards.txt
[task 2021-12-15T04:33:54.608Z] Running explain to generate ['hazards.txt', 'unnecessary.txt', 'refs.txt']
[task 2021-12-15T04:33:54.608Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /usr/bin/python3 /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/explain.py rootingHazards.txt gcFunctions.txt hazards.txt unnecessary.txt refs.txt
[task 2021-12-15T04:33:54.609Z] Spawned process 20124
[task 2021-12-15T04:34:02.718Z] Wrote explained_hazards.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote unnecessary.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote refs.tmp1
[task 2021-12-15T04:34:02.718Z] Found 2 hazards 230 unsafe references 0 missing
[task 2021-12-15T04:34:02.874Z] Renaming explained_hazards.tmp1 -> hazards.txt
[task 2021-12-15T04:34:02.874Z] Renaming unnecessary.tmp1 -> unnecessary.txt
[task 2021-12-15T04:34:02.874Z] Renaming refs.tmp1 -> refs.txt
[task 2021-12-15T04:34:02.874Z] Running heapwrites to generate heapWriteHazards.txt
[task 2021-12-15T04:34:02.874Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /builds/worker/workspace/obj-haz-shell/dist/bin/js /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/analyzeHeapWrites.js > heapWriteHazards.txt
[task 2021-12-15T04:34:02.875Z] Spawned process 20125
[task 2021-12-15T04:34:41.470Z] Renaming heapwrites.tmp1 -> heapWriteHazards.txt
[task 2021-12-15T04:34:41.535Z] + check_hazards /builds/worker/workspace/haz-browser
[task 2021-12-15T04:34:41.535Z] + set +e
[task 2021-12-15T04:34:41.536Z] ++ grep -c 'Function.*has unrooted.*live across GC call' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.537Z] + NUM_HAZARDS=2
[task 2021-12-15T04:34:41.537Z] ++ grep -c '^Function.*takes unsafe address of unrooted' /builds/worker/workspace/haz-browser/refs.txt
[task 2021-12-15T04:34:41.539Z] + NUM_UNSAFE=230
[task 2021-12-15T04:34:41.539Z] ++ grep -c '^Function.* has unnecessary root' /builds/worker/workspace/haz-browser/unnecessary.txt
[task 2021-12-15T04:34:41.541Z] + NUM_UNNECESSARY=1207
[task 2021-12-15T04:34:41.541Z] ++ grep -c '^Dropped CFG' /builds/worker/workspace/haz-browser/build_xgill.log
[task 2021-12-15T04:34:41.678Z] + NUM_DROPPED=0
[task 2021-12-15T04:34:41.678Z] ++ perl -lne 'print $1 if m!found (\d+)/\d+ allowed errors!' /builds/worker/workspace/haz-browser/heapWriteHazards.txt
[task 2021-12-15T04:34:41.682Z] + NUM_WRITE_HAZARDS=0
[task 2021-12-15T04:34:41.682Z] ++ grep -c '^Function.*expected hazard.*but none were found' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.683Z] + NUM_MISSING=0
[task 2021-12-15T04:34:41.683Z] + set +x
[task 2021-12-15T04:34:41.683Z] TinderboxPrint: rooting hazards<br/>2
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unsafe references to unrooted GC pointers)<br/>230
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unnecessary roots)<br/>1207
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: missing expected hazards<br/>0
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: heap write hazards<br/>0
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'this' of type 'js::PromiseObject*' live across GC call at js/src/builtin/Promise.cpp:5931
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'mover' of type 'js::TenuringTracer' live across GC call at js/src/gc/Nursery.cpp:1228
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | 2 rooting hazards detected
[task 2021-12-15T04:34:41.685Z] TinderboxPrint: documentation<br/><a href='https://wiki.mozilla.org/Javascript:Hazard_Builds#Diagnosing_a_rooting_hazards_failure'>static rooting hazard analysis failures</a>, visit "Inspect Task" link for hazard details
Flags: needinfo?(gsvelto)

I don't think this failure is related to my patch, I'll re-try and see if I can re-land.

Flags: needinfo?(gsvelto)

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)
Flags: needinfo?(gsvelto)

This needs changes to the TimeStamp class in order to work; I'll put them in another bug.

Flags: needinfo?(gsvelto)
Depends on: 1748094
Depends on: 1751041
Attachment #9253512 - Attachment description: Bug 1678152 - Catch all stack overflows on Linux r=jld → Bug 1678152 - Catch all stack overflows on Linux r=glandium
Attachment #9253512 - Attachment description: Bug 1678152 - Catch all stack overflows on Linux r=glandium → Bug 1678152 - Catch all stack overflows on Linux r=jld
Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/46640f068ae4 Catch all stack overflows on Linux r=jld

The problem here is that I've put the interposer inside mozglue (so that it would be loaded early at startup) but forgot that mozglue gets linked into other stuff - such as certutil in this case. In this particular instance the interposer can't find the real pthread_create() call and thus crashes the tool.

Flags: needinfo?(gsvelto)

I think I'll have to pull the pthread_create() interposer out of mozglue, put it into its own library and then link it early only when building libxul.

(In reply to Gabriele Svelto [:gsvelto] from comment #46)

I think I'll have to pull the pthread_create() interposer out of mozglue, put it into its own library and then link it early only when building libxul.

You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.

Flags: needinfo?(jld)

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #47)

You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.

Yes, I tried with libxul yesterday night only to realize that Firefox was linking libpthread before it was linking libxul. Now I'm trying with the firefox and xpcshell executables directly and it seems that it does the trick (which has the upside of not linking this into things that don't care about crash reporting).

I tweaked the patch to also include the plugin-container executable as suggested by :jld on Matrix and tested that it works as expected. Let's try to land this again.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cf94448b84dc Catch all stack overflows on Linux r=jld
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch
Regressions: 1752566
Depends on: 1752697
Regressions: 1752697
Blocks: 1754578
No longer depends on: 1752697
Blocks: 1768794
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: