Some (parent process) crashes don't yield stack traces
Categories
(Toolkit :: Crash Reporting, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox98 | --- | fixed |
People
(Reporter: jgilbert, Assigned: gsvelto)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
These linux crashes have no stacks:
https://treeherder.mozilla.org/jobs?repo=try&revision=1c16512bcb9ee6ef4ec6644ca672c8dcb1af732a
Assignee | ||
Comment 1•4 years ago
|
||
I've looked at the logs but it seems that Firefox is always exiting cleanly, albeit with an error. I may be looking at the wrong thing though, can you point me to the crash?
Reporter | ||
Comment 2•4 years ago
|
||
I don't think Firefox should be exiting cleanly-but-with-error? For my purposes any early exit here is a crash.
Reporter | ||
Comment 3•4 years ago
|
||
[task 2020-11-12T20:57:26.606Z] 20:57:26 INFO - TEST-START | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html
[task 2020-11-12T20:57:26.764Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277
[task 2020-11-12T20:57:26.889Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.890Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.891Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.892Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: NS_ENSURE_TRUE(nsContentUtils::IsJavascriptMIMEType(type)) failed: file /builds/worker/checkouts/gecko/dom/script/ScriptLoader.cpp:1647
[task 2020-11-12T20:57:26.908Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.908Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.908Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.964Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robust_buffer_access_behavior marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:26.965Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: Robustness supported, strategy is not LOSE_CONTEXT_ON_RESET!: file /builds/worker/checkouts/gecko/gfx/gl/GLContext.cpp:971
[task 2020-11-12T20:57:26.965Z] 20:57:26 INFO - GECKO(1510) | [Parent 1510, Compositor] WARNING: robustness marked as unsupported: file /builds/worker/checkouts/gecko/gfx/gl/GLContextFeatures.cpp:628
[task 2020-11-12T20:57:27.170Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.266Z] 20:57:27 INFO - TEST-INFO | Main app process: killed by SIGSEGV
[task 2020-11-12T20:57:27.267Z] 20:57:27 INFO - Buffered messages logged at 20:57:26
[task 2020-11-12T20:57:27.267Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | A valid string reason is expected
[task 2020-11-12T20:57:27.267Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | Reason cannot be empty
[task 2020-11-12T20:57:27.268Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines
[task 2020-11-12T20:57:27.268Z] 20:57:27 INFO - Buffered messages logged at 20:57:27
[task 2020-11-12T20:57:27.269Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in multiple lines
[task 2020-11-12T20:57:27.270Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line
[task 2020-11-12T20:57:27.271Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+,-,/,*] operators in expression in one line
[task 2020-11-12T20:57:27.271Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines
[task 2020-11-12T20:57:27.272Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in multiple lines
[task 2020-11-12T20:57:27.272Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line
[task 2020-11-12T20:57:27.273Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [+] operators in expression in one line
[task 2020-11-12T20:57:27.274Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines
[task 2020-11-12T20:57:27.274Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in multiple lines
[task 2020-11-12T20:57:27.275Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line
[task 2020-11-12T20:57:27.276Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 10 [-] operators in expression in one line
[task 2020-11-12T20:57:27.276Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines
[task 2020-11-12T20:57:27.277Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in multiple lines
[task 2020-11-12T20:57:27.277Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line
[task 2020-11-12T20:57:27.278Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+,-,/,*] operators in expression in one line
[task 2020-11-12T20:57:27.278Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines
[task 2020-11-12T20:57:27.278Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in multiple lines
[task 2020-11-12T20:57:27.278Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line
[task 2020-11-12T20:57:27.279Z] 20:57:27 INFO - TEST-PASS | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | shader with 100 [+] operators in expression in one line
[task 2020-11-12T20:57:27.279Z] 20:57:27 INFO - Buffered messages finished
[task 2020-11-12T20:57:27.280Z] 20:57:27 ERROR - TEST-UNEXPECTED-FAIL | dom/canvas/test/webgl-conf/generated/test_conformance__glsl__bugs__long-expressions-should-not-crash.html | application terminated with exit code -11
There's a SIGSEGV mentioned above, in https://firefoxci.taskcluster-artifacts.net/XsNTZuC-QxmTR3WWUgUXXw/0/public/logs/live_backing.log
From https://treeherder.mozilla.org/jobs?repo=try&revision=1c16512bcb9ee6ef4ec6644ca672c8dcb1af732a&selectedTaskRun=c_kIkS3zTGyQDbKeRhn4HA.0
Assignee | ||
Comment 4•4 years ago
|
||
Ah yes, sorry I missed that. What's happening here is that we're segfaulting after we uninstalled the crash reporter's signal handler so the signal is not being caught.
Looking at the log it seems that we hit some kind of IPC issue:
[task 2020-11-12T20:57:26.764Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Attempt to __delete__ missing or closed actor: file /builds/worker/workspace/obj-build/ipc/ipdl/PWebGLChild.cpp:111
[task 2020-11-12T20:57:26.806Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/checkouts/gecko/ipc/glue/ProtocolUtils.cpp:511
[task 2020-11-12T20:57:26.807Z] 20:57:26 INFO - GECKO(1510) | [Child 1573, Main Thread] WARNING: Error sending reply: file /builds/worker/workspace/obj-build/ipc/ipdl/PDocumentChannelChild.cpp:277
And then all the content processes quit in response to that :
[task 2020-11-12T20:57:27.170Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.173Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.175Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
[task 2020-11-12T20:57:27.185Z] 20:57:27 INFO - GECKO(1510) | Exiting due to channel error.
Deactivating the exception handler happens very late during shutdown but there's still stuff happening afterwards but not much. You can try commenting out the line where we disable the crash reporter and see if you get a minidump. Given that there's some IPC stuff going awry it might be possible that the crash is not happening on the main thread which would explain why the exception handler gets disabled before you hit the SIGSEGV.
Unfortunately the crash reporting machinery has some odd dependencies on XPCOM so it might not work; I've wanted to make it work all the way to the actual shutdown for a long time.
Reporter | ||
Comment 5•4 years ago
|
||
If we're not getting crash stacks because we're not hitting our crash reporting code, this could mean we're missing crash reports for a class of parent process crashes from users in the wild, which would be relatively severe.
Assignee | ||
Comment 6•4 years ago
|
||
Yes, it is :-( We get some of those crashes on the Microsoft health dashboards for Windows builds but there's nothing of the sort on Linux. There's an effort to enable things like Windows Error Reporting on our Windows TaskCluster runners so that we at least catch those crashes there.
Comment 7•4 years ago
|
||
If this bug blocks WebGL remoting on Linux, then it also blocks Fission on Linux. Tentatively tracking this bug for Fission M7 Beta milestone.
Assignee | ||
Comment 9•4 years ago
|
||
This is a quick hack that prevents the exception handler from being uninstalled during shutdown, which means it should still be able to catch crashes and write minidumps right until the end. The patch has some significant side effects, one of which is that it leaks memory so be prepared to see those failures. It might also make funky things if a crash is generated way to late - which might lead to a cascade of crashes - but at least the first minidump should be written out correctly.
Comment 10•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #9)
Created attachment 9193989 [details]
[TEST PATCH] Don't remove the exception handler on shutdown
Jeff, when you have time, can you try reproducing your WebGL test crash with Gabriele's patch? You might be able to get a more meaningful stack trace.
Reporter | ||
Comment 11•4 years ago
|
||
Crashing test re-enabled:
https://treeherder.mozilla.org/jobs?repo=try&revision=f869b4d6a4dc887ea7812d68dcc22ff04ab3b7fe
Crash reporter UnsetExceptionHandler disabled per gsvelto's patch:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f39bafe79232dfd2f4ccb7eb672976eeb9e30b75
Reporter | ||
Comment 12•4 years ago
|
||
Still SIGSEV without stacks.
Assignee | ||
Comment 13•4 years ago
|
||
This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like exit(-11)
or somesuch. Normally this type of crashes also cause some spew from the IPC code because the connection is abruptly severed but there's none here. I think you're only way out here is Pernosco, that should be able to capture the point of the crash.
Reporter | ||
Comment 14•4 years ago
|
||
Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.
Comment 15•4 years ago
|
||
(In reply to Jeff Gilbert [:jgilbert] from comment #14)
Alright, well good to know we're all confused I guess! I'll revisit this now that bug 1640607 landed.
Jeff, any luck reproducing this Linux crash in Pernosco?
Updated•4 years ago
|
Reporter | ||
Comment 16•4 years ago
|
||
Not yet but I'm trying to fit it in.
Reporter | ||
Comment 17•4 years ago
|
||
Pernosco seems to have caught it very deep in a call stack, so this might be a stack exhaustion segfault?
Reporter | ||
Comment 18•4 years ago
|
||
Reporter | ||
Comment 19•4 years ago
|
||
This particular class of bugs, we can work around individually, but if this class of bugs is not getting crash stacks, that's dangerous.
Are there any stack-overflow caveats for our crash handling? Do we expect to handle this case?
Assignee | ||
Comment 20•4 years ago
|
||
On Linux we use an alternate stack for the exception handler so stack exhaustion shouldn't be an issue. It should segfault (or whatever happens in that case), the stack would be switched with the signal handling stack and then the exception handler would run from there.
BTW I had a look at the relevant code in mesa and it seems to push only a few bytes per frame so even though the stack is deep I'm not sure if it's deep enough to run out of stack space. Does pernosco save the crash reason too (i.e. the signal number and code)? It would be interesting to know what it is.
Comment 21•4 years ago
|
||
Linux bugs don't block Fission Beta experiment (on Windows and macOS), so I am moving Linux Fission bugs from Fission milestone M7 to M7a.
Comment 22•4 years ago
|
||
Jeff, any updates on this Linux crash stack issue? Was the stack overflow in Mozilla or Mesa code?
I assume this crash stack issue should still block shipping WebGL remoting on Linux.
Comment 24•4 years ago
|
||
Fission M7a will not block on this, since we will have semi-headless mode for Linux that jld is working on.
Comment 25•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #13)
This is so odd. There's no trace of the crash, it seems like the exception handler didn't see it at all. I'm starting to doubt that this is really a SIGSEGV at all. I'm wondering if the main process is doing something like
exit(-11)
or somesuch.
There's another way this can happen: if a thread has SIGSEGV
blocked and then does something that would raise SIGSEGV
, the signal is unblocked and the handler is reset to SIG_DFL
before processing it. (The rationale is that it's assumed the signal was blocked for a reason, and it may not be safe to run the handler at that point. This behavior is also used for other cases of important synchronous signals, like SIGSYS
raised from seccomp-bpf.)
In particular, sometimes things will block all signals, for reasons of varying validity; e.g., this code in glib (not glibc). And here's something suspiciously similar in Mesa (which helpfully omits SIGSYS
, but not SIGSEGV
). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it). But (as gsvelto reminded me out-of-band while I was writing this) any signal handler will also block its own signal while it's running, unless SO_NODEFER
is used, and we do have a number of SIGSEGV
handlers; they can also opt in to blocking other signals (the sa_mask
member of struct sigaction
).
For the case of SIGSYS
in child processes, we're already using symbol interposition to prevent blocking it. (It doesn't work for code that inlines the syscall, which glibc's dynamic linker necessarily does, but glibc is usually fairly cautious about this kind of thing.) Something similar could be tried here to see if that's the problem; we might need to define the symbols in the executable, or maybe have the test harness LD_PRELOAD
libmozsandbox.so
.
Assignee | ||
Comment 26•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #25)
In particular, sometimes things will block all signals, for reasons of varying validity; e.g., [this code in glib (not glibc)][glib-oops]. And here's [something suspiciously similar in Mesa][mesa-oops] (which helpfully omits
SIGSYS
, but notSIGSEGV
). Both of those examples create a thread with all signals already blocked when it's started (inherited from the thread that creates it).
The mesa case is particularly suspicious given the issue here. The commit message states that it was a workaround for some signal-handling in X which is really sad. I think we'll have to audit both threads and handlers that block signals to get to the bottom of this.
I wonder if we'll have to interpose ourselves everywhere to make sure we can catch crashes even when they happen in threads which the authors have deemed not suitable for handling signals. Which BTW is not trivial either, we had bugs in breakpad in the past when several threads entered the signal handler simultaneously.
Comment 27•3 years ago
|
||
I ran an experiement on Try, using symbol interposition to prevent SIGSEGV (and a few others just in case: BUS, ILL, FPE) from being blocked via the usual libc APIs. And there still isn't a crash report. I also added logging, so I can see that the interposition is being loaded in the parent process (i.e., I didn't edit the wrong Python class when I tried to set the env var). I can also see in local testing, using procfs (SigBlk
in the status
file), that the signals in question aren't blocked on threads that are otherwise blocking everything. So I tried again, with logging of changes to the signal handlers, but that wasn't any more enlightening. One thing I haven't tested yet is forcing SA_ONSTACK
, but that would only help if there's another handler applied after Breakpad which segfaults in the handler itself, and I don't know if there's any evidence for that.
One thing I noticed is the driver info, which might be trying to tell me that it's vmwgfx
?
GL_VENDOR: VMware, Inc.
mVendor: VMware, Inc.
GL_RENDERER: llvmpipe (LLVM 10.0.0, 256 bits)
mRenderer: Unknown
I'll see if I can get a Pernosco trace; comment #18 indicates that some kind of crash is reproducible that way, and hopefully it's the same one (even though I assume it's using software rendering because of rr
).
Comment 28•3 years ago
|
||
I got Pernosco working, and assuming the crash under rr
is the same as the normal one, I think this is an OOM: it crashes on a direct call
instruction where the target is readable, and the memory below the stack pointer is readable, but the entire rest of the page above the stack pointer is zeroes and the prologue subtracts 0xbd8
from it, suggesting that that call
is the first thing to touch that page. That might explain why the crash reporter wouldn't work, although it is possible to crash during a “normal” OOM detected by mozjemalloc, so I'm not 100% sure I understand what's going on here.
It might also explain why the ASan runs didn't fail, if they have higher resource limits or otherwise manage memory differently in some way that matters here.
But this result suggests an experiment: mlock
the compositor thread's stack and see what happens.
Comment 29•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #28)
I think this is an OOM
On further inspection, no. This is a stack overflow; the page in question was the guard page (even though it appeared readable in the replay debugger), as determined by inspecting the execution of pthread_create
that created the stack. But that returns us to the question of why the crash reporter would have failed.
Comment 30•3 years ago
|
||
It looks like WasmTrapHandler
was stacked on top of the crash reporter's signal handler, but it's also SA_ONSTACK
so that should work.
However, sigaltstack
is per-thread and it seems to be used only on each process's main thread, if I'm reading the trace (and the code) correctly; this means that off-main-thread stack exhaustion (in this case, the compositor thread) cannot be caught by the crash reporter.
Assignee | ||
Comment 31•3 years ago
|
||
Interesting find! I had no idea sigaltstack()
worked per-thread but in hindsight it makes sense given that multiple threads can handle signals concurrently - and hence would need separate stacks to do so. I'm sure I can write a test for that. As for the fix that's trickier, we'd need to call sigaltstack()
on every thread and I don't think that's doable, plus it would have non-negligible memory impact.
Assignee | ||
Comment 32•3 years ago
|
||
I've double-checked what happens in Chrome and Crashpad. Crashpad recommends that every thread of an application using it call sigaltstack()
(through a Crashpad-specific call). It also provides a replacement for pthread_create()
that does it automatically. It sounds like we might have to do it too.
Comment 33•3 years ago
|
||
If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object like what the profiler does. (In general there are a few things we tend to do on our own threads, including some stuff in XPCOM that pokes at the stack memory, so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)
As for memory, it looks like the chromium code to create the alternative stack just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). Our current alternative stack uses calloc
which will probably(?) fault in the memory, which would be the opposite trade-off.
Comment 34•3 years ago
|
||
On the topic of fixing the compositor thread stack size, I did some Try runs: 512kB works, 448kB works, 384kB fails. The current size is 320kB, set in bug 1716120 because of Windows crashes which appeared to be stack overflows (but the last comments on that bug indicate there were some unanswered questions about that) and which went away with a higher stack size.
Updated•3 years ago
|
Assignee | ||
Comment 35•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #33)
If we wanted something a little less ambitious, like opting-in on threads that are likely to cause problems (e.g., the compositor thread: calling third-party driver code to do shader compilation with a fixed stack limit that we chose by adjusting it until it stopped crashing), we could use a RAII object [like what the profiler does][prof-init]. (In general there are [a few things we tend to do on our own threads][broker-init], including [some stuff in XPCOM that pokes at the stack memory][xpc-init], so there's precedent, but of course if we want to cover all the threads we'll need interposition like in Chromium.)
I'd go for that if you're OK with it. The alternative would be a whack-a-mole game of finding which threads might be overflowing their stacks while not getting crashes from them.
As for memory, it looks like [the chromium code to create the alternative stack][cs-stack] just mmaps it, so it wouldn't be allocated until it's used (assuming memory is available, but I think Linux will try killing things in that case anyway, so I'm not sure what the actual impact is on handling OOMs). [Our current alternative stack][fx-stack] uses
calloc
which will probably(?) fault in the memory, which would be the opposite trade-off.
Thinking about this I had thought about mmap()
'ing, good to know that Chrome uses it too. It will probably use a minuscule amount of extra memory in the kernel for the mapping but no actual pages. I think that mmap()'ing the alternate stack + interposition is the way to go. If I'm not too busy with Web MIDI I'll try to cook a patch before the end of the week.
Assignee | ||
Comment 36•3 years ago
|
||
Stealing this bug, I've got a POC which includes a unit test that demonstrates the issue on Linux and a potential fix.
Assignee | ||
Comment 37•3 years ago
|
||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 38•3 years ago
|
||
Comment 39•3 years ago
|
||
Backed out for causing hazard bustage on Promise.cpp.
[task 2021-12-15T04:29:47.841Z] Spawned process 19668
[task 2021-12-15T04:31:43.024Z] Renaming hazards.tmp1 -> rootingHazards.1.of.8
[task 2021-12-15T04:32:02.104Z] Renaming hazards.tmp4 -> rootingHazards.4.of.8
[task 2021-12-15T04:32:04.333Z] Renaming hazards.tmp7 -> rootingHazards.7.of.8
[task 2021-12-15T04:32:28.737Z] Renaming hazards.tmp3 -> rootingHazards.3.of.8
[task 2021-12-15T04:32:46.991Z] Renaming hazards.tmp5 -> rootingHazards.5.of.8
[task 2021-12-15T04:32:50.616Z] Renaming hazards.tmp2 -> rootingHazards.2.of.8
[task 2021-12-15T04:33:06.691Z] Renaming hazards.tmp6 -> rootingHazards.6.of.8
[task 2021-12-15T04:33:54.604Z] Renaming hazards.tmp8 -> rootingHazards.8.of.8
[task 2021-12-15T04:33:54.604Z] Running gather-hazards to generate rootingHazards.txt
[task 2021-12-15T04:33:54.604Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' cat rootingHazards.1.of.8 rootingHazards.2.of.8 rootingHazards.3.of.8 rootingHazards.4.of.8 rootingHazards.5.of.8 rootingHazards.6.of.8 rootingHazards.7.of.8 rootingHazards.8.of.8 > rootingHazards.txt
[task 2021-12-15T04:33:54.605Z] Spawned process 20123
[task 2021-12-15T04:33:54.608Z] Renaming gather-hazards.tmp1 -> rootingHazards.txt
[task 2021-12-15T04:33:54.608Z] Running explain to generate ['hazards.txt', 'unnecessary.txt', 'refs.txt']
[task 2021-12-15T04:33:54.608Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /usr/bin/python3 /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/explain.py rootingHazards.txt gcFunctions.txt hazards.txt unnecessary.txt refs.txt
[task 2021-12-15T04:33:54.609Z] Spawned process 20124
[task 2021-12-15T04:34:02.718Z] Wrote explained_hazards.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote unnecessary.tmp1
[task 2021-12-15T04:34:02.718Z] Wrote refs.tmp1
[task 2021-12-15T04:34:02.718Z] Found 2 hazards 230 unsafe references 0 missing
[task 2021-12-15T04:34:02.874Z] Renaming explained_hazards.tmp1 -> hazards.txt
[task 2021-12-15T04:34:02.874Z] Renaming unnecessary.tmp1 -> unnecessary.txt
[task 2021-12-15T04:34:02.874Z] Renaming refs.tmp1 -> refs.txt
[task 2021-12-15T04:34:02.874Z] Running heapwrites to generate heapWriteHazards.txt
[task 2021-12-15T04:34:02.874Z] PATH="/builds/worker/fetches/gcc/bin:/builds/worker/fetches/sixgill/usr/bin:${PATH}" XDB='/builds/worker/fetches/sixgill/usr/bin/xdb.so' SOURCE='/builds/worker/checkouts/gecko' ANALYZED_OBJDIR='/builds/worker/workspace/obj-analyzed-browser' /builds/worker/workspace/obj-haz-shell/dist/bin/js /builds/worker/checkouts/gecko/js/src/devtools/rootAnalysis/analyzeHeapWrites.js > heapWriteHazards.txt
[task 2021-12-15T04:34:02.875Z] Spawned process 20125
[task 2021-12-15T04:34:41.470Z] Renaming heapwrites.tmp1 -> heapWriteHazards.txt
[task 2021-12-15T04:34:41.535Z] + check_hazards /builds/worker/workspace/haz-browser
[task 2021-12-15T04:34:41.535Z] + set +e
[task 2021-12-15T04:34:41.536Z] ++ grep -c 'Function.*has unrooted.*live across GC call' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.537Z] + NUM_HAZARDS=2
[task 2021-12-15T04:34:41.537Z] ++ grep -c '^Function.*takes unsafe address of unrooted' /builds/worker/workspace/haz-browser/refs.txt
[task 2021-12-15T04:34:41.539Z] + NUM_UNSAFE=230
[task 2021-12-15T04:34:41.539Z] ++ grep -c '^Function.* has unnecessary root' /builds/worker/workspace/haz-browser/unnecessary.txt
[task 2021-12-15T04:34:41.541Z] + NUM_UNNECESSARY=1207
[task 2021-12-15T04:34:41.541Z] ++ grep -c '^Dropped CFG' /builds/worker/workspace/haz-browser/build_xgill.log
[task 2021-12-15T04:34:41.678Z] + NUM_DROPPED=0
[task 2021-12-15T04:34:41.678Z] ++ perl -lne 'print $1 if m!found (\d+)/\d+ allowed errors!' /builds/worker/workspace/haz-browser/heapWriteHazards.txt
[task 2021-12-15T04:34:41.682Z] + NUM_WRITE_HAZARDS=0
[task 2021-12-15T04:34:41.682Z] ++ grep -c '^Function.*expected hazard.*but none were found' /builds/worker/workspace/haz-browser/rootingHazards.txt
[task 2021-12-15T04:34:41.683Z] + NUM_MISSING=0
[task 2021-12-15T04:34:41.683Z] + set +x
[task 2021-12-15T04:34:41.683Z] TinderboxPrint: rooting hazards<br/>2
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unsafe references to unrooted GC pointers)<br/>230
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: (unnecessary roots)<br/>1207
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: missing expected hazards<br/>0
[task 2021-12-15T04:34:41.684Z] TinderboxPrint: heap write hazards<br/>0
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'this' of type 'js::PromiseObject*' live across GC call at js/src/builtin/Promise.cpp:5931
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | unrooted 'mover' of type 'js::TenuringTracer' live across GC call at js/src/gc/Nursery.cpp:1228
[task 2021-12-15T04:34:41.685Z] TEST-UNEXPECTED-FAIL | hazards | 2 rooting hazards detected
[task 2021-12-15T04:34:41.685Z] TinderboxPrint: documentation<br/><a href='https://wiki.mozilla.org/Javascript:Hazard_Builds#Diagnosing_a_rooting_hazards_failure'>static rooting hazard analysis failures</a>, visit "Inspect Task" link for hazard details
Assignee | ||
Comment 40•3 years ago
|
||
I don't think this failure is related to my patch, I'll re-try and see if I can re-land.
Comment 41•3 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 42•3 years ago
|
||
This needs changes to the TimeStamp class in order to work; I'll put them in another bug.
Updated•3 years ago
|
Updated•3 years ago
|
Comment 43•3 years ago
|
||
Comment 44•3 years ago
|
||
Backed out 2 changesets (Bug 1678152, Bug 1751041) for causing failures in test_missing_intermediate.js CLOSED TREE
Log: https://treeherder.mozilla.org/logviewer?job_id=365391373&repo=autoland&lineNumber=8975
Backout: https://hg.mozilla.org/integration/autoland/rev/ea291de828a9258194e0a74ed91248b6764ceebc
Assignee | ||
Comment 45•3 years ago
|
||
The problem here is that I've put the interposer inside mozglue (so that it would be loaded early at startup) but forgot that mozglue gets linked into other stuff - such as certutil in this case. In this particular instance the interposer can't find the real pthread_create()
call and thus crashes the tool.
Assignee | ||
Comment 46•3 years ago
|
||
I think I'll have to pull the pthread_create()
interposer out of mozglue, put it into its own library and then link it early only when building libxul.
Comment 47•3 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #46)
I think I'll have to pull the
pthread_create()
interposer out of mozglue, put it into its own library and then link it early only when building libxul.
You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.
Assignee | ||
Comment 48•3 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #47)
You might need to link it into the executables and not libxul, in order for it to be early enough in the search order.
Yes, I tried with libxul yesterday night only to realize that Firefox was linking libpthread before it was linking libxul. Now I'm trying with the firefox and xpcshell executables directly and it seems that it does the trick (which has the upside of not linking this into things that don't care about crash reporting).
Assignee | ||
Comment 49•3 years ago
|
||
I tweaked the patch to also include the plugin-container executable as suggested by :jld on Matrix and tested that it works as expected. Let's try to land this again.
Comment 50•3 years ago
|
||
Comment 51•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Description
•