Re-enable stackwalking on local macOS builds when rust-based stackwalking is available
Categories
(Core :: XPCOM, task)
Tracking
()
People
(Reporter: Gijs, Assigned: n.nethercote)
References
Details
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
As per summary - in bug 1543447 we're landing a change to disable stackwalking on local macOS usage because of performance issues. This bug is here to re-enable it when we've fixed the perf issues by switching to rust.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
Now that fix_stacks.py
is being used instead of fix_macosx_stack.py
,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.
Updated•5 years ago
|
Assignee | ||
Comment 2•5 years ago
|
||
Now that
fix_stacks.py
is being used instead offix_macosx_stack.py
,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.
This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.
Assignee | ||
Comment 3•5 years ago
|
||
I also wonder if we still need the MOZ_DISABLE_STACK_FIX
environment variable that Matt added in bug 1543447.
Reporter | ||
Comment 4•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #2)
Now that
fix_stacks.py
is being used instead offix_macosx_stack.py
,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.
30s is better, but it'd be helpful if the system could immediately output something that indicated that that was what was happening for the next 30-odd seconds. I also imagine that the 30s is a number on some types of machines, and that number will differ on slower / different machines...
(In reply to Nicholas Nethercote [:njn] from comment #3)
I also wonder if we still need the
MOZ_DISABLE_STACK_FIX
environment variable that Matt added in bug 1543447.
I think we should keep it (perhaps renaming it if we want to opt people back in to having them by default) to make it possible to not have this happen locally when e.g. running tests, given that when debugging test failures, timing can be very important. In any case, if the assertions for which we gather stacks are unrelated to what you're doing, you hit a few of them for every run, and you're trying to run verify
, now a 1-2 minute task has become a 5-10 minute task, and that'd still be annoying.
Assignee | ||
Comment 5•5 years ago
|
||
I also imagine that the 30s is a number on some types of machines, and that number will differ on slower / different machines...
That's a safe bet.
you hit a few of them for every run
Fortunately the ~30 second cost should be a one-time hit for the first stack trace, and later stack traces will be almost instantaneous, because a single fix-stacks
process will be used for the entire test run.
Comment 6•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #2)
Now that
fix_stacks.py
is being used instead offix_macosx_stack.py
,
stack-fixing time has dropped from about 14 minutes to about 30 seconds on my
new MacBook Pro.This can't land until bug 1619837 does, but while we're waiting: does this seem like a reasonable length of time to wait for a stack trace? I'm needinfo'ing various people who hit this problem in or around bug 1543447.
30s is still a surprising amount of time to see no output on a test run, but I think if we can do what Gijs suggested (output a message saying that a stack is being processed and there will be a short delay) then I think at least for me I'd be happy with having this on as a default.
(In reply to Nicholas Nethercote [:njn] from comment #5)
Fortunately the ~30 second cost should be a one-time hit for the first stack trace, and later stack traces will be almost instantaneous, because a single
fix-stacks
process will be used for the entire test run.
Is there any way to pre-cache the data that fix-stacks needs?
Assignee | ||
Comment 7•5 years ago
|
||
Pre-caching is difficult. fix-stacks
needs to slurp up a lot of debuginfo, and it does this on a binary-by-binary basis. If your stack trace involves libxul, there will be a wait, because libxul is enormous. If your stack trace only involves other libraries it'll be much faster.) Also, on Linux and Windows all the debuginfo for each binary is in a single file (on Linux it's typically in the binary itself, on Windows it's in a separate PDB file), but on Mac the debuginfo is spread across all the .o
files that fed into the binary. So there's just a lot of disk access, which is why it takes 10s of seconds, and Mac tends to be the slowest platform.
We could package that data into a more convenient form ahead of time, but it would still be a large amount of data and so the start-up time would likely be better but still significant. Also, the cost of the packaging step would presumably be added to build time, which would hurt every build. Or we'd have a separate packaging step that you'd have to run before running tests, which would be annoying. (The standard Mac approach is to do exactly this kind of packaging via dsymutil
, but unfortunately it is extremely slow, which is why we read directly from .o
files.)
So in short, the current approach of doing things on demand seems like the least worst option. But a warning message should be easy and will help a lot.
Comment 8•5 years ago
|
||
30 seconds sounds reasonable, if there's a message, though I do worry that this is the sort of thing that could increase further over time.
I'm also worried about platform tools (like Instruments) that can't use fix-stacks, and currently hang for 15+ minutes when used. My initial testing in bug 1543447 suggested that the slowness was specific to libgkrust.a (which also explains why this has regressed so much over the last few years). Has there been any investigation as to why that is, and if it's possible to output symbols from rust in a way that isn't as slow?
It'd also be really nice to have a mozconfig/mozbuild option to run a persistent fix-stack process (spawned when a build completes, and prodded to pre-cache XUL symbols), so that it'd be ready for test runs.
Comment 9•5 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #8)
It'd also be really nice to have a mozconfig/mozbuild option to run a persistent fix-stack process (spawned when a build completes, and prodded to pre-cache XUL symbols), so that it'd be ready for test runs.
Yeah this would be great. I've frequently run the same test multiple times without a rebuild in-between and so every time I'd see the slow first run to figure out the symbols
Comment 10•5 years ago
|
||
30 seconds seems fine, though I'd still like for a way to disable it. I guess leaving the existing env var does have the disadvantage that anybody who did so will continue to have broken stacks until they change something.
Comment 11•5 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #10)
30 seconds seems fine, though I'd still like for a way to disable it. I guess leaving the existing env var does have the disadvantage that anybody who did so will continue to have broken stacks until they change something.
To that end it would be nice to also add a message when this is disabled telling folks how to re-enable it.
Updated•5 years ago
|
Assignee | ||
Comment 12•5 years ago
|
||
The new patch prints both requested warnings: one if you have MOZ_DISABLE_STACK_FIX
set, and one when the first stack frame gets fixed. Other suggestions (fixing Instruments, magic pre-caching, persistent fix-stacks processes) are out of scope for this bug.
Comment 13•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #12)
The new patch prints both requested warnings: one if you have
MOZ_DISABLE_STACK_FIX
set, and one when the first stack frame gets fixed. Other suggestions (fixing Instruments, magic pre-caching, persistent fix-stacks processes) are out of scope for this bug.
This is awesome, thanks!
Assignee | ||
Comment 14•5 years ago
|
||
I've frequently run the same test multiple times without a rebuild in-between and so every time I'd see the slow first run to figure out the symbols
If you know in advance that you'll be running the same test multiple times, you can do ./mach buildsymbols
first. It takes a while to run, but stack-fixing using breakpad symbols is much faster on my Mac than native stack fixing.
Comment 15•5 years ago
|
||
Comment 16•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Description
•