We don't capture native stacks for markers anymore on Linux
Categories
(Core :: Gecko Profiler, defect, P2)
Tracking
()
People
(Reporter: julienw, Assigned: canova)
References
(Regression)
Details
(Keywords: regression, Whiteboard: [fxp])
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
STR:
- Capture a profile with any marker that captures a stack (eg:
Reflow
orDoFlushPendingNotifications
)
The markers only have non-native stacks.
According to Florian this seems to be Linux only, at least he says this works on MacOS.
mozregression gives this pushlog https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=448f017b7781d8504c547599bddb50814e71e3c7&tochange=c8933e49a64bb79e25d333e7c832880048cb5712 which points to bug 1716959.
Gerald can you please have a look?
Reporter | ||
Updated•2 years ago
|
Reporter | ||
Updated•2 years ago
|
Updated•2 years ago
|
Comment 1•2 years ago
|
||
Set release status flags based on info from the regressing bug 1716959
Comment 2•2 years ago
|
||
:gerald, since you are the author of the regressor, bug 1716959, could you take a look?
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 3•2 years ago
|
||
Set release status flags based on info from the regressing bug 1716959
Updated•2 years ago
|
Reporter | ||
Comment 4•2 years ago
|
||
Hey Gerald, can you give some light about how this regression could have happened? Thanks :-)
It is quite possible that bug 1716959 caused this, as it moved some things around, that are used for stack-walking.
Unfortunately I'm not a Linux guru, so I'm not sure what's happening.
I did try to revert parts of the patches in bug 1716959, but could not get stack-walking working again. Either I did something wrong, or that bug it not the real cause? (And I couldn't run mozregression either, to test that theory.)
I'll try again, but we may need to get real experts involved...
Updated•2 years ago
|
Reporter | ||
Comment 6•2 years ago
|
||
Test various features, for example disable stack sampling, see if this makes a difference and possibly get a pernosco session about that.
Reporter | ||
Comment 7•2 years ago
|
||
Changing features doesn't change anything.
I'm trying to get a pernosco session but it's a bit difficult right now, but I'm still looking at this.
I did confirm again that the things that work on MacOS don't work on Linux.
Other markers easy to spot: ViewManagerFlush or NotifyObservers.
Reporter | ||
Comment 8•2 years ago
|
||
I think I managed to capture a pernosco session.
This specific view is when capturing a stack for a NotifyObservers
text marker.
This other view is when capturing the cause for a ViewManagerFlush
marker.
This is the profile recorded during this rr session => https://share.firefox.dev/3DlESO3
I hope somebody will be able to look at it, because this is quite out of my skills.
Assignee | ||
Comment 9•2 years ago
|
||
I spent some time to investigate the issue that we have and I found a fix to it.
We are using getcontext
the get the user context when we need to do a sync backtrace capture. And we may use 2 different getcontext
s depending on the platform and libc because it's not a part of the standard C library and no longer part of the Posix standard. If libc has it, then we use the libc provided one. If it doesn't, we fallback to breakpad_getcontext.
I'm not sure in which cases we don't have getcontext
, but my ubuntu linux machine with the current mozilla-central clang has it. I tried both with libc and breakpad ones though to make sure.
It looks like breakpad's getcontext
was working without a problem. But libc getcontext
wasn't working properly (which seems like it's the mostly used one from the complaints).
Also I tested with different build flags, and found out that libc getcontext
is working when built with no optimizations. When we build with optimization (starting from -O1), it stops working properly.
The next thing I tested was to call the getcontext
in the parent right after initializing Registers here instead of calling them inside Registers::SyncPopulate
. It appears that this indeed fixes the issue. When I searched for it, it's mentioned in a few places that the saved context is invalid once the function that called getcontext
returns. We need to call the getcontext
while the frame where we called it is still on the stack. Because stack pointer address will be invalid once the SyncPopulate
returns. So that explains why it works when we call it inside the profiler_capture_backtrace_into
function but not inside the Registers::SyncPopulate
method.
I have a patch now to move this getcontext
function out of that method and inline to its parent with a macro. That way the user context will be valid during the stack walking. I tested it manually and it seems like it's working well.
Assignee | ||
Comment 10•2 years ago
|
||
See my comment on here for more context of my investigation:
https://bugzilla.mozilla.org/show_bug.cgi?id=1779257#c9
The saved context is invalid once the function that called getcontext
returns. We need to call the getcontext
while the frame where we called it is
still on the stack. That's why this patch is moving the call to getcontext
to
parent function by inlining the SyncPopulate content by using a macro instead.
This has to be a macro instead of a function because stack pointer address will
be invalid once the Registers::SyncPopulate
returns. I tried to change this
method to inline but that didn't help either.
Updated•2 years ago
|
(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #9)
Because stack pointer address will be invalid once the
SyncPopulate
returns. So that explains why it works when we call it inside theprofiler_capture_backtrace_into
function but not inside theRegisters::SyncPopulate
method.
This reminds me of bug 1714501. Your patch may actually fix that bug as well! (But I'm just passing by, you may want to check for yourself -- Thank you for working on this.)
Assignee | ||
Comment 12•2 years ago
|
||
Hey Gerald! Good to see you around :)
(In reply to Gerald Squelart (he/him) from comment #11)
(In reply to Nazım Can Altınova [:canova][:canaltinova on phabricator] from comment #9)
Because stack pointer address will be invalid once the
SyncPopulate
returns. So that explains why it works when we call it inside theprofiler_capture_backtrace_into
function but not inside theRegisters::SyncPopulate
method.This reminds me of bug 1714501. Your patch may actually fix that bug as well! (But I'm just passing by, you may want to check for yourself -- Thank you for working on this.)
Oh! That looks like the exact problem! I didn't know about this bug, thanks for pointing it. It answers why we see different results for non-opt and optimized builds. Probably it's because of the tail-call optimization happening on the optimized ones.
And it mentions that macOS get the values from the caller, which means that now after it's inlined to the parent, we might not see profiler_capture_backtrace_into
frame anymore. This could be a good thing (I remember some people mentioning that it being useless as it's always obviously there), but probably we should be consistent with other platforms or decide intentionally. I will investigate that a bit more.
Assignee | ||
Comment 13•2 years ago
|
||
Update about macOS:
It looks like we don't see profiler_capture_backtrace_into
frame on the optimized builds already. It appears that Registers::SyncPopulate
is being inlined on macOS. So changing this function to macro doesn't have any visible affect after all. Here are before/after profiles:
macOS before the fix: https://share.firefox.dev/3xtITwz
macOS after the fix: https://share.firefox.dev/3XTFPoB
Updated•2 years ago
|
Updated•2 years ago
|
Comment 14•2 years ago
|
||
Comment 15•2 years ago
|
||
bugherder |
Updated•2 years ago
|
Description
•