Gecko Profiler causes a deadlock during shmem allocation, when profiling https://discord.com/hypesquad
Categories
(Core :: Gecko Profiler, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox79 | --- | unaffected |
firefox80 | --- | wontfix |
firefox81 | --- | fixed |
People
(Reporter: dholbert, Assigned: aosmond)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: regression)
Attachments
(2 files)
STR:
-
Starting with a fresh profile, visit https://profiler.firefox.com/ and click the button to enable the profiler menu button.
-
Ctrl+Shift+1 to start profiling
ACTUAL RESULTS:
The animated guy with a backpack freezes after you start profiling. It's kinda janky before that, but it just freezes entirely when the profiler starts.
EXPECTED RESULTS:
No such freezing.
This is blocking investigation of why the rendering is janky. In bug 1658282 comment 6, mstange suggests it's because the profiler causes a deadlock during shmem allocation.
This bug here covers that profiler issue.
Reporter | ||
Comment 1•4 years ago
|
||
Here are two profiles that I tried to capture & ran into this issue:
https://share.firefox.dev/2DZyFeS
https://share.firefox.dev/3aiSdXQ
Reporter | ||
Comment 2•4 years ago
|
||
Transferring ni=gerald from bug 1658282.
Reporter | ||
Comment 3•4 years ago
|
||
This is a regression -- I can profile the site without interrupting its animation (i.e. no deadlock) in Firefox 79 release.
Thank you Daniel.
Some early unprocessed thoughts:
The "nativeallocations" feature is not used, that would have been the main culprit.
"screenshots" is also not used, that could have caused hiccups.
The profiler sampler overhead is at 20%, that's fairly high!
Anyway I can reproduce it on Linux (the animation doesn't freeze for me, but gets jankier), and I can see a lot of time spend in posix_fallocate, I'll investigate further...
mozregression points at bug 1582954, which started using posix_fallocate
, and that's the function that gets stuck when profiling.
Andrew, you introduced posix_fallocate
in bug 1582954, and it's causing significant pauses when profiling. I'm not sure if it's only affecting the Gecko Profiler, or if it's really causing pauses by itself (in which case, it would also be the "regressed-by" for bug 1658282).
Could you please have a look?
Updated•4 years ago
|
(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)
(it would also be the "regressed-by" for bug 1658282)
Forget about that part, bug 1658282 is a graphics issue pre-dating your patch.
Comment 8•4 years ago
|
||
Set release status flags based on info from the regressing bug 1582954
Updated•4 years ago
|
Assignee | ||
Comment 9•4 years ago
|
||
I can reproduce with the STR. Investigating.
Assignee | ||
Comment 10•4 years ago
|
||
It doesn't happen if I profile, but disable profiling the main thread. While the profiles suggest it is blocked, it actually appears to be getting EINTR many times in succession, and we loop retrying.
Assignee | ||
Comment 11•4 years ago
|
||
Reducing the sample time also avoids the problem.
Assignee | ||
Comment 12•4 years ago
|
||
Reviewing the posix_fallocate implementation:
Either the read or write hits EINTR, and then we need to start all over again. This is why larger allocations are more vulnerable. If we were handling EINTR directly inside this loop, I imagine we would avoid the problem, as we would be making incremental progress.
Assignee | ||
Comment 13•4 years ago
|
||
Just for the record, given the profiler registers a frequent signal, SIGPROF,
it is unlikely this impacts users outside of the profiler.
Assignee | ||
Comment 14•4 years ago
|
||
kvark linked me to:
https://github.com/pmem/issues/issues/168
https://github.com/pmem/pmdk/pull/3580/files
which tries incrementally smaller chunks to work around this.
The only alternative would be to implement posix_fallocate ourselves.
Assignee | ||
Comment 15•4 years ago
|
||
We already have mozilla::fallocate which implements its own fallback if we don't have posix_fallocate:
Perhaps the best course of action is to improve this implementation to support EINTR and use our fallback path if we get EINTR-ed...
Thank you for the analysis.
So if I understand correctly, the profiler sampler's SIGPROF
is interrupting posix_fallocate
? And that's why it happens more often when allocating a lot and/or sampling more frequently, right?
So one alternate solution (apart from modifying posix_fallocate
itself) would be to stop the profiler from interrupting the thread in this case.
I think we could use (abuse!) AUTO_PROFILER_THREAD_SLEEP
and AUTO_PROFILER_THREAD_WAKE
around the call to posix_fallocate
:
- If
MOZ_GECKO_PROFILER
is not #defined, these macros are empty. - Otherwise, If the profiler is not running, they will only do one TLS read.
- Otherwise (i.e., the profiler is running), they will do one TLS read and one atomic write; after
AUTO_PROFILER_THREAD_SLEEP
the next sampling loop will still take one sample (thereby interruptingposix_fallocate
), but the following loops will not sample that thread anymore, allowingposix_fallocate
to do its job to completion;AUTO_PROFILER_THREAD_WAKE
will then allow the sampler to resume sampling.
What do you think?
Comment 17•4 years ago
|
||
I like this suggestion. You could even argue that it would be reasonable to put AUTO_PROFILER_THREAD_SLEEP
there even if it didn't have these beneficial effects, because no interesting CPU work happens inside posix_fallocate
.
Great!
Andrew, since you're able to reproduce the issue could you please try that? You just need to add AUTO_PROFILER_THREAD_SLEEP
before the posix_fallocate
call, and AUTO_PROFILER_THREAD_WAKE
after. [edit: See comment 19, no need for _WAKE
, but you'll need to enclose _SLEEP
and fallocate
in a scope]
If that works, ship it! 😉
And I see that we have more posix_fallocate
calls in our code: https://searchfox.org/mozilla-central/search?q=posix_fallocate
So as a follow-up (in this bug or another one) we should consider doing the same in these other places, or combine them all into a profiler-friendly wrapper...
Comment 19•4 years ago
|
||
(In reply to Gerald Squelart [:gerald] (he/him) from comment #18)
You just need to add
AUTO_PROFILER_THREAD_SLEEP
before theposix_fallocate
call, andAUTO_PROFILER_THREAD_WAKE
after.
It's an RAII class (it starts with "AUTO_"), so just AUTO_PROFILER_THREAD_SLEEP
should be fine, no wake after. You only need the wake thing if you want to temporarily "unsleep" from the inside.
Assignee | ||
Comment 20•4 years ago
|
||
posix_fallocate iterates over each page/block in a shmem to ensure the
OS allocates memory to back it. Large shmems will cause many read/write
calls to be made, and when profiling, it is very likely a SIGPROF signal
will interrupt us at sufficiently high sampling rates. Most attempts at
retrying will fail for the same reason, and this can cause the threads
to block for an indeterminate period of time.
To work around this we put the profiler thread to sleep during the
posix_fallocate call. This will avoid us getting repeatably interrupted
until it can be completed.
Assignee | ||
Comment 21•4 years ago
|
||
I verified this works. Thanks for the suggestion :).
Comment 22•4 years ago
|
||
Comment 23•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Updated•4 years ago
|
Description
•