Closed Bug 1658847 Opened 4 years ago Closed 4 years ago

Gecko Profiler causes a deadlock during shmem allocation, when profiling https://discord.com/hypesquad

Categories

(Core :: Gecko Profiler, defect, P1)

defect

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox79 --- unaffected
firefox80 --- wontfix
firefox81 --- fixed

People

(Reporter: dholbert, Assigned: aosmond)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(2 files)

STR:

  1. Starting with a fresh profile, visit https://profiler.firefox.com/ and click the button to enable the profiler menu button.

  2. Visit https://discord.com/hypesquad

  3. Ctrl+Shift+1 to start profiling

ACTUAL RESULTS:
The animated guy with a backpack freezes after you start profiling. It's kinda janky before that, but it just freezes entirely when the profiler starts.

EXPECTED RESULTS:
No such freezing.

This is blocking investigation of why the rendering is janky. In bug 1658282 comment 6, mstange suggests it's because the profiler causes a deadlock during shmem allocation.

This bug here covers that profiler issue.

Here are two profiles that I tried to capture & ran into this issue:
https://share.firefox.dev/2DZyFeS
https://share.firefox.dev/3aiSdXQ

Transferring ni=gerald from bug 1658282.

Flags: needinfo?(gsquelart)

This is a regression -- I can profile the site without interrupting its animation (i.e. no deadlock) in Firefox 79 release.

Keywords: regression

Thank you Daniel.

Some early unprocessed thoughts:
The "nativeallocations" feature is not used, that would have been the main culprit.
"screenshots" is also not used, that could have caused hiccups.
The profiler sampler overhead is at 20%, that's fairly high!

Anyway I can reproduce it on Linux (the animation doesn't freeze for me, but gets jankier), and I can see a lot of time spend in posix_fallocate, I'll investigate further...

Assignee: nobody → gsquelart
Severity: -- → S3
Flags: needinfo?(gsquelart)
Priority: -- → P1
Attached file mozregression (deleted) —

mozregression points at bug 1582954, which started using posix_fallocate, and that's the function that gets stuck when profiling.

Andrew, you introduced posix_fallocate in bug 1582954, and it's causing significant pauses when profiling. I'm not sure if it's only affecting the Gecko Profiler, or if it's really causing pauses by itself (in which case, it would also be the "regressed-by" for bug 1658282).

Could you please have a look?

Flags: needinfo?(aosmond)
Regressed by: 1582954
Has Regression Range: --- → yes

(In reply to Gerald Squelart [:gerald] (he/him) from comment #6)

(it would also be the "regressed-by" for bug 1658282)

Forget about that part, bug 1658282 is a graphics issue pre-dating your patch.

Set release status flags based on info from the regressing bug 1582954

I can reproduce with the STR. Investigating.

It doesn't happen if I profile, but disable profiling the main thread. While the profiles suggest it is blocked, it actually appears to be getting EINTR many times in succession, and we loop retrying.

Reducing the sample time also avoids the problem.

Reviewing the posix_fallocate implementation:

https://sourceware.org/git/?p=glibc.git;a=blob_plain;f=sysdeps/posix/posix_fallocate.c;hb=d614a7539657941a9201c236b2f15afac18e1213

Either the read or write hits EINTR, and then we need to start all over again. This is why larger allocations are more vulnerable. If we were handling EINTR directly inside this loop, I imagine we would avoid the problem, as we would be making incremental progress.

Just for the record, given the profiler registers a frequent signal, SIGPROF,

https://searchfox.org/mozilla-central/rev/fa7f47027917a186fb2052dee104cd06c21dd76f/mozglue/baseprofiler/core/platform-linux-android.cpp#299

it is unlikely this impacts users outside of the profiler.

kvark linked me to:

https://github.com/pmem/issues/issues/168
https://github.com/pmem/pmdk/pull/3580/files

which tries incrementally smaller chunks to work around this.

The only alternative would be to implement posix_fallocate ourselves.

We already have mozilla::fallocate which implements its own fallback if we don't have posix_fallocate:

https://searchfox.org/mozilla-central/rev/fa7f47027917a186fb2052dee104cd06c21dd76f/xpcom/glue/FileUtils.cpp#80

Perhaps the best course of action is to improve this implementation to support EINTR and use our fallback path if we get EINTR-ed...

Thank you for the analysis.

So if I understand correctly, the profiler sampler's SIGPROF is interrupting posix_fallocate? And that's why it happens more often when allocating a lot and/or sampling more frequently, right?

So one alternate solution (apart from modifying posix_fallocate itself) would be to stop the profiler from interrupting the thread in this case.
I think we could use (abuse!) AUTO_PROFILER_THREAD_SLEEP and AUTO_PROFILER_THREAD_WAKE around the call to posix_fallocate:

  • If MOZ_GECKO_PROFILER is not #defined, these macros are empty.
  • Otherwise, If the profiler is not running, they will only do one TLS read.
  • Otherwise (i.e., the profiler is running), they will do one TLS read and one atomic write; after AUTO_PROFILER_THREAD_SLEEP the next sampling loop will still take one sample (thereby interrupting posix_fallocate), but the following loops will not sample that thread anymore, allowing posix_fallocate to do its job to completion; AUTO_PROFILER_THREAD_WAKE will then allow the sampler to resume sampling.

What do you think?

I like this suggestion. You could even argue that it would be reasonable to put AUTO_PROFILER_THREAD_SLEEP there even if it didn't have these beneficial effects, because no interesting CPU work happens inside posix_fallocate.

Great!

Andrew, since you're able to reproduce the issue could you please try that? You just need to add AUTO_PROFILER_THREAD_SLEEP before the posix_fallocate call, and AUTO_PROFILER_THREAD_WAKE after. [edit: See comment 19, no need for _WAKE, but you'll need to enclose _SLEEP and fallocate in a scope]
If that works, ship it! 😉

And I see that we have more posix_fallocate calls in our code: https://searchfox.org/mozilla-central/search?q=posix_fallocate
So as a follow-up (in this bug or another one) we should consider doing the same in these other places, or combine them all into a profiler-friendly wrapper...

(In reply to Gerald Squelart [:gerald] (he/him) from comment #18)

You just need to add AUTO_PROFILER_THREAD_SLEEP before the posix_fallocate call, and AUTO_PROFILER_THREAD_WAKE after.

It's an RAII class (it starts with "AUTO_"), so just AUTO_PROFILER_THREAD_SLEEP should be fine, no wake after. You only need the wake thing if you want to temporarily "unsleep" from the inside.

posix_fallocate iterates over each page/block in a shmem to ensure the
OS allocates memory to back it. Large shmems will cause many read/write
calls to be made, and when profiling, it is very likely a SIGPROF signal
will interrupt us at sufficiently high sampling rates. Most attempts at
retrying will fail for the same reason, and this can cause the threads
to block for an indeterminate period of time.

To work around this we put the profiler thread to sleep during the
posix_fallocate call. This will avoid us getting repeatably interrupted
until it can be completed.

I verified this works. Thanks for the suggestion :).

Flags: needinfo?(aosmond)
Pushed by aosmond@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4a008bd49ea4 Put profiler thread to sleep when allocating memory for shmems. r=mstange
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch
Regressions: 1659965
Assignee: gsquelart → aosmond
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: