Closed Bug 1683404 Opened 4 years ago Closed 3 years ago

ThreadSanitizer: data race [@ nsTimerImpl::Shutdown] vs. [@ nsTimerImpl::CancelImpl]

Categories

(Core :: XPCOM, defect)

defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: Gankra, Assigned: KrisWright)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

Attached file tsan-backtrace-dec-2020.txt (deleted) —

Found while testing out turning on the full shutdown code for tsan (that doesn't just exit). Presumably not a security issue since this code is only for testing, and doesn't run in production?

Looks like gThread needs to be atomically managed.

Permalinks to problematic lines:

https://searchfox.org/mozilla-central/rev/58e955b8005b4845c42e874a6b9aa5b223ef0d82/xpcom/threads/nsTimerImpl.cpp#288

https://searchfox.org/mozilla-central/rev/58e955b8005b4845c42e874a6b9aa5b223ef0d82/xpcom/threads/nsTimerImpl.cpp#402

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

I think I've seen a TSan bug before for this timer shutdown race, but maybe not.

We do run this code in non-debug builds in the parent process.

I think bug 1647417 is the other bug I was thinking of.

Good catch, thanks!

Assignee: nobody → a.beingessner
Status: NEW → ASSIGNED
Keywords: leave-open
Pushed by abeingessner@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cc7b2b5c39b3 supress issue found while enabling full tsan shutdown. r=decoder
Assignee: a.beingessner → nobody
Status: ASSIGNED → NEW

It looks like this race may be involved in some recent shutdown hangs. I'm preparing a patch for this, but given how we use gThread any fix is nontrivial. I'm trying to wrap the thread itself in a threadsafe event target, but I'm not sure how to subvert races late into shutdown. Maybe this theoretical wrapper class should shut down gThread with timer shutdown but the wrapper itself shouldn't be dereferenced until after we shut down nsThreadManager in case of late timer checks.

Assignee: nobody → kwright

=Note that this work needs to be rebased on top of Bug 1711090

This particular race is a tricky one - there's no perfect solution to protecting the timer thread from being called in cancel while being dereferenced. This ensures that we won't run into that problem by locking all of our TimerThread calls behind a mutex inside a wrapper class. Then we hold onto the wrapper class until after we shutdown nsThreadManager, in which case no background thread pools should be active anymore to call nsTimerImpl::Cancel.

For reference, the worst-case scenario happens when we dereference gThread between these two calls, in which case we will get stuck on a dereferenced mutex. By putting the check and the actual call into gThread behind a mutex maybe we can prevent this issue.

This may not be the most elegant solution, so if there is a better way to fix this then it is more than welcome.

Try push shows that the issue is fixed with the above patch in the stack: https://treeherder.mozilla.org/jobs?repo=try&revision=a9e9135464b2750a8b6fd2a19408e3537b39dbe2

Attachment #9222548 - Attachment description: Bug 1683404 - (WIP) Wrap the timer thread behind a mutex. → Bug 1683404 - Wrap the timer thread behind a mutex.
Pushed by kwright@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/339b1169ea97 Wrap the timer thread behind a mutex. r=nika https://hg.mozilla.org/integration/autoland/rev/bc3f26b4edba Remove tsan suppressions for nsTimerImpl. r=Gankra

Is there more work to do here, or can we remove the leave-open and resolve?

Flags: needinfo?(kwright)

(In reply to Julien Cristau [:jcristau] from comment #13)

Is there more work to do here, or can we remove the leave-open and resolve?

Looks like we're not running into this intermittent anymore after I removed the suppression, so I'll resolve it.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(kwright)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: