Closed Bug 1683404 Opened 4 years ago Closed 3 years ago

ThreadSanitizer: data race [@ nsTimerImpl::Shutdown] vs. [@ nsTimerImpl::CancelImpl]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

91 Branch

Tracking Flags:

Tracking

Status

firefox91

---

fixed

People

(Reporter: Gankra, Assigned: KrisWright)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

tsan-backtrace-dec-2020.txt 4 years ago Aria Beingessner [:Gankra] (deleted), text/plain		Details
Bug 1683404 - supress issue found while enabling full tsan shutdown. r?decoder 4 years ago Aria Beingessner [:Gankra] (deleted), text/x-phabricator-request		Details
Bug 1683404 - Wrap the timer thread behind a mutex. 3 years ago Kris Wright :KrisWright (deleted), text/x-phabricator-request		Details
Bug 1683404 - Remove tsan suppressions for nsTimerImpl. 3 years ago Kris Wright :KrisWright (deleted), text/x-phabricator-request		Details

Aria Beingessner [:Gankra]

Reporter

Description

•

4 years ago

Attached file tsan-backtrace-dec-2020.txt (deleted) — Details

Found while testing out turning on the full shutdown code for tsan (that doesn't just exit). Presumably not a security issue since this code is only for testing, and doesn't run in production?

Looks like gThread needs to be atomically managed.

Permalinks to problematic lines:

https://searchfox.org/mozilla-central/rev/58e955b8005b4845c42e874a6b9aa5b223ef0d82/xpcom/threads/nsTimerImpl.cpp#288

https://searchfox.org/mozilla-central/rev/58e955b8005b4845c42e874a6b9aa5b223ef0d82/xpcom/threads/nsTimerImpl.cpp#402

General information about TSan reports

Why fix races?

Data races are undefined behavior and can cause crashes as well as correctness issues. Compiler optimizations can cause racy code to have unpredictable and hard-to-reproduce behavior.

Rating

If you think this race can cause crashes or correctness issues, it would be great to rate the bug appropriately as P1/P2 and/or indicating this in the bug. This makes it a lot easier for us to assess the actual impact that these reports make and if they are helpful to you.

False Positives / Benign Races

Typically, races reported by TSan are not false positives [1], but it is possible that the race is benign. Even in this case it would be nice to come up with a fix if it is easily doable and does not regress performance. Every race that we cannot fix will have to remain on the suppression list and slows down the overall TSan performance. Also note that seemingly benign races can possibly be harmful (also depending on the compiler, optimizations and the architecture) [2][3].

[1] One major exception is the involvement of uninstrumented code from third-party libraries.
[2] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
[3] How to miscompile programs with "benign" data races: https://www.usenix.org/legacy/events/hotpar11/tech/final_files/Boehm.pdf
Suppressing unfixable races

If the bug cannot be fixed, then a runtime suppression needs to be added in mozglue/build/TsanOptions.cpp. The suppressions match on the full stack, so it should be picked such that it is unique to this particular race. The bug number of this bug should also be included so we have some documentation on why this suppression was added.

Andrew McCreight [:mccr8]

Comment 1

•

4 years ago

I think I've seen a TSan bug before for this timer shutdown race, but maybe not.

We do run this code in non-debug builds in the parent process.

Andrew McCreight [:mccr8]

Comment 2

•

4 years ago

I think bug 1647417 is the other bug I was thinking of.

Aria Beingessner [:Gankra]

Reporter

Comment 4

•

4 years ago

Good catch, thanks!

Aria Beingessner [:Gankra]

Reporter

Comment 5

•

4 years ago

Attached file Bug 1683404 - supress issue found while enabling full tsan shutdown. r?decoder (deleted) — Details

Depends on D100267

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → a.beingessner

Status: NEW → ASSIGNED

Aria Beingessner [:Gankra]

Reporter

Updated

•

4 years ago

Keywords: leave-open

Pulsebot

Comment 6

•

4 years ago

Pushed by abeingessner@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cc7b2b5c39b3 supress issue found while enabling full tsan shutdown. r=decoder

Bogdan Tara[:bogdan_tara | bogdant]

Comment 7

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cc7b2b5c39b3

Aria Beingessner [:Gankra]

Reporter

Updated

•

4 years ago

Assignee: a.beingessner → nobody

Status: ASSIGNED → NEW

Kris Wright :KrisWright

Assignee

Comment 8

•

3 years ago

It looks like this race may be involved in some recent shutdown hangs. I'm preparing a patch for this, but given how we use gThread any fix is nontrivial. I'm trying to wrap the thread itself in a threadsafe event target, but I'm not sure how to subvert races late into shutdown. Maybe this theoretical wrapper class should shut down gThread with timer shutdown but the wrapper itself shouldn't be dereferenced until after we shut down nsThreadManager in case of late timer checks.

Assignee: nobody → kwright

Kris Wright :KrisWright

Assignee

Comment 9

•

3 years ago

Attached file Bug 1683404 - Wrap the timer thread behind a mutex. (deleted) — Details

=Note that this work needs to be rebased on top of Bug 1711090

This particular race is a tricky one - there's no perfect solution to protecting the timer thread from being called in cancel while being dereferenced. This ensures that we won't run into that problem by locking all of our TimerThread calls behind a mutex inside a wrapper class. Then we hold onto the wrapper class until after we shutdown nsThreadManager, in which case no background thread pools should be active anymore to call nsTimerImpl::Cancel.

For reference, the worst-case scenario happens when we dereference gThread between these two calls, in which case we will get stuck on a dereferenced mutex. By putting the check and the actual call into gThread behind a mutex maybe we can prevent this issue.

This may not be the most elegant solution, so if there is a better way to fix this then it is more than welcome.

Kris Wright :KrisWright

Assignee

Comment 10

•

3 years ago

Attached file Bug 1683404 - Remove tsan suppressions for nsTimerImpl. (deleted) — Details

Try push shows that the issue is fixed with the above patch in the stack: https://treeherder.mozilla.org/jobs?repo=try&revision=a9e9135464b2750a8b6fd2a19408e3537b39dbe2

Phabricator Automation

Updated

•

3 years ago

Attachment #9222548 - Attachment description: Bug 1683404 - (WIP) Wrap the timer thread behind a mutex. → Bug 1683404 - Wrap the timer thread behind a mutex.

Pulsebot

Comment 11

•

3 years ago

Pushed by kwright@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/339b1169ea97 Wrap the timer thread behind a mutex. r=nika https://hg.mozilla.org/integration/autoland/rev/bc3f26b4edba Remove tsan suppressions for nsTimerImpl. r=Gankra

Marian-Vasile Laza

Comment 12

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/339b1169ea97
https://hg.mozilla.org/mozilla-central/rev/bc3f26b4edba

Julien Cristau [:jcristau]

Comment 13

•

3 years ago

Is there more work to do here, or can we remove the leave-open and resolve?

Flags: needinfo?(kwright)

Kris Wright :KrisWright

Assignee

Comment 14

•

3 years ago

(In reply to Julien Cristau [:jcristau] from comment #13)

Is there more work to do here, or can we remove the leave-open and resolve?

Looks like we're not running into this intermittent anymore after I removed the suppression, so I'll resolve it.

Status: NEW → RESOLVED

Closed: 3 years ago

Flags: needinfo?(kwright)

Keywords: leave-open

Resolution: --- → FIXED

Julien Cristau [:jcristau]

Updated

•

3 years ago

status-firefox91: --- → fixed

Target Milestone: --- → 91 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Blocks: 1704391

You need to log in before you can comment on or make changes to this bug.