Closed Bug 1735972 Opened 3 years ago Closed 3 years ago

Datachannel hangs on close - race condition causing iloop with a lock held.

Tracking

()

Status:

RESOLVED FIXED

Milestone:

95 Branch

Tracking Flags:

Tracking

Status

firefox-esr78

---

wontfix

firefox-esr91

---

fixed

firefox93

---

wontfix

firefox94

---

wontfix

firefox95

fixed

People

(Reporter: drno, Assigned: jesup)

References

(Regression)

Details

(Keywords: regression)

Attachments

(2 files)

datachannel_logs.tar.gz 3 years ago Nils Ohlmeier [:drno] (deleted), application/x-gzip		Details
Bug 1735972: reset current channel if that channel is closed r=bwc 3 years ago Randell Jesup [:jesup] (needinfo me) (deleted), text/x-phabricator-request	RyanVM : approval-mozilla-esr91+	Details

Nils Ohlmeier [:drno]

Reporter

Description

•

3 years ago

Attached file datachannel_logs.tar.gz (deleted) — Details

I'm able to reproduce content process hangs (100% CPU usage) when keeping https://test-network.callstats.io/ open for a few minutes.

Here is a Firefox profiler run for it https://share.firefox.dev/3BJLwLq

This reproduces for me on Nightly 95 and Release 93. Both running on macOS 11.6.

I was able to capture logs with DataChannel:5 for it. Unfortunately the logs get truncated. And as soon as I add 'sync' to the MOZ_LOG setting the bug no longer reproduces.

Byron Campen [:bwc]

Comment 1

•

3 years ago

So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. We might be able to fix the deadlock fairly easily, but other reentrancy bugs may remain.

jib, is there something in the spec I missed that would protect from this if we implemented it?

Flags: needinfo?(jib)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 2

•

3 years ago

[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.

We're not deadlocked, we're i-looping with a lock held.

It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.

Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.

There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.

Status: UNCONFIRMED → NEW

tracking-firefox95: --- → ?

Ever confirmed: true

Flags: needinfo?(docfaraday)

Keywords: regression

Regressed by: 1556795

Summary: Datachannel hangs on close → Datachannel hangs on close - race condition causing iloop with a lock held.

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

3 years ago

Set release status flags based on info from the regressing bug 1556795

status-firefox93: --- → affected

status-firefox94: --- → affected

status-firefox95: --- → affected

status-firefox-esr78: --- → affected

status-firefox-esr91: --- → affected

Byron Campen [:bwc]

Comment 4

•

3 years ago

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #2)

[Tracking Requested - why for this release]: Can cause any page using datachannels to lock up, chewing a core.

We're not deadlocked, we're i-looping with a lock held.

It's i-looping with the lock held on Socket Thread. It's a regression from bug 1556795 - a rework of the channel allocation logic. We're in a threshold event, and call SendDeferredMessages(). We have the lock. We call Channels::Get(0) (i = 0, end = 1). We don't get a channel (returns nullptr), so we call UpdateCurrentStreamIndex(), which calls Channels::GetNextChannel(), but the channel array is empty, so we return nullptr, and we return 0 and loop. So i never gets above 0, and we don't find a channel 0, so we never stop looping.

Probably it's a race condition with a close that modified the channel array. Since end=1, GetCurrentStreamIndex() must have returned 1, and i was 1 for the first iteration.

There might still be a spec issue, but that would be a separate bug. We do dispatch an event for the close stuff.

Hmm, the profile did not have the socket thread, so did not see this happening. We're still going reentrant on main, which is a huge problem I think. I'll look into this.

Assignee: nobody → docfaraday

Flags: needinfo?(docfaraday)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 5

•

3 years ago

Attached file Bug 1735972: reset current channel if that channel is closed r=bwc (deleted) — Details

Depends on D128738

Byron Campen [:bwc]

Comment 6

•

3 years ago

Oh, you were working a patch for this already.

Assignee: docfaraday → rjesup

Jan-Ivar Bruaroey [:jib] (needinfo? me)

Comment 7

•

3 years ago

(In reply to Byron Campen [:bwc] from comment #1)

So, we're deadlocking because we're going reentrant while firing a "closed" event. Looking at webrtc-pc, I actually don't see any "queue a task" step that would protect from this. ... jib, is there something in the spec I missed that would protect from this if we implemented it?

If you mean the "close" event, the spec says: "the user agent MUST queue a task to run the following steps: ... Fire an event named close at channel.".

Flags: needinfo?(jib)

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox93: affected → wontfix

status-firefox94: affected → wontfix

status-firefox-esr78: affected → wontfix

Byron Campen [:bwc]

Updated

•

3 years ago

Severity: -- → S2

Priority: -- → P2

Byron Campen [:bwc]

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1728616

Pulsebot

Comment 8

•

3 years ago

Pushed by rjesup@wgate.com: https://hg.mozilla.org/integration/autoland/rev/aadf149cc1c8 reset current channel if that channel is closed r=bwc

Iulian Moraru

Comment 9

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/aadf149cc1c8

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox95: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 95 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

tracking-firefox95: ? → +

Ryan VanderMeulen [:RyanVM]

Comment 10

•

3 years ago

Is this something you wanted to nominate for ESR91 approval?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 11

•

3 years ago

Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc

ESR Uplift Approval Request

If this is not a sec:{high,crit} bug, please state case for ESR consideration: Can cause a core to get chewed forever
User impact if declined: A datachannel may lock up randomly, chewing a core, slowing down the machine and wasting battery if mobile/laptop.
Fix Landed on Version: 95
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Extremely simple patch. Just resets the current stream value if it's been deleted
String or UUID changes made by this patch: None

Flags: needinfo?(rjesup)

Attachment #9246425 - Flags: approval-mozilla-esr91?

Ryan VanderMeulen [:RyanVM]

Comment 12

•

3 years ago

Comment on attachment 9246425 [details]
Bug 1735972: reset current channel if that channel is closed r=bwc

Approved for 91.4esr.

Attachment #9246425 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+

Ryan VanderMeulen [:RyanVM]

Comment 13

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr91/rev/5124138abc28

status-firefox-esr91: affected → fixed

You need to log in before you can comment on or make changes to this bug.