Closed Bug 1535699 Opened 5 years ago Closed 5 years ago

Properly clear the client (i.e. window) and controller (i.e. Service Worker) on a load info when performing a "process switch"/"process swap"

Categories

(Core :: DOM: Service Workers, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla68
Root Cause Poor Architecture
Fission Milestone M5
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- disabled
firefox66 --- wontfix
firefox67 blocking verified
firefox68 + disabled
firefox69 --- fixed
firefox70 --- fixed

People

(Reporter: perry, Assigned: perry)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [qa-triaged])

Crash Data

Attachments

(3 files)

This is possibly the same thing as bug 1530168.

The cause of the crash seems to be when a Service Worker registered under origin A attempts to intercept a request to origin A, but the Service Worker doesn't actually handle fetch events, and the request to origin A ultimately redirects to origin B. Sometimes (i.e. intermittently), the load info for the cross-origin redirect (i.e. the redirect channel to origin B), incorrectly inherits the controller that was marked on the request to origin A.

Assignee: nobody → perry
Priority: -- → P2
Assignee: perry → bugmail

Debugger detective work by :perry appears to indicate that when the process swap occurs, the "redirect mode" is getting corrupted from being MANUAL to FOLLOW, at least in the HttpChannelChild, which is causing the ClientChannelHelper to get confused. We believe this is due to FOLLOW being the default for all channels with SetRedirectMode being called to propagate the value on redirect. However, it appears likely that although an attempt is made to propagate this value in the parent process, that the value is not being propagated in the child process.

When doing pair debugging after this, we determined that in all likelihood the reason Perry was seeing this intermittently in his Service Worker parent-intercept patches was an initial simplification to randomly select the process to spawn the ServiceWorker in and it was selecting the extensions process, which necessitated a process swap for any opened pages. (Because extensions have their own process, the RemoteWorkerService is created in that process as well, and the RemoteWorker random process selection operates over all known parents without applying any remotetype pool clustering, so it's possible to spawn a ServiceWorker in a process it should not be in.) Perry is addressing the process selection logic in his patch-set for ServiceWorkerPrivate remoting and so I'm taking over this patch.

It's certain that bug 1530168 which was highly correlated to webextensions is experiencing this. In short the scenario there is that:

  • ServiceWorker interception happens in the child initially, so initial channel opening happens in the child process.
  • Interception gets reset and so a channel connection is opened.
  • Process swap logic triggers.
  • The bug with redirect mode not propagating happens. And no other logic cares about redirect mode being lost due to process swap. Process swap only happens for top-level tabs and for navigation, which means it's not exposed to the fetch() API. Process swap also only happens when there's a remoteType mismatch; we don't currently process swap based on origin because we're not ready for that.
Crash Signature: [@ mozilla::dom::ClientSource::SetController]

+500 crashes with the recently released 67 beta 3, marking as blocking for the release.

Priority: P2 → P1
Severity: normal → major

Andrew, did you make any progress on finding a solution for this top crasher? Thanks

Flags: needinfo?(bugmail)
Severity: major → critical

Andrew Sutherland told me that he's still working on tracking down the root cause here. Comment 1 is still valid.

Could we get a progress update please? Thanks

Andrew Sutherland told me he's going to have a patch for this up for review Soon^TM

Any news of a patch? We are building beta 14, so that leaves only 2 betas before RC week. Thanks

Flags: needinfo?(overholt)

Formulating a test to reproduce this has been more involved than expected, should have it and the fix soon. (This is a serious invariant violation and we really are lacking test coverage in this area. Our previous ability to trigger the bug was due to ServiceWorker remoted WorkerPrivate logic accidentally choosing the wrong process, which is not a realistic means of triggering the failure, especially because we don't want to enable that specific scenario ever.)

Flags: needinfo?(overholt)
Flags: needinfo?(bugmail)

Andrew, we are building beta 15 today, the last beta is on Friday and then this is RC week. Will we get a patch to at least mitigate some of crashes before we ship 67? Thanks

Flags: needinfo?(bugmail)
Assignee: bugmail → perry
Flags: needinfo?(bugmail)

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

Beta/Release Uplift Approval Request

  • User impact if declined: Intermittent crashes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Possibly risky because it touches network request logic, but the changes are correctly propagating requests' state (to my understanding).
  • String changes made/needed:
Attachment #9062293 - Flags: approval-mozilla-beta?

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

Patch is r+, the try build looks good so far and it fixes our top crasher for 67. It's not landed yet on mozilla-central but given that we need to go to build beta 16 now and that this is our last beta build that would allow us to evaluate the impact on crashes, I am taking it for uplift. Thanks!

Attachment #9062293 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Pushed by bugmail@asutherland.org:
https://hg.mozilla.org/integration/autoland/rev/e12876ee3835
Propagate redirect mode across cross-process redirects. r=asuth

Keywords: checkin-needed
Attached patch beta-patch.diff (deleted) — — Splinter Review

Beta/Release Uplift Approval Request

  • User impact if declined: Intermittent crashes.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): Possibly risky because it touches network request logic, but the changes are correctly propagating requests' state (to my understanding).
  • String changes made/needed:
Attachment #9062368 - Flags: approval-mozilla-beta?
Comment on attachment 9062368 [details] [diff] [review]
beta-patch.diff

Approved updated patch applying to beta branch correctly.
Attachment #9062368 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9062293 [details]
Bug 1535699 - Propagate redirect mode across cross-process redirects. r?asuth

unsetting approval on older uplift patch

Attachment #9062293 - Flags: approval-mozilla-beta+ → approval-mozilla-beta-
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

crash reports for this signature are continuing to come in in the early stages of b16 - the crash stack and crash reason seems to have remained the same like they were before the patch.

Flags: needinfo?(perry)

(In reply to [:philipp] from comment #20)

crash reports for this signature are continuing to come in in the early stages of b16 - the crash stack and crash reason seems to have remained the same like they were before the patch.

I think we'll need to add some diagnostic assertions. The cause of the crash is likely in another process, and the reported stack isn't providing enough information. I'm working on this now and keeping the needinfo open until then.

Clearly not fixed, reopening.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(Adjusting the subject because there are actually 2 problems going on, one of which we fixed, but which are very similar in nature.)

Restating my current understanding of what's happening:

  • :valentin is pursuing a fix that corrects a bug in nsHttpChannel where currently cache redirects are processed before performing process swap. https://treeherder.mozilla.org/#/jobs?repo=try&revision=219297ea3a0e045122f6abc668a14ceca9d8d2fd is the most recent fix attempt for this, but there was an assertion failure at https://treeherder.mozilla.org/#/jobs?repo=try&revision=219297ea3a0e045122f6abc668a14ceca9d8d2fd&selectedJob=245570125
    • This is the most correct thing, as the ServiceWorker child intercept logic and ClientChannelHelper logic really can't handle where the process swap is happening right now.
    • The fix will likely need to bake on trunk for a little bit before we'd consider uplift.
  • :perry is pursuing adapting the test from this bug to also be a test for this specific manifestation of the nsHttpChannel problem here.
  • The stop-gap solution is to disable the "browser.tabs.remote.useHTTPResponseProcessSelection" preference which was enabled in bug 1528360 initially targeted at Firefox 67.
    • By disabling this preference, cross process swaps will use a less-fancy mechanism based on sessionstore and which means that POST requests that get process swapped will lose their data and break.
    • We can turn this back on in the Necko dot release when we uplift valentin's fix.
    • I'll prepare the review request for this now. (Which is just flipping the pref and adding a comment.)
Summary: Properly clear the controller (i.e. Service Worker) on a load info for cross-origin redirect channels → Properly clear the client (i.e. window) and controller (i.e. Service Worker) on a load info when performing a "process switch"/"process swap"

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes in the situation where:
  • An HTTP(S) link is being opened in a non-"web" process. Usually this would mean a "file" process, a WebExtension "extension" process, or potentially the about:newtab fancy "privileged" process type.
  • The link being opened is to a cached redirect.
  • The target of the redirect matches an existing ServiceWorker registration.

Note that :perry is working on refining an automated test for this situation.

Also note that this has already been discussed with :nika and :pascalc. I've asked :nika for review as a sanity-checking step.

Also also note: Although I prepared this patch locally against beta, the phabricator UI seems to suggest it's against mozilla-central and the UI also seems to suggest lando will land it against the wrong place. Our phabricator and lando docs don't seem to cover this. I'll file a bug now, because it seems like they should.

  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: So, I've manually verified the pref flip avoids the crash scenario, but if others want to reproduce, see https://bugzilla.mozilla.org/show_bug.cgi?id=1547524#c0. I've uploaded the provided HTML snippet from there as an attachment, but per the STR, the file must be saved to disk and opened via file URI. So, like, on my system it might be file:///tmp/save-me-open-me-via-file-protocol.html
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is un-flipping a pref flip that has not yet shipped to release. (Bug 1528360 was that pref flip, currently riding the trains to a Fx67 release if we don't stop it.)
  • String changes made/needed: no
Attachment #9064111 - Flags: approval-mozilla-beta?
Flags: qe-verify+

(I filed bug 1550847 for the lando docs stuff.)

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Approved for our beta branch before our first merge, thanks!

Attachment #9064111 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: needinfo?(perry)
Comment on attachment 9062368 [details] [diff] [review]
beta-patch.diff

Clearing approval flags to get this off the uplift queries.
Attachment #9062368 - Flags: approval-mozilla-beta+
Attachment #9064111 - Flags: approval-mozilla-beta+

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Beta/Release Uplift Approval Request

Flags: needinfo?(bugmail)
Attachment #9064111 - Flags: approval-mozilla-release?

Comment on attachment 9064111 [details]
Bug 1535699 - Disable httpResponseProcessSelection on Fx67 for now. r=nika a=pascalc

Approved for 67 RC, thanks

Attachment #9064111 - Flags: approval-mozilla-release? → approval-mozilla-release+
Whiteboard: [qa-triaged]

Verified as fixed on Firefox 67.0 on Windows 10 x 64, Windows 7 x32, Mac OS X 10.14 and on Ubuntu 18.04 x64 following the steps from bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1547524#c0.

Build ID: 20190516215225
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0

Verified as fixed on Firefox 67.0 RC build2 on Windows 10 x 64, Windows 7 x32, Mac OS X 10.14 and on Ubuntu 18.04 x64.

Are we expecting to flip this for 68 too, or land and uplift the fix for bug 1551601 instead?

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(bugmail)

(In reply to Julien Cristau [:jcristau] from comment #39)

Are we expecting to flip this for 68 too, or land and uplift the fix for bug 1551601 instead?

My understanding and opinion is we want to land and uplift bug 1551601 instead.

Flags: needinfo?(bugmail)

some instability with this crash signature is still ongoing on nightly builds (20190528214841 & later) which have received the fix from bug 1551601.

Thank you for tracking this :philipp! I'm not seeing any obvious complicating factors, so this suggests something is still going on with the process switch.

:valentin, any thoughts from your recent fix about other edge cases that might exist as it relates to redirects and process switches?

The only additional permutation that's jumping to mind is a webextension using webRequest to meddle with the network. All of the more recent crashes seem to be happening on profiles involving uBlock0, avira, or a boat-load of jetpack(?!) extensions.

Options at this point seem to be:

  • Keep watching the crashes, see what correlations stack up.
  • Augment the channels to maintain a best-effort circular string buffer on document loads that tracks the general life-cycle of the channel so that when we hit this assertion we can use that as a crash string to shed some light on what's happening dynamically.

Note that we're planning to switch over to parent intercept soon on nightly, so this ends up being more about what we uplift to beta, I think.

Maybe this will help?
I'm using 68.0b5 Developer Edition and the crash happens every time I try to visit YouTube from my GroupSpeedDial add-on. Even if I disable all other add-ons.
Also it happens on all of my devices. But when I try to reproduce it on a clean profile, it won't crash.

It started in 67 version with more broken pages but with the released patches it got fixed on all of them, except YouTube.
Also the link has to be "http://www.youtube.com/", if I change it to https, it won't crash.
And it crashes only when you use left click (opening link using middle mouse click in new tab won't crash).

One of crashes with disabled add-ons:
https://crash-stats.mozilla.org/report/index/eeed3c48-2034-4889-8f3f-f15720190531

I'm author of the GroupSpeedDial add-on and I can tell you it's just a simple href link, there is no special click event.

<a tabindex="0" id="dial_10" draggable="true" href="http://www.youtube.com/" class=""><div><div><img src="blob:moz-extension://f969338e-1785-4f56-8dd7-a76f09540887/11b6bacd-ada0-419f-888c-90a8cd6c76b7"><span>YouTube</span></div><img draggable="false" src="blob:moz-extension://f969338e-1785-4f56-8dd7-a76f09540887/1cfe747c-27b7-49b4-bbcb-250f188d4279"></div></a>

(In reply to juraj.masiar from comment #43)

Maybe this will help?

Are you able to reproduce on nightly? The state of things as I understand them:

  • Release=67: Shouldn't crash because we disabled the process switch pref (httpResponseProcessSelection).
  • Beta=68: Should crash because the fix from bug 1551601 isn't there and the process switch pref isn't disabled.
  • Nightly=69: Shouldn't crash because of the fix from bug 1551601 but clearly at least a few people are experiencing the crash.

So the crash is unfortunately expected. (And we are planning to uplift bug 1551601 to beta to address this. The fix just took a little longer than expected and we need to give it a few days to have confidence that it won't make things worse for beta/dev-edition users.)

Your STR for 68 is consistent with the bug as we understand it. A clean profile won't trigger a crash because you don't have the ServiceWorker installed for youtube and/or you don't have the redirect in your http cache.

Bug 1554217 disables httpResponseProcessSelection in beta starting with next week's 68.0b7

I've just tested Nightly and It's crashing as well, using 69.0a1 (2019-06-02) (64-bit), with disabled addons (except mine).
My crash:
bp-33debae5-5880-4736-811c-57a460190603

Clearing ni while I work towards a better fix for bug 1551601. Considering there are other bugs also broken with httpResponseProcessSelection, maybe it's OK to leave it off in beta until we are confident with a fix.

Flags: needinfo?(valentin.gosu)

(In reply to juraj.masiar from comment #46)

I've just tested Nightly and It's crashing as well, using 69.0a1 (2019-06-02) (64-bit), with disabled addons (except mine).
My crash:
bp-33debae5-5880-4736-811c-57a460190603

Thanks for the repro. Given the backout of bug 1551601 in bug 1555966 on May 31st, this is unfortunately now expected. Our current situation is then:

  • Release=67: Shouldn't crash because httpResponseProcessSelection was disabled in comment 36.
  • Beta=68: Shouldn't crash in 68.0b7 onwards because httpResponseProcessSelection was disabled in Bug 1554217.
  • Nightly=69: Will crash because httpResponseProcessSelection is still enabled and bug 1551601 got backed out.

The workaround right now would be to ensure that anyone using your extension has set the pref "browser.tabs.remote.useHTTPResponseProcessSelection" to false. Alternately, if you change how your extension triggers the page load, that might help too. (Like use the tabs API to explicitly open the new tab with an initial http/https URL.)

I think it's up to release management to determine whether we should flip the pref here on nightly too.

Adding RyanVM for Comment 8, since he is release owner for 69.

Flags: needinfo?(ryanvm)

I think we should disable it on Nightly until bug 1551601 is ready to re-land.

Flags: needinfo?(ryanvm)

I'll push the pref disabling to nightly and follow-up.

https://crash-stats.mozilla.org/report/index/01b5d706-dbd2-444f-87f0-09db70190624

We also experienced 3 tab crashes with the same signature [@ mozilla::dom::ClientSource::SetController ] while browsing random videos on youtube on Windows 10x64 with the latest FF Nightly. (2019-06-27)

Crash occurred when watching a longer video all while fission was active.

Fission Milestone: --- → M5

Bug 1551601 ended up landing and sticking and the crash rate commensurately dropped consistent with that.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?

Root cause analysis from :asuth: This was an emergent problem and the process swap implementation was a big pile of hacks. There was surprise complexity/interaction and general complexity from child intercept, which itself was architecturally a bad idea when originally implemented, except okay in the exact Firefox OS fission-like process model.

Root Cause: ? → Poor Architecture
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: