Open Bug 1797558 Opened 2 years ago Updated 1 year ago

Navigation from extension page to another extension page results in a unusable tab on Fenix (and RefBrow)

Categories

(GeckoView :: Extensions, defect, P1)

All
Android

Tracking

(firefox106 wontfix, firefox107 affected, firefox108 affected, firefox109 affected, firefox110 affected, firefox111 affected)

Tracking Status
firefox106 --- wontfix
firefox107 --- affected
firefox108 --- affected
firefox109 --- affected
firefox110 --- affected
firefox111 --- affected

People

(Reporter: robwu, Assigned: boek)

References

(Depends on 1 open bug)

Details

(Whiteboard: [geckoview:m109][geckoview:m111][geckoview:m112][geckoview:m113][bugz:In Progress][addons-jira])

Attachments

(3 files)

STR:

  1. Open uBlock Origin's panel (install if needed)
  2. Click on Dashboard.
  3. Click on the eye icon, e.g. at uBlock Filters. This will trigger a navigation to a page within the extension a moz-extension:-URL
    4 .Observe that the tab is created but cannot be interacted with, and still renders whatever what was previously displayed (e.g. the displayed tab from step 3).
  4. Attach the debugger to that new tab via about:debugging on desktop.
  5. Switch to Inspect element, and discover that suddenly the bug disappears - content is painted again, and can be interacted with!
    • THIS IS A STRANGE OBSERVATION that may potentially shed some light on the underlying issue.
    • More info: hitting the "side effect free" evaluation mode of the devtools restores the connection to the hidden tab. $('body') is not side-effect free, so the devtools doesn't run it. document.body is side-effect free, and immediately triggers a preview + highlights the page. That results in the page being painted.

Versions reproduced on:

  • org.mozilla.reference.browser: Firefox Beta 106.0b3 (Build #2015907027), e80324116+; AC: 106.0.2, 50dd304103, GV: 106.0-20220929195234
  • org.mozilla.firefox_beta: Reference Browser 1.0.2231 (Build #42140610 GV: 105.0a1-20220731065953)

Does NOT happen on:

  • org.mozilla.fenix: Firefox Nightly 106.0a1 (Build #2015901675), b43243685+; AC: 106.0.20220901143315, f54031ee15; GV: 106.0a1-20220901095452
    • In the JS console, I can see that WebExtensionPolicy.useRemoteWebExtensions is true. Aka extensions.webextensions.remote=true.
    • When I flip the preference to false, the bug also reproduces on Nightly.

Because of the last observation, I am marking it as a dependency of bug 1535365. However, I still believe that it's worth investigating this bug, to get a better understanding of why this bug occurs, in the hope that it may also fix this issue with the same symptoms, but more difficult to reproduce: https://github.com/mozilla-mobile/fenix/issues/24563

Arturo, you mentioned being able to reproduce this and saw some interesting logs. Could you attach the logs that you deem relevant, along with any relevant analysis that you can come up with?

Flags: needinfo?(amejiamarmol)

Since this bug affects both Fenix and RB, it sounds like a GV (or AC) bug.

In the JS console, I can see that WebExtensionPolicy.useRemoteWebExtensions is true. Aka extensions.webextensions.remote=true.

Is the extension process enabled in Fenix or GV Nightly?? Did we leave it enabled in Nightly to continue testing it while we wait for MV3?

Since Android MV3 won't be ready until next year, I think we should disable the extension process in all channels. We want Nightly users to test the configuration that we're shipping.

Severity: -- → S2
Component: WebExtensions → Extensions
Priority: -- → P2
Product: Fenix → GeckoView
Whiteboard: [geckoview:m109?]

(In reply to Chris Peterson [:cpeterson] from comment #2)

Since this bug affects both Fenix and RB, it sounds like a GV (or AC) bug.

In the JS console, I can see that WebExtensionPolicy.useRemoteWebExtensions is true. Aka extensions.webextensions.remote=true.

Is the extension process enabled in Fenix or GV Nightly?? Did we leave it enabled in Nightly to continue testing it while we wait for MV3?

It's off by default on GV. The pref might have been a leftover from a previous debugging session.
I mentioned the pref and all details/observations, with the hope that the ability to observe different behaviors can help with pinpointing the issue and fix.

(In reply to Rob Wu [:robwu] from comment #1)

Arturo, you mentioned being able to reproduce this and saw some interesting logs. Could you attach the logs that you deem relevant, along with any relevant analysis that you can come up with?

Sure, after opening a new tab and switch a couple of times between the new the old tab, the old tab became responsive again.

Logs:

Not recording metadata (canAddUri=false) for: moz-extension://d50c45a6-02ef-41a1-8686-7e458d541b14/asset-viewer.html?url=ublock-filters
2022-10-24 17:04:37.853 26451-26795 lla.fenix.debug         org.mozilla.fenix.debug              I  Explicit concurrent copying GC freed 231988(8713KB) AllocSpace objects, 17(1168KB) LOS objects, 75% free, 11MB/45MB, paused 61us,52us total 71.819ms
2022-10-24 17:04:37.866 26451-26462 StrictMode              org.mozilla.fenix.debug              D  StrictMode policy violation: android.os.strictmode.LeakedClosableViolation: A resource was acquired at attached stack trace but never released. See java.io.Closeable for information on avoiding resource leaks.
                                                                                                    	at android.os.StrictMode$AndroidCloseGuardReporter.report(StrictMode.java:1986)
                                                                                                    	at dalvik.system.CloseGuard.warnIfOpen(CloseGuard.java:336)
                                                                                                    	at android.view.Surface.finalize(Surface.java:310)
                                                                                                    	at java.lang.Daemons$FinalizerDaemon.doFinalize(Daemons.java:319)
                                                                                                    	at java.lang.Daemons$FinalizerDaemon.runInternal(Daemons.java:306)
                                                                                                    	at java.lang.Daemons$Daemon.run(Daemons.java:140)
                                                                                                    	at java.lang.Thread.run(Thread.java:1012)
                                                                                                    Caused by: java.lang.Throwable: Explicit termination method 'Surface.release' not called
                                                                                                    	at dalvik.system.CloseGuard.openWithCallSite(CloseGuard.java:288)
                                                                                                    	at dalvik.system.CloseGuard.open(CloseGuard.java:257)
                                                                                                    	at android.view.Surface.setNativeObjectLocked(Surface.java:760)
                                                                                                    	at android.view.Surface.updateNativeObject(Surface.java:599)
                                                                                                    	at android.view.Surface.copyFrom(Surface.java:626)
                                                                                                    	at android.view.Surface.<init>(Surface.java:268)
                                                                                                    	at org.mozilla.gecko.gfx.SurfaceControlManager.getChildSurface(SurfaceControlManager.java:76)
                                                                                                    	at org.mozilla.geckoview.GeckoSession$Compositor.syncResumeResizeCompositor(Native Method)
                                                                                                    	at org.mozilla.geckoview.GeckoSession.onSurfaceChanged(GeckoSession.java:6086)
                                                                                                    	at org.mozilla.geckoview.GeckoDisplay.surfaceChanged(GeckoDisplay.java:149)
                                                                                                    	at org.mozilla.geckoview.GeckoView$Display.acquire(GeckoView.java:121)
                                                                                                    	at org.mozilla.geckoview.GeckoView.setSession(GeckoView.java:525)
                                                                                                    	at mozilla.components.browser.engine.gecko.GeckoEngineView.render(GeckoEngineView.kt:116)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter.renderTab(EngineViewPresenter.kt:89)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter.onTabToRender(EngineViewPresenter.kt:62)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter.access$onTabToRender(EngineViewPresenter.kt:24)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter$start$1$3.emit(EngineViewPresenter.kt:46)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter$start$1$3.emit(EngineViewPresenter.kt:46)
                                                                                                    	at mozilla.components.support.ktx.kotlinx.coroutines.flow.FlowKt$ifAnyChanged$$inlined$filter$1$2.emit(Emitters.kt:223)
                                                                                                    	at mozilla.components.feature.session.engine.EngineViewPresenter$start$1$invokeSuspend$$inlined$map$1$2.emit(Emitters.kt:223)
                                                                                                    	at kotlinx.coroutines.flow.FlowKt__ChannelsKt.emitAllImpl$FlowKt__ChannelsKt(Channels.kt:62)
                                                                                                    	at kotlinx.coroutines.flow.FlowKt__ChannelsKt.access$emitAllImpl$FlowKt__ChannelsKt(Channels.kt:1)
                                                                                                    	at kotlinx.coroutines.flow.FlowKt__ChannelsKt$emitAllImpl$1.invokeSuspend(Unknown Source:14)
                                                                                                    	at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
                                                                                                    	at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
                                                                                                    	at android.os.Handler.handleCallback(Handler.java:942)
                                                                                                    	at android.os.Handler.dispatchMessage(Handler.java:99)
                                                                                                    	at android.os.Looper.loopOnce(Looper.java:201)
                                                                                                    	at android.os.Looper.loop(Looper.java:288)
                                                                                                    	at android.app.ActivityThread.main(ActivityThread.java:7898)
                                                                                                    	at java.lang.reflect.Method.invoke(Native Method)
                                                                                                    	at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:548)
 ```                                                                                                   	at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:936)
Flags: needinfo?(amejiamarmol)

Tagging jnicol, as part of the stack we have gecko.gfx specifically org.mozilla.gecko.gfx.SurfaceControlManager.getChildSurface(SurfaceControlManager.java:76).

Do you have some context around what could be happening?

Thanks in advance!

Flags: needinfo?(jnicol)

That stack trace means we are not explicitly releasing the Surface we created using SurfaceControlManager. I should fix that, but it should be harmless. If I disable our usage of SurfaceControlManager altogether I can still reproduce this bug, so I don't think it's relevant. I imagine it's just showing up at the logcat at this time because we set our remaining reference to the surface to null in SyncPauseCompositor, which gets called for the tab we are switching away from just before we switch to the new tab.

Flags: needinfo?(jnicol)

When adding logging to verify the above, I can see that SyncResumeResizeCompositor() never gets called for the tab we are switching to. That's because mCompositorReady is false here. This means we are never receiving the COMPOSITOR_CONTROLLER_OPEN message from gecko. This happens automatically after gecko creates a compositor, which happens automatically whenever nsWindow::GetWindowRenderer() is called. But I can see that never happens for this webextension window.

Is there anything special about this type of window that would cause gecko not to paint to it as usual?

I can also reproduce this, with the exception of the behavior described when the debugger is attached; for me, it did not load the page corectly, after attaching.
I can also confirm that having useRemoteWebExtensions fixes the problem

I cannot see anything wrong with the tab load, but the content is not painted.

One additional thing I can add to all the observations made here is that when activating Talkback, it reads the correct content as if the page was actually loaded correctly.

(In reply to Jamie Nicol [:jnicol] from comment #7)

Is there anything special about this type of window that would cause gecko not to paint to it as usual?

The "window" here is a regular content window. The only special thing about it is that it is loaded in the main process instead of a content process.
More observations:

  • x Reloading the tab does NOT fix the issue.
  • x Copying the URL and pasting does NOT fix the issue.
  • x Typing a different URL (e.g. dashboard.html instead of asset-viewer.html) does NOT fix the issue.
  • v Navigating elsewhere (to a website) and then navigating back does "fix" the issue.

(In reply to Mihai Adrian Carare from comment #8)

I can also reproduce this, with the exception of the behavior described when the debugger is attached; for me, it did not load the page corectly, after attaching.

It is not just attaching the debugger, but also activating the inspector or triggering the "side effect free" evaluation, e.g. typing document.body in the console.

One additional thing I can add to all the observations made here is that when activating Talkback, it reads the correct content as if the page was actually loaded correctly.

I can confirm - enabling Talkback in Android OS > Settings > Accessibility, followed by running the STR in Firefox triggers the bug, and after tapping inside the content area, Talkback will select the are and read the tapped content from the non-painted tab whenever I tap. The tab is not painted.

TalkBack working correctly makes sense, if the problem is just that nothing is triggering a paint but the page is still loaded.

Likewise that plausibly explains why the devtools inspector can fix it, because presumably that does trigger a paint.

It does sound plausible that loading a window from the main process behaves differently than content, and for some reason we don't paint. My next step would be to use the debugger to find out where nsWindow::GetWindowRenderer gets called from for "normal" tabs, and see where that differs for this tab.

I am being affected by the same bug in Firefox Nightly for Android.

The issue is happening since a couple of weeks (so already several builds ago) and when entering NoScript or Dark Reader options.

(In reply to Jamie Nicol [:jnicol] from comment #10)

My next step would be to use the debugger to find out where nsWindow::GetWindowRenderer gets called from for "normal" tabs, and see where that differs for this tab.

Jamie, are you working on this bug? Or was your comment a debugging suggestion for whoever will be working on this bug?

(In reply to Jamie Nicol [:jnicol] from comment #6)

That stack trace means we are not explicitly releasing the Surface we created using SurfaceControlManager. I should fix that, but it should be harmless.

Should I file a new bug for that Surface LeakedClosableViolation warning?

Flags: needinfo?(jnicol)
Priority: P2 → P1

(In reply to Chris Peterson [:cpeterson] from comment #12)

(In reply to Jamie Nicol [:jnicol] from comment #10)

My next step would be to use the debugger to find out where nsWindow::GetWindowRenderer gets called from for "normal" tabs, and see where that differs for this tab.

Jamie, are you working on this bug? Or was your comment a debugging suggestion for whoever will be working on this bug?

No I'm not working on this, that was a suggestion.

That stack trace means we are not explicitly releasing the Surface we created using SurfaceControlManager. I should fix that, but it should be harmless.

Should I file a new bug for that Surface LeakedClosableViolation warning?

Yes that would be great, thanks.

Flags: needinfo?(jnicol)

(In reply to Jamie Nicol [:jnicol] from comment #13)

That stack trace means we are not explicitly releasing the Surface we created using SurfaceControlManager. I should fix that, but it should be harmless.

Should I file a new bug for that Surface LeakedClosableViolation warning?

I filed bug 1800541.

We were able to reproduce this issue inside the GeckoView sample application, this should rule out Android Components. Going to spend some time today trying to debug nsWindow::GetWindowRenderer to see if we can track down what is different about these tabs.

(In reply to Jeff Boek [:boek] from comment #15)

Going to spend some time today trying to debug nsWindow::GetWindowRenderer to see if we can track down what is different about these tabs.

Tentatively assigning this bug to Jeff since he is investigating it.

Assignee: nobody → jboek
Whiteboard: [geckoview:m109?] → [geckoview:m109]
It seems like uploading an attachment didn't fail. It just posted it anyway .
Attached file getWindowRenderer - mozilla.org (deleted) —
It seems like uploading an attachment didn't fail. It just posted it anyway.

Here is what I've found so far

When accessing a tab pointed to something like https://mozilla.org I get what looks like a normal looking order of calls:
(Sorry for the gist links, attachments on Bugzilla were failing to upload for me)
https://gist.github.com/boek/f9ed619bae4b10c0e37fc5c47eea54f7

When the uBlock origin dashboard page is open, it seems to be looping with this stack infinitely. I'm not sure if this is relevant but it seemed odd:
https://gist.github.com/boek/fd26e5cb38f73a1ea150559afb376ee7

When opening the broken tab or when you switch tabs and navigate back nsWindow::GetWindowRenderer doesn't ever seem to be called.

This is my first time debugging native code so it is possible I may not have something setup correctly.

Assignee: jboek → nobody
Whiteboard: [geckoview:m109] → [geckoview:m109?]

When accessing a tab pointed to something like https://mozilla.org I get what looks like a normal looking order of calls:
(Sorry for the gist links, attachments on Bugzilla were failing to upload for me)
https://gist.github.com/boek/f9ed619bae4b10c0e37fc5c47eea54f7

We can see here the paint is called from the refresh driver tick here. So I think the interesting place to break the debugger and get the callstack is actually when mViewManagerFlushIsPending gets set true here. It would be interesting to see whether that occurs for the broken extension tab or not.

When the uBlock origin dashboard page is open, it seems to be looping with this stack infinitely. I'm not sure if this is relevant but it seemed odd:
https://gist.github.com/boek/fd26e5cb38f73a1ea150559afb376ee7

I think this just means the page has an animated SVG or something like that on it.

(In reply to Jamie Nicol [:jnicol] from comment #20)

We can see here the paint is called from the refresh driver tick here. So I think the interesting place to break the debugger and get the callstack is actually when mViewManagerFlushIsPending gets set true here. It would be interesting to see whether that occurs for the broken extension tab or not.

I see, interesting. it looks like when switching from mozilla.org to the broken tab it does get set to true once: https://gist.github.com/boek/cbb4dc1094dcec721c565306818253fc
Switching from the broken tab back to mozilla.org: it is also only called once but the trace is significantly different: https://gist.github.com/boek/d33681986daae74af4f301ca2113f807

I'll keep poking around to see if I can see anything else

When the uBlock origin dashboard page is open, it seems to be looping with this stack infinitely. I'm not sure if this is relevant but it seemed odd:
https://gist.github.com/boek/fd26e5cb38f73a1ea150559afb376ee7

I think this just means the page has an animated SVG or something like that on it.
Oooh, good to know. I tried to reproduce on a few pages with just SVGs but I wasn't getting the same behavior so I wasn't sure why it was happening.

So I think I found out why this is broken, but what I am not sure about is how.

When creating a new working tab it will eventually create a layout manager
https://searchfox.org/mozilla-central/rev/c1180ea13e73eb985a49b15c0d90e977a1aa919c/widget/android/nsWindow.cpp#2386

Here is the stack trace:
https://gist.github.com/boek/24ba936c44829286261de5ba393e40ad

When creating a new broken tab the furthest up the stack that it reaches is

mozilla::dom::binding_detail::GenericGetter<…>(JSContext *, unsigned int, JS::Value *) BindingUtils.cpp:3169

where then it repeatedly bounces around the frames marked with ✅, never reaching the frame above

mozilla::dom::FrameLoader_Binding::get_loadContext(JSContext *, JS::Handle<…>, void *, JSJitGetterCallArgs) FrameLoaderBinding.cpp:338

Following on from Jeff's good work in comment 22 finding the stacktrace for when the compositor is created in the working case: I'm not sure if there's a clever way to follow a callstack from C++ to javascript. But naively we can see the places in javascript which could call FrameLoader_Binding::get_loadContext() are quite few: https://searchfox.org/mozilla-central/search?q=symbol:%23loadContext&redirect=false

And with some dump()s we can narrow it down to here.

With the devtools debugger I can see the javascript callstack is:

construct (chrome://global/content/elements/browser-custom-element.js#983)
connectedCallback (chrome://global/content/elements/browser-custom-element.js#324)
init (chrome://geckoview/content/geckoview.js#88)
startup (chrome://geckoview/content/geckoview.js#546)
<anonymous> (chrome://geckoview/content/geckoview.xhtml#12)

In geckoview.xhtml we can see this comes from a "DOMContentLoaded" event. Which matches the dom::Document::DispatchContentLoadedEvents() call in the native stacktrace just before it goes into javascript code. Neat!

Interestingly, when opening the broken tab, browser-custom-element.js' construct() function does in fact get called! But with a slightly different stacktrace:

construct (chrome://global/content/elements/browser-custom-element.js#983)
connectedCallback (chrome://global/content/elements/browser-custom-element.js#324)
init (chrome://geckoview/content/geckoview.js#89)
startup (chrome://geckoview/content/geckoview.js#548)
handleNewSession (resource://gre/modules/GeckoViewNavigation.jsm#379)
createContentWindow (resource://gre/modules/GeckoViewNavigation.jsm#414)

And crucially, this.isRemoteBrowser is false here, meaning we never call loadContext (which eventually leads to calling GetWindowRenderer() / CreateCompositor()).

I guess (though haven't verified) that the reason this worked when extensions.webextensions.remote was true is that isRemoteBrowser would therefore be true.

One last thing to note is that when loading the other ublock pages in the steps to reproduce this bug, (eg the panel and the dashboard), when construct() is initially called the browser remoteness is true, meaning we create the compositor successfully, and then immediately afterwards finishChangeRemoteness() is called to make it false. But with the broken page, it is false from the beginning.

This is getting really far outside my sphere of knowledge now.. perhaps somebody else might know:
a) Why we don't call this code (which leads to creating the compositor) when isRemoteBrowser is false.
b) Why for the broken page isRemoteBrowser is initially false, whereas for other pages it starts off as true and is later switched to false.

The reason isRemoteBrowser is false for the broken page (which unlike the working pages is opened via GeckoViewNavigation.jsm) is because of this line.

Priority: P2 → P1
Assignee: nobody → jboek
Whiteboard: [geckoview:m109?] → [geckoview:m109?][geckoview:m111?]
Whiteboard: [geckoview:m109?][geckoview:m111?] → [geckoview:m109?][geckoview:m111?][bugz:In Progress]
Whiteboard: [geckoview:m109?][geckoview:m111?][bugz:In Progress] → [geckoview:m109?][geckoview:m111?][bugz:Todo]
Whiteboard: [geckoview:m109?][geckoview:m111?][bugz:Todo] → [geckoview:m109?][geckoview:m111?][bugz:In Progress]
Whiteboard: [geckoview:m109?][geckoview:m111?][bugz:In Progress] → [geckoview:m109?][geckoview:m111?][geckoview:m112?][bugz:In Progress]
Attached image Black screen (deleted) —

Spent some more time debugging this today trying to invalidate paint explicitly by both changing the opacity and toggling the visibility in SetActive. This did seem to remove the last frame from drawn from the previous tab, but it didn't repaint the new tab.

Spent some more time today trying to

Whiteboard: [geckoview:m109?][geckoview:m111?][geckoview:m112?][bugz:In Progress] → [geckoview:m109?][geckoview:m111?][geckoview:m112?][geckoview:m113?][bugz:In Progress]

Post sprint-planning: The question mark is for sprint planning to consider if we want to add it to the sprint or not. Removing the ? because we have agreed to continue working on this.

Whiteboard: [geckoview:m109?][geckoview:m111?][geckoview:m112?][geckoview:m113?][bugz:In Progress] → [geckoview:m109][geckoview:m111][geckoview:m112][geckoview:m113][bugz:In Progress]

We'll need to fix this bug for Android Addons GA.

Whiteboard: [geckoview:m109][geckoview:m111][geckoview:m112][geckoview:m113][bugz:In Progress] → [geckoview:m109][geckoview:m111][geckoview:m112][geckoview:m113][bugz:In Progress][addons-jira]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: