Closed Bug 1662808 Opened 4 years ago Closed 2 years ago

TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null (reftest.js)

Categories

(Remote Protocol :: Marionette, task, P3)

Default
task

Tracking

(firefox107 fixed)

RESOLVED FIXED
107 Branch
Tracking Status
firefox107 --- fixed

People

(Reporter: whimboo, Assigned: whimboo)

References

(Blocks 1 open bug)

Details

(Whiteboard: [marionette-fission-reserve] [fission:android:m2])

Attachments

(3 obsolete files)

A lot of commands are going through the JSWindowActor implementation. When the currently selected browsing context or top-level browsing context closes, the call into the child actor should not hang but return immediately.

Whiteboard: [marionette-fission-mvp] → [marionette-fission-mvp][complex]

Tentatively tracking marionette-fission-mvp bugs for Fission M7 Beta milestone. We would like Marionette tests to work with Fission before we ride the trains to Beta.

Fission Milestone: --- → M7

Tentatively assigning myself to this.
I will first come up with a proposal to make sure we agree on what should be implemented here.

From what I can see, all the commands are relying on JSWindowActor::sendQuery. Any pending call to sendQuery should throw when the window global is destroyed (which would happen for any navigation, tab closing etc...). So I don't think commands should hang (unless we hang when an exception happens).
I'll try to compare what happens for each command with the actor implementation and without the actor implementation when the context closes.

Assignee: nobody → jdescottes
Status: NEW → ASSIGNED

Yes, with the concept of actors we should not hang, but we will have to find a way to maybe re-run the command if the actor gets destroyed due to some remoteness changes. I assume there is a specific error thrown in such a case?

Some analysis and investigation for this bug.

Introduction

The intent is to find a good strategy for Marionette/Webdriver commands around the AbortError that will be thrown by JSWindowActor's sendQuery calls whenever a pair of JSWindowActors is destroyed. This will happen whenever the corresponding window global is destroyed, which can be triggered in many use cases: closing the tab, reloading the page (or the frame), navigating with or without process change etc...

The goal should be to reduce the number of errors when enabling marionette actors. An example try run can be found at https://treeherder.mozilla.org/#/jobs?repo=try&revision=c91d5773c8d01ca7b0d2bf27a4b14c262052b654 (recent central + enabling marionette actors pref). As we can see we have many failures here. We can call out 3 main types of failing tests:

  • marionette harness tests (Mn)
  • awsy tests (ab)
  • webplatform tests (Wd, WdH, Wr, wpt)

Mn & ab tests both fail with similar errors:

"AbortError: Actor 'MarionetteFrame' destroyed before query 'MarionetteFrameParent:XYZ' was resolved"

Webplatform tests show several different types of failures. Most frequent ones:

  • "TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null" in test_no_browsing_context tests
  • "Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>"

Retrying commands

The initial idea is that after an AbortError, we should be able to retry the command until we can get the correct response.

For instance if we think about WebDriver:FindElement. When an AbortError is received for this command, there should be no harm in simply retrying the query with the next pair of JSWindowActor.
If we get a new pair of JSWindowActors, we should be able to run the command again and get a final result.
If we can't get a new pair of Actors, it means we don't have a new global available and we should probably throw a meaningful error, most likely NoSuchWindowError. Apart from this use case, retrying sounds safe.

I made a first simple implementation of an actor proxy which implements this retry logic. Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e5d3f0c30d44a3da229de1ee85766d72e5cb08e .

This simple change fixed the Mn & ab test suites!

But it increased the number of failures in Wd tests, such as:

/webdriver/tests/new_window/new.py | test_no_browsing_context - setup error

Retrying commands with no side effects only

Inspecting the logs for the new errors mentioned above show that the issue is linked to a click on "#remove-parent" button triggered by a test fixture: https://searchfox.org/mozilla-central/rev/61728de8273c04fe2417c475fc0637e8b79210d7/testing/web-platform/tests/webdriver/tests/support/fixtures.py#230

Adding logs, we can see that this click triggers a destroy of the window global. With the approach retrying any command as soon as an AbortError was received, we ended up clicking again on the button. But even though we got an AbortError, the click() was still correctly performed, it's simply that it destroyed the window global in the process.

Based on this, I modified the proxy to avoid retrying all the commands that might modify the page:

  • clickElement
  • executeScript
  • sendKeysToElement
  • performActions
  • releaseActions
  • singleTap

All those commands might indirectly or directly destroy the window global. And if we get an AbortError, there is no way to know if we got the error before the child actor received the query? before it could run the action? after it executed the action?

We could probably come up with a complex solution to record which queries have been properly executed in the child actors, by storing information in a singleton in the Child actor's JSM (although this would only live as long as the process itself is alive). I think that coming up with such a solution is a bit out of scope, and it should be acceptable in a first step to retry only the actions which can't have a side effect.

Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=d41e92f0937f6f9b73c05865eb9d8010e06e686c

This fixed all of the errors "test_no_browsing_context - setup error".

Fix getActor inconsistencies

At this stage we only have a handful of permanent failures left:

  • /webdriver/tests/switch_to_parent_frame/switch.py
    • test_stale_element_from_iframe
    • Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>
  • /webdriver/tests/switch_to_frame/switch.py
    • test_frame_id_null
    • Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>
  • /webdriver/tests/release_actions/release.py
    • test_no_browsing_context
    • AssertionError: unknown error (500): TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null
  • /webdriver/tests/switch_to_frame/switch.py
    • test_no_browsing_context[None]
    • AssertionError: unknown error (500): TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null

For the remaining test_no_browsing_context issues, I think the problem comes from inconsistencies between assert.open() and getActor(), present in both releaseActions and switchToFrame. Basically we pass different options to assert.open and to getActor. Fixing them fixes those two tests.

Note that releaseActions is being modified by Henrik in Bug 1671372. I am not sure if switchToFrame needs to be updated similarly or not. For the purpose of this experiment, I am simply re-applying the arguments used in assert.open

Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f964047384dc31dc06f07e64053efcaa6d5c992

I believe the two StaleElementReferenceException errors are linked to Bug 1665718 - Clean up element store in parent process when a tab closes

There are also tests on Fission which no longer timeout, but that shouldn't be an issue.

Proposal

Based on the investigation above, I propose to add a wrapper/proxy invoked when calling getActor() that will automatically retry all actions that can't have any side effect on the page. I will attach corresponding patches on the bug so that you can take a look at some example implementation (they are not intended for review, I only wrote them to support my investigation).

We will need to discuss more about the long term plan for actions that can have a side effect.

There are also edge cases that would be very difficult to support. For instance, imagine an execute script that:

  • is heavily async
  • triggers a navigation
  • returns some information from the page AFTER the navigation

This kind of script is nearly impossible to support in a Fission world. I have no idea how much execute_script is used in the wild, so if you know that it is widely used to run very complex navigation scenarios, we should probably discuss about this as well.

Let me know if you have feedback on any of this!

Depends on D94086

jgraham also mentioned the use case of action sequences that trigger a navigation, pause, and then interact with the new page. If this is currently supported (to be verified), this would need a custom implementation for actors. We either need to have a way to resume an action sequence from the content process, or we should keep the sequence in the driver layer and only send individual actions to actors.

Looks like the "DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>" are already tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1671370

Depends on: 1671370
Attachment #9182570 - Attachment is obsolete: true
Attachment #9182571 - Attachment is obsolete: true

Please note that most of the discussion on this bug actually belongs to bug 1671347, which is about handling the AbortError by using a proxy. As such I requested to get the patch moved over to the other bug.

Here we should really only discuss the closing of a frame or window.

Comment on attachment 9182569 [details]
Bug 1662808 - [marionette] Add proxy for MarionetteFrameActor to retry queries after an AbortError

Revision D94086 was moved to bug 1671347. Setting attachment 9182569 [details] to obsolete.

Attachment #9182569 - Attachment is obsolete: true
No longer blocks: 1671347
Depends on: 1671347
Summary: Handle unexpected closing of active frame and window for Fission → Handle unexpected closing of active tab and window for Fission
Assignee: jdescottes → nobody
Status: ASSIGNED → NEW

We haven't seen problems with that yet. Lowering the priority for now.

Priority: P2 → P3

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #12)

We haven't seen problems with that yet. Lowering the priority for now.

Better to move it out of MVP into the reserved backlog.

Whiteboard: [marionette-fission-mvp][complex] → [marionette-fission-reserve]

Looks like that everything is working as expected, and nothing remains to be done here.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME

Actually this is still present and easy to fix. Here the error stack:

_a/test_minimized.py TestMinimizedTestCase.test_usercase - UnknownException: TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null
stacktrace:
    getMarionetteCommandsActorProxy/get/<@chrome://marionette/content/actors/MarionetteCommandsParent.jsm:332:29
Traceback (most recent call last):
  File "/Users/henrik/code/gecko/testing/marionette/harness/marionette_harness/marionette_test/testcases.py", line 214, in run
    testMethod()
  File "/Users/henrik/code/gecko/_a/test_minimized.py", line 112, in test_usercase
    self.marionette.get_url()
  File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 1476, in get_url
    return self._send_message("WebDriver:GetCurrentURL", key="value")
  File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/decorators.py", line 27, in _
    return func(*args, **kwargs)
  File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 629, in _send_message
    self._handle_error(err)
  File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 651, in _handle_error

Using browsingContextFn()?.currentWindowGlobal?.getActor(..) and checking for a valid actor here should be the way to go.

Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

Does this marionette-fission-reserve bug need to block shipping Fission MVP?

Fission Milestone: M7 → ?

For now, I am assuming marionette-fission-reserve bugs don't need to block shipping Fission MVP.

Fission Milestone: ? → Future

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)

_a/test_minimized.py TestMinimizedTestCase.test_usercase - UnknownException: TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null

This somewhat smells to be related to bug 1691348 comment 9. I will check again once bug 1691348 has been fixed.

As it looks like this happens more often for web-platform reftests (see bug 1684664). Maybe it would help to run these kind of tests to make it easier reproducible and verifiable.

I believe this is not just a Fission bug. I am seeing it pretty consistently in a web automation flow with both Firefox 86 and 88.0b2. Here's a log from this morning:

Marionette threw an error: TypeError: browsingContextFn().currentWindowGlobal is null
getMarionetteCommandsActorProxy/get/<@chrome://marionette/content/actors/MarionetteCommandsParent.jsm:327:29
GeckoDriver.prototype.takeScreenshot@chrome://marionette/content/driver.js:2392:26

It looks like the issue still exist on the trunk, and I had a suspicion that this might be related to null changes related to bug 1669961, so I am going to try reverting to the latest ESR build.

Nathan, do you have some lines of code that reproduce this issue? Or maybe provide an excerpt of the trace log? Thanks!

Flags: needinfo?(nbryant)

The lines of code are in the Elixir language with our in-house-built Webdriver client library. If it helps, I can tell you that this is from a call to /session/{sess_id}/moz/screenshot/full. I will see what can be done about the trace log but I have already deployed ESR to production, which is the only place we're seeing the issue.

Flags: needinfo?(nbryant)

Hi Henrik, he's a little more info. Our overall processing flow is as follows:

  1. Call /session/{sess_id}/element/{elem_id}/click, the website in question may return to this in various ways, in one case of interest, I think something happens with frames.
  2. We implement a "wait until element hidden" loop by repeatedly calling a find elements (/session/{sess_id}/elements) with a selector for a <div> that shows a "wait, processing" spinner. Sometimes this succeeds but this time it returned:
{
  "value": {
    "error": "no such window",
    "message": "Browsing context has been discarded",
    "stacktrace": "WebDriverError@chrome://marionette/content/error.js:181:5\nNoSuchWindowError@chrome://marionette/content/error.js:415:5\nassert.that/<@chrome://marionette/content/assert.js:460:13\nassert.open@chrome://marionette/content/assert.js:168:4\nGeckoDriver.prototype.findElements@chrome://marionette/content/driver.js:2100:10\ndespatch@chrome://marionette/content/server.js:297:40\nexecute@chrome://marionette/content/server.js:267:16\nonPacket/<@chrome://marionette/content/server.js:240:20\nonPacket@chrome://marionette/content/server.js:241:9\n_onJSONObjectReady/<@chrome://marionette/content/transport.js:504:20\n"
  }
}
  1. Error handling is lax on the previous call because it's just an "await until hidden" loop, so we attempt to find the next element we're interested in, which also fails: as you can see from 2), the context has been discarded.
  2. At this point we attempt a diagnostic screenshot. My observation in the past has been that geckodriver would return an empty base64 string in some cases where an iframe has vanished, and we could then rescue it by switching to the parent frame and retrying the screenshot. But something has changed, either on the site we are automating or in Firefox and this no longer works, screenshot fails on the first attempt with this TypeError from my previous comment.

Hope that helps.

(In reply to Nathan Bryant from comment #23)

  1. We implement a "wait until element hidden" loop by repeatedly calling a find elements (/session/{sess_id}/elements) with a selector for a <div> that shows a "wait, processing" spinner. Sometimes this succeeds but this time it returned:
{
  "value": {
    "error": "no such window",
    "message": "Browsing context has been discarded",

This error shouldn't happen anymore with Firefox 87 when it is caused by a command as executed after switching to a frame and the frame hasn't finished loading yet. See bug 1691348.

  1. At this point we attempt a diagnostic screenshot. My observation in the past has been that geckodriver would return an empty base64 string in some cases where an iframe has vanished, and we could then rescue it by switching to the parent frame and retrying the screenshot. But something has changed, either on the site we are automating or in Firefox and this no longer works, screenshot fails on the first attempt with this TypeError from my previous comment.

Yes, similar to the above bug. You would have to test with 87 and later. You mentioned 88.0b2 in your first comment, does that exact error still happen there?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #24)

Yes, similar to the above bug. You would have to test with 87 and later. You mentioned 88.0b2 in your first comment, does that exact error still happen there?

Yes, the TypeError continues to occur on 88.0b2. We recently upgraded from 86 to 88.0b2 and that did clear up some other null-handling issues, but not this one in particular.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #24)

This error shouldn't happen anymore with Firefox 87 when it is caused by a command as executed after switching to a frame and the frame hasn't finished loading yet. See bug 1691348.

The thing is, we don't explicitly switch frames here. Not with a webdriver command, anyway. I'm not sure what's happening with frames, but it's something that the target website is initiating.

(In reply to Nathan Bryant from comment #26)

The thing is, we don't explicitly switch frames here. Not with a webdriver command, anyway. I'm not sure what's happening with frames, but it's something that the target website is initiating.

Ok, so when you have steps for this or a trace log, which might also help, please file a new bug, and lets keep this one for the currentWindowGlobal being null. Thanks!

Summary: Handle unexpected closing of active tab and window for Fission → TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null
Fission Milestone: Future → ---

This can actually happen when we are trying to access the currentWindowGlobal too early for a newly created browsing context. With bug 1747359 fixed we will now wait until a tab to be used has been finished loading. And we are going to specifically check for an not yet existing currentWindowGlobal and ensure to wait for one. So this bug should no longer happen once bug 1747359 is fixed.

Depends on: 1747359

Nathan, could you maybe check again with your testcase if it looks better in the most recent Firefox Nightly build from today? If not, are you modifying the pageLoadStrategy capability so that WebDriver:ElementClick returns before the page has been finished loading? Thanks!

Flags: needinfo?(nbryant)

Redirect a needinfo that is pending on an inactive user to the triage owner.
:whimboo, could you please find another way to get the information or close the bug as INCOMPLETE if it is not actionable?

For more information, please visit auto_nag documentation.

Flags: needinfo?(nbryant) → needinfo?(hskupin)
Flags: needinfo?(hskupin)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #29)

Nathan, could you maybe check again with your testcase if it looks better in the most recent Firefox Nightly build from today? If not, are you modifying the pageLoadStrategy capability so that WebDriver:ElementClick returns before the page has been finished loading? Thanks!

:whimboo, for what it's worth, we have upgraded several Firefox versions and I have not seen this bug come up again in our production environment in a long, long time.

That's good to hear! Thanks for your feedback. Nevertheless we get still random failures for reftests (see the blocked bug 1684664). It's due to the following usage:

https://searchfox.org/mozilla-central/rev/973000acec0cbf7211e0fad89ca00c352aeb8384/remote/marionette/reftest.js#679-681

As such I would propose we limit this bug to only handle the reftest.js usage.

Status: REOPENED → NEW
Summary: TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null → TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null (reftest.js)

We are seeing a lot (~300 different tests) of these failures in WPT reftest on Android Fission builds. The error is:

TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null
loadTestUrl@chrome://remote/content/marionette/reftest.js:679:21

I am able to consistently reproduce the error locally when I run any of these tests. With --disable-fission flag the error does not reproduce and the test passes.

This blocks the Android Fission project currently, can we consider raising the priority of this bug to P1?

Flags: needinfo?(hskupin)

Here is a log from the terminal: https://paste.mozilla.org/fPYtO3Z8 Command used was ./mach wpt testing/web-platform/tests/acid/acid3/test.html

Blocks: 1714654

I discussed with owlish on Element and my suggestion is to simply [continue in the while loop if the currentWindowGlobal is null](https://searchfox.org/mozilla-central/rev/4f4c8e0e84d5a728244f1e820dda14e5cdb81e71/remote/marionette/reftest.js#675-682. To check for a valid navigation we would need a valid window as well. For me this check is fine given that we won't have the time to get bug 1669787 implemented.

Owlish will report back if that works given that it seems to be easily to get reproduced. Thanks!

Flags: needinfo?(hskupin) → needinfo?(bugzeeeeee)

I had a quick look myself and as it looks like the browsing context gets discarded when the page has been finished loading:

09-05 13:31:32.837  8345  8413 I Gecko   : 1662377492837	Marionette	DEBUG	Window URL does not match the expected URL "about:blank" !== "http://web-platform.test:8000/acid/acid3/reference.sub.html"
09-05 13:31:32.886  8307  8332 D GeckoViewSessionStore: observe browsing-context-discarded

There quite a few lines with the Marionette debug output but finally when the ACID3 page has finished loading I can see only a single browsing-context-discarded notification. If another browsing context gets created there should be an observer notification as well. Owlish I assume that there is some kind of bug in the GeckoView implementation?

Whiteboard: [marionette-fission-reserve] → [marionette-fission-reserve] [fission:android:m2]
Depends on: 1794553
Flags: needinfo?(bugzeeeeee)

Owlish, are you still seeing this issue with Fission enabled and a recent mozilla-central checkout (fix for bug 1794553 included)?

Flags: needinfo?(bugzeeeeee)

Chris, I would like to get some feedback from the GeckoView team if it's working now. At least I cannot see the problem anymore when testing locally. Maybe you could forward the needinfo to the right person? Maybe Owlish is covered with other work right now? Thanks!

Flags: needinfo?(cpeterson)

Owlish will start a try run to see if she can still reproduce the error.

Flags: needinfo?(cpeterson)

I ran try build, and I am not seeing the failure in the reftests anymore. I am closing this for now.

Status: NEW → RESOLVED
Closed: 4 years ago2 years ago
Flags: needinfo?(bugzeeeeee)
Resolution: --- → WORKSFORME

That's great to hear. Thanks a lot for the feedback. Given that bug 1794553 was the only code change in that area I'm going to mark the bug as fixed instead.

Assignee: nobody → hskupin
Resolution: WORKSFORME → FIXED
Target Milestone: --- → 107 Branch
Product: Testing → Remote Protocol
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: