TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null (reftest.js)
Categories
(Remote Protocol :: Marionette, task, P3)
Tracking
(firefox107 fixed)
Tracking | Status | |
---|---|---|
firefox107 | --- | fixed |
People
(Reporter: whimboo, Assigned: whimboo)
References
(Blocks 1 open bug)
Details
(Whiteboard: [marionette-fission-reserve] [fission:android:m2])
Attachments
(3 obsolete files)
A lot of commands are going through the JSWindowActor implementation. When the currently selected browsing context or top-level browsing context closes, the call into the child actor should not hang but return immediately.
Assignee | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
Tentatively tracking marionette-fission-mvp bugs for Fission M7 Beta milestone. We would like Marionette tests to work with Fission before we ride the trains to Beta.
Comment 2•4 years ago
|
||
Tentatively assigning myself to this.
I will first come up with a proposal to make sure we agree on what should be implemented here.
From what I can see, all the commands are relying on JSWindowActor::sendQuery
. Any pending call to sendQuery
should throw when the window global is destroyed (which would happen for any navigation, tab closing etc...). So I don't think commands should hang (unless we hang when an exception happens).
I'll try to compare what happens for each command with the actor implementation and without the actor implementation when the context closes.
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
Yes, with the concept of actors we should not hang, but we will have to find a way to maybe re-run the command if the actor gets destroyed due to some remoteness changes. I assume there is a specific error thrown in such a case?
Comment 4•4 years ago
|
||
Some analysis and investigation for this bug.
Introduction
The intent is to find a good strategy for Marionette/Webdriver commands around the AbortError
that will be thrown by JSWindowActor's sendQuery
calls whenever a pair of JSWindowActors is destroyed. This will happen whenever the corresponding window global is destroyed, which can be triggered in many use cases: closing the tab, reloading the page (or the frame), navigating with or without process change etc...
The goal should be to reduce the number of errors when enabling marionette actors. An example try run can be found at https://treeherder.mozilla.org/#/jobs?repo=try&revision=c91d5773c8d01ca7b0d2bf27a4b14c262052b654 (recent central + enabling marionette actors pref). As we can see we have many failures here. We can call out 3 main types of failing tests:
- marionette harness tests (Mn)
- awsy tests (ab)
- webplatform tests (Wd, WdH, Wr, wpt)
Mn & ab tests both fail with similar errors:
"AbortError: Actor 'MarionetteFrame' destroyed before query 'MarionetteFrameParent:XYZ' was resolved"
Webplatform tests show several different types of failures. Most frequent ones:
- "TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null" in
test_no_browsing_context
tests - "Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>"
Retrying commands
The initial idea is that after an AbortError, we should be able to retry the command until we can get the correct response.
For instance if we think about WebDriver:FindElement. When an AbortError is received for this command, there should be no harm in simply retrying the query with the next pair of JSWindowActor.
If we get a new pair of JSWindowActors, we should be able to run the command again and get a final result.
If we can't get a new pair of Actors, it means we don't have a new global available and we should probably throw a meaningful error, most likely NoSuchWindowError
. Apart from this use case, retrying sounds safe.
I made a first simple implementation of an actor proxy which implements this retry logic. Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=0e5d3f0c30d44a3da229de1ee85766d72e5cb08e .
This simple change fixed the Mn & ab test suites!
But it increased the number of failures in Wd tests, such as:
/webdriver/tests/new_window/new.py | test_no_browsing_context - setup error
Retrying commands with no side effects only
Inspecting the logs for the new errors mentioned above show that the issue is linked to a click on "#remove-parent" button triggered by a test fixture: https://searchfox.org/mozilla-central/rev/61728de8273c04fe2417c475fc0637e8b79210d7/testing/web-platform/tests/webdriver/tests/support/fixtures.py#230
Adding logs, we can see that this click triggers a destroy of the window global. With the approach retrying any command as soon as an AbortError was received, we ended up clicking again on the button. But even though we got an AbortError, the click() was still correctly performed, it's simply that it destroyed the window global in the process.
Based on this, I modified the proxy to avoid retrying all the commands that might modify the page:
- clickElement
- executeScript
- sendKeysToElement
- performActions
- releaseActions
- singleTap
All those commands might indirectly or directly destroy the window global. And if we get an AbortError, there is no way to know if we got the error before the child actor received the query? before it could run the action? after it executed the action?
We could probably come up with a complex solution to record which queries have been properly executed in the child actors, by storing information in a singleton in the Child actor's JSM (although this would only live as long as the process itself is alive). I think that coming up with such a solution is a bit out of scope, and it should be acceptable in a first step to retry only the actions which can't have a side effect.
Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=d41e92f0937f6f9b73c05865eb9d8010e06e686c
This fixed all of the errors "test_no_browsing_context - setup error".
Fix getActor inconsistencies
At this stage we only have a handful of permanent failures left:
- /webdriver/tests/switch_to_parent_frame/switch.py
- test_stale_element_from_iframe
- Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>
- /webdriver/tests/switch_to_frame/switch.py
- test_frame_id_null
- Failed: DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>
- /webdriver/tests/release_actions/release.py
- test_no_browsing_context
- AssertionError: unknown error (500): TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null
- /webdriver/tests/switch_to_frame/switch.py
- test_no_browsing_context[None]
- AssertionError: unknown error (500): TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null
For the remaining test_no_browsing_context
issues, I think the problem comes from inconsistencies between assert.open()
and getActor()
, present in both releaseActions and switchToFrame. Basically we pass different options to assert.open and to getActor. Fixing them fixes those two tests.
Note that releaseActions is being modified by Henrik in Bug 1671372. I am not sure if switchToFrame needs to be updated similarly or not. For the purpose of this experiment, I am simply re-applying the arguments used in assert.open
Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f964047384dc31dc06f07e64053efcaa6d5c992
I believe the two StaleElementReferenceException errors are linked to Bug 1665718 - Clean up element store in parent process when a tab closes
There are also tests on Fission which no longer timeout, but that shouldn't be an issue.
Proposal
Based on the investigation above, I propose to add a wrapper/proxy invoked when calling getActor() that will automatically retry all actions that can't have any side effect on the page. I will attach corresponding patches on the bug so that you can take a look at some example implementation (they are not intended for review, I only wrote them to support my investigation).
We will need to discuss more about the long term plan for actions that can have a side effect.
There are also edge cases that would be very difficult to support. For instance, imagine an execute script that:
- is heavily async
- triggers a navigation
- returns some information from the page AFTER the navigation
This kind of script is nearly impossible to support in a Fission world. I have no idea how much execute_script is used in the wild, so if you know that it is widely used to run very complex navigation scenarios, we should probably discuss about this as well.
Let me know if you have feedback on any of this!
Comment 5•4 years ago
|
||
Comment 6•4 years ago
|
||
Depends on D94086
Comment 7•4 years ago
|
||
Depends on D94087
Comment 8•4 years ago
|
||
jgraham also mentioned the use case of action sequences that trigger a navigation, pause, and then interact with the new page. If this is currently supported (to be verified), this would need a custom implementation for actors. We either need to have a way to resume an action sequence from the content process, or we should keep the sequence in the driver layer and only send individual actions to actors.
Comment 9•4 years ago
|
||
Looks like the "DID NOT RAISE <class 'webdriver.error.StaleElementReferenceException'>" are already tracked in https://bugzilla.mozilla.org/show_bug.cgi?id=1671370
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
Please note that most of the discussion on this bug actually belongs to bug 1671347, which is about handling the AbortError
by using a proxy. As such I requested to get the patch moved over to the other bug.
Here we should really only discuss the closing of a frame or window.
Comment 11•4 years ago
|
||
Comment on attachment 9182569 [details]
Bug 1662808 - [marionette] Add proxy for MarionetteFrameActor to retry queries after an AbortError
Revision D94086 was moved to bug 1671347. Setting attachment 9182569 [details] to obsolete.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 12•4 years ago
|
||
We haven't seen problems with that yet. Lowering the priority for now.
Updated•4 years ago
|
Assignee | ||
Comment 13•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #12)
We haven't seen problems with that yet. Lowering the priority for now.
Better to move it out of MVP into the reserved backlog.
Assignee | ||
Comment 14•4 years ago
|
||
Looks like that everything is working as expected, and nothing remains to be done here.
Assignee | ||
Comment 15•4 years ago
|
||
Actually this is still present and easy to fix. Here the error stack:
_a/test_minimized.py TestMinimizedTestCase.test_usercase - UnknownException: TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null
stacktrace:
getMarionetteCommandsActorProxy/get/<@chrome://marionette/content/actors/MarionetteCommandsParent.jsm:332:29
Traceback (most recent call last):
File "/Users/henrik/code/gecko/testing/marionette/harness/marionette_harness/marionette_test/testcases.py", line 214, in run
testMethod()
File "/Users/henrik/code/gecko/_a/test_minimized.py", line 112, in test_usercase
self.marionette.get_url()
File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 1476, in get_url
return self._send_message("WebDriver:GetCurrentURL", key="value")
File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/decorators.py", line 27, in _
return func(*args, **kwargs)
File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 629, in _send_message
self._handle_error(err)
File "/Users/henrik/code/gecko/testing/marionette/client/marionette_driver/marionette.py", line 651, in _handle_error
Using browsingContextFn()?.currentWindowGlobal?.getActor(..)
and checking for a valid actor here should be the way to go.
Comment 16•4 years ago
|
||
Does this marionette-fission-reserve
bug need to block shipping Fission MVP?
Comment 17•4 years ago
|
||
For now, I am assuming marionette-fission-reserve
bugs don't need to block shipping Fission MVP.
Assignee | ||
Comment 18•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #15)
_a/test_minimized.py TestMinimizedTestCase.test_usercase - UnknownException: TypeError: can't access property "getActor", browsingContextFn().currentWindowGlobal is null
This somewhat smells to be related to bug 1691348 comment 9. I will check again once bug 1691348 has been fixed.
Assignee | ||
Comment 19•4 years ago
|
||
As it looks like this happens more often for web-platform reftests (see bug 1684664). Maybe it would help to run these kind of tests to make it easier reproducible and verifiable.
Comment 20•4 years ago
|
||
I believe this is not just a Fission bug. I am seeing it pretty consistently in a web automation flow with both Firefox 86 and 88.0b2. Here's a log from this morning:
Marionette threw an error: TypeError: browsingContextFn().currentWindowGlobal is null
getMarionetteCommandsActorProxy/get/<@chrome://marionette/content/actors/MarionetteCommandsParent.jsm:327:29
GeckoDriver.prototype.takeScreenshot@chrome://marionette/content/driver.js:2392:26
It looks like the issue still exist on the trunk, and I had a suspicion that this might be related to null changes related to bug 1669961, so I am going to try reverting to the latest ESR build.
Assignee | ||
Comment 21•4 years ago
|
||
Nathan, do you have some lines of code that reproduce this issue? Or maybe provide an excerpt of the trace log? Thanks!
Comment 22•4 years ago
|
||
The lines of code are in the Elixir language with our in-house-built Webdriver client library. If it helps, I can tell you that this is from a call to /session/{sess_id}/moz/screenshot/full
. I will see what can be done about the trace log but I have already deployed ESR to production, which is the only place we're seeing the issue.
Comment 23•4 years ago
|
||
Hi Henrik, he's a little more info. Our overall processing flow is as follows:
- Call /session/{sess_id}/element/{elem_id}/click, the website in question may return to this in various ways, in one case of interest, I think something happens with frames.
- We implement a "wait until element hidden" loop by repeatedly calling a find elements (/session/{sess_id}/elements) with a selector for a <div> that shows a "wait, processing" spinner. Sometimes this succeeds but this time it returned:
{
"value": {
"error": "no such window",
"message": "Browsing context has been discarded",
"stacktrace": "WebDriverError@chrome://marionette/content/error.js:181:5\nNoSuchWindowError@chrome://marionette/content/error.js:415:5\nassert.that/<@chrome://marionette/content/assert.js:460:13\nassert.open@chrome://marionette/content/assert.js:168:4\nGeckoDriver.prototype.findElements@chrome://marionette/content/driver.js:2100:10\ndespatch@chrome://marionette/content/server.js:297:40\nexecute@chrome://marionette/content/server.js:267:16\nonPacket/<@chrome://marionette/content/server.js:240:20\nonPacket@chrome://marionette/content/server.js:241:9\n_onJSONObjectReady/<@chrome://marionette/content/transport.js:504:20\n"
}
}
- Error handling is lax on the previous call because it's just an "await until hidden" loop, so we attempt to find the next element we're interested in, which also fails: as you can see from 2), the context has been discarded.
- At this point we attempt a diagnostic screenshot. My observation in the past has been that geckodriver would return an empty base64 string in some cases where an iframe has vanished, and we could then rescue it by switching to the parent frame and retrying the screenshot. But something has changed, either on the site we are automating or in Firefox and this no longer works, screenshot fails on the first attempt with this TypeError from my previous comment.
Hope that helps.
Assignee | ||
Comment 24•4 years ago
|
||
(In reply to Nathan Bryant from comment #23)
- We implement a "wait until element hidden" loop by repeatedly calling a find elements (/session/{sess_id}/elements) with a selector for a <div> that shows a "wait, processing" spinner. Sometimes this succeeds but this time it returned:
{ "value": { "error": "no such window", "message": "Browsing context has been discarded",
This error shouldn't happen anymore with Firefox 87 when it is caused by a command as executed after switching to a frame and the frame hasn't finished loading yet. See bug 1691348.
- At this point we attempt a diagnostic screenshot. My observation in the past has been that geckodriver would return an empty base64 string in some cases where an iframe has vanished, and we could then rescue it by switching to the parent frame and retrying the screenshot. But something has changed, either on the site we are automating or in Firefox and this no longer works, screenshot fails on the first attempt with this TypeError from my previous comment.
Yes, similar to the above bug. You would have to test with 87 and later. You mentioned 88.0b2 in your first comment, does that exact error still happen there?
Comment 25•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #24)
Yes, similar to the above bug. You would have to test with 87 and later. You mentioned 88.0b2 in your first comment, does that exact error still happen there?
Yes, the TypeError continues to occur on 88.0b2. We recently upgraded from 86 to 88.0b2 and that did clear up some other null-handling issues, but not this one in particular.
Comment 26•4 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #24)
This error shouldn't happen anymore with Firefox 87 when it is caused by a command as executed after switching to a frame and the frame hasn't finished loading yet. See bug 1691348.
The thing is, we don't explicitly switch frames here. Not with a webdriver command, anyway. I'm not sure what's happening with frames, but it's something that the target website is initiating.
Assignee | ||
Comment 27•4 years ago
|
||
(In reply to Nathan Bryant from comment #26)
The thing is, we don't explicitly switch frames here. Not with a webdriver command, anyway. I'm not sure what's happening with frames, but it's something that the target website is initiating.
Ok, so when you have steps for this or a trace log, which might also help, please file a new bug, and lets keep this one for the currentWindowGlobal
being null. Thanks!
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 28•3 years ago
|
||
This can actually happen when we are trying to access the currentWindowGlobal
too early for a newly created browsing context. With bug 1747359 fixed we will now wait until a tab to be used has been finished loading. And we are going to specifically check for an not yet existing currentWindowGlobal
and ensure to wait for one. So this bug should no longer happen once bug 1747359 is fixed.
Assignee | ||
Comment 29•3 years ago
|
||
Nathan, could you maybe check again with your testcase if it looks better in the most recent Firefox Nightly build from today? If not, are you modifying the pageLoadStrategy
capability so that WebDriver:ElementClick
returns before the page has been finished loading? Thanks!
Comment 30•2 years ago
|
||
Redirect a needinfo that is pending on an inactive user to the triage owner.
:whimboo, could you please find another way to get the information or close the bug as INCOMPLETE
if it is not actionable?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•2 years ago
|
Comment 31•2 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #29)
Nathan, could you maybe check again with your testcase if it looks better in the most recent Firefox Nightly build from today? If not, are you modifying the
pageLoadStrategy
capability so thatWebDriver:ElementClick
returns before the page has been finished loading? Thanks!
:whimboo, for what it's worth, we have upgraded several Firefox versions and I have not seen this bug come up again in our production environment in a long, long time.
Assignee | ||
Comment 32•2 years ago
|
||
That's good to hear! Thanks for your feedback. Nevertheless we get still random failures for reftests (see the blocked bug 1684664). It's due to the following usage:
As such I would propose we limit this bug to only handle the reftest.js usage.
Comment 33•2 years ago
|
||
We are seeing a lot (~300 different tests) of these failures in WPT reftest on Android Fission builds. The error is:
TypeError: can't access property "getActor", browsingContext.currentWindowGlobal is null
loadTestUrl@chrome://remote/content/marionette/reftest.js:679:21
I am able to consistently reproduce the error locally when I run any of these tests. With --disable-fission
flag the error does not reproduce and the test passes.
This blocks the Android Fission project currently, can we consider raising the priority of this bug to P1?
Comment 34•2 years ago
|
||
Here is a log from the terminal: https://paste.mozilla.org/fPYtO3Z8 Command used was ./mach wpt testing/web-platform/tests/acid/acid3/test.html
Assignee | ||
Comment 35•2 years ago
|
||
I discussed with owlish on Element and my suggestion is to simply [continue in the while loop if the currentWindowGlobal
is null
](https://searchfox.org/mozilla-central/rev/4f4c8e0e84d5a728244f1e820dda14e5cdb81e71/remote/marionette/reftest.js#675-682. To check for a valid navigation we would need a valid window as well. For me this check is fine given that we won't have the time to get bug 1669787 implemented.
Owlish will report back if that works given that it seems to be easily to get reproduced. Thanks!
Assignee | ||
Comment 36•2 years ago
|
||
I had a quick look myself and as it looks like the browsing context gets discarded when the page has been finished loading:
09-05 13:31:32.837 8345 8413 I Gecko : 1662377492837 Marionette DEBUG Window URL does not match the expected URL "about:blank" !== "http://web-platform.test:8000/acid/acid3/reference.sub.html"
09-05 13:31:32.886 8307 8332 D GeckoViewSessionStore: observe browsing-context-discarded
There quite a few lines with the Marionette debug output but finally when the ACID3 page has finished loading I can see only a single browsing-context-discarded
notification. If another browsing context gets created there should be an observer notification as well. Owlish I assume that there is some kind of bug in the GeckoView implementation?
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 37•2 years ago
|
||
Owlish, are you still seeing this issue with Fission enabled and a recent mozilla-central checkout (fix for bug 1794553 included)?
Assignee | ||
Comment 38•2 years ago
|
||
Chris, I would like to get some feedback from the GeckoView team if it's working now. At least I cannot see the problem anymore when testing locally. Maybe you could forward the needinfo to the right person? Maybe Owlish is covered with other work right now? Thanks!
Comment 39•2 years ago
|
||
Owlish will start a try run to see if she can still reproduce the error.
Updated•2 years ago
|
Comment 40•2 years ago
|
||
I ran try build, and I am not seeing the failure in the reftests anymore. I am closing this for now.
Assignee | ||
Comment 41•2 years ago
|
||
That's great to hear. Thanks a lot for the feedback. Given that bug 1794553 was the only code change in that area I'm going to mark the bug as fixed instead.
Updated•2 years ago
|
Description
•