Closed Bug 1699983 Opened 4 years ago Closed 4 years ago

Perma macOS shutdown hangs devtools/server/tests/xpcshell/test_extension_storage_actor.js | Test timed out + browser/components/extensions/test/xpcshell/test_ext_urlbar.js | Test timed out when Gecko 88 switches to Late Beta on 2021-04-02

Categories

(Toolkit :: General, defect)

defect

Tracking

()

VERIFIED FIXED
89 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox86 --- unaffected
firefox87 --- unaffected
firefox88 + fixed
firefox89 + verified

People

(Reporter: aryx, Assigned: jstutte)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

The issue doesn't affect early beta and at least some tests are related to storage which indicates a dependency on the old storage implementation.

[task 2021-03-20T21:08:25.346Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.manager	DEBUG	before shutdown"
[task 2021-03-20T21:08:25.346Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.manager	DEBUG	shutdown"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.manager	DEBUG	Calling shutdown blocker for XPIProvider"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.xpi	DEBUG	shutdown"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.xpi-utils	DEBUG	shutdown"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.manager	DEBUG	Calling shutdown blocker for PluginProvider"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224729	addons.manager	DEBUG	Calling shutdown blocker for GMPProvider"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  "CONSOLE_MESSAGE: (info) 1616274224738	addons.manager	DEBUG	Async provider shutdown done"
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | WARNING: At least one completion condition is taking too long to complete. Conditions: [{"name":"ContentParent: id=11c6f1000","state":{"remoteTypePrefix":"prealloc"},"filename":"/builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp","lineNumber":3540,"stack":["resource://gre/modules/nsAsyncShutdown.jsm:addBlocker:162","/Users/cltbld/tasks/task_1616259816/build/tests/xpcshell/head.js:_do_main:239","/Users/cltbld/tasks/task_1616259816/build/tests/xpcshell/head.js:_execute_test:567","-e:null:1"]}] Barrier: profile-before-change
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | FATAL ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"ContentParent: id=11c6f1000","state":{"remoteTypePrefix":"prealloc"},"filename":"/builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp","lineNumber":3540,"stack":["resource://gre/modules/nsAsyncShutdown.jsm:addBlocker:162","/Users/cltbld/tasks/task_1616259816/build/tests/xpcshell/head.js:_do_main:239","/Users/cltbld/tasks/task_1616259816/build/tests/xpcshell/head.js:_execute_test:567","-e:null:1"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | WARNING: No crash reporter available
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | \x07[Parent 7401, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp:3540
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | [Parent 7401, Main Thread] ###!!! ABORT: file /builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp:3540
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | ###!!! [Child][MessageChannel] Error: (msgtype=0x3A00E2,name=PContent::Msg_GraphicsError) Channel error: cannot send/recv
[task 2021-03-20T21:08:25.347Z] 21:08:25     INFO -  PID 7401 | [GFX1-]: Receive IPC close with reason=AbnormalShutdown
Flags: needinfo?(jstutte)

Hi Paul, am I right that the use of prompter.jsm depends on configurations and thus we might just end up with an untested configuration here in late beta? I doubt that local storage is the direct culprit this time (though it might be involved in causing the hang we do not catch here). Can you point me somewhere how to configure the release behavior locally here? (Edit) Just realizing that it's Mac only, which I cannot test anyway.

Brain dump so far:

SpinEventLoopUntilOrQuitis used only in few places other than prompter.jsm:

browser/base/content/tabbrowser.js
toolkit/components/passwordmgr/LoginManagerAuthPrompter.jsm
toolkit/content/widgets/browser-custom-element.js

Of particular interest might be browser/base/content/tabbrowser.js as (according to the comment) we seem to prompt right afterwards, though apparently only if the beforeUnloadPromises really resolved. I wonder if we should be more explicit about errors to avoid any doubt of what is going to happen in the abort case.

It might also be the case, that there are other process types involved here for which we do not advance the correct shutdown phase, but then I would not understand why this happens only with late beta.

Flags: needinfo?(jstutte) → needinfo?(pbz)

So I added some warnings and made a central-as-late-beta simulation on MacOS (thanks to aryx for the instructions).

It seems that AppShutdown::AdvanceShutdownPhase is never called here which might cause a lack of shutdown notifications in general.

Are XPCShell tests (that involve XPI, it seems) invoked differently under MacOS (and this only on the release channel) ? Sounds weird.

I will launch the same test also for linux.

Selecting the tasks for linux I noticed, that we do not execute XPCShell tests at all on linux shippable? That could explain the OS difference, at least.

Anyway, here is the try run on linux-asan and macos in parallel as late beta. This might give us a hook for a pernosco session.

(Edit) I just realized that I was not using the right way to print out messages to the log. So I updated things and gave it a new try (link updated, too).

Flags: needinfo?(aryx.bugmail)

The XPCshell tasks run on Linux shippable for central-as-beta simulations and regular beta pushes. ./mach try chooser --full offers XPCshell tests for shippable builds both on Linux and macOS here.

Flags: needinfo?(aryx.bugmail)

(In reply to Jens Stutte [:jstutte] from comment #3)

Anyway, here is the try run on linux-asan and macos in parallel as late beta. This might give us a hook for a pernosco session.

From my first try runs it has been confirmed to be MacOS only, as reported. So no real hope for a pernosco session here.

So this time the failure is apparently happening on a different test. Filtering the output for timeouts yields:

[task 2021-03-23T16:01:25.472Z] 16:01:25     INFO -  browser/components/tests/unit/test_distribution.js failed or timed out, will retry.
[task 2021-03-23T16:02:49.257Z] 16:02:49     INFO -  extensions/pref/autoconfig/test/unit/test_autoconfig_nonascii.js failed or timed out, will retry.
[task 2021-03-23T16:06:15.859Z] 16:06:15     INFO -  TEST-TIMEOUT | xpcshell.ini:browser/components/extensions/test/xpcshell/test_ext_urlbar.js | took 300120ms
[task 2021-03-23T16:06:57.547Z] 16:06:57     INFO -  TEST-TIMEOUT | xpcshell-child-process.ini:dom/indexedDB/test/unit/test_keys.js | took 300038ms
[task 2021-03-23T16:07:33.208Z] 16:07:33     INFO -  TEST-TIMEOUT | dom/push/test/xpcshell/test_unregister_invalid_json.js | took 300079ms
[task 2021-03-23T16:07:33.209Z] 16:07:33     INFO -  dom/push/test/xpcshell/test_unregister_invalid_json.js failed or timed out, will retry.
[task 2021-03-23T16:08:16.759Z] 16:08:16     INFO -  toolkit/components/telemetry/tests/unit/test_ChildScalars.js failed or timed out, will retry.
[task 2021-03-23T16:08:27.489Z] 16:08:27     INFO -  toolkit/components/url-classifier/tests/unit/test_listmanager.js failed or timed out, will retry.
[task 2021-03-23T16:09:35.090Z] 16:09:35     INFO -  TEST-TIMEOUT | netwerk/test/unit_ipc/test_alt-data_stream_wrap.js | took 300037ms
[task 2021-03-23T16:11:03.107Z] 16:11:03     INFO -  TEST-TIMEOUT | xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_responseBody.js | took 300012ms
[task 2021-03-23T16:11:13.042Z] 16:11:13     INFO -  TEST-TIMEOUT | xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript_triggeringPrincipal.js | took 300008ms
[task 2021-03-23T16:15:39.664Z] 16:15:39     INFO -  TEST-START | security/manager/ssl/tests/unit/test_ocsp_timeout.js
[task 2021-03-23T16:16:16.303Z] 16:16:16     INFO -  TEST-PASS | security/manager/ssl/tests/unit/test_ocsp_timeout.js | took 36591ms
[task 2021-03-23T16:18:21.769Z] 16:18:21     INFO -  xpcshell.ini:browser/components/extensions/test/xpcshell/test_ext_urlbar.js failed or timed out, will retry.
[task 2021-03-23T16:18:50.576Z] 16:18:50     INFO -  xpcshell-child-process.ini:dom/indexedDB/test/unit/test_keys.js failed or timed out, will retry.
[task 2021-03-23T16:21:35.155Z] 16:21:35     INFO -  netwerk/test/unit_ipc/test_alt-data_stream_wrap.js failed or timed out, will retry.
[task 2021-03-23T16:22:59.502Z] 16:22:59     INFO -  xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_responseBody.js failed or timed out, will retry.
[task 2021-03-23T16:23:13.035Z] 16:23:13     INFO -  xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript_triggeringPrincipal.js failed or timed out, will retry.
[task 2021-03-23T16:23:29.457Z] 16:23:29  WARNING -  TEST-UNEXPECTED-TIMEOUT | xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_responseBody.js | Test timed out
[task 2021-03-23T16:23:29.529Z] 16:23:29     INFO -  PID 5688 | FATAL ERROR: AsyncShutdown timeout in profile-before-change Conditions: [{"name":"ContentParent: id=11fc8c800","state":{"remoteTypePrefix":"web"},"filename":"/builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp","lineNumber":3540,"stack":["resource://gre/modules/nsAsyncShutdown.jsm:addBlocker:162","resource://testing-common/ExtensionXPCShellUtils.jsm:_initBrowser:274","/Users/cltbld/tasks/task_1616512595/build/tests/xpcshell/head.js:_do_main:239","/Users/cltbld/tasks/task_1616512595/build/tests/xpcshell/head.js:_execute_test:567","-e:null:1"]},{"name":"ContentParent: id=12092f000","state":{"remoteTypePrefix":"prealloc"},"filename":"/builds/worker/checkouts/gecko/dom/ipc/ContentParent.cpp","lineNumber":3540,"stack":["resource://gre/modules/nsAsyncShutdown.jsm:addBlocker:162","/Users/cltbld/tasks/task_1616512595/build/tests/xpcshell/head.js:_do_main:239","/Users/cltbld/tasks/task_1616512595/build/tests/xpcshell/head.js:_execute_test:567","-e:null:1"]}] At least one completion condition failed to complete within a reasonable amount of time. Causing a crash to ensure that we do not leave the user with an unresponsive process draining resources.

It seems, we encounter several timeouts, some of them reported as INFO and not as error. And then something tries to abort the child process probably through an IPC message to the content processes:

[task 2021-03-23T16:23:29.527Z] 16:23:29     INFO -  PID 5688 | [Child 5689: Main Thread]: D/AppShutdown AppShutdown::AdvanceShutdownPhase called with 1
[task 2021-03-23T16:23:29.527Z] 16:23:29     INFO -  PID 5688 | [Child 5689: Main Thread]: I/AppShutdown AppShutdown::AdvanceShutdownPhase advancing to 1
[task 2021-03-23T16:23:29.527Z] 16:23:29     INFO -  PID 5688 | [Child 5690: Main Thread]: D/AppShutdown AppShutdown::AdvanceShutdownPhase called with 1
[task 2021-03-23T16:23:29.528Z] 16:23:29     INFO -  PID 5688 | [Child 5690: Main Thread]: I/AppShutdown AppShutdown::AdvanceShutdownPhase advancing to 1

This is the only touching point with the code I changed here, but it happens way after we already messed up, it seems. And the AsyncShutdown timeout is happening immediately after, thus is not induced by AppShutdown::AdvanceShutdownPhase (which potentially spins up the terminator thread, but this is not causing the FATAL ERROR here - it would crash hard after a timeout).

I might not be the right person to investigate this deeply, but I cannot really see how the patch from bug 1696397 could cause those initial timeouts. The abort case of SpinEventLoopUntilOrQuit is never hit, it seems, so this should not influence anything. On the linux executions I see far less timeouts which cause a retry that then works. And I do not see any invocation of AppShutdown::AdvanceShutdownPhase at all. To me all I see in the MacOS log just shows a sane Async Shutdown Blocker behavior after some unforeseen timeouts of unknown nature. Which might just be caused by slow execution?

To be sure, we could simulate, if a backout of the patch from bug 1696397 really would solve this problem? My hg "expertise" is not enough to do this on my own, though...

Flags: needinfo?(aryx.bugmail)

Before we do this, I pushed another try that tracks each call to nsThreadManager::SpinEventLoopUntilInternal. If we find some of those next to the tests that time out and that have a value >0 attached to them, I might be wrong.

Flags: needinfo?(aryx.bugmail)

It's from bug 1696397. The pushlog lists a push by Hiro before the one with bug 1696397. The try push based on the former didn't hit the issue while the one on the latter did. The imported patch got used to fix some bustage.

And test_ext_webRequest_responseBody.js was always one of the failing tests as far as I checked but not the first one.

Thanks for clarifying. On my local tests, xpcshell seems to use a lot of nsThreadManager::SpinEventLoopUntil but I do not see the nsThreadManager::SpinEventLoopUntilOrQuit variant, which is the only one that should have changed behavior. Let's see if this is different in the try push. I am not an xpcshell expert at all, but there might be something we do different when we retry a timed out test execution.

So, just to summarize my findings so far (based on the latest push):

  • there seem to be no calls to SpinEventLoopUntilOrQuit involved:
13 matches
[task 2021-03-23T20:20:14.771Z] 20:20:14     INFO -  PID 4597 | [Parent 4597: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:20:14.772Z] 20:20:14     INFO -  PID 4597 | [Parent 4597: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshell/head.js:_do_main) (0)
[task 2021-03-23T20:20:14.773Z] 20:20:14     INFO -  PID 4597 | [Parent 4597: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshell/head.js:_do_main)|XPIProvider.jsm:awaitPromise (0)
[task 2021-03-23T20:20:15.136Z] 20:20:15     INFO -  PID 4597 | [Parent 4597: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:20:15.136Z] 20:20:15     INFO -  PID 4597 | [Parent 4597: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshel/head.js:_execute_test) (0)
[task 2021-03-23T20:37:29.190Z] 20:37:29     INFO -  PID 4819 | [Parent 4819: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:37:29.191Z] 20:37:29     INFO -  PID 4819 | [Parent 4819: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshell/head.js:_do_main) (0)
[task 2021-03-23T20:37:29.272Z] 20:37:29     INFO -  PID 4819 | [Parent 4819: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:37:29.272Z] 20:37:29     INFO -  PID 4819 | [Parent 4819: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshel/head.js:_execute_test) (0)
[task 2021-03-23T20:54:50.825Z] 20:54:50     INFO -  PID 5027 | [Parent 5027: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:54:50.825Z] 20:54:50     INFO -  PID 5027 | [Parent 5027: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshell/head.js:_do_main) (0)
[task 2021-03-23T20:54:50.955Z] 20:54:50     INFO -  PID 5027 | [Parent 5027: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for PerTestCoverageUtils.jsm:awaitPromise (0)
[task 2021-03-23T20:54:50.955Z] 20:54:50     INFO -  PID 5027 | [Parent 5027: Main Thread]: D/nsThreadManager SpinEventLoopUntilInternal called for Test(xpcshel/head.js:_execute_test) (0)

These are all instances of calls to SpinEventLoopUntilInternal. The (n) indicates the shutdown phase we should abort the loop for, 0 means no abort at all. So no SpinEventLoopUntilOrQuit has been called, apparently. And in fact no additional message indicates that we aborted a loop due to shutdown (I added those messages on the push, too).

  • There seems to be a correlation between earlier INFO ... xxx test timed out, will retry. and later WARNING ... xxx | Test timed out :
13 matches
[task 2021-03-23T19:32:37.337Z] 19:32:37     INFO -  extensions/pref/autoconfig/test/unit/test_autoconfig_nonascii.js failed or timed out, will retry.
[task 2021-03-23T19:38:48.897Z] 19:38:48     INFO -  toolkit/components/url-classifier/tests/unit/test_listmanager.js failed or timed out, will retry.
[task 2021-03-23T19:47:49.498Z] 19:47:49     INFO -  xpcshell.ini:browser/components/extensions/test/xpcshell/test_ext_history.js failed or timed out, will retry.
[task 2021-03-23T19:48:36.512Z] 19:48:36     INFO -  xpcshell-child-process.ini:dom/indexedDB/test/unit/test_keys.js failed or timed out, will retry.
[task 2021-03-23T19:48:44.132Z] 19:48:44     INFO -  xpcshell.ini:browser/components/extensions/test/xpcshell/test_ext_urlbar.js failed or timed out, will retry.
[task 2021-03-23T19:51:32.918Z] 19:51:32     INFO -  netwerk/test/unit_ipc/test_alt-data_stream_wrap.js failed or timed out, will retry.
[task 2021-03-23T19:53:04.033Z] 19:53:04     INFO -  xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_filterResponseData.js failed or timed out, will retry.
[task 2021-03-23T19:53:06.037Z] 19:53:06     INFO -  xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_responseBody.js failed or timed out, will retry.
[task 2021-03-23T19:53:14.320Z] 19:53:14     INFO -  xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript_triggeringPrincipal.js failed or timed out, will retry.
[task 2021-03-23T19:55:27.250Z] 19:55:27  WARNING -  TEST-UNEXPECTED-TIMEOUT | xpcshell.ini:browser/components/extensions/test/xpcshell/test_ext_urlbar.js | Test timed out
[task 2021-03-23T19:55:27.395Z] 19:55:27     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "SkippableTimer :: UrlbarProviderExtension notification timer :: Timed out!" {file: "resource:///modules/UrlbarUtils.jsm" line: 1935}]
[task 2021-03-23T19:55:27.402Z] 19:55:27     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "SkippableTimer :: UrlbarProviderExtension notification timer :: Timed out!" {file: "resource:///modules/UrlbarUtils.jsm" line: 1935}]
[task 2021-03-23T20:12:27.622Z] 20:12:27  WARNING -  TEST-UNEXPECTED-TIMEOUT | xpcshell-e10s.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_responseBody.js | Test timed out
  • It still just happens only on MacOS, which is puzzling me additionally. The later WARNING - TEST-UNEXPECTED-TIMEOUT are not happening under linux at all.

So from what I understand so far from the logs the move from nsThreadManager::SpinEventLoopUntilOrShutdown to nsThreadManager::SpinEventLoopUntilOrQuit seems not relevant here. But I might be overlooking something obvious here, :kmag ?

Flags: needinfo?(kmaglione+bmo)

(In reply to Jens Stutte [:jstutte] from comment #11)

  • There seems to be a correlation between earlier INFO ... xxx test timed out, will retry. and later WARNING ... xxx | Test timed out :

Yes. When a test fails when run in parallel, the harness tries running it again on its own in case there's a race that we may not hit when under less load.

So from what I understand so far from the logs the move from nsThreadManager::SpinEventLoopUntilOrShutdown to nsThreadManager::SpinEventLoopUntilOrQuit seems not relevant here. But I might be overlooking something obvious here, :kmag ?

These are all timeouts in async shutdown blockers waiting for ContentParent shutdown. The only way they'd have anything to do with your changes is if something's spinning a nested event loop in a child process and was affected by your changes. That seems unlikely, though.

Flags: needinfo?(kmaglione+bmo)

(In reply to Kris Maglione [:kmag] from comment #12)

The only way they'd have anything to do with your changes is if something's spinning a nested event loop in a child process and was affected by your changes.

For what it's worth, spinEventLoopUntilOrQuit in Prompter.jsm here: https://searchfox.org/mozilla-central/rev/cc4a17cea338fe236626d838ca96e9bf6a775675/toolkit/components/prompts/src/Prompter.jsm#1065 can be called in the content process. For example if a site calls window.alert.

(In reply to Jens Stutte [:jstutte] from comment #1)

Hi Paul, am I right that the use of prompter.jsm depends on configurations and thus we might just end up with an untested configuration here in late beta?

I'm not aware of any late-beta specific behavior of Prompter.jsm itself, however its consumers (via nsIPrompt, or nsIPromptService) could behave differently.

Flags: needinfo?(pbz)

(In reply to Paul Zühlcke [:pbz] from comment #13)

(In reply to Kris Maglione [:kmag] from comment #12)

The only way they'd have anything to do with your changes is if something's spinning a nested event loop in a child process and was affected by your changes.

For what it's worth, spinEventLoopUntilOrQuit in Prompter.jsm here: https://searchfox.org/mozilla-central/rev/cc4a17cea338fe236626d838ca96e9bf6a775675/toolkit/components/prompts/src/Prompter.jsm#1065 can be called in the content process. For example if a site calls window.alert.

The log messages I added seem to indicate, that this is not the case here. I added some more messages (also for direct non-JS SpinEventLoopUntil calls) and made a new try push.

My only remaining doubt is, if we still arrive here on all processes, thus I included also observer messages in the log.

I made a try push with a doubled timeout. It shows the same behavior, which means we clearly have some live lock here.

I am not able to force the same behavior locally under linux, even with ridiculously short timeouts.

(In reply to Jens Stutte [:jstutte] from comment #6)

To be sure, we could simulate, if a backout of the patch from bug 1696397 really would solve this problem? My hg "expertise" is not enough to do this on my own, though...

I do not see any path forward than trying a backout push and in case this heals compare the logs with the failing one.

Flags: needinfo?(aryx.bugmail)

(In reply to Paul Zühlcke [:pbz] from comment #13)

(In reply to Kris Maglione [:kmag] from comment #12)

The only way they'd have anything to do with your changes is if something's spinning a nested event loop in a child process and was affected by your changes.

For what it's worth, spinEventLoopUntilOrQuit in Prompter.jsm here: https://searchfox.org/mozilla-central/rev/cc4a17cea338fe236626d838ca96e9bf6a775675/toolkit/components/prompts/src/Prompter.jsm#1065 can be called in the content process. For example if a site calls window.alert.

This is happening in xpcshell tests, which don't support any sort of prompts (though I did check that the failing tests weren't using them just in case). And, even if they did, the processes that are blocking shutdown are usually preallocated processes, which wouldn't ever get a chance to run code that could trigger the prompt service.

OK, one possible significant difference I do see is that between bug 1689953 and bug 1696397, we now fire "quit-application" observers in content processes at ContentChild shutdown, which we didn't before. That could mean that we're triggering observers that we didn't before, and it definitely means that we're calling a couple of observers that listen for both "content-child-will-shutdown" and "quit-application" an additional time.

Yeah, might be. I kind of anticipated the need to make this cleaner in bug 1697745. A faster hack could be to have a variant of AdvanceShutdownPhase that does not fire any observers.

Assignee: nobody → jstutte
Status: NEW → ASSIGNED
Pushed by jstutte@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/89517310d74e Avoid firing unexpected shutdown observer notification in content process. r=kmag,xpcom-reviewers
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

Comment on attachment 9211363 [details]
Bug 1699983: Avoid firing unexpected shutdown observer notification in content process. r?kmag,#xpcom-reviewers

Beta/Release Uplift Approval Request

  • User impact if declined: This fixes a regression caused by D107619. We have no evidence of concrete user impact, yet, but in order to be sure we will not cause any harm with the other patch we should uplift this.

Though we do not have a specific test for it, the regression has been detected by our automatic testing.

  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): It is not risky and reverts a slightly risky change made by D107619, that is firing unexpected shutdown observer notifications in the content process.
  • String changes made/needed:
Attachment #9211363 - Flags: approval-mozilla-beta?

Verified fixed with latest central-as-late-beta simulation. Thank you to everybody involved.

Status: RESOLVED → VERIFIED

Comment on attachment 9211363 [details]
Bug 1699983: Avoid firing unexpected shutdown observer notification in content process. r?kmag,#xpcom-reviewers

Thanks indeed for getting this sorted out! Approved for 88.0b4.

Attachment #9211363 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: