Intermittent shutdown timeouts in CI with native menus
Categories
(Core :: Widget: Cocoa, defect, P1)
Tracking
()
People
(Reporter: mstange, Assigned: mstange)
References
Details
(Whiteboard: [proton-context-menus][mac:mr1] [proton-uplift])
Attachments
(7 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details |
Now that native context menus are enabled on macOS in Nightly (bug 1700679), we are seeing an increase in jobs that are terminated by [@ mozilla::(anonymous namespace)::RunWatchdog(void*)]
after a shutdown timeout, such as this one.
These occurrences have NativeMenuMac::OpenMenu
on the stack. This indicates that the nested event loop for a native menu is never exited.
(Bug 1358898 is the general bug about shutdown timeouts in CI.)
I debugged this a little by pushing logging instrumentation to try, and it seems that this happens if we open another native menu before the previous native menu was able to exit its nested event loop.
Comment 1•4 years ago
|
||
Could this be related to bug 1705365 ?
Assignee | ||
Comment 2•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #1)
Could this be related to bug 1705365 ?
I don't think so - that bug was filed before native context menus were enabled, and it doesn't have NativeMenuMac::OpenMenu on the stack.
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
On CI, where we open and close menu items in quick succession, we sometimes got
into a state where the new menu was opened while we were still in the old menu's
nested event loop. So we had the following sequence of events:
- old menu +[NSMenu popUpContextMenu:withEvent:forView:]
- nested event loop for old menu
- old menu -[NSMenu cancelTrackingWithoutAnimation]
- new menu +[NSMenu popUpContextMenu:withEvent:forView:]
- nested event loop for new menu
- new menu -[NSMenu cancelTrackingWithoutAnimation]
- new menu's event loop is exited, but old menu's event loop remains on the
stack
- shutdown hang here
MOZMenuOpeningCoordinator makes sure that +[NSMenu popUpContextMenu:withEvent:forView:]
is always called in sequence, never in a nested fashion.
Comment 5•4 years ago
|
||
Backed out for causing bc failures on browser_library_bookmark_pages.js.
Failure log: https://treeherder.mozilla.org/logviewer?job_id=337785220&repo=autoland
Backout link: https://hg.mozilla.org/integration/autoland/rev/b1bc688b1805f1755c61532c64d7ec3f7d9fec9e
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
I cannot reproduce the browser_library_bookmark_pages.js timeout locally. I even tried running the entire directory.
Try push with debugging instrumentation: https://treeherder.mozilla.org/jobs?repo=try&revision=975e0a40dfd4576d129d5abe53539c052e169663
Assignee | ||
Comment 7•4 years ago
|
||
Something very strange is happening in that test. Profile: https://share.firefox.dev/3sZhuxZ
The first menu opens. While it is open, i.e. while we're in the tracking event loop for the first menu, we call activateItem. activateItem posts a runnable for the menuitem command, and then calls -[NSMenu cancelTrackingWithoutAnimation]
. Then the runnable runs before we we had a chance to exit the nested event loop for the menu. (This is the part that happens on try but not on my machine - on my machine, we exit the nested event loop before we run the runnable.) During the command event, we open a sheet window and spawn another nested event loop, under AppWindow::ShowModal
. So now we have two nested event loops on the stack: the event loop for the first menu (which already has had cancelTrackingWithoutAnimation
called on it) and inside of that the modal event loop for the sheet window. Then, at some point, we close the sheet window and exit the modal event loop. At this point, you'd expect that we'd leave the nested event loop for the first menu, due to the cancelTrackingWithoutAnimation
call earlier! But we don't. The menu's event loop stays on the stack. I'm not sure why. We now try to open the second menu. However, MOZMenuOpeningCoordinator delays this opening because the first menu's event loop is still on the stack. And then we go idle, stuck in the nested event loop of the first menu, and the second menu never opens.
It seems that our request to close the first menu somehow got lost. Maybe cancelTrackingWithoutAnimation
doesn't expect us to launch another nested event loop before returning back into the menu's event loop?
I'll try to make it so that the command event from activateItem only runs after we exit the menu's event loop. That should avoid this bug.
Assignee | ||
Comment 8•4 years ago
|
||
That seems to work.
Assignee | ||
Comment 9•4 years ago
|
||
Otherwise the singleton would be leaked. It's not a lot of bytes, but a future
patch in this series is going to add an nsDeque member to this class, and
nsDeque has leak detection which would cause test failures in debug builds if
we didn't clean up MOZMenuOpeningCoordinator on shutdown.
Depends on D113373
Assignee | ||
Comment 10•4 years ago
|
||
It doesn't need to be cancelable - the Run() method already nulls out mMenuItem,
so once Run() has been called once, it won't do anything on subsequent invocations.
This patch also improves safety a bit, by moving the pending command runnables
out of the mPendingCommandRunnables field before running them, so that the Run()
method has no way of invalidating the array that's being iterated over.
Depends on D113731
Assignee | ||
Comment 11•4 years ago
|
||
This is a robust way to queue runnables which won't run until after the nested
event loop of the open NSMenu has been exited.
NS_DispatchToCurrentThread is not a reliable way to achieve that, because it
puts the runnables into the Gecko event loop, and there's nothing that prevents
us from potentially visiting the Gecko event loop inside the NSMenu's event loop,
even after cancelTracking(WithoutAnimation) has been called.
Depends on D113732
Assignee | ||
Comment 12•4 years ago
|
||
This avoids bugs where a menu close request (cancelTrackingWithoutAnimation) would be dropped
if the command event did certain unexpected things. I'm not 100% sure what the exact
circumstances for the bad behavior were (see bug 1707598 comment 7 for details), but
this seems like the right thing to do anyway.
Depends on D113733
Assignee | ||
Comment 13•4 years ago
|
||
Assignee | ||
Comment 14•4 years ago
|
||
It looks like these patches now cause another test failure, in 4 out of 5 runs, in bc7 browser/components/extensions/test/browser/browser_ext_browserAction_contextMenu.js. I'll debug that one now. It doesn't reproduce locally for me.
Assignee | ||
Comment 15•4 years ago
|
||
This also makes sure we call ActivateItemAfterMenuClosing on the submenu that
immediately contains the activated menu item, and not on the root menu.
This usually doesn't make a difference, except in the case where something
calls FlushMenuClosedRunnable afterwards; in that case, before this patch, we
might accidentally fired the popuphidden event for the submenu before the
command event for the clicked item.
Depends on D113733
Assignee | ||
Comment 16•4 years ago
|
||
In the next patch, it will make different decisions for how to dispatch the
MenuClosedAsync event based on this flag.
Depends on D113987
Assignee | ||
Comment 17•4 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #14)
It looks like these patches now cause another test failure, in 4 out of 5 runs, in bc7 browser/components/extensions/test/browser/browser_ext_browserAction_contextMenu.js. I'll debug that one now. It doesn't reproduce locally for me.
This was happening because of changed microtask ordering; the test was expecting telemetry events after a popuphidden event, and one of those telemetry events was inserted from a microtask after the command event. I've addressed this by making -[MOZMenuOpeningCoordinator _runMenu]
dispatch the "after menu closed runnables" to the regular event loop instead of calling ->Run()
on them directly; this way, the Gecko event loop will make sure there are appropriate microtask checkpoints between them.
There was another test failure, in devtools/client/framework/test/browser_menu_api.js - this one was caused by a missing / delayed popuphidden event for a submenu which was closed with openMenu(false)
; the popuphidden event was delayed until after the entire menu closed and exited its event loop, but obviously that's too late if we're only closing a submenu and leaving the rest of the menu open. I've addressed this bug with the newly-added patches which let MenuClosed() know whether the entire menu is closing, so that it can pick the appropriate way to dispatch the MenuClosedAsync runnable.
Assignee | ||
Comment 18•4 years ago
|
||
Comment 19•4 years ago
|
||
Comment 20•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8f867acbaa94
https://hg.mozilla.org/mozilla-central/rev/50adb1fd03f3
https://hg.mozilla.org/mozilla-central/rev/be5032506244
https://hg.mozilla.org/mozilla-central/rev/d0c2e0e4b6a5
https://hg.mozilla.org/mozilla-central/rev/5df0f67ca1c6
https://hg.mozilla.org/mozilla-central/rev/058fd48682c5
https://hg.mozilla.org/mozilla-central/rev/2cd944025913
Comment 21•4 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/81480fec4ca9
https://hg.mozilla.org/releases/mozilla-beta/rev/2b85fcf9113a
https://hg.mozilla.org/releases/mozilla-beta/rev/5e4b0e56d9ba
https://hg.mozilla.org/releases/mozilla-beta/rev/365aabdbb867
https://hg.mozilla.org/releases/mozilla-beta/rev/f146ec71d0a2
https://hg.mozilla.org/releases/mozilla-beta/rev/d0677098dc5e
https://hg.mozilla.org/releases/mozilla-beta/rev/643b73cca9fd
Updated•4 years ago
|
Description
•