Closed Bug 1740301 Opened 3 years ago Closed 3 years ago

IOUtils.exists causes intermittent crashes in Talos' sessionrestore_no_auto_restore test when called via XPIInstall.jsm

Categories

(Toolkit Graveyard :: OS.File, defect, P2)

Tracking

(firefox97 fixed)

RESOLVED FIXED
Tracking Status
firefox97 --- fixed

People

(Reporter: standard8, Unassigned)

References

Details

(Whiteboard: [fixed by bug 1744794])

Bug 1733540 tried to replace OS.File usage in XPIInstall.jsm with IOUtils. When that landed, it intermittently caused crashes (bug 1736189) in Talos' sessionrestore_no_auto_restore tests.

I have tried simplifying the patch to handle just the change needed to drop OS.File from the early startup process (prior to the first browser window being drawn). I tested simply replacing the one instance of the OS.File.exists call that startup currently hits: https://hg.mozilla.org/try/rev/d6432cc4ab48f4ed37e8f13b9b1685165db30fa7

That replacement still caused crashes.

On Linux, the crash is a SIGABRT, and the top frames are:

libxul.so!PollWrapper(_GPollFD*, unsigned int, int) [nsAppShell.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 61 + 0xf]
libglib-2.0.so.0!g_main_context_iterate.isra.26 [gmain.c : 3897 + 0x24]
libglib-2.0.so.0!g_main_context_iteration [gmain.c : 3963 + 0x14]
libxul.so!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 1091 + 0x18d]
libxul.so!mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [MessagePump.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 107 + 0x2e]

On Windows, it says "No crash". Though the top frames are:

win32u.dll + 0x96e4
user32.dll + 0x2029d
xul.dll!static mozilla::widget::WinUtils::WaitForMessage(unsigned long) [WinUtils.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 832 + 0x1d]
xul.dll!nsAppShell::ProcessNextNativeEvent(bool) [nsAppShell.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 739 + 0xa]
xul.dll!nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) [nsBaseAppShell.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 259 + 0x42]
xul.dll!nsThread::ProcessNextEvent(bool, bool*) [nsThread.cpp:a4f4332a57e88004402b6a9a7f401c0fa8e078e9 : 1091 + 0x52]

So in both cases, it looks like this happens whilst we're waiting for events.

This turns out to be very specific. If I remove just the exists change from the patch in bug 1733540, then there's no crashes (for this test at least): https://treeherder.mozilla.org/jobs?repo=try&revision=34305df8b0cd74e500a3de027364518fcc5784a2

Hi Barret, I'm wondering if you might be able to take a look at this crash?

I believe that making this line use IOUtils is enough to trigger the crash.

I haven't been able to reproduce this locally yet though.

One suggestion from one of the bugs was that this could potentially be due to some of the async shutdown blocking.

Flags: needinfo?(brennie)

It could be due to async shutdown blocking, but the async shutdown barrier code warns after 10 seconds if there is a blocker that hasn't yet resolved, which I don't see in the log.

We could add some logging to IOUtils around its async shutdown to see if we get stuck inside.

Flags: needinfo?(brennie)

I added some logging and pushed to try.

One of the jobs failed with a crash but it does not seem to be a shutdown crash at all, so I've re-triggered the jobs to run 5x.

The guess about the fact it was shutdown related may have been wrong. The log seems similar to some of the others that we saw previously - note that Mac & Windows didn't seem to have the crash stack information whereas Linux did.

Thinking about it given that the line in XPIInstall.jsm is triggered on startup, this is probably more likely to be startup, though I don't get why it would be intermittent and only show up on Talos.

I've fired off another try push with very verbose logging around all dispatching and resolving/rejecting of IOUtils operations https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a662849e723ad7224e9fcfdaf062d8c57b43ccf

I took a peek at the generated logs and it seems like every call to IOUtils.exists() is resolving or rejecting appropriately -- we aren't getting stuck in the event queue, so something even weirder is going on.

It would be good if we could get a reproduction of this via rr to run on pernosco and see what is going on.

The severity field is not set for this bug.
:barret, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(brennie)
Severity: -- → S3
Flags: needinfo?(brennie)
Priority: -- → P2
Depends on: 1744794

I believe the main issue in the Talos test has been fixed by bug 1744794.

I've just filed bug 1745430 on potentially adding a test for shutdown during startup.

Hence calling this fixed by bug 1744794.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Whiteboard: [fixed by bug 1744794]
Product: Toolkit → Toolkit Graveyard
You need to log in before you can comment on or make changes to this bug.