Open Bug 1675068 Opened 4 years ago Updated 1 year ago

Service workers delaying shutdown

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

Tracking Status
firefox-esr78 --- affected
firefox85 --- wontfix
firefox86 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox96 --- wontfix
firefox97 --- wontfix
firefox98 --- wontfix
firefox104 --- wontfix
firefox105 --- wontfix
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 --- affected

People

(Reporter: ytausky, Unassigned)

References

(Depends on 1 open bug)

Details

Crash Data

After the nasty deadlock in bug 1588152 was resolved, it turns out that there are still reports of service workers delaying the shutdown. The top 3 async shutdown timeout reasons, accounting for 60% of the crashes, are:

  1. {"shutdownStates":"parent process IPDL background thread, ","pendingPromises":1,"acceptingPromises":false}
  2. {"shutdownStates":"","pendingPromises":0,"acceptingPromises":true}
  3. {"shutdownStates":"content process main thread, ","pendingPromises":1,"acceptingPromises":false}
Assignee: nobody → ytausky
Status: NEW → ASSIGNED
Crash Signature: [@ AsyncShutdownTimeout | profile-change-teardown | ServiceWorkerShutdownBlocker: shutting down Service Workers ]

(In reply to Yaron Tausky [:ytausky] from comment #0)

  1. {"shutdownStates":"content process main thread, ","pendingPromises":1,"acceptingPromises":false}

I just encountered a crash today that falls in this category: https://crash-stats.mozilla.org/report/index/bac8b5b8-29da-487f-a42d-71cf40201104#tab-metadata

I have semi-reliable steps to reproduce. When I do startup profiling on Windows (start with MOZ_PROFILER_STARTUP=1 in the environment), sometimes when I attempt to collect the profile, the profiler tab doesn't load, and remain stuck with the animated loading indicator, and the profiler.firefox.com domain name as the tab title, but the tab doesn't actually load. The profiler website uses a service worker. When a profiler tab fails to load like this, Firefox shutdown never completes cleanly.

On a slow test laptop, I reproduce quite consistently when doing a cold startup profile. On my devlopment machine I could reproduce about half the time with my local build (so I might be able to test things there).

Severity: -- → S3
Priority: -- → P2

I woke my laptop, immediately updated Nightly and after ~2-3 minutes of the updater running in the background I got this crash:
https://crash-stats.mozilla.org/report/index/0c366038-3f4a-42d8-bdbe-181e70201116

¡Hola!

Build ID 20201211213049 crashed here:

https://crash-stats.mozilla.org/report/index/7bb232a4-9947-4e39-acdf-a71050201216#tab-details

Updated flags per the crash data available.

¡Gracias!
Alex

OS: Unspecified → All
Hardware: Unspecified → All
Assignee: ytausky → jstutte

This signature started to get crash reports again at the end of last week (10-20 crash reports per Nighty). Regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=720d5125a9b4aa6750806ed9b51fbd0811da10c4&tochange=59134b451eec1454be5f2e489176b4b59f1ddb5a

Yulia, could bug 1742438 be related?

Flags: needinfo?(ystartsev)

Yes, possibly. Hopefully I have a fix for this regression already.

Flags: needinfo?(ystartsev)

The bug is marked as tracked for firefox104 (nightly). However, the bug still has low severity.

:jstutte, could you please increase the severity for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jstutte)

(In reply to Yulia Startsev [:yulia] from comment #7)

Yes, possibly. Hopefully I have a fix for this regression already.

Yulia, could you please link the bug with the fix here?

Flags: needinfo?(jstutte) → needinfo?(ystartsev)

(discussed with jens offline)

I took a look, and this appears to still be happening. I believe this is due to the cache promise cleanup potentially sometimes being late. it is similar to the points made by Yaron above a few years ago, but now that it is done independently of the ScriptLoader shutdown it may be lagging or failing. I'll investigate.

I think I found the underlying issue. Cancellation was changed into "informing" rather than triggering a kill of the worker. However, it is possible for the worker to be killed without going through cancellation, which means we don't do clean up. I am looking in to how to make this always go through the cancellation path / do cleanup.

(removing myself as assignee as Yulia seems to be on top of it)

Assignee: jstutte → ystartsev

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:yulia, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ystartsev)
Keywords: topcrash

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit auto_nag documentation.

Keywords: topcrash

I am still tracking this bug, as there is a bit more work to do regarding the cancellation path. Will update once I am finished.

Flags: needinfo?(ystartsev)

I've reviewed the crashes in the last week, and they do not appear to be related to the ScriptLoader shutdown any more, they appear to be hanging in the JS interpreter, which looks unrelated.

Indeed we are back to normal here. Unassigning Yulia but we keep it open for further monitoring.

Assignee: ystartsev → nobody
Status: ASSIGNED → NEW

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:jmarshall, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmarshall)
Keywords: topcrash
Flags: needinfo?(jmarshall) → needinfo?(jstutte)

:jstutte this has started spiking in recent nightly builds.
Could someone take a look, is it related to Bug 1812490?

The spike seemed to start with landing of bug 1811195, which has a remote worker related part. Then we landed the fix for bug 1812490 and numbers went down a lot, but seem to be still significantly higher than before. My assumption would be, that one of the two conditions we changed in bug 1811195 is not triggered anymore since bug 1812490 landed, but the other is?

:asuth, could it be that the bail out in BackgroundParentImpl::AllocPRemoteWorkerControllerParent is doing more harm than good?

Flags: needinfo?(jstutte) → needinfo?(bugmail)

(In reply to Jens Stutte [:jstutte] from comment #20)

:asuth, could it be that the bail out in BackgroundParentImpl::AllocPRemoteWorkerControllerParent is doing more harm than good?

From here that looks very likely.

Depends on: 1813559
Flags: needinfo?(bugmail)

(In reply to Jens Stutte [:jstutte] from comment #20)

:asuth, could it be that the bail out in BackgroundParentImpl::AllocPRemoteWorkerControllerParent is doing more harm than good?

Yeah, I think this would probably break the state machine and so removing it again in https://phabricator.services.mozilla.com/D168264 makes sense.

This is a reminder regarding comment #8!

The bug is marked as tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 9 days. However, the bug still isn't assigned.

(In reply to Release mgmt bot [:suhaib / :marco/ :calixte] from comment #23)

This is a reminder regarding comment #8!

The bug is marked as tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 9 days. However, the bug still isn't assigned.

We do not yet see data after bug 1813559 landed (which is tracked, too), the hope would be that we can untrack this bug again after that landed.

This is a reminder regarding comment #8!

The bug is marked as tracked for firefox111 (nightly). We have limited time to fix this, the soft freeze is in 8 days. However, the bug still isn't assigned.

Removing tracking for 111, the crash volume dropped after Bug 1813559 landed.

I can reproduce this bug by opening all my 40+ bookmarks and closing the browser before all pages are loaded. Sometimes it happens and sometimes it doesn’t, but most of the time I get the same crash.
STR:
1 - Open Firefox Nightly 102.0a1
2 - Right-click bookmark bar -> Open all bookmarks
3 - Close the browser right after opening all the bookmarks without waiting for them to finish loading

Results: Firefox closes most of the processes, but a few particular processes remain open for a while until I get the crash notification.

Expected results: Firefox should close all the tabs.

Using Windows 11

Crash report: https://crash-stats.mozilla.org/report/index/205110cc-31e3-4c5c-ab0d-2df8a0230221

Could it be that the bot is getting confused around merge day about how to query the nightly topcrash criteria (similar to https://github.com/mozilla/bugbug/issues/351) ?

Flags: needinfo?(smujahid)

(In reply to Jens Stutte [:jstutte] from comment #32)

Could it be that the bot is getting confused around merge day about how to query the nightly topcrash criteria

Thank you for reporting this! The bot account for the release day and includes the previous version in the first week after the release date. However, it seems that contradictions in the versions on the release day caused this. I filed an issue to follow up on this: https://github.com/mozilla/bugbot/issues/2100

(similar to https://github.com/mozilla/bugbug/issues/351) ?

I guess you mean https://github.com/mozilla/bugbug/issues/3516

Flags: needinfo?(smujahid)

(In reply to Suhaib Mujahid [:suhaib] from comment #33)

(In reply to Jens Stutte [:jstutte] from comment #32)

Could it be that the bot is getting confused around merge day about how to query the nightly topcrash criteria

Thank you for reporting this! The bot account for the release day and includes the previous version in the first week after the release date. However, it seems that contradictions in the versions on the release day caused this. I filed an issue to follow up on this: https://github.com/mozilla/bugbot/issues/2100

It seems it is happening again.

Flags: needinfo?(smujahid)

:jstutte I will prioritize working on #2100 to prevent this from happening again.

Flags: needinfo?(smujahid)

"Perf" key word?

This crash may fall into the same category : 

https://crash-stats.mozilla.org/report/index/3e31cb79-abdb-46e7-83da-0f8680230819#tab-bugzilla

I see Signature: [@ AsyncShutdownTimeout | profile-change-teardown | ServiceWorkerShutdownBlocker: shutting down Service Workers ]

You need to log in before you can comment on or make changes to this bug.