Closed Bug 1189659 Opened 9 years ago Closed 9 years ago

New Register/Update tasks should terminate existing installing worker

Categories

(Testing :: web-platform-tests, defect)

defect
Not set
normal

Tracking

(firefox42 wontfix, firefox43 wontfix, firefox44 fixed, firefox45 fixed, b2g-v2.5 fixed)

RESOLVED FIXED
mozilla45
Tracking Status
firefox42 --- wontfix
firefox43 --- wontfix
firefox44 --- fixed
firefox45 --- fixed
b2g-v2.5 --- fixed

People

(Reporter: noemi, Assigned: catalinb)

References

Details

Attachments

(6 files, 12 obsolete files)

(deleted), text/x-review-board-request
baku
: review+
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
ehsan.akhgari
: review+
Details | Diff | Splinter Review
Test run such as |./mach web-platform-tests _mozilla/service-workers/service-worker/register-wait-forever-in-install-worker.https.html| Harness status: Timeout Harness status: Timeout Found 1 tests 1 Timeout * register worker that calls waitUntil with a promise that never resolves in oninstall -> timed out Traces: https://pastebin.mozilla.org/8838598
According to the spec, a new Register/Update task should terminate an existing installing worker. We do not do this yet, leading to the test failure. The actual implementation is going to need some thinking. The easiest thing to do would be to immediately expire the timeout on the installing worker (once the timers patch lands). Then override Cancel() on the lifecycle tasks so that when they get canceled, or when they get destroyed 'uncleanly' (due to the worker shutting down), they dispatch a runnable to the main thread that indicates the task failed, so that the RegisterJob can continue as if the install failed.
(Clarifying title a bit)
Summary: "Harness status: Timeout" when running "register-wait-forever-in-install-worker.https.html" test → New Register/Update tasks should terminate existing installing worker
Bug 1189659 - Fail lifecycle task when worker terminates with pending promise. r?baku Dispatches a runnable to the main thread to indicate a failed lifecycle task if the worker starts shutting down and the lifecycle task promise has not resolved. The ServiceWorkerRegisterJob constructors terminate an existing installing worker. These two together ensure that a new register() or update() can proceed even if an old job is blocked on installation. Update web-platform-tests expected data
Attachment #8665641 - Flags: review?(amarchesini)
Comment on attachment 8665641 [details] MozReview Request: Bug 1189659 - Fail lifecycle task when worker terminates with pending promise. r?baku https://reviewboard.mozilla.org/r/20281/#review18289 ::: testing/web-platform/meta/subresource-integrity/subresource-integrity.html.ini:17 (Diff revision 1) > + if you remove empty lines at the end of any .ini file, don't add it here.
Attachment #8665641 - Flags: review?(amarchesini) → review+
It seems like there are some race conditions which manifest in forms on intermittent test failures. Note the difference between the debug and opt test runs of mochitest-4. I haven't really looked into it in detail. Please ignore the failures in dom/push/test/test_serviceworker_lifetime.html, those are caused by something else in my patch queue.
Bumping this. I think we want to do something at least to make sure that |event.waitUntil(new Promise(()=>()));| doesn't create a zombie worker.
Blocks: 1189023
No longer blocks: 1180622
Assignee: nobody → catalin.badea392
(In reply to Ehsan Akhgari (don't ask for review please) [Away Nov 3-19] from comment #8) > Bumping this. I think we want to do something at least to make sure that > |event.waitUntil(new Promise(()=>()));| doesn't create a zombie worker. The zombie worker bug happens because our job queue in ServiceWorkerManager.cpp expects the life cycle promises to be resolved or rejected at some point to continue.
Note, for install events we need to fail the installation if the worker is timed-out, rejects the waitUntil promise, or otherwise fails. For the activate event, though, we need to allow the SW to proceed to be the active worker. This is spec'd but was not immediately obvious to me. See: https://github.com/slightlyoff/ServiceWorker/issues/776
Status: NEW → ASSIGNED
Attachment #8679199 - Attachment is obsolete: true
Attachment #8689182 - Attachment is obsolete: true
Please note the changes in bug 1186856 that I intent to land later tonight. That test times out with these patches applied.
Flags: needinfo?(catalin.badea392)
(In reply to Ben Kelly [:bkelly] from comment #15) > Please note the changes in bug 1186856 that I intent to land later tonight. > That test times out with these patches applied. Thanks for the heads up. It's a bug in patches after I did some crazy git rebases.
Flags: needinfo?(catalin.badea392)
Attachment #8689678 - Attachment is obsolete: true
Attachment #8689789 - Flags: review?(bkelly)
Attachment #8689679 - Attachment is obsolete: true
Attachment #8689790 - Flags: review?(bkelly)
Attachment #8689711 - Attachment is obsolete: true
Attachment #8689792 - Flags: review?(bkelly)
Comment on attachment 8689789 [details] [diff] [review] Part 1 - Continue service worker job queue when life cycle events expire. Review of attachment 8689789 [details] [diff] [review]: ----------------------------------------------------------------- r=me with comments addressed. ::: dom/workers/ServiceWorkerPrivate.cpp @@ +355,5 @@ > + NS_IMETHOD > + Cancel() override > + { > + mCallback->SetResult(false); > + MOZ_ALWAYS_TRUE(NS_SUCCEEDED(NS_DispatchToMainThread(mCallback))); It seems this should call ReportResult(mWorkerPrivate->GetJSContext(), false)? Right now Cancel() is not setting mDone. @@ +438,5 @@ > + > + void > + ReportResult(JSContext* aCx, bool aResult) > + { > + if (mDone) { Please mWorkerPrivate->AssertIsOnWorkerThread() to make it clear mDone handling is thread safe. @@ +449,5 @@ > + NS_RUNTIMEABORT("Failed to dispatch life cycle event handler."); > + } > + > + mWorkerPrivate->RemoveFeature(aCx, this); > + mDone = true; To minimize the potential for races, I think its better to set mDone immediately after checking mDone. That way you can be sure that if anything you call in this method triggers re-entry, it won't double execute. @@ +480,5 @@ > } > > RefPtr<xpc::ErrorReport> xpcReport = new xpc::ErrorReport(); > xpcReport->Init(report.report(), report.message(), > + /* aIsChrome = */ false, mWorkerPrivate->WindowID()); This needs to be rebased. I just removed this error reporting code in bug 1225470. It has not merged to central yet. @@ +529,2 @@ > } else { > + watcher->ReportResult(aCx, false); Is this necessary? It seems the watcher will get de-referenced and report the failure in its destructor. The comment above seems to suggest this is what you want to happen.
Attachment #8689789 - Flags: review?(bkelly) → review+
Comment on attachment 8689789 [details] [diff] [review] Part 1 - Continue service worker job queue when life cycle events expire. Review of attachment 8689789 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/ServiceWorkerPrivate.cpp @@ +413,5 @@ > + MOZ_ASSERT(mWorkerPrivate); > + mWorkerPrivate->AssertIsOnWorkerThread(); > + JSContext* cx = mWorkerPrivate->GetJSContext(); > + > + if (NS_WARN_IF(!mWorkerPrivate->AddFeature(cx, this))) { Can you add a comment about why a feature is needed? Its not immediately obvious to me what this is protecting against. Doesn't the ServiceWorkerPrivate hold the worker open while the lifecycle is progressing?
Comment on attachment 8689790 [details] [diff] [review] Part 2 - Remove set of scopes being updated from ServiceWorkerManager. Review of attachment 8689790 [details] [diff] [review]: ----------------------------------------------------------------- r=me with comments. ::: dom/workers/ServiceWorkerPrivate.cpp @@ +126,5 @@ > bool > WorkerRun(JSContext* aCx, WorkerPrivate* aWorkerPrivate) override > { > aWorkerPrivate->AssertIsOnWorkerThread(); > + mCallback->SetResult(aWorkerPrivate->WorkerScriptExecutedSuccessfully()); Nice! I was not aware of WorkerScriptExecutedSuccessfully. Much cleaner since we simply need to reject promises with a TypeError and not the specific script error. @@ +139,5 @@ > + { > + mCallback->SetResult(false); > + MOZ_ALWAYS_TRUE(NS_SUCCEEDED(NS_DispatchToMainThread(mCallback))); > + > + mDone = true; nit: It would be nice if the callback SetResult(), dispatch to main thread, and mDone = true were all done in a single method called from both WorkerRun() and Cancel(). It minimizes the chance we miss some cleanup step.
Attachment #8689790 - Flags: review?(bkelly) → review+
(In reply to Ben Kelly [:bkelly] from comment #21) > Comment on attachment 8689789 [details] [diff] [review] > Part 1 - Continue service worker job queue when life cycle events expire. > > Review of attachment 8689789 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: dom/workers/ServiceWorkerPrivate.cpp > @@ +413,5 @@ > > + MOZ_ASSERT(mWorkerPrivate); > > + mWorkerPrivate->AssertIsOnWorkerThread(); > > + JSContext* cx = mWorkerPrivate->GetJSContext(); > > + > > + if (NS_WARN_IF(!mWorkerPrivate->AddFeature(cx, this))) { > > Can you add a comment about why a feature is needed? Its not immediately > obvious to me what this is protecting against. Doesn't the > ServiceWorkerPrivate hold the worker open while the lifecycle is progressing? It does, but we might get stuck when register just 1 service worker and that service workers waits forever in the install handler. In this case, we wait for the keepAliveToken to expire and terminate the worker, so we need the feature to catch this and continue the job, well, to fail the job.
(In reply to Ben Kelly [:bkelly] from comment #20) > Comment on attachment 8689789 [details] [diff] [review] > Part 1 - Continue service worker job queue when life cycle events expire. > > Review of attachment 8689789 [details] [diff] [review]: > ----------------------------------------------------------------- > > r=me with comments addressed. > > ::: dom/workers/ServiceWorkerPrivate.cpp > @@ +355,5 @@ > > + NS_IMETHOD > > + Cancel() override > > + { > > + mCallback->SetResult(false); > > + MOZ_ALWAYS_TRUE(NS_SUCCEEDED(NS_DispatchToMainThread(mCallback))); > > It seems this should call ReportResult(mWorkerPrivate->GetJSContext(), > false)? Right now Cancel() is not setting mDone. Well, ReportResult is a member of LifeCycleEventWatcher, which is only created if the runnable is executed (WorkerRun -> DispatchLifeCycleEvent). The assumption is that WorkerRun and Cancel are never both called for the same runnable. Should I still make this change? > @@ +529,2 @@ > > } else { > > + watcher->ReportResult(aCx, false); > > Is this necessary? It seems the watcher will get de-referenced and report > the failure in its destructor. The comment above seems to suggest this is > what you want to happen. I usually don't like to rely on destructor logic because the cycle collector may delay things, but I think you're right.
Comment on attachment 8689792 [details] [diff] [review] Part 3 - Use separate synchronization queues for service worker register jobs and install jobs. Review of attachment 8689792 [details] [diff] [review]: ----------------------------------------------------------------- r=me with comments addressed. Thanks! ::: dom/workers/ServiceWorkerManager.cpp @@ +153,5 @@ > NS_DECL_ISUPPORTS > > + enum QueueType > + { > + // Adding a new entry here requires updating ServiceWorkerQueue::GetQueue. Do we have a bug to add an activation queue as defined in the spec? Or do we handle those semantics in a different way? @@ +155,5 @@ > + enum QueueType > + { > + // Adding a new entry here requires updating ServiceWorkerQueue::GetQueue. > + RegistrationQueue, > + InstallationQueue We already have a JobType enumeration in the ServiceWorkerRegisterJob. Can we pull that out to a public enumeration and then add installing to it? Then we can have a single type used for both the queue and the jobs. The JobQueue could internally map register/update to the registration queue. I think having multiple enumerations that partly overlap, but don't fully line up will be confusing to maintain in the future. @@ +201,5 @@ > + NS_WARNING("Pending/running jobs still around on shutdown!"); > + } > + } > + > + nsTArray<RefPtr<ServiceWorkerJob>> mJobs; I assume this is only accessed on main thread, but the JobQueue code seems to be missing thread assertions. Can you add AssertIsOnMainThread() throughout the JobQueue methods? @@ +239,5 @@ > void > + CancelJobs(QueueData& aQueue); > + > + QueueData& > + GetQueue(ServiceWorkerJob::QueueType aType) Rather than expose an owned internal data structure that then gets operated on, I think it would be cleaner to simple add the type code to public Append() and Pop() methods. Something like: mJobQueue->Append(mJobType, aJob); And mJobQueue->Pop(mJobType); This makes the public interface of the JobQueue clear and doesn't require multiple step state to be handled properly by the caller. All the checking can be enforced internal to the JobQueue itself. @@ +954,5 @@ > + // Done() with the failure code instead. > + // Callers MUST hold a strong ref before calling this! > + void > + Fail(ErrorResult& aRv) > + { Are there any changes to this method? Or was it just moved into the base class? I'm going to just assume it was moved without changes for now. @@ +1127,5 @@ > + mRegistration->mWaitingWorker->WorkerPrivate()->TerminateWorker(); > + mRegistration->mWaitingWorker->UpdateState(ServiceWorkerState::Redundant); > + > + nsresult rv = serviceWorkerScriptCache::PurgeCache(mRegistration->mPrincipal, > + mRegistration->mWaitingWorker->CacheName()); nit: I know you just moved this line, but line wrapping. @@ +1425,5 @@ > void > ServiceWorkerJobQueue::CancelJobs() > { > + CancelJobs(mRegistrationJobQueue); > + CancelJobs(mInstallationJobQueue); Does ordering matter here? If so, it might be good to comment on that. I assume canceling registration queue first is best so not to fill the installation queue after its been cleared. @@ -1529,2 @@ > ServiceWorkerJobQueue* queue = GetOrCreateJobQueue(originSuffix, cleanedScope); > - MOZ_ASSERT(queue); It seems this assertion is dropped, but queue is still used below even if its nullptr? Its hard to tell if another patch changed that because its off the splinter diff context.
Attachment #8689792 - Flags: review?(bkelly) → review+
(In reply to Cătălin Badea (:catalinb) from comment #23) > It does, but we might get stuck when register just 1 service worker and that > service workers waits > forever in the install handler. In this case, we wait for the keepAliveToken > to expire and terminate the > worker, so we need the feature to catch this and continue the job, well, to > fail the job. Ok, I think a comment to that effect would be useful. (In reply to Cătălin Badea (:catalinb) from comment #24) > Well, ReportResult is a member of LifeCycleEventWatcher, which is only > created if the runnable is executed (WorkerRun -> DispatchLifeCycleEvent). > The assumption is that WorkerRun and Cancel are never both called for the > same runnable. Should I still make this change? Oh, good point. I missed that they were different objects. You can forget this one. > > Is this necessary? It seems the watcher will get de-referenced and report > > the failure in its destructor. The comment above seems to suggest this is > > what you want to happen. > I usually don't like to rely on destructor logic because the cycle collector > may delay things, but I think > you're right. Either way is fine. This one was more of a nit.
(In reply to Ben Kelly [:bkelly] from comment #25) > Comment on attachment 8689792 [details] [diff] [review] > Part 3 - Use separate synchronization queues for service worker register > jobs and install jobs. > > Review of attachment 8689792 [details] [diff] [review]: > ----------------------------------------------------------------- > > r=me with comments addressed. Thanks! > > ::: dom/workers/ServiceWorkerManager.cpp > @@ +153,5 @@ > > NS_DECL_ISUPPORTS > > > > + enum QueueType > > + { > > + // Adding a new entry here requires updating ServiceWorkerQueue::GetQueue. > > Do we have a bug to add an activation queue as defined in the spec? Or do > we handle those semantics in a different way? We handle that part by racing :). I couldn't think of a situation where this has a negative side effect, though. Filed bug 1226441 to investigate it later. > @@ +155,5 @@ > > + enum QueueType > > + { > > + // Adding a new entry here requires updating ServiceWorkerQueue::GetQueue. > > + RegistrationQueue, > > + InstallationQueue > > We already have a JobType enumeration in the ServiceWorkerRegisterJob. Can > we pull that out to a public enumeration and then add installing to it? > > Then we can have a single type used for both the queue and the jobs. The > JobQueue could internally map register/update to the registration queue. > > I think having multiple enumerations that partly overlap, but don't fully > line up will be confusing to maintain in the future. Good point! > @@ +954,5 @@ > > + // Done() with the failure code instead. > > + // Callers MUST hold a strong ref before calling this! > > + void > > + Fail(ErrorResult& aRv) > > + { > > Are there any changes to this method? Or was it just moved into the base > class? I'm going to just assume it was moved without changes for now. There is a small change: the previous version would check if mRegistration is null and either use mRegistration->mScope/mScriptSpec or mScope/mScriptSpec for error reporting: https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerManager.cpp?from=ServiceWorkerManager.cpp#1121 However, mRegistration is never actually null when we call Fail(), this was already enforced by a "hidden" assert in MaybeRemoveRegistration which is always called from Fail(). https://dxr.mozilla.org/mozilla-central/source/dom/workers/ServiceWorkerManager.cpp?from=ServiceWorkerManager.cpp#1153 I added an assert to make this obvious and changed it to always use mRegistration when constructing the ErrorResult. ServiceWorkerJobBase doesn't have the mScope/mScriptSpec members. The other change is that Fail is now a protected member of ServiceWorkerJobBase, which I think is better. The reason it was public before was to be accessible from HandleError. > @@ +1425,5 @@ > > void > > ServiceWorkerJobQueue::CancelJobs() > > { > > + CancelJobs(mRegistrationJobQueue); > > + CancelJobs(mInstallationJobQueue); > > Does ordering matter here? If so, it might be good to comment on that. I > assume canceling registration queue first is best so not to fill the > installation queue after its been cleared. It doesn't really matter, Cancel just sets a flag on the job objects. I'll add a comment. > @@ -1529,2 @@ > > ServiceWorkerJobQueue* queue = GetOrCreateJobQueue(originSuffix, cleanedScope); > > - MOZ_ASSERT(queue); > > It seems this assertion is dropped, but queue is still used below even if > its nullptr? Its hard to tell if another patch changed that because its off > the splinter diff context. This change was made by mistake. Thanks for the review!
Unfortunately these patches seem to break the updates triggered by navigation or registration. Go to this page: https://adriftwith.me/sandbox/service_worker_reload/go.html Open the web console and refresh a few times. Without these patches we get an install message with a new time on every refresh. This is due to the SW updating. With these patches we don't get the console message. In addition, checking about:serviceworkers shows that the service worker is not updating. The uuid remains unchanged. Can you please investigate what is going on here? I don't think we can include these patches in the release if they break updates.
Flags: needinfo?(catalin.badea392)
(In reply to Ben Kelly [:bkelly] from comment #28) > Unfortunately these patches seem to break the updates triggered by > navigation or registration. > > Go to this page: > > https://adriftwith.me/sandbox/service_worker_reload/go.html > > Open the web console and refresh a few times. > > Without these patches we get an install message with a new time on every > refresh. This is due to the SW updating. > > With these patches we don't get the console message. In addition, checking > about:serviceworkers shows that the service worker is not updating. The > uuid remains unchanged. > > Can you please investigate what is going on here? I don't think we can > include these patches in the release if they break updates. I'm still investigating this, but it kind of works for me. Testing with yesterday's m-c + my patches: Refreshing the web page logs the install and activate messages in the browser console (not the web console). Twice for each refresh, 2 install events and 2 activate events. Is this because we're trying to soft update then register performs another update? Next, refreshing about:serviceworkers after each refresh of the demo page reveals a new active worker cache UUID and a new waiting worker as well. Testing with today's inbound + my patches: Refreshing the web pages again logs to the browser console, but only once this time (1 install + 1 activate message). Refreshing about:serviceworkers again reveals new UUIDs for the active and waiting workers. What's the patch queue you're using?
Flags: needinfo?(catalin.badea392)
You see it on each refresh? I saw the events on the first time the page loads, but not when hitting refresh repeatedly.
> What's the patch queue you're using? Just the patches from this bug. Do you have local changes compared to what is uploaded here?
Attachment #8689789 - Attachment is obsolete: true
Attachment #8689790 - Attachment is obsolete: true
Attached patch Part 4 - Fix race in test_install_event.html. (obsolete) (deleted) — Splinter Review
The test does the following: 1. registers a service worker and waits for the updatefound event. 2. Registers a new service worker that throws in the install handler 3. Next it calls getRegistration and checks that there's an active worker The problem is that if the first service worker doesn't finish installing by the time the second registration reaches step 4.2 from [[Update]] it will be terminated and never reach activating state.
Updates look good with latest patches. Sorry for comment 28. I must have been running with the patches that had the bad git rebase.
Attachment #8690144 - Flags: review?(bkelly)
(In reply to Cătălin Badea (:catalinb) from comment #35) > Created attachment 8690144 [details] [diff] [review] > Part 4 - Fix race in test_install_event.html. > > The test does the following: > 1. registers a service worker and waits for the updatefound event. > 2. Registers a new service worker that throws in the install handler > 3. Next it calls getRegistration and checks that there's an active worker > > The problem is that if the first service worker doesn't finish installing by > the time the second registration reaches step 4.2 from [[Update]] it will be > terminated and never reach activating state. This is the try run with the mochitest failing. https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a2c0b180c6b
Attachment #8690144 - Flags: review?(bkelly) → review+
Catalin, what's the status of this bug?
Flags: needinfo?(catalin.badea392)
Just need to land bug 1227932 first.
Depends on: 1227932
Flags: needinfo?(catalin.badea392)
rebased.
Attachment #8690130 - Attachment is obsolete: true
rebased
Attachment #8690131 - Attachment is obsolete: true
Attachment #8690144 - Attachment is obsolete: true
In test-skip-waiting.html it is possible to run all skipWaiting() calls before the worker reaches "installing" state, with all of them failing. This might cause the test to timeout. Filed 1189659 to fix the actual bug, and tweaked the test to always pass for now.
Attachment #8692480 - Flags: review?(ehsan)
Attachment #8692480 - Flags: review?(ehsan) → review+
Comment on attachment 8692143 [details] [diff] [review] Part 1 - Continue service worker job queue when life cycle events expire. Approval Request Comment Request for entire patch series (1-5) [Feature/regressing bug #]: Service Workers [User impact if declined]: SW will ship in 44 and this fixes a bug in common usage scenarios. [Describe test coverage new/current, TreeHerder]: covered by web platform tests [Risks and why]: Low, SW targeted change. [String/UUID change made/needed]: None
Attachment #8692143 - Flags: approval-mozilla-aurora?
Comment on attachment 8692143 [details] [diff] [review] Part 1 - Continue service worker job queue when life cycle events expire. SW is planned for FF44 and these patches have been in Nightly for a few days, seems safe to uplift to Aurora44.
Attachment #8692143 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8692144 [details] [diff] [review] Part 2 - Remove set of scopes being updated from ServiceWorkerManager. Aurora44+
Attachment #8692144 - Flags: approval-mozilla-aurora+
Comment on attachment 8692145 [details] [diff] [review] Part 3 - Use separate synchronization queues for service worker register jobs and install jobs. Aurora44+
Attachment #8692145 - Flags: approval-mozilla-aurora+
Comment on attachment 8692146 [details] [diff] [review] Part 4 - Fix race in test_install_event.html. Aurora44+
Attachment #8692146 - Flags: approval-mozilla-aurora+
Comment on attachment 8692480 [details] [diff] [review] Part 5 - Fix race in skip-waiting.https.html and add some logging for SkipWaitingFlag in ServiceWorkerManager. Aurora44+
Attachment #8692480 - Flags: approval-mozilla-aurora+
problems during uplift of part 2 grafting 317032:8a417a349a61 "Bug 1189659 - Part 2 - Remove set of scopes being updated from ServiceWorkerManager. r=bkelly" merging dom/workers/ServiceWorkerManager.cpp merging dom/workers/ServiceWorkerPrivate.cpp merging dom/workers/ServiceWorkerPrivate.h warning: conflicts while merging dom/workers/ServiceWorkerManager.cpp! (edit, then use 'hg resolve --mark')
Flags: needinfo?(catalin.badea392)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: