Closed
Bug 1324820
Opened 8 years ago
Closed 8 years ago
IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter moments
Categories
(Core :: DOM: Content Processes, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla53
Tracking | Status | |
---|---|---|
platform-rel | --- | ? |
firefox-esr45 | --- | unaffected |
firefox50 | --- | unaffected |
firefox51 | --- | unaffected |
firefox52 | --- | unaffected |
firefox53 | --- | fixed |
People
(Reporter: kjozwiak, Assigned: tnguyen)
References
(Blocks 1 open bug)
Details
(Keywords: regression, Whiteboard: [platform-rel-Twitter])
Crash Data
Attachments
(2 files, 3 obsolete files)
Created a new bug as per bug#1279293comment#63. When you close fx while the browser is autoplaying a gif from one of Twitter's "Moment" pages, you'll receive the [@ IPCError-browser | ShutDownKill] crash every time the browser is shut down. I can reproduce this 100% of the time using the following STR:
* install the latest version of m-c
* open https://twitter.com/i/moments/810878086774456320
* let one of the random gifs start autoplaying and quickly shutdown FX
You'll notice that the shutdown process will hang and will take several seconds for fx to correctly close.
I've created a quick video of the STR that I used to reproduce the crash on a brand new m-c installation under macOS 10.12.2:
* https://youtu.be/SQ6x5iWVaqg
Crash Reports:
* bp-ff8ee990-5b23-43c3-a3e8-4e48e2161219
* bp-1146cec8-6ed7-42cf-9571-67f872161219
* bp-1384ea42-7d95-46f8-be8b-a3d5e2161219
* bp-e296be83-7fa1-4ede-afb0-455cb2161219
Reporter | ||
Updated•8 years ago
|
Has STR: --- → yes
Summary: IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter oments → IPCError-browser | ShutDownKill received when closing FX while a gif is autoplaying from Twitter moments
Assignee | ||
Comment 2•8 years ago
|
||
Thanks for reporting this, I am looking at this bug.
At the first look, I think the reason is that XMLHttpRequestMainThread::SendInternal opened a http channel (asycnopen2). Although FF is shutting down, the channel is still suspended but never got callback. The 6 secs shutdownkill timer is fired then causes a silent crash.
Flags: needinfo?(tnguyen)
Updated•8 years ago
|
Comment 3•8 years ago
|
||
This bug is a regression caused by bug 1315386, per Thomas.
Assignee | ||
Comment 4•8 years ago
|
||
Assignee | ||
Comment 5•8 years ago
|
||
MozReview-Commit-ID: IGsPCkrEEuG
Assignee | ||
Comment 6•8 years ago
|
||
This silent crash does not occur if I compile and build FF from source. I've just pushed my patch to try
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7c5f5fb81ba97448a92c3cc99018db65c41e68a&selectedJob=33499220
Could you please download the build from the try running
https://queue.taskcluster.net/v1/task/Ro5Ghbi-SR-o6TP6OUl39g/runs/0/artifacts/public/build/target.dmg
and see if the patch could fix this bug
Flags: needinfo?(kjozwiak)
Reporter | ||
Comment 7•8 years ago
|
||
I reproduced the original issue using the STR from comment#0 with the following build:
* fx53.0a1, buildid: 20161228030213, changeset: d7b6af32811b
** https://archive.mozilla.org/pub/firefox/nightly/2016/12/2016-12-28-03-02-13-mozilla-central/
Using the build that Thomas provided in comment#6, I went through the following test cases:
* going through STR using regular windows/tabs
* going through STR using default container window/tabs
* going through STR using custom container window/tabs
* going through STR using PB window/tabs
* opened the website from the STR in two different tabs and closed FX
** container/regular tab
** two default tabs
** container/PB
** default container/custom container
* checked about:crashes after every test case to ensure that new silent crashes haven't been introduced
Platforms Used: (I'll check Win/Linux once this lands into m-c)
* macOS 10.12.2 - PASSED
Thomas, it looks like the try build has fixed the issue. I tried several different test cases and couldn't reproduce the original crash.
Flags: needinfo?(kjozwiak)
Assignee | ||
Updated•8 years ago
|
Attachment #8822138 -
Flags: review?(gpascutto)
Comment 8•8 years ago
|
||
Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill
This looks correct (DoLookup does a check for shuttingdown and will immediately resolve with null if so), but at the same time feels also quite icky. We're resolving callbacks when we already got notified that we're supposed to be shutting down.
Tracing this, it seems to be caused by sync XHR, which would partially explain why it's possible for the missing callback to block anything at all. I see there's work done to resolve issues where sync XHR times out for any reason. Would it make sense to add checks here to prevent the spinning when the browser is shutting down? i.e. break out the spinning loop on shutdown.
I'm going to f? a few people here because this may be a case where the real problem is not in this code.
https://dxr.mozilla.org/mozilla-central/rev/143bb4b9249e528e658f6ccc449991794b8675f8/dom/xhr/XMLHttpRequestMainThread.cpp#3006
Attachment #8822138 -
Flags: feedback?(bugs)
Attachment #8822138 -
Flags: feedback?(amarchesini)
Comment 9•8 years ago
|
||
Comment on attachment 8822138 [details] [diff] [review]
Resolve all pending lookups to fix IPCError-browser | ShutDownKill
Can't give any feedback about this patch, but indeed, cancelling sync XHR when the TabChild/ContentChild is going away makes sense, if doable.
Note, need to check also all the other cases when we spin event loop: modal dialogs at least, but possibly also some random JS scripts.
Attachment #8822138 -
Flags: feedback?(bugs)
Comment 10•8 years ago
|
||
Some discussion from #developers:
21:08 <gcp> the patch there is a fix, but I'm not sure if that's the correct fix
21:08 <gcp> i.e. going back and issuing a bunch of callbacks in shutdown code
21:08 <@bz> yeah, that's a bit odd.
21:09 <@bz> I'd think necko would cancel the channel and send OnStopRequest to the XHR....
21:09 <gcp> necko is probably blocked on the classifier returning from the classification callback
21:09 <gcp> though that wasn't in the stack
21:16 <gcp> oh, the original bug is already in DOM :P
21:18 <@bz> gcp: necko should cancel the classification and move on, I'd think
21:19 <@bz> gcp: just like it would if the user navigated away from the page!
21:19 — @bz really hopes we would not keep waiting on the classifier then.
Can we shelve the current patch and investigate why necko is unable to unstuck itself on shutdown here? (note mozilla::dom::TabChild::RecvDestroy() in the stack)
Updated•8 years ago
|
platform-rel: --- → ?
Whiteboard: [platform-rel-Twitter]
Assignee | ||
Comment 11•8 years ago
|
||
Thanks gcp and smaug for looking at this, so it's exactly blocked in a sync xhr until the channel stopped and send OnStopRequest to XHR (then mark mFlagSyncLooping to false).
I am considering about not keep waiting on the classifier, and cancelling all the channels in pending queue. However, we start doing shutdown classifier very early in "quit-application", will we probably lost some tasks in shutdown, for example cancel doing xhr request in unload event? Hmm, I don't know, maybe I am wrong.
Comment 12•8 years ago
|
||
Crash volume for signature 'IPCError-browser | ShutDownKill':
- nightly (version 53): 79125 crashes from 2016-11-14.
- aurora (version 52): 118636 crashes from 2016-11-14.
- beta (version 51): 412191 crashes from 2016-11-14.
- release (version 50): 3115 crashes from 2016-11-01.
- esr (version 45): 132 crashes from 2016-07-06.
Crash volume on the last weeks (Week N is from 01-02 to 01-08):
W. N-1 W. N-2 W. N-3 W. N-4 W. N-5 W. N-6 W. N-7
- nightly 11114 11304 13841 11315 10841 9916 7074
- aurora 18482 20233 21166 20277 19016 13352 152
- beta 4854 11234 104993 116974 80635 70864 21296
- release 381 401 650 560 456 367 157
- esr 3 3 8 2 4 10 7
Affected platforms: Windows, Mac OS X, Linux
Crash rank on the last 7 days:
Browser Content Plugin
- nightly #1
- aurora #1
- beta #1
- release #29
- esr #1
status-firefox50:
--- → affected
status-firefox51:
--- → affected
status-firefox52:
--- → affected
status-firefox-esr45:
--- → affected
Assignee | ||
Comment 14•8 years ago
|
||
Hi :smaug
Should we cancel all pending channels in classifier queue after "quit-application" event?
Flags: needinfo?(bugs)
Comment 15•8 years ago
|
||
I guess so. Do we not cancel the channel automatically? Why doesn't Necko call OnStopRequest or such?
Or is the issue that OnStopRequest happens later, and classifier doesn't expect it to happen so late?
Flags: needinfo?(bugs)
Assignee | ||
Comment 16•8 years ago
|
||
Oh, the channel was not cancelled automatically. Another approach, I think it's more reasonable, that we should not add any pending channel after shutting down.
In that way, we do not expect any request after "quit-application" event rather than cancel pending request, but we may reject requests (if there's any) in unload event.
Assignee | ||
Updated•8 years ago
|
Attachment #8822138 -
Flags: review?(gpascutto)
Attachment #8822138 -
Flags: feedback?(amarchesini)
Assignee | ||
Comment 17•8 years ago
|
||
MozReview-Commit-ID: LfxKf8JNOCC
Assignee | ||
Updated•8 years ago
|
Attachment #8822138 -
Attachment is obsolete: true
Assignee | ||
Comment 18•8 years ago
|
||
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels
The crash signature did not appear before (or it rarely appeared), it occurs frequently after we added the shutdown-aware (stop doing classifier things while shutting down).
I did not trace why necko does not send OnStopRequest (or doing cancel), but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker. Updated the patch to send NS_ERROR_ABORT to cancel the channel (or we could observe shutdown in nsChannelClassifier).
Attachment #8824356 -
Flags: feedback?(gpascutto)
Comment 19•8 years ago
|
||
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels
Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/components/url-classifier/nsUrlClassifierDBService.cpp
@@ +1179,5 @@
>
> NS_IMETHODIMP
> nsUrlClassifierClassifyCallback::HandleEvent(const nsACString& tables)
> {
> nsresult response = TablesToResponse(tables);
If we're shutting down, it would be nice to avoid this call too.
So perhaps `nsresult response = NS_OK;` and then do `response = TablesToResponse(tables);` after the `if (gShuttingDownThread)` check?
Assignee | ||
Comment 20•8 years ago
|
||
Thanks Francois.
I could see the channel is cancelled during shutting down, but no onstoprequest. Imho, a suspended request is supposed not to dispatch event (such as onstopRequest) until the request is resumed. Then it led a hang in sync XHR here.
For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel the channel then resume it, so we could resolve the hang. But I am thinking do we need to do resume all suspended channels if ff is shutting down, to avoid the same accidental issue later?
Flags: needinfo?(dd.mozilla)
Assignee | ||
Comment 21•8 years ago
|
||
(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
More clearly, after Cancel(), we need to call Resume() to send out OnStopRequest
Comment 22•8 years ago
|
||
Comment on attachment 8824356 [details] [diff] [review]
Resolve all pending lookups and cancel pending channels
Review of attachment 8824356 [details] [diff] [review]:
-----------------------------------------------------------------
>but, I think we still need to resolve pending lookups, otherwise there might be a leak in worker
I don't understand. This patch will cause nsIUrlClassifierLookupCallback->LookupComplete() to be called on the entries in mPendingLookups. How does that stop a leak?
Attachment #8824356 -
Flags: feedback?(gpascutto)
Assignee | ||
Comment 23•8 years ago
|
||
Oh, if there's a pending lookup has not been resolved, in other words mPendingLookups (mWorker) contains an element which was not freed.
That occurs in shutdown, and it might have effect and cause some intermittent leak tests, I think.
https://dxr.mozilla.org/mozilla-central/source/toolkit/components/url-classifier/nsUrlClassifierDBService.cpp#309
Comment 24•8 years ago
|
||
It's an nsTArray, it will free the elements on destruction. There's no raw pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting a leakcheck fail because shutdown hangs in the middle, so destructors are never run.
Unless I missed something obvious or there's side effects in other code, I do not believe this is a valid reasoning to be executing these callbacks if we're supposed to be shutting down.
Assignee | ||
Comment 25•8 years ago
|
||
Thanks gcp,
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never run.
Shawn talked to me that it seems have highly leak intermittent in XHR test,
And I found
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierClassifyCallback
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBService
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierDBServiceWorker
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsUrlClassifierLookupCallback
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 10 nsWeakReference
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 nsXMLHttpRequestXPCOMifier
07:14:13 INFO - TEST-INFO | leakcheck | default process: leaked 1 xptiInterfaceInfo
Probably nsUrlClassifierLookupCallback leaks in the middle of shutting down
Comment hidden (typo) |
Assignee | ||
Comment 27•8 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #24)
> It's an nsTArray, it will free the elements on destruction. There's no raw
> pointers in it, only an nsCOMPtr, so it can't leak. You're probably getting
> a leakcheck fail because shutdown hangs in the middle, so destructors are
> never called
Oh, you are right. I looked at the code again. Thanks you.
Assignee | ||
Comment 28•8 years ago
|
||
Shawn pushed a patch in bug Bug 1324856 to resolve the loop in sync XHR (in XMLHttpRequestMainThread), the basic idea is catching "shutdown" and then break the loop, but I am not very sure if it's a complete fix. I am taking a look at necko part today to see why necko cannot unstuck it self, just turned on nsHttp and ipdl log and tracing the log.
- OnStopRequest is expected to be received in XHR, otherwise it will hang.
- In this case the parent channel is suspended (and also its pump) and there will be no OnStopRequest here until it's resumed.
- The same in child channel, there will be no OnStopRequest received from parent and be dispatched to listeners.
- The child and parent will be cancelled during shutting down but could not be deleted itself (I could not see any "Destroying HttpChannelParent" or HttpChannelChild related to the channels). It's supposed that these channels will be deleted after getting OnStopRequest then send/receive DeleteChannel and DeleteSelf message.
The OnStopRequest event seem be pended due to the channel (or pump) is suspended, not very sure resuming then dispatching OnStopRequest (if ff is shutting down) is a good idea. I am confused if we need to resume channel, pump, in parent, in child, or just sending out OnStopRequest (or do nothing, just fix in XMLHttpRequestMainThread) in this case.
Probably I am completely wrong, will continue looking at this to see which I could do, but would like to ni some idea from necko
Hi Patrick, Dragana, do you have any idea about this bug?
Thanks
Assignee | ||
Comment 29•8 years ago
|
||
MozReview-Commit-ID: 4urL2hoxm3L
Assignee | ||
Updated•8 years ago
|
Attachment #8824356 -
Attachment is obsolete: true
Assignee | ||
Comment 30•8 years ago
|
||
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR
Hi Patrick,
Would you please comment about the patch, do we need to resume suspended channel in this case?
Thanks
Attachment #8825728 -
Flags: feedback?(mcmanus)
Comment 31•8 years ago
|
||
(In reply to Thomas Nguyen[:tnguyen] (use ni? plz) from comment #20)
> Thanks Francois.
> I could see the channel is cancelled during shutting down, but no
> onstoprequest. Imho, a suspended request is supposed not to dispatch event
> (such as onstopRequest) until the request is resumed. Then it led a hang in
> sync XHR here.
> For now, if we send NS_ERROR_ABORT to nsChannelClassifier we could cancel
> the channel then resume it, so we could resolve the hang. But I am thinking
> do we need to do resume all suspended channels if ff is shutting down, to
> avoid the same accidental issue later?
The necko expects that all suspends have a resume and it will not send any OnStartR, OnDataA or OnStopR. if the channel is suspended. The one that is suspending a channel needs to take care of resuming it. Any other more complex rule will be hard to follow. Necko expects that a channel is resume exactly as many time as it is suspended.
Flags: needinfo?(dd.mozilla)
Comment 32•8 years ago
|
||
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR
Review of attachment 8825728 [details] [diff] [review]:
-----------------------------------------------------------------
I think we should resolved this in a different way. We should see what has suspended the channel and try to fine a way that exactly that code resumes the channel. Fro that point OnStopR. will be fired and XHR would not hang.
Assignee | ||
Comment 33•8 years ago
|
||
Comment on attachment 8825728 [details] [diff] [review]
Fire OnStopRequest when shutting down to avoid shutdown hang in XHR
Thanks Dragana for clarifying the point.
Attachment #8825728 -
Flags: feedback?(mcmanus)
Thomas,
I found onStopR. will be fired in non-e10s version, but not in e10s version.
Assignee | ||
Comment 36•8 years ago
|
||
MozReview-Commit-ID: 35jBJb1Q8Rz
Assignee | ||
Updated•8 years ago
|
Attachment #8825728 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #8826125 -
Flags: feedback?(dd.mozilla)
Comment 37•8 years ago
|
||
Comment on attachment 8826125 [details] [diff] [review]
Make sure we resume the channel if there's no classifier callback
Francois is a better person to give you a feedback.
Attachment #8826125 -
Flags: feedback?(dd.mozilla) → feedback?(francois)
Updated•8 years ago
|
Attachment #8826125 -
Flags: feedback?(francois) → feedback+
Comment hidden (mozreview-request) |
Comment 39•8 years ago
|
||
Do you know why it works in non-e10s version and not in e10s?
Are you sure that Resume will not be called 2 times? The normal path and this shutdown path?
At least set mSuspendedChannel to false.
Assignee | ||
Comment 40•8 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #39)
> Do you know why it works in non-e10s version and not in e10s?
>
That said, XHR request is sent too late (when ff is shutting down). In e10s, in my tests, it only occurs if we have too late xhr request.
> Are you sure that Resume will not be called 2 times? The normal path and
> this shutdown path?
> At least set mSuspendedChannel to false.
Thanks Dragana.
I think it's added to the patch, mSuspendedChannel= false prevent resume and cancel not being call 2 times. But I still have not asked a review request yet. I am running xhr tests locally to see if there's still intermittent leak contains nsUrlClassifierLookupCallback.
Thanks
Assignee | ||
Updated•8 years ago
|
Attachment #8826458 -
Flags: review?(gpascutto)
Comment 41•8 years ago
|
||
mozreview-review |
Comment on attachment 8826458 [details]
Bug 1324820 - Make sure we resume the channel if there's no classifier callback
https://reviewboard.mozilla.org/r/104420/#review105674
r+ as per comment 32 saying this is the approach to use.
Attachment #8826458 -
Flags: review?(gpascutto) → review+
Assignee | ||
Comment 42•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 43•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/bb0fc7b8fd04
Make sure we resume the channel if there's no classifier callback r=gcp
Keywords: checkin-needed
Comment 44•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Comment 45•8 years ago
|
||
Too late for 51 as we are now in RC mode but happy to take a patch to 52!
Thomas, could you fill the uplift request? Thanks
Flags: needinfo?(tnguyen)
Assignee | ||
Comment 46•8 years ago
|
||
status-firefox50: affected → wontfix
status-firefox51: affected → wontfix
status-firefox-esr45: affected → wontfix
IPCError-browser | ShutDownKill signature is caused by various reasons and what I fixed here is only one of them, just only for the case related to nsChannelClassifier. The above status may not exact. I am not very sure if uplifting to 52 may help in this case.
Kamil, did you see similar issue in 52?
Flags: needinfo?(tnguyen) → needinfo?(kjozwiak)
Reporter | ||
Comment 47•8 years ago
|
||
>> Kamil, did you see similar issue in 52?
Thomas, I couldn't reproduce the crash using the STR from comment#0 with the other release channels (listed below). It seems like this particular case only affected m-c.
I made sure I could still reproduce the original issue with a version of m-c before comment#44. Once I reproduced the crash, I went through the following channels:
* fx52.0a2 - couldn't reproduce
** buildid: 20170120004016, changeset: b2e633d19d68
* fx51.0b14 - couldn't reproduce
buildid: 20170118123726, changeset: ea82b5e20cbb
* fx50.0.2 - couldn't reproduce
** buildid: 20161129173726, changeset: cc272f7d48d3
Flags: needinfo?(kjozwiak)
Comment 48•8 years ago
|
||
The status flags were set automatically by a bot in this bug. Given Kamil's comments and the fix at hand, I'm setting them all to unaffected to get this off the radar.
Updated•8 years ago
|
Blocks: shutdownkill
Updated•8 years ago
|
Updated•7 years ago
|
Blocks: IPCError_ShutDownKill
You need to log in
before you can comment on or make changes to this bug.
Description
•