Closed
Bug 1051567
Opened 10 years ago
Closed 8 years ago
[e10s] Crash in [@ mozilla::ipc::MessageChannel::OnChannelErrorFromLink]
Categories
(Core :: IPC, defect, P4)
Tracking
()
People
(Reporter: szx, Assigned: kanru)
References
(Depends on 1 open bug, )
Details
(5 keywords)
Crash Data
Attachments
(2 files)
(deleted),
text/plain
|
Details | |
(deleted),
text/x-review-board-request
|
billm
:
review+
lizzard
:
approval-mozilla-aurora+
lizzard
:
approval-mozilla-beta+
|
Details |
User Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 (Beta/Release)
Build ID: 20140807212602
Actual results:
This is the stack trace:
03e0f298 580a1751 mozalloc!mozalloc_abort(char * msg = 0x03e0f2e0 "[4864] ###!!! ABORT: Aborting on channel error.: file c:/builds/moz2_slave/rel-m-beta-w32_bld-00000000000/build/ipc/glue/MessageChannel.cpp, line 1532")+0x2a [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\memory\mozalloc\mozalloc_abort.cpp @ 30]
03e0f6d0 58112d78 xul!NS_DebugBreak(unsigned int aSeverity = 3, char * aStr = 0x58ecf8f0 "Aborting on channel error.", char * aExpr = 0x00000000 "", char * aFile = 0x58ecf178 "c:/builds/moz2_slave/rel-m-beta-w32_bld-00000000000/build/ipc/glue/MessageChannel.cpp", int aLine = 0n1532)+0x1ff [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\xpcom\base\nsdebugimpl.cpp @ 451]
03e0f6f0 58113e02 xul!mozilla::ipc::MessageChannel::OnChannelErrorFromLink(void)+0x4e [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\glue\messagechannel.cpp @ 1532]
03e0f6fc 581016bc xul!mozilla::ipc::ProcessLink::OnChannelError(void)+0x1b [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\glue\messagelink.cpp @ 356]
03e0f70c 580ff0c1 xul!IPC::Channel::ChannelImpl::OnIOCompleted(struct base::MessagePumpForIO::IOContext * context = 0x0276d004, unsigned long bytes_transfered = 0, unsigned long error = 0x6d)+0x7c [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\chrome\common\ipc_channel_win.cc @ 452]
03e0f734 580ff17f xul!base::MessagePumpForIO::WaitForIOCompletion(unsigned long timeout = 0xffffffff, class base::MessagePumpForIO::IOHandler * filter = 0x00000000)+0x74 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 524]
03e0f744 580ff2ed xul!base::MessagePumpForIO::WaitForWork(void)+0x19 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 501]
03e0f750 580fe981 xul!base::MessagePumpForIO::DoRunLoop(void)+0x50 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 463]
03e0f770 580fedbb xul!base::MessagePumpWin::RunWithDispatcher(class base::MessagePump::Delegate * delegate = 0x03e0f7e8, class base::MessagePumpWin::Dispatcher * dispatcher = 0x00000000)+0x3c [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.cc @ 55]
03e0f77c 581072ae xul!base::MessagePumpWin::Run(class base::MessagePump::Delegate * delegate = 0x03e0f7e8)+0xb [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_pump_win.h @ 78]
03e0f7b4 5810737d xul!MessageLoop::RunHandler(void)+0x51 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_loop.cc @ 223]
03e0f7d4 58109d40 xul!MessageLoop::Run(void)+0x19 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\message_loop.cc @ 197]
03e0f8c0 585f35a6 xul!base::Thread::ThreadMain(void)+0xa4 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\ipc\chromium\src\base\thread.cc @ 171]
03e0f8c4 754a919f xul!ThreadEntry(void * arg = 0x027251c8)+0x9 [c:\builds\moz2_slave\rel-m-beta-w32_bld-00000000000\build\tools\profiler\platform-win32.cc @ 248]
03e0f8d0 771ea22b KERNEL32!BaseThreadInitThunk+0xe
03e0f914 771ea201 ntdll!__RtlUserThreadStart+0x20
03e0f924 00000000 ntdll!_RtlUserThreadStart+0x1b
Updated•10 years ago
|
Status: UNCONFIRMED → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
Comment 2•10 years ago
|
||
Reopening per bug 1047160 comment 6.
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: DUPLICATE → ---
Reporter | ||
Comment 3•10 years ago
|
||
By the way, I still have the crash dump on my disk in case you would like me to extract something else from it.
Updated•10 years ago
|
Status: REOPENED → UNCONFIRMED
Component: Untriaged → IPC
Ever confirmed: false
Product: Firefox → Core
Reporter | ||
Comment 4•10 years ago
|
||
I've had a similar same crash today, it seems to happen when I close the browser window.
As best I can tell this is actually the child process crashing.
Comment 6•10 years ago
|
||
I am getting this crash pretty much daily on an e10s build.
Comment 7•10 years ago
|
||
Updated•10 years ago
|
Assignee: nobody → mrbkap
Updated•10 years ago
|
Comment 8•10 years ago
|
||
I wonder if this could be related to bug 1035454.
Comment 9•10 years ago
|
||
(In reply to Blake Kaplan (:mrbkap) from comment #8)
> I wonder if this could be related to bug 1035454.
Hmm, I have no idea how that could be relevant here. Can you please clarify? (I'd be happy to run some diagnostics if you've got things for me to check.)
Comment 10•10 years ago
|
||
My nightly is pretty unusable with this crash, and it's getting worse every day... I'm probably going to stop dogfooding e10s until this bug is fixed.
Comment 11•10 years ago
|
||
Brad, I think we should not enable e10s on Nightly before we fix this bug. It makes the browser very crashy. I unfortunately don't have STRs but I get these crashes every hour or so these days. Youtube hits this a *lot*.
Flags: needinfo?(blassey.bugs)
Comment 12•10 years ago
|
||
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #11)
> Brad, I think we should not enable e10s on Nightly before we fix this bug.
> It makes the browser very crashy. I unfortunately don't have STRs but I get
> these crashes every hour or so these days. Youtube hits this a *lot*.
Part of the reason to enable on nightly is to get a volume of crash data that will allow us to prioritize crash bugs.
Right now I see 56 crashes over the last 30 days and they're all on Windows. Jim, have you seen this signature?
Flags: needinfo?(blassey.bugs) → needinfo?(jmathies)
Comment 13•10 years ago
|
||
Looks like there's one on mac too -
https://crash-stats.mozilla.com/query/?product=Firefox&version=ALL%3AALL&range_value=1&range_unit=weeks&date=09%2F08%2F2014+18%3A00%3A00&query_search=signature&query_type=contains&query=MessageChannel%3A%3AOnChannelErrorFromLink&reason=&release_channels=&build_id=&process_type=any&hang_type=any
I haven't seen this on Windows. Looks like these are all child aborts related to some sort of failed channel connection. To get here _pipe has to be valid but a write operation had to fail.
http://hg.mozilla.org/releases/mozilla-release/annotate/529a45c94e5a/ipc/chromium/src/chrome/common/ipc_channel_win.cc#l365
Flags: needinfo?(jmathies)
Comment 14•10 years ago
|
||
Over four weeks -
https://crash-stats.mozilla.com/query/?product=Firefox&version=ALL%3AALL&range_value=4&range_unit=weeks&date=09%2F08%2F2014+18%3A00%3A00&query_search=signature&query_type=contains&query=MessageChannel%3A%3AOnChannelErrorFromLink&reason=&release_channels=&build_id=&process_type=any&hang_type=any
I'm not see a reason to block on this for mac builds, but Windows might be an issue. I think this abort should show up as crashed tabs.
Comment 15•10 years ago
|
||
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #12)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #11)
> > Brad, I think we should not enable e10s on Nightly before we fix this bug.
> > It makes the browser very crashy. I unfortunately don't have STRs but I get
> > these crashes every hour or so these days. Youtube hits this a *lot*.
>
> Part of the reason to enable on nightly is to get a volume of crash data
> that will allow us to prioritize crash bugs.
According to crash reporter, not a single one of these crashes have been submitted. They all trigger the built-in OSX crash reporter, which is why I'm trying to bring this to your attention.
Comment 16•10 years ago
|
||
testcase |
This link reproduces this crash for me very reliably: <http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/machine-translation.html>
Keywords: testcase
Comment 17•10 years ago
|
||
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #16)
> This link reproduces this crash for me very reliably:
> <http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/machine-
> translation.html>
(The reason is that the said link triggers bug 1079422 which causes an unrelated crash in the content process, and then I get a popup reporting the crash this bug is filed for.)
Comment 18•10 years ago
|
||
Just for the record, I am hitting this crash 1+ times a day. And not a single one of them have so far triggered breakpad.
Comment 19•10 years ago
|
||
I noticed this crash happens frequently on treeherder these days.
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=7235805
Updated•10 years ago
|
Comment 20•10 years ago
|
||
(In reply to Hsin-Yi Tsai [:hsinyi] from comment #19)
> I noticed this crash happens frequently on treeherder these days.
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=7235805
Hi Blake,
Are you able to help make some progress here? Thank you.
Flags: needinfo?(mrbkap)
Comment 21•10 years ago
|
||
retriaging:
<RyanVM|sheriffduty> jimm: oh god, it's *that* ?
<RyanVM|sheriffduty> lord knows we have plenty of OnChannelErrorFromLink issues
<jimm> um, actually OnChannelErrorFromLink aborts are pretty rare afaik
<jimm> we have a few meta signatures
<jimm> this isn't one of the bad ones
<RyanVM|sheriffduty> jimm: bug 1142693?
<RyanVM|sheriffduty> only our top failure on OSX by a country mile
Comment 22•10 years ago
|
||
Comment 23•10 years ago
|
||
note - bug 1152372 reproduces on Windows.
Updated•10 years ago
|
OS: Windows 8.1 → All
Comment 24•10 years ago
|
||
A portion of the mac related crashes are covered by bug 1142693.
Comment 25•10 years ago
|
||
> note - bug 1152372 reproduces on Windows.
This bug looks like a poorly filed bug that blames some ipc code for a python automation problem. I don't think this crash happens on Windows.
AFAICT this is a test only crash too -
https://crash-stats.mozilla.com/search/?product=Firefox&version=40.0a1&signature=~OnChannelErrorFromLink&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
and I think we might have a fix for it in sub bug 1142693.
Comment 26•10 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #25)
> > note - bug 1152372 reproduces on Windows.
>
> This bug looks like a poorly filed bug that blames some ipc code for a
> python automation problem. I don't think this crash happens on Windows.
>
> AFAICT this is a test only crash too
I hit this a few days ago on Windows using the latest nightly during normal browsing, though I don't remember what I was doing at the time.
bp-c49b2000-9c7c-4dc6-a79a-9ac502150406
Comment 27•10 years ago
|
||
(In reply to Trevor Rowbotham from comment #26)
> (In reply to Jim Mathies [:jimm] from comment #25)
> > > note - bug 1152372 reproduces on Windows.
> >
> > This bug looks like a poorly filed bug that blames some ipc code for a
> > python automation problem. I don't think this crash happens on Windows.
> >
> > AFAICT this is a test only crash too
>
> I hit this a few days ago on Windows using the latest nightly during normal
> browsing, though I don't remember what I was doing at the time.
>
> bp-c49b2000-9c7c-4dc6-a79a-9ac502150406
Yep, that's this crash on Windows. Still very rare in the wild, or possibly we are having problems getting crash reports submitted -
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozalloc_abort%28char+const*+const%29+|+NS_DebugBreak+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AOnChannelErrorFromLink%28%29
Updated•10 years ago
|
Comment 28•9 years ago
|
||
I found a way to reproduce it on FF40/41 with e10s enabled (Win 7).
https://crash-stats.mozilla.com/report/index/8e8c2c6d-5045-4e6b-95a6-f6da92150604
I'm bisecting right now.
Comment 29•9 years ago
|
||
I filed bug 1171307 because I'm not sure if it's the same underlying issue.
Updated•9 years ago
|
Priority: -- → P4
Updated•9 years ago
|
Updated•8 years ago
|
Crash Signature: [@ mozalloc_abort | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::OnChannelErrorFromLink ]
Comment 30•8 years ago
|
||
This signature is still happening sometimes on Nightly.
Comment 31•8 years ago
|
||
This happens when Nightly is opened on Mac OS X 10.12 Sierra. Usually 2-4 crashes are recorded as soon as one just starts Nightly with clean profile (not 100% reproducible).
Here are some crash reports:
bp-4282b3bd-682f-4a19-8f7d-6d0d82160714
bp-81c3cdb7-6c98-46df-b637-05a542160714
bp-867ca5a6-4fdf-4641-b224-6aeb12160714
bp-d674a63b-37d5-4e3e-8941-101f42160714
bp-bf1bce20-8023-475b-a780-33b682160714
bp-9b2ffe73-2674-41b2-8545-470c82160714
Updated•8 years ago
|
status-firefox50:
--- → affected
Version: 32 Branch → 50 Branch
Comment 32•8 years ago
|
||
Crash volume for signature 'mozalloc_abort | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::OnChannelErrorFromLink':
- nightly (version 50): 336 crashes from 2016-06-06.
- aurora (version 49): 742 crashes from 2016-06-07.
- beta (version 48): 5 crashes from 2016-06-06.
- release (version 47): 5 crashes from 2016-05-31.
- esr (version 45): 0 crash from 2016-04-07.
Crash volume on the last weeks:
Week N-1 Week N-2 Week N-3 Week N-4 Week N-5 Week N-6 Week N-7
- nightly 139 36 27 19 35 37 10
- aurora 203 83 71 76 127 106 11
- beta 1 1 2 1 0 0 0
- release 0 0 0 0 3 1 0
- esr 0 0 0 0 0 0 0
Affected platform: Mac OS X
Comment 33•8 years ago
|
||
This signature started rising on Nightly using the 20160801074053 build. It is now the top crash on Nightly.
status-firefox51:
--- → affected
Comment 34•8 years ago
|
||
FYI - Seems to be related: I see a crash with Firefox48 (release) and Selenium-beta2 (Java client), platform: Windows 8.1, when invoking .quit()
1470659291121 Marionette INFO Listening on port 55104
[Child 8720] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 343
[Child 8720] WARNING: pipe error: 109: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 343
1470659292014 Marionette INFO startBrowser 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292021 Marionette INFO sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292157 Marionette INFO sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
1470659292379 Marionette INFO sendAsync 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
[Child 1340] WARNING: pipe error: 232: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 497
[Child 1340] ###!!! ABORT: Aborting on channel error.: file c:/builds/moz2_slave/m-rel-w32-00000000000000000000/build/src/ipc/glue/MessageChannel.cpp, line 2046
Exception in thread "main" org.openqa.selenium.remote.UnreachableBrowserException: Error communicating with the remote browser. It may have died.
Build info: version: 'unknown', revision: '2aa21c1', time: '2016-08-02 14:59:43 -0700'
System info: host: 'lnz-geralde3', ip: '192.168.56.1', os.name: 'Windows 8.1', os.arch: 'amd64', os.version: '6.3', java.version: '1.8.0_45'
Driver info: driver.version: RemoteWebDriver
Capabilities [{rotatable=false, raisesAccessibilityExceptions=false, marionette=true, appBuildId=20160726073904, version=, platform=XP, proxy={}, command_id=1, specificationLevel=0, acceptSslCerts=false, browserVersion=48.0, platformVersion=6.3, XULappId={ec8030f7-c20a-464f-9b0e-13a3a9e97384}, browserName=Firefox, takesScreenshot=true, takesElementScreenshot=true, platformName=Windows_NT, device=desktop}]
Session ID: 91f6547b-743d-4e5e-8b57-e1c5bb4d4cbd
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:670)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:706)
at org.openqa.selenium.remote.RemoteWebDriver.quit(RemoteWebDriver.java:531)
at FirefoxSmokeTest.main(FirefoxSmokeTest.java:20)
Caused by: java.lang.IllegalStateException: UnixUtils may not be used on Windows
at org.openqa.selenium.os.ProcessUtils.getProcessId(ProcessUtils.java:188)
at org.openqa.selenium.os.UnixProcess$SeleniumWatchDog.getPID(UnixProcess.java:222)
at org.openqa.selenium.os.UnixProcess$SeleniumWatchDog.access$300(UnixProcess.java:201)
at org.openqa.selenium.os.UnixProcess.destroy(UnixProcess.java:132)
at org.openqa.selenium.os.CommandLine.destroy(CommandLine.java:155)
at org.openqa.selenium.remote.service.DriverService.stop(DriverService.java:196)
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:94)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:649)
... 3 more
Comment 35•8 years ago
|
||
[Tracking Requested - why for this release]:
As mentioned in comment 33 this is a topcrash in Nightly now. So asking for tracking the 51 release.
tracking-firefox51:
--- → ?
Keywords: topcrash
Comment 37•8 years ago
|
||
Assuming that bug 1293090 is correct that we're hitting this same crash in automation, this may not just be "Nightly" the trunk channel, but nightly the builds we do at 3am as distinct from the builds we do on every push, which means a major party foul on someone's part, because the binaries and the behavior of tests is not supposed to be different between nightlies and on-push.
Is it possible to tell the difference between crash reports from nightly-Nightly users and the rare default-update-channel users of on-push builds, to see whether this is happening on non-nightly builds?
Comment 38•8 years ago
|
||
[Tracking Requested - why for this release]: Nominating this for 50 tracking as it currently sits as the #2 top crash on 50, and it is being identified as a startup crash. It sits at #4 on 51.
I asked a few folks to try to get an answer to Phil's question in Comment 37, but so far I haven't found someone who has the answer.
tracking-firefox50:
--- → ?
Hello mccr8, I've seen you work on IPC related OOM crashes in e10s. Just wondering, is this something you can help investigate/fix? Please let me know.
Hello overholt, this was mentioned as a top crash in the channel meeting today and as the engineering owner for Fx50, could you please help me find an owner who can investigate this? Thanks!
Flags: needinfo?(overholt)
Flags: needinfo?(continuation)
Comment 40•8 years ago
|
||
This seems quite messy. Does bug 1152372 still reproduce this crash?
Jed, can you please take a quick look and see if anything jumps out at you?
Flags: needinfo?(overholt) → needinfo?(jld)
Comment 41•8 years ago
|
||
(In reply to Bogdan Maris, QA [:bogdan_maris][PTO 08-22 Aug] from comment #31)
> bp-4282b3bd-682f-4a19-8f7d-6d0d82160714
> bp-81c3cdb7-6c98-46df-b637-05a542160714
> bp-867ca5a6-4fdf-4641-b224-6aeb12160714
> bp-d674a63b-37d5-4e3e-8941-101f42160714
> bp-bf1bce20-8023-475b-a780-33b682160714
> bp-9b2ffe73-2674-41b2-8545-470c82160714
These all show a content process that's starting up, has sent a PCrashReporter constructor (which is sync) to the parent, and gets an IPC channel error while waiting for a reply.
These are all on OS X, which has a history of OS bugs affecting the features we use for IPC (e.g., bug 1142693), so that might be part of it. It would help to have a little more detail on the I/O error that caused the crash, but I'm not seeing anything useful in crash-stats.
Flags: needinfo?(jld)
Comment 42•8 years ago
|
||
This doesn't look OOM related as far as I can tell.
Flags: needinfo?(continuation)
Updated•8 years ago
|
Assignee: mrbkap → nobody
Severity: normal → critical
Keywords: crash
Summary: Crash in mozilla::ipc::MessageChannel::OnChannelErrorFromLink → [e10s] Crash in [@ mozilla::ipc::MessageChannel::OnChannelErrorFromLink]
Updated•8 years ago
|
Comment 43•8 years ago
|
||
This is getting a high volume crash for our Marionette tests for OS X debug builds. See all the bugs as marked as being blocked. It looks like that Firefox crashes randomly during the test job. Anything we can do here soon to help the sheriffs from not having to star that many test failures? That would be great! Thanks.
Assignee | ||
Comment 44•8 years ago
|
||
I looked at some of the blocked bugs, they all crash when the ContentChild is creating the PCrashReporter actor.
gecko.log has this line:
[Child 1935] WARNING: Message needs unreceived descriptors channel:1129c3000 message-type:4849673 header()->num_fds:1 num_fds:0 fds_i:0: file /builds/slave/m-in-m64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 482
Bill, does this ring any alarm bells?
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 45•8 years ago
|
||
Indeed it looks very similar to bug 1142693.
Flags: needinfo?(wmccloskey) → needinfo?(jld)
Assignee | ||
Comment 46•8 years ago
|
||
https://mozilla-releng-blobs.s3.amazonaws.com/blobs/mozilla-inbound/sha512/4867c171ae0c3bb1face4f6b2e9025270ee8db27020d976805503f10fff636ed30aaf63144de66bd8c59a7273c0b12330c9485f8a42664b7af18f7cd3bd00b65
The gecko.log also has this line:
[Parent 1931] WARNING: FileDescriptorSet destroyed with unconsumed descriptors: file /builds/slave/m-in-m64-000000000000000000000/build/src/ipc/chromium/src/chrome/common/file_descriptor_set_posix.cc, line 22
Assignee | ||
Comment 47•8 years ago
|
||
One possibility is that we are leaking fds or other processes consumed too many fds so the child process failed to create a new one. However all the crashes are in PCrashReporterConstructor is suspicious.
Comment 48•8 years ago
|
||
Kan-Ru, out of interest, do those automation crashes map with those reported to crashstats? If not we may have another underlying issue?
Assignee | ||
Comment 49•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #48)
> Kan-Ru, out of interest, do those automation crashes map with those reported
> to crashstats? If not we may have another underlying issue?
They have similar crash stacks so I assume they are same crashes. Which means if we fix this we are not only fixing automation but also real crashes.
Assignee | ||
Comment 50•8 years ago
|
||
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #47)
> One possibility is that we are leaking fds or other processes consumed too
> many fds so the child process failed to create a new one. However all the
> crashes are in PCrashReporterConstructor is suspicious.
The reason we crash at PCrashReporterConstructor is because it's the first sync message send from ContentChild to parent which forces us to consume the incoming messages.
The msgtype 4849673 in the log is PContent::Msg_InitCompositor__ID so looks there is something wrong either in the new Endpoints code or the out-of-process compositor code.
Flags: needinfo?(dvander)
bug 1293580 is similar, I think. Mac runs out of fds a lot and we don't seem to know why. We can fail on either the sending side (by failing to create a channel), or on the receiving side, when SCM_RIGHTS or something fails to find a new descriptor.
Meanwhile, the old IPDL bridging model looked like this:
1. Allocate fds for a channel.
2. On failure, return.
3. On success, send bridge messages.
Most consumers assumed that bridges never failed, because even if they did, nothing would "appear" wrong. The bridge would simply never happen on the other side, and functionality would be silently broken/missing (and in some cases would probably crash later).
Endpoints work differently. The caller must check for failure, and if you send an invalid endpoint, IPDL will crash. When we switched the compositor to use Endpoints, suddenly Mac's file descriptor problems exacerbated the fact that we're not handling errors properly.
We can and should fix our error-checking of Endpoints, which I'll do in bug 1293580. But this won't solve the fact that Mac is running out of descriptors way too often, and that will lead to broken behavior.
Can we get anyone to try bug 1296756? It'd be great if we could see how many of each descriptor type is open.
Flags: needinfo?(dvander)
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 52•8 years ago
|
||
I'll take a look at bug 1296756
Comment 53•8 years ago
|
||
I don't think I have anything useful to add at this point.
Flags: needinfo?(jld)
Assignee | ||
Comment 54•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090
OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0
total opened fds is 11, not particular high, which makes sense because the process is starting up. Not sure why SCM_RIGHTS failed to create fd though.
(In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #54)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090
>
> OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0
>
> total opened fds is 11, not particular high, which makes sense because the
> process is starting up. Not sure why SCM_RIGHTS failed to create fd though.
Where were you able to tell that SCM_RIGHTS failed?
Assignee | ||
Comment 56•8 years ago
|
||
(In reply to David Anderson [:dvander] from comment #55)
> (In reply to Kan-Ru Chen [:kanru] (UTC+8) from comment #54)
> > https://treeherder.mozilla.org/#/
> > jobs?repo=try&revision=f0cfacd0d39a&selectedJob=26416090
> >
> > OpenedFileDescriptors: 11 2 0 2 0 0 0 7 0
> >
> > total opened fds is 11, not particular high, which makes sense because the
> > process is starting up. Not sure why SCM_RIGHTS failed to create fd though.
>
> Where were you able to tell that SCM_RIGHTS failed?
I can't 100% tell that SCM_RIGHTS failed but the log is added after the "Message needs unreceived descriptors" error. It means we have completely received the message but not the fds. It doesn't look like the sendmsg failed.
Fortunately or unfortunately, this is easily reproducible on try so I can try to log more information.
Assignee | ||
Comment 57•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb05a72031b2
Not sure what has happened, but with my logging patch I can't reproduce this on try anymore.
Comment 58•8 years ago
|
||
Currently this bug is #3 top browser crash on Nightly. I know various people have chimed in on this bug - is someone actually willing to take ownership of it?
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → kchen
Assignee | ||
Comment 59•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ee086ef5822
From the log I found that the failure steps are always like this:
1. parent: 1 file descriptor to send
2. parent: successfully writes 64 bytes
3. child: read 4096 bytes
4. parent: 1 file descriptor to send
5. parent: failed to write 64 bytes, EAGAIN
6. parent: successfully writes 64 bytes
7. child: read 128 bytes
8. child: only receives 1 file descriptor
I think after step 5. we forgot to pack the file descriptors into the message to send.
I'll have a patch to review pretty soon if this try is green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0be1080bcea
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Blocks: 1262671
Keywords: regression
Assignee | ||
Comment 61•8 years ago
|
||
tracking-firefox49? because we think is a regression from bug 1262671 which landed in 49
tracking-firefox49:
--- → ?
Assignee | ||
Comment 62•8 years ago
|
||
I noticed this is opened 2 years ago so the patch in comment 60 can be fixing the original issue. Let's use this bug to fix the recent spike of this signature and open a new one if there is still more crashes of the signature later.
Assignee | ||
Comment 63•8 years ago
|
||
I meant the patch can't be fixing the original issue.
Updated•8 years ago
|
Comment 64•8 years ago
|
||
mozreview-review |
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.
https://reviewboard.mozilla.org/r/75320/#review73268
Thanks for tracking this down Kan-Ru!
::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc:583
(Diff revision 1)
> static const int tmp = CMSG_SPACE(sizeof(
> int[FileDescriptorSet::MAX_DESCRIPTORS_PER_MESSAGE]));
> char buf[tmp];
>
> - if (partial_write_iter_.isNothing() &&
> + if ((partial_write_iter_.isNothing() ||
> + partial_write_iter_.value().Data() == msg->Buffers().Iter().Data()) &&
Instead of Iter().Data() you can use Start().
It also might make sense to move this code:
http://searchfox.org/mozilla-central/rev/064025c802c22cd5ad122746733cbd34ea47393c/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#614-617
up above this check here. Then we can remove the isNothing() check and just check if Data() == Start().
Attachment #8786335 -
Flags: review+
Attachment #8786335 -
Flags: review?(dvander)
Assignee | ||
Comment 65•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #64)
> Comment on attachment 8786335 [details]
> Bug 1051567 - Make sure we resend file descriptors for the first chunk of a
> message.
>
> https://reviewboard.mozilla.org/r/75320/#review73268
>
> Thanks for tracking this down Kan-Ru!
>
> ::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc:583
> (Diff revision 1)
> > static const int tmp = CMSG_SPACE(sizeof(
> > int[FileDescriptorSet::MAX_DESCRIPTORS_PER_MESSAGE]));
> > char buf[tmp];
> >
> > - if (partial_write_iter_.isNothing() &&
> > + if ((partial_write_iter_.isNothing() ||
> > + partial_write_iter_.value().Data() == msg->Buffers().Iter().Data()) &&
>
> Instead of Iter().Data() you can use Start().
I can't use Start() because msg->Buffers() is marked as const. I added a const overload to BufferList::Start(). I assume you would rs+ this change ;)
> It also might make sense to move this code:
> http://searchfox.org/mozilla-central/rev/
> 064025c802c22cd5ad122746733cbd34ea47393c/ipc/chromium/src/chrome/common/
> ipc_channel_posix.cc#614-617
> up above this check here. Then we can remove the isNothing() check and just
> check if Data() == Start().
Sounds good.
Comment hidden (mozreview-request) |
Comment 67•8 years ago
|
||
Pushed by kchen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/acb978a84753
Make sure we resend file descriptors for the first chunk of a message. r=billm
Comment 68•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 10 years ago → 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Comment 69•8 years ago
|
||
Please nominate this for Aurora/Beta approval when you get a chance. Also, glory hallelujah at Mn-e10s now!
Flags: needinfo?(kchen)
Assignee | ||
Comment 70•8 years ago
|
||
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.
Approval Request Comment
[Feature/regressing bug #]: bug 1262671
[User impact if declined]: Users on Linux and/or Mac platform might see immediate content process crash after startup
[Describe test coverage new/current, TreeHerder]: Landed on m-c and fixed many Mn-e10s intermittent failure on Mac
[Risks and why]: Low. This just restore the behavior before bug 1262671.
[String/UUID change made/needed]: n/a
Flags: needinfo?(kchen)
Attachment #8786335 -
Flags: approval-mozilla-beta?
Attachment #8786335 -
Flags: approval-mozilla-aurora?
Comment on attachment 8786335 [details]
Bug 1051567 - Make sure we resend file descriptors for the first chunk of a message.
Let's take this as it reverts some of the behavior which made Mn-e10s tests fail. If we land it right away and it sticks this can make it to the beta 10 build today.
Attachment #8786335 -
Flags: approval-mozilla-beta?
Attachment #8786335 -
Flags: approval-mozilla-beta+
Attachment #8786335 -
Flags: approval-mozilla-aurora?
Attachment #8786335 -
Flags: approval-mozilla-aurora+
Comment 72•8 years ago
|
||
bugherder uplift |
Comment 73•8 years ago
|
||
bugherder uplift |
You need to log in
before you can comment on or make changes to this bug.
Description
•