Closed
Bug 1475970
Opened 6 years ago
Closed 6 years ago
[FreeBSD] e10s no longer works
Categories
(Core :: IPC, defect)
Tracking
()
RESOLVED
FIXED
mozilla63
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | unaffected |
firefox61 | --- | unaffected |
firefox62 | --- | unaffected |
firefox63 | --- | fixed |
People
(Reporter: jbeich, Assigned: jbeich)
References
Details
(Keywords: regression)
Attachments
(1 file)
After mozilla-central changeset 05a2a487661e none of web sites load. Affects even about: pages. Can be worked around by browser.tabs.remote.autostart -> false in <profile-dir>/user.js $ c++ -v FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1) Target: x86_64-unknown-freebsd12.0 Thread model: posix InstalledDir: /usr/bin $ ./mach bootstrap $ ./mach build $ ./mach run
status-firefox61:
--- → unaffected
status-firefox62:
--- → unaffected
status-firefox-esr52:
--- → unaffected
status-firefox-esr60:
--- → unaffected
Flags: needinfo?(bugmail)
Comment 1•6 years ago
|
||
I reviewed some of the patches and I have a FreeBSD box that's already set up for Firefox builds, so I'll take a look.
Comment 2•6 years ago
|
||
I tried looking at this, and I don't know what's going on yet. The content processes aren't crashing, and they're not obviously blocked on anything; they're just⦠not doing anything. I ktraced and the two parts of the preferences are being mapped.
Flags: needinfo?(jld)
Comment 3•6 years ago
|
||
I tried debugging this some more. I defined IPC_MESSAGE_DEBUG_EXTRA at the top of ipc_channel_posix.cc and set MOZ_LOG=chromium:5 in the environment; I could see the parent process sending a lot of messages, but the child got only the PContent::RefreshScreens. I attached gdb to the parent process; it seemed to be working as expected, but I noticed that, with the SetXPCOMProcessAttributes message, each sendmsg call was transferring only 8KiB (the total message size was somewhere around 64KiB). And then I debugged the child process, to see what it was receiving. This looks bad: (gdb) p/x message_length $15 = 0x72637c81 (gdb) p/x ((IPC::Message::Header*)((IPC::Message*)(incoming_message_.mStorage))->header_) $18 = 0x816243000 (gdb) x/s $18 0x816243000: "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"... The commit linked in comment #0 changed the SetXPCOMProcessAttributes message to include a file descriptor, where previously I think it didn't. We have other messages that pass descriptors, of course, but I don't know if we have any larger than 8k. So my best guess right now is that there's a bug somewhere in the stack, relating to short writes and attached SCM_RIGHTS, such that we lose the data from that chunk and treat some arbitrary part of the message body as the header. Other things I should follow up on: (1) if there's a sockopt we should be setting to transfer more than 8k at a time, because that seems kind of inefficient, (2) if there's a sockopt we could set on Linux on debug builds to do the opposite and exercise the short write handling, if that's the problem here or maybe even if it isn't, and (3) given that IPC has a maximum message size and this is well above it, if we can assert that the length field is in range at this level of the stack.
Comment 4•6 years ago
|
||
Fwiw on OpenBSD while testing diff from bug #1400051 i've been able to start nightly with a new profile, my build had the changes from bug #1463587 (ie it was https://hg.mozilla.org/integration/mozilla-inbound/rev/f2bef68d99ba and https://reviewboard-hg.mozilla.org/gecko/rev/2bf5a083d1324361534974b880b360f3415001f3 on top), and it seems e10s worked as 3 content process were spawned and i could browse a bit... I can of course provide more info if needed, ie set some env var for live debugging, only gdb/stepping in the process will be complicated.
Comment 5•6 years ago
|
||
Jed, are you going to assign this to yourself for follow up?
Flags: needinfo?(jld)
Comment 6•6 years ago
|
||
(In reply to David Durst [:ddurst] (REO for 63) from comment #5) > Jed, are you going to assign this to yourself for follow up? Note FreeBSD is a tier 3 platform, so this doesn't block release.
Comment 7•6 years ago
|
||
Here's the start of what the content process receives, the first time it recvmsg()s after the MOZ_DEBUG_CHILD_PROCESS delay: (gdb) x/80xw msg.msg_iov[0].iov_base 0x8162280c8: 0x00000004 0x80000000 0x0000ffff 0x00000001 0x8162280d8: 0x00000000 0xffffffff 0xffffffff 0x00000000 0x8162280e8: 0x0000fb46 0x00000080 0x7fffffff 0x002e005b 0x8162280f8: 0x00000001 0x00000000 0xffffffff 0xffffffff 0x816228108: 0x00000000 0x00000001 0x00000000 0x00000000 0x816228118: 0x00000f00 0x00000870 0x1a2732de 0x00000000 0x816228128: 0x00000000 0x00000f00 0x00000870 0x4563e731 0x816228138: 0x00000000 0x00000000 0x00000f00 0x00000870 0x816228148: 0x0bbc4812 0x00000000 0x00000000 0x00000f00 0x816228158: 0x00000870 0x4d197c3d 0x00000018 0xf1bc2fc5 0x816228168: 0x00000018 0xf4d42a3a 0x3f800000 0x4d9b43aa 0x816228178: 0x3f800000 0x970abf2a 0x431d3414 0x5c9cbc7e 0x816228188: 0x8a7074e4 0x72637c61 0x73637c68 0x65657c62 0x816228198: 0x7461667c 0x6c69667c 0x6273687c 0x7c74687c 0x8162281a8: 0x6a7c7a68 0x616b7c76 0x6a6b7c62 0x7c726b7c 0x8162281b8: 0x692d756b 0x756b7c71 0x7c72742d 0x7c6d776b 0x8162281c8: 0x6c7c676c 0x6e6d7c69 0x7c6e6d2d 0x6e7c736d 0x8162281d8: 0x676e7c61 0x7c766e7c 0x7c61746f 0x2d706170 0x8162281e8: 0x707c6e61 0x612d7061 0x75717c77 0x7a75717c 0x8162281f8: 0x7c6e727c 0x737c7772 0x64737c63 0x7c67737c The header is 8 words, starting with the payload length (not including the header). The first one is the hello message; the payload is the pid. The second one has type 0x002e005b (PContent::RefreshScreens) and length 0x80, so its payload should start at 0x81622810c and end at 0x81622818c, but that looks like text: (gdb) x/s 0x81622818c 0x81622818c: "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"... We've seen this text before. Now, what does the parent process think it sent? The variables I want don't have debug info at the breakpoint I set, but they're still around: (gdb) p/x $r13 $1 = 0x3fec (gdb) p/x $r14 $2 = 0x195c0 It thinks it sent 0x3fec octets of a possible 0x195c0. Where does this mystery string start? (gdb) p strstr(msgh.msg_iov[3].iov_base, "a|crh|csb|ee") $11 = 0x82268d77c "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"... (gdb) p/x 0x82268d77c - (size_t)(msgh.msg_iov[3].iov_base) $14 = 0x77c (gdb) p/x msgh.msg_iov[0].iov_len + msgh.msg_iov[1].iov_len + msgh.msg_iov[2].iov_len + 0x77c $17 = 0x27bc There are three other occurrences, but they're all outside what sendmsg reported writing and they're all odd (unaligned) offsets, so I suspect it's the first one. I have no idea what's going with those offsets. Also, the child process never received the SCM_RIGHTS that the parent sent with the sendmsg() call where the data was mangled. Conclusion: This looks like an OS bug. I'd suggest writing a small C program to try to reproduce this case β socketpair with some data buffered, a message with SCM_RIGHTS (and split into several iovecs, maybe) that can't be completely written in the remaining space.
Flags: needinfo?(jld)
Summary: e10s no longer works → [FreeBSD] e10s no longer works
Comment 8•6 years ago
|
||
Also, before I went in with gdb, I tried a simple experiment on Linux and set SO_SNDBUF to small values (8k and 4k) on all IPC channels; I can see the short writes with strace(1), but it doesn't cause this bug.
Apparently, the underlying FreeBSD bug has been known for years. After applying the proposed fix e10s works fine.
Comment hidden (mozreview-request) |
Comment 11•6 years ago
|
||
mozreview-review |
Comment on attachment 8993511 [details] Bug 1475970 - Disable e10s on FreeBSD by default due to kernel bug. https://reviewboard.mozilla.org/r/258212/#review265260 Just...wow.
Attachment #8993511 -
Flags: review?(nfroyd) → review+
Comment 12•6 years ago
|
||
Pushed by nfroyd@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/197fcba26a38 Disable e10s on FreeBSD by default due to kernel bug. r=froydnj
Updated•6 years ago
|
Assignee: nobody → jbeich
Status: NEW → ASSIGNED
Comment 13•6 years ago
|
||
We can probably work around this easily enough by just moving the shared map file descriptor from SetXPCOMProcessAttributes to an UpdateSharedData call right afterwards. Looking at it again, I notice that I forgot to update SetXPCOMProcessAttributes to send BlobImpls, anyway, so it would probably be nice to remove the duplication, anyway. Also... it may be worth considering using shared memory rather than arrays to send some of that font data. That's kind of a lot of data to pass to every process by message, and it doesn't change very often.
Comment 14•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/197fcba26a38
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Comment 15•6 years ago
|
||
The referenced kernel bug https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=181741 was fixed by https://reviews.freebsd.org/rS337328 I'm currently running a build with e10s enabled, seems to work fine. Actually, I never ever experienced these crashes⦠Can we at least check for older __FreeBSD_version (whatever 12 is before r337328 / 11 before r337329) instead of just __FreeBSD__?
Assignee | ||
Comment 16•6 years ago
|
||
(In reply to greg from comment #15) This bug was backed out via mozilla-central changeset 2eaa478b8979. However, the underlying issue is also responsible for bug 1476130 which can be worked around via "sysctl net.local.stream.recvspace=16384".
Comment 17•6 years ago
|
||
(In reply to Jan Beich from comment #16) Oops. That's my git mistake. I though I checked out the latest revision, but that did not in fact happen and what I was looking at was a couple months old :D
You need to log in
before you can comment on or make changes to this bug.
Description
•