Closed Bug 1475970 Opened 6 years ago Closed 6 years ago

[FreeBSD] e10s no longer works

Categories

(Core :: IPC, defect)

Unspecified
FreeBSD
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox61 --- unaffected
firefox62 --- unaffected
firefox63 --- fixed

People

(Reporter: jbeich, Assigned: jbeich)

References

Details

(Keywords: regression)

Attachments

(1 file)

After mozilla-central changeset 05a2a487661e none of web sites load. Affects even about: pages. Can be worked around by browser.tabs.remote.autostart -> false in <profile-dir>/user.js

$ c++ -v
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)
Target: x86_64-unknown-freebsd12.0
Thread model: posix
InstalledDir: /usr/bin

$ ./mach bootstrap
$ ./mach build
$ ./mach run
Flags: needinfo?(bugmail)
I reviewed some of the patches and I have a FreeBSD box that's already set up for Firefox builds, so I'll take a look.
Blocks: 1463587
No longer depends on: 1463587
Flags: needinfo?(jld)
I tried looking at this, and I don't know what's going on yet.  The content processes aren't crashing, and they're not obviously blocked on anything; they're just… not doing anything.  I ktraced and the two parts of the preferences are being mapped.
Flags: needinfo?(jld)
I tried debugging this some more.  I defined IPC_MESSAGE_DEBUG_EXTRA at the top of ipc_channel_posix.cc and set MOZ_LOG=chromium:5 in the environment; I could see the parent process sending a lot of messages, but the child got only the PContent::RefreshScreens.

I attached gdb to the parent process; it seemed to be working as expected, but I noticed that, with the SetXPCOMProcessAttributes message, each sendmsg call was transferring only 8KiB (the total message size was somewhere around 64KiB).

And then I debugged the child process, to see what it was receiving.  This looks bad:

(gdb) p/x message_length
$15 = 0x72637c81

(gdb) p/x ((IPC::Message::Header*)((IPC::Message*)(incoming_message_.mStorage))->header_)
$18 = 0x816243000
(gdb) x/s $18
0x816243000:    "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"...

The commit linked in comment #0 changed the SetXPCOMProcessAttributes message to include a file descriptor, where previously I think it didn't.  We have other messages that pass descriptors, of course, but I don't know if we have any larger than 8k.


So my best guess right now is that there's a bug somewhere in the stack, relating to short writes and attached SCM_RIGHTS, such that we lose the data from that chunk and treat some arbitrary part of the message body as the header.


Other things I should follow up on: (1) if there's a sockopt we should be setting to transfer more than 8k at a time, because that seems kind of inefficient, (2) if there's a sockopt we could set on Linux on debug builds to do the opposite and exercise the short write handling, if that's the problem here or maybe even if it isn't, and (3) given that IPC has a maximum message size and this is well above it, if we can assert that the length field is in range at this level of the stack.
Fwiw on OpenBSD while testing diff from bug #1400051 i've been able to start nightly with a new profile, my build had the changes from bug #1463587 (ie it was https://hg.mozilla.org/integration/mozilla-inbound/rev/f2bef68d99ba and https://reviewboard-hg.mozilla.org/gecko/rev/2bf5a083d1324361534974b880b360f3415001f3 on top), and it seems e10s worked as 3 content process were spawned and i could browse a bit... 

I can of course provide more info if needed, ie set some env var for live debugging, only gdb/stepping in the process will be complicated.
Jed, are you going to assign this to yourself for follow up?
Flags: needinfo?(jld)
(In reply to David Durst [:ddurst] (REO for 63) from comment #5)
> Jed, are you going to assign this to yourself for follow up?

Note FreeBSD is a tier 3 platform, so this doesn't block release.
Here's the start of what the content process receives, the first time it recvmsg()s after the MOZ_DEBUG_CHILD_PROCESS delay:

(gdb) x/80xw msg.msg_iov[0].iov_base
0x8162280c8:    0x00000004      0x80000000      0x0000ffff      0x00000001
0x8162280d8:    0x00000000      0xffffffff      0xffffffff      0x00000000
0x8162280e8:    0x0000fb46      0x00000080      0x7fffffff      0x002e005b
0x8162280f8:    0x00000001      0x00000000      0xffffffff      0xffffffff
0x816228108:    0x00000000      0x00000001      0x00000000      0x00000000
0x816228118:    0x00000f00      0x00000870      0x1a2732de      0x00000000
0x816228128:    0x00000000      0x00000f00      0x00000870      0x4563e731
0x816228138:    0x00000000      0x00000000      0x00000f00      0x00000870
0x816228148:    0x0bbc4812      0x00000000      0x00000000      0x00000f00
0x816228158:    0x00000870      0x4d197c3d      0x00000018      0xf1bc2fc5
0x816228168:    0x00000018      0xf4d42a3a      0x3f800000      0x4d9b43aa
0x816228178:    0x3f800000      0x970abf2a      0x431d3414      0x5c9cbc7e
0x816228188:    0x8a7074e4      0x72637c61      0x73637c68      0x65657c62
0x816228198:    0x7461667c      0x6c69667c      0x6273687c      0x7c74687c
0x8162281a8:    0x6a7c7a68      0x616b7c76      0x6a6b7c62      0x7c726b7c
0x8162281b8:    0x692d756b      0x756b7c71      0x7c72742d      0x7c6d776b
0x8162281c8:    0x6c7c676c      0x6e6d7c69      0x7c6e6d2d      0x6e7c736d
0x8162281d8:    0x676e7c61      0x7c766e7c      0x7c61746f      0x2d706170
0x8162281e8:    0x707c6e61      0x612d7061      0x75717c77      0x7a75717c
0x8162281f8:    0x7c6e727c      0x737c7772      0x64737c63      0x7c67737c

The header is 8 words, starting with the payload length (not including the header).  The first one is the hello message; the payload is the pid.  The second one has type 0x002e005b (PContent::RefreshScreens) and length 0x80, so its payload should start at 0x81622810c and end at 0x81622818c, but that looks like text:

(gdb) x/s 0x81622818c
0x81622818c:    "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"...

We've seen this text before.

Now, what does the parent process think it sent?  The variables I want don't have debug info at the breakpoint I set, but they're still around:

(gdb) p/x $r13
$1 = 0x3fec
(gdb) p/x $r14
$2 = 0x195c0

It thinks it sent 0x3fec octets of a possible 0x195c0.  Where does this mystery string start?

(gdb) p strstr(msgh.msg_iov[3].iov_base, "a|crh|csb|ee")
$11 = 0x82268d77c "a|crh|csb|ee|fat|fil|hsb|ht|hz|jv|kab|kj|kr|ku-iq|ku-tr|kwm|lg|li|mn-mn|ms|na|ng|nv|ota|pap-an|pap-aw|qu|quz|rn|rw|sc|sd|sg|sn|su|ty|za:fontversion=155320:capability=otlayout\\:DFLT otlayout\\:arab otla"...
(gdb) p/x 0x82268d77c - (size_t)(msgh.msg_iov[3].iov_base)
$14 = 0x77c
(gdb) p/x msgh.msg_iov[0].iov_len + msgh.msg_iov[1].iov_len + msgh.msg_iov[2].iov_len + 0x77c
$17 = 0x27bc

There are three other occurrences, but they're all outside what sendmsg reported writing and they're all odd (unaligned) offsets, so I suspect it's the first one.  I have no idea what's going with those offsets.

Also, the child process never received the SCM_RIGHTS that the parent sent with the sendmsg() call where the data was mangled.


Conclusion: This looks like an OS bug.

I'd suggest writing a small C program to try to reproduce this case β€” socketpair with some data buffered, a message with SCM_RIGHTS (and split into several iovecs, maybe) that can't be completely written in the remaining space.
Flags: needinfo?(jld)
Summary: e10s no longer works → [FreeBSD] e10s no longer works
Also, before I went in with gdb, I tried a simple experiment on Linux and set SO_SNDBUF to small values (8k and 4k) on all IPC channels; I can see the short writes with strace(1), but it doesn't cause this bug.
Apparently, the underlying FreeBSD bug has been known for years. After applying the proposed fix e10s works fine.
Comment on attachment 8993511 [details]
Bug 1475970 - Disable e10s on FreeBSD by default due to kernel bug.

https://reviewboard.mozilla.org/r/258212/#review265260

Just...wow.
Attachment #8993511 - Flags: review?(nfroyd) → review+
Pushed by nfroyd@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/197fcba26a38
Disable e10s on FreeBSD by default due to kernel bug. r=froydnj
Assignee: nobody → jbeich
Status: NEW → ASSIGNED
We can probably work around this easily enough by just moving the shared map file descriptor from SetXPCOMProcessAttributes to an UpdateSharedData call right afterwards. Looking at it again, I notice that I forgot to update SetXPCOMProcessAttributes to send BlobImpls, anyway, so it would probably be nice to remove the duplication, anyway.

Also... it may be worth considering using shared memory rather than arrays to send some of that font data. That's kind of a lot of data to pass to every process by message, and it doesn't change very often.
Depends on: 1477129
https://hg.mozilla.org/mozilla-central/rev/197fcba26a38
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
The referenced kernel bug https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=181741 was fixed by https://reviews.freebsd.org/rS337328

I'm currently running a build with e10s enabled, seems to work fine. Actually, I never ever experienced these crashes…

Can we at least check for older __FreeBSD_version (whatever 12 is before r337328 / 11 before r337329) instead of just __FreeBSD__?
(In reply to greg from comment #15)
This bug was backed out via mozilla-central changeset 2eaa478b8979. However, the underlying issue is also responsible for bug 1476130 which can be worked around via "sysctl net.local.stream.recvspace=16384".
(In reply to Jan Beich from comment #16)
Oops. That's my git mistake. I though I checked out the latest revision, but that did not in fact happen and what I was looking at was a couple months old :D
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: