Closed
Bug 1235633
Opened 9 years ago
Closed 9 years ago
[e10s] crash in OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink
Categories
(Core :: IPC, defect)
Tracking
()
RESOLVED
FIXED
mozilla48
People
(Reporter: tracy, Assigned: billm)
References
Details
(Keywords: crash, topcrash-win, Whiteboard: [MemShrink:P1])
Crash Data
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
jld
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-00b84fbe-0ca0-4be2-b160-ccca52151219.
=============================================================
This crash is #10, with 132 crashes, on the 44beta a/b experiment.
Frame Module Signature Source
0 xul.dll NS_ABORT_OOM(unsigned int) xpcom/base/nsDebugImpl.cpp
1 xul.dll mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) ipc/glue/MessageChannel.cpp
2 kernelbase.dll ReadFile
3 kernel32.dll ReadFileImplementation
4 kernel32.dll ReadFileImplementation
5 nss3.dll _MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c
6 xul.dll mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) ipc/glue/MessageLink.cpp
7 xul.dll IPC::Channel::ChannelImpl::ProcessIncomingMessages(base::MessagePumpForIO::IOContext*, unsigned long) ipc/chromium/src/chrome/common/ipc_channel_win.cc
8 xul.dll IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext*, unsigned long, unsigned long) ipc/chromium/src/chrome/common/ipc_channel_win.cc
Updated•9 years ago
|
Comment 1•9 years ago
|
||
This is happening on the line:
mPending.push_back(aMsg);
IIRC, this is the point where a message is copied out of the raw buffer. I think this is just a fundamental problem where things that send huge messages are going to hit heap fragmentation issues.
These crash reports are also missing all of the nice information about system memory usage that you usually see (Total Virtual Memory, Available Virtual Memory, System Memory Use Percentage). Maybe that doesn't work in a content process? I'll file a bug about that.
This particular allocation size should also get that size of the allocation reported. I don't know if that's related to the previous issue or not.
Component: XPCOM → IPC
Comment 2•9 years ago
|
||
Bug 1156484 is in the same part of the code, except in that crash report the memory information (and allocation size) is showing up properly.
Updated•9 years ago
|
Blocks: e10s-crashes
Comment 3•9 years ago
|
||
Also, why isn't the allocation size annotated for this case? (that's what the "unknown" in the signature means)
Comment 4•9 years ago
|
||
Ah, that seems to be bug 1236108, we really need to fix that to even be able to analyze if e10s is nearing ship-readiness.
Comment 5•9 years ago
|
||
renom'ing. Top crash in beta experiment. Poiru, can you indicate where this is in the top crash list?
Flags: needinfo?(birunthan)
Comment 6•9 years ago
|
||
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5)
> renom'ing. Top crash in beta experiment. Poiru, can you indicate where this
> is in the top crash list?
This is #4 (2.08%) for content processes. See bug 1249209 comment 2 for full list.
Flags: needinfo?(birunthan)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → wmccloskey
Comment 7•9 years ago
|
||
Message's copy constructor goes to Pickle's copy constructor, which calls realloc() to have enough memory to store payload. If realloc() fails, NS_ABORT_OOM() is called and that is where this crash happens.
realloc() winds up calling imalloc(). Assume the size is a large number, and huge_malloc() is used. If it is chunk_alloc() to return null, it has to fail both chunk_recycle() and chunk_alloc_mmap(). Note chunk_alloc_mmap() goes to VirtualAlloc() on Windows. I wonder how did it fail.
Comment 8•9 years ago
|
||
The most common cause of OOM with large allocations is a lack of contiguous address space. Because of bug 1236108, these crash reports do not contain information about contiguous address space or memory usage, so it is hard to be sure if that is happening here.
Comment 9•9 years ago
|
||
I wonder how hard it would be to get data on IPC protocols using large messages, broken out by IPDL actor class and message (or the low-level message ID, although the mapping from that back to something meaningful varies with IPDL changes).
Assignee | ||
Comment 10•9 years ago
|
||
I think we can avoid the copy here. I'm working on a patch.
Comment 11•9 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #10)
> I think we can avoid the copy here. I'm working on a patch.
That'd be awesome!
Assignee | ||
Comment 12•9 years ago
|
||
How does this look, Jed? This patch changes the code so that the Message passed to OnMessageReceivedFromLink owns its data. Therefore we can just move it to the queue rather than copying it. The goal of the patch is to ensure that we only have one copy of the message data around at a time, at least for large messages.
I also added some code to properly size the buffer for the message once we have received enough data to know how big it will be. Right now we rely on std::string to do exponential growth, which is good, but not as good as allocating with the right size from the beginning.
If this looks okay, then I'll fix up the Windows code in the same way. I made the new Buffer class as much like std::string as possible so that it fits into both the POSIX and Windows implementation with few changes. Perhaps in the future we could do a better job sharing the code in these two classes, but I'm not sure it's worth it.
Attachment #8726014 -
Flags: feedback?(jld)
Updated•9 years ago
|
Whiteboard: [MemShrink]
Updated•9 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P1]
Comment 13•9 years ago
|
||
Comment on attachment 8726014 [details] [diff] [review]
patch
Review of attachment 8726014 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry for the delay. I haven't looked at this quite as much as for r?, but it seems reasonable and a definite improvement for the case where the first message is large and the rest of the buffer is small.
::: ipc/chromium/src/base/buffer.cc
@@ +92,5 @@
> +Buffer::trade_bytes(size_t count)
> +{
> + char* result = mBuffer;
> + mSize = mReserved = mSize - count;
> + mBuffer = (char*)malloc(mReserved);
What happens if mSize == count?
@@ +95,5 @@
> + mSize = mReserved = mSize - count;
> + mBuffer = (char*)malloc(mReserved);
> + MOZ_RELEASE_ASSERT(mBuffer);
> + memcpy(mBuffer, result + count, mSize);
> + return result;
Could the buffer be realloc'ed down to the requested size here? (Specifically: does jemalloc do something reasonable with that in cases where the part to be dropped is small?) I'm concerned about the case where the remaining part of the buffer is also large, and possibility of making things worse in some other case while trying to improve this one.
::: ipc/glue/MessageChannel.cpp
@@ +638,5 @@
> }
> };
>
> void
> +MessageChannel::OnMessageReceivedFromLink(Message& aMsg)
It would be nice if this could be an rvalue reference to avoid the Move(), or at least move the Move() farther up the stack to where it's more obvious it the Message isn't used afterwards.
This would apply to the other const removals too, I think.
Attachment #8726014 -
Flags: feedback?(jld) → feedback+
Updated•9 years ago
|
Assignee: wmccloskey → cyu
Comment 14•9 years ago
|
||
[@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ] looks like the same issue: during the call to push_back in MessageChannel::OnMessageReceivedFromLink(), the copy constructor for Pickle() attempts to allocate a new block of memory and fails. The allocation sizes I see in the handful of reports on Nightly so far range from about 262KB to 1.3MB.
Crash Signature: [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] → [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] [@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ]
Updated•9 years ago
|
Assignee | ||
Updated•9 years ago
|
Assignee: cyu → wmccloskey
Comment 15•9 years ago
|
||
On further examination, most (but not all) of the crashes in the Pickle ctor are under PPluginScriptableObjectParent::CallInvoke(), in the parent process, so they aren't really e10s-specific. I'm not sure if I should split that into a separate bug or what.
Assignee | ||
Comment 16•9 years ago
|
||
I still don't have a green try run on Windows, but I think I'm close. This should be ready to review.
Attachment #8726014 -
Attachment is obsolete: true
Attachment #8736506 -
Flags: review?(jld)
Updated•9 years ago
|
Assignee | ||
Comment 17•9 years ago
|
||
Actually, the try push for this patch came back and it's green on Windows now.
Comment 18•9 years ago
|
||
Comment on attachment 8736506 [details] [diff] [review]
patch
Review of attachment 8736506 [details] [diff] [review]:
-----------------------------------------------------------------
Some minor things, one not-so-minor thing, but otherwise looks good.
::: ipc/chromium/src/base/buffer.cc
@@ +76,5 @@
> +void
> +Buffer::assign(const char* bytes, size_t length)
> +{
> + if (bytes >= mBuffer && bytes < mBuffer + mReserved) {
> + MOZ_RELEASE_ASSERT(length <= mSize);
Nit: this assertion could be stronger — bytes + length <= mBuffer + mSize should hold, I think?
@@ +91,5 @@
> +void
> +Buffer::erase(size_t start, size_t count)
> +{
> + mSize -= count;
> + memmove(mBuffer + start, mBuffer + start + count, mSize);
The third argument should be `mSize - start`. (This won't have shown up in testing because the only call site passes 0 for start.)
@@ +110,5 @@
> + char* result = mBuffer;
> + mSize = mReserved = mSize - count;
> + mBuffer = mReserved ? (char*)malloc(mReserved) : nullptr;
> + MOZ_RELEASE_ASSERT(!mReserved || mBuffer);
> + memcpy(mBuffer, result + count, mSize);
Nit: You might want to skip the memcpy if mBuffer is nullptr and mSize is 0. This is one of those cases that's technically undefined behavior even though there's not much reason for it to do anything weird, but this part might be cleaner anyway with a single `if (mReserved)`.
::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc
@@ +439,5 @@
> + // overflow buffer.
> + MOZ_RELEASE_ASSERT(p == overflowp);
> + buf = input_overflow_buf_.trade_bytes(len);
> +
> + // At this point the remaining data is at the from of
Typo nit: “at the front of”?
(Also in ipc_channel_win.)
Attachment #8736506 -
Flags: review?(jld) → review+
Comment 20•9 years ago
|
||
Comment 21•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment 22•9 years ago
|
||
Bill, should we uplift this fix to Aurora 47 in preparation for our Beta 47 test? This was the top e10s crash from our Beta 46 test.
status-firefox47:
--- → ?
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 23•9 years ago
|
||
Let's wait a few days to backport. This needs to stabilize. We have until 4/25, right?
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 24•9 years ago
|
||
Comment on attachment 8736506 [details] [diff] [review]
patch
I'd like to get this into Aurora so that we get data for the next e10s experiment. It has been on Nightly for a week and I think we've shaken out all the bugs. It's not particularly configuration-dependent, so I wouldn't expect new problems to appear in other channels.
Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: more OOMs with e10s
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: has been on Nightly for a while. risk is moderate and mostly restricted to e10s.
[String/UUID change made/needed]: none
Attachment #8736506 -
Flags: approval-mozilla-aurora?
Updated•9 years ago
|
status-firefox46:
--- → wontfix
Comment on attachment 8736506 [details] [diff] [review]
patch
Fixes some e10s OOMs, Aurora47+
Attachment #8736506 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Assignee | ||
Comment 26•9 years ago
|
||
Ryan, please let me land this. Setting needinfo so you see this :-).
Flags: needinfo?(wmccloskey)
Flags: needinfo?(ryanvm)
Updated•9 years ago
|
Flags: needinfo?(ryanvm) → needinfo?(cbook)
Comment 28•9 years ago
|
||
Bill, this has conflicts when uplifting, can you take a look, thanks!
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Comment 29•9 years ago
|
||
I believe that Bill's intent was for him to take care of it himself ;)
Comment 30•9 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #29)
> I believe that Bill's intent was for him to take care of it himself ;)
oh sorry misread this that as we should check this in :)
Assignee | ||
Comment 31•9 years ago
|
||
http//hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169
Flags: needinfo?(wmccloskey)
Comment 32•9 years ago
|
||
Marking status-firefox47=fixed as per Aurora uplift in comment 31:
https://hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169
Bill, just curious here: we are seeing around 300 crashes a week with this signature on 46.0.1 release. I wondered if we have a bunch of people running 46.0.1 with e10s enabled. I don't think so, so in that case, we may still be seeing this crash signature in some other context. Do you think it's worth filing a followup bug? Or might this be related to whatever is going on in bug 1273258?
Flags: needinfo?(wmccloskey)
Comment 34•9 years ago
|
||
Liz, that has nothing to do with e10s (although I'm seeing only about 40 crashes per week): https://crash-stats.mozilla.com/search/?signature=%3DOOM+%7C+unknown+%7C+NS_ABORT_OOM+%7C+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AOnMessageReceivedFromLink&signature=%3DOOM+%7C+large+%7C+NS_ABORT_OOM+%7C+Pickle%3A%3APickle&version=46.0.1&_facets=process_type&_facets=plugin_name&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
These are crashes in the plugin subsystem where somebody is passing a very large string to a scriptable plugin via JS. see e.g. https://crash-stats.mozilla.com/report/index/d5d5b4ea-5cbf-415d-82bc-d609e2160511 which is a representative example. I don't know if we can avoid crashing in this case and just reject the call back through MessageChannel::Call.
Flags: needinfo?(wmccloskey)
You need to log in
before you can comment on or make changes to this bug.
Description
•