Closed Bug 1235633 Opened 9 years ago Closed 9 years ago

[e10s] crash in OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink

Categories

(Core :: IPC, defect)

Product:

Component:

Version:

44 Branch

Platform:

x86

Windows NT

Type:

defect

Priority:

Not set

Severity:

critical

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla48

Tracking Flags:

Tracking

Status

e10s

m9+

---

firefox46

---

wontfix

firefox47

---

fixed

firefox48

---

fixed

People

(Reporter: tracy, Assigned: billm)

References

Details

(Keywords: crash, topcrash-win, Whiteboard: [MemShrink:P1])

Crash Data

Attachments

(1 file, 1 obsolete file)

patch 9 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) (deleted), patch	jld : feedback+	Details \| Diff \| Splinter Review
patch 9 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) (deleted), patch	jld : review+ ritu : approval-mozilla-aurora+	Details \| Diff \| Splinter Review

Tracy Walker [:tracy]

Reporter

Description

•

9 years ago

This bug was filed from the Socorro interface and is report bp-00b84fbe-0ca0-4be2-b160-ccca52151219. ============================================================= This crash is #10, with 132 crashes, on the 44beta a/b experiment. Frame Module Signature Source 0 xul.dll NS_ABORT_OOM(unsigned int) xpcom/base/nsDebugImpl.cpp 1 xul.dll mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message const&) ipc/glue/MessageChannel.cpp 2 kernelbase.dll ReadFile 3 kernel32.dll ReadFileImplementation 4 kernel32.dll ReadFileImplementation 5 nss3.dll _MD_CURRENT_THREAD nsprpub/pr/src/md/windows/w95thred.c 6 xul.dll mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message const&) ipc/glue/MessageLink.cpp 7 xul.dll IPC::Channel::ChannelImpl::ProcessIncomingMessages(base::MessagePumpForIO::IOContext*, unsigned long) ipc/chromium/src/chrome/common/ipc_channel_win.cc 8 xul.dll IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext*, unsigned long, unsigned long) ipc/chromium/src/chrome/common/ipc_channel_win.cc

Tracy Walker [:tracy]

Reporter

Updated

•

9 years ago

No longer blocks: 1234647

Tracy Walker [:tracy]

Reporter

Updated

•

9 years ago

Blocks: 1234647

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

tracking-e10s: ? → +

Andrew McCreight [:mccr8]

Comment 1

•

9 years ago

This is happening on the line: mPending.push_back(aMsg); IIRC, this is the point where a message is copied out of the raw buffer. I think this is just a fundamental problem where things that send huge messages are going to hit heap fragmentation issues. These crash reports are also missing all of the nice information about system memory usage that you usually see (Total Virtual Memory, Available Virtual Memory, System Memory Use Percentage). Maybe that doesn't work in a content process? I'll file a bug about that. This particular allocation size should also get that size of the allocation reported. I don't know if that's related to the previous issue or not.

Component: XPCOM → IPC

Andrew McCreight [:mccr8]

Comment 2

•

9 years ago

Bug 1156484 is in the same part of the code, except in that crash report the memory information (and allocation size) is showing up properly.

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1236108

Biru Mohanathas [:poiru]

Updated

•

9 years ago

Blocks: 1246180

Biru Mohanathas [:poiru]

Updated

•

9 years ago

Blocks: e10s-crashes

Comment 3

•

9 years ago

Also, why isn't the allocation size annotated for this case? (that's what the "unknown" in the signature means)

Comment 4

•

9 years ago

Ah, that seems to be bug 1236108, we really need to fix that to even be able to analyze if e10s is nearing ship-readiness.

Brad Lassey [:blassey] (use needinfo?)

Comment 5

•

9 years ago

renom'ing. Top crash in beta experiment. Poiru, can you indicate where this is in the top crash list?

tracking-e10s: + → ?

Flags: needinfo?(birunthan)

Biru Mohanathas [:poiru]

Comment 6

•

9 years ago

(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5) > renom'ing. Top crash in beta experiment. Poiru, can you indicate where this > is in the top crash list? This is #4 (2.08%) for content processes. See bug 1249209 comment 2 for full list.

Flags: needinfo?(birunthan)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

9 years ago

Assignee: nobody → wmccloskey

tracking-e10s: ? → m9+

Chris Peterson [:cpeterson]

Updated

•

9 years ago

Blocks: 1251376

Ting-Yu Chou [:ting] (away)

Comment 7

•

9 years ago

Message's copy constructor goes to Pickle's copy constructor, which calls realloc() to have enough memory to store payload. If realloc() fails, NS_ABORT_OOM() is called and that is where this crash happens. realloc() winds up calling imalloc(). Assume the size is a large number, and huge_malloc() is used. If it is chunk_alloc() to return null, it has to fail both chunk_recycle() and chunk_alloc_mmap(). Note chunk_alloc_mmap() goes to VirtualAlloc() on Windows. I wonder how did it fail.

Andrew McCreight [:mccr8]

Comment 8

•

9 years ago

The most common cause of OOM with large allocations is a lack of contiguous address space. Because of bug 1236108, these crash reports do not contain information about contiguous address space or memory usage, so it is hard to be sure if that is happening here.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 9

•

9 years ago

I wonder how hard it would be to get data on IPC protocols using large messages, broken out by IPDL actor class and message (or the low-level message ID, although the mapping from that back to something meaningful varies with IPDL changes).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 10

•

9 years ago

I think we can avoid the copy here. I'm working on a patch.

Ting-Yu Chou [:ting] (away)

Comment 11

•

9 years ago

(In reply to Bill McCloskey (:billm) from comment #10) > I think we can avoid the copy here. I'm working on a patch. That'd be awesome!

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 12

•

9 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

How does this look, Jed? This patch changes the code so that the Message passed to OnMessageReceivedFromLink owns its data. Therefore we can just move it to the queue rather than copying it. The goal of the patch is to ensure that we only have one copy of the message data around at a time, at least for large messages. I also added some code to properly size the buffer for the message once we have received enough data to know how big it will be. Right now we rely on std::string to do exponential growth, which is good, but not as good as allocating with the right size from the beginning. If this looks okay, then I'll fix up the Windows code in the same way. I made the new Buffer class as much like std::string as possible so that it fits into both the POSIX and Windows implementation with few changes. Perhaps in the future we could do a better job sharing the code in these two classes, but I'm not sure it's worth it.

Attachment #8726014 - Flags: feedback?(jld)

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Whiteboard: [MemShrink]

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Whiteboard: [MemShrink] → [MemShrink:P1]

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 13

•

9 years ago

Comment on attachment 8726014 [details] [diff] [review] patch Review of attachment 8726014 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for the delay. I haven't looked at this quite as much as for r?, but it seems reasonable and a definite improvement for the case where the first message is large and the rest of the buffer is small. ::: ipc/chromium/src/base/buffer.cc @@ +92,5 @@ > +Buffer::trade_bytes(size_t count) > +{ > + char* result = mBuffer; > + mSize = mReserved = mSize - count; > + mBuffer = (char*)malloc(mReserved); What happens if mSize == count? @@ +95,5 @@ > + mSize = mReserved = mSize - count; > + mBuffer = (char*)malloc(mReserved); > + MOZ_RELEASE_ASSERT(mBuffer); > + memcpy(mBuffer, result + count, mSize); > + return result; Could the buffer be realloc'ed down to the requested size here? (Specifically: does jemalloc do something reasonable with that in cases where the part to be dropped is small?) I'm concerned about the case where the remaining part of the buffer is also large, and possibility of making things worse in some other case while trying to improve this one. ::: ipc/glue/MessageChannel.cpp @@ +638,5 @@ > } > }; > > void > +MessageChannel::OnMessageReceivedFromLink(Message& aMsg) It would be nice if this could be an rvalue reference to avoid the Move(), or at least move the Move() farther up the stack to where it's more obvious it the Message isn't used afterwards. This would apply to the other const removals too, I think.

Attachment #8726014 - Flags: feedback?(jld) → feedback+

Jonathan Howard

Updated

•

9 years ago

Depends on: 1257486

Jim Mathies [:jimm]

Updated

•

9 years ago

Assignee: wmccloskey → cyu

Chris Peterson [:cpeterson]

Updated

•

9 years ago

Blocks: e10s-rc

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1256541

Andrew McCreight [:mccr8]

Comment 14

•

9 years ago

[@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ] looks like the same issue: during the call to push_back in MessageChannel::OnMessageReceivedFromLink(), the copy constructor for Pickle() attempts to allocate a new block of memory and fails. The allocation sizes I see in the handful of reports on Nightly so far range from about 262KB to 1.3MB.

Crash Signature: [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] → [@ OOM | unknown | NS_ABORT_OOM | mozilla::ipc::MessageChannel::OnMessageReceivedFromLink] [@ OOM | large | NS_ABORT_OOM | Pickle::Pickle ]

Jim Mathies [:jimm]

Updated

•

9 years ago

Blocks: e10s-oom
No longer blocks: e10s-crashes

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

9 years ago

Assignee: cyu → wmccloskey

Andrew McCreight [:mccr8]

Comment 15

•

9 years ago

On further examination, most (but not all) of the crashes in the Pickle ctor are under PPluginScriptableObjectParent::CallInvoke(), in the parent process, so they aren't really e10s-specific. I'm not sure if I should split that into a separate bug or what.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 16

•

9 years ago

Attached patch patch (deleted) — Details — Splinter Review

I still don't have a green try run on Windows, but I think I'm close. This should be ready to review.

Attachment #8726014 - Attachment is obsolete: true

Attachment #8736506 - Flags: review?(jld)

Andrew McCreight [:mccr8]

Updated

•

9 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1261094

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 17

•

9 years ago

Actually, the try push for this patch came back and it's green on Windows now.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 18

•

9 years ago

Comment on attachment 8736506 [details] [diff] [review] patch Review of attachment 8736506 [details] [diff] [review]: ----------------------------------------------------------------- Some minor things, one not-so-minor thing, but otherwise looks good. ::: ipc/chromium/src/base/buffer.cc @@ +76,5 @@ > +void > +Buffer::assign(const char* bytes, size_t length) > +{ > + if (bytes >= mBuffer && bytes < mBuffer + mReserved) { > + MOZ_RELEASE_ASSERT(length <= mSize); Nit: this assertion could be stronger — bytes + length <= mBuffer + mSize should hold, I think? @@ +91,5 @@ > +void > +Buffer::erase(size_t start, size_t count) > +{ > + mSize -= count; > + memmove(mBuffer + start, mBuffer + start + count, mSize); The third argument should be `mSize - start`. (This won't have shown up in testing because the only call site passes 0 for start.) @@ +110,5 @@ > + char* result = mBuffer; > + mSize = mReserved = mSize - count; > + mBuffer = mReserved ? (char*)malloc(mReserved) : nullptr; > + MOZ_RELEASE_ASSERT(!mReserved || mBuffer); > + memcpy(mBuffer, result + count, mSize); Nit: You might want to skip the memcpy if mBuffer is nullptr and mSize is 0. This is one of those cases that's technically undefined behavior even though there's not much reason for it to do anything weird, but this part might be cleaner anyway with a single `if (mReserved)`. ::: ipc/chromium/src/chrome/common/ipc_channel_posix.cc @@ +439,5 @@ > + // overflow buffer. > + MOZ_RELEASE_ASSERT(p == overflowp); > + buf = input_overflow_buf_.trade_bytes(len); > + > + // At this point the remaining data is at the from of Typo nit: “at the front of”? (Also in ipc_channel_win.)

Attachment #8736506 - Flags: review?(jld) → review+

Comment 20

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/dd3e03fcb06b

Carsten Book [:Tomcat]

Comment 21

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dd3e03fcb06b

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox48: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla48

Chris Peterson [:cpeterson]

Comment 22

•

9 years ago

Bill, should we uplift this fix to Aurora 47 in preparation for our Beta 47 test? This was the top e10s crash from our Beta 46 test.

status-firefox47: --- → ?

Flags: needinfo?(wmccloskey)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Depends on: 1263292

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1263281

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 23

•

9 years ago

Let's wait a few days to backport. This needs to stabilize. We have until 4/25, right?

Flags: needinfo?(wmccloskey)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Depends on: 1263457

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1263763

Alice0775 White

Updated

•

9 years ago

Depends on: 1264398

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1265036

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 24

•

9 years ago

Comment on attachment 8736506 [details] [diff] [review] patch I'd like to get this into Aurora so that we get data for the next e10s experiment. It has been on Nightly for a week and I think we've shaken out all the bugs. It's not particularly configuration-dependent, so I wouldn't expect new problems to appear in other channels. Approval Request Comment [Feature/regressing bug #]: e10s [User impact if declined]: more OOMs with e10s [Describe test coverage new/current, TreeHerder]: [Risks and why]: has been on Nightly for a while. risk is moderate and mostly restricted to e10s. [String/UUID change made/needed]: none

Attachment #8736506 - Flags: approval-mozilla-aurora?

Andrew McCreight [:mccr8]

Updated

•

9 years ago

status-firefox46: --- → wontfix

status-firefox47: ? → affected

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 25

•

9 years ago

Comment on attachment 8736506 [details] [diff] [review] patch Fixes some e10s OOMs, Aurora47+

Attachment #8736506 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 26

•

9 years ago

Ryan, please let me land this. Setting needinfo so you see this :-).

Flags: needinfo?(wmccloskey)

Flags: needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Flags: needinfo?(ryanvm) → needinfo?(cbook)

Ryan VanderMeulen [:RyanVM]

Comment 27

•

9 years ago

(I rarely do uplifts anymore, FWIW)

Flags: needinfo?(wkocher)

Carsten Book [:Tomcat]

Comment 28

•

9 years ago

Bill, this has conflicts when uplifting, can you take a look, thanks!

Flags: needinfo?(wkocher)

Flags: needinfo?(cbook)

Ryan VanderMeulen [:RyanVM]

Comment 29

•

9 years ago

I believe that Bill's intent was for him to take care of it himself ;)

Carsten Book [:Tomcat]

Comment 30

•

9 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #29) > I believe that Bill's intent was for him to take care of it himself ;) oh sorry misread this that as we should check this in :)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 31

•

9 years ago

http//hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169

Flags: needinfo?(wmccloskey)

Chris Peterson [:cpeterson]

Comment 32

•

9 years ago

Marking status-firefox47=fixed as per Aurora uplift in comment 31: https://hg.mozilla.org/releases/mozilla-aurora/rev/5649dee23169

status-firefox47: affected → fixed

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1266578

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

Depends on: 1267106

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Depends on: 1268616

Updated

•

9 years ago

Depends on: 1273258

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 33

•

9 years ago

Bill, just curious here: we are seeing around 300 crashes a week with this signature on 46.0.1 release. I wondered if we have a bunch of people running 46.0.1 with e10s enabled. I don't think so, so in that case, we may still be seeing this crash signature in some other context. Do you think it's worth filing a followup bug? Or might this be related to whatever is going on in bug 1273258?

Flags: needinfo?(wmccloskey)

Benjamin Smedberg

Comment 34

•

9 years ago

Liz, that has nothing to do with e10s (although I'm seeing only about 40 crashes per week): https://crash-stats.mozilla.com/search/?signature=%3DOOM+%7C+unknown+%7C+NS_ABORT_OOM+%7C+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AOnMessageReceivedFromLink&signature=%3DOOM+%7C+large+%7C+NS_ABORT_OOM+%7C+Pickle%3A%3APickle&version=46.0.1&_facets=process_type&_facets=plugin_name&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports These are crashes in the plugin subsystem where somebody is passing a very large string to a scriptable plugin via JS. see e.g. https://crash-stats.mozilla.com/report/index/d5d5b4ea-5cbf-415d-82bc-d609e2160511 which is a representative example. I don't know if we can avoid crashing in this case and just reject the call back through MessageChannel::Call.

Flags: needinfo?(wmccloskey)

You need to log in before you can comment on or make changes to this bug.