Closed Bug 1369041 Opened 7 years ago Closed 7 years ago

Crash in mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms

Categories

(Toolkit :: Telemetry, defect, P1)

Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla56
Tracking Status
firefox-esr52 --- unaffected
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- fixed

People

(Reporter: calixte, Assigned: chutten)

References

Details

(Keywords: crash, Whiteboard: [measurement:client])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is report bp-5a2cf234-be9a-48e6-b2b1-8ab420170531. ============================================================= I got this crash in trying to load https://people-mozilla.org/~cdenizet/callgraph.json. The backtrace from gdb is the following: Thread 1 "Web Content" received signal SIGSEGV, Segmentation fault. mozilla::ipc::ProcessLink::SendMessage (this=<optimized out>, msg=0x7fbd9d831c80) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageLink.cpp:149 149 MOZ_CRASH("IPC message size is too large"); (gdb) bt #0 mozilla::ipc::ProcessLink::SendMessage (this=<optimized out>, msg=0x7fbd9d831c80) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageLink.cpp:149 #1 0x00007fc0723fa5d9 in mozilla::ipc::MessageChannel::Send (this=0x7fc07f268120, aMsg=aMsg@entry=0x7fbd9d831c80) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessageChannel.cpp:936 #2 0x00007fc072698dfa in mozilla::dom::PContentChild::SendAccumulateChildHistograms (this=0x7fc07f268020, accumulations=...) at /home/calixte/dev/mozilla/mozilla-central.hg/obj-x86_64-pc-linux-gnu/ipc/ipdl/PContentChild.cpp:4823 #3 0x00007fc074280040 in SendAccumulatedData<mozilla::dom::ContentChild> (ipcActor=0x7fc07f268020) at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:264 #4 mozilla::TelemetryIPCAccumulator::IPCTimerFired (aTimer=aTimer@entry=0x0, aClosure=aClosure@entry=0x0) at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:297 #5 0x00007fc074280261 in (anonymous namespace)::<lambda()>::operator() (__closure=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:122 #6 mozilla::detail::RunnableFunction<(anonymous namespace)::DispatchIPCTimerFired()::<lambda()> >::Run(void) (this=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThreadUtils.h:460 #7 0x00007fc07207710e in mozilla::SchedulerGroup::Runnable::Run (this=0x7fbdb4ee3b50) at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/SchedulerGroup.cpp:359 #8 0x00007fc072084531 in nsThread::ProcessNextEvent (this=0x7fc07f2d8ba0, aMayWait=<optimized out>, aResult=0x7ffc4a79882f) at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThread.cpp:1322 #9 0x00007fc072082835 in NS_ProcessNextEvent (aThread=<optimized out>, aThread@entry=0x7fc07f2d8ba0, aMayWait=aMayWait@entry=false) at /home/calixte/dev/mozilla/mozilla-central.hg/xpcom/threads/nsThreadUtils.cpp:472 #10 0x00007fc0723f2e06 in mozilla::ipc::MessagePump::Run (this=0x7fc07f2cb920, aDelegate=0x7ffc4a798a30) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/glue/MessagePump.cpp:96 #11 0x00007fc0723b66e9 in MessageLoop::RunHandler (this=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:231 #12 MessageLoop::Run (this=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:211 #13 0x00007fc07361d233 in nsBaseAppShell::Run (this=0x7fc06b9f3f40) at /home/calixte/dev/mozilla/mozilla-central.hg/widget/nsBaseAppShell.cpp:156 #14 0x00007fc0742cdfcb in XRE_RunAppShell () at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/xre/nsEmbedFunctions.cpp:893 #15 0x00007fc0723b66e9 in MessageLoop::RunHandler (this=0x7ffc4a798a30) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:231 #16 MessageLoop::Run (this=this@entry=0x7ffc4a798a30) at /home/calixte/dev/mozilla/mozilla-central.hg/ipc/chromium/src/base/message_loop.cc:211 #17 0x00007fc0742ce39f in XRE_InitChildProcess (aArgc=13, aArgv=0x7ffc4a798d68, aChildData=<optimized out>) at /home/calixte/dev/mozilla/mozilla-central.hg/toolkit/xre/nsEmbedFunctions.cpp:709 #18 0x00005590ad2aefd9 in content_process_main (bootstrap=0x7fc07f2ba0a0, argc=16, argv=0x7ffc4a798d68) at /home/calixte/dev/mozilla/mozilla-central.hg/browser/app/../../ipc/contentproc/plugin-container.cpp:64 #19 0x00005590ad2aeb1f in main (argc=16, argv=0x7ffc4a798d68, envp=0x7ffc4a798df0) at /home/calixte/dev/mozilla/mozilla-central.hg/browser/app/nsBrowserApp.cpp:285 (gdb) p msg->header_->payload_size $2 = 1075819928
Crash Signature: [@ mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms] → [@ mozilla::ipc::ProcessLink::SendMessage | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms][@ mozilla::ipc::ProcessLink::SendMessageW | IPC_Message_Name=PContent::Msg_AccumulateChildHistograms ]
Hi George, it seems the crash resulting from a telemetry ipc message size is too large. Is there anything we can do in telemetry ipc code?
Flags: needinfo?(gfritzsche)
Chris, can you take a look?
Flags: needinfo?(gfritzsche) → needinfo?(chutten)
What is "too large"? The telemetry IPC code has some watermarks[1][2] to try and avoid egregiously-sized IPC messages that _ought_ to prevent such things (unless they're the wrong size, or otherwise ineffective... or if the main thread is too wigged out to serve the request in a timely fashion or the parent process can't receive it for whatever reason) [1]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#46-47,50 [2]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#141,156,172,214
Flags: needinfo?(chutten) → needinfo?(htsai)
(In reply to Chris H-C :chutten from comment #4) > What is "too large"? The telemetry IPC code has some watermarks[1][2] to try > and avoid egregiously-sized IPC messages that _ought_ to prevent such things > (unless they're the wrong size, or otherwise ineffective... or if the main > thread is too wigged out to serve the request in a timely fashion or the > parent process can't receive it for whatever reason) > > [1]: > http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/ > TelemetryIPCAccumulator.cpp#46-47,50 > [2]: > http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/ > TelemetryIPCAccumulator.cpp#141,156,172,214 Hi Chris, I was guessing the "too large" problem as I saw |MOZ_CRASH("IPC message size is too large");| on comment 0. If it shouldn't be a size issue, do you have ideas of what's up in this crash signature or help me point to the right person? Thank you.
Flags: needinfo?(htsai)
Flags: needinfo?(chutten)
For information, I added the payload size at the end of the backtrace: (gdb) p msg->header_->payload_size $2 = 1075819928
A gig? Oh wow, that's impressive. How could that have happened... I find it unlikely that it is full of legitimately-accumulated telemetry data. Each accumulation is two uint32_t... Oh, but a KeyedAccumulation has a nsCString key. That could be a problem, since I don't see any code limiting those keys' lengths. That could explain some of the OOMs we occasionally see in Telemetry IPC code due to outrageously-large allocations, too... Georg, is there supposed to be a length limit for keyed{Histograms,Scalars,etc}' keys? Irrespective of this, a sensible fix for this might be to rewrite the watermarks in terms of the size of the structs we're sending.
Flags: needinfo?(chutten) → needinfo?(gfritzsche)
This would be bug 1275035. Possibly the right fix here is to land that with a conservative value (+ some asserts and error telemetry?).
Flags: needinfo?(gfritzsche)
(In reply to Calixte Denizet (:calixte) from comment #0) > #2 0x00007fc072698dfa in > mozilla::dom::PContentChild::SendAccumulateChildHistograms > (this=0x7fc07f268020, accumulations=...) > at > /home/calixte/dev/mozilla/mozilla-central.hg/obj-x86_64-pc-linux-gnu/ipc/ > ipdl/PContentChild.cpp:4823 But note that this is not in SendAccumulateChildKeyedHistograms.
Then I am baffled. struct Accumulation is exactly two uint32_t. payload_size is 1075819928. Assuming payload_size is in bytes, that's over 134 Million accumulations. That's over 67k per millisecond (2s timer _ought_ to preclude these accumulations from being over 2000ms old). That seems rather impossible.
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Component: DOM → Telemetry
Product: Core → Toolkit
Whiteboard: [measurement:client]
Priority: -- → P1
Per talking through it this week, running up accumulation arrays of that size is actually possible (recording into Telemetry is faster than assumed). The problem is presumably that some scenarios accumulate Telemetry too fast while we wait for the main thread to clear out & send the accumulations. The next steps here are: - truncating the accumulations when going over a limit - adding metrics to track that this happens
Attachment #8876267 - Flags: feedback?(benjamin)
Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152386 This looks like the right approach, but i'm not sure about: - the way this sends the accumulations - the chosen factor and if it is sufficient to protect us from unwanted OOMs ::: toolkit/components/telemetry/Scalars.yaml:478 (Diff revision 1) > - telemetry-client-dev@mozilla.com > release_channel_collection: opt-out > record_in_processes: > - 'main' > > +telemetry.discarded: Ok, i kind of whish we had bug 1343855 for scalars already, that would allow to specify all those properties once. ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:56 (Diff revision 1) > // With the current limits, events cost us about 1100 bytes each. > // This limits memory use to about 10MB. > const size_t kEventsArrayHighWaterMark = 10000; > +// If we are starved we can overshoot the watermark. > +// This is the multiplier over which we will discard data. > +const size_t kWaterMarkDiscardFactor = 100; Is this low enough? What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:59 (Diff revision 1) > +uint32_t gDiscardedAccumulations = 0; > +uint32_t gDiscardedKeyedAccumulations = 0; Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:299 (Diff revision 1) > + scalarsToSend.AppendElement(ScalarAction{ > + ScalarID::TELEMETRY_DISCARDED_ACCUMULATIONS, ScalarActionType::eAdd, > + Some(AsVariant(gDiscardedAccumulations)) }); This seems fragile and duplicates the accumulation details. Can we add an IPC message for this and record the scalars in the parent?
Attachment #8876267 - Flags: review?(gfritzsche)
Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152486 data-r=me - please do a pi-request if this is something we should add to mission control monitoring (sounds like it should be!)
Attachment #8876267 - Flags: review+
Attachment #8876267 - Flags: feedback?(benjamin)
Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152386 > Is this low enough? > What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...? Prior art for this was here: https://bugzilla.mozilla.org/show_bug.cgi?id=1338555#c8 Current watermarks have us at: 40k for accumulations 280k for keyed accumulations 400k for scalars 880k for keyed scalars ???? for events ---- 1.6M + ??? There's a 256M limit on IPC messages (also according to bug 1338555) With a factor of 100, we're looking at 160M + 100 * (size of events watermark). If the events watermark is 1M, we might still hit the limit if every buffer's full at once. > Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? {Keyed}Accumulations is the name of the type, so it seemed the most sensible to use. > This seems fragile and duplicates the accumulation details. > Can we add an IPC message for this and record the scalars in the parent? Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but everythign is strongly typed. Using the scalar mechanism itself seems an excellent way to ensure we have the flexibility to change our data reporting requirements in the future, if necessary. I suppose we could add custom IPC for this, but that seems fragile as well, given the number of processes we need to support already (and into the future). I'd much rather report in-band than out-of-band.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #14) > Comment on attachment 8876267 [details] > bug 1369041 - Allow child processes to discard data when overwhelmed > f?bsmedberg > > https://reviewboard.mozilla.org/r/147708/#review152486 > > data-r=me - please do a pi-request if this is something we should add to > mission control monitoring (sounds like it should be!) I thought Mission Control was more interested in user-impacting criteria, not Telemetry Client Health?
Flags: needinfo?(benjamin)
Mission control is intended to be a system for monitoring all incoming telemetry data in a permanent/reliable way.
Flags: needinfo?(benjamin)
Then I may just happen to have a list of Telemetry Health criteria that I may wish to add: - all of these new telementry.discarded scalars - TELEMETRY_FAILURE_TYPE - ..actually, probably just about all of the TELEMETRY_* histograms that aren't TELEMETRY_TEST_* - ping latency - ping duplicates - missing subsessions (discontinuous info.profileSubsessionCounter) Is there a process for having these added? MC seems pretty early-days as of yet, but presumably we could make a table of "measurement, location(hgram, scalar, existing table, etc.), type(count,threshold,etc.), alerting threshold, alert emails" rows.
NI for comment 18.
Flags: needinfo?(benjamin)
(In reply to Chris H-C :chutten from comment #15) > There's a 256M limit on IPC messages (also according to bug 1338555) > > With a factor of 100, we're looking at 160M + 100 * (size of events > watermark). If the events watermark is 1M, we might still hit the limit if > every buffer's full at once. The rough estimate here puts us at ~10M upper bound: https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#48 This seems problematic. Do we have a reason to use a higher factor than say 2x or 5x? > > Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? > > {Keyed}Accumulations is the name of the type, so it seemed the most sensible > to use. I see, we could rename those too. This could be a good mentored follow-up bug? > > This seems fragile and duplicates the accumulation details. > > Can we add an IPC message for this and record the scalars in the parent? > > Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but > everythign is strongly typed. Using the scalar mechanism itself seems an > excellent way to ensure we have the flexibility to change our data reporting > requirements in the future, if necessary. > > I suppose we could add custom IPC for this, but that seems fragile as well, > given the number of processes we need to support already (and into the > future). This leaks out / duplicates the scalar IPC serialization, i don't think we should do that. One specific concern is that we change semantics on this without changing the types, then overlook changing this (as its hard to discover). We could: 1) properly share that serialization code or 2) send up this information in a separate message or 3) add test coverage that assures that this doesn't break 2) doesn't seem too bad either, it would involve: - invoking the ipc message for this on whatever IPC process actor we have (similar to [1]) - routing this through TelemetryIPC.h/.cpp The first part should assure that we are not forgetting adding this for different supported processes (produces compile error). 1: https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#231
gfritzsche please email pi-request@mozilla.com and we'll add it to the mission control backlog (expect that to be a couple months).
Flags: needinfo?(benjamin)
Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review155148 Thanks. ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:22 (Diff revision 2) > +using mozilla::AsVariant; > +using mozilla::Some; Unused? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:34 (Diff revision 2) > using mozilla::Telemetry::Accumulation; > +using mozilla::Telemetry::DiscardedData; > using mozilla::Telemetry::KeyedAccumulation; > using mozilla::Telemetry::ScalarActionType; > using mozilla::Telemetry::ScalarAction; > +using mozilla::Telemetry::ScalarID; Unused? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:60 (Diff revision 2) > +uint32_t gDiscardedAccumulations = 0; > +uint32_t gDiscardedKeyedAccumulations = 0; > +uint32_t gDiscardedScalarActions = 0; > +uint32_t gDiscardedKeyedScalarActions = 0; > +uint32_t gDiscardedChildEvents = 0; Can we track this in a `DiscardedData`? Then zero-initialization & reset are trivial (c++ value initialization semantics). ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:303 (Diff revision 2) > + discardedData = { > + gDiscardedAccumulations, > + gDiscardedKeyedAccumulations, > + gDiscardedScalarActions, > + gDiscardedKeyedScalarActions, > + gDiscardedChildEvents}; Nit: The closing bracket should be on a new line.
Attachment #8876267 - Flags: review?(gfritzsche) → review+
Blocks: 1375043
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fde21799bb80 Allow child processes to discard data when overwhelmed r=bsmedberg,gfritzsche f?bsmedberg
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla56
Does this need to be considered for uplift or can it ride the 56 train?
Flags: needinfo?(chutten)
As mentioned over email, this is a very unlikely crash in a situation we already only tolerably support (megs-and-megs of JSON) so there's no great rush.
Flags: needinfo?(chutten)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: