Bugzilla

Comment 3

•

7 years ago

Chris, can you take a look?

Flags: needinfo?(gfritzsche) → needinfo?(chutten)

Hsin-Yi Tsai (she/her) [:hsinyi]

Assignee

Comment 4

•

7 years ago

What is "too large"? The telemetry IPC code has some watermarks[1][2] to try and avoid egregiously-sized IPC messages that _ought_ to prevent such things (unless they're the wrong size, or otherwise ineffective... or if the main thread is too wigged out to serve the request in a timely fashion or the parent process can't receive it for whatever reason) [1]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#46-47,50 [2]: http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#141,156,172,214

Flags: needinfo?(chutten) → needinfo?(htsai)

Comment 5

•

7 years ago

(In reply to Chris H-C :chutten from comment #4) > What is "too large"? The telemetry IPC code has some watermarks[1][2] to try > and avoid egregiously-sized IPC messages that _ought_ to prevent such things > (unless they're the wrong size, or otherwise ineffective... or if the main > thread is too wigged out to serve the request in a timely fashion or the > parent process can't receive it for whatever reason) > > [1]: > http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/ > TelemetryIPCAccumulator.cpp#46-47,50 > [2]: > http://searchfox.org/mozilla-central/source/toolkit/components/telemetry/ipc/ > TelemetryIPCAccumulator.cpp#141,156,172,214 Hi Chris, I was guessing the "too large" problem as I saw |MOZ_CRASH("IPC message size is too large");| on comment 0. If it shouldn't be a size issue, do you have ideas of what's up in this crash signature or help me point to the right person? Thank you.

Flags: needinfo?(htsai)

Hsin-Yi Tsai (she/her) [:hsinyi]

Updated

•

7 years ago

Flags: needinfo?(chutten)

Calixte Denizet (:calixte)

Reporter

Comment 6

•

7 years ago

For information, I added the payload size at the end of the backtrace: (gdb) p msg->header_->payload_size $2 = 1075819928

Assignee

Comment 7

•

7 years ago

A gig? Oh wow, that's impressive. How could that have happened... I find it unlikely that it is full of legitimately-accumulated telemetry data. Each accumulation is two uint32_t... Oh, but a KeyedAccumulation has a nsCString key. That could be a problem, since I don't see any code limiting those keys' lengths. That could explain some of the OOMs we occasionally see in Telemetry IPC code due to outrageously-large allocations, too... Georg, is there supposed to be a length limit for keyed{Histograms,Scalars,etc}' keys? Irrespective of this, a sensible fix for this might be to rewrite the watermarks in terms of the size of the structs we're sending.

Flags: needinfo?(chutten) → needinfo?(gfritzsche)

Comment 8

•

7 years ago

This would be bug 1275035. Possibly the right fix here is to land that with a conservative value (+ some asserts and error telemetry?).

Flags: needinfo?(gfritzsche)

Comment 9

•

7 years ago

(In reply to Calixte Denizet (:calixte) from comment #0) > #2 0x00007fc072698dfa in > mozilla::dom::PContentChild::SendAccumulateChildHistograms > (this=0x7fc07f268020, accumulations=...) > at > /home/calixte/dev/mozilla/mozilla-central.hg/obj-x86_64-pc-linux-gnu/ipc/ > ipdl/PContentChild.cpp:4823 But note that this is not in SendAccumulateChildKeyedHistograms.

Assignee

Comment 10

•

7 years ago

Then I am baffled. struct Accumulation is exactly two uint32_t. payload_size is 1075819928. Assuming payload_size is in bytes, that's over 134 Million accumulations. That's over 67k per millisecond (2s timer _ought_ to preclude these accumulations from being over 2000ms old). That seems rather impossible.

Assignee

Updated

•

7 years ago

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Component: DOM → Telemetry

Product: Core → Toolkit

Whiteboard: [measurement:client]

Updated

•

7 years ago

Priority: -- → P1

Comment 11

•

7 years ago

Per talking through it this week, running up accumulation arrays of that size is actually possible (recording into Telemetry is faster than assumed). The problem is presumably that some scenarios accumulate Telemetry too fast while we wait for the main thread to clear out & send the accumulations. The next steps here are: - truncating the accumulations when going over a limit - adding metrics to track that this happens

Comment hidden (mozreview-request)

Assignee

Updated

•

7 years ago

Attachment #8876267 - Flags: feedback?(benjamin)

Comment 13

•

7 years ago

mozreview-review

Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152386 This looks like the right approach, but i'm not sure about: - the way this sends the accumulations - the chosen factor and if it is sufficient to protect us from unwanted OOMs ::: toolkit/components/telemetry/Scalars.yaml:478 (Diff revision 1) > - telemetry-client-dev@mozilla.com > release_channel_collection: opt-out > record_in_processes: > - 'main' > > +telemetry.discarded: Ok, i kind of whish we had bug 1343855 for scalars already, that would allow to specify all those properties once. ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:56 (Diff revision 1) > // With the current limits, events cost us about 1100 bytes each. > // This limits memory use to about 10MB. > const size_t kEventsArrayHighWaterMark = 10000; > +// If we are starved we can overshoot the watermark. > +// This is the multiplier over which we will discard data. > +const size_t kWaterMarkDiscardFactor = 100; Is this low enough? What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:59 (Diff revision 1) > +uint32_t gDiscardedAccumulations = 0; > +uint32_t gDiscardedKeyedAccumulations = 0; Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:299 (Diff revision 1) > + scalarsToSend.AppendElement(ScalarAction{ > + ScalarID::TELEMETRY_DISCARDED_ACCUMULATIONS, ScalarActionType::eAdd, > + Some(AsVariant(gDiscardedAccumulations)) }); This seems fragile and duplicates the accumulation details. Can we add an IPC message for this and record the scalars in the parent?

Attachment #8876267 - Flags: review?(gfritzsche)

Comment 14

•

7 years ago

mozreview-review

Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152486 data-r=me - please do a pi-request if this is something we should add to mission control monitoring (sounds like it should be!)

Attachment #8876267 - Flags: review+

Updated

•

7 years ago

Attachment #8876267 - Flags: feedback?(benjamin)

Assignee

Comment 15

•

7 years ago

mozreview-review-reply

Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review152386 > Is this low enough? > What max. memory use does this factor amount to for histograms, keyed histograms, scalars, ...? Prior art for this was here: https://bugzilla.mozilla.org/show_bug.cgi?id=1338555#c8 Current watermarks have us at: 40k for accumulations 280k for keyed accumulations 400k for scalars 880k for keyed scalars ???? for events ---- 1.6M + ??? There's a 256M limit on IPC messages (also according to bug 1338555) With a factor of 100, we're looking at 160M + 100 * (size of events watermark). If the events watermark is 1M, we might still hit the limit if every buffer's full at once. > Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? {Keyed}Accumulations is the name of the type, so it seemed the most sensible to use. > This seems fragile and duplicates the accumulation details. > Can we add an IPC message for this and record the scalars in the parent? Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but everythign is strongly typed. Using the scalar mechanism itself seems an excellent way to ensure we have the flexibility to change our data reporting requirements in the future, if necessary. I suppose we could add custom IPC for this, but that seems fragile as well, given the number of processes we need to support already (and into the future). I'd much rather report in-band than out-of-band.

Assignee

Comment 16

•

7 years ago

(In reply to Benjamin Smedberg [:bsmedberg] from comment #14) > Comment on attachment 8876267 [details] > bug 1369041 - Allow child processes to discard data when overwhelmed > f?bsmedberg > > https://reviewboard.mozilla.org/r/147708/#review152486 > > data-r=me - please do a pi-request if this is something we should add to > mission control monitoring (sounds like it should be!) I thought Mission Control was more interested in user-impacting criteria, not Telemetry Client Health?

Flags: needinfo?(benjamin)

Comment 17

•

7 years ago

Mission control is intended to be a system for monitoring all incoming telemetry data in a permanent/reliable way.

Flags: needinfo?(benjamin)

Assignee

Comment 18

•

7 years ago

Then I may just happen to have a list of Telemetry Health criteria that I may wish to add: - all of these new telementry.discarded scalars - TELEMETRY_FAILURE_TYPE - ..actually, probably just about all of the TELEMETRY_* histograms that aren't TELEMETRY_TEST_* - ping latency - ping duplicates - missing subsessions (discontinuous info.profileSubsessionCounter) Is there a process for having these added? MC seems pretty early-days as of yet, but presumably we could make a table of "measurement, location(hgram, scalar, existing table, etc.), type(count,threshold,etc.), alerting threshold, alert emails" rows.

Comment 19

•

7 years ago

NI for comment 18.

Flags: needinfo?(benjamin)

Comment 20

•

7 years ago

(In reply to Chris H-C :chutten from comment #15) > There's a 256M limit on IPC messages (also according to bug 1338555) > > With a factor of 100, we're looking at 160M + 100 * (size of events > watermark). If the events watermark is 1M, we might still hit the limit if > every buffer's full at once. The rough estimate here puts us at ~10M upper bound: https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#48 This seems problematic. Do we have a reason to use a higher factor than say 2x or 5x? > > Can we name these `gDiscarded{Keyed}HistogramAccumulations` to avoid confusion? > > {Keyed}Accumulations is the name of the type, so it seemed the most sensible > to use. I see, we could rename those too. This could be a good mentored follow-up bug? > > This seems fragile and duplicates the accumulation details. > > Can we add an IPC message for this and record the scalars in the parent? > > Which part is fragile? The Some(AsVariant(...)) stuff was a bit odd, but > everythign is strongly typed. Using the scalar mechanism itself seems an > excellent way to ensure we have the flexibility to change our data reporting > requirements in the future, if necessary. > > I suppose we could add custom IPC for this, but that seems fragile as well, > given the number of processes we need to support already (and into the > future). This leaks out / duplicates the scalar IPC serialization, i don't think we should do that. One specific concern is that we change semantics on this without changing the types, then overlook changing this (as its hard to discover). We could: 1) properly share that serialization code or 2) send up this information in a separate message or 3) add test coverage that assures that this doesn't break 2) doesn't seem too bad either, it would involve: - invoking the ipc message for this on whatever IPC process actor we have (similar to [1]) - routing this through TelemetryIPC.h/.cpp The first part should assure that we are not forgetting adding this for different supported processes (produces compile error). 1: https://dxr.mozilla.org/mozilla-central/rev/91134c95d68cbcfe984211fa3cbd28d610361ef1/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#231

Comment 21

•

7 years ago

gfritzsche please email pi-request@mozilla.com and we'll add it to the mission control backlog (expect that to be a couple months).

Flags: needinfo?(benjamin)

Comment hidden (mozreview-request)

Comment 23

•

7 years ago

mozreview-review

Comment on attachment 8876267 [details] bug 1369041 - Allow child processes to discard data when overwhelmed f?bsmedberg https://reviewboard.mozilla.org/r/147708/#review155148 Thanks. ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:22 (Diff revision 2) > +using mozilla::AsVariant; > +using mozilla::Some; Unused? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:34 (Diff revision 2) > using mozilla::Telemetry::Accumulation; > +using mozilla::Telemetry::DiscardedData; > using mozilla::Telemetry::KeyedAccumulation; > using mozilla::Telemetry::ScalarActionType; > using mozilla::Telemetry::ScalarAction; > +using mozilla::Telemetry::ScalarID; Unused? ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:60 (Diff revision 2) > +uint32_t gDiscardedAccumulations = 0; > +uint32_t gDiscardedKeyedAccumulations = 0; > +uint32_t gDiscardedScalarActions = 0; > +uint32_t gDiscardedKeyedScalarActions = 0; > +uint32_t gDiscardedChildEvents = 0; Can we track this in a `DiscardedData`? Then zero-initialization & reset are trivial (c++ value initialization semantics). ::: toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp:303 (Diff revision 2) > + discardedData = { > + gDiscardedAccumulations, > + gDiscardedKeyedAccumulations, > + gDiscardedScalarActions, > + gDiscardedKeyedScalarActions, > + gDiscardedChildEvents}; Nit: The closing bracket should be on a new line.

Attachment #8876267 - Flags: review?(gfritzsche) → review+

Comment hidden (mozreview-request)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

7 years ago

Blocks: 1375043

Comment hidden (mozreview-request)

Pulsebot

Comment 26

•

7 years ago

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fde21799bb80 Allow child processes to discard data when overwhelmed r=bsmedberg,gfritzsche f?bsmedberg

Comment 27

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/fde21799bb80

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

status-firefox56: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla56

Ryan VanderMeulen [:RyanVM]

Comment 28

•

7 years ago

Does this need to be considered for uplift or can it ride the 56 train?

status-firefox54: --- → wontfix

status-firefox55: --- → affected

Flags: needinfo?(chutten)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

status-firefox55: affected → wontfix

status-firefox-esr52: --- → unaffected