Closed Bug 1127914 Opened 10 years ago Closed 10 years ago

Duplicate telemetry histograms for double submission

Categories

(Toolkit :: Telemetry, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla39
Tracking Status
firefox39 --- fixed

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

Details

(Whiteboard: [ready])

Attachments

(3 files, 8 obsolete files)

(deleted), patch
gfritzsche
: review+
Details | Diff | Splinter Review
(deleted), patch
gfritzsche
: review+
Details | Diff | Splinter Review
(deleted), patch
gfritzsche
: review+
Details | Diff | Splinter Review
Per the discussion on the FHR/Telemetry unification concerns [0] we want to go with client-side submission of both "reset per subsession" and "old style telemetry" histograms. This means that we have to duplicate histograms. Bug 1126867 showed that this has acceptable memory costs. 0:"Suggestions for the new unified FHR/Telemetry/Experiment ping", https://mail.mozilla.org/pipermail/fhr-dev/2015-January/thread.html#382
Assignee: nobody → gfritzsche
Blocks: 1127919
Attached patch Part 1: Duplicate keyed histograms (obsolete) (deleted) — Splinter Review
This duplicates keyed histogram data and allows to reset and retrieve the duplicate separately. Notes: * I'm keeping this to the minimum required for handling duplicates in TelemetrySession.jsm * I don't intend to document the duplicate features as this is only for a temporary telemetry-internal use-case
Attachment #8558148 - Flags: review?(vdjeric)
Status: NEW → ASSIGNED
Attached patch Part 1: Duplicate keyed histograms (obsolete) (deleted) — Splinter Review
Fixed minor oversight.
Attachment #8558148 - Attachment is obsolete: true
Attachment #8558148 - Flags: review?(vdjeric)
Attachment #8558159 - Flags: review?(vdjeric)
Attached patch Part 2: Duplicate normal histograms (obsolete) (deleted) — Splinter Review
Attachment #8558646 - Flags: review?(vdjeric)
This still has some test issues, but gives context on the usage of part 1 & 2.
Comment on attachment 8558646 [details] [diff] [review] Part 2: Duplicate normal histograms Review of attachment 8558646 [details] [diff] [review]: ----------------------------------------------------------------- Given that we expect subsession histograms to eventually completely replace the full-session histograms, shouldn't we make the full-session histograms be the "duplicates"? ::: toolkit/components/telemetry/Telemetry.cpp @@ +993,5 @@ > } > > +Histogram* > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId, > + Histogram& existing) - add a comment header for these functions explaining what they do & the motivation @@ +995,5 @@ > +Histogram* > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId, > + Histogram& existing) > +{ > + const TelemetryHistogram &p = gHistograms[existingId]; - I'd prefer a check for nullness of gHistograms[existingId] here, rather than in this function's callers - pick a more descriptive name than p @@ +1009,5 @@ > + } > + > + Histogram::SampleSet ss; > + existing.SnapshotSample(&ss); > + clone->AddSampleSet(ss); There's a chance of a race condition here, SnapshotSample() is not protected with any locks! http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/histogram.cc#389 This could cause the sub-session histogram to miss samples added to the full-session histogram. This wasn't a big deal before because HistogramFrom was only used to create snapshots of histograms at startup. It gets more sensitive for the new use case @@ +1019,5 @@ > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId) > +{ > + Histogram *existing = nullptr; > + nsresult rv = GetHistogramByEnumId(existingId, &existing); > + if (NS_FAILED(rv)) { why not put this check into the other CloneHistogram function? That way we'll know the check is done even if some future code calls the 3-parameter version of CloneHistogram directly @@ +1027,5 @@ > + return CloneHistogram(newName, existingId, *existing); > +} > + > +Histogram* > +GetDuplicateHistogram(Histogram& existing) same comment about the "duplicate" name @@ +1057,5 @@ > +nsresult > +HistogramAdd(Histogram& histogram, int32_t value) > +{ > + histogram.Add(value); > + if (Histogram* duplicate = GetDuplicateHistogram(histogram)) { I feel like something very visible should happen if we're not able to create a duplicate histogram (e.g. an assert) @@ +2194,5 @@ > + Histogram* original = h; > + if (duplicates) { > + h = GetDuplicateHistogram(*h); > + if (!h) { > + continue; why would we want to continue here? ::: toolkit/components/telemetry/nsITelemetry.idl @@ +58,5 @@ > /** > + * As histogramSnapshots, except this contains the internally duplicated histograms for subsession telemetry. > + */ > + [implicit_jscontext] > + readonly attribute jsval duplicateHistogramSnapshots; Duplicate is ambiguous, especially since it's not clear which is which. Let's use another name, maybe "subsessionHistogramSnapshots" or "dailyHistogramSnapshots". ::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js @@ +604,5 @@ > Assert.ok(!registered.has("TELEMETRY_TEST_KEYED_FLAG")); > Assert.ok(registered.has("TELEMETRY_TEST_KEYED_RELEASE_OPTOUT")); > } > > +function test_duplicates() { Add another test for calling clear() when there's data in both snapshots @@ +630,5 @@ > + h.clear(true); > + h.add(1); > + snapshot = Telemetry.histogramSnapshots; > + duplicates = Telemetry.duplicateHistogramSnapshots; > + Assert.deepEqual(original[ID], duplicates[ID]); this is a little hard to parse.. why not just check for value of the histogram's .sum field directly? @@ +637,5 @@ > + // Increasing the count again means the duplicate should now match the second snapshot. > + h.add(1); > + snapshot = Telemetry.histogramSnapshots; > + duplicates = Telemetry.duplicateHistogramSnapshots; > + Assert.deepEqual(original2[ID], duplicates[ID]); ditto
Attachment #8558646 - Flags: review?(vdjeric)
Comment on attachment 8558159 [details] [diff] [review] Part 1: Duplicate keyed histograms Review of attachment 8558159 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +3877,5 @@ > +nsresult > +KeyedHistogram::Add(const nsCString& key, uint32_t sample) > +{ > + if (!TelemetryImpl::CanRecord()) { > + return NS_ERROR_FAILURE; doesn't this mean we're going to report a JS error and throw an exception(?) if Telemetry is turned off? @@ +3883,5 @@ > + > + Histogram* histogram = GetHistogram(key, false); > + Histogram* duplicate = GetHistogram(key, true); > + if (!histogram || !duplicate) { > + return NS_ERROR_FAILURE; in general, i don't like throwing exceptions from Telemetry code.. they're nevercaught by the callers and they're hard to spot when it starts happening. Asserts (or segfaults for OOMs) would be preferable
Attachment #8558159 - Flags: review?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #5) > Given that we expect subsession histograms to eventually completely replace > the full-session histograms, shouldn't we make the full-session histograms > be the "duplicates"? I don't think that matters much, we have to remove one code-path either way and only submit/reset one of them. > @@ +1009,5 @@ > > + } > > + > > + Histogram::SampleSet ss; > > + existing.SnapshotSample(&ss); > > + clone->AddSampleSet(ss); > > There's a chance of a race condition here, SnapshotSample() is not protected > with any locks! > > http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/ > histogram.cc#389 > > This could cause the sub-session histogram to miss samples added to the > full-session histogram. > This wasn't a big deal before because HistogramFrom was only used to create > snapshots of histograms at startup. It gets more sensitive for the new use > case I vaguely remember that histograms and the telemetry API should only be used from the same thread. Is that not the case anymore? I'm asking because the cloning i added is only triggered from the histogram accumulate/add code paths. If we need this, where would you prefer to protect this? Synchronizing Telemetry call paths or on the histograms themselves? > @@ +1019,5 @@ > > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId) > > +{ > > + Histogram *existing = nullptr; > > + nsresult rv = GetHistogramByEnumId(existingId, &existing); > > + if (NS_FAILED(rv)) { > > why not put this check into the other CloneHistogram function? That way > we'll know the check is done even if some future code calls the 3-parameter > version of CloneHistogram directly I did not want to call GetHistogramByEnumId() unnecessarily - one caller / call path already has the histogram, so the lookup is not needed. > @@ +2194,5 @@ > > + Histogram* original = h; > > + if (duplicates) { > > + h = GetDuplicateHistogram(*h); > > + if (!h) { > > + continue; > > why would we want to continue here? Because keyed histograms follow a different code path and we don't get duplicates for the existing duplicate histograms - GetDuplicateHistogram() fails for them.
needinfo? on the open questions in comment 7.
Flags: needinfo?(vdjeric)
(In reply to Vladan Djeric (:vladan) from comment #5) > @@ +1057,5 @@ > > +nsresult > > +HistogramAdd(Histogram& histogram, int32_t value) > > +{ > > + histogram.Add(value); > > + if (Histogram* duplicate = GetDuplicateHistogram(histogram)) { > > I feel like something very visible should happen if we're not able to create > a duplicate histogram (e.g. an assert) For simplicity the additions are limited to registered histograms. For all other histograms, GetDuplicateHistogram() (or now GetSubsessionHistogram()), returns null, which is not a failure. We should handle failure cases in GetSubsessionHistogram().
Attached patch Part 1: Duplicate keyed histograms (obsolete) (deleted) — Splinter Review
Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558159 - Attachment is obsolete: true
Attachment #8562368 - Flags: review?(vdjeric)
Attached patch Part 2: Duplicate normal histograms (obsolete) (deleted) — Splinter Review
Addressed feedback except the lock protection - i would handle that in a separate patch depending on the answer to the open question.
Attachment #8558646 - Attachment is obsolete: true
Attachment #8562373 - Flags: review?(vdjeric)
This takes care of submitting the duplicates for all session reasons other than the "classic"/"full-session" ones.
Attachment #8559251 - Attachment is obsolete: true
Attachment #8562426 - Flags: review?(vdjeric)
No longer blocks: 1127919
Blocks: 1127919
No longer blocks: 1069869
> I vaguely remember that histograms and the telemetry API should only be used > from the same thread. Is that not the case anymore? That's not correct. Some of the Telemetry.idl methods get called from other threads and there are some C++ methods exposed via Telemetry.h that access the same data from multiple threads (e.g. RecordSlowSQLStatement). The histograms can be accumulated from a not-main thread (e.g. COMPOSITE_TIME) or even multiple threads (e.g. MOZ_SQLITE_PLACES_READ_MS). These accumulations are not synchronized! The thinking is that Telemetry doesn't have to be 100% accurate because we only look at the results in aggregate, and we definitely don't want to pay the cost of synchronizing, we just need to make sure there's no data corruption. > I'm asking because the cloning i added is only triggered from the histogram > accumulate/add code paths. Accumulate is not synchronized, so cloning could race with accumulate. > If we need this, where would you prefer to protect this? Synchronizing > Telemetry call paths or on the histograms themselves? On second thought, I think this is just something we might have to tolerate. Let's document Telemetry behavior wrt threads in your in-tree doc. > > @@ +1019,5 @@ > > > +CloneHistogram(const nsACString& newName, Telemetry::ID existingId) > > > +{ > > > + Histogram *existing = nullptr; > > > + nsresult rv = GetHistogramByEnumId(existingId, &existing); > > > + if (NS_FAILED(rv)) { > > > > why not put this check into the other CloneHistogram function? That way > > we'll know the check is done even if some future code calls the 3-parameter > > version of CloneHistogram directly > > I did not want to call GetHistogramByEnumId() unnecessarily - one caller / > call path already has the histogram, so the lookup is not needed. You could only call GetHistogramByEnumId() in the 3-parameter version of CloneHistogram, but only if the ID is some known invalid value. But that would be messier cause the IDs are an auto-generated uint32_t enum. So nevermind :) > > @@ +2194,5 @@ > > > + Histogram* original = h; > > > + if (duplicates) { > > > + h = GetDuplicateHistogram(*h); > > > + if (!h) { > > > + continue; > > > > why would we want to continue here? > > Because keyed histograms follow a different code path and we don't get > duplicates for the existing duplicate histograms - GetDuplicateHistogram() > fails for them. Oh I see, GetDuplicateHistogram returns a nullptr if the histogram is keyed or if there's something really wrong with Telemetry (e.g. histogram ID not found).
Flags: needinfo?(vdjeric)
Comment on attachment 8562373 [details] [diff] [review] Part 2: Duplicate normal histograms Review of attachment 8562373 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/Telemetry.cpp @@ +993,5 @@ > } > > +// This clones a histogram |existing| with the id |existingId| to a > +// new histogram with the name |newName|. > +// For simplicity this is limited to registered histograms. nit: use the /** .. */ style @@ +1036,5 @@ > +Histogram* > +GetSubsessionHistogram(Histogram& existing) > +{ > + Telemetry::ID id; > + nsresult rv = TelemetryImpl::GetHistogramEnumId(existing.histogram_name().c_str(), &id); existing.id() has the histogram ID @@ +1236,5 @@ > + bool onlySubsession = false; > + JS::CallArgs args = JS::CallArgsFromVp(argc, vp); > + > + if (args.length() >= 1) { > + if (!(args[0].isNumber() || args[0].isBoolean())) { do we really want to allow number arguments here? @@ +1248,4 @@ > Histogram *h = static_cast<Histogram*>(JS_GetPrivate(obj)); > + MOZ_ASSERT(h); > + if(!onlySubsession) { > + h->Clear(); it likely won't matter, but clear() is going to be racey too
Attachment #8562373 - Flags: review?(vdjeric) → review+
Comment on attachment 8562368 [details] [diff] [review] Part 1: Duplicate keyed histograms Review of attachment 8562368 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/tests/unit/test_nsITelemetry.js @@ +610,5 @@ > + const KEY = "foo"; > + > + // Both original and subsession should start out the same. > + h.clear(); > + Assert.ok(!("foo" in h.snapshot())); use KEY in these asserts instead of "foo"
Attachment #8562368 - Flags: review?(vdjeric) → review+
Comment on attachment 8562426 [details] [diff] [review] Part 3: Submit duplicate histograms for non-classic payload reasons Review of attachment 8562426 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/telemetry/TelemetrySession.jsm @@ +792,5 @@ > * to |this.getSimpleMeasurements| and |this.getMetadata|, > * respectively. > */ > + assemblePayloadWithMeasurements: function(simpleMeasurements, info, reason) { > + const classicReasons = [ this is pretty ugly, but I see how it allows you to make fewer changes. @@ +807,5 @@ > let payloadObj = { > ver: PAYLOAD_VERSION, > simpleMeasurements: simpleMeasurements, > + histograms: this.getHistograms(isSubsession), > + keyedHistograms: this.getKeyedHistograms(isSubsession), which code is going to call clear() on the subsession histograms? ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js @@ +615,5 @@ > + // Both classic and subsession payload histograms should start the same. > + count.clear(); > + keyed.clear(); > + let classic = TelemetrySession.getPayload(); > + let subsession = TelemetrySession.getPayload("environment-change"); might want to explicitly test getPayload("saved-session") == getPayload() @@ +641,5 @@ > + Assert.ok(KEYED_ID in subsession.keyedHistograms); > + Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1); > + Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1); > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1); > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1); Couldn't you just verify classic has the values you want, and then just Assert.deepEqual(classic, subsession)? @@ +654,5 @@ > + Assert.ok(!(COUNT_ID in subsession.histograms)); > + Assert.ok(KEYED_ID in classic.keyedHistograms); > + Assert.ok(KEYED_ID in subsession.keyedHistograms); > + Assert.deepEqual(classic.keyedHistograms[KEYED_ID], {}); > + Assert.deepEqual(subsession.keyedHistograms[KEYED_ID], {}); ditto
Attachment #8562426 - Flags: review?(vdjeric) → review+
There's a lot of changes here, so make sure to do an end-to-end test. You can use this dummy Telemetry server, instructions in README: https://github.com/vdjeric/gzipServer
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #17) > There's a lot of changes here, so make sure to do an end-to-end test. You > can use this dummy Telemetry server, instructions in README: > > https://github.com/vdjeric/gzipServer You can also run the real Telemetry server locally - instructions: https://github.com/mozilla/telemetry-server#httpserverjs
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #16) > @@ +807,5 @@ > > let payloadObj = { > > ver: PAYLOAD_VERSION, > > simpleMeasurements: simpleMeasurements, > > + histograms: this.getHistograms(isSubsession), > > + keyedHistograms: this.getKeyedHistograms(isSubsession), > > which code is going to call clear() on the subsession histograms? This will happen in bug 1120362, with the session split. > ::: toolkit/components/telemetry/tests/unit/test_TelemetryPing.js > @@ +615,5 @@ > > + // Both classic and subsession payload histograms should start the same. > > + count.clear(); > > + keyed.clear(); > > + let classic = TelemetrySession.getPayload(); > > + let subsession = TelemetrySession.getPayload("environment-change"); > > might want to explicitly test getPayload("saved-session") == getPayload() Between collecting the two payloads, a number of histograms and other values may actually get updated (GC_REASON2, uptime, ...). Also, we only duplicate registered histograms into the subsession. We can't do a deep equality test, but i've added: * verifying that we have the registered histograms in the subsession payload * deep equality between subsession and classic for specific known-to-be-stable histograms > @@ +641,5 @@ > > + Assert.ok(KEYED_ID in subsession.keyedHistograms); > > + Assert.equal(classic.keyedHistograms[KEYED_ID]["a"].sum, 1); > > + Assert.equal(classic.keyedHistograms[KEYED_ID]["b"].sum, 1); > > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["a"].sum, 1); > > + Assert.equal(subsession.keyedHistograms[KEYED_ID]["b"].sum, 1); > > Couldn't you just verify classic has the values you want, and then just > Assert.deepEqual(classic, subsession)? See above.
Attached patch Part 1: Duplicate keyed histograms (obsolete) (deleted) — Splinter Review
Attachment #8562368 - Attachment is obsolete: true
Attachment #8564286 - Flags: review+
Attachment #8564287 - Flags: review+
Attachment #8562373 - Attachment is obsolete: true
Attachment #8562426 - Attachment is obsolete: true
Attachment #8564288 - Flags: review+
> > which code is going to call clear() on the subsession histograms? > > This will happen in bug 1120362, with the session split. Ok, you'll want the histogram clear() to happen as close as possible to the snapshotting.. the more time elapses between capturing the snapshot and resetting the histograms, the more data gets lost. so for example you wouldn't want to have any yields from a Task in between the snapshot & clear
.. a snapshotAndClear method would be ideal
Good point, added moved that to bug 1120362.
Ok, the only failures seem to be hitting MOZ_ASSERT on Android: 03:26:17 INFO - 02-14 03:26:01.593 F/MOZ_Assert( 2258): Assertion failure: this->mRecursionLevel > 0, at /builds/slave/try-and-api-11-d-0000000000000/build/xpcom/glue/pldhash.cpp:567 https://tbpl.mozilla.org/php/getParsedLog.php?id=59037148&tree=Try&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=59037188&tree=Try&full=1 https://tbpl.mozilla.org/php/getParsedLog.php?id=59037236&tree=Try&full=1 So, we should not hit the duplicate code on non-desktop anyway, i'll address that. We should still understand the issue though in case this is not Android-specific.
Blocks: 1069869
No longer blocks: 1127919
Whiteboard: [ready]
Rebase.
Attachment #8564286 - Attachment is obsolete: true
Attachment #8568521 - Flags: review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: