Closed
Bug 704176
Opened 13 years ago
Closed 13 years ago
Telemetry sends bad buckets sometimes
Categories
(Toolkit :: Telemetry, defect)
Tracking
()
RESOLVED
FIXED
mozilla12
People
(Reporter: xstevens, Assigned: froydnj)
References
Details
Attachments
(3 files, 3 obsolete files)
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
taras.mozilla
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
I've been doing some analysis for taras and Telemetry sends bad buckets for some histograms which is causing the telemetry dashboard to try to display many more buckets than are possible.
One example from 10/23/2011 is on Firefox 8.0 build ID 20111011182523. There was a single doc that came up with buckets 6,19,59 for EARLY_GLUESTARTUP_READ_TRANSFER. No other builds on that version, that day used those buckets. I'll attach my data files.
I've also seen this happen in other versions and other histograms as well. Firefox 7.0.1 and CYCLE_COLLECTOR for instance.
Reporter | ||
Comment 1•13 years ago
|
||
Reporter | ||
Comment 2•13 years ago
|
||
Also here are the CYCLE_COLLECTOR bucket counts by version by day. This just shows what versions are over their bucket_count (should be 50) for that histogram.
http://people.mozilla.org/~xstevens/telemetry/704176/
Reporter | ||
Comment 3•13 years ago
|
||
cc'ing taras and glandium.
glandium: taras asked me to see if you could look into this since he's out this week.
Updated•13 years ago
|
Assignee: nobody → tglek
Comment 4•13 years ago
|
||
Xavier, I could give you a list of buckets in every histogram, but we could also port the bucket generation algorithm to java so we dont have to do this every time we add a histogram.
Assignee | ||
Comment 6•13 years ago
|
||
We're going to piggyback on this feature to help detect corruption or machine mismatches when reading in histograms in bug 707320.
Assignee: taras.mozilla → nfroyd
Blocks: 707320
Assignee | ||
Comment 7•13 years ago
|
||
Here's a patch to do some corruption detection on the client side. We'll just not reflect things into JS that we know to be bad.
I'm a little unsure of what to do for JSHistogram_Snapshot. Giving back garbage data (as we might do before) isn't desirable, but just punting and handing back nothing doesn't seem like the right idea either. I've opted to punt and hand back nothing in this patch. (In any event, that only really matters for cloned histograms and user-created histograms, which are not the focus here.)
Attachment #586512 -
Flags: feedback?(taras.mozilla)
Comment 8•13 years ago
|
||
Comment on attachment 586512 [details] [diff] [review]
don't reflect corrupt histograms
s/SKIP/CORRUPT/. I think this should throw an exception instead of returning an empty snapshot object which could lead to undefined values propagating.
You should also add a histogram to track detected corruption. Might make sense to clear the histogram after detecting corruption too.
f+ since this is relatively close to correct.
Attachment #586512 -
Flags: feedback?(taras.mozilla) → feedback+
Assignee | ||
Comment 9•13 years ago
|
||
Histogram::FindCorruption actually logs some instances of corruption (not all) in secret histograms. So those would show up in telemetry pings, I guess?
I am a little leery of updating the histogram while you're looping over them in GetHistogramSnapshots. One could make a pre-pass to find corrupted instances and then only reflect the ones that are OK, though...do we have proper bit vectors somewhere?
Comment 10•13 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #9)
> Histogram::FindCorruption actually logs some instances of corruption (not
> all) in secret histograms. So those would show up in telemetry pings, I
> guess?
sounds good
>
> I am a little leery of updating the histogram while you're looping over them
> in GetHistogramSnapshots. One could make a pre-pass to find corrupted
> instances and then only reflect the ones that are OK, though...do we have
> proper bit vectors somewhere?
not aware of any(short of vector<bool>)
Assignee | ||
Comment 11•13 years ago
|
||
Part 2 is going to maintain a corruption flag for every statically defined histogram. To do that, we need some way of mapping from a histogram (histogram name) to the ID associated with it. This function is the way to do that.
I know that this function is logically a boolean function, but kept the nsresult return code so that we get more precise errors in GetHistogramByName.
Attachment #587056 -
Flags: review?(taras.mozilla)
Assignee | ||
Comment 12•13 years ago
|
||
OK, so this addresses feedback comments. We now identify all corrupt (static) histograms prior to reflection and mark them as such. We don't reflect Histogram's secret histograms; we instead maintain our own for the different types of corruption that FindCorruption can report.
I chose not to accumulate histograms in ReflectHistogramSnapshot because a) we're already doing it in HistogramSnapshots and friends; and b) if we're not calling ReflectHistogramSnapshot from HistogramSnapshots, then we're calling it for a user-defined histogram, which we'll just report an error for. (And with an eye to the future, when we call ReflectHistogramSnapshot when reifying snapshots that came from disk, we don't want to measure corruption in those, just the ones we're accumulating this session.)
Attachment #586512 -
Attachment is obsolete: true
Attachment #587057 -
Flags: review?(taras.mozilla)
Updated•13 years ago
|
Attachment #587056 -
Flags: review?(taras.mozilla) → review+
Comment 13•13 years ago
|
||
Comment on attachment 587057 [details] [diff] [review]
part 2 - don't reflect corrupt histograms
don't really like that histograms are looked up by string.
You are doing 2x as many string comparisons than sensible.
strcmp(name, "Histogram.InconsistentCountHigh") check should go into NS_FAILED, but I guess that lazily creates histograms, so up to you if you want to create an immutable version of GetHistogramEnumId.
Attachment #587057 -
Flags: review?(taras.mozilla) → review+
Comment 14•13 years ago
|
||
This could benefit from some sort of test coverage, but I don't see how you'd do that.
Assignee | ||
Comment 15•13 years ago
|
||
Address review comments, in particular moving strcmps out of the common path. Tidied comments, too.
Attachment #587057 -
Attachment is obsolete: true
Attachment #587322 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Comment 16•13 years ago
|
||
http://hg.mozilla.org/integration/mozilla-inbound/rev/465bacf25d61
http://hg.mozilla.org/integration/mozilla-inbound/rev/47c464033fa8
Keywords: checkin-needed
Target Milestone: --- → mozilla12
Comment 17•13 years ago
|
||
backed out due to build bustage
../components/telemetry/Telemetry.o: In function `IdentifyCorruptHistograms':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:529: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:552: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
../components/telemetry/Telemetry.o: In function `(anonymous namespace)::TelemetryImpl::GetHistogramSnapshots(JSContext*, JS::Value*)':
/builds/slave/m-in-lnx/build/obj-firefox/toolkit/components/telemetry/../../../../toolkit/components/telemetry/Telemetry.cpp:601: undefined reference to `(anonymous namespace)::TelemetryImpl::gCorruptHistograms'
collect2: ld returned 1 exit status
Assignee | ||
Comment 18•13 years ago
|
||
Things work better if you actually declare storage for static variables.
Attachment #587322 -
Attachment is obsolete: true
Attachment #588594 -
Flags: review+
Assignee | ||
Updated•13 years ago
|
Keywords: checkin-needed
Comment 19•13 years ago
|
||
Try run for cb131e9edb0b is complete.
Detailed breakdown of the results available here:
https://tbpl.mozilla.org/?tree=Try&rev=cb131e9edb0b
Results (out of 277 total builds):
exception: 2
success: 231
warnings: 28
failure: 16
Builds (or logs if builds failed) available at:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/philringnalda@gmail.com-cb131e9edb0b
Comment 20•13 years ago
|
||
Pfft. Only 46 failures? I'll push that.
https://hg.mozilla.org/integration/mozilla-inbound/rev/33e46ccc6d5c
https://hg.mozilla.org/integration/mozilla-inbound/rev/af6585a74bb9
Keywords: checkin-needed
Comment 21•13 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/33e46ccc6d5c
https://hg.mozilla.org/mozilla-central/rev/af6585a74bb9
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Comment 22•13 years ago
|
||
Are we doing any kind of consistency checking on the server side? This is important, per the end-to-end principal.
You need to log in
before you can comment on or make changes to this bug.
Description
•