1311100 - [meta] Fix the measurement semantics for the Telemetry JS API

Reporter

Description

•

8 years ago

Right now we have a lot of code that uses try...catch wrappers around their calls to Services.telemetry because... well, it's not clear, actually. The current state has a few downsides: - it's verbose, and annoying to remember - we regularly include other code in the try catch, for which we'll (often silently) eat errors - we don't get any feedback (e.g. via telemetry...) about exceptions / issues with our calls to telemetry (e.g. broken params, expired histograms, whatever). It would be great if there was clarity about the exact circumstances under which telemetry calls fail (ie why we have all these try...catch wrappers). I've asked around, and not really found anybody who knew of something material. I thought it was possible for telemetry to be null if Firefox was built with the right options, but on looking at the code that seems not to be the case. It would be better if: - Telemetry guaranteed that all calls that fetch or operate on histograms are failsafe, irrespective of whether it failed to initialize, can't submit data, etc. This could potentially include passing back a no-op shim to getHistogramById calls that pass an unknown histogram ID - if passed anything broken, telemetry code is responsible for: -- not throwing JS exceptions -- reporting errors to the browser console -- potentially logging opt-in telemetry about broken callsites or something like that. This would allow removing all the consumer try...catch gunk and having a unified way of finding out what histograms (and/or their callsites) are broken, and in what circumstances telemetry is broken. As a general principle, it seems to me that no code that does any telemetry gathering genuinely appreciates being interrupted by a JS exception if reporting such telemetry fails. Does this make sense? Am I missing something? How hard would it be to implement this? Is there anything that prevents us removing these try...catch statements from consumers today?

Flags: needinfo?(gfritzsche)

Flags: needinfo?(chutten)

Benjamin Smedberg

Comment 1

•

8 years ago

I have a feeling we shouldn't wrap try/catch around telemetry, but you should work with Georg on this. Some additional notes/thoughts: 1) it was (is?) possible to disable telemetry at build time, so any calls would fail. I don't know whether this is still true, but let's remove this build option if so. 2) Code records telemetry to an unknown histogram. I have a feeling this *should* throw an exception. It indicates a programming error that should break tests. 3) Code records telemetry to an expired/disabled histogram. I'm 99.44% sure this doesn't throw, and it shouldn't. 4) Code records an illegal value. I'm torn on this one. Could we make this not-throw, but if this happens in an automated test we cause the test to fail?

Chris H-C :chutten

Comment 2

•

8 years ago

Sadly I don't really know if the current exception-throwing situation is by design or by happenstance. Also, I don't know if I agree. Exceptions should be thrown in exceptional cases. Accumulating incorrectly, to an expired or unknown histogram... these seem exceptional to me, and exactly the fault of the calling code. Calling telemetry at the "wrong" time? Yeah, that shouldn't throw (if it does). Timing things incorrectly isn't the caller's fault.

Flags: needinfo?(chutten)

:Gijs (he/him)

Reporter

Comment 3

•

8 years ago

(In reply to Chris H-C :chutten from comment #2) > Also, I don't know if I agree. Exceptions should be thrown in exceptional > cases. Accumulating incorrectly, to an expired or unknown histogram... these > seem exceptional to me, and exactly the fault of the calling code. I would disagree on the expired histogram. Given histogram foo, expires in 53 - when do I remove the code accumulating to and/or the histogram itself? I can't do it before merge day because now it'll be removed from 52. I can't do it afterwards because the world will explode with exceptions immediately after the version increment. I therefore suspect that this case doesn't currently throw, and would support it staying that way. As for unknown histograms, that might make sense, because those are normally hardcoded constants anyway so we should always hit those in tests etc. Invalid values to a known histogram, less so, because often the value is computed somehow and so if there are edgecases in that computation they are more likely to not be caught by tests or manual code exercising. Breaking the code that is doing the computation when the value doesn't match what the histogram expects seems wrong and would likely break other running code. It would be more appropriate if we had an alternative route of feeding information about this back through telemetry (e.g. a keyed histogram where the keys are... other histogram IDs!) that increments when we get called with invalid values. That histogram should always be empty, and we should file bugs for cases where it isn't, but I'm less convinced that we should break the caller "in the wild" on release builds if this happens.

Benjamin Smedberg

Comment 4

•

8 years ago

Agree with Gijs, his rationale is a better explanation of what I was thinking! In particular, it is not exceptional to record to an expired histogram. We want teams to be able to leave expired histograms in the tree without penalty, if they decide they may want to resurrect them later.