Closed Bug 1376493 Opened 7 years ago Closed 7 years ago

Aggregate String Scalars as Simple Counts

Categories

(Data Platform and Tools :: General, enhancement, P1)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: bugzilla, Assigned: frank)

References

Details

In bug 1323069 I added a new scalar probe, A11Y_INSTANTIATORS. This probe contains string values so it is not displayed on t.m.o. Ideally we would like to be able to generate a histogram from that probe where each unique, non-empty string receives its own bucket.
Frank is this something you could help with?
Flags: needinfo?(fbertsch)
Note this will help us understand risk as we ship Windows e10s a11y support...
Hi David, apologies for the delay. Chutten and I discussed this and came up with a solution for displaying string scalars. Basically, we'll end up with a single number per string, just the total number of instances of that string across all pings. It will be displayed like keyed scalars are now, but instead of a distribution, just a single number. I'll make this bug track those changes.
Assignee: nobody → fbertsch
Points: --- → 2
Component: Datasets: General → Datasets: Telemetry Aggregates
Flags: needinfo?(fbertsch)
Priority: -- → P1
Summary: Need to be able to aggregate A11Y_INSTANTIATORS scalar → Aggregate String Scalars as Simple Counts
Blocks: 1378389
Benjamin: We are running into a question on the PR [0] of whether we need to limit the strings to just those that occur in greater than 1% (or some other percentage) of incoming pings. 1% is the number for the hardware report. If this were the case for string scalars, do we also need to do the same for keyed histograms? [0] https://github.com/mozilla/python_mozaggregator/pull/49
Flags: needinfo?(benjamin)
In order to reduce risk, I do not think we should display this data by default/automatically on telemetry.mozilla.org. The identification risk of strings is naturally higher. That doesn't mean that we can't ever do it: but I'd prefer that teams explicitly review incoming data to ensure that it's the data they expect before making it public. So to start out, I recommend analyzing this using a dataset (is this data included in main-summary?) in STMO, or using an ATMO query. Once you've reviewed results, it's ok to publish using the STMO publishing facility, and that's less risky than doing automatically-publishing aggregates.
Flags: needinfo?(benjamin)
> In order to reduce risk, I do not think we should display this data by > default/automatically on telemetry.mozilla.org. The identification risk of > strings is naturally higher. In that case, I can create a whitelist of string scalars to aggregate. We should also update the documentation and mention that if teams want their string scalars aggregated, they need to put a bug out and we can add it. Benjamin, would you want teams to request a data review from a data steward before asking to make a string scalar public?
Flags: needinfo?(benjamin)
Yes. I think the risk profile of that is enough that we'd like to review those.
Flags: needinfo?(benjamin)
Here is a query listing all the A11Y_INSTANTIATORS strings, with their counts [0]. David, is it acceptable to make these public? Benjamin, requesting a data-review. Please let us know if you need more information. [0] https://sql.telemetry.mozilla.org/queries/5484
Flags: needinfo?(dbolter)
Flags: needinfo?(benjamin)
Thank you! I don't have a need to make this public. Sorting by count, high to low, is pretty interesting!
Flags: needinfo?(surkov.alexander)
Flags: needinfo?(jmathies)
Flags: needinfo?(dbolter)
Flags: needinfo?(aklotz)
> Thank you! I don't have a need to make this public. What I mean is, if/when we add this to the aggregator, all of these will be public on TMO. I want to make sure these aren't surprising to you, and all of the values are expected. Or do you mean you don't need them on TMO any longer?
I'd prefer that David's team publish a one-off or curated report with this data rather than making it public by default. So I guess that counts as data-review denied?
Flags: needinfo?(benjamin)
I've imported the dataset from comment 8 into a google sheet and shared with some folks. We can add notes/findings to the rows there and decide what we want to do next...
Flags: needinfo?(jmathies)
For now, let's not fix this in the aggregator.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(surkov.alexander)
Flags: needinfo?(aklotz)
Resolution: --- → WONTFIX
Component: Datasets: Telemetry Aggregates → General
You need to log in before you can comment on or make changes to this bug.