[DoH] Some clients are sending over 1000 heuristics events per hour, mildly flooding the telemetry.events table
Categories
(Firefox :: Security, task, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox89 | --- | fixed |
People
(Reporter: valentin, Assigned: nhnt11)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [trr])
Attachments
(2 files)
The rollout study showed that for some users the "net-change" event was triggered way too often. This is not great, as they would have the heuristics running constantly.
We probably want to disable DoH for some of these users.
Assignee | ||
Comment 1•4 years ago
|
||
I was wondering how relevant this is now that we've implemented debouncing in bug 1654520, so I crunched some data...
I looked at some percentiles of event count per client-hour for each evaluateReason
over the last week. Here they are for netchange
:
90th percentile: consistently below 15 events per client per hour except for one spike of 20
95th percentile: consistently below 50 events per client per hour except for one spike of 53
98th percentile: consistently below 90 events per client per hour
99th percentile: consistently below 90 events per client per hour except for one spike of 106
Max: hovering around 500 events per client per hour with a few spikes the worst of which was ~2800.
Note that while the 99th percentile of the connectivity
evaluateReason was around 50 events per client per hour, the max was consistently around 1k, with spikes going up to 2k.
Maybe this is sufficient to get a verdict on whether we need to do anything about this from our data folks.
Query: https://sql.telemetry.mozilla.org/queries/78858/source#196006
Comment 2•4 years ago
|
||
For perspective, events with category "doh" make up about 17% of the total event count from desktop. Heuristic events are about 4% of the total event count.
Assignee | ||
Comment 3•4 years ago
|
||
Thanks Jeff, that's a great way to think about this I think. I wanted to see the shares of other event categories so I made this query: https://sql.telemetry.mozilla.org/queries/78910/source#196108
I definitely don't think DoH's fraction should be so high, especially considering that it's more like ~35% if you exclude the security category.
I'll chat with the team to figure out what the right thing is to do here.
Assignee | ||
Comment 4•4 years ago
|
||
Some more deets:
- This shows how the event fraction went up when we shipped DoH by default in the US: https://sql.telemetry.mozilla.org/queries/78912/source#196111
- This shows that the event share went down significantly around the end of Aug 2020 (when Fx80 was released, with the patch from bug 1654520): https://sql.telemetry.mozilla.org/queries/78913/source#196113 (EDIT: The updated query shows that the event count stayed consistent, just the fraction went down)
- This shows a further sharp decline in the fraction of these events around mid Dec 2020 (Fx84 release): https://sql.telemetry.mozilla.org/queries/78914/source#196115
I need to look into 3. more - I couldn't find anything that landed in 84 that seems like it might have catalyzed this decline.
Comment 5•4 years ago
|
||
The Fx84 release is when security events spiked up and the overall number of events approximately doubled, which likely explains (3). See https://bugzilla.mozilla.org/show_bug.cgi?id=1684920
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 6•4 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #5)
The Fx84 release is when security events spiked up and the overall number of events approximately doubled, which likely explains (3). See https://bugzilla.mozilla.org/show_bug.cgi?id=1684920
Oh that makes a ton of sense, thank you. A more meaningful metric is probably to look at the event counts rather than fractions, to avoid wild goose chases like this.
Assignee | ||
Comment 7•4 years ago
|
||
I updated all the queries to show both the raw count as well as the fraction. This revealed that the dip at the end of Aug 2020 was also a case of the fraction going down but the count staying pretty much consistent.
Assignee | ||
Comment 8•4 years ago
|
||
Looks like we have about 44k clients (0.2%) generating 99% of events, and about 2k clients (0.01%) generating 50% of events: https://sql.telemetry.mozilla.org/queries/78969/source#196231
This seems to be low hanging fruit. I suspect we can "fix" this by introducing some throttling:
- First time network goes up, just run heuristics, but start debounce timer.
- If network oscillates N times before the debounce timer fires, multiply the delay by P. Multiply N by Q. Reset the timer.
- Reset N and delay when timer fires.
We just need to choose N, P, Q, and initial delay such that we catch all these volatile networks.
Assignee | ||
Comment 9•4 years ago
|
||
This prevents clients on volatile networks from running heuristics repeatedly,
which has two consequences: performance hit due to the DNS lookups performed by
heuristics, as well as flooding telemetry since we record an event each time.
With this patch, when a heuristics run is first triggered, we run it as usual
but start a timer in the background. Any runs triggered before the timer fires
are coalesced into a count. When the timer fires, if the count exceeded a limit,
we restart the timer and also the count. If the count was within the limit, we
run heuristics just once if it was triggered at least once while throttling.
To make this testable deterministically, I had to make a few subtle changes,
e.g. to not await calls to runHeuristics.
I also took the opportunity to replace BrowserTestUtils.waitForCondition with
the more canonical copy that lives in TestUtils, and to put the confirmationNS
pref into the prefs object so it doesn't need to be explicitly cleaned up.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 10•4 years ago
|
||
Comment 11•4 years ago
|
||
bugherder |
Assignee | ||
Comment 12•4 years ago
|
||
For posterity, here is the intended response curve of the rate limiting filter, for a rate limit threshold value of 2.
Until the number of inputs exceeds 1 per timeout period, the output mirrors the input. Beyond 1 input per timeout period, the output is clipped to 1 per timeout period, and beyond the rate limit threshold, the output is suppressed completely.
The linearly increasing part of the curve covers clients with "normal" network conditions, the middle part covers volatility that we're willing to tolerate by throttling, and the suppressed part covers clients with excessive volatility by suppressing heuristics for them completely.
Note: the implementation doesn't apply this continuously; the timeout is more of a polling interval. So this response curve while accurate for a single timeout period, is more of an approximation of the implementation when applied to longer durations. So I offer a grain of salt along with this graph :)
Description
•