Closed Bug 1626083 Opened 5 years ago Closed 4 years ago

[DoH] Some clients are sending over 1000 heuristics events per hour, mildly flooding the telemetry.events table

Tracking

()

Status:

RESOLVED FIXED

Milestone:

89 Branch

Tracking Flags:

Tracking

Status

firefox89

---

fixed

People

(Reporter: valentin, Assigned: nhnt11)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [trr])

Attachments

(2 files)

Bug 1626083 - Rate-limit DoH heuristics runs. r=#necko!,johannh! 4 years ago (inactive) Nihanth Subramanya [:nhnt11] (deleted), text/x-phabricator-request		Details
Rate limiting response curve 4 years ago (inactive) Nihanth Subramanya [:nhnt11] (deleted), image/png		Details

Valentin Gosu [:valentin] (he/him)

Reporter

Description

•

5 years ago

The rollout study showed that for some users the "net-change" event was triggered way too often. This is not great, as they would have the heuristics running constantly.
We probably want to disable DoH for some of these users.

Valentin Gosu [:valentin] (he/him)

Reporter

Updated

•

4 years ago

Blocks: doh

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 1

•

4 years ago

I was wondering how relevant this is now that we've implemented debouncing in bug 1654520, so I crunched some data...

I looked at some percentiles of event count per client-hour for each evaluateReason over the last week. Here they are for netchange:
90th percentile: consistently below 15 events per client per hour except for one spike of 20
95th percentile: consistently below 50 events per client per hour except for one spike of 53
98th percentile: consistently below 90 events per client per hour
99th percentile: consistently below 90 events per client per hour except for one spike of 106
Max: hovering around 500 events per client per hour with a few spikes the worst of which was ~2800.

Note that while the 99th percentile of the connectivity evaluateReason was around 50 events per client per hour, the max was consistently around 1k, with spikes going up to 2k.

Maybe this is sufficient to get a verdict on whether we need to do anything about this from our data folks.

Query: https://sql.telemetry.mozilla.org/queries/78858/source#196006

Jeff Klukas [:klukas] (UTC-4)

Comment 2

•

4 years ago

For perspective, events with category "doh" make up about 17% of the total event count from desktop. Heuristic events are about 4% of the total event count.

See https://sql.telemetry.mozilla.org/queries/78909/source

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 3

•

4 years ago

Thanks Jeff, that's a great way to think about this I think. I wanted to see the shares of other event categories so I made this query: https://sql.telemetry.mozilla.org/queries/78910/source#196108

I definitely don't think DoH's fraction should be so high, especially considering that it's more like ~35% if you exclude the security category.

I'll chat with the team to figure out what the right thing is to do here.

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 4

•

4 years ago

Some more deets:

This shows how the event fraction went up when we shipped DoH by default in the US: https://sql.telemetry.mozilla.org/queries/78912/source#196111
This shows that the event share went down significantly around the end of Aug 2020 (when Fx80 was released, with the patch from bug 1654520): https://sql.telemetry.mozilla.org/queries/78913/source#196113 (EDIT: The updated query shows that the event count stayed consistent, just the fraction went down)
This shows a further sharp decline in the fraction of these events around mid Dec 2020 (Fx84 release): https://sql.telemetry.mozilla.org/queries/78914/source#196115

I need to look into 3. more - I couldn't find anything that landed in 84 that seems like it might have catalyzed this decline.

Flags: needinfo?(nhnt11)

Jeff Klukas [:klukas] (UTC-4)

Comment 5

•

4 years ago

The Fx84 release is when security events spiked up and the overall number of events approximately doubled, which likely explains (3). See https://bugzilla.mozilla.org/show_bug.cgi?id=1684920

Jeff Klukas [:klukas] (UTC-4)

Updated

•

4 years ago

URL: https://bugzilla.mozilla.org/show_bug...

Jeff Klukas [:klukas] (UTC-4)

Updated

•

4 years ago

URL: https://bugzilla.mozilla.org/show_bug...

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1684920

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 6

•

4 years ago

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #5)

The Fx84 release is when security events spiked up and the overall number of events approximately doubled, which likely explains (3). See https://bugzilla.mozilla.org/show_bug.cgi?id=1684920

Oh that makes a ton of sense, thank you. A more meaningful metric is probably to look at the event counts rather than fractions, to avoid wild goose chases like this.

Flags: needinfo?(nhnt11)

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 7

•

4 years ago

I updated all the queries to show both the raw count as well as the fraction. This revealed that the dip at the end of Aug 2020 was also a case of the fraction going down but the count staying pretty much consistent.

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 8

•

4 years ago

Looks like we have about 44k clients (0.2%) generating 99% of events, and about 2k clients (0.01%) generating 50% of events: https://sql.telemetry.mozilla.org/queries/78969/source#196231

This seems to be low hanging fruit. I suspect we can "fix" this by introducing some throttling:

First time network goes up, just run heuristics, but start debounce timer.
If network oscillates N times before the debounce timer fires, multiply the delay by P. Multiply N by Q. Reset the timer.
Reset N and delay when timer fires.

We just need to choose N, P, Q, and initial delay such that we catch all these volatile networks.

Valentin Gosu [:valentin] (he/him)

Reporter

Updated

•

4 years ago

Depends on: 1700951

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 9

•

4 years ago

Attached file Bug 1626083 - Rate-limit DoH heuristics runs. r=#necko!,johannh! (deleted) — Details

This prevents clients on volatile networks from running heuristics repeatedly,
which has two consequences: performance hit due to the DNS lookups performed by
heuristics, as well as flooding telemetry since we record an event each time.

With this patch, when a heuristics run is first triggered, we run it as usual
but start a timer in the background. Any runs triggered before the timer fires
are coalesced into a count. When the timer fires, if the count exceeded a limit,
we restart the timer and also the count. If the count was within the limit, we
run heuristics just once if it was triggered at least once while throttling.

To make this testable deterministically, I had to make a few subtle changes,
e.g. to not await calls to runHeuristics.

I also took the opportunity to replace BrowserTestUtils.waitForCondition with
the more canonical copy that lives in TestUtils, and to put the confirmationNS
pref into the prefs object so it doesn't need to be explicitly cleaned up.

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → nhnt11

Status: NEW → ASSIGNED

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Updated

•

4 years ago

Summary: DoH Rollout Extension: Disable heuristics for users with too frequent "net-change" events → [DoH] Some clients are sending over 1000 heuristics events per hour, mildly flooding the telemetry.events table

Phabricator Automation

Updated

•

4 years ago

Attachment #9212702 - Attachment description: Bug 1626083 - Rate-limit DoH heuristics runs. r=#necko!,Gijs! → Bug 1626083 - Rate-limit DoH heuristics runs. r=#necko!,johannh!

Pulsebot

Comment 10

•

4 years ago

Pushed by nhnt11@gmail.com: https://hg.mozilla.org/integration/autoland/rev/d835fe22ff82 Rate-limit DoH heuristics runs. r=necko-reviewers,johannh,dragana

Andreea Pavel [:apavel]

Comment 11

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d835fe22ff82

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox89: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 89 Branch

(inactive) Nihanth Subramanya [:nhnt11]

Assignee

Comment 12

•

4 years ago

Attached image Rate limiting response curve (deleted) — Details

For posterity, here is the intended response curve of the rate limiting filter, for a rate limit threshold value of 2.

Until the number of inputs exceeds 1 per timeout period, the output mirrors the input. Beyond 1 input per timeout period, the output is clipped to 1 per timeout period, and beyond the rate limit threshold, the output is suppressed completely.

The linearly increasing part of the curve covers clients with "normal" network conditions, the middle part covers volatility that we're willing to tolerate by throttling, and the suppressed part covers clients with excessive volatility by suppressing heuristics for them completely.

Note: the implementation doesn't apply this continuously; the timeout is more of a polling interval. So this response curve while accurate for a single timeout period, is more of an approximation of the implementation when applied to longer durations. So I offer a grain of salt along with this graph :)

You need to log in before you can comment on or make changes to this bug.