1249373 - saved-session ping size and ping frequency are causing bandwidth issues on Android

Reporter

Description

•

9 years ago

Looking at the last 7 days of Nightly, I see that we send a lot of raw data. Some people have ~34MB of raw pings sent per day. This happens on Wifi and Cell Data. We need to look into how to mitigate this issue. I did some exploration in this doc: https://docs.google.com/document/d/1-YXxlKutU31BS5WjEjd0niDA3PE9oa8QfxqB6rGzQHA/edit#

:Margaret Leibovic

Comment 1

•

9 years ago

To clarify, this is only about the old opt-in telemetry, right? So not an issue caused by the new core ping.

Flags: needinfo?(mark.finkle)

Mark Finkle (:mfinkle) (use needinfo?)

Reporter

Comment 2

•

9 years ago

(In reply to :Margaret Leibovic from comment #1) > To clarify, this is only about the old opt-in telemetry, right? So not an > issue caused by the new core ping. Yes. This is an issue with "saved-session" telemetry managed in Gecko's Telemetry system.

Flags: needinfo?(mark.finkle)

Chris H-C :chutten

Comment 3

•

9 years ago

Do we have logs or a record of the Content-Encoding negotiation to ensure that we are sending compressed payloads? All we need is a couple of example HTTP headers exchanges to see both the encoding and the length of the content. A simple mitigation strategy is one advanced by most app stores: only use data when reasonably certain it will travel on a Wi-Fi transport.

Georg Fritzsche [:gfritzsche]

Comment 4

•

9 years ago

In Gecko JS, do we have info for "on Wifi" vs. "on cell data"? A first simply cut would be to simply not upload the pings on cell data. Re. ping size - this actually improved a lot over the last two quarters from the UT work. We found we had some unbounded fields in there leading to rather extreme ping sizes. The ping count per client seems high, short-term shooting for one per Gecko life-time (instead of per "app went to background") seems feasible. Everything else would mean stitching data across sessions etc., which probably gets complicated and is of questionable value.

Chris H-C :chutten

Comment 5

•

9 years ago

There is a proposed Network Information API that we could probably add support for: https://developer.mozilla.org/en-US/docs/Web/API/Network_Information_API

:Margaret Leibovic

Comment 6

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #4) > In Gecko JS, do we have info for "on Wifi" vs. "on cell data"? > A first simply cut would be to simply not upload the pings on cell data. Yes. Here's an example: http://mxr.mozilla.org/mozilla-central/source/mobile/android/modules/HomeProvider.jsm#85

Mark Finkle (:mfinkle) (use needinfo?)

Reporter

Comment 7

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #4) > In Gecko JS, do we have info for "on Wifi" vs. "on cell data"? > A first simply cut would be to simply not upload the pings on cell data. Not uploading pings on cell data would be OK for some data, like histograms, but I don't think we want to lose UI telemetry. Maybe we could drop certain payloads on cell data and see what affect that has on ping size? > Re. ping size - this actually improved a lot over the last two quarters from > the UT work. > We found we had some unbounded fields in there leading to rather extreme > ping sizes. Android is not using UT. Were the unbounded fields fixed in "saved-session" or only "main"? > The ping count per client seems high, short-term shooting for one per Gecko > life-time (instead of per "app went to background") seems feasible. > Everything else would mean stitching data across sessions etc., which > probably gets complicated and is of questionable value. I'm less worried about ping count. It's the ping size that is the main driver. That said, we could look at just archiving when on cell data.

Georg Fritzsche [:gfritzsche]

Comment 8

•

9 years ago

(In reply to Mark Finkle (:mfinkle) from comment #7) > (In reply to Georg Fritzsche [:gfritzsche] from comment #4) > > In Gecko JS, do we have info for "on Wifi" vs. "on cell data"? > > A first simply cut would be to simply not upload the pings on cell data. > > Not uploading pings on cell data would be OK for some data, like histograms, > but I don't think we want to lose UI telemetry. Maybe we could drop certain > payloads on cell data and see what affect that has on ping size? This might be a complicated road, it depends on how which part is affecting analysis. We can also screen the toplevel payload properties and see which we are definitely not using at all on Android: https://gecko.readthedocs.org/en/latest/toolkit/components/telemetry/telemetry/main-ping.html Suspects standing out to me: chromeHangs threadHangStats fileIOReports lateWrites > > Re. ping size - this actually improved a lot over the last two quarters from > > the UT work. > > We found we had some unbounded fields in there leading to rather extreme > > ping sizes. > > Android is not using UT. Were the unbounded fields fixed in "saved-session" > or only "main"? Both, it's mostly the same data. > > The ping count per client seems high, short-term shooting for one per Gecko > > life-time (instead of per "app went to background") seems feasible. > > Everything else would mean stitching data across sessions etc., which > > probably gets complicated and is of questionable value. > > I'm less worried about ping count. It's the ping size that is the main > driver. That said, we could look at just archiving when on cell data. So, this would be a pretty simple thing to do on the short-term (just blocking upload on cell data in TelemetrySend.jsm). Should we do this part now or wait for a more involved design?

Richard Newman [:rnewman]

Updated

•

9 years ago

tracking-fennec: --- → ?

OS: Unspecified → Android

Hardware: Unspecified → All

Mark Finkle (:mfinkle) (use needinfo?)

Reporter

Comment 9

•

9 years ago

I ran a slightly tweaked version of Alessio's script [1] on Fennec Nightly for the last week. 100% of the pings were in the "maximum" bucket. Not a surprise since Fennec only uses "saved-sessions". The results: (payload section, median value, number of pings) [('payload', 98876.0, 13115), ('payload/histograms', 75398.0, 13115), ('payload/threadHangStats', 17015.0, 13115), ('payload/keyedHistograms', 2952.0, 13115), ('environment', 1998.0, 13115), ('payload/simpleMeasurements', 732.0, 13115), ('payload/info', 630.0, 13115), ('environment/addons/activeAddons', 596.0, 7140), ('environment/addons', 507.0, 13076), ('environment/settings', 488.0, 13115), ('payload/UIMeasurements', 437.0, 11051), ('environment/addons/theme', 275.0, 670), ('environment/addons/activePlugins', 241.0, 1548), ('payload/addonDetails', 109.0, 13115), ('payload/chromeHangs', 97.0, 13115)] [1] https://gist.github.com/Dexterp37/c2e1c1d4de4ba22bc4cf#file-bug-1215545-buckets-ipynb

:Margaret Leibovic

Updated

•

9 years ago

Assignee: nobody → mark.finkle

tracking-fennec: ? → 48+

Georg Fritzsche [:gfritzsche]

Comment 10

•

9 years ago

I wonder if anyone is even looking at threadHangStats, addonDetails & chromeHangs for Android. Those seem like candidates for dropping.

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 11

•

8 years ago

I'll take a look to get a sense of the amount of work necessary here.

Assignee: mark.finkle → michael.l.comella

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 12

•

8 years ago

I repeated Finkle's experiment in comment 9 with the latest builds and got the same result (still running so I don't have the break-down yet) – the pings are still quite large (> 15.5 Kb). A summary of options from this thread: * don’t upload on wifi (e.g. archive on cell data) * remove fields unused on fennec * ensure all fields are bounded * only upload certain payloads on cell data (but this could be complicated - comment 8)

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 13

•

8 years ago

We're going to wait until our Telemetry roadmap meeting to decide if this is important to do right now, or if we can wait until we move this to the Java uploader implementation in a few months (which would undo any work we do right now).

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 14

•

8 years ago

Unfortunately, we didn't specifically address this during the telemetry meeting. While it's unclear how much the pings have improved in the past few months, looking at Finkle's doc in comment 0, it looks like it's a MB per day (on average) and a max of 34 MB per day, which could be pretty bad in extreme data scenarios (e.g. roaming). Margaret, do you have an opinion on how we should move forward with this bug, if we even should?

Assignee: michael.l.comella → nobody

Flags: needinfo?(margaret.leibovic)

:Margaret Leibovic

Comment 15

•

8 years ago

(In reply to Michael Comella (:mcomella) from comment #14) > Unfortunately, we didn't specifically address this during the telemetry > meeting. > > While it's unclear how much the pings have improved in the past few months, > looking at Finkle's doc in comment 0, it looks like it's a MB per day (on > average) and a max of 34 MB per day, which could be pretty bad in extreme > data scenarios (e.g. roaming). > > Margaret, do you have an opinion on how we should move forward with this > bug, if we even should? I think we need to understand how much data we're currently sending, and have a way to monitor that over time. So, as a first step and bare minimum, let's get a system in place for this. Maybe Georg can help make a server-side analysis for this? Then, we need to decide how much data is too much data to be sending. We could try to wait until there's a wifi connection to make an upload. I think Barbara should be involved in helping make this decision. Given that the core ping is opt-out instead of opt-in, it's more important to address size issues with it than it would be with our other telemetry pings.

Flags: needinfo?(margaret.leibovic) → needinfo?(bbermes)

Michael Comella (:mcomella) [NI reported issues only; no longer employed by Mozilla]

Comment 16

•

8 years ago

(In reply to :Margaret Leibovic from comment #15) > I think Barbara should be involved in helping make this decision. Given that > the core ping is opt-out instead of opt-in, it's more important to address > size issues with it than it would be with our other telemetry pings. I may be misinterpreting but this is for the saved-session ping, not the core ping.

Summary: Ping size and ping frequency are causing bandwidth issues on Android → saved-session ping size and ping frequency are causing bandwidth issues on Android

Georg Fritzsche [:gfritzsche]

Comment 17

•

8 years ago

(In reply to Michael Comella (:mcomella) from comment #16) > (In reply to :Margaret Leibovic from comment #15) > > I think Barbara should be involved in helping make this decision. Given that > > the core ping is opt-out instead of opt-in, it's more important to address > > size issues with it than it would be with our other telemetry pings. > > I may be misinterpreting but this is for the saved-session ping, not the > core ping. It is, from the incoming data we concluded that we don't have a ping size or frequency concerns with the "core" ping client-side.

:Margaret Leibovic

Comment 18

•

8 years ago

Apologies, I didn't pay close enough attention here. If this is for opt-in telemetry, I'm less concerned, although I do think we should at least have a system in place to know how much data we're sending. Data-driven data decisions! :)

Barbara Bermes [:barbara] - NI please!

Updated

•

8 years ago

Flags: needinfo?(bbermes)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Updated

•

8 years ago

tracking-fennec: 48+ → ---

Sebastian Kaspari (:sebastian; :pocmo)

Updated

•

8 years ago

Flags: needinfo?(s.kaspari)

Sebastian Kaspari (:sebastian; :pocmo)

Comment 19

•

8 years ago

I don't have the time right now to explore this, but I'm marking this for the Taipei team. This is a bigger thing and needs some investigation (and should be prioritized). We can talk more about this during our Taipei week.

Flags: needinfo?(s.kaspari)

Whiteboard: [TPE-1]

Nevin Chen(Not active on Bugzilla)

Comment 20

•

8 years ago

Maybe we can check for WIFI or Cell data here? http://searchfox.org/mozilla-central/rev/78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/mobile/android/base/java/org/mozilla/gecko/telemetry/schedulers/TelemetryUploadAllPingsImmediatelyScheduler.java#20

Nevin Chen(Not active on Bugzilla)

Comment 21

•

8 years ago

(In reply to Nevin Chen [:nechen] from comment #20) > Maybe we can check for WIFI or Cell data here? > http://searchfox.org/mozilla-central/rev/ > 78ac0ceba97bd2deed847a8d0ae86ccf7a8887bf/mobile/android/base/java/org/ > mozilla/gecko/telemetry/schedulers/ > TelemetryUploadAllPingsImmediatelyScheduler.java#20 Sorry the code above is for core ping. For main ping upload ....I can't find the code in front end. Maybe it's in here[1]? The only thing I found is this [2]. But I think it only reads the health report.... [1]https://bugzilla.mozilla.org/show_bug.cgi?id=1156253 [2]http://searchfox.org/mozilla-central/rev/7cb75d87753de9103253e34bc85592e26378f506/mobile/android/chrome/content/aboutHealthReport.js#98

Flags: needinfo?(rnewman)

Flags: needinfo?(gfritzsche)

Richard Newman [:rnewman]

Comment 22

•

8 years ago

(In reply to Nevin Chen [:nechen] from comment #21) > Sorry the code above is for core ping. For main ping upload ....I can't find > the code in front end. Last I was involved in this, the telemetry pings in question were managed and uploaded by Gecko, even on Android. (This is a bad thing: it means collection, composition, and upload only happen when the browser is running, and thus compete for resources with the browser itself at the most critical moments.) If that's still true, the relevant code is e.g., https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySend.jsm#435 > The only thing I found is this [2]. But I think it only reads the health > report.... Correct: aboutHealthReport is nothing to do with telemetry upload.

Flags: needinfo?(rnewman)

Nevin Chen(Not active on Bugzilla)

Comment 23

•

8 years ago

Hi Georg Can I just add this check [1] to here[2]? Thank you! Best [1] https://dxr.mozilla.org/mozilla-central/rev/1b9293be51637f841275541d8991314ca56561a5/mobile/android/modules/HomeProvider.jsm#85 [2] https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/TelemetrySend.jsm#435

Georg Fritzsche [:gfritzsche]

Comment 24

•

7 years ago

Is TelemetrySend.jsm actually currently used on Android to upload these pings? I'm not sure if we use the Java uploader or TelemetrySend, we should check that first. If this is in TelemetrySend, do we actually want to keep sending data from there or move this to the Java uploader? If we keep this in TelemetrySend, is it possible to avoid repeated checks? E.g. is there an observer topic that we could listen to transition between the states "on local network" and "not on local network"? Then we can properly shut down all sending activity while not on a local network. Last but not least: Do we know how many pre-release users are connected to a local network at least, say, once a week? Consider we had 90% of pre-release users never or rarely connect to a local network, could we afford to lose their data?

Flags: needinfo?(gfritzsche)

Chris H-C :chutten

Comment 25

•

7 years ago

FWIW Fennec uses TelemetrySend according to TELEMETRY_SEND_SUCCESS: https://mzl.la/2rUZVn2

Richard Newman [:rnewman]

Comment 26

•

7 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #24) > Do we know how many pre-release users are connected to a local network at > least, say, once a week? I think a more fundamental issue is: there are some populations who only use cellular data, and so any scheme that alters submission behavior based on network will totally obscure those populations, which will cause existence conclusions to be wrong. It doesn't necessarily matter how big that population is, proportionally -- if we have, e.g., a bug that causes us to only run an updater on wifi, then we will see no evidence of that bug if we only report telemetry on wifi! If we do choose whether to do an upload _right now_ based on connectivity -- which isn't a bad option for first-world users -- it shouldn't wait too long, and we shouldn't discard data. IMO the solution for this bug tends more towards "don't send 34MB of data per day", rather than "only send 34MB when on wifi". Remember that "wifi" doesn't mean "unmetered", it doesn't mean "fast", and it doesn't mean "low-power". That might mean changing data representations or pruning what's collected.

Georg Fritzsche [:gfritzsche]

Updated

•

6 years ago

Component: Telemetry → Metrics

Product: Toolkit → Firefox for Android

Firefox Bug Husbandry Bot

Comment 27

•

6 years ago

Re-triaging per https://bugzilla.mozilla.org/show_bug.cgi?id=1473195 Needinfo :susheel if you think this bug should be re-triaged.

Priority: -- → P5

BMO Automation

Comment 28

•

4 years ago

We have completed our launch of our new Firefox on Android. The development of the new versions use GitHub for issue tracking. If the bug report still reproduces in a current version of [Firefox on Android nightly](https://play.google.com/store/apps/details?id=org.mozilla.fenix) an issue can be reported at the [Fenix GitHub project](https://github.com/mozilla-mobile/fenix/). If you want to discuss your report please use [Mozilla's chat](https://wiki.mozilla.org/Matrix#Connect_to_Matrix) server https://chat.mozilla.org and join the [#fenix](https://chat.mozilla.org/#/room/#fenix:mozilla.org) channel.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → INCOMPLETE

BMO Automation

Updated

•

4 years ago

Product: Firefox for Android → Firefox for Android Graveyard