Open Bug 1722082 Opened 3 years ago Updated 2 years ago

ClientEnvironment.jsm and TelemetryEnvironment.jsm require browser's AttributionCode which breaks on non-browser apps (e.g. Fenix)

Categories

(Toolkit :: Telemetry, defect, P4)

All
Android
defect

Tracking

()

REOPENED
Tracking Status
firefox92 --- affected

People

(Reporter: Gijs, Unassigned)

References

Details

In bug 1721627 I'm experimenting with making tests fail if we request internal resources that aren't present. The idea is to prevent patches from landing if they mis-spell (or forget to hg add, or forget to include in jar.mn or similar) subresources like CSS files and images, by failing if we hit chrome: or resource: URLs in tests that are not available (ie where the internal jar/file code cannot find the relevant file).

On android, resource:///modules/AttributionCode.jsm is getting flagged up as a URL that is not present. Some of these are guarded in try...catches, but some are not.
The relevant bits should probably be behind AppConstants.MOZ_BUILD_APP == "browser" checks.

Though, to be fair, I also do not see any non-test uses of the ClientEnvironment static attribution property, so perhaps removing that altogether would be the easier option?

I'm a little confused as to why the Environment JSMs are present in Fenix. Telemetry doesn't work on Fenix (well, bits of the core do, but they're libxul, not JSMs), so there shouldn't be any of it around, no?

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Chris H-C :chutten from comment #1)

I'm a little confused as to why the Environment JSMs are present in Fenix. Telemetry doesn't work on Fenix (well, bits of the core do, but they're libxul, not JSMs), so there shouldn't be any of it around, no?

I'm afraid I'm not an android expert, and I don't know to what degree these JSMs being present is expected or not.

At least the commit that added the AttributionCode.jsm bit to ClientEnvironment (bug 1578885) mentions remote settings and normandy, which might exist on android? I also don't see anything in https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/moz.build that excludes these JSMs from omni.ja for android, but I'd expect a bunch of toolkit would break if Services.telemetry was suddenly null?

In terms of why I know this is happening, I'm afraid all I have is a C++ stack which is not much use without the JS info: https://treeherder.mozilla.org/logviewer?job_id=346178124&repo=try&lineNumber=25450 . The patch logged:

07-23 17:27:11.038  4177  4192 I Gecko   : Missing file: resource:///modules/AttributionCode.jsm

on logcat, on this try job and https://searchfox.org/mozilla-central/search?q=AttributionCode.jsm only shows ClientEnvironment.jsm and TelemetryEnvironment.jsm in non-browser/, non-test code. I guess the former isn't part of telemetry though, despite what the name might suggest - so perhaps I filed this in the wrong place? As I noted in comment #0 I don't see how we're accessing that property at all, but if that's the culprit then we must be doing so, somehow! I'm hoping Rob can help here...

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(rhelmer)

(In reply to :Gijs (he/him) from comment #2)

(In reply to Chris H-C :chutten from comment #1)

I'm a little confused as to why the Environment JSMs are present in Fenix. Telemetry doesn't work on Fenix (well, bits of the core do, but they're libxul, not JSMs), so there shouldn't be any of it around, no?

I'm afraid I'm not an android expert, and I don't know to what degree these JSMs being present is expected or not.

At least the commit that added the AttributionCode.jsm bit to ClientEnvironment (bug 1578885) mentions remote settings and normandy, which might exist on android? I also don't see anything in https://searchfox.org/mozilla-central/source/toolkit/components/telemetry/moz.build that excludes these JSMs from omni.ja for android, but I'd expect a bunch of toolkit would break if Services.telemetry was suddenly null?

In terms of why I know this is happening, I'm afraid all I have is a C++ stack which is not much use without the JS info: https://treeherder.mozilla.org/logviewer?job_id=346178124&repo=try&lineNumber=25450 . The patch logged:

07-23 17:27:11.038  4177  4192 I Gecko   : Missing file: resource:///modules/AttributionCode.jsm

on logcat, on this try job and https://searchfox.org/mozilla-central/search?q=AttributionCode.jsm only shows ClientEnvironment.jsm and TelemetryEnvironment.jsm in non-browser/, non-test code. I guess the former isn't part of telemetry though, despite what the name might suggest - so perhaps I filed this in the wrong place? As I noted in comment #0 I don't see how we're accessing that property at all, but if that's the culprit then we must be doing so, somehow! I'm hoping Rob can help here...

ClientEnvironment.jsm (at least at the time) was used by Normandy as well as Activity Stream and Remote Settings, the goal with bug 1578885 was to have the attribution data available very early in startup so it could be used for experiments (showing a user a different first-run page if they had a certain attribution code, etc).

This is hard to debug without more info, like the JS call stack or something... offhand though I'd say it could be coming from one of those three components, is ASRouter or newtab about:welcome used on Android? https://searchfox.org/mozilla-central/search?q=AttributionCode.jsm&path=

Flags: needinfo?(rhelmer)

I am struggling to figure out how ClientEnvironment.jsm is implicated, but looking at https://searchfox.org/mozilla-central/search?q=ClientEnvironment.jsm&path= I see that Nimbus and messaging-system import it as well.

I've now spent half a day trying to get a local android build to work, to no avail (the emulator refuses to start). Alternatively I could keep debugging through try, but that's slow and tedious (though...).

Depends on: 1722416

(In reply to :Gijs (he/him) from comment #5)

I've now spent half a day trying to get a local android build to work, to no avail (the emulator refuses to start). Alternatively I could keep debugging through try, but that's slow and tedious (though...).

I have finally gotten this working, but of course it now turns out I cannot reproduce locally... 🙃

I also tried reproducing with the original rev, but I can't get the emulator to work locally with that, so that's not going to fly either. I also tried some other suspicions I had, but I didn't get anywhere - I guess this may need to remain a mystery... If it comes back we can reopen with more details. I'll make sure the patch in 1721627 logs JS stacks on debug builds on infra if it fails.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE

This went orange when I actually landed the patch: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=fBM6W5jET9G93oUKWSDduQ.0&revision=54e2a2ae58ddc48dcfab4bdcbcdc855d50aae7b9

Here's the JS stack:

08-02 17:25:33.565 3346 3361 D Gecko : 0 _updateAttribution() ["resource://gre/modules/TelemetryEnvironment.jsm":1670:6]
08-02 17:25:33.565 3346 3361 D Gecko : 1 _updateSettings() ["resource://gre/modules/TelemetryEnvironment.jsm":1591:9]
08-02 17:25:33.565 3346 3361 D Gecko : 2 EnvironmentCache() ["resource://gre/modules/TelemetryEnvironment.jsm":945:16]
08-02 17:25:33.565 3346 3361 D Gecko : 3 getGlobal() ["resource://gre/modules/TelemetryEnvironment.jsm":101:25]
08-02 17:25:33.565 3346 3361 D Gecko : 4 get currentEnvironment() ["resource://gre/modules/TelemetryEnvironment.jsm":108:11]
08-02 17:25:33.565 3346 3361 D Gecko : 5 assemblePing() ["resource://gre/modules/TelemetryControllerParent.jsm":425:0]
08-02 17:25:33.565 3346 3361 D Gecko : 6 _submitPingLogic() ["resource://gre/modules/TelemetryControllerParent.jsm":487:24]
08-02 17:25:33.565 3346 3361 D Gecko : 7 InterpretGeneratorResume() ["self-hosted":1482:33]
08-02 17:25:33.565 3346 3361 D Gecko : 8 AsyncFunctionNext() ["self-hosted":692:26]

Is the expectation thus that this doesn't run on android/fenix?

Status: RESOLVED → REOPENED
Flags: needinfo?(chutten)
Resolution: INCOMPLETE → ---

My expectation is that the code shouldn't even be present on Fenix, let alone be running. Firefox Telemetry isn't sent from Fenix, Glean data is.

To throw in a theory: perhaps Firefox Telemetry is present on Android builds because of Fennec (though it had its own written-in-Java Telemetry networking stack, IIRC), and Fenix inherited it?

ni?pocmo Do you have any idea of the history/mechanics are that result in Firefox Telemetry being present and active in Fenix (where it isn't used to send Telemetry)?

Flags: needinfo?(chutten)
Flags: needinfo?(s.kaspari)

I have to forward this question to agi.

Flags: needinfo?(s.kaspari) → needinfo?(agi)

I think the patch from bug 1723628 (https://hg.mozilla.org/mozilla-central/rev/a349aee25be1) may have fixed the immediate issue here, but it seems worth keeping this bug open to figure out the question of how this code is even running on android, and if we have an opportunity to improve android perf and app size by unshipping some of the relevant telemetry bits on android...

We chatted a little bit with :chutten on matrix, this is likely due to the telemetry ping that we send. I'm not sure if anybody is using that (maybe gfx is?) Unfortunately we're temporarily really low on staff and I don't have time to look at this more in depth.

Flags: needinfo?(agi)

I've done some more research and the environment-esque datapoints GFX echoes to Geckoview don't come from TelemetryEnvironment.jsm.

Near as I can tell there's no data reason for Telemetry JSMs to be included in Geckoview builds.

Agi, is there a "quick" way we can try removing toolkit/components/telemetry/app/* from gv builds to see if things go awry? I imagine they take up a lot of space and removing them could be a quick perf win. Is it a matter of editing some manifest somewhere that I can try?

Flags: needinfo?(agi)

(In reply to Chris H-C :chutten from comment #12)

I've done some more research and the environment-esque datapoints GFX echoes to Geckoview don't come from TelemetryEnvironment.jsm.

Near as I can tell there's no data reason for Telemetry JSMs to be included in Geckoview builds.

Agi, is there a "quick" way we can try removing toolkit/components/telemetry/app/* from gv builds to see if things go awry? I imagine they take up a lot of space and removing them could be a quick perf win. Is it a matter of editing some manifest somewhere that I can try?

You can try changing the relevant moz.build files that include the jsm files into omni.ja, so that they don't do that on android. No idea how much that's gonna break though! :-)

That ^ just add if CONFIG["MOZ_BUILD_APP"] != "mobile/android" to the highest level moz.build that includes those files.

Flags: needinfo?(agi)

Adding a naive if CONFIG["MOZ_BUILD_APP"] != "mobile/android" and not CONFIG["TESTS"]: to toolkit/components/telemetry/moz.build just above adding the JSMs was enough to get tests to fail entertainingly. I'm not even sure it's my code change that caused these failures because... well, the mochitests are in a retry loop and xpcshell can't find something called @/data/local/tmp/test_root/xpc/head.js:1739:69 whatever that means. (Maybe that's the line and column of an unresolved import?). They're definitely broken. And I definitely made a change. But I don't immediately see how the latter causes the former from the logs.

I guess I should set up a local Android build if I want to pursue this more. I wonder if choosing a non-desktop option in mach bootstrap will ruin my Desktop build config...

(In reply to Chris H-C :chutten (PTO, back August 16) from comment #15)

I guess I should set up a local Android build if I want to pursue this more. I wonder if choosing a non-desktop option in mach bootstrap will ruin my Desktop build config...

Just running bootstrap will not - it will just download all the android stuff you need for the android build. You could probably get away with an artifact build if this is just about the JSMs and not about the compiled bits of telemetry. However, I would strongly encourage you to keep a separate objdir and mozconfig file for the android build, unless you really like clobbering and rebuilding things all the time. mozconfigwrapper can help with that.

Thanks for the tips! I've been manually keeping separate objdirs using commented-out lines in my mozconfig for years. Next time I set up my build environment I'm definitely getting mozconfigwrapper to do it for me.

Hm, that's odd. about:memory on geckoview-example has very few Telemetry jsms at all. Fenix has a few more in the parent process.... but no matter what there's no TelemetryEnvironment anyplace. What's up with that.

Maybe this isn't worth that much investigation after all? But then what fell afoul of the original test?

(In reply to Chris H-C :chutten from comment #17)

Thanks for the tips! I've been manually keeping separate objdirs using commented-out lines in my mozconfig for years. Next time I set up my build environment I'm definitely getting mozconfigwrapper to do it for me.

Hm, that's odd. about:memory on geckoview-example has very few Telemetry jsms at all. Fenix has a few more in the parent process.... but no matter what there's no TelemetryEnvironment anyplace. What's up with that.

Maybe this isn't worth that much investigation after all? But then what fell afoul of the original test?

My suspicion is that some infrequent thing causes us to attempt to submit a telemetry ping. I imagine that rules out the main ping (unless it's only happening to test jobs that happen to run when the subsession split happens, ie across midnight or whatever? That would be fun...).

If we're not sure what's loading these files I have a few options to suggest:

  • stop packaging them entirely on android (ie make some/all of https://searchfox.org/mozilla-central/rev/d3683dbb252506400c71256ef3994cdbdfb71ada/toolkit/components/telemetry/moz.build#94 conditional on us not being on android) and push to try, see what breaks
  • less drastically, make importing the TelemetryEnvironment.jsm file break or log something unique (which I guess is not super different from the former point, now that bug 1721627 has landed - attempting to import the file that isn't packaged will cause a crash if it happens in automated tests)
  • use profiler markers (they register imports) to figure out what is importing the file first
  • try to reason about what would/could have caused the _submitPingLogic call stack in comment 7...

(In reply to :Gijs (he/him) from comment #18)

(In reply to Chris H-C :chutten from comment #17)

Thanks for the tips! I've been manually keeping separate objdirs using commented-out lines in my mozconfig for years. Next time I set up my build environment I'm definitely getting mozconfigwrapper to do it for me.

Hm, that's odd. about:memory on geckoview-example has very few Telemetry jsms at all. Fenix has a few more in the parent process.... but no matter what there's no TelemetryEnvironment anyplace. What's up with that.

Maybe this isn't worth that much investigation after all? But then what fell afoul of the original test?

My suspicion is that some infrequent thing causes us to attempt to submit a telemetry ping. I imagine that rules out the main ping (unless it's only happening to test jobs that happen to run when the subsession split happens, ie across midnight or whatever? That would be fun...).

If we're not sure what's loading these files I have a few options to suggest:

That's comment 12 . Failed entertainingly. Maybe I'll try a reduced set now that I have a working Android build.

  • less drastically, make importing the TelemetryEnvironment.jsm file break or log something unique (which I guess is not super different from the former point, now that bug 1721627 has landed - attempting to import the file that isn't packaged will cause a crash if it happens in automated tests)
  • use profiler markers (they register imports) to figure out what is importing the file first

Since bug 1721627 I'm more interested in just how much of Telemetry I can remove from GV without exploding things, so I'm disinclined to pursue these.

  • try to reason about what would/could have caused the _submitPingLogic call stack in comment 7...

Sadly it could be anything calling submitExternalPing: that stack is just the async (I don't know what to call it... the Promise resumption?) of that sync call.


I'll try nixxing everything from my local Android build and see where I can go from here. Getting tests to run is a bit of a bear, so I may still be on a long-latency try-bottlenecked dev cycle in the meantime.

Further investigation will take lower priority in favour of migrating Telemetry to Glean so we can "just" tear it all out from all shipping products.

Severity: -- → S4
Priority: -- → P4
No longer blocks: 1721627
Depends on: 1721627

For the next person who sees this: I pushed a try build to experiment with this that:

  1. removed the JSMs (on Android)
  2. also removed the component category registrations (on Android)

Everything fails! The xpcshell tests show lots of places that import Telemetry, like the test harness itself and the AddonManager, for example. The mochitests crash early, but I don't see quite where. So doing this properly requires a good deal more effort than "just don't package" the relevant JSMs.

You need to log in before you can comment on or make changes to this bug.