Closed Bug 1716847 Opened 3 years ago Closed 2 years ago

Glean errors for baseline.duration on Firefox iOS are unusually high

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: travis_, Assigned: travis_)

References

Details

Attachments

(1 file)

After forking our Fenix error query and creating one for Firefox iOS looking at something for Nimbus, I noticed that the glean.baseline.duration metric was reporting 120k-140k errors per day, affecting 6-7% of clients.

Whiteboard: [telemetry:glean-rs:m?]

Not actually likely to be related to bug 1780035 as the shared root cause assumption that this had to do with overflowing preinit queues weren't helped by increasing the preinit queue.

Duplicate of this bug: 1806448
Priority: P3 → P2

Went from ~100k per day to ~400k around 2022-12-16, continuing to go up. New ramp-up from 2022-12-22 on, leveling at 3.4 million since 2022-12-29.

https://mozilla.cloud.looker.com/explore/firefox_ios/metrics?qid=VeWCiLhEC04siMCnXMyAqm&origin_space=746&toggle=fil,vis

Assignee: nobody → tlong

Before I forget to update this with the latest progress:

It appears that this is now easily reproducible for me locally, and I I've narrowed it down to the baseline.duration generating an InvalidState error because start was called on an already started counter. I believe that this may be a race condition between this line:

https://github.com/mozilla/glean/blob/0591aecadb762ac93e70bc85b8605f7a1ea409f0/glean-core/ios/Glean/Scheduler/GleanLifecycleObserver.swift#L28

And this line:

https://github.com/mozilla/glean/blob/0591aecadb762ac93e70bc85b8605f7a1ea409f0/glean-core/ios/Glean/Scheduler/GleanLifecycleObserver.swift#L35

This race condition exists because we expected the creation of the lifecycle observer to happen after the willEnterForeground event had occured, and so for the first foreground when launching the app we expected to need to explicitly do all the foreground things when creating the observer. But there was always a handful of errors in the past and now a lot of them to indicate that this wasn't working quite like we expected. Something has changed recently to cause the init to happen more consistently before the willEnterForeground happens, so now we are calling the handleForegroundEvent twice when launching the app.

In order to fix this, I think it is still important to have the handleForegroundEvent in the init of the observer to handle the case where it get initialized after the event has occurred, but we should add a check to ensure calls to handleForegroundEvent don't happen without a call to handleBackgroundEvent in between. This may mean yet another flag, but I'm looking at ways to handle this better, so happy to entertain any ideas or counter proposals to what might be going on here.

Priority: P2 → P1

Waiting for this to be released and see if it reduces the InvalidState errors we are seeing before closing this

Looks like this, combined with the other recent iOS HttpUploader updates has done what we hoped and reduced the errors we were seeing with this. Currently this is only in beta but I don't think we need to wait until v112 to see this in release to call it good. The beta data was obvious enough to call this fixed.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: