Closed Bug 1566488 Opened 5 years ago Closed 5 years ago

Investigate 'metrics' ping 4AM-ending measurement windows weirdness

Categories

(Data Platform and Tools :: Glean: SDK, task, P3)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m8])

Attachments

(1 file)

Why are there so many of them? Why are so many of them 0-length?

From bug 1552507 comment 12:

Measurement Windows are distributed a little too much on the 0-length to my taste, especially since half of them have an end_time in the 4AM hour.

Yes, this is particularly concerning to me. I wonder if there's some bug in glean-ac about figuring out the 'start_time'. I'm a bit curious about the distribution of sequence numbers for this chunk of pings having 0-length and end-time at 4AM. If they are mostly at seq=0, this might indicate a problem in collecting the first "start_time".

Priority: -- → P3
Whiteboard: [telemetry:glean-rs:m?]
Blocks: 1552507
Whiteboard: [telemetry:glean-rs:m?] → [telemetry:glean-rs:m7]
Whiteboard: [telemetry:glean-rs:m7] → [telemetry:glean-rs:m8]
Assignee: nobody → mdroettboom

Looking at all the metrics pings that came in the last 24 hours, 78% of them measured a start/end delta of 0 (meaning less than the 1 minute resolution of those time stamps). 24% of all pings had a zero delta and an end time at 4AM. Those values are very high for metrics pings, which should not ever be sent more than once a day.

There is only one scenario in which a delta of 0 would be considered "normal". On the very first startup of the app by a user, the app will see that no ping has been sent today (or ever) and will schedule it immediately. Since the ping will start and end quite quickly, this is expected. We can detect these by filtering for pings with a seq of 0. This makes up 8% of the pings in the test data set (and are not included in the problematic percentages above).

I believe these anomalies can be traced to the fact that ping start times are not persisted to disk and exist only in an in-memory map and, on mobile platforms, the application is frequently killed by the OS.

Scenario 1

SYMPTOM: A delta of 0, not ending at 4AM.

The user hasn't run the app at all today, and the app is not running. The user starts the app at some time of day other than 4AM. The metrics ping scheduler sees that no ping was sent today, so schedules it to go out immediately. The ping collector sees that no start time is recorded for the ping, so sets it to app startup time, which is approximately "now". The end time gets set to "now".

Scenario 2

SYMPTOM: A delta of 0, ending at 4AM.

The user starts the app. The metrics ping scheduler sees that a ping has already been sent today, so it schedules the work manager to wake up the app at 4am tomorrow. Before the end of the day, the user closes the app, and the OS eventually kills the app. Work Manager wakes the app up at 4am to send a ping. The ping collector sees that no start time is recorded for the ping, so it sets it to app startup time which is approximately "now" at 4am. The end time also gets set to 4am.

Flags: needinfo?(tlong)
Flags: needinfo?(alessio.placitelli)

Another detail: This bug exists in glean-ac, but doesn't seem to exist in glean-core (based on reading alone). glean-core persists the start time as a user-lifetime metric. A triage question is whether we intend to fix this in glean-ac or just wait for the migration to glean-core to fix it.

Yeah, I definitely agree that the ping start time is meaningless if it's just held in memory, at least for the metrics ping. This could also be a problem for custom pings that are scheduled outside of the normal application lifetime. I'm pretty sure we can take care of this for the metrics ping, but I'm a little unsure as how this would get persisted for those. The other thing is that we probably don't want to persist the start time for the baseline and events ping longer than the application lifetime and just use the objectStartTime.

Flags: needinfo?(tlong)

(In reply to Michael Droettboom [:mdroettboom] from comment #1)

I believe these anomalies can be traced to the fact that ping start times are not persisted to disk and exist only in an in-memory map and, on mobile platforms, the application is frequently killed by the OS.

This is definitely a bug. If you look at the scheduling examples, we expect start_time to be the time when the "measurement window" started.

Flags: needinfo?(alessio.placitelli)
Attached file GitHub Pull Request (deleted) —

Closing because fix landed. bug 1587548 will validate that the fix worked once the fix rides the trains through Fenix.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: