Closed Bug 1602824 Opened 5 years ago Closed 5 years ago

Improve the 'metrics' ping schedule

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: mdroettboom)

References

Details

(Whiteboard: [telemetry:glean-rs:m11])

Attachments

(1 file)

In bug 1520838 we design the 'metrics' ping and defined its schedule (see the design document).

After using it for a while in Fenix, we found that it has a few flaws that need to be addressed.

This bug is for discussing how to address the reported problems and figure out a plan of action.

Priority: -- → P1
Whiteboard: [telemetry:glean-rs:m11]
Summary: Redesign the 'metrics' ping → Redesign the 'metrics' ping schedule

Hey Rebecca and Emily!

Thank you both for allowing us to discuss this at your DS meeting. This is the bug we're using to track the re-design efforts for this ping. Who can we consult with, on the Data Science team, to move this forward?

Flags: needinfo?(rweiss)
Flags: needinfo?(ethompson)
Blocks: 1587548
Blocks: 1536930

Note mobile telemetry is going to be critical over the next 5 months... does this this redesign add risk?

(In reply to David Bolter [:davidb] (NeedInfo me for attention) from comment #2)

Note mobile telemetry is going to be critical over the next 5 months... does this this redesign add risk?

Minimal. This re-design is required to smooth out some flaws (see comment 0) that might affect analyses. From my point of view (SDK), this is a required step forward.

Of course this should be agreed with Data Science, that's why they're flagged :)

For example, this is partly already biting us in bug 1601091

Just to clarify: this change would only be affecting case 3 from the docs in the metrics ping. This won't be affecting baseline nor events ping.

This is about increasing the quality of the collection. I think that not doing this is riskier than doing it.

Hi Alessio, Saptarshi, Marissa and Corey can be looped in to discuss the metrics ping redesign.

Flags: needinfo?(sguha)
Flags: needinfo?(rweiss)
Flags: needinfo?(mgorlick)
Flags: needinfo?(ethompson)
Flags: needinfo?(cdowhygelund)

why not send a ping when user brings the application to foreground? if the user never does a) no ping is collected and b)any pending ping will not be sent. In the case of (b) this might affect getting information of 'last known use' before user stopped using the app.

Not suggesting these are robust ideas, happy to see what others have thought of

Flags: needinfo?(sguha)

The baseline ping functions the opposite way: when the application is moved to the background. Why don't we follow that approach and collect metrics when the app is in the foreground, then sending them when the app is backgrounded? That makes it consist with the baseline ping. It also solves issue b) of any pending pings from never being sent.

Is there a concern of varying time windows between pings in such a case? Or "empty" pings when a user is flipping quickly between apps? What was the driver for sending the ping every 24 hours at 4am?

Flags: needinfo?(cdowhygelund)

(In reply to Emily Thompson from comment #5)

Hi Alessio, Saptarshi, Marissa and Corey can be looped in to discuss the metrics ping redesign.

Thank you Emily. I've scheduled a meeting to get us all on the same page.

(In reply to "Saptarshi Guha[:joy]" from comment #6)

why not send a ping when user brings the application to foreground? if the user never does a) no ping is collected and b)any pending ping will not be sent. In the case of (b) this might affect getting information of 'last known use' before user stopped using the app.

The metrics ping is much bigger than the baseline ping, especially due to the GV metrics. While "going to foreground" is a good signal we could track, I think we should still not send it every time we get to foreground. We could, for example, check "was the ping sent in the last 24 hours?" or "was the ping already sent today?" every time we get to foreground. If a ping wasn't sent, we collect and send one.

This is one of the potential solutions we thought of. However it has implications :)

(In reply to Corey Dow-Hygelund [:ccd] from comment #7)

The baseline ping functions the opposite way: when the application is moved to the background. Why don't we follow that approach and collect metrics when the app is in the foreground, then sending them when the app is backgrounded? That makes it consist with the baseline ping. It also solves issue b) of any pending pings from never being sent.

In Glean (and in Firefox Desktop) collection and sending of a ping are already decoupled. For this specific bug, we're considering the collection of the ping, which is basically cutting the boundary between "old data" and "data that will be sent with the next ping".

The metrics ping is much bigger than the baseline ping: it's roughly equivalent to the main-ping on Desktop, carrying to bulk of the metrics. Sending it on every foreground could be demanding :)

Moreover, given that Glean is also used in non-mobile environments, "foreground" might mean something very different (think of Desktop vs Android vs a VR).

Is there a concern of varying time windows between pings in such a case? Or "empty" pings when a user is flipping quickly between apps? What was the driver for sending the ping every 24 hours at 4am?

These are all good questions. I scheduled a meeting for tomorrow so that all of us can get on the same page about the meaning of this ping and its problems. All the relevant reading material is in the meeting invite.

Since this is the source of some scary rumors, I'm renaming this bug :)

Summary: Redesign the 'metrics' ping schedule → Improve the 'metrics' ping schedule
Assignee: nobody → alessio.placitelli
Blocks: 1601080
Blocks: 1601960

Mike volunteered to write a proposal doc about this. Re-assigning :D

Assignee: alessio.placitelli → mdroettboom
Flags: needinfo?(mgorlick)
Blocks: 1595546
Blocks: 1601263

Proposed solution for sign-off from Data Science:

The root of the problem is starting up the application using the Work Manager at 4am, when it isn't otherwise running. It ends up running in an incomplete state that has many of the issues listed above.

Therefore we propose changing the behavior to use a regular timer rather than the WorkManager to trigger the collection of the metrics ping. This would mean the metrics ping could only be collected when the application is running. And here "application is running" is a superset of the "application is in the foreground", i.e. the application can be running even when it is not visible on the screen. The salient difference is that if the OS shuts down the application completely, we will no longer be using the Work Manager to wake it back up.

A check would be performed at startup to schedule this timer, which would be largely similar to the metrics ping scheduling currently implemented. The main difference would be that there is no expectation of waking the application up when it isn't already running. Secondly, the important corner case of not mixing metrics from multiple versions of the application would be handled.

On startup, the following cases will be detected:

  1. the application was just installed;
  2. the application is a different version than the last time it was run;
  3. the application was just started (after a crash or a long inactivity period);
  4. the application was open and the 4AM due time was hit.

In the first case, since the application was just installed, if the due time for the current calendar day has passed, a metrics ping is immediately generated and scheduled for sending. Otherwise, if the due time for the current calendar day has not passed, a ping collection is scheduled for that time.

In the second case, if the application is of a different version than the last time it was run, a metrics ping is collected and sent immediately.

In the third case, if the metrics ping was already collected on the current calendar day, a new collection will be scheduled for the next calendar day, at 4AM. If no collection happened yet, and the due time for the current calendar day has passed, a metrics ping is immediately generated and scheduled for sending.

In the fourth case, similarly to the previous case, if the metrics ping was already collected on the current calendar day when we hit 4AM, then a new collection is scheduled for the next calendar day. Otherwise, the metrics is immediately collected and scheduled for sending.
Lastly, whenever a metrics ping is sent, the next one will be rescheduled using the same criteria described above.

Additionally, to make these sorts of issues more readily diagnosable in the future, a "reason code" metric will be added to the metrics ping that describes the reason the ping was sent. These reasons are:

  • new_install: This is the first time the application was run after installation
  • different_version: The version of the application is different than the last time it was run
  • overdue: It is past 4am, and there was no ping sent today at 4am
  • same_day: It is before 4am, so the ping is scheduled to be sent today at 4am (if app is running)
  • next_day: It is after 4am, so the ping is scheduled to be sent the next day at 4am (if app is running)
Flags: needinfo?(sguha)
Flags: needinfo?(mgorlick)
Flags: needinfo?(cdowhygelund)
Attached file GitHub Pull Request (deleted) —

The timing noted above should be our best option because it is the most reliable way to collect the data. However, it is worth noting that metrics ping latencies will increase and this is not well accounted for in our analytics. This can lead to spurious results. It will be important to educate data stakeholders that submission times are not a reliable way to join data to get cross-ping insights.

I have done a deep dive on ping latencies and how this timing affects analysis with respect to retention and understanding user behavior. A few recommendations for analysis emerged. https://docs.google.com/document/d/1tG1sT1oBJp9R-ZbOr47lKZ77K-Ycg4lcSUnh-FT3e-8/edit#

Recommendations for mobile analytics:
Only use the baseline ping to compute retention in GLEAN.
•If we decide we need to use submission_date as our spine we should never combine these pings to estimate retention because we will be double counting users on the day of activity and the day after when we see the metrics ping.
Only combine different pings if you are using the start time and end time embedded in each ping.
•Latencies are longer for new GLEAN telemetry between when the actions happen and when telemetry is received than legacy systems.
•This also means that we can’t include insights provided on the metrics ping to inform retention easily. This ping holds some critical metrics for understanding performance and revenue proxies. It is not meaningful to join on Submission Date.
•Search, flash_usage, default_search_engine, default_browser.
If you have decided to use start and end time and want the data to be stable over time and facilitate ETL, 99% of data seen in a 90 day window has landed in our data warehouse after 6 days.
We need a new event to measure New User Acquisition and guarantee receipt and the attachment of the adjust campaign_id to a user’s first session (in flight for Fenix, but what about other mobile products?).
•Using profile_created_date to determine when a new user has created their profile is valid and no additional ETL work or filters are needed.
•We should add these events to measures of retention to better understand users that are bouncing off of our products.

Flags: needinfo?(mgorlick)

(In reply to Marissa Gorlick from comment #14)

•Latencies are longer for new GLEAN telemetry between when the actions happen and when telemetry is received than legacy systems.

While orthogonal to this particular bug, this is very concerning to me. I left a few comments on the document you linked to understand a bit more of this, as we don't expect latency of Glean pings to be worse than the one from legacy telemetry.

Based on our discussion these appears okay. Do note for new users we might never get their ping (if my memory serves correctly) - this would affect 'new user analysis'.

+r

Flags: needinfo?(sguha)

r+

Flags: needinfo?(cdowhygelund)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: