Closed Bug 1566484 Opened 5 years ago Closed 5 years ago

[Glean] Investigate the DAU/WAU discrepancy between "metrics" pings and "baseline"/"events" pings

Categories

(Data Science :: Investigation, task)

x86_64
Windows 10
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: flawrence)

References

Details

Brief Description of the request (required):

In bug 1552507 we found that the sample used for the validation contains 7% more clients if you measure by "metrics" pings instead of "events" pings. And if you move that to per-day, that's 27% more clients. (suggesting that "metrics" pings are more likely to come from the same clients over multiple days, and "events" and "baseline" pings coming from different clients over those multiple days).

We'd like some help understanding if this is an artefact of how things are scheduled or a bug in the code, since it's big enough to take seriously.

Business purpose for this request (required):

Glean SDK validation for Fenix.

Requested timelines for the request or how this fits into roadmaps or critical decisions (required):

This data is powering Fenix DAU/MAU/WAU, so it's important enough to require timely action.

Links to any assets (e.g Start of a PHD, BRD; any document that helps describe the project):

Name of Data Scientist (If Applicable):

  • FWIW, Felix helped designing the 'metrics' ping scheduling, so he might have a bit more context here.
Blocks: 1552507
Assignee: nobody → flawrence
Status: NEW → ASSIGNED

My first suspicion here is that maybe the metrics ping is getting sent even if the user didn't open Fenix that day. If they didn't open Fenix then it wouldn't have generated any events, and wouldn't have sent any baseline pings because it was never sent to background. This was inspired by chutten's observation that most main pings were getting sent at/around 4AM, which indicates that the ping is getting sent even if the phone is not in active use at the time.

I tried to look into this by looking at total_uri_count - I was going to see if most of the clients with missing events/baseline pings had total_uri_count == 0, and that clients with total_uri_count > 0 had the correct behaviour. But on 20190715, the vast majority of main pings did not contain total_uri_count. Did this probe get removed from the main ping in newer builds of Fenix and glean, or is this a problem in its own right?

  • jesse for visibility, since this probably impacts Fenix metrics in GUD

According to Fenix's metrics.yaml total_uri_count is sent... in the "baseline" ping?!

(( I would think that such a metric wouldn't need to bloat the small and speedy "baseline" ping and would instead be present in the "metrics" ping, but. ))

You could look at the search-related probes I found in the "metrics" ping instead, if you'd like. I summarized some findings in bug 1552507 comment#12.

Ha. Thanks for the link to metrics.yaml. It looks like the only user-activity-related metric in the main ping is search_count. I guess I'll have to use that. Happily for this analysis it's meant to be in both the main ping and the baseline ping.

I think we're receiving the main ping even from clients who didn't use the browser that day, and this is why we're receiving main pings from more users than baseline pings. It would be good to have a mobile engineer verify that this is happening or at least plausible.

I took all main pings received on 20190724, aggregated by client_id and checked whether the client had any searches using any search engine. I did the same for the baseline pings, but using three days of data, to account for lag in receiving pings. Then I left joined from the main pings to the baseline pings, on client_id.

Of the clients for whom we received a main ping, there were roughly 37% for whom we did not receive a baseline ping in this period.

Of the clients for whom we received a main ping indicating that they made searches (i.e. actually used the browser during the time period), there were only 0.5% for whom we did not receive a baseline ping in this period.

Of the clients for whom we received a main ping indicating that they did not make searches (i.e. they may or may not have used the browser during the time period), a whopping 52% did not have a baseline ping.

This is consistent with the hypothesis that relatively few pings are getting lost but the main ping is sometimes being sent even when the user does not use the browser.

For symmetry I checked the reverse: I started with baseline pings from 20190724 and looked for clients with no main ping in the surrounding days. There were 3.6%. Of those who made searches (according to a baseline ping), 2.6% were missing a main ping. Of those who did not make searches, 4.8% were missing a main ping. These numbers are all fairly small, in line with the hypothesis.

From here:

  • A mobile engineer should check whether the hypothesis is plausible from a technical point of view. If so, then
  • We should probably configure glean not to send a main ping if there has been no browser use since the last main ping was sent
  • Since the % of missing pings reduced a lot but not to zero, someone should decide whether there is cause for further investigation or whether we are happy to write off the remainder as being due to users going offline, network issues, solar flares, etc.
  • Either ask me for something else or close the bug :)

Notebook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/149123/

(In reply to Felix Lawrence from comment #5)

I think we're receiving the main ping even from clients who didn't use the browser that day, and this is why we're receiving main pings from more users than baseline pings. It would be good to have a mobile engineer verify that this is happening or at least plausible.

I had a nice chat with Sebastian (just cc'd) about this hypothesis. There's different parts to this story. In general, yes, this is possible.
It also depends on what "use the browser" means: no search? If we mean that the user opens the app (i.e. launches the app process), then leaves the app (a 'baseline' ping is generated at this point) and the process stays around while the user doesn't switch back to the app, then:

Sebastian: This is in the realm of possibilities, yes. It's pretty much undefined when exactly Android is going to kill the app process. If the user doesn't do much else and the device is powerful enough then the app process could stick around. However if this affects many users then I would be very surprised if this would be the reason.

Sebastian also mentioned that there's other things we should verify: Fenix has a bunch of background services that:

Sebastian: (e.g. for custom tabs, accessed by third-party apps) will create the application object, start a background service, but never actually visually launch the app

In these cases, the application will be there.

From here:

  • We should probably configure glean not to send a main ping if there has been no browser use since the last main ping was sent

Should we? That would require a strict definition of what 'browser usage' is. Would open a custom tab mean that the browser was used? Do we get a baseline ping in that case?

I'd be in favour of investigating the 'custom tabs' case a bit more, to figure out what Fenix does, exactly. I'd be in favour of documenting your findings, Felix, in the metrics ping docs edge cases section. We could also potentially add a built-in metric, a boolean, that says "something browser-like was displayed". What do you think?

  • Since the % of missing pings reduced a lot but not to zero, someone should decide whether there is cause for further investigation or whether we are happy to write off the remainder as being due to users going offline, network issues, solar flares, etc.

I'd never expect such percentage to drop to 0% :) I'd be happy to see it around ~1%, though. We have:

  • 0.5% for whom we received the 'metrics' ping and did not receive a baseline ping in this period;
  • 3.6% for whom we received the 'baseline' ping and did not receive a metrics ping in this period.

Is that right? I'd expect the second chunk to be related to either people using Fenix one-off (and never come back!) or to users killing the process manually. Or uninstalling the app. Would it be easy to check that, Felix?

Flags: needinfo?(flawrence)

(In reply to Alessio Placitelli [:Dexter] from comment #6)

If we mean that the user opens the app (i.e. launches the app process), then leaves the app (a 'baseline' ping is generated at this point) and the process stays around while the user doesn't switch back to the app, then:

"leaves the app" is determined by registering to the ProcessLifecycleOwner and observing ON_STOP, right? A quick jump to the "recent apps" screen and swiping the app immediately away may not trigger this event since this kills the process in most situations (and there's some delay in ProcessLifecycleOwner to avoid being triggered when switching between screens). Is this important here too (should be verified with some testing though!)?

Another important, but small, note (if you'll forgive the interruption) is that it isn't a "main" ping, it's a "metrics" ping.

I was taking an extremely conservative (broad) definition of "use the browser" - something along the lines of "had the browser on screen at any point that day/measurement window". There is no metric that measures this directly, so I looked at "made a search", which is a sufficient condition for browser usage but not a necessary one.

This is the first I've heard about custom tabs etc. Note from :chutten's investigation, most metrics pings are sent at or shortly after 4AM local time (i.e. the earliest they're scheduled to be sent), which implies that for many users some fenix process is active at 4AM. I believe that for most of these cases, the user is not actually using the phone at 4AM. Without knowledge of the source code, it is possible that some process is running through the night; it is also possible that the ping is scheduled in such a way that starts the process in order to send the ping (e.g. it sets it up with an AlarmManager??)

From here:

  • We should probably configure glean not to send a main ping if there has been no browser use since the last main ping was sent

Should we? That would require a strict definition of what 'browser usage' is. Would open a custom tab mean that the browser was used? Do we get a baseline ping in that case?

I still don't know what a custom tab is, but I would guess that we would want a ping if a custom tab has been used, so that we could collect telemetry about that. But we should avoid the case where someone installs Fenix, opens it once, decides never to use it again but doesn't uninstall it, and it continues phoning home forever...

We could also potentially add a built-in metric, a boolean, that says "something browser-like was displayed". What do you think?

Yes, this sounds like a good idea; something that lets us count the number of active browser users per day.

I'd be in favour of investigating the 'custom tabs' case a bit more, to figure out what Fenix does, exactly.

I agree that someone needs to think through Fenix's telemetry and how it relates to custom tabs and other features. Who would be best placed to do this? Fenix engineers and PMs, reviewed by who?

We have:

  • 0.5% for whom we received the 'metrics' ping and did not receive a baseline ping in this period;
  • 3.6% for whom we received the 'baseline' ping and did not receive a metrics ping in this period.

Is that right? I'd expect the second chunk to be related to either people using Fenix one-off (and never come back!) or to users killing the process manually. Or uninstalling the app.

That sounds about right. Note that in my analysis, "never" means "not that day or the next day", and I used submission_date_s3. If the metrics ping is sent at the earliest time (4AM local time), that could be up to 24 hours after the baseline ping so it's fairly likely that the metrics ping has a later submission_date_s3 than the baseline ping. If the metrics ping is sent after the earliest time then it may well be received two days after the baseline ping. And you can add "goes offline" and "turns the phone off" to "killing the process manually" and "uninstalling the app".

Would it be easy to check that, Felix?

Not easy enough to be worthwhile, I suspect, unless you're really concerned? I could look at metrics pings received over a longer window, which would catch pings where the phone comes back online or the process restarts. But to do this rigorously and avoid false assurances, we'd need to check that the later pings were indeed for the correct time period - i.e. that we didn't miss a metrics ping then receive the next metrics ping and declare that no pings were missing. To do this properly would be quite time consuming and is unlikely to yield great insights IMO?

"leaves the app" is determined by registering to the ProcessLifecycleOwner and observing ON_STOP, right? A quick jump to the "recent apps" screen and swiping the app immediately away may not trigger this event since this kills the process in most situations (and there's some delay in ProcessLifecycleOwner to avoid being triggered when switching between screens).

This sounds like an annoying edge case, but not something that more than a handful of users will do on a given day?

Another important, but small, note (if you'll forgive the interruption) is that it isn't a "main" ping, it's a "metrics" ping.

Oops!

Flags: needinfo?(flawrence)

(In reply to Felix Lawrence from comment #9)

it is also possible that the ping is scheduled in such a way that starts the process in order to send the ping (e.g. it sets it up with an AlarmManager??)

Dang, that was it. We filed bug 1570647 for this.

However, there's a twist: we should be sending no ping at all if nothing was recorded. A few things changed in Fenix, recently, so we always have a few metrics in the metrics ping. However, that wasn't the case when Chris performed its analysis.

This might indicate a different, additional problem. Any chance you could use the same sample that Chris used and check if the same problem is there, the sample is defined here?

From here:

  • We should probably configure glean not to send a main ping if there has been no browser use since the last main ping was sent

Should we? That would require a strict definition of what 'browser usage' is. Would open a custom tab mean that the browser was used? Do we get a baseline ping in that case?

I still don't know what a custom tab is, but I would guess that we would want a ping if a custom tab has been used, so that we could collect telemetry about that. But we should avoid the case where someone installs Fenix, opens it once, decides never to use it again but doesn't uninstall it, and it continues phoning home forever...

We could also potentially add a built-in metric, a boolean, that says "something browser-like was displayed". What do you think?

Yes, this sounds like a good idea; something that lets us count the number of active browser users per day.

I'd be in favour of investigating the 'custom tabs' case a bit more, to figure out what Fenix does, exactly.

I agree that someone needs to think through Fenix's telemetry and how it relates to custom tabs and other features. Who would be best placed to do this? Fenix engineers and PMs, reviewed by who?

I had a lovely chat with Sebastian about this. If we go by your initial suggestion ("not send a metrics ping if there was no browser usage"), we should be covered, even for custom tabs. For context, custom tabs are for embedding Fenix in apps. See here.

Would it be easy to check that, Felix?

Not easy enough to be worthwhile, I suspect, unless you're really concerned? I could look at metrics pings received over a longer window, which would catch pings where the phone comes back online or the process restarts. But to do this rigorously and avoid false assurances, we'd need to check that the later pings were indeed for the correct time period - i.e. that we didn't miss a metrics ping then receive the next metrics ping and declare that no pings were missing. To do this properly would be quite time consuming and is unlikely to yield great insights IMO?

All right, ignore the above :)

"leaves the app" is determined by registering to the ProcessLifecycleOwner and observing ON_STOP, right? A quick jump to the "recent apps" screen and swiping the app immediately away may not trigger this event since this kills the process in most situations (and there's some delay in ProcessLifecycleOwner to avoid being triggered when switching between screens).

This sounds like an annoying edge case, but not something that more than a handful of users will do on a given day?

Good point.

Flags: needinfo?(flawrence)

However, there's a twist: we should be sending no ping at all if nothing was recorded. A few things changed in Fenix, recently, so we always have a few metrics in the metrics ping. However, that wasn't the case when Chris performed its analysis.

This might indicate a different, additional problem. Any chance you could use the same sample that Chris used and check if the same problem is there, the sample is defined here?

I looked at metrics pings from '20190703', which lies in the range of days :chutten investigated. 32% had no matching baseline ping, roughly similar to :chutten's "27%" figure for a related quantity.

Again, very few (0.7%) users who made searches had no baseline ping. Of those who made no searches, 47% had no baseline ping. So it looks like the same effect was present at the start of July.

What are these metrics that are always present in the metrics ping? I saw metrics.default_browser, search.default_engine.code, metrics.default_moz_browser, search.default_engine.name, metrics.mozilla_products and some other metrics present in plenty of pings that had no searches - if the presence of these "metrics" is enough to trigger a ping to be sent, then this was already happening at the start of July in the period :chutten used for his analysis. My quick analysis of this is at the bottom of the same notebook.

Flags: needinfo?(flawrence)

(In reply to Felix Lawrence from comment #11)

What are these metrics that are always present in the metrics ping? I saw metrics.default_browser, search.default_engine.code, metrics.default_moz_browser, search.default_engine.name, metrics.mozilla_products and some other metrics present in plenty of pings that had no searches - if the presence of these "metrics" is enough to trigger a ping to be sent, then this was already happening at the start of July in the period :chutten used for his analysis. My quick analysis of this is at the bottom of the same notebook.

The presence of all metrics that have lifetime: application here is enough to make sure a metric ping is collected. However, the change that added/fixed these metrics was merged the 18th of July.

Anyway, I'm satisfied with the answers there, thanks Felix for the investigation and for pointing us to the right direction!

Feel free to close this bug, as we have bug 1570647 for tracking the fix.

OK, great. Glad to have helped.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.