Closed Bug 1548819 Opened 5 years ago Closed 5 years ago

Compare telemetry data vs glean data in Firefox-tv

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:mobilesdk:m10] )

This bug is for validating the data reported by the legacy telemetry system and Glean, to make sure numbers are roughly in the same ballpark.

Depends on: 1535052
Whiteboard: [telemetry:mobilesdk:m10]
Priority: -- → P3
Depends on: 1552295

I know I shared this in other places, but putting it here to make it easier to find...

This is the existing telemetry dashboard for Firefox for FireTV:
https://sql.telemetry.mozilla.org/dashboard/firefox-fire-tv

Blocks: 1552884
Type: defect → task

According to the FFTV team, the release containing Glean has been pushed back:

AshleyT   [11 minutes ago]
Hi Travis! We initially planned to release this week, however, we had some concerns over quality of the changes so we had to hold back. That means staged rollout will start June 10th and we should be at 100% by June 12th
Depends on: 1550767

This can no longer block Fenix MVP because any code changes from this analysis would need to get into a-c by the 11th in order to make code freeze on the 14.

No longer blocks: 1552884

Hey look, we have data now! I'll look into this in this iteration.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: P3 → P1

I think I might have some ideas about where to go here... what do you think, Su?

FFTV (Firefox for Fire TV) version 3.10 launched with both the Glean SDK and a legacy mobile Telemetry implementation. No part of FFTV used the Glean SDK beyond initialization, so it only sends "baseline" pings. Thus this comparison will limit itself to comparing those "baseline" pings to the closest legacy mobile Telemetry analogue, the "core" ping.

"core" pings and "baseline" pings are both sent when the application is backgrounded. This allows us to directly compare important things like:

  • client counts
  • ping counts
  • ping latency
  • ping duplicate counts
  • reported OS, device, and application types and versions
  • timezone and locale
  • geo-ip population distribution
  • application foreground duration

We can also fuzzily compare things like:

  • The "baseline" ping's first_run_date to the "core" ping's profileCreationDate (but only for new profiles created and applications first run since 3.10)

Things we can't look at are:

  • Search counts (because they aren't recorded in FFTV's "baseline" pings)

My plan is to use sql.tmo to generate the data cubes and then use Iodide for the report. Sound like a plan?

Flags: needinfo?(shong)

I think this is an excellent plan, I wouldn't do anything differently.

I will also note, when other data sources (such as the different event pings in FFTV) are migrated to Glean, a similar validation should be performed[*], following the above gameplan.

[*] If data consistency is important to the data end user.

Flags: needinfo?(shong)

Report's up: https://stage.iodide.nonprod.dataops.mozgcp.net/notebooks/39/?viewMode=report

ni?Alessio for reviewing the report. Please feel free to include Data Science on reviewing the methodology (such as it is. It's really mostly just a narrative-driven exploration).

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #7)

Report's up: https://stage.iodide.nonprod.dataops.mozgcp.net/notebooks/39/?viewMode=report

ni?Alessio for reviewing the report. Please feel free to include Data Science on reviewing the methodology (such as it is. It's really mostly just a narrative-driven exploration).

This analysis looks great to me, thanks Chris! I have a couple of observations and requests but, overall, the approach looks solid to me and the narrative builds up nicely.

General recommendations

  • Would you kindly label your axes?
  • Would you kindly give a title to your plots?
  • Would you kindly always show the Y=0 in your plots (e.g. Android SDK version looks confusing otherwise)?

Device manufacturer and model difference

This is both interesting and odd. I've double checked the code and we collect the same data points. FFTV / service-telemetry only does a bit more truncation. Here's the Glean code for comparison.

When you say "The under-reporting by Glean of Amazon-AFTT can mostly be explained away by the under-reporting of clients overall... but only mostly explained.", what do you mean by "mostly" explained? Can you expand a bit?

Android SDK Version

"Of interest to me here is the number of excess "baseline" pings coming from Android 22. Maybe Glean's duplicates problem correlates with the Android SDK in use? (Exuent stage left, singing "Blame WorkManager" sung to the tune of "Blame Canada")"

This is very interesting. It might definitely be a bug in the WorkManager. Can you summarize the relevant stats in the paragraph? Also, the plot is a bit unclear here (I dropped my considerations in the general recommendations section).

Inferred Geo

Do you think the mismatch we see might have something to do with different lookup dataset being used on our end?
Glean uses the same code converted to Kotlin that's being used in FFTV

Other notes

  • Both the Glean SDK and FFTV's telemetry collect and generate their pings on on_stop. See the FFTV code. However service-telemetry does not use the WorkManager.
    Moreover, service-telemetry has a problem which might result in partial pings being written to disk and uploaded (thus potentially rejected from the ingestion).
  • Another slight difference is that we're using slightly different techniques for observing the application lifecycle, as androidx offers two APIs. We're using LifecycleObserver while FFTV is using lifecycle.Observer. This shouldn't make much difference, though.
Flags: needinfo?(alessio.placitelli) → needinfo?(chutten)

(In reply to Alessio Placitelli [:Dexter] from comment #8)

"Of interest to me here is the number of excess "baseline" pings coming from Android 22. Maybe Glean's duplicates problem correlates with the Android SDK in use? (Exuent stage left, singing "Blame WorkManager" sung to the tune of "Blame Canada")"

This is very interesting. It might definitely be a bug in the WorkManager. Can you summarize the relevant stats in the paragraph? Also, the plot is a bit unclear here (I dropped my considerations in the general recommendations section).

This is VERY possible, especially since WorkManager uses different mechanisms under the hood based on the SDK level since different API's are available. With an older SDK version, it would be using more legacy scheduling tools that may have different behavior or bugs associated with them than a newer SDK tool.

Both the Glean SDK and FFTV's telemetry collect and generate their pings on on_stop. See the FFTV code. However service-telemetry does not use the WorkManager.

Yes, it does not use WorkManager but WorkManager does act as a wrapper for the same JobScheduler that service-telemetry uses, but only for SDK versions 23+ where JobScheduler is available. Earlier SDK devices would force WorkMananger to use a custom AlarmManager + BroadcastReceiver implementation for API 14-22. (see here for more info). What is odd is that JobScheduler appears to have been added in SDK 21, but WorkManager opts to use a custom implementation rather than the JobScheduler until SDK 23...

(In reply to Alessio Placitelli [:Dexter] from comment #8)

(In reply to Chris H-C :chutten from comment #7)

Report's up: https://stage.iodide.nonprod.dataops.mozgcp.net/notebooks/39/?viewMode=report

ni?Alessio for reviewing the report. Please feel free to include Data Science on reviewing the methodology (such as it is. It's really mostly just a narrative-driven exploration).

This analysis looks great to me, thanks Chris! I have a couple of observations and requests but, overall, the approach looks solid to me and the narrative builds up nicely.

General recommendations

  • Would you kindly label your axes?
  • Would you kindly give a title to your plots?

Sorry, yes, that's totally laziness on my side. I thought the surrounding discussion'd be enough for understanding, but that's no excuse.

  • Would you kindly always show the Y=0 in your plots (e.g. Android SDK version looks confusing otherwise)?

UGGGGGGGGHHHHH, it was so hard to get the two y axes' 0-lines to line up at all and now you want them visible?! That Android SDK plot is more trouble than it's worth. Should've just used a table... grumblegrumble

(( but as a result, I now better understand how the code works, so thank you for pushing me on this ))

Device manufacturer and model difference

This is both interesting and odd. I've double checked the code and we collect the same data points. FFTV / service-telemetry only does a bit more truncation. Here's the Glean code for comparison.

When you say "The under-reporting by Glean of Amazon-AFTT can mostly be explained away by the under-reporting of clients overall... but only mostly explained.", what do you mean by "mostly" explained? Can you expand a bit?

That actually appears to be an error. There are 1.3k fewer AFTT clients, and 1.8k fewer total clients (baseline-reported in the sample). It could be 100% explained by the under-reporting.

Maybe I was trying to balance the books using the 877 over-reported (by Glean) AFTM clients in my head. I'll take that sentence out.

Android SDK Version

"Of interest to me here is the number of excess "baseline" pings coming from Android 22. Maybe Glean's duplicates problem correlates with the Android SDK in use? (Exuent stage left, singing "Blame WorkManager" sung to the tune of "Blame Canada")"

This is very interesting. It might definitely be a bug in the WorkManager. Can you summarize the relevant stats in the paragraph? Also, the plot is a bit unclear here (I dropped my considerations in the general recommendations section).

Which relevant stats did you have in mind? Something about "With over 95% of the extra "baseline" pings coming from Android SDK 22..."?

Inferred Geo

Do you think the mismatch we see might have something to do with different lookup dataset being used on our end?
Glean uses the same code converted to Kotlin that's being used in FFTV

This isn't Inferred Geo, this is Locale, no?

I don't know enough about Kotlin/Java differences to have an informed opinion about what might be happening there. However, given that the difference is more-or-less echoed in Timezone differences I'd be suspicious about that. Though I suppose Timezone could be related to the client-reported Country for the locale...

Well, let's look at ja-JP: That one's weird no matter how we look at it. It's showing a difference in client reporting of locale and timezone and in server reporting of inferred geo. To me these clues are all pointing towards a population difference, because how else would Austria show up in our geoIP result and in reported locale?

I think I can make this a little more clear in the report. Specifically that "Inferred Geo" is server-side (it's a geo-IP lookup). I was expecting Inferred Geo to be different if we were doing it differently on the Server (should have Data Engineering look at this...), but I wasn't expecting it to agree with client-reported differences.

Other notes

  • Both the Glean SDK and FFTV's telemetry collect and generate their pings on on_stop. See the FFTV code. However service-telemetry does not use the WorkManager.
    Moreover, service-telemetry has a problem which might result in partial pings being written to disk and uploaded (thus potentially rejected from the ingestion).
  • Another slight difference is that we're using slightly different techniques for observing the application lifecycle, as androidx offers two APIs. We're using LifecycleObserver while FFTV is using lifecycle.Observer. This shouldn't make much difference, though.

Would either of these notes be something observable in the report? If you're positing that Glean delivery is more reliable than Telemetry delivery because of WorkManager, we'd see more pings and clients over the sample.

(In reply to Chris H-C :chutten from comment #10)

Device manufacturer and model difference

This is both interesting and odd. I've double checked the code and we collect the same data points. FFTV / service-telemetry only does a bit more truncation. Here's the Glean code for comparison.

When you say "The under-reporting by Glean of Amazon-AFTT can mostly be explained away by the under-reporting of clients overall... but only mostly explained.", what do you mean by "mostly" explained? Can you expand a bit?

That actually appears to be an error. There are 1.3k fewer AFTT clients, and 1.8k fewer total clients (baseline-reported in the sample). It could be 100% explained by the under-reporting.

Maybe I was trying to balance the books using the 877 over-reported (by Glean) AFTM clients in my head. I'll take that sentence out.

Ah, great, glad we have one less problem to understand at this time :-D

Android SDK Version

"Of interest to me here is the number of excess "baseline" pings coming from Android 22. Maybe Glean's duplicates problem correlates with the Android SDK in use? (Exuent stage left, singing "Blame WorkManager" sung to the tune of "Blame Canada")"

This is very interesting. It might definitely be a bug in the WorkManager. Can you summarize the relevant stats in the paragraph? Also, the plot is a bit unclear here (I dropped my considerations in the general recommendations section).

Which relevant stats did you have in mind? Something about "With over 95% of the extra "baseline" pings coming from Android SDK 22..."?

Yes, something like that.

Inferred Geo

Do you think the mismatch we see might have something to do with different lookup dataset being used on our end?
Glean uses the same code converted to Kotlin that's being used in FFTV

This isn't Inferred Geo, this is Locale, no?

Ah, good point. Yes, it's locale. I got confused, apologies.

I don't know enough about Kotlin/Java differences to have an informed opinion about what might be happening there. However, given that the difference is more-or-less echoed in Timezone differences I'd be suspicious about that. Though I suppose Timezone could be related to the client-reported Country for the locale...

Well, let's look at ja-JP: That one's weird no matter how we look at it. It's showing a difference in client reporting of locale and timezone and in server reporting of inferred geo. To me these clues are all pointing towards a population difference, because how else would Austria show up in our geoIP result and in reported locale?

I think I can make this a little more clear in the report. Specifically that "Inferred Geo" is server-side (it's a geo-IP lookup). I was expecting Inferred Geo to be different if we were doing it differently on the Server (should have Data Engineering look at this...), but I wasn't expecting it to agree with client-reported differences.

Maybe we should have a conversation with the Data Science team about this. Maybe specifically with the people that deal with FFTV data.

Other notes

  • Both the Glean SDK and FFTV's telemetry collect and generate their pings on on_stop. See the FFTV code. However service-telemetry does not use the WorkManager.
    Moreover, service-telemetry has a problem which might result in partial pings being written to disk and uploaded (thus potentially rejected from the ingestion).
  • Another slight difference is that we're using slightly different techniques for observing the application lifecycle, as androidx offers two APIs. We're using LifecycleObserver while FFTV is using lifecycle.Observer. This shouldn't make much difference, though.

Would either of these notes be something observable in the report?

I'm afraid not :(

If you're positing that Glean delivery is more reliable than Telemetry delivery because of WorkManager, we'd see more pings and clients over the sample.

Yes, you're right. I'd also add that the scale of the potential problem in service-telemetry couldn't possible skew this anyway (it should happen in rare circumstances).

Hey Emily!

Does any Data Scientist in your team know about/had ever looked at the Firefox for FireTV data? We're seeing some interesting things we'd like to understand a bit more. See Chris' analysis in comment 7 for more context (especially the locale/geo oddities).

Flags: needinfo?(ethompson)

Megan has in the past, and I was going to start poking at the data in amplitude.

Want to create a bug for us to investigate? I can get someone to look at it next week possibly (unless you'd like it sooner!)

Flags: needinfo?(ethompson)
Flags: needinfo?(chutten)

(In reply to Emily Thompson from comment #13)

Megan has in the past, and I was going to start poking at the data in amplitude.

Want to create a bug for us to investigate? I can get someone to look at it next week possibly (unless you'd like it sooner!)

Thank you Emily! I filed bug 1564785 for this. This is not terribly urgent, next week or within the next 2 weeks sounds great :)

Please let us know if/when more informations are required!

Hey Chris!

The analysis looks good to me. I think we can close this bug. The one last recommendation I'd have would be to summarize your findings in a comment on the bug, in case the iodide link stops working (or just to simplify discoverability). I filed bug 1564785 so that somebody with FireTV population expertise could give a second look at the oddities you found. Of course, I know you're a super curious person ;) Feel free to chime in there as you see fit!

Flags: needinfo?(chutten)

Conclusion

Glean's "baseline" ping shows very similar characteristics to Telemetry's "core" ping on Firefox for Fire TV. They both are sent at similar times and frequencies, they both report similar features of their environment, and they both measure time in roughly the same way. Glean seems to be getting the job done.

There is still the matter of the "duplicates problem", though we now serendipitously have some information that it seems to be mostly restricted to Android SDK 22 (Android 5.1 Lollipop).

The biggest question marks on this are where aspects of the environment that should be the same aren't. Everything from device manufacturer to locale to timezone has surprising differences. However, that these differences show self-consistency suggest to me that Glean is reported from a slightly (a hundredth of a percent of clients per week) different population than Telemetry is reported from.

Specifically, Glean seems to hear more from Japan, Britain, Mexico, and Barbados clients, but from fewer clients over all.

Given how much thick skin our analyses to-date have had with respect to garbage data, this seems like situation normal to me. Maybe a more focused investigation (maybe ignoring Android SDK 22 users?) would be able to find more information, but since the differences are small enough I have no trouble recommending the use of Glean "baseline" pings as a replacement for Mobile Telemetry "core" pings in Firefox for Fire TV.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(chutten)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.