Closed Bug 1552507 Opened 5 years ago Closed 5 years ago

Validate the 'metrics' ping in Fenix

Categories

(Data Platform and Tools :: Glean: SDK, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Dexter, Assigned: chutten)

References

Details

The 'metrics' ping will be carrying the bulk of metrics and has a very specific schedule. This bug is about verifying that it behaves as expected.

Documentation for the ping is here

Assignee: nobody → chutten
Priority: -- → P1

Testable parts of the docs seem to be:

  • What distribution do measurement window lengths take? They shouldn't be shorter than 24 hours outside of timezone changes. (measurement windows end at 4am local)
  • Do measurement windows overlap?
  • Do we see pings of both "was just started" and "was open at 4AM" variety? At what proportion?
  • Meta-metrics: what glean.error-category metrics do we see and at what frequencies?
  • Cross-comparison: do the proportions of client_info and ping_info fields seem reasonable compared to those seen in "baseline" pings. (e.g. do we see the same proportion of seq dupes/holes? Do we see the same client count?)
  • Ping delay and clock skew. I'd expect the delay between ping_info.end_time and metadata/Timestamp to be very short.
  • Should probably also do some light validation on some of Fenix's metrics

Need anything else?

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #1)

Need anything else?

This sounds about right. I'm not sure how complex that is, but I'd also keep an eye for our edge cases.

Flags: needinfo?(alessio.placitelli)

In my preliminary population analysis (ie, "how many pings and clients are we even looking at here") I noticed that the pings were less than "baseline" pings, which was expected.

The client count was also much less, which wasn't expected.

The client count curves for "baseline" and "events" pings are very close to each other, but "metrics"... well, it's doing its own thing.

To me, "metrics" looks a little like an adoption curve. It's as though there was only recently a build that started sending them and it's still being picked up by our users. If this is true, a validation analysis at this time would probably just need to be repeated to ensure there wasn't weird adoption selection effects in the population I was analyzing.

Alessio, do you know if "metrics" pings are newly-sent from Fenix builds for whatever reason? Should I hold off on validation at least until the population levels off?

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #3)

To me, "metrics" looks a little like an adoption curve. It's as though there was only recently a build that started sending them and it's still being picked up by our users. If this is true, a validation analysis at this time would probably just need to be repeated to ensure there wasn't weird adoption selection effects in the population I was analyzing.

Alessio, do you know if "metrics" pings are newly-sent from Fenix builds for whatever reason? Should I hold off on validation at least until the population levels off?

This is a bit concerning. I agree with you, the curves look like adoption ones. Some non-event metrics were added as long as 2 months ago, so this doesn't match up with the "adoption" theory. Unless there was some bug in Glean or Fenix that prevented pings from being generated and was recently fixed.

Another thing we could check would be to see if we're getting rejected at ingestion due to some schema mismatch. But looking at the ingestion stats for both the events and the metrics, it really looks like we're not sending stuff.

Flags: needinfo?(alessio.placitelli)

Travis, Mike: thoughts on comment 3 and comment 4?

As a side note, I filed bug 1554860 for looking into this.

Flags: needinfo?(tlong)
Flags: needinfo?(mdroettboom)

(In reply to Chris H-C :chutten from comment #3)

Alessio, do you know if "metrics" pings are newly-sent from Fenix builds for whatever reason? Should I hold off on validation at least until the population levels off?

One other thing to keep in mind is that the metrics ping is not sent when going to background (which is a frequent operation), but rather sent on a schedule or when starting. So we expect to see less pings there, but not less clients. Unless there's a bug, of course.

Cross-posting this from Slack:

Could the difference in numbers be due to Fenix only just recently adding things that go into the metrics ping? I think they started off with a lot of events and only added metrics later.

Also, could it be affected by the fact that we may not always send a metrics ping if we didn't record any new data?

Flags: needinfo?(tlong)

(In reply to Travis Long from comment #7)

Cross-posting this from Slack:

Could the difference in numbers be due to Fenix only just recently adding things that go into the metrics ping? I think they started off with a lot of events and only added metrics later.

I had the same thought, but the first non event metric was added 2 months ago: https://github.com/mozilla-mobile/fenix/blame/master/app/metrics.yaml#L315

Also, could it be affected by the fact that we may not always send a metrics ping if we didn't record any new data?

If we're talking about raw ping numbers, yes. If we're talking about clients, I'm not sure.

Just clearing needinfo -- I don't think I have anything to add.

Flags: needinfo?(mdroettboom)

Given timing of Whistler and the fixes in bug 1556684, this is pushed to the week after Whistler.

Status: NEW → ASSIGNED
Priority: P1 → P2
Depends on: 1554860
Priority: P2 → P1

New doc location: https://mozilla.github.io/glean/book/user/pings/metrics.html

Scope

Let's look at "metrics" pings from the week of July 1 to July 7 (submission_date_s3 >= "20190701" AND < "20190708"). I expect some dampening on US and CA pings given the national holidays in that week, but that shouldn't matter too much.

Ping and Client Counts

The sample comprises 560,415 "metrics" pings received from 102,537 clients. Over the same period, we received 3,339,626 "baseline" pings from 94,993 clients and 2,188,659 "events" pings from 95,711 clients. That's 7% more clients over the week being reported by "metrics" pings.

Looking at these numbers per day, the differences are even more striking: we hear from at least 27% more clients via "metrics" pings per day over the sample than we hear from via "baseline" or "events" pings.

(( For fun I also plotted them per hour so we can see the time-of-day effects caused by how the "metrics" ping is scheduled. ))

Sequence holes and duplicates

Over the entire sample we received 1,664 duplicate pings. That's about 0.3% of the pings under study which is a little lower than the dupe rate seen by "baseline" pings.

Client-wise, these dupes came from 1,385 clients, which makes up 1.35% of the clients under study.

"metrics" Ping Scheduling

The "metrics" ping is sent whenever there is anything to send at any time the application is opened after 4AM local (client's clock). If the app is open across the 4AM boundary, it is sent at 4AM. If the app isn't running at 4AM but there's something recorded from before 4AM, then whenever the app is next used it is sent immediately.

So we can look at the end_time to determine roughly how often the app is open and running at 4AM... which turns out to be quite a lot of the time.

Over half (54%) of all "baseline" pings are sent in the 4AM hour. (and from the slightly-inflated numbers from the 3AM hour, there might be a few hundred more pings that are just a rounding error away from being included).

This is larger than I'd expected. I'd have thought it much more likely that this distribution would follow user activity patterns. Maybe Fenix processes live longer than I expected.

Edge Cases

The docs call out five specific scheduling examples. We should see which ones we can distinguish by looking at the data.

Example 1 "Crossing due time with the application closed" is covered already by the 46% of pings sent not in the 4AM hour. Example 2 "Crossing due time and changing timezones" we can find evidence by looking for pings not sent at 4AM and having different timezones for start_time and end_time. Example 3 "The application doesn’t run in a week" is just an extreme case of Example 1 (and "doesn't run in a week" is incompatible with the length of the sample period) so we'll skip it. Example 4 "The application doesn’t run for a week, and when it’s finally re-opened the timezone has changed" ditto. Example 5 "The user changes timezone in an extreme enough fashion that they cross 4AM twice on the same date" does not seem like it's something we could detect from the submitted data. It seems like a special case of Example 2, but I'm not sure how to identify them specifically.

So let's focus on Example 2. As we can see from the scheduling plot the case where start_time and end_time do not have matching timezones is rare, but it does happen about 0.001% of the time each hour... except 4AM. We'd expect there to be 30 or so pings in that hour, but there are 0.

This could be weak evidence for there being a bug in the scheduling (because I would expect clients who have changed timezones to be just as likely as clients who haven't for hitting the 4AM scheduling point).

Measurement Window Lengths

end_time - start_time should be 0-ish a lot of the time from all the new installs that happened during the week. Aside from those, what predictions from the docs can we make about what this should look like?

I originally thought we wouldn't see anything with durations less than 24h, but that was predicated on the idea that long-running Fenix sessions didn't exist. Given that over half of these pings were sent at 4AM local, I'm hypothesizing more than half of these pings will come in under or around 24h in length.

And though I'm not wrong, it's still a little weird.

Look at where all those 320k 4AM pings are on the plot. They're under 24h, and an overwhelming half of the 0-hour-duration-windows are closed at exactly 4AM? What's going on there?

And those lovely 2, 3, 4-day measurement windows... how are they only being closed at 4AM more than a day later? Can Fenix be asleep for the first 4AM but awake for the next? Or am I misunderstanding how "metrics" pings are scheduled (which is easy for me to do, apparently. Can't tell if that's because the scheduling is complex or because I'm just not getting it) and this is all perfectly explicable?

Measurement Window Overlaps

This was a "fun" one to write a query for. 757 pings (0.135% of the sample) report an end_time after the following ping's start_time. Surprisingly only 23 are only overlapping by 1 minute, leaving 734 pings overlapping their following pings by more than just a rounding/timing error.

I'm pretty much willing to chalk this up to the usual nonsense we receive. Nothing here indicates to me any problems with how Glean is recording or reporting the timestamps for the measurement windows.


That's it for today. Tomorrow we actually look at some metrics: glean.error.{invalid_value|invalid_label} and anything that catches my fancy from Fenix's metrics. And then the Conclusion and my Recommendations.

"metrics" Metrics

I looked at a bunch of things. Come look at what I found.

First, in Glean Error Metrics:

  • 5% of pings contain glean.error.invalid_value
    • Most of those (74%) are invalid values for glean.baseline.duration
    • The other sizeable chunk (25.6%) are invalid values for search.default_engine.submission_url
    • (and there are some others, but only 1.5% of pings have 'em so I didn't look much closer)
    • (( Yes, 74 + 25.6 + 1.5 > 100. This is because some pings have more than one kind of invalid value. ))
  • 0.36% of pings contain glean.error.invalid_label
    • All of them are because of metrics.search_count

This means that we have a strong signal that there's a bug in Glean around durations (which we already kinda/sorta knew because of bug 1525600), and there's signal that the search metrics aren't using Glean properly (or they are relying on our error behaviour as part of everyday collection).

In Fenix Metrics, I took a broad survey of metrics I found in the sample. This wasn't complete, and since I don't know how these metrics are supposed to be collected, some of these points may not be terribly useful as they're artefacts of the collection. Unsurprisingly, most of these metrics are about search:

  • 30% of pings have any search counts of any kind (suggestion or action)
    • 78% of the pings showed search activity from Google
    • 18% activity from DuckDuckGo
    • 5.3% from others
  • 0 pings showed clients with custom default search engines (search.default_engine.name set to custom)
  • 60% of pings came from clients with more than 1 (Fenix) Mozilla product installed (length of metrics.mozilla_products greater than 1)
  • 66% of pings came from clients with a non-mozilla browser as default (metrics.default_moz_browser present, but set to '')
  • 25% of pings reported metrics.default_moz_browser of org.mozilla.fenix
    • There are 0.03% more pings reporting metrics.default_browser as true than there are pings reporting metrics.default_moz_browser of org.mozilla.fenix
    • Just under 1% of pings have metrics.default_browser missing or set to NULL

The disagreement between metrics.default_moz_browser and metrics.default_browser is interesting to me even though it's small. Also, why do metrics.default_browser and metrics.default_moz_browser have "ping" lifetimes? Shouldn't they be "application" if we actually want them in every ping? It certainly seems like they're wanted in every ping. I'd ask the owner, but fenix-core@moz isn't a person. At least they expire.

Conclusions

The "metrics" ping appears to continue the trend of other Glean pings by being reliable and internally-consistent. It also continues the trend of having a small but noticeable duplicates issue, and weird strangeness that will no doubt make sense in time, but not today.

Like, for instance, how this sample contains 7% more clients if you measure by "metrics" pings instead of "events" pings. And if you move that to per-day, that's 27% more clients. (suggesting that "metrics" pings are more likely to come from the same clients over multiple days, and "events" and "baseline" pings coming from different clients over those multiple days).

And the scheduling has two mysteries: Why so many at 4AM, and why do none of them have different timezones on start_time and end_time?

Measurement Windows are distributed a little too much on the 0-length to my taste, especially since half of them have an end_time in the 4AM hour.

And as we now finally look at internal Glean error metrics, we notice that there is a problem with glean.baseline.duration and we may have an opportunity to educate the owners of search.default_engine.submission_url and metrics.search_count on how to avoid errors in recording (and to reach out in case this might lead to errors in measurement).

Recommendations

  1. Investigate the weird DAU/WAU discrepancy between "metrics" pings and "baseline"/"events" pings. It might be nothing (an artefact of how things are scheduled, perhaps), but it's big enough to take seriously.
  2. Nail down "the duplicates problem". It didn't lead to too much uncertainty in this analysis, but I'm getting tired of looking over my shoulder every time because it might be there.
  3. Dive deeper into the 4AM-ending measurement windows. Why are there so many of them? Why are so many of them 0-length?
  4. Investigate what might be going on with glean.baseline.duration that's causing it to error out so often. I expected this rate to be well below 1%.
  5. Reach out to the owners of Fenix's search probes about how two of them are erroring out some of the time. It might indicate an error in measurement (which would be bad, so we should let them know) or it might just be noise (in which case it would affect analysis, so we should let them know).

At the very least #1 and #3 I think should be investigated and explained before these pings can be considered "ready".


Alessio, your turn. What do you think? Did I miss anything?

Flags: needinfo?(alessio.placitelli)

(In reply to Chris H-C :chutten from comment #12)

I looked at a bunch of things. Come look at what I found.

First, in Glean Error Metrics:

  • 5% of pings contain glean.error.invalid_value
    • Most of those (74%) are invalid values for glean.baseline.duration

This is interesting. We know that, when using the debug view and triggering a 'baseline' ping manually, we might not get a duration. However, this seems a bit too frequent (5% of all the pings for a week!). The new Glean SDK pings all hit the same endpoint, even if they are tagged as debug. I thought that pings tagged as debug would not show up in 'the real data'. With that said:

  • we need to make sure 'debug view' pings are not in the sample for this analysis;
  • if they are, we need a way to tell them apart from real data pings.
  • we report errors both in the 'metrics' ping and in the ping the errors originated in. Would you kindly cross check that figures for 'glean.error.invalid_value' check out with the 'baseline' pings coming from the sam dates/same version?
* The other sizeable chunk (25.6%) are invalid values for `search.default_engine.submission_url`

Looks like Fenix is sending very long URLs for submission_url. This is weird, given that all the expected values should be <= 76 characters (see bug 1552905).

This should be checked with the Fenix team.

* (and there are some others, but only 1.5% of pings have 'em so I didn't look much closer)
* (( Yes, `74 + 25.6 + 1.5 > 100`. This is because some pings have more than one kind of invalid value. ))
  • 0.36% of pings contain glean.error.invalid_label
    • All of them are because of metrics.search_count

This is interesting: invalid label for metrics.search_count suggests they are using a label that either fails our regex or its way too long.

This should be checked with the Fenix team.

In Fenix Metrics, I took a broad survey of metrics I found in the sample. This wasn't complete, and since I don't know how these metrics are supposed to be collected, some of these points may not be terribly useful as they're artefacts of the collection. Unsurprisingly, most of these metrics are about search:

...
* Just under 1% of pings have metrics.default_browser missing or set to NULL

This is very weird. Reminds me of bug 1525600, except that it's a different type.

The disagreement between metrics.default_moz_browser and metrics.default_browser is interesting to me even though it's small. Also, why do metrics.default_browser and metrics.default_moz_browser have "ping" lifetimes? Shouldn't they be "application" if we actually want them in every ping? It certainly seems like they're wanted in every ping. I'd ask the owner, but fenix-core@moz isn't a person. At least they expire.

This should be checked with the Fenix team.

Conclusions

Like, for instance, how this sample contains 7% more clients if you measure by "metrics" pings instead of "events" pings. And if you move that to per-day, that's 27% more clients. (suggesting that "metrics" pings are more likely to come from the same clients over multiple days, and "events" and "baseline" pings coming from different clients over those multiple days).

This is a nice hypothesis!

And the scheduling has two mysteries: Why so many at 4AM, and why do none of them have different timezones on start_time and end_time?

Did we expect many users to cross timezone at 4AM in the morning? It might be worth checking in with Felix about this.

Measurement Windows are distributed a little too much on the 0-length to my taste, especially since half of them have an end_time in the 4AM hour.

Yes, this is particularly concerning to me. I wonder if there's some bug in glean-ac about figuring out the 'start_time'. I'm a bit curious about the distribution of sequence numbers for this chunk of pings having 0-length and end-time at 4AM. If they are mostly at seq=0, this might indicate a problem in collecting the first "start_time".

Recommendations

  1. Investigate the weird DAU/WAU discrepancy between "metrics" pings and "baseline"/"events" pings. It might be nothing (an artefact of how things are scheduled, perhaps), but it's big enough to take seriously.
  2. Nail down "the duplicates problem". It didn't lead to too much uncertainty in this analysis, but I'm getting tired of looking over my shoulder every time because it might be there.
  3. Dive deeper into the 4AM-ending measurement windows. Why are there so many of them? Why are so many of them 0-length?
  4. Investigate what might be going on with glean.baseline.duration that's causing it to error out so often. I expected this rate to be well below 1%.
  5. Reach out to the owners of Fenix's search probes about how two of them are erroring out some of the time. It might indicate an error in measurement (which would be bad, so we should let them know) or it might just be noise (in which case it would affect analysis, so we should let them know).

At the very least #1 and #3 I think should be investigated and explained before these pings can be considered "ready".


Alessio, your turn. What do you think? Did I miss anything?

Thank you for this great, in-depth analysis. Excellent work, as usual. I will file follow-up bugs for the issues you've discovered. I left a few comments above, but it might make sense to dig into these in their respective bugs and call this analysis complete.

Flags: needinfo?(alessio.placitelli)

(In reply to Alessio Placitelli [:Dexter] from comment #13)

(In reply to Chris H-C :chutten from comment #12)

I looked at a bunch of things. Come look at what I found.

First, in Glean Error Metrics:

  • 5% of pings contain glean.error.invalid_value
    • Most of those (74%) are invalid values for glean.baseline.duration

This is interesting. We know that, when using the debug view and triggering a 'baseline' ping manually, we might not get a duration. However, this seems a bit too frequent (5% of all the pings for a week!). The new Glean SDK pings all hit the same endpoint, even if they are tagged as debug. I thought that pings tagged as debug would not show up in 'the real data'. With that said:

  • we need to make sure 'debug view' pings are not in the sample for this analysis;
  • if they are, we need a way to tell them apart from real data pings.
  • we report errors both in the 'metrics' ping and in the ping the errors originated in. Would you kindly cross check that figures for 'glean.error.invalid_value' check out with the 'baseline' pings coming from the sam dates/same version?

Hey Arkadiusz, what do you think of the above? Are debug view pings getting into the same dataset as non debug ones?

Flags: needinfo?(akomarzewski)

(In reply to Chris H-C :chutten from comment #12)

Recommendations

  1. Investigate the weird DAU/WAU discrepancy between "metrics" pings and "baseline"/"events" pings. It might be nothing (an artefact of how things are scheduled, perhaps), but it's big enough to take seriously.

I filed bug 1566484 for this, asking help to Data Science.

  1. Nail down "the duplicates problem". It didn't lead to too much uncertainty in this analysis, but I'm getting tired of looking over my shoulder every time because it might be there.

We have bug 1547234 for this. This will become a priority right after the GeckoView+Glean.

  1. Dive deeper into the 4AM-ending measurement windows. Why are there so many of them? Why are so many of them 0-length?

I filed bug 1566488.

  1. Investigate what might be going on with glean.baseline.duration that's causing it to error out so often. I expected this rate to be well below 1%.

We might be able to clear this as part of this bug. See comment 14.

  1. Reach out to the owners of Fenix's search probes about how two of them are erroring out some of the time. It might indicate an error in measurement (which would be bad, so we should let them know) or it might just be noise (in which case it would affect analysis, so we should let them know).

Doing this right now :)

(In reply to Alessio Placitelli [:Dexter] from comment #15)

  1. Reach out to the owners of Fenix's search probes about how two of them are erroring out some of the time. It might indicate an error in measurement (which would be bad, so we should let them know) or it might just be noise (in which case it would affect analysis, so we should let them know).

Doing this right now :)

https://github.com/mozilla-mobile/fenix/issues/4082

(In reply to Alessio Placitelli [:Dexter] from comment #14)

Hey Arkadiusz, what do you think of the above? Are debug view pings getting into the same dataset as non debug ones?

Yes, they are, that was one of our requirements.

As for telling them apart from the real pings - we can do this, but only on GCP because headers aren't stored in direct-to-parquet tables on AWS (org_mozilla_fenix_metrics_parquet).

I ran the following query to see how many debug metrics pings we got in July (there was 1):
https://sql.telemetry.mozilla.org/queries/63651/source

(In reply to Alessio Placitelli [:Dexter] from comment #15)

(In reply to Chris H-C :chutten from comment #12)

Recommendations

  1. Investigate what might be going on with glean.baseline.duration that's causing it to error out so often. I expected this rate to be well below 1%.

We might be able to clear this as part of this bug. See comment 14.

Since this is a bug, I filed bug 1566757 for it.

(In reply to Arkadiusz Komarzewski [:akomar] from comment #17)

(In reply to Alessio Placitelli [:Dexter] from comment #14)

Hey Arkadiusz, what do you think of the above? Are debug view pings getting into the same dataset as non debug ones?

Yes, they are, that was one of our requirements.

As for telling them apart from the real pings - we can do this, but only on GCP because headers aren't stored in direct-to-parquet tables on AWS (org_mozilla_fenix_metrics_parquet).

I ran the following query to see how many debug metrics pings we got in July (there was 1):
https://sql.telemetry.mozilla.org/queries/63651/source

Thank you Arkadiusz. This suggests that there's indeed a bug in the Glean SDK.(In reply to Chris H-C :chutten from comment #12)

Flags: needinfo?(akomarzewski)
Depends on: 1566484, 1547234, 1566488
No longer depends on: 1554860
Depends on: 1566764
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.