Closed Bug 1525603 Opened 6 years ago Closed 5 years ago

Validate incoming "baseline" ping data after fixes have landed

Categories

(Toolkit :: Telemetry, enhancement, P1)

enhancement
Points:
2

Tracking

()

RESOLVED FIXED
Tracking Status
firefox67 --- affected

People

(Reporter: Dexter, Assigned: chutten)

References

Details

(Whiteboard: [telemetry:mobilesdk:m7])

+++ This bug was initially created as a clone of Bug #1520182 +++

Bug 1520182 provided an initial report of the data reported by glean and identified a few issues. We should run the analysis again once all blocker bugs have landed/are fixed.

Depends on: 1525578
Depends on: 1525600
Depends on: 1508965
Depends on: 1525045
Depends on: 1525540
Priority: P1 → P2
Depends on: 1534309
Assignee: nobody → chutten
Status: NEW → ASSIGNED
Points: --- → 2
Priority: P2 → P1
Depends on: 1541084
Whiteboard: [telemetry:mobilesdk:m6] → [telemetry:mobilesdk:m7]

Time to give "baseline" pings another look.

Scope

The queries are limited to the pings coming from builds after April 11 (app_build > '11010000') (ie, contains the client_id and first_run_date fix) received between April 12 and 24.

This is about 6200 pings and 520 clients.

First things first, let's see what's changed since last time.

Ping and Client Counts

Aggregate

We're still seeing about 100 DAU and in the range of 500-600 pings per day. Nothing much has changed since March, after the Fenix call for testers.

This is a limited population, but that's what we have to work with so let's get to it.

Per-client, Per-day

Nothing much to say here. We're looking at far fewer clients than before, and they appear to be more dedicated: slightly elevated pings-per-client and average-pings-per-day rates. We seem to be out of the "one and done" clients who pop in for a single session and are never seen again (though there are still a few of those in the population), and we seem to have fewer outrageous ping-per-day numbers (though we have a higher proportion of outrageously-high ping-per-client clients. Best guess: testing profiles)

Sequence Numbers

Distribution

Now that's what I'm talking about. Look at those sequence numbers distribute... there are two clear cohorts: the old guard with seq above 480, and then wealth of fresher clients mostly below 200.

And if you flip over to #pings - #clients we see that there are a few more dupes of {client_id, seq} pairs. The only pattern here seems to mimic the overall population distribution from the first graph (there are more dupes where there are more users sending more pings). This means it's unlikely that there's an underlying bias causing dupes to happen at different parts of the sequence (ie, we're just as likely to see these dupes from a deep in a long-sequence client's lifecycle as we are from the first seq from a fresh client. Seems to be random.)

Holes

We're still seeing holes, but many fewer holes. And now we're starting to see dupes. But overall, holes + dupes rates are lower than the holes rates were last time.

Still a bit high, though. 59 clients in the population of 523 had dupes or holes in their sequence record (11.3%, of which 8.6% are dupes). And these numbers should be after being reduced by ingest's deduping.

I hope it's because of population effects (we don't exactly have a lot of clients running &browser, so it's not unreasonable to expect outsized effects of rare and weird clients). To look into this hypothesis we'd probably want to take a look at Fenix (which at the very least I use like a normal browser so should generate reasonable-looking data).

Field Compositions

Now, it's not quite as easy to compare durations' distributions because the bucket layout of NUMERIC_HISTOGRAM isn't stable. However we can tell that it's still an exponential with a heavy emphasis on short (under 14s) durations. There are fewer high values (in the thousands of seconds (20s of minutes) range). Seems reasonable to me.

Also, we're starting to see far more null durations. 140 out of the 6200 pings is 2%. Zero-length durations are up there as well, at 2.5%.

On the plus side, all the other fields (os and os_version, device_manufacturer and device_model, and architecture) are all very-well-behaved now with no obvious faults.

Delay

Due to reasons, there is no HTTP Date header in org_mozilla_reference_browser_baseline_parquet so the delay verification is limited to submission delay (delay from the ping being created until it's on our servers) without adjustments due to clock skew (so if a client's clock is really out to lunch, it'll distort the delay calculation).

And the delay is limited to per-minute resolution, as that's the resolution of end_time.

2% of pings are received 2-3min before they're recorded. (time travellers)
85% of pings are received within a minute of their recording.

To get to 95% (5900 pings) we need to go out to 61min. Quite a lot quicker than Desktop's "Wait 2 days" rule of thumb.

In fact, under 4% of pings take over 3 hours to be received... Now, given the slight clumping around 60min it isn't unreasonable to assume that there's some artificiality here. Maybe a misconfigured timezone here, an artificially-truncated timestamp there...

In short, I'm not sure how close this is to a real distribution and given the outsized effects weird clients can have in a sample of this size, all I'm willing to venture is that the aggregate delay is a lot lower than Desktop's and doesn't appear to have a systematic issue.

More analysis needed:

  • Clock skew adjustments
  • Checking to see if there are commonalities within the group of long-delayed pings. Maybe they're all sent from certain clients, or at certain times of day, or at certain parts of the app lifecycle. "You only have to wait an hour to get 95% of the pings" is only useful if the 95% we receive in that hour are representative (outside of their delay) of the population of the 100%.

Conclusion

I conclude that "baseline" pings are almost ready, though I wouldn't trust analyses against the &browser population to be helpful.

Recommendations

  • Figure out what's with these dupes. We're getting too many of them.
  • Consider how we might be able to judge another (say, Fenix's) population for suitability as a test bed for further verification analyses.
    • Larger is immediately better, but is there some way to take the population distribution out of the equation so we can evaluate the pings without wondering how much is due to "weird clients"?
    • Not saying that verifying that pings act properly in the face of "weird clients" isn't valuable on its own (it is), but we should try to split that verification from the more pressing matter of "Do the pings work?"
  • Add Date headers to the metadata to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things.
  • Unless &browser's population inflates to at least 1k DAU or we find some way to explore the client population's composition of weirdness, perform no further validation analyses against it.

Alessio, please take a look and let me know your questions, concerns, and corrections.

Flags: needinfo?(alessio.placitelli)

Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.

(In reply to Chris H-C :chutten from comment #2)

Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.

Mh, interesting. I wonder if de-duping is catching stuff on the pipeline at all.

(In reply to Chris H-C :chutten from comment #1)

Sequence Numbers

Distribution

Now that's what I'm talking about. Look at those sequence numbers distribute... there are two clear cohorts: the old guard with seq above 480, and then wealth of fresher clients mostly below 200.

And if you flip over to #pings - #clients we see that there are a few more dupes of {client_id, seq} pairs. The only pattern here seems to mimic the overall population distribution from the first graph (there are more dupes where there are more users sending more pings). This means it's unlikely that there's an underlying bias causing dupes to happen at different parts of the sequence (ie, we're just as likely to see these dupes from a deep in a long-sequence client's lifecycle as we are from the first seq from a fresh client. Seems to be random.)

This dupes thing is starting to concern me a bit. I think I need to check more in depth that the pipeline is working as we expect and how many dupes it is catching.

Holes

Yes, this seems like a good hypothesis that needs to be verified with a bigger population.

Field Compositions

Now, it's not quite as easy to compare durations' distributions because the bucket layout of NUMERIC_HISTOGRAM isn't stable. However we can tell that it's still an exponential with a heavy emphasis on short (under 14s) durations. There are fewer high values (in the thousands of seconds (20s of minutes) range). Seems reasonable to me.

Also, we're starting to see far more null durations. 140 out of the 6200 pings is 2%. Zero-length durations are up there as well, at 2.5%.

With respect to null durations, we should wait until we further transition to GCP to see if that gets fixed.
Regarding the zero-length durations, which can be actionable now, I see a different figure: 80 pings over 6210... so 1.2%? This seems to be fairly stable compared to the old analysis, which reported 1.1% of pings with 0 duration.

Given the size of the effect, I'm not too concerned. It would still be interesting to see how start_time and end_time behave compared to duration, especially in these weird cases of "null" or "0".

Delay

Due to reasons, there is no HTTP Date header in org_mozilla_reference_browser_baseline_parquet so the delay verification is limited to submission delay (delay from the ping being created until it's on our servers) without adjustments due to clock skew (so if a client's clock is really out to lunch, it'll distort the delay calculation).

Gah, that's sad :( Sorry for not catching this earlier.

In short, I'm not sure how close this is to a real distribution and given the outsized effects weird clients can have in a sample of this size, all I'm willing to venture is that the aggregate delay is a lot lower than Desktop's and doesn't appear to have a systematic issue.

WHOZAA! Great news :)

More analysis needed:

  • Clock skew adjustments
  • Checking to see if there are commonalities within the group of long-delayed pings. Maybe they're all sent from certain clients, or at certain times of day, or at certain parts of the app lifecycle. "You only have to wait an hour to get 95% of the pings" is only useful if the 95% we receive in that hour are representative (outside of their delay) of the population of the 100%.

These are good points for follow-up analyses, maybe on a bigger population.

Conclusion

I conclude that "baseline" pings are almost ready, though I wouldn't trust analyses against the &browser population to be helpful.

I second your conclusions. Your analysis looks sound.

Recommendations

  • Figure out what's with these dupes. We're getting too many of them.

I filed bug 1547234 for tracking the problem down.

  • Consider how we might be able to judge another (say, Fenix's) population for suitability as a test bed for further verification analyses.
    • Larger is immediately better, but is there some way to take the population distribution out of the equation so we can evaluate the pings without wondering how much is due to "weird clients"?
    • Not saying that verifying that pings act properly in the face of "weird clients" isn't valuable on its own (it is), but we should try to split that verification from the more pressing matter of "Do the pings work?"

We "might" have something lined up for FFTV. Or, worst case, there's Fenix Beta lined up.

  • Add Date headers to the metadata to enable clock skew calculations. It didn't occur to me until I was performing the delay calculation how much skew could affect things.

@Frank, how hard would it be to do that? Given the GCP transition, does it make sense to do it?

  • Unless &browser's population inflates to at least 1k DAU or we find some way to explore the client population's composition of weirdness, perform no further validation analyses against it.

I agree, unless there's any follow-up analysis that might help us with the dupes

Flags: needinfo?(alessio.placitelli) → needinfo?(fbertsch)

You're right about the 0 durations (1.2%). You're right to ask about durations vs [start_time, end_time] (though IIRC the _times are per-minute so I don't know that we'll be able to get the necessary resolution) which I should make a note of to check for future analyses.

Frank filed https://github.com/mozilla-services/mozilla-pipeline-schemas/issues/323 for adding the Date header.

Looks like we're done... FOR NOW.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(fbertsch)
Resolution: --- → FIXED

Regarding the zero-length durations, which can be actionable now, I see a different figure: 80 pings over 6210... so 1.2%? This seems to be fairly stable compared to the old analysis, which reported 1.1% of pings with 0 duration.

Keep in mind this just means < 1 second. So I don't think zero-length duration should really be considered anything special.

There was discussion about whether to add +1 to durations in this bug but ultimately it was decided not to.

(In reply to Chris H-C :chutten from comment #1)

Holes

We're still seeing holes, but many fewer holes. And now we're starting to see dupes. But overall, holes + dupes rates are lower than the holes rates were last time.

Still a bit high, though. 59 clients in the population of 523 had dupes or holes in their sequence record (11.3%, of which 8.6% are dupes). And these numbers should be after being reduced by ingest's deduping.

I hope it's because of population effects (we don't exactly have a lot of clients running &browser, so it's not unreasonable to expect outsized effects of rare and weird clients). To look into this hypothesis we'd probably want to take a look at Fenix (which at the very least I use like a normal browser so should generate reasonable-looking data).

I know this is closed, but I believe I can possibly explain some of the holes. Since we added the "ping tagging" capability in GleanDebugActivity, it diverts pings to the Debug View on GCP. Then if the application is run again without a tagged ping, then the pings are sent to AWS, as normal. This would make it appear from the AWS perspective where the validation queries were run, that the diverted pings show up as holes.

You need to log in before you can comment on or make changes to this bug.