Closed Bug 1800079 Opened 2 years ago Closed 2 years ago

Instrument the failure rate of `sendStandalonePing` by user

Categories

(Toolkit :: Telemetry, task, P1)

task

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox109 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

Details

Attachments

(2 files)

sendStandalonePing is used by two subsystems: PingCentre and PingSender2. It is a best-effort "let's try this once with no retry" API.

PingSender2, by nature of its system, has reliable retry because the pings it's sending are sourced from disk. If sendStandalonePing fails, then the ping'll get picked up as pending by the next Firefox run.

PingCentre does not retry. Its ping payloads are assembled in memory, handed to sendStandalonePing and then are let go.

In order to figure out how many pings of what types are going awry, we should (in a reliable retry system like Glean) instrument the number of send failures. This is a data-oriented collection, not a network-oriented one, so we're more interested in which data isn't making it out (or at least how much) than we are the reasons why (network errors). At least for now.

I'm interested in this: what's the driving use-case for collecting this data? Was there any problem? (not opposed to it)

Do we want to instrument error return codes?

Flags: needinfo?(chutten)

There appears to be mounting evidence that the events_aggregates dataset fuelled by PingCentre-sent activity stream data is missing information. events_aggregates is in the running to be used for the derivation of new datasets to be used by DS for... you know, I didn't ask why. :skahmann would know.

Given that events_aggregates has the most history of our contextual services datasets, it would be beneficial to know what factor or rule of thumb to apply to its historical figures to correct them.

(( And this should help us understand just how reliable (or unreliable! ...but probably reliable) Glean-reported newtab interaction data is, with the eventual goal of switching over to it (some time after it has more than just a couple months of history) ))

(In reply to Alessio Placitelli [:Dexter] from comment #1)

Do we want to instrument error return codes?

I don't, but it'd be easy enough to add a TELEMETRY_SEND_FAILURE_TYPE-esque metric for those if someone else wanted it. Just let me know.

I'm more interested in how many of which kinds of pings aren't making it through, which may involve instrumenting higher up (in PingCentre itself, or even in TelemetryFeed).

Flags: needinfo?(chutten) → needinfo?(skahmann)

(In reply to Chris H-C :chutten from comment #2)

There appears to be mounting evidence that the events_aggregates dataset fuelled by PingCentre-sent activity stream data is missing information. events_aggregates is in the running to be used for the derivation of new datasets to be used by DS for... you know, I didn't ask why. :skahmann would know.

Thank you for your reply!
Indeed, I'm not surprised by this. We saw the same evidence while working on other incidents (can't recall specifically which one). I believe this is by design and was an intentional choice of the original team: I was told that this was done "because of latency".

In case of either transient or prolonged network outages, we would just lose the data.

:skahmann, very curious about what effect you're seeing!

Attached file data collection review request (deleted) —
Attachment #9302959 - Flags: data-review?(tlong)

:Dexter and :chutten, :rburwei@mozilla.com and I are adding a suite of new contextual services datasets to split the suggest and tiles products: two derived from event_aggregates (one per product) and similarly, two "clients_daily" tables without advertiser info for more general access. We're interested in understanding the gap between the event_aggregates metrics (by context_id, from custom contextual services pings) and the new client_id metrics (these are pulled from main, events, etc.) so we can expectation-set with stakeholders. We've seen a variety of counts when comparing different metrics and OS. In August, we found a daily impact of +13% Suggest impressions from event_aggregates vs main. (https://docs.google.com/document/d/1TjcQhavVhimb7QSX2PZsH-JVvpV1hOPZamCWl07V2VU/edit#bookmark=id.61i78m5bjjvx) Now, when we compare event_aggregates to the new client-level tables, aggregating over 3 months, we're seeing an impact of +1-6% from event aggregates depending upon the device/ product (https://docs.google.com/document/d/1DLpuG5dQNm0Bd4JVN7Mj9ZGWClP2i12V3sDi5i9fX1A/edit#bookmark=id.qp5l6wbaq1o4)

Flags: needinfo?(skahmann)

Comment on attachment 9302959 [details]
data collection review request

Data Review

  1. Is there or will there be documentation that describes the schema for the ultimate data set in a public, complete, and accurate way?

Yes, through the metrics.yaml file and the Glean Dictionary.

  1. Is there a control mechanism that allows the user to turn the data collection on and off?

Yes, through the standard Telemetry preference in the application settings.

  1. If the request is for permanent data collection, is there someone who will monitor the data over time?

N/A, collection to expire in version 115

  1. Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical data

  1. Is the data collection request for default-on or default-off?

Default-on

  1. Does the instrumentation include the addition of any new identifiers (whether anonymous or otherwise; e.g., username, random IDs, etc. See the appendix for more details)?

No

  1. Is the data collection covered by the existing Firefox privacy notice?

Yes

  1. Does the data collection use a third-party collection tool?

No

Result

data-review+

Attachment #9302959 - Flags: data-review?(tlong) → data-review+

Backed out for causing xpcshell failures on test_PingCentre.js.

Push with failures

Failure log

Backout link

[task 2022-11-14T20:26:19.275Z] 20:26:19     INFO -  TEST-START | browser/modules/test/unit/test_PingCentre.js
[task 2022-11-14T20:26:19.661Z] 20:26:19  WARNING -  TEST-UNEXPECTED-FAIL | browser/modules/test/unit/test_PingCentre.js | xpcshell return code: 0
[task 2022-11-14T20:26:19.661Z] 20:26:19     INFO -  TEST-INFO took 382ms
[task 2022-11-14T20:26:19.661Z] 20:26:19     INFO -  >>>>>>>
[task 2022-11-14T20:26:19.662Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: Couldn't get the user appdata directory. Crash events may not be produced.: file /builds/worker/checkouts/gecko/toolkit/crashreporter/nsExceptionHandler.cpp:2906
[task 2022-11-14T20:26:19.662Z] 20:26:19     INFO -  (xpcshell/head.js) | test MAIN run_test pending (1)
[task 2022-11-14T20:26:19.663Z] 20:26:19     INFO -  (xpcshell/head.js) | test run_next_test 0 pending (2)
[task 2022-11-14T20:26:19.663Z] 20:26:19     INFO -  (xpcshell/head.js) | test MAIN run_test finished (2)
[task 2022-11-14T20:26:19.663Z] 20:26:19     INFO -  running event loop
[task 2022-11-14T20:26:19.664Z] 20:26:19     INFO -  browser/modules/test/unit/test_PingCentre.js | Starting setup setup
[task 2022-11-14T20:26:19.664Z] 20:26:19     INFO -  (xpcshell/head.js) | test setup pending (2)
[task 2022-11-14T20:26:19.665Z] 20:26:19     INFO -  (xpcshell/head.js) | test run_next_test 0 finished (2)
[task 2022-11-14T20:26:19.665Z] 20:26:19     INFO -  (xpcshell/head.js) | test run_next_test 1 pending (2)
[task 2022-11-14T20:26:19.665Z] 20:26:19     INFO -  (xpcshell/head.js) | test setup finished (2)
[task 2022-11-14T20:26:19.666Z] 20:26:19     INFO -  browser/modules/test/unit/test_PingCentre.js | Starting test_enabled
[task 2022-11-14T20:26:19.666Z] 20:26:19     INFO -  (xpcshell/head.js) | test test_enabled pending (2)
[task 2022-11-14T20:26:19.667Z] 20:26:19     INFO -  TEST-PASS | browser/modules/test/unit/test_PingCentre.js | test_enabled - [test_enabled : 56] Telemetry should be on - true == true
[task 2022-11-14T20:26:19.667Z] 20:26:19     INFO -  (xpcshell/head.js) | test run_next_test 1 finished (2)
<...>
[task 2022-11-14T20:26:19.685Z] 20:26:19     INFO -  TEST-PASS | browser/modules/test/unit/test_PingCentre.js | test_sendStructuredIngestionPing_failure - [test_sendStructuredIngestionPing_failure : 154] "undefined" == "undefined"
[task 2022-11-14T20:26:19.685Z] 20:26:19  WARNING -  TEST-UNEXPECTED-FAIL | browser/modules/test/unit/test_PingCentre.js | test_sendStructuredIngestionPing_failure - [test_sendStructuredIngestionPing_failure : 157] 1 == "undefined"
[task 2022-11-14T20:26:19.685Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/tests/browser/modules/test/unit/test_PingCentre.js:test_sendStructuredIngestionPing_failure:157
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/head.js:_run_next_test/<:1775
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/head.js:_run_next_test:1775
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/head.js:run:819
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/head.js:_do_main:240
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  /builds/worker/workspace/build/tests/xpcshell/head.js:_execute_test:597
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  -e:null:1
[task 2022-11-14T20:26:19.686Z] 20:26:19     INFO -  exiting test
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  (xpcshell/head.js) | test run_next_test 10 finished (2)
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  Unexpected exception NS_ERROR_ABORT:
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  _abort_failed_test@/builds/worker/workspace/build/tests/xpcshell/head.js:875:20
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  do_report_result@/builds/worker/workspace/build/tests/xpcshell/head.js:976:5
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  Assert<@/builds/worker/workspace/build/tests/xpcshell/head.js:73:21
[task 2022-11-14T20:26:19.687Z] 20:26:19     INFO -  Assert.prototype.report@resource://testing-common/Assert.jsm:240:10
[task 2022-11-14T20:26:19.688Z] 20:26:19     INFO -  equal@resource://testing-common/Assert.jsm:282:8
[task 2022-11-14T20:26:19.688Z] 20:26:19     INFO -  test_sendStructuredIngestionPing_failure@/builds/worker/workspace/build/tests/xpcshell/tests/browser/modules/test/unit/test_PingCentre.js:157:10
[task 2022-11-14T20:26:19.688Z] 20:26:19     INFO -  _run_next_test/<@/builds/worker/workspace/build/tests/xpcshell/head.js:1775:22
[task 2022-11-14T20:26:19.688Z] 20:26:19     INFO -  _run_next_test@/builds/worker/workspace/build/tests/xpcshell/head.js:1775:38
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  run@/builds/worker/workspace/build/tests/xpcshell/head.js:819:9
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  _do_main@/builds/worker/workspace/build/tests/xpcshell/head.js:240:6
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  _execute_test@/builds/worker/workspace/build/tests/xpcshell/head.js:597:5
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  @-e:1:1
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  exiting test
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  "CONSOLE_MESSAGE: (info) TELEMETRY PING (test_topic): {"experiments":{},"locale":"en-US","version":"109.0a1","release_channel":"default","event":"fake_event","value":"fake_value"}
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  "
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "Structured Ingestion ping failure with error: undefined" {file: "resource:///modules/PingCentre.jsm" line: 137}]
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  sendStructuredIngestionPing/<@resource:///modules/PingCentre.jsm:137:10
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  _do_main@/builds/worker/workspace/build/tests/xpcshell/head.js:240:6
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  _execute_test@/builds/worker/workspace/build/tests/xpcshell/head.js:597:5
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  @-e:1:1
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  "
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/checkouts/gecko/xpcom/base/nsCycleCollector.cpp:3387
[task 2022-11-14T20:26:19.689Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: OOPDeinit() without successful OOPInit(): file /builds/worker/checkouts/gecko/toolkit/crashreporter/nsExceptionHandler.cpp:3374
[task 2022-11-14T20:26:19.690Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: NS_ENSURE_TRUE(Preferences::InitStaticMembers()) failed: file /builds/worker/checkouts/gecko/modules/libpref/Preferences.cpp:4628
[task 2022-11-14T20:26:19.690Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: XPCOM object PerThreadData destroyed from static ctor/dtor: file /builds/worker/checkouts/gecko/xpcom/base/nsTraceRefcnt.cpp:215
[task 2022-11-14T20:26:19.691Z] 20:26:19     INFO -  PID 11625 | [Parent 11625, Main Thread] WARNING: XPCOM object SourceList destroyed from static ctor/dtor: file /builds/worker/checkouts/gecko/xpcom/base/nsTraceRefcnt.cpp:215
[task 2022-11-14T20:26:19.691Z] 20:26:19     INFO -  <<<<<<<
[task 2022-11-14T20:26:19.691Z] 20:26:19     INFO -  INFO | Result summary:
[task 2022-11-14T20:26:19.692Z] 20:26:19     INFO -  INFO | Passed: 190
[task 2022-11-14T20:26:19.692Z] 20:26:19  WARNING -  INFO | Failed: 1
[task 2022-11-14T20:26:19.692Z] 20:26:19  WARNING -  One or more unittests failed.
[task 2022-11-14T20:26:19.692Z] 20:26:19     INFO -  INFO | Todo: 0
[task 2022-11-14T20:26:19.693Z] 20:26:19     INFO -  INFO | Retried: 1
[task 2022-11-14T20:26:19.693Z] 20:26:19     INFO -  SUITE-END | took 215s
Flags: needinfo?(chutten)

Odd. It works on my machine. I'll take a closer look tomorrow.

Flags: needinfo?(chutten)

Ah, stub.rejects() is just asynchronous enough that we need an async on sendStructuredIngestionPing in the test.

This could have impact on instrumentation if the rejection isn't serviced before shutdown, but oh well.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

One way to get to this code path is turning off your internet / work offline. Then every new tab opening results in 1 error and closing the tab results in 3 more errors.

That could explain some of what we're seeing. It would also affect Glean, but if someone ever moved to work online again then Glean would be able to report the failures.

I'm not sure that it covers all of the effect we're seeing. So far on Nightly, 1 in 5 clients each day are losing at least one ping, and the number of lost pings averages about 5 or 6: https://sql.telemetry.mozilla.org/queries/88826/source#220001

Might just be a Nightly thing. We'll see how these numbers change as Fx109 rides the trains.

Blocks: 1818340
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: