Closed Bug 1632138 Opened 5 years ago Closed 5 years ago

Telemetry error from _migrateAppDataPings - unable to remove file - invalid path component

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla77
Tracking Status
firefox77 --- fixed

People

(Reporter: Gijs, Assigned: chutten)

References

Details

Attachments

(1 file)

When I start a self-built nightly, the browser console shows:

1587562231779	Toolkit.Telemetry	ERROR	TelemetryStorage::_migrateAppDataPings - failed to remove file my\homedir\AppData\Roaming\Mozilla\Firefox\Pending Pings\6A822F6A-38A0-46E2-BE5D-E17AD14B8B3A: TypeError: invalid path component(resource://gre/modules/osfile/ospath_win.jsm:150:13) JS Stack trace: join@ospath_win.jsm:150:13
pingFilePath@TelemetryStorage.jsm:2028:18
savePing@TelemetryStorage.jsm:1418:28
async*savePing@TelemetryStorage.jsm:372:33
_migrateAppDataPings@TelemetryStorage.jsm:1691:32
async*_scanPendingPings@TelemetryStorage.jsm:1741:16
loadPendingPingList@TelemetryStorage.jsm:1719:39
loadPendingPingList@TelemetryStorage.jsm:302:33
_checkPendingPings@TelemetrySend.jsm:825:40
setup@TelemetrySend.jsm:772:18
setup@TelemetrySend.jsm:211:30
setupTelemetry/this._delayedInitTask<@TelemetryController.jsm:817:31
async*_runTask@DeferredTask.jsm:333:18
_timerCallback/<@DeferredTask.jsm:305:20
_timerCallback@DeferredTask.jsm:324:9
_startTimer/callback/<@DeferredTask.jsm:179:18
ChromeUtils::IdleDispatch handler*_startIdleDispatch@DeferredTask.jsm:192:17
callback@DeferredTask.jsm:173:14

My Windows user profile path has no non-latin characters in, though it does have a space. Other than that, I'm unsure what is causing this.

Could you provide the contents of my\homedir\AppData\Roaming\Mozilla\Firefox\Pending Pings\6A822F6A-38A0-46E2-BE5D-E17AD14B8B3A ? It could be that it doesn't have an id and so when we try to join the profile ping storage dir with the id it fails.

Flags: needinfo?(gijskruitbosch+bugs)
{"build_channel":"beta","build_version":"76.0","default_browser":"firefox","os_locale":"en-US","os_version":"10.0.19603.1000","previous_default_browser":"firefox"}

is the entire file, apparently... so yes, it appears not to have an id. Perhaps the code should catch that case? It seems right now, this file is never removed... I'm also not sure how this ping ended up being so empty?

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(chutten)

Well that's weird. No Telemetry ping should be without a type and a id (and some other stuff)

Ah, that's the "default-browser" ping. ni?Molly, does the "default-browser" ping really not follow the Common Ping Format? If so... there might be some problems in store.

Flags: needinfo?(chutten) → needinfo?(mhowell)

No, it doesn't; it's a custom ping generated outside of Firefox, I didn't know there was any need for those to use the common format. This kind of sounds like I should just not try to use that directory at all.

It's not strictly on topic, but I don't really understand how that file is actually still there at all at that point; I immediately pass it to pingsender after generating it, so I thought it would get deleted right away.

Flags: needinfo?(mhowell)

Pingsender deletes the file on success, otherwise leaves it behind for Firefox to pick up and try sending when next it starts up. Firefox tends to be better at networking than the system default configuration pingsender uses which is why pingsender is used as a best effort transmission mechanism, not a reliable one.

Okay. How serious do you think this is, what else might go wrong other than this error, and what do you think ought to be done?

Flags: needinfo?(chutten)

Well, luckily this ping is located in "old ping limbo". Once we fail to copy it into the profile's ping storage it shouldn't be interacted with by any other part of the Telemetry subsystem at all. (If it had been in the profile storage there is a lot of other code we'd have to worry about.) And the migration code is fault tolerant per-file so it shouldn't stop us from grabbing migratable pings as part of our normal operations. In short: it shouldn't gum up the works.

Some Concerns and Solutions:

  1. The logging will only get worse -- each unsent ping will clog up the folder, and each failure to migrate it into the profile's ping storage will log. I don't think we impose any limits on the size or number of files in that directory so eventually there might be perf problems. Eventually.
    a) We should patch TelemetryStorage to remove any file it can't migrate so that current instances of the problem self-limit.
    b) Consider patching the default browser agent to put the non-migratable pings someplace else, and cleaning the directory between invocations.
  2. Evidence of send failures - Pingsender is "best effort" and this is evidence that it can fail. I'm not sure how esoteric :Gijs' net setup is on that machine, but it's evidence that pings we were expecting are not being delivered.
    a) Consider making the "default-browser" ping adhere to the Common Ping Format. Then we can skip 1(b) and gain the benefit of using Firefox's Telemetry to reliably deliver the data at some future time. (As well as getting a record in about:telemetry of past sent pings)
    b) Consider adding a sequence number to the "default-browser" ping to detect and monitor temporary reporting failures.
  3. Evidence that developer-built Firefox tries to send "default-browser" pings - developer-built Firefoxen do not upload Telemetry regardless of the value of datareporting.healthreport.uploadEnabled because there isn't any value in its analysis. I'm not 100% clear on the scope of the "default-browser" ping (maybe there is value in analyses of developer builds), but it might be worth considering mirroring this behaviour from Telemetry.
    a) Consider (if appropriate) mirroring Telemetry's baroque (but useful for testing) logic for splitting "data will be collected" from "data will be uploaded" (which for this particular purpose hinges on this code here which is known at compile-time so shouldn't require mirroring state between processes)

How's this sound? Does it address your questions?

Flags: needinfo?(chutten) → needinfo?(mhowell)

(In reply to Chris H-C :chutten from comment #7)

1b) Consider patching the default browser agent to put the non-migratable pings someplace else, and cleaning the directory between invocations.

That I can definitely do.

2a) Consider making the "default-browser" ping adhere to the Common Ping Format. Then we can skip 1(b) and gain the benefit of using Firefox's Telemetry to reliably deliver the data at some future time. (As well as getting a record in about:telemetry of past sent pings)

I have some questions about the common ping format if that's what we're going to be using here:

  1. Is everything listed in the docs other than clientId and environment strictly required? I definitely can't have either of those things here, but some of the stuff in the application block I would need to figure out how to even gather.
  2. Would the type value just be default-browser? Does it need to be namespaced at all?
  3. Since we're already receiving pings, I don't know what the migration/versioning story would look like. Certainly this would be a new schema version (which, I'm already dreading doing that), but wouldn't that make it somewhere between really annoying and impossible to keep using the existing data?

2b) Consider adding a sequence number to the "default-browser" ping to detect and monitor temporary reporting failures.

Do you mean just as a payload field? I can't find any other pings that have something like that, I'd love to see an example or hear some more details of how to do that in a way that it would be helpful.

3a) Consider (if appropriate) mirroring Telemetry's baroque (but useful for testing) logic for splitting "data will be collected" from "data will be uploaded" (which for this particular purpose hinges on this code here which is known at compile-time so shouldn't require mirroring state between processes)

Yeah, I do think using that same logic to determine when to not upload these sounds sensible; I can copy all that over.

How's this sound? Does it address your questions?

Yes, that was super helpful, thank you. I hate that I still feel so far out of my depth whenever I'm doing anything anywhere near telemetry, but I always find some new way to be completely clueless.

Flags: needinfo?(mhowell) → needinfo?(chutten)

(In reply to Molly Howell (she/her) [:mhowell] from comment #8)

2a) Consider making the "default-browser" ping adhere to the Common Ping Format. Then we can skip 1(b) and gain the benefit of using Firefox's Telemetry to reliably deliver the data at some future time. (As well as getting a record in about:telemetry of past sent pings)

I have some questions about the common ping format if that's what we're going to be using here:

  1. Is everything listed in the docs other than clientId and environment strictly required? I definitely can't have either of those things here, but some of the stuff in the application block I would need to figure out how to even gather.

To my knowledge yes (ni?frank for surety). It pertains to the application under instrumentation, though, so you don't need to ferry information from Firefox. If the agent's application name is "Cody Banks" then put "Cody Banks" in there (though maybe drop the space). And if the field is unknown to the agent then a value of "unknown" or some other sentinel would be fine.

  1. Would the type value just be default-browser? Does it need to be namespaced at all?

There are no namespaces in Telemetry doctypes, so "default-browser" should do.

  1. Since we're already receiving pings, I don't know what the migration/versioning story would look like. Certainly this would be a new schema version (which, I'm already dreading doing that), but wouldn't that make it somewhere between really annoying and impossible to keep using the existing data?

My guess would be that this'll end up in a new table. A UNION across tables in SQL is quite performant. Who's your contact on Data Platform for schema, dataset, Data Engineeringish stuff? They'll have some opinions about this, I reckon.

2b) Consider adding a sequence number to the "default-browser" ping to detect and monitor temporary reporting failures.

Do you mean just as a payload field? I can't find any other pings that have something like that, I'd love to see an example or hear some more details of how to do that in a way that it would be helpful.

Actually, now that I think about it, sequence numbers are useless without an identifier that is stable over time (like client_id). Let's ignore this.

How's this sound? Does it address your questions?

Yes, that was super helpful, thank you. I hate that I still feel so far out of my depth whenever I'm doing anything anywhere near telemetry, but I always find some new way to be completely clueless.

Don't let my calm demeanour fool you: Windows things give me the screaming willies. That's why they pay our princely salaries: so we are willing to each know different pieces of the Nightmare : D

Flags: needinfo?(chutten) → needinfo?(fbertsch)
Assignee: nobody → chutten
Status: NEW → ASSIGNED

Okay, here's what I want to do. I'm going to file two bugs, one for copying over the logic to skip uploading the ping in non-official builds, and the second to switch to the common ping format, because I think those things plus your patch are the solutions we've settled on. The former I will write a patch for right now, the latter will take longer and I may or may not be the one doing it, I don't know yet.

(In reply to Chris H-C :chutten from comment #9)

To my knowledge yes (ni?frank for surety). It pertains to the application under instrumentation, though, so you don't need to ferry information from Firefox. If the agent's application name is "Cody Banks" then put "Cody Banks" in there (though maybe drop the space). And if the field is unknown to the agent then a value of "unknown" or some other sentinel would be fine.

Is the agent really a separate "application" from Firefox, in this sense? I have no idea how any of this information is ever used so I don't know what should be expected, but to me that seems like a surprising implication.

My guess would be that this'll end up in a new table. A UNION across tables in SQL is quite performant.

That's what I was afraid of. My worry wasn't about performance, it was about making it harder to write those queries. But I guess there's no helping it.

Who's your contact on Data Platform for schema, dataset, Data Engineeringish stuff? They'll have some opinions about this, I reckon.

I... don't really have one? I've always done things either on my own or as one-off requests. Maybe that's why I have so many problems.

To my knowledge yes (ni?frank for surety). It pertains to the application under instrumentation, though, so you don't need to ferry information from Firefox. If the agent's application name is "Cody Banks" then put "Cody Banks" in there (though maybe drop the space). And if the field is unknown to the agent then a value of "unknown" or some other sentinel would be fine.

The schema for Application is here. If you scroll to the bottom it has a required field, those are all the fields that are required to be present in ping for it to be considered valid.

Similarly the environment schema is here. This one gets confusing because of the nesting; some parent fields are required but only a subset of their children are.

Since we're already receiving pings, I don't know what the migration/versioning story would look like. Certainly this would be a new schema version (which, I'm already dreading doing that), but wouldn't that make it somewhere between really annoying and impossible to keep using the existing data?

This won't be too bad. We can support you here with a simplified view, if that would make things easier.

Flags: needinfo?(fbertsch)

Okay, here's what I want to do. I'm going to file two bugs, one for copying over the logic to skip uploading the ping in non-official builds, and the second to switch to the common ping format, because I think those things plus your patch are the solutions we've settled on. The former I will write a patch for right now, the latter will take longer and I may or may not be the one doing it, I don't know yet.

Sounds like a plan to me.

(In reply to Chris H-C :chutten from comment #9)

To my knowledge yes (ni?frank for surety). It pertains to the application under instrumentation, though, so you don't need to ferry information from Firefox. If the agent's application name is "Cody Banks" then put "Cody Banks" in there (though maybe drop the space). And if the field is unknown to the agent then a value of "unknown" or some other sentinel would be fine.

Is the agent really a separate "application" from Firefox, in this sense? I have no idea how any of this information is ever used so I don't know what should be expected, but to me that seems like a surprising implication.

Up to you. In the past the application name was used to distinguish between Firefox, Thunderbird, and Fennec (since those all could send Telemetry). Since there's unlikely to be more than one application ever sending this ping you'll probably never need it.

(In reply to Frank Bertsch [:frank] from comment #12)

To my knowledge yes (ni?frank for surety). It pertains to the application under instrumentation, though, so you don't need to ferry information from Firefox. If the agent's application name is "Cody Banks" then put "Cody Banks" in there (though maybe drop the space). And if the field is unknown to the agent then a value of "unknown" or some other sentinel would be fine.

The schema for Application is here. If you scroll to the bottom it has a required field, those are all the fields that are required to be present in ping for it to be considered valid.

Similarly the environment schema is here. This one gets confusing because of the nesting; some parent fields are required but only a subset of their children are.

Actually, the schema for Application is here (copy-pasta, I think). Notable is that the entire Environment is itself optional, so you can skip it all. Required application fields include:

"architecture",
"buildId",
"channel",
"name",
"platformVersion",
"version",
"vendor",
"xpcomAbi"

So it looks like there's gonna be a lot of empty stings/sentinel values/"unknown"s since e.g. our Cody Banks here doesn't have XPCOM. Your choices here for what the values will be should be based on what you yourself would find useful in your analyses. If it isn't applicable or useful, give it a sentinel for now.

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9fcdb9a46da6 Remove ping files that we can't migrate r=janerik
Severity: -- → normal
Priority: -- → P1
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: