Closed Bug 1696565 Opened 4 years ago Closed 3 years ago

YSOD on browser.xhtml with UNDEFINED_ENTITY

Categories

(Toolkit :: Startup and Profile System, defect)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: zbraniecki, Unassigned)

References

(Blocks 1 open bug)

Details

User Story

Volume: ~3000/day on release

This is similar to bug 1696551, but this time the XML file loads correctly, but the DTD it depends on errors out.

This correlation is two-way. Every DTD corruption of brand.dtd can be correlated to browser.xhtml YSOD in line 744:15 and charsetMenu.dtd leads to YSOD in line 960:7 - which makes sense. Missing DTD triggers YSOD in the first reference

Session 1

  • 17472, zero_byte_load: [{"f":[{"v":"sync"},{"v":"true"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/pocket/SaveToPocket.jsm"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18299, zero_byte_load: [{"f":[{"v":"sync"},{"v":"true"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/en-US/locale/branding/brand.dtd"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18330, ysod: [{"f":[{"v":"error_code"},{"v":"11"}]},{"f":[{"v":"location"},{"v":"744:15"}]},{"f":[{"v":"last_line"},{"v":"
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/search-extensions/bing/manifest.json"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/search-extensions/wikipedia/manifest.json"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/search-extensions/google/manifest.json"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/content/browser/browser.xhtml"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_UNEXPECTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/search-extensions/ddg/manifest.json"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 18736, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/search-extensions/amazondotcom/manifest.json"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 33852, zero_byte_load: [{"f":[{"v":"sync"},{"v":"false"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/toolkit/content/gfxsanity/sanitytest.html"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FAILURE"}]},{"f":[{"v":"cancelled"},{"v":"true"}]}]
  • 1334870, zero_byte_load: [{"f":[{"v":"sync"},{"v":"true"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/res/activity-stream/lib/ShortURL.jsm"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]
  • 1334885, zero_byte_load: [{"f":[{"v":"sync"},{"v":"true"}]},{"f":[{"v":"file_name"},{"v":"omni.ja!/chrome/browser/res/activity-stream/lib/ASRouterNewTabHook.jsm"}]},{"f":[{"v":"status"},{"v":"NS_ERROR_FILE_CORRUPTED"}]},{"f":[{"v":"cancelled"},{"v":"false"}]}]

Other sessions have the same NS_ERROR_FILE_CORRUPTED status from different files like:

  • omni.ja!/chrome/toolkit/res/normandy/Normandy.jsm
  • omni.ja!/chrome/toolkit/content/global/customElements.js
  • omni.ja!/chrome/toolkit/content/gfxsanity/sanityparent.html
  • omni.ja!/chrome/en-US/locale/en-US/global/layout/xmlparser.properties
  • omni.ja!/chrome/toolkit/content/global/process-content.js
  • omni.ja!/actors/LinkHandlerChild.jsm
  • omni.ja!/chrome/browser/res/activity-stream/lib/ASRouterNewTabHook.jsm

and in one case omni.ja!/chrome/browser/content/browser/browser.xhtml with NS_ERROR_UNEXPECTED.

User Story: (updated)

Flagging you since there seem to be a good correlation with corrupted JSM loads, so this YSOD may have the same root cause as bug 1616059. Wondering if there's any chance that we can now track it better.

Is there a chance that we are somehow corrupting the omni.ja during updates?

Flags: needinfo?(kmaglione+bmo)

We are definitely winding up with corrected omnijar files in the wild that cause any number of files. Whether the corruption happens during update or at some other time is a question I can't currently answer. I suppose ideally we should have the updater verify the integrity of files after they've been written (if it already doesn't; I have no reason to believe it does), which would at least answer that part of the question.

Flags: needinfo?(kmaglione+bmo)
Product: Core → Toolkit
Component: General → Application Update

(In reply to Kris Maglione [:kmag] from comment #2)

I suppose ideally we should have the updater verify the integrity of files after they've been written (if it already doesn't; I have no reason to believe it does), which would at least answer that part of the question.

Correct, the updater doesn't currently do any extra checking of the written files. There is verification of the data we've read from the patch (via the signature on the mar), and of the original file that's being patched (via a source file crc check), but that's it.

Correct, the updater doesn't currently do any extra checking of the written files.

How much work would it be to fix that?

Flags: needinfo?(agashlin)

Ignoring, for a moment, the issue of how difficult this would be to do, I'm reluctant to have the updater spend even more CPU time updating. If we stage, I wouldn't mind taking extra verification time at that point. But if a user is updating on startup, they have to wait for the update to apply. And I'm reluctant to increase the amount of time the user spends waiting for Firefox to launch.

But if a user is updating on startup, they have to wait for the update to apply. And I'm reluctant to increase the amount of time the user spends waiting for Firefox to launch.

Am I correct to read that there's a trade-off between ability to mitigate corrupted omni.ja and time it takes to perform an update?

If so, there is likely a threshold at which the degraded quality of bricking users browsers outweights the cost of verifying the update, am I correct?

Flags: needinfo?(ksteuber)

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #6)

Am I correct to read that there's a trade-off between ability to mitigate corrupted omni.ja and time it takes to perform an update?

If, at update installation time, you want to verify the installed files after we install them, that will take additional update time.

In my opinion, we already verify the update sufficiently.

When the MAR is downloaded, we verify the signature. If any byte in the MAR is not as-expected, this signature check will fail and the update will not install.

When we are patching (i.e. using a partial MAR), each file's integrity is verified before we patch it. A failure results in the entire update being rolled back.

Flags: needinfo?(ksteuber)

In my opinion, we already verify the update sufficiently.

That means that the MAR update is not a vector of corrupted omni.ja, and therefore additional verification wouldn't yield any value and would cost time.

Which is directly countering Kris' point:

Whether the corruption happens during update or at some other time is a question I can't currently answer. I suppose ideally we should have the updater verify the integrity of files after they've been written (if it already doesn't; I have no reason to believe it does), which would at least answer that part of the question.

Could we perform an experiment to validate that claim? How costly would it be to add additional verification for a small percentage of population, or just nightly, for a short period of time and cross-check it against corrupted omni.ja YSODs telemetry?
We have anecdotal evidence that there are cases where a user performs an update and they see YSOD on browser.xhtml next (:nalexander documented one in bug 1675823 comment 7). But we also see from telemetry that shows that the volume of browser.xhtml YSODs is steady on release channel, rather than correlated with release rollouts which would indicate that updates are not causative here.

I don't think we'll get a hard correlation, and we need to make decisions on investments in such experiments based on our intuition and knowledge of the domain, hence I'm trying to get people's positions.

Unless I'm missing something important (@agashlin please chime in here if I am), I believe this would be a pretty large engineering effort to do as an experiment. The big question that I ask when I think about verifying these files is "what are we comparing them to?" I don't believe that the MAR contains the expected hash of the resulting file. So how would we do this, exactly? Would we add the file hashes to the MAR? Would we phone home to Mozilla, mid-update, to ask what the correct hash is supposed to be?

I'm not sure how to talk about how difficult this would be, because I'm not really sure how we would want to accomplish it.

Will Firefox send telemetry in the YSOD state? If not, I don't see what good this would do us. If so, I suspect it would be much easier, engineering-wise, to verify the installation files at that point, rather than at update time.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #4)

Correct, the updater doesn't currently do any extra checking of the written files.

How much work would it be to fix that?

  • Extend the MAR format to include checksums for the updated files. (This would need to be done in a backwards-compatible way to avoid introducing a watershed, it could be an additional file like the manifest or precomplete files.)
  • Modify the MAR generation tools (libmar, mardor are the ones I know of) to generate this new format.
  • Modify the updater to use this data (possibly only when staging an update as :bytesized suggested in comment 5)
    • Load it from the MAR
    • Check. To protect against I/O errors this probably wants to reopen and read back the output file.
    • Write an error status code specific to a mismatch (so it can be bubbled up to telemetry). Restoring backups should be handled by existing code.
  • Tests
Flags: needinfo?(agashlin)

Will Firefox send telemetry in the YSOD state?

It does. And we are collecting it. Here's a dashboard - https://sql.telemetry.mozilla.org/dashboard/ysods - notice top-right Release, document YSOD volume. We believe some part of it (this bug documents the case studies) are caused by NS_ERROR_FILE_CORRUPTED on DTD file.

I've filed 1700205 for further discussion of verifying within the updater, but I think the immediate issue belongs in a different component. I wasn't sure if it would be Toolkit :: Startup and Profile System or Core :: Networking: JAR, but either seemed more likely than Update.

Component: Application Update → Startup and Profile System

Can this bug be closed or have a severity assigned?

Flags: needinfo?(zbraniecki)

Yes.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(zbraniecki)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.