Closed Bug 1530146 Opened 6 years ago Closed 6 years ago

Tab Crash - Viewing Facebook [@ js::ContextChecks::check]

Categories

(Core :: DOM: Core & HTML, defect, P1)

x86_64
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
firefox-esr60 --- unaffected
firefox65 --- unaffected
firefox66 --- unaffected
firefox67 + fixed

People

(Reporter: sciguyryan, Assigned: bzbarsky)

References

(Blocks 1 open bug)

Details

(5 keywords, Whiteboard: [post-critsmash-triage])

Crash Data

Attachments

(3 files)

Hi!

Since nightly build 2019-02-22 I have been hitting frequent crashes when attempting to visit Facebook. An example report is: https://crash-stats.mozilla.org/report/index/d609e039-a58b-4563-8432-4042b0190224#tab-details

So far I have been unable to narrow down the exact cause and haven't been able to build a reproducible test case. So far I have not noticed it impacting any other sites.

The crash happens a few seconds after loading the page. It only impacts the individual tab.

There are 195 crashes (from 38 installations) in nightly 67 starting with buildid 20190223041557. In analyzing the backtrace, the regression may have been introduced by patch [1] to fix bug 1528146.

[1] https://hg.mozilla.org/mozilla-central/rev?node=e657fe2ca2a3

Blocks: clouseau, 1528146
Component: JavaScript Engine → DOM
Flags: needinfo?(bzbarsky)
Keywords: regression

Hmm. I don't see anything in the backtrace that would implicate bug 1528146...

Do you have a link to the pushlog for the relevant nightly, by any chance?

Flags: needinfo?(bzbarsky) → needinfo?(sciguyryan)

The pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=826b59e57fe4&tochange=6924dd16f7b1

About the backtrace:
https://crash-stats.mozilla.com/report/index/3f23aea2-2732-4a9b-8281-682970190224

see frame 43:
https://hg.mozilla.org/mozilla-central/annotate/6924dd16f7b1e2e6ce71a8eb4cbe330d3e4745dc/dom/base/nsContentSink.cpp#l1565

In comparing backtrace and hg history, it's the only patch I see, I know that the position in the backtrace is pretty low, but the # of crashes is high enough so...

Tracking as it has a high volume of crashes on Nightly and may explode in beta.

Priority: -- → P1

About the backtrace:

Ah, that's a quite different backtrace than the one in comment 0. ;)

Given that bug 1528146 was in for a few days last week before being backed out without this popping up, it's fairly unlikely that it's really involved. Especially since all it's doing in this case is triggering some script, that runs other script, that eventually does something.

Bug 1523843 landed in that same push and is a more likely candidate for issues here, in theory, in that it actually affects compartment boundaries.

There were also various spidermonkey changes, which might in theory be involved. I really wish the compartment checks gave a better idea of what fails the assert. :(

Jan, any idea how to go about debugging this? We could try disabling bug 1523843 and seeing if this goes away, I guess... Or we could make the InternalCallOrConstruct code provide more information about what compartment issues it's running into, on nightly?

Flags: needinfo?(sciguyryan) → needinfo?(jdemooij)

So I poked at the crash reports some. If I search for "js::ContextChecks::check" on crash-stats, look at reports, sort by build id, I see the following:

  1. All the Windows and Android stacks are cut off.

  2. The Linux/Mac stacks mostly pass through PrecompiledScript::ExecuteInGlobal and that's triggered via nsContentSink::NotifyDocElementCreated invoking ExtensionPolicyService::CheckContentScripts. They fail the compartment checks in js::InternalCallOrConstruct.

  3. There is a separate crash under ReportIsNotFunction that I just filed as bug 1530388.

As far as point 2 goes, a bunch of these are coming off the HTML5 parser, not the old parser, so they are NOT the notification I added in bug 1528146, but the notification we used to fire all along.

So tcampbell found https://crash-stats.mozilla.com/report/index/41652358-8427-401a-9460-279470190225 which says we're coming through js::InternalCallOrConstruct with a callee that is a JSFunction backed by a JSNative. At that point, chances are we're failing a cx->check() call in CallJSNative.

What's not clear is whether it's the one for args or rval, and if the former which thing in CallArgs it's for...

I wonder whether it's worth adding explicit MOZ_CRASH diagnostics to CallJSNative that would say which (if any) of the checks is failing, temporarily.

Kris notes that TamperMonkey seems likely to be related...

Ryan, can you give us a list of Tampermonkey user scripts you have installed, or at least those which would run on Facebook? A full list would be better, since it's possible some scripts may run in iframes.

Flags: needinfo?(sciguyryan)

I also hit these, although they might also be cut off since I'm on Windows:

I do have TamperMonkey installed, although it only has one script on this machine ATM, which runs on github.com to make it a bit more accessible. The script is here.

Marco, were you loading any particular page(s) when you hit those?

The stacks being cut off is not really a problem; what I'd really love is a way to reproduce so I can debug. Reproducing just once is enough as long as I can manage to catch it under rr... ;)

Flags: needinfo?(mzehe)

(In reply to Kris Maglione [:kmag] from comment #9)

Ryan, can you give us a list of Tampermonkey user scripts you have installed, or at least those which would run on Facebook? A full list would be better, since it's possible some scripts may run in iframes.

Hi Kris.

I do have TamperMonkey installed however none of the scripts are running on Facebook or any of its domains. Disabling TamperMonkey for the whole domain does not prevent the crashing either. However disabling TamperMonkey completely does prevent the crash, which is fascinating.

Flags: needinfo?(sciguyryan)

I noticed some crashes on support.microsoft.com on crash-stats that had the CanvasBlocker extension installed.

I can reproduce this 100% reliably with that extension on https://support.microsoft.com/nl-be/help/12415/windows-10-recovery-options

https://crash-stats.mozilla.org/report/index/a859bf2d-dab6-4367-a4fb-604b90190226

So in a debug build we fail this debug-only check:

https://searchfox.org/mozilla-central/rev/b10ae6b7a50d176a813900cbe9dc18c85acd604b/js/src/vm/Interpreter-inl.h#501

At this point our |obj| is a CrossOriginObjectWrapper proxy, |key| is "HTMLCanvasElement" and |res| is a function.

DumpJSStack:

0 preIntercept/forEachFunction/</<(object = "HTMLCanvasElement", 0, HTMLCanvasElement) ["moz-extension://183e0772-4c8f-1544-ad6c-c67925b21926/lib/intercept.js":81:24]

intercept.js:81 is this line:

var constructor = getWrapped(window)[object];

Flags: needinfo?(jdemooij) → needinfo?(bzbarsky)

We have a CrossOriginObjectWrapper => XrayWaiver => nsOuterWindowProxy. I think the JSAutoRealm in the XrayWaiver code switches compartments here (the wrapper and cx are probably in different compartments):

https://searchfox.org/mozilla-central/rev/b10ae6b7a50d176a813900cbe9dc18c85acd604b/js/xpconnect/wrappers/WrapperFactory.cpp#51

Flags: needinfo?(mzehe)
Keywords: topcrash

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

Marco, were you loading any particular page(s) when you hit those?

No, just your plain old facebook.com which brings up the news feed. Takes a few seconds, then crashes the tab. I am logged in automatically. One extra bit is that I am using our Facebook Container add-on. So no special page, just the news feed that comes up by default.

We have a CrossOriginObjectWrapper => XrayWaiver => nsOuterWindowProxy

Aha! That is not a thing that should be happening as I planned this stuff; certainly the code in XrayWaiver can't deal with it.

Should be pretty simple to write a testcase for this... Working on that and a fix.

Assignee: nobody → bzbarsky
Group: dom-core-security

This isn't a UAF

Keywords: csectype-uaf

Isn't this a compartment mismatch? We generally mark those as a UAF because they can cause the GC to not mark a live object.

(In reply to Andrew McCreight [:mccr8] from comment #19)

Isn't this a compartment mismatch? We generally mark those as a UAF because they can cause the GC to not mark a live object.

I think we only run into the compartment mismatch in special corner cases because of the weird double wrapper. I don't think this should affect GC behavior unless one of those corner cases lets through a cross-compartment object that doesn't hit a compartment check.

I could be wrong, though.

OK, I've confirmed this is a regression from bug 1523843.

Blocks: 1523843
No longer blocks: 1528146

Alright, so some more information on what's going on, now that I have the steps from comment 13 in rr.

We have a toplevel page at https://support.microsoft.com/nl-be/help/12415/windows-10-recovery-options which has a subframe at https://support.microsoft.com/api/content/SignedOut. The two pages are same-compartment (after bug 1523843) but are not same-origin, because the parent page sets document.domain to "microsoft.com".

We have an extension running in a sandbox targeting the subframe. It has an expanded principal, as expected, wrapping a moz-extension principal and the https://support.microsoft.com/api/content/SignedOut principal. The sandbox has the wantXrays flag so it can get same-origin Xrays and hence XrayWaivers.

We first seem to run into trouble in xpc::WrapperFactory::Rewrap trying to create a cross-compartment wrapper for an object. The object is an XrayWaiver for the subframe's WindowProxy. The principal of the XrayWaiver is the toplevel page's principal. Presumably that's because we first created an XrayWaiver for that WindowProxy when the subframe contained initial about:blank (which inherits the parent's principal). Then we did the actual subframe load, but since we're using the same compartment we kept the same WindowProxy object (though with a new global and principal) and hence also kept using the same XrayWaiver.

Since the two ContentPrincipals involved are not same-origin with each other for purposes of this code, we end up in the block at https://searchfox.org/mozilla-central/source/js/xpconnect/wrappers/WrapperFactory.cpp#698-704 and decide that our CCW should be a CrossOriginObjectWrapper. That's because both targetSubsumesOrigin and originSubsumesTarget are both false, so we never worry about the wantXrays flags, since they're not relevant to not-same-origin cases.

But really, the fact that targetSubsumesOrigin is false is really wrong here. "target" is the extension sandbox. "origin" really ought to be the window that sandbox is targeting, but we're getting it off the XrayWaiver, so we're ending up with the parent window, which the sandbox does not subsume...

This business of having a waiver that's not same-origin with the thing it's targeting seems pretty weird. Maybe we need a different approach for bug 1526624. Alternately, maybe Rewrap needs to strip off waivers and only reapply them if we're about to decide to use an Xray or something... I still need to double-check exactly how the sandbox is getting its hands on this not-same-origin waiver; that's coming up next.

Flags: needinfo?(bobbyholley)

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #22)

It has an expanded principal, as expected, wrapping a moz-extension principal and the https://support.microsoft.com/api/content/SignedOut principal. The sandbox has the wantXrays flag so it can get same-origin Xrays and hence XrayWaivers.

Ugh. I didn't realize we were still setting that flag on those sandboxes. The fact that they have expanded principals should be enough for them to get X-rays for the objects they want X-rays for. We don't really want them to have same-origin X-rays...

In any case, I'm not sure that's really relevant to the problem here. It seems like we'd still have it with just an expanded principal.

And, either way, the case of a content script created for a window before it's document.domain changes is already pretty broken, I think, since we don't update the principal of the sandbox. And when we did remap wrappers when document.domain changed, we ran into other issues (i.e., bug 1414783 where we lose expandos for the window wrappers).

OK, so as expected the place the waiver comes from is the extension calling its getWrapped function which calls xpc::wrappedJSObject_getter which calls xpc::WrapperFactory::WaiveXrayAndWrap. AllowWaiver in there returns true, because there we're comparing the context compartment (the extension's sandbox) to the object compartment (the compartment both web pages are in) and doing a Subsumes() check, which ignores document.domain. And that passes in this case. And even if we compared Realms or did SubsumesConsideringDomain() there it would still pass, because at that point we're comparing the extension sandbox to the subframe's OuterWindowProxy.

Right, the wantXrays flag is not really relevant here, except insofar as there are cases in which it suppresses the use of CrossOriginObjectWrapper, so I wanted to understand why that wasn't happening in this case.

the case of a content script created for a window before it's document.domain changes is already pretty broken

I think it actually works kinda fine in that case: the expanded principal stores a pointer to the principal that the document.domain change mutates, so SubsumesConsideringDomain() still passes.

The issue in this case is that the XrayWaivers are per-compartment, not per-realm, given how we fixed bug 1526624. Which means that we should really never be examining any Realm-dependent state (including principals) on them. Ideally js::GetNonCCWObjectRealm would just assert against being called on an XrayWaiver, as would JS::GetNonCCWObjectGlobal and maybe in general any place we assert !IsCrossCompartmentWrapper(). Maybe we need a semantically distinct !IsObjectWithoutRealm() assert or something.

Again, unless we change how waivers work to be per-Realm.

Attached patch Mochitest for this bug (deleted) — Splinter Review

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #25)

the case of a content script created for a window before it's document.domain changes is already pretty broken

I think it actually works kinda fine in that case: the expanded principal stores a pointer to the principal that the document.domain change mutates, so SubsumesConsideringDomain() still passes.

Oh, that's a good point. In that case, I guess just not remapping wrappers may have been enough to fix the existing problems. I was expecting to need to do more work after that.

Alright, bholley and I talked about comment 22. The basic options seem to be:

  1. Change XrayWaiver to be per-realm, not per-compartment. This basically means adjusting FixWaiverAfterTransplant to handle the same-compartment different-realm case.

  2. Add asserts against getting realms, principals, etc off XrayWaivers (just like we already have for CCWs). Fix all the resulting fallout.

I took a brief stab at #2 and it's somewhat invasive and kinda annoying. So I'm going to try #1.

Flags: needinfo?(bobbyholley)
Flags: needinfo?(bzbarsky)

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #29)

Created attachment 9047271 [details]
Bug 1530146 part 1. Switch XrayWaiver to always being same-realm with its target. r=bholley

I wonder if I should be concerned that the second email I got about this revision was plain text rather than "The content for this message can only be transmitted over a secure channel" ...

(In reply to Kris Maglione [:kmag] from comment #31)

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #29)

Created attachment 9047271 [details]
Bug 1530146 part 1. Switch XrayWaiver to always being same-realm with its target. r=bholley

I wonder if I should be concerned that the second email I got about this revision was plain text rather than "The content for this message can only be transmitted over a secure channel" ...

I got the same thing. You should file a bug about it.

Group: dom-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

(In reply to Kris Maglione [:kmag] from comment #31)

I wonder if I should be concerned that the second email I got about this revision was plain text rather than "The content for this message can only be transmitted over a secure channel" ...

I filed bug 1531823 for this issue. Thanks for pointing it out, Kris.

Component: DOM → DOM: Core & HTML
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: