Closed
Bug 1110931
Opened 10 years ago
Closed 10 years ago
Intermittent crash at !InFreeList
Categories
(Core :: JavaScript: GC, defect)
Core
JavaScript: GC
Tracking
()
RESOLVED
FIXED
mozilla38
Tracking | Status | |
---|---|---|
firefox35 | --- | unaffected |
firefox36 | --- | fixed |
firefox37 | --- | fixed |
firefox38 | --- | fixed |
firefox-esr31 | --- | unaffected |
b2g-v1.4 | --- | unaffected |
b2g-v2.0 | --- | unaffected |
b2g-v2.0M | --- | unaffected |
b2g-v2.1 | --- | unaffected |
b2g-v2.1S | --- | unaffected |
b2g-v2.2 | --- | fixed |
b2g-master | --- | fixed |
People
(Reporter: terrence, Assigned: jonco)
References
(Blocks 1 open bug)
Details
(Keywords: sec-moderate)
Attachments
(5 files, 1 obsolete file)
(deleted),
patch
|
jonco
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
sfink
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
terrence
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
terrence
:
review+
abillings
:
approval-mozilla-aurora+
abillings
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
The fuzzers are seeing an intermittent crash under !InFreeList for strings because the free list is corrupt.
The facts so far:
* This is the last test in CheckMarkedThing, so we are fairly sure that it is actually the free list that is corrupt and not the pointer or GC memory.
* The check we are doing is "is the thing live" -- for non-strings this takes the IsThingPoisoned path, since poisoning works. However, since inline string data is in the word we test, we cannot check the poisoning and instead walk the free list. Other types might experience this as well if we used this mechanism.
* This crashes >= 1 time per hour for decoder, so we should be able to test patches relatively easily.
Candidates:
* A buffer overflow.
* A race with an allocation in the parsing thread.
The parser checks isHeapBusy, which checks a volatile enum type. I think as a first pass we should try making the misc volatiles in the GC into atomics.
Attachment #8535734 -
Flags: review?(jcoppeard)
Assignee | ||
Comment 1•10 years ago
|
||
Comment on attachment 8535734 [details] [diff] [review]
atomicize_heapstate-v0.diff
Review of attachment 8535734 [details] [diff] [review]:
-----------------------------------------------------------------
Yes, let's move to using atomics over volatile regardless.
Attachment #8535734 -
Flags: review?(jcoppeard) → review+
Updated•10 years ago
|
Keywords: sec-moderate
Reporter | ||
Comment 2•10 years ago
|
||
This push is not a fix, just a test, so marking leave-open and not requesting sec approval.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e7664dbdbfd6
Keywords: leave-open
Comment 3•10 years ago
|
||
Reporter | ||
Comment 4•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/9bba75c1ee47
And backed out for regressing Splay. Christian, did this help at all for the day it was on m-c?
Flags: needinfo?(choller)
Comment 5•10 years ago
|
||
Yes it did. As far as I can see, the crashes stopped yesterday when the machines cycled their trees. Since then, no reports came in. I guess it will start again once this lands on m-c and the machines recycle, then we have confirmation.
If this is the case, then I guess we need to land this and take the performance penalties (or figure out how to avoid them).
Flags: needinfo?(choller)
Comment 6•10 years ago
|
||
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/9bba75c1ee47
Assignee: nobody → terrence
Keywords: leave-open
Comment 7•10 years ago
|
||
Looks like I spoke too soon. There were some reports coming in on a revision that had this temporary fix baked in. So it's probably not solving our problem. Any further ideas what we could do? Or other variables this could be racing on?
Reporter | ||
Comment 8•10 years ago
|
||
Would you say that the volume was lower? There might be multiple races here.
Flags: needinfo?(choller)
Reporter | ||
Comment 9•10 years ago
|
||
Please double-check me, but it looks like interFrameGC can only ever be touched from the main thread.
Attachment #8538578 -
Flags: review?(jcoppeard)
Comment 10•10 years ago
|
||
It seemed lower, but I'm not sure if that's significant or just coincidence. Is it maybe possible to add more assertions that make the failure more reproducible?
Flags: needinfo?(choller)
Reporter | ||
Comment 11•10 years ago
|
||
heapState appears to only be set in TraceSession and checked externally, but only heavily in assertions, so I think it should be fine to atomicize it. I'm going to carry the r+ and land as soon as the 10.10 builders on AWFY are back up.
The other volatile -- majorGCRequested -- is, I think, both the most likely to be the performance issue and also unlikely to be implicated in this bug, given its one-way nature and lack of correctness implications.
Attachment #8538584 -
Flags: review+
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #10)
> It seemed lower, but I'm not sure if that's significant or just coincidence.
> Is it maybe possible to add more assertions that make the failure more
> reproducible?
Yes, we were considering making the more intrusive assertion on all types, not just strings. I'll prep a patch for that too and we can get that in as well, assuming it doesn't show up too frequently on treeherder.
Comment 13•10 years ago
|
||
Although it's of course preferred to put the changes into mozilla-central first, we can also test patches for you, if it's not possible to land them due to test breakage (and that test breakage can't be resolved easily). I can use one machine to do that sort of testing (on m-c it gets much more coverage of course).
Reporter | ||
Updated•10 years ago
|
Attachment #8538578 -
Flags: review?(jcoppeard) → review?(sphink)
Reporter | ||
Comment 14•10 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #13)
> Although it's of course preferred to put the changes into mozilla-central
> first, we can also test patches for you, if it's not possible to land them
> due to test breakage (and that test breakage can't be resolved easily). I
> can use one machine to do that sort of testing (on m-c it gets much more
> coverage of course).
The patch appears to run fine on jit-tests, so it's probably not going to be a show-stopper.
Comment 15•10 years ago
|
||
Comment on attachment 8538578 [details] [diff] [review]
nonvolatile_interFrameGC-v0.diff
Review of attachment 8538578 [details] [diff] [review]:
-----------------------------------------------------------------
(In reply to Terrence Cole [:terrence] from comment #9)
> Created attachment 8538578 [details] [diff] [review]
> nonvolatile_interFrameGC-v0.diff
>
> Please double-check me, but it looks like interFrameGC can only ever be
> touched from the main thread.
Certainly looks that way to me. But that is determined by how the embedding calls us, so I'd be more comfortable if there were an assertion that we are on the main thread in notifyDidPaint.
Attachment #8538578 -
Flags: review?(sphink) → review+
Reporter | ||
Comment 16•10 years ago
|
||
Keywords: leave-open
Comment 17•10 years ago
|
||
Reporter | ||
Comment 18•10 years ago
|
||
Backed out for regressing svgx.
https://hg.mozilla.org/integration/mozilla-inbound/rev/70f64ef5569e
Comment 19•10 years ago
|
||
Something landed here.
https://hg.mozilla.org/mozilla-central/rev/e98f0dc6b1c3
Reporter | ||
Comment 20•10 years ago
|
||
The patches in here are quite speculative -- I opened a cover bug to land them since it seems there are going to be regressions to track.
Assignee | ||
Comment 21•10 years ago
|
||
I made decoder a patch to print out the contents of the arena when this assert went off. The results are here.
The assertion failing |(first & ~ArenaMask) == (next->first & ~ArenaMask)| is failing because |next->first & ~ArenaMask| == 0x4b4b4b4b4b4b4000, i.e. it looks like we're seeing the poison pattern rather than a valid address. This is for the free span at 0x7f7cd84e1118 (from the stack trace).
However, looking at the memory dump next == 0x7f7cd84e1190 which is a valid empty free span with next->first == 0. So I don't understand what's going on here.
Assignee | ||
Comment 22•10 years ago
|
||
I think this might be because we are walking the free list in a minor GC, when the background sweep thread can be running and concurrently modifying it.
Here's a patch to not do this in a minor GC (it's only used for an assertion anyway).
Assignee | ||
Comment 23•10 years ago
|
||
Decoder reports that the previous patch fixed the issue.
As suggested, this is a more specific fix that doesn't attempt to walk the free list if background sweeping is running.
Assignee: terrence → jcoppeard
Attachment #8551870 -
Attachment is obsolete: true
Attachment #8552392 -
Flags: review?(terrence)
Reporter | ||
Comment 24•10 years ago
|
||
Comment on attachment 8552392 [details] [diff] [review]
bug1110931-fix-assertion-race-with-bg-sweep
Review of attachment 8552392 [details] [diff] [review]:
-----------------------------------------------------------------
Ouch. Did we always not lock when updating the arena lists?
Attachment #8552392 -
Flags: review?(terrence) → review+
Assignee | ||
Comment 25•10 years ago
|
||
(In reply to Terrence Cole [:terrence] from comment #24)
We don't lock when updating the free list in sweeping as the main thread doesn't access this for arenas that are being swept - except as it turns out for this assertion.
Assignee | ||
Comment 26•10 years ago
|
||
I guess this has been a problem since bug 964214, but bug 989390 probably made it more likely.
Flags: needinfo?(abillings)
Assignee | ||
Comment 27•10 years ago
|
||
Sorry, didn't mean to needinfo you there.
Flags: needinfo?(abillings)
Assignee | ||
Comment 28•10 years ago
|
||
Comment 29•10 years ago
|
||
Going with dates from bug 989390.
status-firefox35:
--- → unaffected
status-firefox36:
--- → affected
status-firefox37:
--- → affected
status-firefox38:
--- → affected
status-firefox-esr31:
--- → unaffected
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Updated•10 years ago
|
Assignee | ||
Comment 31•10 years ago
|
||
This is fixed now.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 10 years ago
Keywords: leave-open
Resolution: --- → FIXED
Comment 33•10 years ago
|
||
Please request Aurora/Beta approval on the appropriate patches when you get a chance :)
status-b2g-v1.4:
--- → unaffected
status-b2g-v2.0:
--- → unaffected
status-b2g-v2.0M:
--- → unaffected
status-b2g-v2.1:
--- → unaffected
status-b2g-v2.1S:
--- → unaffected
status-b2g-v2.2:
--- → affected
status-b2g-master:
--- → fixed
Flags: needinfo?(jcoppeard)
Assignee | ||
Comment 34•10 years ago
|
||
Comment on attachment 8552392 [details] [diff] [review]
bug1110931-fix-assertion-race-with-bg-sweep
Approval Request Comment
[Feature/regressing bug #]: Bug 989390.
[User impact if declined]: None, the code is executed as part of a debug mode assertion only.
[Describe test coverage new/current, TreeHerder]: On m-c for a week.
[Risks and why]: None.
[String/UUID change made/needed]: None.
Flags: needinfo?(jcoppeard)
Attachment #8552392 -
Flags: approval-mozilla-beta?
Attachment #8552392 -
Flags: approval-mozilla-aurora?
Updated•10 years ago
|
Attachment #8552392 -
Flags: approval-mozilla-beta?
Attachment #8552392 -
Flags: approval-mozilla-beta+
Attachment #8552392 -
Flags: approval-mozilla-aurora?
Attachment #8552392 -
Flags: approval-mozilla-aurora+
Comment 35•10 years ago
|
||
Comment 36•10 years ago
|
||
Updated•10 years ago
|
Group: core-security
You need to log in
before you can comment on or make changes to this bug.
Description
•