Closed Bug 1110931 Opened 10 years ago Closed 10 years ago

Intermittent crash at !InFreeList

Categories

(Core :: JavaScript: GC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox35 --- unaffected
firefox36 --- fixed
firefox37 --- fixed
firefox38 --- fixed
firefox-esr31 --- unaffected
b2g-v1.4 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.0M --- unaffected
b2g-v2.1 --- unaffected
b2g-v2.1S --- unaffected
b2g-v2.2 --- fixed
b2g-master --- fixed

People

(Reporter: terrence, Assigned: jonco)

References

(Blocks 1 open bug)

Details

(Keywords: sec-moderate)

Attachments

(5 files, 1 obsolete file)

Attached patch atomicize_heapstate-v0.diff (deleted) — Splinter Review
The fuzzers are seeing an intermittent crash under !InFreeList for strings because the free list is corrupt. The facts so far: * This is the last test in CheckMarkedThing, so we are fairly sure that it is actually the free list that is corrupt and not the pointer or GC memory. * The check we are doing is "is the thing live" -- for non-strings this takes the IsThingPoisoned path, since poisoning works. However, since inline string data is in the word we test, we cannot check the poisoning and instead walk the free list. Other types might experience this as well if we used this mechanism. * This crashes >= 1 time per hour for decoder, so we should be able to test patches relatively easily. Candidates: * A buffer overflow. * A race with an allocation in the parsing thread. The parser checks isHeapBusy, which checks a volatile enum type. I think as a first pass we should try making the misc volatiles in the GC into atomics.
Attachment #8535734 - Flags: review?(jcoppeard)
Comment on attachment 8535734 [details] [diff] [review] atomicize_heapstate-v0.diff Review of attachment 8535734 [details] [diff] [review]: ----------------------------------------------------------------- Yes, let's move to using atomics over volatile regardless.
Attachment #8535734 - Flags: review?(jcoppeard) → review+
This push is not a fix, just a test, so marking leave-open and not requesting sec approval. https://hg.mozilla.org/integration/mozilla-inbound/rev/e7664dbdbfd6
Keywords: leave-open
https://hg.mozilla.org/integration/mozilla-inbound/rev/9bba75c1ee47 And backed out for regressing Splay. Christian, did this help at all for the day it was on m-c?
Flags: needinfo?(choller)
Yes it did. As far as I can see, the crashes stopped yesterday when the machines cycled their trees. Since then, no reports came in. I guess it will start again once this lands on m-c and the machines recycle, then we have confirmation. If this is the case, then I guess we need to land this and take the performance penalties (or figure out how to avoid them).
Flags: needinfo?(choller)
Assignee: nobody → terrence
Keywords: leave-open
Looks like I spoke too soon. There were some reports coming in on a revision that had this temporary fix baked in. So it's probably not solving our problem. Any further ideas what we could do? Or other variables this could be racing on?
Would you say that the volume was lower? There might be multiple races here.
Flags: needinfo?(choller)
Please double-check me, but it looks like interFrameGC can only ever be touched from the main thread.
Attachment #8538578 - Flags: review?(jcoppeard)
It seemed lower, but I'm not sure if that's significant or just coincidence. Is it maybe possible to add more assertions that make the failure more reproducible?
Flags: needinfo?(choller)
Attached patch heapstate_atomic-v1.diff (deleted) — Splinter Review
heapState appears to only be set in TraceSession and checked externally, but only heavily in assertions, so I think it should be fine to atomicize it. I'm going to carry the r+ and land as soon as the 10.10 builders on AWFY are back up. The other volatile -- majorGCRequested -- is, I think, both the most likely to be the performance issue and also unlikely to be implicated in this bug, given its one-way nature and lack of correctness implications.
Attachment #8538584 - Flags: review+
(In reply to Christian Holler (:decoder) from comment #10) > It seemed lower, but I'm not sure if that's significant or just coincidence. > Is it maybe possible to add more assertions that make the failure more > reproducible? Yes, we were considering making the more intrusive assertion on all types, not just strings. I'll prep a patch for that too and we can get that in as well, assuming it doesn't show up too frequently on treeherder.
Although it's of course preferred to put the changes into mozilla-central first, we can also test patches for you, if it's not possible to land them due to test breakage (and that test breakage can't be resolved easily). I can use one machine to do that sort of testing (on m-c it gets much more coverage of course).
Attachment #8538578 - Flags: review?(jcoppeard) → review?(sphink)
(In reply to Christian Holler (:decoder) from comment #13) > Although it's of course preferred to put the changes into mozilla-central > first, we can also test patches for you, if it's not possible to land them > due to test breakage (and that test breakage can't be resolved easily). I > can use one machine to do that sort of testing (on m-c it gets much more > coverage of course). The patch appears to run fine on jit-tests, so it's probably not going to be a show-stopper.
Comment on attachment 8538578 [details] [diff] [review] nonvolatile_interFrameGC-v0.diff Review of attachment 8538578 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Terrence Cole [:terrence] from comment #9) > Created attachment 8538578 [details] [diff] [review] > nonvolatile_interFrameGC-v0.diff > > Please double-check me, but it looks like interFrameGC can only ever be > touched from the main thread. Certainly looks that way to me. But that is determined by how the embedding calls us, so I'd be more comfortable if there were an assertion that we are on the main thread in notifyDidPaint.
Attachment #8538578 - Flags: review?(sphink) → review+
Depends on: 1117817
The patches in here are quite speculative -- I opened a cover bug to land them since it seems there are going to be regressions to track.
Attached file crash_data.txt (deleted) —
I made decoder a patch to print out the contents of the arena when this assert went off. The results are here. The assertion failing |(first & ~ArenaMask) == (next->first & ~ArenaMask)| is failing because |next->first & ~ArenaMask| == 0x4b4b4b4b4b4b4000, i.e. it looks like we're seeing the poison pattern rather than a valid address. This is for the free span at 0x7f7cd84e1118 (from the stack trace). However, looking at the memory dump next == 0x7f7cd84e1190 which is a valid empty free span with next->first == 0. So I don't understand what's going on here.
Attached patch bug1110931-dont-walk-free-list-in-minor-gc (obsolete) (deleted) — Splinter Review
I think this might be because we are walking the free list in a minor GC, when the background sweep thread can be running and concurrently modifying it. Here's a patch to not do this in a minor GC (it's only used for an assertion anyway).
Decoder reports that the previous patch fixed the issue. As suggested, this is a more specific fix that doesn't attempt to walk the free list if background sweeping is running.
Assignee: terrence → jcoppeard
Attachment #8551870 - Attachment is obsolete: true
Attachment #8552392 - Flags: review?(terrence)
Comment on attachment 8552392 [details] [diff] [review] bug1110931-fix-assertion-race-with-bg-sweep Review of attachment 8552392 [details] [diff] [review]: ----------------------------------------------------------------- Ouch. Did we always not lock when updating the arena lists?
Attachment #8552392 - Flags: review?(terrence) → review+
(In reply to Terrence Cole [:terrence] from comment #24) We don't lock when updating the free list in sweeping as the main thread doesn't access this for arenas that are being swept - except as it turns out for this assertion.
I guess this has been a problem since bug 964214, but bug 989390 probably made it more likely.
Flags: needinfo?(abillings)
Sorry, didn't mean to needinfo you there.
Flags: needinfo?(abillings)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is fixed now.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Keywords: leave-open
Resolution: --- → FIXED
Please request Aurora/Beta approval on the appropriate patches when you get a chance :)
Comment on attachment 8552392 [details] [diff] [review] bug1110931-fix-assertion-race-with-bg-sweep Approval Request Comment [Feature/regressing bug #]: Bug 989390. [User impact if declined]: None, the code is executed as part of a debug mode assertion only. [Describe test coverage new/current, TreeHerder]: On m-c for a week. [Risks and why]: None. [String/UUID change made/needed]: None.
Flags: needinfo?(jcoppeard)
Attachment #8552392 - Flags: approval-mozilla-beta?
Attachment #8552392 - Flags: approval-mozilla-aurora?
Attachment #8552392 - Flags: approval-mozilla-beta?
Attachment #8552392 - Flags: approval-mozilla-beta+
Attachment #8552392 - Flags: approval-mozilla-aurora?
Attachment #8552392 - Flags: approval-mozilla-aurora+
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: