Open Bug 1476239 Opened 6 years ago Updated 2 years ago

Investigate recent increase in GCMarker::processMarkStackTop crash rate

Categories

(Core :: JavaScript: GC, defect, P5)

61 Branch
defect

Tracking

()

REOPENED
mozilla63
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- affected
firefox61 - wontfix
firefox62 - wontfix
firefox63 - wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox72 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 --- fix-optional

People

(Reporter: jonco, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: stalled)

Crash Data

Attachments

(5 files)

There have always been crashes in GCMarker::processMarkStackTop but these seem to have increased recently. This bug aims to investigate what's happening and see if there's anything we can do to improve the situation.
The address 0xfc0a0 is very common to crash on. This means that we're trying to access the mark bits for a null pointer. Some ideas: - replace MarkStack memory allocation with use of Vector - poison mark stack memory in debug builds with something that shows a destinctive crash address - assert cell pointers are valid on push/pop to/from the mark stack - check the logic in GCMarker::saveValueRanges / restoreValueArray and add more assertions
Attached patch bug1476239-1-mark-stack-alloc (deleted) — Splinter Review
Replace manual memory management with a Vector. The vector is still resized manually since we want to limit mark stack size. MarkStack now uses an index for the top of the stack rather than a raw pointer but I doubt this will have any noticeable performance impact.
Attachment #8994909 - Flags: review?(sphink)
Attached patch bug1476239-2-poison-mark-stack (deleted) — Splinter Review
Poison usused parts of the mark stack on allocation and when yielding to the mutator.
Attachment #8994910 - Flags: review?(sphink)
Attached patch bug1476239-3-check-cell-ptrs (deleted) — Splinter Review
Assert cell pointers are valid when pushed onto or popped off the mark stack.
Attachment #8994911 - Flags: review?(sphink)
Attached patch bug1476239-4-saved-ranges (deleted) — Splinter Review
Refactor save/restore of value ranges. Asserts validity in the constructor of ValueArray and SavedValueArray. There is also one functional change: GCMarker::pushValueArray now skips pushing empty ranges.
Attachment #8994913 - Flags: review?(sphink)
Comment on attachment 8994909 [details] [diff] [review] bug1476239-1-mark-stack-alloc Review of attachment 8994909 [details] [diff] [review]: ----------------------------------------------------------------- This does nicely remove some code duplicate with Vector, which is nice. ::: js/src/gc/GCMarker.h @@ +204,3 @@ > > public: > + explicit MarkStackIter(MarkStack& stack); I didn't see where this needed to be non-const?
Attachment #8994909 - Flags: review?(sphink) → review+
Comment on attachment 8994910 [details] [diff] [review] bug1476239-2-poison-mark-stack Review of attachment 8994910 [details] [diff] [review]: ----------------------------------------------------------------- WFM. I guess it would be overkill to poison on every pop, and assert that space pushed into is poisoned.
Attachment #8994910 - Flags: review?(sphink) → review+
Attachment #8994911 - Flags: review?(sphink) → review+
Attachment #8994913 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #6) > I didn't see where this needed to be non-const? GCMarker::saveValueRanges() calls its saveValueArray() method which mutates the stack. > I guess it would be overkill to poison on every pop, and assert that space pushed into is poisoned. Yeah, I considered this but thought it would be overkill. We could still do it if you think it's worth it.
Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d895bcb50ce7 Replace MarkStack memory allocation code with a Vector r=sfink https://hg.mozilla.org/integration/mozilla-inbound/rev/aa709ea6ba7c Poison unused mark stack memory r=sfink https://hg.mozilla.org/integration/mozilla-inbound/rev/d8b161307ded Check cell pointers are valid when pushed/popped to/from the mark stack r=sfink https://hg.mozilla.org/integration/mozilla-inbound/rev/246e495c481e Improve assertions around saved slots/elements ranges r=sfink
Is this something we should be considering for Beta uplift?
Flags: needinfo?(jcoppeard)
Attached patch bug1476239-beta (deleted) — Splinter Review
Approval Request Comment [Feature/Bug causing the regression]: Unknown. [User impact if declined]: This is not a fix for anything specific but should make the mark stack code more robust so possibly this will reduce crashes. [Is this code covered by automated tests?]: Yes. [Has the fix been verified in Nightly?]: Yes. [Needs manual test from QE? If yes, steps to reproduce]: No. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: Low risk. [Why is the change risky/not risky?]: It's a fair amount of change. It's been on central for a while though so I'd say it was low risk. [String changes made/needed]: None.
Flags: needinfo?(jcoppeard)
Attachment #8998143 - Flags: approval-mozilla-beta?
Comment on attachment 8998143 [details] [diff] [review] bug1476239-beta This may be a little risky but it has a chance to fix a bad crash. This is huge top crash on 61 release, while we can see moderate crash volume in beta it may be hard to tell if it will fix the issue on 62 release. But, let's land it now, so that if it makes things obviously worse we'll have a chance to address that.
Attachment #8998143 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This crash is still there on nightly 63 and beta 62. The volume on beta is almost the same as before the patch landed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
These patches haven't made any difference to the crash rate. This seems like another bug where heap corruption causes crashes during GC, and may be caused by a bug anywhere in the system. I'm not sure what else we can do here.
Assignee: jcoppeard → nobody
Keywords: stalled
Priority: -- → P5

Here's my crash report about the same issue, in case it will help at all. 06f2a57b-bd57-439e-8246-9d4c60181216

I had FF 65 at the time.

Oddly enough, there was a one day spike in 20190225215823 which generated over 3000 crashes, but only 2 installs - something seems strange about that.

Jonco: Could we look a this bug again? Specifically with the new large spike in crashes as indicated in Comment 19.

Flags: needinfo?(jcoppeard)

This looks like a problem with one or two machines getting stuck in a loop and crashing over and over again. I don't think it represents a general increase in the frequency of this crash.

Looking at the crashes themselves again they appear to be memory corruption so I still don't see a way forward here.

Flags: needinfo?(jcoppeard)

This bug report has come up in crash report https://crash-stats.mozilla.org/report/index/163a4f35-bb5f-44a5-8501-f525c0200422#tab-details which happened on a Zoom web client call.

That crash signature was like 800 crashes per release 6 months ago but it was 1000 crashes per day in 74 and 75 so this is becoming more frequent.

Bonjour Pascal,

My Nightly just crashed like
https://crash-stats.mozilla.org/report/index/5b10f423-b43e-4f97-8c32-f65700210716
on chat.google.com

Updating flags here per
https://crash-stats.mozilla.org/signature/?product=Firefox&signature=js%3A%3AGCMarker%3A%3AprocessMarkStackTop

I wonder if a new bug should be filed as the crash I experienced is seemingly on Fission though.

Should this one be a duplicate of https://bugzilla.mozilla.org/show_bug.cgi?id=719114 perhaps?

Merci,
Alex

Flags: needinfo?(pascalc)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: