Closed Bug 1298639 Opened 8 years ago Closed 8 years ago

BackgroundHangMonitor reads JS during compacting GC

Categories

(Core :: JavaScript: GC, defect)

Unspecified
Windows 10
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox49 --- unaffected
firefox50 --- unaffected
firefox51 --- fixed

People

(Reporter: marcia, Assigned: jonco)

References

Details

(5 keywords)

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-2e60b532-9d4a-48ab-a56a-61fc22160827.
=============================================================

Seen while looking at crash stats: http://bit.ly/2bpqQeT. Windows crash that seems to have started with 20160826030226 build. 

ni on :jonco and :bkelly (since I think :jonco is on PTO) to weigh in.
Flags: needinfo?(jcoppeard)
Flags: needinfo?(bkelly)
One notable thing is that the crash address is always 0xbad0bad9. I don't see that appearing anywhere of note in the code base, other than 0xbad0 and 0xbad9 are the values for some Hangul (Korean alphabet) letters, which is probably just a coincidence?
(In reply to Nicholas Nethercote [:njn] from comment #1)
> One notable thing is that the crash address is always 0xbad0bad9. I don't
> see that appearing anywhere of note in the code base,

There's this in RelocationOverlay:

  static const uintptr_t Relocated = uintptr_t(0xbad0bad1);
The only thing that jumps out at me is that the stack includes SPSProfiler and we recently tried to make it so profiling works on Worker threads again in bug 1297773.  Its unclear to me from the stack if its a worker thread, though.
Flags: needinfo?(bkelly)
This looks bad. I looked at a number of crashes. In the five or so I looked at, the main thread was compacting, while the background hang monitor thread, is poking around in JS engine data structures. The background hang monitor code is old, so I don't know why these crashes are showing up now.

The function IsChromeJSScript(JSScript* aScript) in ThreadStackHelper.cpp has this comment:
  // May be called from another thread or inside a signal handler.
  // We assume querying the script is safe but we must not manipulate it.
Which seems a little dubious at best.

Terrence, can you take a look?

This should probably be in XPCOM but I'll put it in JS:GC for now.
Group: javascript-core-security
Component: JavaScript Engine → JavaScript: GC
Flags: needinfo?(terrence)
Summary: Crash in JSScript::code → BackgroundHangMonitor reads JS during compacting GC
In the stacks I looked at, the main thread was waiting on a lock inside js::GCParallelTask::join().

This is probably an immediate regression from bug 1288579, which was probably in the 8-26 build first. That was supposed to fix a hang, but maybe it caused one? If we're hanging more, then that would trigger the hang monitor. Alternatively, it may have improved the hang so that the hang monitor can actually run.
Blocks: 1288579
All 13 of these crashes are on Windows.
Keywords: regression
(In reply to Andrew McCreight [:mccr8] from comment #4)
> This looks bad. I looked at a number of crashes. In the five or so I looked
> at, the main thread was compacting, while the background hang monitor
> thread, is poking around in JS engine data structures. The background hang
> monitor code is old, so I don't know why these crashes are showing up now.

We just started moving JSScripts recently.

> The function IsChromeJSScript(JSScript* aScript) in ThreadStackHelper.cpp
> has this comment:
>   // May be called from another thread or inside a signal handler.
>   // We assume querying the script is safe but we must not manipulate it.
> Which seems a little dubious at best.

Yeah, more than a little. I can't image how this was ever supposed to work, but I suppose there's probably some awful hack somewhere unrelated-seeming in the engine that makes it not immediately explode.
It looks like this is also showing up all over the place on Treeherder, too.
Blocks: 1294009
Bug 1294009 was filed before bug 1288579 landed, which suggests that in fact we were hitting these kinds of crashes earlier, which makes sense. The stack in comment 0 there does not involve compacting GC on the main thread, but does involve some other JS.
I just filed bug 1299275 and put it on my xmas list for Santa
Summary: BackgroundHangMonitor reads JS during compacting GC → Intermittent browser_markup_events_jquery_1.7.js,browser_perf-recording-actor-01.js,test_bug260264.html | application crashed [@ js::ProfileEntry::pc()] or [@ js::ProfileEntry::pc() const volatile] -- BackgroundHangMonitor reads JS during compacting GC
Apparently duping made it impossible to star on treeherder...
Summary: Intermittent browser_markup_events_jquery_1.7.js,browser_perf-recording-actor-01.js,test_bug260264.html | application crashed [@ js::ProfileEntry::pc()] or [@ js::ProfileEntry::pc() const volatile] -- BackgroundHangMonitor reads JS during compacting GC → BackgroundHangMonitor reads JS during compacting GC
This has been triggered by refactoring of JSScript in bug 1288715.  Now there is an additional dereference to get the start of the script's bytecode we are crashing because the script is being moved and has had the magic value written over the field holding the shared script data pointer.
Attached patch bug1298639-hang-monitor (obsolete) (deleted) — Splinter Review
The problem is that we now need to dereference the script's shared script data pointer to get the address of the bytecode, and while the script is being moved that is overwritten with the magic value.

In fact we don't need the bytecode address though because it's just used to calculate the PC address from an offset stored on the profiling stack, and the PC address is then converted back into the offset when we can calculate the line number.

Instead we can add an API to get the line number given the offset and then we don't need to access this script for this.
Assignee: nobody → jcoppeard
Flags: needinfo?(jcoppeard)
Attachment #8788195 - Flags: review?(shu)
Comment on attachment 8788195 [details] [diff] [review]
bug1298639-hang-monitor

Review of attachment 8788195 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

::: js/src/jsscript.cpp
@@ +2925,5 @@
> +unsigned
> +js::PCToLineNumber(JSScript* script, int32_t pcOffset, unsigned* columnp)
> +{
> +    /* Cope with InterpreterFrame.pc value prior to entering Interpret. */
> +    MOZ_ASSERT(pcOffset >= -1);

>:|

::: js/src/vm/SPSProfiler.cpp
@@ +485,5 @@
> +JS_FRIEND_API(int32_t)
> +ProfileEntry::pcOffset() const volatile
> +{
> +    MOZ_ASSERT(isJs());
> +    return lineOrPc;

Perhaps this field should be more accurately renamed lineOrPcOffset?
Attachment #8788195 - Flags: review?(shu) → review+
I assume the line number is just used for displaying output, and not indexing into the array or anything? Given that and that it requires the profiler to be running, I'm going to mark this sec-moderate. Thanks for fixing it.
Flags: needinfo?(terrence)
Comment on attachment 8788195 [details] [diff] [review]
bug1298639-hang-monitor

Jan informs me that there are still crashes with this patch.  This is due to the fact that we still need to call JSScript::notes which will also dereference the shared script data pointer.
Attachment #8788195 - Attachment is obsolete: true
Blocks: 1297706
Attached patch bug1298639-hang-monitor v2 (obsolete) (deleted) — Splinter Review
This is a much simpler approach: check whether the script is being moved whenever we give out the pointer.
Comment on attachment 8788861 [details] [diff] [review]
bug1298639-hang-monitor v2

Jan confirmed that this fixes the problem.
Attachment #8788861 - Flags: review?(shu)
Comment on attachment 8788861 [details] [diff] [review]
bug1298639-hang-monitor v2

Review of attachment 8788861 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/SPSProfiler.cpp
@@ +487,5 @@
> +{
> +    MOZ_ASSERT(isJs());
> +    // We could have interrupted a compacting GC before these pointers have been
> +    // updated.
> +    return MaybeForwarded(reinterpret_cast<JSScript*>(spOrScript));

Is there any guarantee that we haven't interrupted in the middle of installing a forwarding pointer? This can't happen when profiling because shu just added an AutoSuppressProfilerSampling during compaction, but I don't know the interaction that the BackgroundHangMonitor has with this code.
Comment on attachment 8788861 [details] [diff] [review]
bug1298639-hang-monitor v2

Review of attachment 8788861 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/SPSProfiler.cpp
@@ +487,5 @@
> +{
> +    MOZ_ASSERT(isJs());
> +    // We could have interrupted a compacting GC before these pointers have been
> +    // updated.
> +    return MaybeForwarded(reinterpret_cast<JSScript*>(spOrScript));

Nick's right here. Looks like BackgroundHangMonitor uses ThreadStackHelper to capture stacks of hangs. ThreadStackHelper holds a PseudoStack* directly [1], and straight up reads the it when taking a backtrace [2]. I see nothing preventing ThreadStackHelper's signal handler from interrupting in the middle of writing the forwarding pointer for a JSScript on the stack.

The ThreadStackHelper code really needs to use the public ProfilingFrameIterator.

[1] http://searchfox.org/mozilla-central/source/xpcom/threads/ThreadStackHelper.h#67
[2] http://searchfox.org/mozilla-central/source/xpcom/threads/ThreadStackHelper.cpp#507
Attachment #8788861 - Flags: review?(shu)
Maybe for now we can check rt->isProfilerSamplingEnabled? That would at least stop the oranges today and let me land the syntax-parse-arrows patch. But maybe that's not necessary if it's easy to convert to ProfilingFrameIterator.
(In reply to Shu-yu Guo [:shu] from comment #22)
Indeed, I'm not claiming this fixes all the races in this code.  There's currently no synchronisation so the current architecture is always going to have problems.
Attached patch bug1298639-hang-monitor v3 (obsolete) (deleted) — Splinter Review
As suggested by Jan, here's a patch that whether profiling is suspended before handing out JSScript pointers.

I rearranged the compacting code a little to make sure that relocated arenas are only freed after we leave the suppression region.  This means that we can still get the runtime from script pointer even if it is in the process of being moved.  The pointers will all have been updated by the time we leave the suppression.

I had to add a rawScript() method for a couple of places where we do need to get the script whatever state it's in.  I also changed the callers to tolerate getting null script pointers back.
Attachment #8788861 - Attachment is obsolete: true
Comment on attachment 8789735 [details] [diff] [review]
bug1298639-hang-monitor v3

Review of attachment 8789735 [details] [diff] [review]:
-----------------------------------------------------------------

This is pretty gross, but let's get Jan unblocked.

::: js/src/jsgc.cpp
@@ +5480,2 @@
>          if (relocateArenas(zone, reason, relocatedArenas, sliceBudget)) {
> +

Extra newline.

::: js/src/vm/SPSProfiler.cpp
@@ +491,5 @@
> +        return nullptr;
> +
> +    // If profiling is supressed then we can't trust the script pointers to be
> +    // valid as they could be in the process of being moved by a compacting GC
> +    // (although it's still OK to get the runtime from them).

This is very, very subtle, but I have no better ideas.
Attachment #8789735 - Flags: review?(shu) → review+
Attached patch bug1298639-hang-monitor-3 (deleted) — Splinter Review
Updated following review.

Jan, did your try run pass with this?  If so could you land it for me as I'm out next week.
Flags: needinfo?(jdemooij)
Attachment #8789934 - Flags: review+
Blocks: 1301890
Blocks: 1301962
Blocks: 1302165
Blocks: 1302326
Comment on attachment 8789735 [details] [diff] [review]
bug1298639-hang-monitor v3

Sure, I'll land this today.
Attachment #8789735 - Attachment is obsolete: true
https://hg.mozilla.org/integration/mozilla-inbound/rev/99ae6aab3d0b1ebc2be01e4dd95bbcc04c7422d5

(In reply to Jon Coppeard (:jonco) (PTO until 19th September) from comment #27)
> Jan, did your try run pass with this?

Yes looks green :)
Flags: needinfo?(jdemooij)
https://hg.mozilla.org/mozilla-central/rev/99ae6aab3d0b
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
No longer blocks: 1298141
No longer blocks: 1301890
No longer blocks: 1301962
No longer blocks: 1302165
No longer blocks: 1302326
No longer blocks: 1300471
Group: javascript-core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: