Closed
Bug 1302914
Opened 8 years ago
Closed 8 years ago
browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js,browser_dbg_cmd-blackbox.js | application crashed [@ js::ProfileEntry::script()]
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla52
People
(Reporter: intermittent-bug-filer, Assigned: jandem)
References
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
(deleted),
patch
|
froydnj
:
review+
ritu
:
approval-mozilla-aurora+
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Filed by: philringnalda [at] gmail.com
https://treeherder.mozilla.org/logviewer.html#?job_id=35884044&repo=mozilla-inbound
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win64/1473875528/mozilla-inbound_win8_64_test-mochitest-e10s-browser-chrome-5-bm109-tests1-windows-build445.txt.gz
Reporter | ||
Updated•8 years ago
|
Reporter | ||
Updated•8 years ago
|
Comment 2•8 years ago
|
||
So, things are better than they were before but we're still not out of the woods it'd appear :(
Flags: needinfo?(jdemooij)
Assignee | ||
Comment 3•8 years ago
|
||
I opened one of these minidumps in Visual Studio - I think we're crashing in ProfileEntry::script doing script->zoneFromAnyThread()->runtimeFromAnyThread(), so the script pointer is probably bogus.
The bug that was fixed was that we could get a thread stack while doing a compacting GC that moved scripts. But here, the main thread (I assume that's the one we're sampling because there are no active worker threads) is doing non-GC, non-SPS stuff when we crash, so I've no idea *why* the profiler stack contains bogus script pointers now. We trace these as roots so the GC should trace and update them just fine.
It's interesting that a lot of these crashes (like the 3 linked here) are Windows 8 x64. Maybe a difference in thread scheduling behavior is provoking the background hang monitor, I don't know.
I tried to reproduce this on Linux with a very low timeout value for the background hang monitor, maybe I can try that on Windows later...
Reporter | ||
Updated•8 years ago
|
Updated•8 years ago
|
Summary: Intermittent browser/extensions/pdfjs/test/browser_pdfjs_zoom.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js | application crashed [@ js::ProfileEntry::script()]
Updated•8 years ago
|
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js | application crashed [@ js::ProfileEntry::script()]
Updated•8 years ago
|
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js | application crashed [@ js::ProfileEntry::script()]
Updated•8 years ago
|
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js | application crashed [@ js::ProfileEntry::script()]
Comment 6•8 years ago
|
||
Any luck, Jan? This is starting to pile up a lot of dupes similar to the last one :(
Assignee | ||
Comment 7•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #6)
> Any luck, Jan? This is starting to pile up a lot of dupes similar to the
> last one :(
I'll look at this later today, after finishing a few other things.
Assignee | ||
Comment 8•8 years ago
|
||
Reports this week are also all on Win8 x64, AFAICS. I'll try to reproduce this on Windows 10...
Updated•8 years ago
|
Summary: browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js | application crashed [@ js::ProfileEntry::script()] → browser_pdfjs_zoom.js,browser_windowless_troubleshoot_crash.js,browser_yahoo_behavior.js,browser_bug1204626.js,browser_dbg_search-symbols.js,browser_cookies_exceptions.js,browser_dbg_cmd-blackbox.js | application crashed [@ js::ProfileEntry::script()]
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #8)
> I'll try to reproduce this on Windows 10...
No luck with this. Today's attempt is to do some Try pushes with different background hang monitor timeouts, to see if that makes these more frequent.
Assignee | ||
Comment 11•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #10)
> Today's attempt is to do some Try pushes with different
> background hang monitor timeouts, to see if that makes these more frequent.
OK, I do get some oranges per push with this, so that's promising. Unfortunately I have Win8 test jobs queued for 2 hours, so this is probably going to take a long time :/ I'll see what happens when I Try these BHM changes on Linux64 or OS X.
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #11)
> OK, I do get some oranges per push with this, so that's promising.
> Unfortunately I have Win8 test jobs queued for 2 hours, so this is probably
> going to take a long time :/ I'll see what happens when I Try these BHM
> changes on Linux64 or OS X.
No oranges on other platforms. I'm now Try-servering a patch that makes SPSProfiler::size_ an Atomic<uint32_t>*, to see if that changes anything.
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #12)
> No oranges on other platforms. I'm now Try-servering a patch that makes
> SPSProfiler::size_ an Atomic<uint32_t>*, to see if that changes anything.
This didn't help. It makes a synchronization issue while pushing/popping stack entries less likely. I'm now trying a patch to enable the BackgroundHangMonitor in debug builds, to see whether that also fails and/or hits assertion failures.
Assignee | ||
Comment 14•8 years ago
|
||
Enabling the BackgroundHangMonitor in debug builds was super useful. I got some of the same crashes, but I also got a few assertion failures. The assertion failures were interesting, because we were failing an assert like this one:
MOZ_ASSERT(entry->isJs());
But the caller(s) *also* checked/asserted entry->isJs(), so this should be impossible. Except if some other thread is modifying |entry| while we are accessing it. How can that happen if we just suspended the thread we're sampling?
It turns out SuspendThread on Windows doesn't actually suspend the thread *immediately*, it's asynchronous: when SuspendThread returns, the thread can run for a little while longer [0]. That's what's happening here: the background hang monitor thread suspends the main thread, but this thread runs for a while longer, modifying our profiler stack while we're reading from it.
The blog post suggests calling GetThreadContext, to ensure the thread is actually suspended. I did that and I can no longer reproduce this on Try with that change \o/.
[0] https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743
Assignee | ||
Comment 15•8 years ago
|
||
See comment 14.
I fixed the one in ThreadStackHelper.cpp, and also other SuspendThread calls in HangMonitor.cpp and profiler/core/platform-win32.cc
This also fixes an unrelated bug in BackgroundHangThread::ReportHang that was exposed by enabling this code in debug builds: we called mHangStack.back() on an empty Vector.
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Flags: needinfo?(jdemooij)
Attachment #8796095 -
Flags: review?(nfroyd)
Comment 16•8 years ago
|
||
Comment on attachment 8796095 [details] [diff] [review]
Patch
Review of attachment 8796095 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with the fixes below, or a justification of why CONTEXT_CONTROL is OK to use in the non-platform-win32.cc cases. Thanks for tracking this one down.
::: xpcom/threads/BackgroundHangMonitor.cpp
@@ +400,5 @@
> // mManager->mLock IS locked
>
> // Remove unwanted "js::RunScript" frame from the stack
> + if (!mHangStack.empty()) {
> + for (const char** f = &mHangStack.back(); f >= mHangStack.begin(); f--) {
Nice, thanks for fixing this bug.
Now that I have a second chance to look at this, can you fix this loop to use integer indices like so:
for (size_t i = 0; i < mHangStack.size(); i++) {
const char** f = mHangStack.begin() + i;
if (!mHangStack.IsInBuffer(*f) && !strcmp(*f, "js::RunScript")) {
mHangStack.erase(f);
}
}
That way, we're not depending on undefined behavior with the loop exit comparison when f moves past mHangStack.begin().
Followup bug is fine, if you prefer to do it that way.
::: xpcom/threads/HangMonitor.cpp
@@ +163,5 @@
> + // SuspendThread is asynchronous, so the thread may still be running. Use
> + // GetThreadContext to ensure it's really suspended.
> + // See https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743.
> + CONTEXT context;
> + context.ContextFlags = CONTEXT_CONTROL;
Why does using CONTEXT_CONTROL not cause crashes here compared to platform-win32.cc? I think we want to replicate the logic from there in this code, too, replete with the same comment.
::: xpcom/threads/ThreadStackHelper.cpp
@@ +250,5 @@
> + // SuspendThread is asynchronous, so the thread may still be running. Use
> + // GetThreadContext to ensure it's really suspended.
> + // See https://blogs.msdn.microsoft.com/oldnewthing/20150205-00/?p=44743.
> + CONTEXT context;
> + context.ContextFlags = CONTEXT_CONTROL;
Same comment here.
Attachment #8796095 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 17•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #16)
> Why does using CONTEXT_CONTROL not cause crashes here compared to
> platform-win32.cc? I think we want to replicate the logic from there in
> this code, too, replete with the same comment.
The JS engine uses CONTEXT_CONTROL when interrupting JS code, without any known problems, and the crashes in bug 1120126 seem more related to the stack walking code and/or the data we get out of the CONTEXT than to GetThreadContext itself. Here we don't use the CONTEXT at all.
Mind if I land this as is, and if any crashes show up we switch to CONTEXT_FULL?
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 18•8 years ago
|
||
Also, I think none of this code is used in release builds (BackgroundHangMonitor et al are disabled there), and interrupts are far less frequent than what the profiler does, so we could also just go with CONTEXT_FULL...
Comment 19•8 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #17)
> (In reply to Nathan Froyd [:froydnj] from comment #16)
> > Why does using CONTEXT_CONTROL not cause crashes here compared to
> > platform-win32.cc? I think we want to replicate the logic from there in
> > this code, too, replete with the same comment.
>
> The JS engine uses CONTEXT_CONTROL when interrupting JS code, without any
> known problems, and the crashes in bug 1120126 seem more related to the
> stack walking code and/or the data we get out of the CONTEXT than to
> GetThreadContext itself. Here we don't use the CONTEXT at all.
Ah, I didn't look far enough. MozStackWalk for this particular case eventually uses CONTEXT_FULL:
http://dxr.mozilla.org/mozilla-central/source/mozglue/misc/StackWalk.cpp#306
and the ThreadStackHelper case weirdly doesn't actually walk the stack. We should really clean all this up sometime. :(
You're right, we don't need to bother with CONTEXT_FULL in the other two places. Land away!
Comment 20•8 years ago
|
||
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/02428eab5922
Use GetThreadContext after calling SuspendThread to ensure threads are really suspended. r=froydnj
Assignee | ||
Comment 21•8 years ago
|
||
(I moved the mHangStack fix from this bug to bug 1306317.)
Updated•8 years ago
|
status-firefox51:
--- → affected
status-firefox52:
--- → affected
Comment 22•8 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment 49•8 years ago
|
||
Please request Aurora/Beta approval on this when you're reasonably confident that it's baked for long enough.
Assignee | ||
Comment 51•8 years ago
|
||
Comment on attachment 8796095 [details] [diff] [review]
Patch
Approval Request Comment
[Feature/regressing bug #]: Old bug.
[User impact if declined]: Crashes when using the browser / profiler. Intermittent oranges.
[Describe test coverage new/current, TreeHerder]: Landed a few days ago. No more of these intermittents since then.
[Risks and why]: Low risk, on m-c without problems.
[String/UUID change made/needed]: None.
Flags: needinfo?(jdemooij)
Attachment #8796095 -
Flags: approval-mozilla-beta?
Attachment #8796095 -
Flags: approval-mozilla-aurora?
Comment on attachment 8796095 [details] [diff] [review]
Patch
Fixes an intermittent, verified on Nightly52, Aurora51+, Beta50+
Attachment #8796095 -
Flags: approval-mozilla-beta?
Attachment #8796095 -
Flags: approval-mozilla-beta+
Attachment #8796095 -
Flags: approval-mozilla-aurora?
Attachment #8796095 -
Flags: approval-mozilla-aurora+
Comment 53•8 years ago
|
||
bugherder uplift |
Flags: in-testsuite+
Comment 54•8 years ago
|
||
bugherder uplift |
You need to log in
before you can comment on or make changes to this bug.
Description
•