Closed Bug 1447643 Opened 7 years ago Closed 6 years ago

Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase

Categories

(Core :: JavaScript: GC, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: kats)

References

(Depends on 1 open bug)

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(1 file)

Filed by: ncsoregi [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=169407754&repo=mozilla-central https://queue.taskcluster.net/v1/task/AcyqZ_9NTuezoIto7Jn5oQ/runs/0/artifacts/public/logs/live_backing.log 11:25:37 INFO - TEST-START | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_ext_content_security_policy.js 11:25:38 INFO - TEST-PASS | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript.js | took 244944ms 11:25:38 INFO - mozcrash Saved minidump as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp 11:25:38 INFO - mozcrash Saved app info as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.extra 11:25:38 WARNING - PROCESS-CRASH | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const)] 11:25:38 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\xpc-other-nrhzta\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp 11:25:38 INFO - Operating system: Windows NT 11:25:38 INFO - 10.0.15063 11:25:38 INFO - CPU: amd64 11:25:38 INFO - family 6 model 63 stepping 2 11:25:38 INFO - 8 CPUs 11:25:38 INFO - GPU: UNKNOWN 11:25:38 INFO - Crash reason: EXCEPTION_BREAKPOINT 11:25:38 INFO - Crash address: 0x7fffca0d8c39 11:25:38 INFO - Assertion: Unknown assertion type 0x00000000 11:25:38 INFO - Process uptime: 27 seconds 11:25:38 INFO - Thread 0 (crashed) 11:25:38 INFO - 0 xul.dll!static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const) [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 873 + 0x0] 11:25:38 INFO - rax = 0x0000000000000000 rdx = 0x0000000000000000 11:25:38 INFO - rcx = 0x00000000ffffffff rbx = 0x00007fffd7a1a989 11:25:38 INFO - rsi = 0x00007fffd7930979 rdi = 0x00007ff801cf8d20 11:25:38 INFO - rbp = 0x00007fffdca316a0 rsp = 0x000000793c3fdc80 11:25:38 INFO - r8 = 0x000000793c3fc638 r9 = 0x0000023204e38183 11:25:38 INFO - r10 = 0x0000000000000000 r11 = 0x000000793c3fdb00 11:25:38 INFO - r12 = 0x000000793c3fdfb8 r13 = 0x000000000000001b 11:25:38 INFO - r14 = 0x00007fffdca2af60 r15 = 0x0000023205b55570 11:25:38 INFO - rip = 0x00007fffca0d8c39 11:25:38 INFO - Found by: given as instruction pointer in context 11:25:38 INFO - 1 xul.dll!js::gcstats::Statistics::endSlice() [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1049 + 0x9] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe2c0 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffca0d7bf8 11:25:38 INFO - Found by: call frame info 11:25:38 INFO - 2 xul.dll!js::gc::GCRuntime::gcCycle(bool,js::SliceBudget &,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7407 + 0x32] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe3a0 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fc9659 11:25:38 INFO - Found by: call frame info 11:25:38 INFO - 3 xul.dll!js::gc::GCRuntime::collect(bool,js::SliceBudget,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7538 + 0x14] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe4b0 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fcb754 11:25:38 INFO - Found by: call frame info 11:25:38 INFO - 4 xul.dll!js::gc::GCRuntime::gcSlice(JS::gcreason::Reason,__int64) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7627 + 0x2e] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe610 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fcc2dc 11:25:38 INFO - Found by: call frame info 11:25:38 INFO - 5 xul.dll!nsJSContext::GarbageCollectNow(JS::gcreason::Reason,nsJSContext::IsIncremental,nsJSContext::IsShrinking,__int64) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1225 + 0xd] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe680 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc06cfba8 11:25:38 INFO - Found by: call frame info 11:25:38 INFO - 6 xul.dll!InterSliceGCRunnerFired(mozilla::TimeStamp,void *) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1834 + 0x1d] 11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0 11:25:38 INFO - rsp = 0x000000793c3fe720 r12 = 0x000000793c3fdfb8 11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60 11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc06d96c2 11:25:38 INFO - Found by: call frame info
Pretty sure this isn't actually the right component, but its less wrong than Add-ons Manager
Component: Add-ons Manager → WebExtensions: Request Handling
This is a GC assertion failure: 11:25:39 INFO - PID 8960 | Parent Sweep time = 187.338ms with 2.396ms remaining, child Sweep Shape time 2.887ms 11:25:39 INFO - PID 8960 | Assertion failure: ok (Inconsistent time data), at z:/build/build/src/js/src/gc/Statistics.cpp:873
Component: WebExtensions: Request Handling → JavaScript: GC
Product: Toolkit → Core
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Summary: Intermittent toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase → Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase
This is the same as bug 1400153.
Depends on: 1400153
Attached patch Drop assertion (deleted) — Splinter Review
Attachment #9020763 - Flags: review?(sphink)
Comment on attachment 9020763 [details] [diff] [review] Drop assertion Redirecting to GC triage owner due to lack of movement
Attachment #9020763 - Flags: review?(sphink) → review?(jcoppeard)
Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and is also showing up in some of the logs for these failures.
Comment on attachment 9020763 [details] [diff] [review] Drop assertion Review of attachment 9020763 [details] [diff] [review]: ----------------------------------------------------------------- I don't mind turning off this assertion for now, but I'd rather we only did this on Windows (nearly every failure is on Windows). If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN ?
Attachment #9020763 - Flags: review?(jcoppeard) → review+
(In reply to Jon Coppeard (:jonco) from comment #49) > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and > is also showing up in some of the logs for these failures. Indeed, that's a good point. I took a quick look at that assertion failure and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing non-monotonic values, but I don't see how that would happen since the low-res time just comes from GetTickCount64 on Windows, which should be monotonic. (In reply to Jon Coppeard (:jonco) from comment #50) > I don't mind turning off this assertion for now, but I'd rather we only did > this on Windows (nearly every failure is on Windows). > > If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN > ? Thanks, will do.
Pushed by kgupta@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c916195bb47d Remove frequently failing assertion. r=jonco
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Assignee: nobody → kats
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51) > (In reply to Jon Coppeard (:jonco) from comment #49) > > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and > > is also showing up in some of the logs for these failures. > > Indeed, that's a good point. I took a quick look at that assertion failure > and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing > non-monotonic values, but I don't see how that would happen since the > low-res time just comes from GetTickCount64 on Windows, which should be > monotonic. Yeah, that was the purpose of this assertion. The GC statistics code doesn't actually care that much about this problem (it'll result in a smattering of weird telemetry results, but that's about it). But I had heard that other code also depends on this timer being monotonic, and the belief was that all remaining issues (all on Windows) had been fixed. So I added an assert to help track it down. When I saw that it was firing, I also found that the logic was pretty complex, so it couldn't immediately be assumed to be the clock going backward. I added a bunch of information to make that easier to track down, but then never circled back to really look at it. I'm going to needinfo Nathan here in case this is of use to him; otherwise, it may as well just be removed where it's problematic (#ifdef XP_WIN). nfroyd, there should be a bunch of log files with fairly verbose diagnostic output that at least tries to point a finger at what's happening.
Flags: needinfo?(nfroyd)
(In reply to Steve Fink [:sfink] [:s:] from comment #57) > nfroyd, there should be a bunch of log files with fairly verbose diagnostic > output that at least tries to point a finger at what's happening. Where are those log files? This sounds vaguely like bug 1487778, but we "fixed" that two months ago by adding monotonicity guards in the TimeStamp code. So I'm not completely sure what the next step is. Honza fixed the aforementioned bug, perhaps the log files might inspire a similar fix someplace else?
Flags: needinfo?(nfroyd)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51) > (In reply to Jon Coppeard (:jonco) from comment #49) > > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and > > is also showing up in some of the logs for these failures. > > Indeed, that's a good point. I took a quick look at that assertion failure > and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing > non-monotonic values, but I don't see how that would happen since the > low-res time just comes from GetTickCount64 on Windows, which should be > monotonic. As Nathan points in comment 58, I've only fixed monotonicity of QueryPerformanceCounter (the hi-res version) = TimeStamp::Now. The low res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic on a single thread. It's a software based counter, not HW, and is under control of the OS.
(In reply to Nathan Froyd [:froydnj] from comment #58) > (In reply to Steve Fink [:sfink] [:s:] from comment #57) > > nfroyd, there should be a bunch of log files with fairly verbose diagnostic > > output that at least tries to point a finger at what's happening. > > Where are those log files? Bug 1400153 (the log messages all have that bug number in them now). > This sounds vaguely like bug 1487778, but we "fixed" that two months ago by > adding monotonicity guards in the TimeStamp code. So I'm not completely > sure what the next step is. Honza fixed the aforementioned bug, perhaps the > log files might inspire a similar fix someplace else? (In reply to Honza Bambas (:mayhemer) from comment #59) > As Nathan points in comment 58, I've only fixed monotonicity of > QueryPerformanceCounter (the hi-res version) = TimeStamp::Now. The low > res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic > on a single thread. It's a software based counter, not HW, and is under > control of the OS. Hm. It looks like we're using a function called ReallyNow(), to bypass record-and-replay, which calls NowUnfuzzed(), which calls NowUnfuzzed(aHighResolution=true), which on Windows calls NowInternal(aHighResolution=true), which does something depending on sUseQPC (that's probably your stuff?) Ok, so looking at an example log https://treeherder.mozilla.org/logviewer.html#?job_id=209485576&repo=mozilla-inbound&lineNumber=70490 I see Parent Mark During Sweeping time = 499.000ms with 498.890ms remaining, child Mark Gray time 499.000ms Assertion failure: ok (Inconsistent time data; see bug 1400153), at z:/build/build/src/js/src/gc/Statistics.cpp:981 That first line comes from if (selfTimes[parent] < childTime) { fprintf(stderr, "Parent %s time = %.3fms with %.3fms remaining, child %s time %.3fms\n", phases[parent].name, times[parent].ToMilliseconds(), selfTimes[parent].ToMilliseconds(), phases[child].name, childTime.ToMilliseconds()); } which means that we have a little piece of the time recording tree where the parent node (labeled "Mark During Sweeping") was calculated to take 499.000ms total (end timestamp minus start timestamp), but it has a child "Mark Gray" that took all of that 499.000ms plus at least one other child that took 1.11ms. Children are actions that run within the scope of their parents (as in, a parent starts, then a child starts and eventually ends, perhaps more children run, then finally the parent ends -- but the whole thing *can* be repeated and the totals accumulated, though the inequalities should still hold.) SWEEP_MARK ("Mark During Sweeping") has a whole bunch of potential children https://searchfox.org/mozilla-central/source/js/src/gc/GenerateStatsPhases.py#103 but we're noticing the imbalance here on the 2nd to last possible child. This is a little indirect. For other assertion sites, I have more specific logging. Let me look at some others... huh. Every single one of them is detected in the same place (line 981). I would have expected, or at least hoped, that a real problem would show up in one of the other sites... and since you disabled this specific check for Windows, I see no bug 1400153 failures in the last 21 days. So it is possible that this check (currently on line 981) is bogus, given that none of the other checks are seeing anything? This may be bad logic after all. Though I don't know why it would only happen on Windows? The possibility of having multiple runs of these phases is annoying, because it makes it harder to simply record start and end times and report them in these crashes. I have an assertion that when a parent ends, it checks to be sure its recorded end is >= all of its children's. And the same for start times. I don't understand how this can happen if those constraints hold. Wow, this log https://treeherder.mozilla.org/logviewer.html#?job_id=207654318&repo=autoland&lineNumber=5395 is off by 8ms! All these failures are in LongestPhaseSelfTimeInMajorGC(), which is called for both the regular case and a funky case where we're recording parallel times in another thread. But that 8ms case is the non-parallel one. If I really wanted to track this down, I guess I'd create an in-memory rolling log of all start and end times and dump it out on failure. Sadly, I'm not feeling that motivated at the moment; the code already handles these failures gracefully in non-debug builds, and it doesn't happen often enough to be worth a lot of effort.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: