Closed
Bug 1447643
Opened 7 years ago
Closed 6 years ago
Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase
Categories
(Core :: JavaScript: GC, defect, P5)
Core
JavaScript: GC
Tracking
()
RESOLVED
FIXED
mozilla65
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox63 | --- | wontfix |
firefox64 | --- | wontfix |
firefox65 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: kats)
References
(Depends on 1 open bug)
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(1 file)
(deleted),
patch
|
jonco
:
review+
|
Details | Diff | Splinter Review |
Filed by: ncsoregi [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=169407754&repo=mozilla-central
https://queue.taskcluster.net/v1/task/AcyqZ_9NTuezoIto7Jn5oQ/runs/0/artifacts/public/logs/live_backing.log
11:25:37 INFO - TEST-START | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_ext_content_security_policy.js
11:25:38 INFO - TEST-PASS | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript.js | took 244944ms
11:25:38 INFO - mozcrash Saved minidump as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp
11:25:38 INFO - mozcrash Saved app info as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.extra
11:25:38 WARNING - PROCESS-CRASH | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const)]
11:25:38 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\xpc-other-nrhzta\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp
11:25:38 INFO - Operating system: Windows NT
11:25:38 INFO - 10.0.15063
11:25:38 INFO - CPU: amd64
11:25:38 INFO - family 6 model 63 stepping 2
11:25:38 INFO - 8 CPUs
11:25:38 INFO - GPU: UNKNOWN
11:25:38 INFO - Crash reason: EXCEPTION_BREAKPOINT
11:25:38 INFO - Crash address: 0x7fffca0d8c39
11:25:38 INFO - Assertion: Unknown assertion type 0x00000000
11:25:38 INFO - Process uptime: 27 seconds
11:25:38 INFO - Thread 0 (crashed)
11:25:38 INFO - 0 xul.dll!static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const) [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 873 + 0x0]
11:25:38 INFO - rax = 0x0000000000000000 rdx = 0x0000000000000000
11:25:38 INFO - rcx = 0x00000000ffffffff rbx = 0x00007fffd7a1a989
11:25:38 INFO - rsi = 0x00007fffd7930979 rdi = 0x00007ff801cf8d20
11:25:38 INFO - rbp = 0x00007fffdca316a0 rsp = 0x000000793c3fdc80
11:25:38 INFO - r8 = 0x000000793c3fc638 r9 = 0x0000023204e38183
11:25:38 INFO - r10 = 0x0000000000000000 r11 = 0x000000793c3fdb00
11:25:38 INFO - r12 = 0x000000793c3fdfb8 r13 = 0x000000000000001b
11:25:38 INFO - r14 = 0x00007fffdca2af60 r15 = 0x0000023205b55570
11:25:38 INFO - rip = 0x00007fffca0d8c39
11:25:38 INFO - Found by: given as instruction pointer in context
11:25:38 INFO - 1 xul.dll!js::gcstats::Statistics::endSlice() [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1049 + 0x9]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe2c0 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffca0d7bf8
11:25:38 INFO - Found by: call frame info
11:25:38 INFO - 2 xul.dll!js::gc::GCRuntime::gcCycle(bool,js::SliceBudget &,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7407 + 0x32]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe3a0 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fc9659
11:25:38 INFO - Found by: call frame info
11:25:38 INFO - 3 xul.dll!js::gc::GCRuntime::collect(bool,js::SliceBudget,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7538 + 0x14]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe4b0 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fcb754
11:25:38 INFO - Found by: call frame info
11:25:38 INFO - 4 xul.dll!js::gc::GCRuntime::gcSlice(JS::gcreason::Reason,__int64) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7627 + 0x2e]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe610 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc9fcc2dc
11:25:38 INFO - Found by: call frame info
11:25:38 INFO - 5 xul.dll!nsJSContext::GarbageCollectNow(JS::gcreason::Reason,nsJSContext::IsIncremental,nsJSContext::IsShrinking,__int64) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1225 + 0xd]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe680 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc06cfba8
11:25:38 INFO - Found by: call frame info
11:25:38 INFO - 6 xul.dll!InterSliceGCRunnerFired(mozilla::TimeStamp,void *) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1834 + 0x1d]
11:25:38 INFO - rbx = 0x00007fffd7a1a989 rbp = 0x00007fffdca316a0
11:25:38 INFO - rsp = 0x000000793c3fe720 r12 = 0x000000793c3fdfb8
11:25:38 INFO - r13 = 0x000000000000001b r14 = 0x00007fffdca2af60
11:25:38 INFO - r15 = 0x0000023205b55570 rip = 0x00007fffc06d96c2
11:25:38 INFO - Found by: call frame info
Comment 1•7 years ago
|
||
Pretty sure this isn't actually the right component, but its less wrong than Add-ons Manager
Component: Add-ons Manager → WebExtensions: Request Handling
Comment 2•7 years ago
|
||
This is a GC assertion failure:
11:25:39 INFO - PID 8960 | Parent Sweep time = 187.338ms with 2.396ms remaining, child Sweep Shape time 2.887ms
11:25:39 INFO - PID 8960 | Assertion failure: ok (Inconsistent time data), at z:/build/build/src/js/src/gc/Statistics.cpp:873
Component: WebExtensions: Request Handling → JavaScript: GC
Product: Toolkit → Core
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 11•6 years ago
|
||
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Comment hidden (Intermittent Failures Robot) |
Comment 13•6 years ago
|
||
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Comment 15•6 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
Resolution: --- → INCOMPLETE
Comment 16•6 years ago
|
||
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Updated•6 years ago
|
Summary: Intermittent toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase → Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 42•6 years ago
|
||
Attachment #9020763 -
Flags: review?(sphink)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 48•6 years ago
|
||
Comment on attachment 9020763 [details] [diff] [review]
Drop assertion
Redirecting to GC triage owner due to lack of movement
Attachment #9020763 -
Flags: review?(sphink) → review?(jcoppeard)
Comment 49•6 years ago
|
||
Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and is also showing up in some of the logs for these failures.
Comment 50•6 years ago
|
||
Comment on attachment 9020763 [details] [diff] [review]
Drop assertion
Review of attachment 9020763 [details] [diff] [review]:
-----------------------------------------------------------------
I don't mind turning off this assertion for now, but I'd rather we only did this on Windows (nearly every failure is on Windows).
If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN ?
Attachment #9020763 -
Flags: review?(jcoppeard) → review+
Assignee | ||
Comment 51•6 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #49)
> Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> is also showing up in some of the logs for these failures.
Indeed, that's a good point. I took a quick look at that assertion failure and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing non-monotonic values, but I don't see how that would happen since the low-res time just comes from GetTickCount64 on Windows, which should be monotonic.
(In reply to Jon Coppeard (:jonco) from comment #50)
> I don't mind turning off this assertion for now, but I'd rather we only did
> this on Windows (nearly every failure is on Windows).
>
> If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN
> ?
Thanks, will do.
Comment 52•6 years ago
|
||
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c916195bb47d
Remove frequently failing assertion. r=jonco
Comment 53•6 years ago
|
||
bugherder |
Status: REOPENED → RESOLVED
Closed: 6 years ago → 6 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Updated•6 years ago
|
Assignee: nobody → kats
Updated•6 years ago
|
status-firefox63:
--- → wontfix
status-firefox64:
--- → wontfix
status-firefox-esr60:
--- → unaffected
Comment hidden (Intermittent Failures Robot) |
Comment 57•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51)
> (In reply to Jon Coppeard (:jonco) from comment #49)
> > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> > is also showing up in some of the logs for these failures.
>
> Indeed, that's a good point. I took a quick look at that assertion failure
> and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing
> non-monotonic values, but I don't see how that would happen since the
> low-res time just comes from GetTickCount64 on Windows, which should be
> monotonic.
Yeah, that was the purpose of this assertion. The GC statistics code doesn't actually care that much about this problem (it'll result in a smattering of weird telemetry results, but that's about it). But I had heard that other code also depends on this timer being monotonic, and the belief was that all remaining issues (all on Windows) had been fixed. So I added an assert to help track it down. When I saw that it was firing, I also found that the logic was pretty complex, so it couldn't immediately be assumed to be the clock going backward. I added a bunch of information to make that easier to track down, but then never circled back to really look at it. I'm going to needinfo Nathan here in case this is of use to him; otherwise, it may as well just be removed where it's problematic (#ifdef XP_WIN).
nfroyd, there should be a bunch of log files with fairly verbose diagnostic output that at least tries to point a finger at what's happening.
Flags: needinfo?(nfroyd)
Comment 58•6 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #57)
> nfroyd, there should be a bunch of log files with fairly verbose diagnostic
> output that at least tries to point a finger at what's happening.
Where are those log files?
This sounds vaguely like bug 1487778, but we "fixed" that two months ago by adding monotonicity guards in the TimeStamp code. So I'm not completely sure what the next step is. Honza fixed the aforementioned bug, perhaps the log files might inspire a similar fix someplace else?
Flags: needinfo?(nfroyd)
Comment 59•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51)
> (In reply to Jon Coppeard (:jonco) from comment #49)
> > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> > is also showing up in some of the logs for these failures.
>
> Indeed, that's a good point. I took a quick look at that assertion failure
> and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing
> non-monotonic values, but I don't see how that would happen since the
> low-res time just comes from GetTickCount64 on Windows, which should be
> monotonic.
As Nathan points in comment 58, I've only fixed monotonicity of QueryPerformanceCounter (the hi-res version) = TimeStamp::Now. The low res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic on a single thread. It's a software based counter, not HW, and is under control of the OS.
Comment 60•6 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #58)
> (In reply to Steve Fink [:sfink] [:s:] from comment #57)
> > nfroyd, there should be a bunch of log files with fairly verbose diagnostic
> > output that at least tries to point a finger at what's happening.
>
> Where are those log files?
Bug 1400153 (the log messages all have that bug number in them now).
> This sounds vaguely like bug 1487778, but we "fixed" that two months ago by
> adding monotonicity guards in the TimeStamp code. So I'm not completely
> sure what the next step is. Honza fixed the aforementioned bug, perhaps the
> log files might inspire a similar fix someplace else?
(In reply to Honza Bambas (:mayhemer) from comment #59)
> As Nathan points in comment 58, I've only fixed monotonicity of
> QueryPerformanceCounter (the hi-res version) = TimeStamp::Now. The low
> res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic
> on a single thread. It's a software based counter, not HW, and is under
> control of the OS.
Hm. It looks like we're using a function called ReallyNow(), to bypass record-and-replay, which calls NowUnfuzzed(), which calls NowUnfuzzed(aHighResolution=true), which on Windows calls NowInternal(aHighResolution=true), which does something depending on sUseQPC (that's probably your stuff?)
Ok, so looking at an example log https://treeherder.mozilla.org/logviewer.html#?job_id=209485576&repo=mozilla-inbound&lineNumber=70490 I see
Parent Mark During Sweeping time = 499.000ms with 498.890ms remaining, child Mark Gray time 499.000ms
Assertion failure: ok (Inconsistent time data; see bug 1400153), at z:/build/build/src/js/src/gc/Statistics.cpp:981
That first line comes from
if (selfTimes[parent] < childTime) {
fprintf(stderr,
"Parent %s time = %.3fms with %.3fms remaining, child %s time %.3fms\n",
phases[parent].name,
times[parent].ToMilliseconds(),
selfTimes[parent].ToMilliseconds(),
phases[child].name,
childTime.ToMilliseconds());
}
which means that we have a little piece of the time recording tree where the parent node (labeled "Mark During Sweeping") was calculated to take 499.000ms total (end timestamp minus start timestamp), but it has a child "Mark Gray" that took all of that 499.000ms plus at least one other child that took 1.11ms. Children are actions that run within the scope of their parents (as in, a parent starts, then a child starts and eventually ends, perhaps more children run, then finally the parent ends -- but the whole thing *can* be repeated and the totals accumulated, though the inequalities should still hold.) SWEEP_MARK ("Mark During Sweeping") has a whole bunch of potential children https://searchfox.org/mozilla-central/source/js/src/gc/GenerateStatsPhases.py#103 but we're noticing the imbalance here on the 2nd to last possible child.
This is a little indirect. For other assertion sites, I have more specific logging. Let me look at some others... huh. Every single one of them is detected in the same place (line 981). I would have expected, or at least hoped, that a real problem would show up in one of the other sites... and since you disabled this specific check for Windows, I see no bug 1400153 failures in the last 21 days. So it is possible that this check (currently on line 981) is bogus, given that none of the other checks are seeing anything? This may be bad logic after all. Though I don't know why it would only happen on Windows?
The possibility of having multiple runs of these phases is annoying, because it makes it harder to simply record start and end times and report them in these crashes. I have an assertion that when a parent ends, it checks to be sure its recorded end is >= all of its children's. And the same for start times. I don't understand how this can happen if those constraints hold.
Wow, this log https://treeherder.mozilla.org/logviewer.html#?job_id=207654318&repo=autoland&lineNumber=5395 is off by 8ms! All these failures are in LongestPhaseSelfTimeInMajorGC(), which is called for both the regular case and a funky case where we're recording parallel times in another thread. But that 8ms case is the non-parallel one.
If I really wanted to track this down, I guess I'd create an in-memory rolling log of all start and end times and dump it out on failure. Sadly, I'm not feeling that motivated at the moment; the code already handles these failures gracefully in non-debug builds, and it doesn't happen often enough to be worth a lot of effort.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•