Closed
Bug 1315285
Opened 8 years ago
Closed 8 years ago
Frequent Windows IPC-related test hangs with jemalloc4 enabled
Categories
(Core :: Memory Allocator, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla53
Tracking | Status | |
---|---|---|
firefox49 | --- | unaffected |
firefox50 | --- | unaffected |
firefox51 | --- | unaffected |
firefox52 | --- | fixed |
firefox53 | --- | fixed |
People
(Reporter: RyanVM, Assigned: ting)
References
Details
(Keywords: crash, hang, regression)
Not sure this is really jemalloc4's fault or not, but starting with the Memory Allocator component for now until there's a better sense of what's going on here. While testing the forthcoming jemalloc 4.3.0 release on Try, glandium and I noticed that Win8 in particular was hitting a lot of hangs along the line of the logs below (the dom-level* tests seem to be best for reliably triggering it): https://treeherder.mozilla.org/logviewer.html#?job_id=30453758&repo=try https://treeherder.mozilla.org/logviewer.html#?job_id=30453765&repo=try Both win32 and win64 Talos also have been seeing hangs like: https://treeherder.mozilla.org/logviewer.html#?job_id=30404691&repo=try https://treeherder.mozilla.org/logviewer.html#?job_id=30399341&repo=try I tried bisecting this to a broken jemalloc commit only to realize that it reproduced with the in-tree 4.1.1 version as well. I was eventually able to bisect it down to the debugging patch from bug 1293501 as the culprit. To reproduce on Try, you should only need to make the following changes to the win32/win64 opt build configs, respectively: https://hg.mozilla.org/try/diff/031fb63ab74b/browser/config/mozconfigs/win32/common-opt https://hg.mozilla.org/try/diff/031fb63ab74b/browser/config/mozconfigs/win64/common-opt
Flags: needinfo?(janus926)
Reporter | ||
Comment 1•8 years ago
|
||
Another stack: https://treeherder.mozilla.org/logviewer.html#?job_id=30456590&repo=try
Assignee | ||
Comment 2•8 years ago
|
||
For now I have no ideas how could VirtualProtect in necko cause hangs. Any ideas about what objects the ZwWaitForMultipleObjects is waiting for in memset?
Flags: needinfo?(janus926)
Assignee | ||
Comment 3•8 years ago
|
||
4 of them are with EXCEPTION_ACCESS_VIOLATION_WRITE: https://treeherder.mozilla.org/logviewer.html#?job_id=30453758&repo=try 07:30:34 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 07:30:34 INFO - Crash address: 0x88486fb808 07:30:34 INFO - Assertion: Unknown assertion type 0x00000000 07:30:34 INFO - Process uptime: 42 seconds 07:30:34 INFO - Thread 1 (crashed) 07:30:34 INFO - 0 KERNELBASE.dll!PostQueuedCompletionStatus + 0x108 https://treeherder.mozilla.org/logviewer.html#?job_id=30453765&repo=try 07:28:52 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 07:28:52 INFO - Crash address: 0x84e72e78e8 07:28:52 INFO - Assertion: Unknown assertion type 0x00000000 07:28:52 INFO - Process uptime: 36 seconds 07:28:52 INFO - Thread 1 (crashed) 07:28:52 INFO - 0 xul.dll!mozilla::BufferList<js::SystemAllocPolicy>::BufferList<js::SystemAllocPolicy>(mozilla::BufferList<js::SystemAllocPolicy> &&) [BufferList.h:533f238d2e70 : 97 + 0x0] https://treeherder.mozilla.org/logviewer.html#?job_id=30399341&repo=try 10:51:31 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 10:51:31 INFO - Crash address: 0x11bb0cc 10:51:31 INFO - Assertion: Unknown assertion type 0x00000000 10:51:31 INFO - Process uptime: 1 seconds 10:51:31 INFO - Thread 2 (crashed) 10:51:31 INFO - 0 xul.dll!IPC::Channel::ChannelImpl::OutputQueuePop() [ipc_channel_win.cc:2ce763de722e : 103 + 0x0] https://treeherder.mozilla.org/logviewer.html#?job_id=30456590&repo=try 08:30:35 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 08:30:35 INFO - Crash address: 0x63c1f6a0e8 08:30:35 INFO - Assertion: Unknown assertion type 0x00000000 08:30:35 INFO - Process uptime: 42 seconds 08:30:35 INFO - 08:30:35 INFO - Thread 1 (crashed) 08:30:35 INFO - 0 xul.dll!mozilla::BufferList<InfallibleAllocPolicy>::BufferList<InfallibleAllocPolicy>(mozilla::BufferList<InfallibleAllocPolicy> &&) [BufferList.h:8a6d7d7e632d : 97 + 0x0]
Assignee | ||
Comment 4•8 years ago
|
||
Those stacks don't seem the culprit I'm hunting for in bug 1293501 as those come with jemalloc4. Note the debug patch may stay a bit longer in the tree (and merge to Aurora) because I haven't seen any access violation write crashes that seems from it.
Comment 5•8 years ago
|
||
... except the patch in bug 1293501 looks wrong to me. First, it wastes a page by element in the array, even though only the beginning of that array is meant to be protected. Second, the padding is not large enough to put what follows it out of the protected page. Third, I don't see anything to handle the array being reallocated when its size changes. Which means the protected page stays there, and can end up being given away to something else by jemalloc, while the array itself is not protected anymore. Essentially, it seems to me you're *very* lucky that those crashes are not happening with mozjemalloc already. In fact, for all I know, there may well be unidentified crashes due to that patch.
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #5) > First, it wastes a page by element in the array, even though only the > beginning of that array is meant to be protected. Yes. But if I add padding to only first element, I'll need to overwrite the array functions for accessing the elements, which I think is too much work for debug. > Second, the padding is not large enough to put what follows it out of the > protected page. I am not sure I understand, why is this needed? From the crash reports earlier, I found it's likely the first nsEntry in the array has nsCString mData 0x0, but the other fields remains valid. So the protected page includes: |nsTArrayHeader|nsEntry.header|padding|nsEntry.value.mData| which covers mData's orginal address and the shifted address. > Third, I don't see anything to handle the array being reallocated when its > size changes. Which means the protected page stays there, and can end up > being given away to something else by jemalloc, while the array itself is > not protected anymore. The only way to alternate the array is through nsHttpRequestHead APIs like: https://dxr.mozilla.org/mozilla-central/rev/86f702229e32c6119d092e86431afee576f033a1/netwerk/protocol/http/nsHttpRequestHead.cpp#162 which each of them has ReentrantMonitorAutoEnter in the beginning of the function. The patch will unprotect the page in the constructor of ReentrantMonitorAutoEnter, and reprotect the page (of course by the updated array mHdr) in its destructor. Also when nsHttpRequestHead is destructed, the page will set back to readwrite. > Essentially, it seems to me you're *very* lucky that those crashes are not > happening with mozjemalloc already. In fact, for all I know, there may well > be unidentified crashes due to that patch. Could be, but from crash-stats I don't see significant access violation write crashes seems to be caused by this: https://crash-stats.mozilla.com/search/?reason=~EXCEPTION_ACCESS_VIOLATION_WRITE&build_id=%3E%3D20161012030211&product=Firefox&version=52.0a1&platform=Windows&date=%3E%3D2016-11-01T09%3A24%3A00.000Z&date=%3C2016-11-08T09%3A24%3A00.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
Updated•8 years ago
|
Priority: -- → P3
Reporter | ||
Updated•8 years ago
|
status-firefox53:
--- → affected
Reporter | ||
Updated•8 years ago
|
Reporter | ||
Comment 7•8 years ago
|
||
Fixed by the bug 1293501 backout.
Assignee: nobody → janus926
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Reporter | ||
Updated•8 years ago
|
Version: unspecified → Trunk
You need to log in
before you can comment on or make changes to this bug.
Description
•