Open Bug 1719481 Opened 3 years ago Updated 2 years ago

Intermittent ccov PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread + 0x16a]

Categories

(Testing :: Code Coverage, defect)

defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, Whiteboard: [stockwell fixed:other])

Crash Data

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=344563965&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/LT5kLaJnR_e_kwAUmqJSFg/runs/0/artifacts/public/logs/live_backing.log


[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO - PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread + 0x16a]
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO - Mozilla crash reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO - Crash dump filename: C:\Users\task_1625659277\AppData\Local\Temp\tmptkgwu0vx.mozrunner\minidumps\da69e453-e7a0-4138-8dc8-0760c21b39d3.dmp
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO - Operating system: Windows NT
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO -                   10.0.17134 
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO - CPU: amd64
[task 2021-07-07T12:25:06.987Z] 12:25:06     INFO -      family 6 model 94 stepping 3
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -      8 CPUs
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - 
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - GPU: UNKNOWN
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - 
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - Crash reason:  EXCEPTION_BREAKPOINT
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - Crash address: 0xba5eb1c7
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - Process uptime: 73 seconds
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - 
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO - Thread 19 (crashed) - Shutdown Hang Terminator 0  xul.dll!void mozilla::nsTerminator::AdvancePhase(enum mozilla::ShutdownPhase) + 0x447
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -     rax = 0x00007ffec36033ee   rdx = 0x0000000000000000
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -     rcx = 0x00007ffef1af4bb0   rbx = 0x00000219bbe59d30
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -     rsi = 0x00007ffefeae3ca0   rdi = 0x0000000000000276
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -     rbp = 0x0000000000000000   rsp = 0x000000d3e7fbfc50
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -      r8 = 0x0000000000000008    r9 = 0x00007ffefd290cec
[task 2021-07-07T12:25:06.988Z] 12:25:06     INFO -     r10 = 0x00000fffdfa72be7   r11 = 0x0022008000000002
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     r12 = 0x00007ffedbaf7c20   r13 = 0x00000219ca46b138
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     r14 = 0x00007ffedbaceef4   r15 = 0x00007fff001f82c0
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     rip = 0x00007ffeba5eb1c7
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     Found by: given as instruction pointer in context
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -  1  nss3.dll!PR_NativeRunThread + 0x16a
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     rbx = 0x00000219bbe59d30   rbp = 0x0000000000000000
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     rsp = 0x000000d3e7fbfca0   r12 = 0x00007ffedbaf7c20
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     r13 = 0x00000219ca46b138   r14 = 0x00007ffedbaceef4
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     r15 = 0x00007fff001f82c0   rip = 0x00007ffedb95186a
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     Found by: call frame info
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -  2  nss3.dll!PR_MD_CREATE_THREAD + 0xd9
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     rbx = 0x00000219bbe59d30   rbp = 0x0000000000000000
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     rsp = 0x000000d3e7fbfd10   r12 = 0x00007ffedbaf7c20
[task 2021-07-07T12:25:06.989Z] 12:25:06     INFO -     r13 = 0x00000219ca46b138   r14 = 0x00007ffedbaceef4
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     r15 = 0x00007fff001f82c0   rip = 0x00007ffedb93dc69
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     Found by: call frame info
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -  3  ucrtbase.dll!RtlpHpSegPageRangeShrink + 0xda
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     rbx = 0x00000219bbe59d30   rbp = 0x0000000000000000
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     rsp = 0x000000d3e7fbfd40   r12 = 0x00007ffedbaf7c20
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     r13 = 0x00000219ca46b138   r14 = 0x00007ffedbaceef4
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     r15 = 0x00007fff001f82c0   rip = 0x00007ffefd26c4be
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO -     Found by: call frame info
[task 2021-07-07T12:25:06.990Z] 12:25:06     INFO - 
Component: Mochitest → XPCOM
Product: Testing → Core

Update:
There have been 59 failures within the last 7 days:
• 44 failures on Windows 10 x64 2004 CCov WebRender opt
• 15 failures on Windows 10 x64 CCov WebRender opt
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=347570272&repo=mozilla-central&lineNumber=4467

[task 2021-08-05T20:11:44.686Z] 20:11:44     INFO - REFTEST TEST-START | layout/reftests/svg/as-image/defer-unsupported-1.svg == layout/reftests/svg/as-image/defer-unsupported-1-ref.svg
[task 2021-08-05T20:11:44.687Z] 20:11:44     INFO - REFTEST TEST-LOAD | file:///Z:/task_162819332576699/build/tests/reftest/tests/layout/reftests/svg/as-image/defer-unsupported-1.svg | 140 / 141 (99%)
[task 2021-08-05T20:11:44.762Z] 20:11:44     INFO - REFTEST TEST-LOAD | file:///Z:/task_162819332576699/build/tests/reftest/tests/layout/reftests/svg/as-image/defer-unsupported-1-ref.svg | 140 / 141 (99%)
[task 2021-08-05T20:11:44.826Z] 20:11:44     INFO - REFTEST TEST-PASS | layout/reftests/svg/as-image/defer-unsupported-1.svg == layout/reftests/svg/as-image/defer-unsupported-1-ref.svg | image comparison, max difference: 0, number of differing pixels: 0
[task 2021-08-05T20:11:44.826Z] 20:11:44     INFO - REFTEST TEST-END | layout/reftests/svg/as-image/defer-unsupported-1.svg == layout/reftests/svg/as-image/defer-unsupported-1-ref.svg
[task 2021-08-05T20:11:44.841Z] 20:11:44     INFO - REFTEST INFO | Slowest test took 995ms (file:///Z:/task_162819332576699/build/tests/reftest/tests/layout/reftests/svg/as-image/img-novb-widthAndHeight-all-1-ref.html)
[task 2021-08-05T20:11:44.841Z] 20:11:44     INFO - REFTEST INFO | Total canvas count = 4
[task 2021-08-05T20:11:44.903Z] 20:11:44     INFO - 1628194304903	Marionette	TRACE	Received observer notification quit-application
[task 2021-08-05T20:11:44.905Z] 20:11:44     INFO - 1628194304903	Marionette	INFO	Stopped listening on port 2828
[task 2021-08-05T20:11:44.905Z] 20:11:44     INFO - 1628194304904	Marionette	DEBUG	Marionette stopped listening
[task 2021-08-05T20:11:44.975Z] 20:11:44     INFO - JavaScript error: resource:///modules/Interactions.jsm, line 230: NS_ERROR_FAILURE: Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIUserIdleService.removeIdleObserver]
[task 2021-08-05T20:11:44.983Z] 20:11:44     INFO - ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2021-08-05T20:11:45.393Z] 20:11:45     INFO - ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2021-08-05T20:11:45.394Z] 20:11:45     INFO - ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2021-08-05T20:11:45.394Z] 20:11:45     INFO - ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2021-08-05T20:11:46.795Z] 20:11:46     INFO - ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2021-08-05T20:12:52.809Z] 20:12:52    ERROR - TEST-UNEXPECTED-FAIL | Last test finished | application terminated with exit code 1
[task 2021-08-05T20:12:52.809Z] 20:12:52     INFO - REFTEST INFO | Downloading symbols from: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Fejj4_-TTPK135s8vNnWqw/artifacts/public/build/target.crashreporter-symbols.zip
[task 2021-08-05T20:12:55.929Z] 20:12:55     INFO - REFTEST INFO | Copy/paste: Z:/task_162819332576699/fetches\minidump_stackwalk\minidump_stackwalk.exe C:\Users\task_162819332576699\AppData\Local\Temp\tmpwosl7qvy.mozrunner\minidumps\e9aa6647-cb09-45dc-97c0-efa2fdb5d848.dmp C:\Users\task_162819332576699\AppData\Local\Temp\tmpohhk74m4
[task 2021-08-05T20:12:59.727Z] 20:12:59     INFO - REFTEST INFO | Saved minidump as Z:\task_162819332576699\build\blobber_upload_dir\e9aa6647-cb09-45dc-97c0-efa2fdb5d848.dmp
[task 2021-08-05T20:12:59.732Z] 20:12:59     INFO - REFTEST INFO | Saved app info as Z:\task_162819332576699\build\blobber_upload_dir\e9aa6647-cb09-45dc-97c0-efa2fdb5d848.extra
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - REFTEST PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread + 0x16a]
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Mozilla crash reason: MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Crash dump filename: C:\Users\task_162819332576699\AppData\Local\Temp\tmpwosl7qvy.mozrunner\minidumps\e9aa6647-cb09-45dc-97c0-efa2fdb5d848.dmp
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Operating system: Windows NT
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -                   10.0.19041 
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - CPU: amd64
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -      family 6 model 63 stepping 2
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -      6 CPUs
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - 
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - GPU: UNKNOWN
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - 
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Crash reason:  EXCEPTION_BREAKPOINT
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Crash address: 0xacd98107
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Process uptime: 104 seconds
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - 
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO - Thread 17 (crashed) - Shutdown Hang Terminator 0  xul.dll!void mozilla::nsTerminator::AdvancePhase(enum mozilla::ShutdownPhase) + 0x447
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     rax = 0x00007ffeb5f5be7b   rdx = 0x0000000000000000
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     rcx = 0x00007ffee46ba8e0   rbx = 0x000002208fd59b30
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     rsi = 0x00007ffeed0fada0   rdi = 0x0000000000000276
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     rbp = 0x0000000000000000   rsp = 0x00000037540cfd60
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -      r8 = 0x0000000000000002    r9 = 0x00000037540cfc08
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     r10 = 0x0000000000000013   r11 = 0x2200000080082000
[task 2021-08-05T20:12:59.776Z] 20:12:59     INFO -     r12 = 0x00007ffed6607c20   r13 = 0x000002209a219538
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r14 = 0x00007ffed65def24   r15 = 0x00007ffeeeb9faa0
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rip = 0x00007ffeacd98107
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     Found by: given as instruction pointer in context
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -  1  nss3.dll!PR_NativeRunThread + 0x16a
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rbx = 0x000002208fd59b30   rbp = 0x0000000000000000
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rsp = 0x00000037540cfdb0   r12 = 0x00007ffed6607c20
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r13 = 0x000002209a219538   r14 = 0x00007ffed65def24
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r15 = 0x00007ffeeeb9faa0   rip = 0x00007ffed646186a
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     Found by: call frame info
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -  2  nss3.dll!PR_MD_CREATE_THREAD + 0xd9
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rbx = 0x000002208fd59b30   rbp = 0x0000000000000000
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rsp = 0x00000037540cfe20   r12 = 0x00007ffed6607c20
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r13 = 0x000002209a219538   r14 = 0x00007ffed65def24
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r15 = 0x00007ffeeeb9faa0   rip = 0x00007ffed644dc69
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     Found by: call frame info
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -  3  ucrtbase.dll!RtlpReAllocateHeap + 0x1ce
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rbx = 0x000002208fd59b30   rbp = 0x0000000000000000
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     rsp = 0x00000037540cfe50   r12 = 0x00007ffed6607c20
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r13 = 0x000002209a219538   r14 = 0x00007ffed65def24
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     r15 = 0x00007ffeeeb9faa0   rip = 0x00007ffeecb01bb2
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO -     Found by: call frame info
[task 2021-08-05T20:12:59.777Z] 20:12:59     INFO - 
Whiteboard: [stockwell needswork:owner]

Nika, as the owner of this component, could you help us assign it to someone?
Thank you.

Flags: needinfo?(nika)

FWIW, it looks like these are windows10-64 ccov WebRender shutdown hangs, spread across various test suites.

In a few logs I looked at, the main thread was in XPCOM shutdown, doing mozilla::ipc::BrowserProcessSubThread::`scalar deleting destructor'

I managed to get a minidump open, though only for bug 1685204. I'm expecting that these two crashes have the same cause. We're hanging within ChildReaper::WillDestroyCurrentMessageLoop here: https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/ipc/chromium/src/chrome/common/process_watcher_win.cc#38 in response to the I/O thread trying to exit. The main thread is then hanging trying to join the I/O thread from BrowserProcessSubThread's destructor.

I'm guessing these are only appearing in debug & ccov builds because toolkit.shutdown.fastShutdownStage is disabled for these builds (https://searchfox.org/mozilla-central/rev/162a8620f126c27de70766808708696260f2256a/modules/libpref/init/StaticPrefList.yaml#11013-11017).

ni? :jld who I know has been touching the ChildReaper code recently, and might have changed something which is leading to these hangs.

Flags: needinfo?(nika) → needinfo?(jld)
Component: XPCOM → IPC
Crash Signature: [@ PR_NativeRunThread + 0x16a] → [@ PR_NativeRunThread + 0x16a] [@ _PR_NativeRunThread(void*)]

I've been touching the code, but I haven't landed those patches, and they're only to the Unix side, and they don't change this behavior, which is intentional: On builds that need to do a full orderly shutdown because of instrumentation (leak checking, ccov, etc.), the parent process blocks at this point waiting for child processes to exit.

What would help is a way to get the stack for the child process in this case. Nika, you pointed out the other week that this should be converted to use nsIAsyncShutdown instead of just blocking in late shutdown like this, and I see it has a state property for exporting information into a hang crash… but in this case I think we'd want a way to say “do a paired minidump with pid N instead” (rather than trying to somehow get all the process's stacks in a useful format and encoded as nsIPropertyBags/nsIVariants).

Crash Signature: [@ PR_NativeRunThread + 0x16a] [@ _PR_NativeRunThread(void*)] → [@ PR_NativeRunThread + 0x16a] [@ _PR_NativeRunThread(void*)]
Flags: needinfo?(jld)

There have been 109 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • windows10-64-2004-ccov-qr

The crashes here are all code coverage build related. Maybe they stopped on Friday, or failing jobs haven't been classified over the weekend yet. Marco, could you have a look please?

Component: IPC → Code Coverage
Flags: needinfo?(mcastelluccio)
Product: Core → Testing
Summary: Intermittent PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread + 0x16a] → Intermittent ccov PROCESS-CRASH | Last test finished | application crashed [@ PR_NativeRunThread + 0x16a]

From the crash reason (MOZ_CRASH(Shutdown hanging after all known phases and workers finished.)), I'd say this was another instance where ccov instrumentation is slowing down shutdown too much, and with the move to Azure this got much worse.

It looks like this stopped on October 8th though, so maybe bug 1733611 was enough to fix this (or at least bring it back to be rarely happening).

Flags: needinfo?(mcastelluccio)

No failures on trunk since the 9th of October.

Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]

The severity field for this bug is set to S4. However, the following bug duplicate has higher severity:

:marco, could you consider increasing the severity of this bug to S3?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mcastelluccio)

Very rare nowadays, let's keep it as S4.

Flags: needinfo?(mcastelluccio)
Crash Signature: [@ PR_NativeRunThread + 0x16a] [@ _PR_NativeRunThread(void*)] → [@ PR_NativeRunThread + 0x16a] [@ _PR_NativeRunThread]
You need to log in before you can comment on or make changes to this bug.