Intermittent zombiecheck | child process XXXX still alive after shutdown
Categories
(Core :: General, defect, P3)
Tracking
()
People
(Reporter: KWierso, Assigned: froydnj)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])
Attachments
(2 files)
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
gbrown
:
review+
|
Details | Diff | Splinter Review |
Updated•9 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 3•9 years ago
|
||
Comment 4•9 years ago
|
||
Comment 5•9 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 20•9 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 37•8 years ago
|
||
Comment 38•8 years ago
|
||
Comment 39•8 years ago
|
||
Comment 40•8 years ago
|
||
Comment 41•8 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 104•6 years ago
|
||
There are 30 failures associated to this bug in the last 7 days. These are occurring mainly on windows10-64 all builds.
:selena can you take a look at this?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 107•6 years ago
|
||
Nathan - can you triage this?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 109•6 years ago
|
||
OK, we're winding up here:
On asan builds, we don't have the crashreporter, so we just give up and don't even try to get a minidump. That seems...reasonable? I think such problems are separate, so perhaps we'll tackle those in a separate bug.
On builds where we do have the crashreporter, we try to terminate the process and get a minidump. In this case, we get logging like the following:
09:58:28 ERROR - TEST-UNEXPECTED-FAIL | zombiecheck | child process 6340 still alive after shutdown
09:58:28 INFO - Killing process: 6340
09:58:28 INFO - TEST-INFO | started process screenshot
09:58:28 INFO - TEST-INFO | screenshot: exit 0
09:58:28 WARNING - mozcrash kill_pid(): wait failed (-1) terminating pid 6340: error 5
09:58:28 INFO - zombiecheck | Checking for orphan process with PID: 6748
09:58:28 INFO - zombiecheck | Checking for orphan process with PID: 2100
09:58:28 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/OPr-Y2JsR6Cl6A4b_qT34Q/artifacts/public/build/target.crashreporter-symbols.zip
09:58:31 INFO - mozcrash Copy/paste: Z:\task_1547630818\build\win32-minidump_stackwalk.exe c:\users\task_1547630818\appdata\local\temp\tmp305rzp.mozrunner\minidumps\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp c:\users\task_1547630818\appdata\local\temp\tmpfz5aiv
09:58:34 INFO - mozcrash Saved minidump as Z:\task_1547630818\build\blobber_upload_dir\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp
09:58:34 INFO - PROCESS-CRASH | Main app process exited normally | application crashed [None]
Notice the mozcrash_kill_pid()
warning: we couldn't wait for the process after termination:
The error is ERROR_ACCESS_DENIED
. dmajor helped me understand that the WaitForSingleObject call in the above actually needs to request SYNCHRONIZE privileges on the process handle:
https://docs.microsoft.com/en-us/windows/desktop/ProcThread/process-security-and-access-rights
presumably if we did that, at least that error would go away.
The other peculiar thing is that the stacks that we do get are all complete garbage. To wit, one example:
09:58:34 INFO - Crash dump filename: c:\users\task_1547630818\appdata\local\temp\tmp305rzp.mozrunner\minidumps\7cb47b5a-dfbf-48b1-b598-ebaa66adbaa1.dmp
09:58:34 INFO - Operating system: Windows NT
09:58:34 INFO - 10.0.15063
09:58:34 INFO - CPU: amd64
09:58:34 INFO - family 6 model 85 stepping 4
09:58:34 INFO - 8 CPUs
09:58:34 INFO -
09:58:34 INFO - GPU: UNKNOWN
09:58:34 INFO -
09:58:34 INFO - No crash
09:58:34 INFO - Process uptime: 64 seconds
09:58:34 INFO -
09:58:34 INFO - Thread 0
09:58:34 INFO - 0 ntdll.dll!NtWaitForSingleObject + 0x14
09:58:34 INFO - rax = 0x000001f2a9c68080 rdx = 0x0000000000000060
09:58:34 INFO - rcx = 0x000001f2a9c68080 rbx = 0x0000000000000000
09:58:34 INFO - rsi = 0x00000000ffffffff rdi = 0x000000000000011c
09:58:34 INFO - rbp = 0x0000008e8b78fa10 rsp = 0x0000008e8b78f868
09:58:34 INFO - r8 = 0xf1190c019cc37d11 r9 = 0x000000000000002a
09:58:34 INFO - r10 = 0x000001f2a9c602e0 r11 = 0x0000000000000000
09:58:34 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000000
09:58:34 INFO - r14 = 0x0000000000000000 r15 = 0x00007ff76491bde8
09:58:34 INFO - rip = 0x00007ff9e4315424
09:58:34 INFO - Found by: given as instruction pointer in context
09:58:34 INFO - 1 KERNELBASE.dll!RtlpHpLargeAlloc + 0x1bb
09:58:34 INFO - rbx = 0x0000000000000000 rbp = 0x0000008e8b78fa10
09:58:34 INFO - rsp = 0x0000008e8b78f870 r12 = 0x0000000000000000
09:58:34 INFO - r13 = 0x0000000000000000 r14 = 0x0000000000000000
09:58:34 INFO - r15 = 0x00007ff76491bde8 rip = 0x00007ff9e136988f
09:58:34 INFO - Found by: call frame info
and the other threads are not much better, being primarily internal ntdll.dll symbols.
Talking with bobowen on #developers suggested one of two possibilities:
- We are hitting a race where we really did think we launched a new child process, but we terminated Firefox before the child process is actually anywhere close to being initialized, and so zombiecheck discovers this process in process, as it were.
- We really did launch the child processes we thought we did, but they were terminated in some way, and by the end of the test run, Windows managed to re-use the PID for one (or more) of our child processes...and we think the child process is still alive, but it's actually some Windows-internal thing. Ye Olde Stack Overflowe answers suggest that PID recycling is definitely A Thing in Windows (https://stackoverflow.com/a/31257979), so this suggestion is not completely wild.
Number 2 is easier to deal with than number 1, so I have some patches up that attempt to check that child processes we think are alive are actually Firefox processes. We'll do a bunch of retriggers on those and see if these not-super-frequent failures show up. Leaving needinfo for a reminder.
Assignee | ||
Comment 110•6 years ago
|
||
We're at a thousand-ish retriggers across a couple different testsuites on win64 opt, and no incidences of the orange so far. I don't think that's quite to a 95% confidence level, but...let's run with it:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b9fd94d578e6242e72e04b0597751a3994c6287a
Patches incoming.
Assignee | ||
Comment 111•6 years ago
|
||
Assignee | ||
Comment 112•6 years ago
|
||
Comment 113•6 years ago
|
||
Comment 114•6 years ago
|
||
Comment 115•6 years ago
|
||
I was going to suggest that you could get the process' parent pid and compare against the Firefox pid but on Windows that's extremely complicated to do so your solution seems fine.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 117•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #114)
::: testing/mochitest/runtests.py
@@ +385,5 @@
if namelen == 0:
# Still an active process, so conservatively assume it's Firefox.
return True
return pName.value.endswith(('firefox.exe', 'plugin-container.exe'))
I'm sure you know more definitively that I do: Are those the only
possibilities, for Windows firefox processes?
That's a good question! I don't know definitively, but looking at a Firefox install on my local Windows machine, the only other .exe files that I see are all utility related, so I think we're in the clear here.
Comment 118•6 years ago
|
||
Comment 119•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/072ba80a8f3e
https://hg.mozilla.org/mozilla-central/rev/9fb802aba6b9
Updated•6 years ago
|
Comment 120•6 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Updated•5 years ago
|
Description
•