Open Bug 1806050 Opened 2 years ago Updated 2 years ago

CreateRemoteThread (used to crash hanging child processes) doesn't work on Windows ccov builds

Categories

(Core :: IPC, defect)

Unspecified
Windows
defect

Tracking

()

People

(Reporter: jld, Unassigned)

References

Details

Attachments

(1 file)

Discovered in bug 1805761: on Windows ccov builds (and apparently only ccov builds?), the use of CreateRemoteThread by ProcessWatcher to crash a possibly-hanging child process — including if it's intentionally blocked in Sleep rather than in coverage runtime shutdown — fails with ERROR_ACCESS_DENIED (error 5).

Thus, we do not get a crash report as promised by the log message. We also don't kill the process, and proceed to do an infinite wait for it to exit on its own (like the status quo before bug 1793525), which isn't ideal.

It would be nice to understand why that happens, but also we could just not attempt to get a crash report on that build type, or fall back to TerminateProcess when it fails.

Attached patch catchntstatus.patch (deleted) — Splinter Review

The Win32 error ERROR_ACCESS_DENIED is not very explicit, but in the case of CreateRemoteThread, the last error is likely set by converting from a more explicit NTSTATUS value. It is possible to get the NTSTATUS value by catching the call to ntdll!RtlNtStatusToDosError, and retrieving the first argument. Normally I would recommend hooking the function to do this; but I'm not sure of the interaction between ccov builds and our hooking code. So instead, you can use the (dirty) attached patched. It sets, handles, and unsets a one-shot breakpoint on RtlNtStatusToDosError. Then you can find the meaning for the reported NTSTATUS code here.

Assignee: nobody → yjuglaret
Assignee: yjuglaret → nobody

One reason for getting ERROR_ACCESS_DENIED would be that the original NTSTATUS reported within CreateRemoteThread is 0xC000010A STATUS_PROCESS_IS_TERMINATING. I've actually caught this NTSTATUS locally if playing with kShutdownWaitMs. If we manage to reproduce the intermittent with the patch and confirm that this is the NTSTATUS that gets caught, then I believe the correct fix would be to accept ERROR_ACCESS_DENIED as a benign error -- we cannot create the remote thread because the process is already terminating, but since the process is already terminating it's OK that we can't create the remote thread to crash it, we are in the happy scenario where the process isn't hanging. I have a try push here where we can try to relaunch jobs until we get the intermittent failure.

That seems to be what's happening indeed: reproduced in Wd2 and in Wd4.

[Parent 4588, IPC I/O Parent] WARNING: Process 4496 may be hanging at shutdown; will wait for up to 8000ms: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:119
[Parent 4588, IPC I/O Parent] WARNING: Process 4496 hanging at shutdown; attempting crash report (fatal error): file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:154
[Parent 4588, IPC I/O Parent] WARNING: Caught NTSTATUS 0xc000010a
: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:214
[Parent 4588, IPC I/O Parent] WARNING: CreateRemoteThread: error 5: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:235
[Parent 4588, IPC I/O Parent] WARNING: Process 1612 hanging at shutdown; attempting crash report (fatal error): file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:154
[Parent 4588, IPC I/O Parent] WARNING: Caught NTSTATUS 0xc000010a
: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:214
[Parent 4588, IPC I/O Parent] WARNING: CreateRemoteThread: error 5: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:235
[Parent 4588, IPC I/O Parent] WARNING: Process 8344 hanging at shutdown; attempting crash report (fatal error): file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:154
[Parent 4588, IPC I/O Parent] WARNING: Caught NTSTATUS 0xc000010a
: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:214
[Parent 4588, IPC I/O Parent] WARNING: CreateRemoteThread: error 5: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:235
[Parent 5032, IPC I/O Parent] WARNING: Process 6456 may be hanging at shutdown; will wait for up to 8000ms: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:119
[Parent 5032, IPC I/O Parent] WARNING: Process 6456 hanging at shutdown; attempting crash report (fatal error): file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:154
[Parent 5032, IPC I/O Parent] WARNING: Caught NTSTATUS 0xc000010a
: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:214
[Parent 5032, IPC I/O Parent] WARNING: CreateRemoteThread: error 5: file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:235
[Parent 5032, IPC I/O Parent] WARNING: Process 7408 hanging at shutdown; attempting crash report (fatal error): file Z:/task_167120411251431/build/src/ipc/chromium/src/chrome/common/process_watcher_win.cc:154
[Parent 5032, IPC I/O Parent] WARNING: Caught NTSTATUS 0xc000010a

I hope this clarifies what's happening and helps you decide how it should be fixed. I personally think that failing with ERROR_ACCESS_DENIED is a good case where the fatal error shouldn't be reported -- since the process is indeed terminating so not hanging. Maybe the fatal error should be reported after the call to CreateRemoteThread -- and only if the call did not fail with ERROR_ACCESS_DENIED?

I personally think that failing with ERROR_ACCESS_DENIED is a good case where the fatal error shouldn't be reported -- since the process is indeed terminating so not hanging.

That sounds like it could lead to false positives. If the ERROR_ACCESS_DENIED instead arises from the source process actually not having privileges to create a remote thread in the target process (which is probably not true now, but is a plausible unintended side effect of future changes to our test setup or sandboxing), that will silently make the test useless.

Also, the fact that it's internally reporting STATUS_PROCESS_IS_TERMINATING may not actually mean we're not hanging! I suspect — based, admittedly, on nothing but questionable intuition — that this error code may be produced after the target process has exited the main program context and is currently within a DLL_PROCESS_DETACH handler (which may have arbitrary misbehavior).

We could avoid the first issue by falling back to NtTerminateProcess and checking the return code (STATUS_PROCESS_IS_TERMINATING good; anything else bad). I'm not sure what we can do about the second, though, if it's real.

Thanks Ray for these interesting remarks. After testing with the code listed below -- here are some results:

  • your intuition is correct that calling CreateRemoteThread on a process which is stuck in a DLL_PROCESS_DETACH handler fails with STATUS_PROCESS_IS_TERMINATING (converted to ERROR_ACCESS_DENIED);
  • however calling NtTerminateProcess on such a process terminates it (as in, will stop execution of the DLL_PROCESS_DETACH handlers and make the process return the provided exit code) and returns 0 (STATUS_SUCCESS), not STATUS_PROCESS_IS_TERMINATING.
// IsTerminating.exe
#include <windows.h>
#include <stdio.h>

constexpr bool kReproduceError = true;

int main(int argc, char* argv[])
{
    if (argc <= 1) {
        printf("Hello from parent\n");
        STARTUPINFOW startupInfo{};
        startupInfo.cb = sizeof(startupInfo);
        PROCESS_INFORMATION processInfo{};
        wchar_t commandLine[] = L"IsTerminating.exe -child";
        if (!CreateProcessW(nullptr, commandLine, nullptr, nullptr, TRUE, 0, nullptr, nullptr, &startupInfo, &processInfo)) {
            printf("CreateProcessW - %lu\n", GetLastError());
            return 0;
        }
        Sleep(1000);
        DWORD threadId = 0;
        HANDLE thread = CreateRemoteThread(processInfo.hProcess, nullptr, 0, nullptr, nullptr, 0, &threadId);
        printf("HANDLE=%p threadId=%lu LastError=%d\n", thread, threadId, GetLastError());
        auto NtTerminateProcess = reinterpret_cast<DWORD(*)(HANDLE, DWORD)>(GetProcAddress(GetModuleHandleW(L"ntdll.dll"), "NtTerminateProcess"));
        printf("NtTerminateProcess=0x%lx\n", NtTerminateProcess(processInfo.hProcess, 0));
    }
    else {
        printf("Hello from child\n");
        if constexpr (!kReproduceError) {
            Sleep(2000);
        }
        LoadLibraryW(L"DllIsTerminating.dll");
    }
    return 0;
}
// DllIsTerminating.dll
#include <windows.h>
#include <stdio.h>

BOOL APIENTRY DllMain( HMODULE hModule, DWORD  ul_reason_for_call, LPVOID lpReserved)
{
    switch (ul_reason_for_call)
    {
    case DLL_PROCESS_ATTACH:
    case DLL_THREAD_ATTACH:
    case DLL_THREAD_DETACH:
        break;
    case DLL_PROCESS_DETACH:
        printf("I WILL NEVER TERMINATE\n");
        while (1) {
        }
        break;
    }
    return TRUE;
}

The output is:

Hello from parent
Hello from child
I WILL NEVER TERMINATE
HANDLE=0000000000000000 threadId=0 LastError=5
NtTerminateProcess=0x0

Windows distinguishes between an exiting process ("clean exit") and a terminating process ("forced exit"). Trying to create a thread (remotely or not) in an exiting process fails with STATUS_PROCESS_IS_TERMINATING (converted to ERROR_ACCESS_DENIED) even though strictly speaking the process is currently exiting, not terminating. However calling NtTerminateProcess on an exiting process succeeds and makes the process a terminating one, so for example calling NtTerminateProcess again right after the first call fails with STATUS_PROCESS_IS_TERMINATING. Since creating new threads is not an option for an exiting process, all we can hope to do with these processes is to either terminate them or interact with the only remaining thread in the process.

So I think our options are as follows:

  • for processes that may be hanging on shutdown code that runs before ExitProcess is called, we can call CreateRemoteThread like with the current patch;
  • that may fail because the process has actually already reached ExitProcess and is thus exiting, in which case:
    • if it is safe to assume that all DLLs' DLL_PROCESS_DETACH handlers cannot block, we can simply let these processes end their lives peacefully;
    • if it is not safe to assume this, then:
      • if we do not care to know where exiting processes get stuck, we can call TerminateProcess to just force the process to terminate;
      • if we do care, we could try to generate a dump based on the context of the remaining thread in the process, so the call stack will reflect the location where the thread is currently stuck.

I believe we are heading towards the last option, so here is how I think it can be achieved: enumerate threads to find the only remaining thread in the process; suspend it; get its context; then generate a dump from the captured context. In fact we seem to be doing something very similar here and we should try using this function and see if it works with exiting processes.

If it works we may actually want to consider using this function instead of the CreateRemoteThread solution for (some?) non-exiting processes as well; the main difference between the two approaches seems to be that with WriteMinidumpForChild we would have to decide of an arbitrary thread to blame, which can be difficult if there are many threads, and then the same hang might show up with different stacks; whereas with CreateRemoteThread the blamed thread is the one we create so the call stacks would always be the same, including for different hangs, which is quite bad for triage.

(Edited: It looks much simpler to generate the dump from the captured context directly, if that works, rather than trying to crash the exiting process, which I was considering as a possibility at first.)

Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: