Closed Bug 1397301 Opened 7 years ago Closed 3 years ago

Crash in sandbox::SharedMemIPCClient::DoCall (Avast)

Categories

(Core :: Security: Process Sandboxing, defect, P3)

56 Branch
x86_64
Windows
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- affected
firefox59 --- affected

People

(Reporter: philipp, Unassigned)

References

Details

(Keywords: crash, Whiteboard: sb+)

Crash Data

This bug was filed from the Socorro interface and is 
report bp-99a8ec3d-932f-4645-9e19-ee3060170906.
=============================================================
Crashing Thread (34)
Frame 	Module 	Signature 	Source
0 	firefox.exe 	sandbox::SharedMemIPCClient::DoCall(sandbox::CrossCallParams*, sandbox::CrossCallReturn*) 	security/sandbox/chromium/sandbox/win/src/sharedmem_ipc_client.cc:105
1 	firefox.exe 	sandbox::CrossCall<sandbox::SharedMemIPCClient, wchar_t*, unsigned int, sandbox::InOutCountedBuffer>(sandbox::SharedMemIPCClient&, unsigned int, wchar_t* const&, unsigned int const&, sandbox::InOutCountedBuffer const&, sandbox::CrossCallReturn*) 	security/sandbox/chromium/sandbox/win/src/crosscall_client.h:366
2 	firefox.exe 	TargetNtQueryAttributesFile 	security/sandbox/chromium/sandbox/win/src/filesystem_interception.cc:233
3 	ntdll.dll 	LdrpResolveFileName 	
4 	ntdll.dll 	LdrpSearchPath 	
5 	ntdll.dll 	RtlSubAuthorityCountSid 	
6 	ntdll.dll 	LdrpFindOrMapDll 	
7 	ntdll.dll 	LdrpFindOrMapDll 	
8 	mozglue.dll 	patched_LdrLoadDll 	mozglue/build/WindowsDllBlocklist.cpp:779
9 	kernelbase.dll 	NlsConvertStringToIntegerW 	
10 	kernelbase.dll 	LoadLibraryExA 	
11 	kernel32.dll 	LoadLibraryA 	
12 		@0x8f0012 	
13 	kernel32.dll 	BaseThreadInitThunk 	
14 	ntdll.dll 	RtlUserThreadStart

these crashes seem to be more frequent on 64bit versions of firefox, so they are now on the rise after we started automatically migrating users to win64 on 56.0b9.
on nightly the signature got more frequent after 55.0a1 build 20170307030205, maybe something is regressing since then and tickling this crash more often.
The ones I've looked at are complaining that answer at [1] is null, in what I think are SSE instructions being used for the memcpy.

The thing is in the dump it isn't.
At least for a 64-bit process.
So, I wonder if there is some confusion between 32-bit and 64-bit somehow during the migration.

[1] https://hg.mozilla.org/releases/mozilla-beta/file/97bee4069ef3/security/sandbox/chromium/sandbox/win/src/sharedmem_ipc_client.cc#l105
Looking at this report: https://crash-stats.mozilla.com/report/index/6e9aaa8b-344d-496d-9273-e84dd0170910#tab-details

The failure code is EXCEPTION_ACCESS_VIOLATION_READ on address 0, but the instruction is a write:

  firefox!sandbox::SharedMemIPCClient::DoCall+0xc2
  00007ff7`55349352 0f2903          movaps  xmmword ptr [rbx],xmm0

Note that rbx (aka `answer`) ends in hex digit "8" so it's not 16-byte aligned. I suspect the reason code is bogus and this is actually an alignment error in the movaps. Spot-checking several reports from that build they all have rbx ending in 8.

Either the compiler is wrong to assume that `answer` is 16-byte aligned, or the caller is wrong in not living up to this requirement. My guess is the former, but I defer to Bob in case he knows something about this code that I don't.
Flags: needinfo?(bobowencode)
(In reply to David Major [:dmajor] from comment #3)
> Looking at this report:
> https://crash-stats.mozilla.com/report/index/6e9aaa8b-344d-496d-9273-
> e84dd0170910#tab-details
> 
> The failure code is EXCEPTION_ACCESS_VIOLATION_READ on address 0, but the
> instruction is a write:
> 
>   firefox!sandbox::SharedMemIPCClient::DoCall+0xc2
>   00007ff7`55349352 0f2903          movaps  xmmword ptr [rbx],xmm0
> 
> Note that rbx (aka `answer`) ends in hex digit "8" so it's not 16-byte
> aligned. I suspect the reason code is bogus and this is actually an
> alignment error in the movaps. Spot-checking several reports from that build
> they all have rbx ending in 8.
> 
> Either the compiler is wrong to assume that `answer` is 16-byte aligned, or
> the caller is wrong in not living up to this requirement. My guess is the
> former, but I defer to Bob in case he knows something about this code that I
> don't.

Thanks for looking at this, I don't know of anything about this code that would change your opinion.

Do you think this could be confusion between 32-bit/64-bit, parent/child during migration, it seemed to spike in Beta 9?
I guess maybe not given that answer always seems to be a stack allocated local variable in the child, so maybe this spike is just down to running more 64-bit versions.

Looking at the SANDBOX_FAILED_LAUNCH telemetry, we do seem to be getting a lot more launch errors for 64-bit as well.
Flags: needinfo?(bobowencode) → needinfo?(dmajor)
(In reply to Bob Owen (:bobowen) from comment #4)
> Do you think this could be confusion between 32-bit/64-bit, parent/child
> during migration, it seemed to spike in Beta 9?
> I guess maybe not given that answer always seems to be a stack allocated
> local variable in the child, so maybe this spike is just down to running
> more 64-bit versions.

My first guess would have been that in 56b9 something tickled the compiler into using `movaps` instructions where it didn't before, but my disassembly of 56b8 says this is not true.

My second guess would be that until 56b8 the pointer math worked out such that rbx _was_ always aligned in practice. Do you know how to exercise this code? Could you try breaking on the movaps in 56b8 and see what your rbx looks like and how it was calculated?
Flags: needinfo?(dmajor)
> My second guess would be that until 56b8 the pointer math worked out such
> that rbx _was_ always aligned in practice.

Hm, actually, this isn't true either. The crash has been seen on rare occasion going back to version 55 nightlies.

Maybe it's right that it's more correlated with the win64 rollout than the actual contents of the builds.
I don't have access to memory dumps but it'd be interesting to know what the stack frames/addresses look like. IIRC the stack must be aligned to 16 bytes at function calls on Win64. Maybe someone lower in the stack messed this up and SharedMemIPCClient is merely the first thing that falls over when we're unaligned.
So, it appears to be at that weird line in the stacks (for example from the crash referenced in comment 3):

12 		@0x6e0012

That's where the addresses start ending with 8.
Flags: needinfo?(dmajor)
(In reply to Bob Owen (:bobowen) from comment #8)
> So, it appears to be at that weird line in the stacks (for example from the
> crash referenced in comment 3):
> 
> 12 		@0x6e0012
> 
> That's where the addresses start ending with 8.

If the callee of BaseThreadInitThunk is some random hex address then this sounds like someone is injecting code/threads into our process. :-(

Any chance the stack still contains the name of the DLL they're trying to load? (Just to confirm that it's something shady.) And I'd be curious to see what `!address 0x6e0012` says -- we might be able to expand our patched_BaseThreadInitThunk code to 64-bit builds.
Flags: needinfo?(dmajor) → needinfo?(bobowencode)
I don't see anything in the stack (although I'm not all that familiar with poking around in it).

In the dump for that crash windbg  actually says 0x6e0013

0:031> !address 0x6e0013
                                     
Mapping file section regions...
Mapping module regions...
Mapping PEB regions...
Mapping TEB and stack regions...
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for mswsock.dll - 
Mapping heap regions...
Mapping page heap regions...
Mapping other regions...
Mapping stack trace database regions...
Mapping activation context regions...

Usage:                  <unknown>
Base Address:           00000000`006e0000
End Address:            00000000`006e1000
Region Size:            00000000`00001000 (   4.000 kB)
State:                  00001000          MEM_COMMIT
Protect:                00000040          PAGE_EXECUTE_READWRITE
Type:                   00020000          MEM_PRIVATE
Allocation Base:        00000000`006e0000
Allocation Protect:     00000040          PAGE_EXECUTE_READWRITE


Content source: 0 (invalid), length: 5b1f9b5
Flags: needinfo?(bobowencode)
Interestingly in the crash dump for this report [1], when I trace the stack it tries to load symbols for snxhk64.dll which apparently is an Avast DLL.

[1] https://crash-stats.mozilla.com/report/index/4b3b1ea4-9152-461d-8d37-33b690170912
(In reply to Bob Owen (:bobowen) from comment #10)
> Protect:                00000040          PAGE_EXECUTE_READWRITE

It sounds like this thread would be blocked if we expand the patched_BaseThreadInitThunk filtering to 64-bits (i.e. remove ifdef X86) because it fails the PAGE_EXECUTE_READ requirement at ShouldBlockThread: https://dxr.mozilla.org/mozilla-central/rev/f9a5e9ed62103c84e4cde915f4d08f1ce71be83e/mozglue/build/WindowsDllBlocklist.cpp#782,784,794
> It sounds like this thread would be blocked if we expand the
> patched_BaseThreadInitThunk filtering to 64-bits (i.e. remove ifdef X86)

Do you want to take on this patch? I can review.

I'm worried about uplifting this to 56 so near release though. The 32-bit version of this code didn't cause any major regressions, but touching these codepaths is always risky. We may have to be prepared to pull it if it causes more problems than it solves.
Depends on: 1372823
(In reply to David Major [:dmajor] from comment #13)
> > It sounds like this thread would be blocked if we expand the
> > patched_BaseThreadInitThunk filtering to 64-bits (i.e. remove ifdef X86)
> 
> Do you want to take on this patch? I can review.
> 
> I'm worried about uplifting this to 56 so near release though. The 32-bit
> version of this code didn't cause any major regressions, but touching these
> codepaths is always risky. We may have to be prepared to pull it if it
> causes more problems than it solves.

Thanks for your help dmajor,
Whiteboard: sb?
(posted too soon) ... 
Thanks for your help dmajor, sounds like this would be a useful change.

Not sure over uplifting, as this crash doesn't seem too bad and has dropped since Fx56b9, but maybe we should try and get it into 57 though.
I don't know if this is related, but in Lenovo DLL bug 1369361, we are seeing a problem where the 64-bit Lenovo DLLs are not blocked even though they have been on the DLL blocklist since Firefox 36 bug 1123778. Those crash reports have an annotation saying "BlocklistInitFailed: 1". That annotation points to a LdrLoadDll detour that dmajor added four years ago (Firefox 27 bug 951827).
(In reply to Chris Peterson [:cpeterson] from comment #16)
> I don't know if this is related, but in Lenovo DLL bug 1369361, we are
> seeing a problem where the 64-bit Lenovo DLLs are not blocked even though
> they have been on the DLL blocklist since Firefox 36 bug 1123778. 

I think the root cause of bug 1369361 is "something external is breaking our blocklist" which means it's unrelated to Lenovo software per se. Those crashes are merely fallout: we're unable to block Lenovo because we're unable to block anyone at all. And this bug is about thread injection so it's probably unrelated as well.

> Those
> crash reports have an annotation saying "BlocklistInitFailed: 1". That
> annotation points to a LdrLoadDll detour that dmajor added four years ago
> (Firefox 27 bug 951827).

FTR the LdrLoadDll hook existed long before that. In bug 951827 we merely tweaked it.
Priority: -- → P3
Whiteboard: sb? → sb+
(In reply to David Major [:dmajor] from comment #13)
> > It sounds like this thread would be blocked if we expand the
> > patched_BaseThreadInitThunk filtering to 64-bits (i.e. remove ifdef X86)
> 
> Do you want to take on this patch? I can review.
> 
> I'm worried about uplifting this to 56 so near release though. The 32-bit
> version of this code didn't cause any major regressions, but touching these
> codepaths is always risky. We may have to be prepared to pull it if it
> causes more problems than it solves.

Bob, do you plan to work on this BaseThreadInitThunk patch for 58?

There are about 500 crash reports from Beta 57. We started to migrate 1% of 32-bit Firefox 56.0 users to 64-bit 56.0.1 yesterday, so we will probably see more of these crashes on the Release channel.
Flags: needinfo?(bobowencode)
(In reply to Chris Peterson [:cpeterson] from comment #18)
...
> Bob, do you plan to work on this BaseThreadInitThunk patch for 58?
> 
> There are about 500 crash reports from Beta 57. We started to migrate 1% of
> 32-bit Firefox 56.0 users to 64-bit 56.0.1 yesterday, so we will probably
> see more of these crashes on the Release channel.

Yes I'm already working on bug 1372823.
I tried to land it once, but it causes intermittent oranges to become permanent, probably because of underlying races.
The first of those was fairly easy to fix because it was just an incorrect assertion the next one is a little harder to track down.
Flags: needinfo?(bobowencode)
¡Hola Bob!

Ended up here from https://support.mozilla.org/questions/1179750 where the user is crashing like https://crash-stats.mozilla.com/report/index/8406e2a8-f9b0-41f7-9986-9e9ce1171016 in Release Version 56.0.1 Build ID 20171002220106

What shall I suggest there?

Reinstall 32-bit Firefox?

¡Gracias!
Alex
Flags: needinfo?(bobowencode)
(In reply to alex_mayorga from comment #20)
...
> What shall I suggest there?
> 
> Reinstall 32-bit Firefox?

Either that or find a different anti-virus, it looks like it is Avast injecting a thread to load their DLL that is causing the crash. They are messing up the alignment in the call stack.
It would be good if they could report the issue to Avast.

I think that the fix landed in bug 1372823 would stop it, but we're not going to uplift this to 57 at this stage.
Flags: needinfo?(bobowencode)
There are still some crash reports after BaseThreadInitThunk bug 1372823 landed in Nightly 58, but the volume is very low. There have only been 97 reports from Beta 58 and 6 from Nightly 59.
Summary: Crash in sandbox::SharedMemIPCClient::DoCall → Crash in sandbox::SharedMemIPCClient::DoCall (Avast)

Hello! I am closing this issue with RESOLVED->WORKSFORME since there where no changes made and there is no recent crash signatures in the past 6 months.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.