Closed Bug 331545 Opened 19 years ago Closed 17 years ago

Use _InterlockedCompareExchange for js_CompareAndSwap

Categories

(Core :: JavaScript Engine, enhancement, P1)

x86
Windows XP
enhancement

Tracking

()

RESOLVED FIXED
mozilla1.9beta4

People

(Reporter: mmoy, Assigned: mmoy)

References

Details

(Keywords: perf)

Attachments

(1 file, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060318 Firefox/1.5.0.1 (mmoy CE 1.5.0.1 K8B-X69)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.0.1) Gecko/20060318 Firefox/1.5.0.1 (mmoy CE 1.5.0.1 K8B-X69)

Mozilla currently uses inline assembly for jsCompareAndSwap. The advantage
of inline assembler is that you have access to the entire instruction set.
The disadvantage is that you generally have to load variables from the stack.
And that the compiler can't optimize your code as it is directed to leave it
alone.

In VS2005, the _InterlockedCompareExchange intrinsic was added which means
that you can get the lock cmpxchg [xxx], yyy inline while allowing the compiler
to optmize the arguments.

static JS_INLINE int
js_CompareAndSwap(jsword *w, jsword ov, jsword nv)
{
    __asm {
        mov eax, ov
        mov ecx, nv
        mov ebx, w
        lock cmpxchg [ebx], ecx
        sete al
        and eax, 1h
    }
}

Regarding the current implementation above, the proposed change to this
would allow the compiler to load the three arguments in an optimized
way and this can mean using data that's already in registers instead of
loading the data onto the stack and then having the data moved from the
stack back to the stated registers.

There are a few places where the compare is thrown away and the proposed
change then doesn't need the and statement. The reason for the use of a
helper function also allows saving the and statement in the current
code.

Here's an example of generated code in one place:

; 724  :         if (!js_CompareAndSwap(&tl->owner, me, 0))

	mov	eax, DWORD PTR _me$[ebp]
	xor	edx, edx
	mov	edi, ecx
	lock	 cmpxchg DWORD PTR [edi], edx
	sete	al
	test	al, 1
	jne	SHORT $LN2@js_SetSlot

There are some pretty good code generation improvements throughout jslock.c
compared to the current code.

On Windows 64: at the moment, I think that Windows 64 just punts on the
interlocked exchange. My Win64 Thunderbird sometimes has spinning hourglass
problems when started up and this may be related to the lack of locking here
(my assumption). This solution can not be used for Windows 64 because Microsoft
desupported Inline Assembler for their x64 compilers. I could not find another
intrinsics solution on Windows 64 from the documentation. However, I accidently
stumbled on a C fragment that generates correct code. In fact it generates only
one instruction for the whole sequence and optimizes the condition code so that
the sete/and 1 aren't needed. I would like to figure out how this works before
presenting it though. And I'd like to see if it fixes the Thunderbird locking
problem. But I would need to figure out how to build Thunderbird with VS2005
first. To my knowledge, noone in the unofficial building community has done
this. I don't know if the Mozilla team has done this either.

So please have a look at this patch and let me know what I need to do to get
it in.

There are other places in NSPR that could use some of the new intrinsics as
well (I'm already using them in my builds) and I'll look at filing bugzilla
entries on those at a future point.

Reproducible: Always
Assignee: nobody → general
Severity: normal → enhancement
Status: UNCONFIRMED → NEW
Component: General → JavaScript Engine
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
I also have an improvement for older compilers that don't support the Intrinsic and this could be used in the branch. This solution only saves one instruction
but js_CompareAndSwap is used quite a bit.

I think that it would be advantageous for Mozilla to move to VS2005 as quickly
as possible on the Branch and the 1.5 line as that would allow the use of
many improvements in the new compiler.
GCC sinnce 4.1 has started to support this as built-in, __sync_bool_compare_and_swap, see http://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html#Atomic-Builtins

That built-in follows "Intel Itanium Processor-specific Application Binary Interface, section 7.4". Perhaps this __sync_bool_compare_and_swap is supported by other compilers as well?
That "#if defined(_WIN32) && defined(_M_IX86)" should be expanded to include a test for the MSVC compiler, as _InterlockedCompareExchange isn't available on other Windows compilers.
VS2005 is required for the Trunk and that's the only line that I had planned
for this patch. I have an improvement that works for VS2003 as well but I'd
like to get this one through first.

BTW, anyone know which flags I should set on this bug?

(In reply to comment #4)
> That "#if defined(_WIN32) && defined(_M_IX86)" should be expanded to include a
> test for the MSVC compiler, as _InterlockedCompareExchange isn't available on
> other Windows compilers.
Oh, OK.  I didn't know that VS2005 is now required for Windows builds of the trunk.  

So GCC is being abandoned completely on the WIN32 platform?  Surprising, for an open-source project.
While VC8 is now the only "supported" compiler on win32 for the trunk, there's no reason to explicitly break other compilers. Some people still use mingw/gcc, for example. I may be misunderstanding, but if all that's required to make this work with other compilers is an extra condition in an ifdef, why not do it?
(In reply to comment #7)
> While VC8 is now the only "supported" compiler on win32 for the trunk, there's
> no reason to explicitly break other compilers. Some people still use mingw/gcc,
> for example. I may be misunderstanding, but if all that's required to make this
> work with other compilers is an extra condition in an ifdef, why not do it?

I have another improvement in this code for VC6 and VS2003 which I was planning
on doing after this code. I also have Win64 code that's sitting in one of my
patch directories. I didn't want to do all of of them at the same time though
to try to keep the environments and testing simple.

The code here shouldn't have any effect on GCC/MingW. 

I can add in a version check (anyone know the base for the VS2005 series) 
but it will take me about a week or two to do the setups, builds and testing 
for  VS2003 and VS2005 as I'm pretty busy with other stuff right now.
Oops. I didn't understand the original question. I thought that I had a test
in there for MSVC. I had another look and I think I missed that.
GCC is supported as far as I know. As a practical matter, though, noone builds
with it on Windows as you lose Java support. And maybe some other plugins.
The GCC build tends to break a lot as there are few, if any, regular builders
on it. So when someone comes along to try it out (for something like 3DNow
compiler optimizations) they run into problems.

But it was not my intention was to break GCC.

(In reply to comment #6)
> Oh, OK.  I didn't know that VS2005 is now required for Windows builds of the
> trunk.  
> 
> So GCC is being abandoned completely on the WIN32 platform?  Surprising, for an
> open-source project.
> 

FYI, your patch works on VS2003 (VC v7.1) also.

When applied, js_CompareAndSwapHelper() is correctly inlined in js_CompareAndSwap(), which in turn is inlined into the calling code.  The more rational register allocation you're going for is seen with this compiler, at least when building with -O2 optimization.

Here's a snippet from js_Enqueue()

; Line 972
        mov     eax, DWORD PTR [ebx]
; Line 974
        test    eax, eax
        je      SHORT $L7470
        mov     ecx, eax
        or      ecx, 1
        lock     cmpxchg DWORD PTR [ebx], ecx
        sete    al
        test    al, 1
        je      SHORT $L7470

The end result is that the cmpxchg/sete instructions are inserted into the code with no stack handling involved.
Pls see bug #334826 for more use of Win32 intrinsic functions.
Flags: blocking1.9?
Attachment #216081 - Flags: review?(brendan)
FWIW, this looks like a good idea.

This depends on a resolution to Bug 353962.
Depends on: 353962
I'd like to get the resolved one way or the other before ship
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Comment on attachment 216081 [details] [diff] [review]
jslock.c patch to use _InterlockedCompareExchange in js_CompareAndSwap

mmoy: is 64-bit thunderbird still hour-glassing for you with JS thinlocks re-enabled? If so it would be good to get stacks and see one stuck on the atom table thinlock.

This is good stuff, let's get it in.

/be
Attachment #216081 - Flags: review?(brendan)
Attachment #216081 - Flags: review+
Attachment #216081 - Flags: approval1.9+
(In reply to comment #16)
> (From update of attachment 216081 [details] [diff] [review])
> mmoy: is 64-bit thunderbird still hour-glassing for you with JS thinlocks
> re-enabled? If so it would be good to get stacks and see one stuck on the atom
> table thinlock.
> 
> This is good stuff, let's get it in.
> 
> /be
> 

The 64-bit build has been fine for quite some time, at least a year I think. I do all of my x64 stuff on AMD systems though. I haven't heard of any users reporting hourglass problems on the 32 or 64 bit Thunderbird builds in a long time so I'm assuming the problem has been fixed.
Assignee: general → mmoy
Keywords: checkin-needed
Keywords: checkin-needed
Attached patch unbitrotten patch (deleted) — Splinter Review
For check-in.

Checking in js/src/jslock.c;
/cvsroot/mozilla/js/src/jslock.c,v  <--  jslock.c
new revision: 3.68; previous revision: 3.67
done
Attachment #216081 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 17 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M11
Comment on attachment 298058 [details] [diff] [review]
unbitrotten patch

>Index: js/src/jslock.c
>===================================================================
> static JS_INLINE int
>-js_CompareAndSwap(jsword *w, jsword ov, jsword nv)
>+js_CompareAndSwapHelper(jsword *w, jsword ov, jsword nv)
> {

>+static JS_INLINE int
>+js_CompareAndSwapHelper(jsword *w, jsword ov, jsword nv)
>+{

how does this compile - there are two |js_CompareAndSwapHelper|s now and no |js_CompareAndSwap|?

also, what about comment 4 - need an MSVC ifdef, no? or do we not care about breaking mingw?
(In reply to comment #19)
> how does this compile - there are two |js_CompareAndSwapHelper|s now and no
> |js_CompareAndSwap|?

I'm dumb. Fixed and committed. :(
the fact that this didn't break on any of the msvc x86 tinderboxen means the new code isn't getting compiled anyway - which means NSPR_LOCK is getting defined here, and that seems like bad news:

http://mxr.mozilla.org/mozilla/source/js/src/jslock.h#192

sounds like _M_IX86 isn't getting defined for the processor du jour (64bit). http://msdn2.microsoft.com/en-us/library/b0084kay(VS.80).aspx only lists values for 386 thru PIII, so maybe we need to or an _M_X64 check in there too? (what about Pentium 4? argh microsoft!)

reopening so this gets back on radar and we get an answer one way or another.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is why there is a dependency on 353962.

For Windows x64, I have another patch which puts in support in jslock.c and jslock.h. But none of this will work until thin locks are put back in.
Keywords: perf
(In reply to comment #22)
> This is why there is a dependency on 353962.
> 
> For Windows x64, I have another patch which puts in support in jslock.c and
> jslock.h. But none of this will work until thin locks are put back in.
> 

What do you need to get that done?  If we can use the OS routines on MacOS (the only plat we saw the problems reported extensively) and turn back on thin locks for nightlies/b3 we can get wider help testing that there are no further issues.
If it is a chip problem, then there's nothing that we can do about it short of shipping two images: one for chips that have the problem and one for chips that don't have the problem.

We don't know exactly what the Mac OSX routines do and Brendan would prefer the use of Mozilla's thinlocks as they've worked for some time.

This is from the Mac OSX documentation:

     These functions are thread and multiprocessor safe.  For each function,
     there is a version that does and another that does not incorporate a
     memory barrier.  Barriers strictly order memory access on a weakly-ordered
     ordered architecture such as PPC.  All loads and stores executed in
     sequential program order before the barrier will complete before any load
     or store executed after the barrier.  On a uniprocessor, the barrier
     operation is typically a nop.  On a multiprocessor, the barrier can be
     quite expensive.

All of the recent Macs are dual-core except for the Mac Mini I think which came in a Core Solo variant. I don't know if the Mac OSX locking routines are more expensive or less expensive than Mozilla's thick locks. It seems to me that those seeing the problems are running more in the way of extensions and I think that we should collect more information on their environments if we can't reproduce their problems to see if there is something in common. My user base is very small so perhaps none of them use the particular combination of extensions and hardware where the problem is seen.

======================================

The errata item that I suspect is here:

AH 39. Cache Data Access Request from One Core Hitting a Modified
Line in the L1 Data Cache of the Other Core May Cause Unpredictable
System Behavior

Problem: When request for data from Core 1 results in a L1 cache
miss, the request is sent to the L2 cache. If this request hits
a modified line in the L1 data cache of Core 2, certain internal
conditions may cause incorrect data to be returned to the Core 1.

Implication: This erratum may cause unpredictable system behavior.

Workaround: It is possible for the BIOS to contain a workaround for
this erratum.

Status: For the steppings affected, see the Summary Tables of Changes.

The steppings with the problem are B2 and L2. L2 is Allendale and I
don't think that Apple uses Allendale parts. The problem is fixed in
steppings A1, E1, M0, and G0.

At least one user reported a hanging problem with a Core Duo processor
though Brendan believes that to be a different issue due to the use
of thick locks for quite some time.
(In reply to comment #24)
> We don't know exactly what the Mac OSX routines do and Brendan would prefer the
> use of Mozilla's thinlocks as they've worked for some time.

I am fine with using an OS routine that claims to be a leaf "lock cmpxchg" subroutine, where we can disassemble and read it. I would not want to make a system call to do this -- has to be in-process fast-calling-convention code. Not convinced without measurement that it must be inlined in jslock.c callers.

> All of the recent Macs are dual-core except for the Mac Mini I think which came
> in a Core Solo variant. I don't know if the Mac OSX locking routines are more
> expensive or less expensive than Mozilla's thick locks. It seems to me that
> those seeing the problems are running more in the way of extensions and I think
> that we should collect more information on their environments if we can't
> reproduce their problems to see if there is something in common. My user base
> is very small so perhaps none of them use the particular combination of
> extensions and hardware where the problem is seen.

Just want to reiterate that I disabled thin locks over a year ago, so no one reporting a hang can blame it on them. There are probably several hang bugs not properly filed because I left bug 353962 open and it has drawn other lightning strikes.

> The errata item that I suspect is here:
> 
> AH 39. Cache Data Access Request from One Core Hitting a Modified
> Line in the L1 Data Cache of the Other Core May Cause Unpredictable
> System Behavior
> 
> Problem: When request for data from Core 1 results in a L1 cache
> miss, the request is sent to the L2 cache. If this request hits
> a modified line in the L1 data cache of Core 2, certain internal
> conditions may cause incorrect data to be returned to the Core 1.
> 
> Implication: This erratum may cause unpredictable system behavior.
> 
> Workaround: It is possible for the BIOS to contain a workaround for
> this erratum.
> 
> Status: For the steppings affected, see the Summary Tables of Changes.
> 
> The steppings with the problem are B2 and L2. L2 is Allendale and I
> don't think that Apple uses Allendale parts. The problem is fixed in
> steppings A1, E1, M0, and G0.

Thanks, this is helpful. It's not the erratum I cited in email, though (note to schrep for communication with chip vendors).

> At least one user reported a hanging problem with a Core Duo processor
> though Brendan believes that to be a different issue due to the use
> of thick locks for quite some time.

Right, I don't believe our thick lock usage is suffering from chip bugs biting the atomic instructions used by, e.g., pthreads.

/be
(In reply to comment #25)
> (In reply to comment #24)
> Not convinced without measurement that it must be inlined in jslock.c callers.

ISTR someone doing such measurements recently, although it may have been on a tier 2 platform. Better register allocation, no stack homing and spilling, etc. Someone please cite the bug.

/be
Moving TM to beta4 since we are still waiting on testing what to do here
Target Milestone: mozilla1.9beta3 → mozilla1.9beta4
Attached patch Possible Fix from Arjan van de Ven (obsolete) (deleted) — Splinter Review
Comment on attachment 300705 [details] [diff] [review]
Possible Fix from Arjan van de Ven

This is the fix for bug 353962 and I've attached it there, along with the change to re-enable JS thin locks.

/be
Attachment #300705 - Attachment is obsolete: true
Comment on attachment 216081 [details] [diff] [review]
jslock.c patch to use _InterlockedCompareExchange in js_CompareAndSwap

>+#pragma intrinsic(_InterlockedCompareExchange)
Unfortunately this only works with /Oi or /O2 so you break people building with --disable-optmize
(In reply to comment #30)
> (From update of attachment 216081 [details] [diff] [review])
> >+#pragma intrinsic(_InterlockedCompareExchange)
> Unfortunately this only works with /Oi or /O2 so you break people building with
> --disable-optmize
> 

Just in case someone else cannot build his Mozilla in the debug mode on Windows anymore - it looks like upgrading from VC7.1 to VC8 solves the problem. The upgrade was smooth following http://developer.mozilla.org/en/docs/Windows_Build_Prerequisites .
(In reply to comment #30)
> (From update of attachment 216081 [details] [diff] [review])
> >+#pragma intrinsic(_InterlockedCompareExchange)
> Unfortunately this only works with /Oi or /O2 so you break people building with
> --disable-optmize

Should I put in a check for the MSVC version around this code?
The blocker is fixed, we build with VC8. File a follow up for VC7.1.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Depends on: 416813
Pete, bug 416813 fixed things for VC7.1 and the patch is in. Please update and confirm over there that it works for you (feel free to VERIFY the bug). Thanks,

/be
Will do. 

FYI: I was building FIREFOX_3_0b3_RELEASE tag
Flags: in-testsuite-
Flags: in-litmus-
Blocks: songbird
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: