Closed
Bug 551539
Opened 15 years ago
Closed 15 years ago
Consider using memcmp on at least Mac in StringMatch for long patterns
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: bzbarsky, Assigned: luke)
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(3 files, 2 obsolete files)
memcmp is a lot faster than our own loop stuff for long patterns on mac, because it uses MMX. We should consider using it.
Assignee | ||
Comment 1•15 years ago
|
||
As the bug summary says, the patch does memcmp for big patterns when memcmp is fast enough. I wasn't exactly sure what the right predicate was for this, so I went with:
return
#if !__GNUC__ || __APPLE__
patlen > 256 ? Duff<MemCmp>(text, textlen, pat, patlen)
:
#endif
Duff<ManualCmp>(text, textlen, pat, patlen);
Is this at all a sensible default for non-Linux GCC's?
Still need to actually test on Mac and Windows, but, strangely enough, its no slower and actually 8% faster on Linux.
Assignee: general → lw
Status: NEW → ASSIGNED
Reporter | ||
Comment 2•15 years ago
|
||
The attached patch doesn't have the predicate listed in comment 1....
I tested memcmp a bit on Windows (64-bit Win7), and while it seems to be a bit slower than our loop on the 2-char pattern and long string testcase luke wrote, and at least 2x faster on the long pattern + long string testcase.
Assignee | ||
Comment 3•15 years ago
|
||
Somebody forgot to hg qref...
Assignee | ||
Comment 4•15 years ago
|
||
So here I just use #if !__linux__.
Attachment #431685 -
Attachment is obsolete: true
Somewhere else in the code we runtime-select SSE2 versions of ... UTF8 decoding, perhaps? There is now shared code to do the runtime selection of SSE2 without reinventing it, though we might have to reinvent it due to layering reasons here.
Anyway, I guess what I'm really asking is: should we do and SSE2-ized memcmp for use here, so that we can run it on all platforms?
Reporter | ||
Comment 6•15 years ago
|
||
Luke and I did talk about this. It was sounding like a lot of work; certainly more than just using memcmp for a start.
Note that the Mac memcmp uses MMX internally. Linux just uses straight loops and compares (hence slow, due to the function call overhead). Not sure what Windows does, offhand.
I do wonder whether we could get better perf out of SSE/2 than out of MMX here....
If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a win for it, since they can depend on it as easily as MMX for their product line.
Assignee | ||
Comment 8•15 years ago
|
||
Measuring the "Short" and "Long" results
From To
OSX 10.6 74/84 65/49
OSX 10.5 86/126 99/49
Linux x86 97/89 87/96
Windows x86 68/133 80/79
Long improves as expected. Whats interesting is that "Short" is affected slightly negatively on 10.5 and Windows and positively on 10.6 and Linux. The "Short" test is, theoretically, no different, but (as bug 526173 demonstrates) the speed on x86 depends heavily on the compiler's register allocation choices. Interestingly, I was able to take out the explicit register allocation hack added by bug 526173 with no measurable change in 10.5 on bz's test in bug 526173.
Reporter | ||
Comment 9•15 years ago
|
||
The test in bug 526173 uses long strings that fall into the memcmp codepath now...
Assignee | ||
Comment 10•15 years ago
|
||
The text is long, but the patterns are short, so no memcpy.
Assignee | ||
Comment 11•15 years ago
|
||
err, memcmp.
Comment 12•15 years ago
|
||
(In reply to comment #7)
> If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a
> win for it, since they can depend on it as easily as MMX for their product
> line.
The right implementation using SSE2 should definitely win over MMX. The only scenario I can think where MMX might win over SSE2 would be an SSE2 implementation that has not taken care of the desired alignment (mod 16). Unaligned accesses can have penalties that depend on the uArch implementation.
A good SSE2 implementation would have to take care of the beginning few bytes (mod 16), then process the bulk of the operation in macro mode, and may have to take care of the remaining bytes at the end as well. If the size of memory/string is large enough, the gain is much larger than the overhead of the required setup.
You may want to consult the following patches submitted by my colleague, H.J. Lu at Intel:
SSE2/SSE3 (Jan. 12, 2010)
http://sourceware.org/ml/libc-alpha/2010-01/msg00016.html
SSSE3/SSS4.2 (Feb. 12, 2010): Note memcmp 3X improvement :
http://sourceware.org/ml/libc-alpha/2010-02/msg00028.html
Another point to remember is that since these primitives are important, later processors may also improve the performance of the plain simple implementations. In other words, the tradeoffs can vary over time, ensuring job security for SW people ;)
-moh
Reporter | ||
Comment 13•15 years ago
|
||
> The text is long, but the patterns are short, so no memcpy.
Er, indeed. Well, the register allocation bug in question seems to not be very robust to minor code changes... ;)
Reporter | ||
Comment 14•15 years ago
|
||
> implementation that has not taken care of the desired alignment (mod 16).
This could be an issue because the alignment would need to happen for _both_ strings at once. If their relative alignment is not 0 mod 16, you can't win, right?
Comment 15•15 years ago
|
||
Right. Here's a possible pseudocode:
if (nbytes <= sizeThreshold) {
...
} else {
int srcMOD16 = ((int) src) % 16;
int dstMOD16 = ((int) dst) % 16;
if ((srcMOD16 | dstMOD16) == 0) {
int rem = nbytes % 128;
if (nbytes != rem) {
...
}
if (rem) {
...
}
} else if (srcMOD16 == dstMOD16) {
int rem = 16 - srcMOD16;
...
} else {
...
}
}
The complexity can be discouraging, but the payoff can be really high too.
Assignee | ||
Comment 16•15 years ago
|
||
Perhaps we should file a separate 'fast memcmp on Linux' bug?
Assignee | ||
Comment 17•15 years ago
|
||
It seemed like "#if !defined(__linux__)" was more appropriate than "#if !__linux__" as a guard.
Attachment #431729 -
Attachment is obsolete: true
Attachment #432227 -
Flags: review?
Assignee | ||
Updated•15 years ago
|
Attachment #432227 -
Flags: review? → review?(bzbarsky)
Reporter | ||
Comment 18•15 years ago
|
||
Comment on attachment 432227 [details] [diff] [review]
tweaked
r=bzbarsky
Attachment #432227 -
Flags: review?(bzbarsky) → review+
Assignee | ||
Comment 19•15 years ago
|
||
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 20•15 years ago
|
||
Hmm, backed out for issues on Windows and Linux64.
Assignee | ||
Comment 21•15 years ago
|
||
Oh, psh, over zealous assert; my eyes ignored the #if __i386__. However, this does reveal that __i386__ isn't defined by MSVC for x86 systems. _M_IX86 is, however, so this seems to indicate using:
#if defined(__i386__) || defined(_M_IX86)
around the patlen == 1 special case.
Comment 22•15 years ago
|
||
Just the other day, I was looking at the performance profile of the latest trunk and noticed that in jsnum.h where js_DoubleToECMAInt32 has different code versions based on #ifdef __i386__, the else part was picked on Win32 (while the then part should have been picked).
#ifdef __i386__
...
#else
... // look at the attached profile
#endif
Perhaps this is the same problem revelead by Luke's case. We may need a scan for similar cases.
Assignee | ||
Comment 23•15 years ago
|
||
MSDN indicates that _M_IX86 is what I need, so pushed with that and the assert fix.
http://hg.mozilla.org/tracemonkey/rev/881eacfc665c
Comment 24•15 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•