Closed Bug 551539 Opened 15 years ago Closed 15 years ago

Consider using memcmp on at least Mac in StringMatch for long patterns

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: luke)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files, 2 obsolete files)

memcmp is a lot faster than our own loop stuff for long patterns on mac, because it uses MMX. We should consider using it.
Attached patch patch (obsolete) (deleted) — Splinter Review
As the bug summary says, the patch does memcmp for big patterns when memcmp is fast enough. I wasn't exactly sure what the right predicate was for this, so I went with: return #if !__GNUC__ || __APPLE__ patlen > 256 ? Duff<MemCmp>(text, textlen, pat, patlen) : #endif Duff<ManualCmp>(text, textlen, pat, patlen); Is this at all a sensible default for non-Linux GCC's? Still need to actually test on Mac and Windows, but, strangely enough, its no slower and actually 8% faster on Linux.
Assignee: general → lw
Status: NEW → ASSIGNED
The attached patch doesn't have the predicate listed in comment 1.... I tested memcmp a bit on Windows (64-bit Win7), and while it seems to be a bit slower than our loop on the 2-char pattern and long string testcase luke wrote, and at least 2x faster on the long pattern + long string testcase.
Somebody forgot to hg qref...
Attached patch hg qrefed (obsolete) (deleted) — Splinter Review
So here I just use #if !__linux__.
Attachment #431685 - Attachment is obsolete: true
Somewhere else in the code we runtime-select SSE2 versions of ... UTF8 decoding, perhaps? There is now shared code to do the runtime selection of SSE2 without reinventing it, though we might have to reinvent it due to layering reasons here. Anyway, I guess what I'm really asking is: should we do and SSE2-ized memcmp for use here, so that we can run it on all platforms?
Luke and I did talk about this. It was sounding like a lot of work; certainly more than just using memcmp for a start. Note that the Mac memcmp uses MMX internally. Linux just uses straight loops and compares (hence slow, due to the function call overhead). Not sure what Windows does, offhand. I do wonder whether we could get better perf out of SSE/2 than out of MMX here....
If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a win for it, since they can depend on it as easily as MMX for their product line.
Attached file bug test (deleted) —
Measuring the "Short" and "Long" results From To OSX 10.6 74/84 65/49 OSX 10.5 86/126 99/49 Linux x86 97/89 87/96 Windows x86 68/133 80/79 Long improves as expected. Whats interesting is that "Short" is affected slightly negatively on 10.5 and Windows and positively on 10.6 and Linux. The "Short" test is, theoretically, no different, but (as bug 526173 demonstrates) the speed on x86 depends heavily on the compiler's register allocation choices. Interestingly, I was able to take out the explicit register allocation hack added by bug 526173 with no measurable change in 10.5 on bz's test in bug 526173.
The test in bug 526173 uses long strings that fall into the memcmp codepath now...
The text is long, but the patterns are short, so no memcpy.
err, memcmp.
(In reply to comment #7) > If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a > win for it, since they can depend on it as easily as MMX for their product > line. The right implementation using SSE2 should definitely win over MMX. The only scenario I can think where MMX might win over SSE2 would be an SSE2 implementation that has not taken care of the desired alignment (mod 16). Unaligned accesses can have penalties that depend on the uArch implementation. A good SSE2 implementation would have to take care of the beginning few bytes (mod 16), then process the bulk of the operation in macro mode, and may have to take care of the remaining bytes at the end as well. If the size of memory/string is large enough, the gain is much larger than the overhead of the required setup. You may want to consult the following patches submitted by my colleague, H.J. Lu at Intel: SSE2/SSE3 (Jan. 12, 2010) http://sourceware.org/ml/libc-alpha/2010-01/msg00016.html SSSE3/SSS4.2 (Feb. 12, 2010): Note memcmp 3X improvement : http://sourceware.org/ml/libc-alpha/2010-02/msg00028.html Another point to remember is that since these primitives are important, later processors may also improve the performance of the plain simple implementations. In other words, the tradeoffs can vary over time, ensuring job security for SW people ;) -moh
> The text is long, but the patterns are short, so no memcpy. Er, indeed. Well, the register allocation bug in question seems to not be very robust to minor code changes... ;)
> implementation that has not taken care of the desired alignment (mod 16). This could be an issue because the alignment would need to happen for _both_ strings at once. If their relative alignment is not 0 mod 16, you can't win, right?
Right. Here's a possible pseudocode: if (nbytes <= sizeThreshold) { ... } else { int srcMOD16 = ((int) src) % 16; int dstMOD16 = ((int) dst) % 16; if ((srcMOD16 | dstMOD16) == 0) { int rem = nbytes % 128; if (nbytes != rem) { ... } if (rem) { ... } } else if (srcMOD16 == dstMOD16) { int rem = 16 - srcMOD16; ... } else { ... } } The complexity can be discouraging, but the payoff can be really high too.
Perhaps we should file a separate 'fast memcmp on Linux' bug?
Attached patch tweaked (deleted) — Splinter Review
It seemed like "#if !defined(__linux__)" was more appropriate than "#if !__linux__" as a guard.
Attachment #431729 - Attachment is obsolete: true
Attachment #432227 - Flags: review?
Attachment #432227 - Flags: review? → review?(bzbarsky)
Comment on attachment 432227 [details] [diff] [review] tweaked r=bzbarsky
Attachment #432227 - Flags: review?(bzbarsky) → review+
Whiteboard: fixed-in-tracemonkey
Hmm, backed out for issues on Windows and Linux64.
Oh, psh, over zealous assert; my eyes ignored the #if __i386__. However, this does reveal that __i386__ isn't defined by MSVC for x86 systems. _M_IX86 is, however, so this seems to indicate using: #if defined(__i386__) || defined(_M_IX86) around the patlen == 1 special case.
Just the other day, I was looking at the performance profile of the latest trunk and noticed that in jsnum.h where js_DoubleToECMAInt32 has different code versions based on #ifdef __i386__, the else part was picked on Win32 (while the then part should have been picked). #ifdef __i386__ ... #else ... // look at the attached profile #endif Perhaps this is the same problem revelead by Luke's case. We may need a scan for similar cases.
MSDN indicates that _M_IX86 is what I need, so pushed with that and the assert fix. http://hg.mozilla.org/tracemonkey/rev/881eacfc665c
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: