Closed Bug 551539 Opened 15 years ago Closed 15 years ago

Consider using memcmp on at least Mac in StringMatch for long patterns

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: bzbarsky, Assigned: luke)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files, 2 obsolete files)

patch 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
hg qrefed 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
bug test 15 years ago Luke Wagner [:luke] (deleted), application/javascript		Details
tweaked 15 years ago Luke Wagner [:luke] (deleted), patch	bzbarsky : review+	Details \| Diff \| Splinter Review
profile indicating #if __i386__ not handled properly 15 years ago Moh Haghighat (deleted), image/x-png		Details

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

15 years ago

memcmp is a lot faster than our own loop stuff for long patterns on mac, because it uses MMX. We should consider using it.

Luke Wagner [:luke]

Assignee

Comment 1

•

15 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

As the bug summary says, the patch does memcmp for big patterns when memcmp is fast enough. I wasn't exactly sure what the right predicate was for this, so I went with: return #if !__GNUC__ || __APPLE__ patlen > 256 ? Duff<MemCmp>(text, textlen, pat, patlen) : #endif Duff<ManualCmp>(text, textlen, pat, patlen); Is this at all a sensible default for non-Linux GCC's? Still need to actually test on Mac and Windows, but, strangely enough, its no slower and actually 8% faster on Linux.

Assignee: general → lw

Status: NEW → ASSIGNED

Boris Zbarsky [:bzbarsky]

Reporter

Comment 2

•

15 years ago

The attached patch doesn't have the predicate listed in comment 1.... I tested memcmp a bit on Windows (64-bit Win7), and while it seems to be a bit slower than our loop on the 2-char pattern and long string testcase luke wrote, and at least 2x faster on the long pattern + long string testcase.

Luke Wagner [:luke]

Assignee

Comment 3

•

15 years ago

Somebody forgot to hg qref...

Luke Wagner [:luke]

Assignee

Comment 4

•

15 years ago

Attached patch hg qrefed (obsolete) (deleted) — Details — Splinter Review

So here I just use #if !__linux__.

Attachment #431685 - Attachment is obsolete: true

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 5

•

15 years ago

Somewhere else in the code we runtime-select SSE2 versions of ... UTF8 decoding, perhaps? There is now shared code to do the runtime selection of SSE2 without reinventing it, though we might have to reinvent it due to layering reasons here. Anyway, I guess what I'm really asking is: should we do and SSE2-ized memcmp for use here, so that we can run it on all platforms?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 6

•

15 years ago

Luke and I did talk about this. It was sounding like a lot of work; certainly more than just using memcmp for a start. Note that the Mac memcmp uses MMX internally. Linux just uses straight loops and compares (hence slow, due to the function call overhead). Not sure what Windows does, offhand. I do wonder whether we could get better perf out of SSE/2 than out of MMX here....

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 7

•

15 years ago

If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a win for it, since they can depend on it as easily as MMX for their product line.

Luke Wagner [:luke]

Assignee

Comment 8

•

15 years ago

Attached file bug test (deleted) — Details

Measuring the "Short" and "Long" results From To OSX 10.6 74/84 65/49 OSX 10.5 86/126 99/49 Linux x86 97/89 87/96 Windows x86 68/133 80/79 Long improves as expected. Whats interesting is that "Short" is affected slightly negatively on 10.5 and Windows and positively on 10.6 and Linux. The "Short" test is, theoretically, no different, but (as bug 526173 demonstrates) the speed on x86 depends heavily on the compiler's register allocation choices. Interestingly, I was able to take out the explicit register allocation hack added by bug 526173 with no measurable change in 10.5 on bz's test in bug 526173.

Boris Zbarsky [:bzbarsky]

Reporter

Comment 9

•

15 years ago

The test in bug 526173 uses long strings that fall into the memcmp codepath now...

Luke Wagner [:luke]

Assignee

Comment 10

•

15 years ago

The text is long, but the patterns are short, so no memcpy.

Luke Wagner [:luke]

Assignee

Comment 11

•

15 years ago

err, memcmp.

Moh Haghighat

Comment 12

•

15 years ago

(In reply to comment #7) > If Apple's GCC uses MMX rather than SSE2, I suspect it means that SSE2 isn't a > win for it, since they can depend on it as easily as MMX for their product > line. The right implementation using SSE2 should definitely win over MMX. The only scenario I can think where MMX might win over SSE2 would be an SSE2 implementation that has not taken care of the desired alignment (mod 16). Unaligned accesses can have penalties that depend on the uArch implementation. A good SSE2 implementation would have to take care of the beginning few bytes (mod 16), then process the bulk of the operation in macro mode, and may have to take care of the remaining bytes at the end as well. If the size of memory/string is large enough, the gain is much larger than the overhead of the required setup. You may want to consult the following patches submitted by my colleague, H.J. Lu at Intel: SSE2/SSE3 (Jan. 12, 2010) http://sourceware.org/ml/libc-alpha/2010-01/msg00016.html SSSE3/SSS4.2 (Feb. 12, 2010): Note memcmp 3X improvement : http://sourceware.org/ml/libc-alpha/2010-02/msg00028.html Another point to remember is that since these primitives are important, later processors may also improve the performance of the plain simple implementations. In other words, the tradeoffs can vary over time, ensuring job security for SW people ;) -moh

Boris Zbarsky [:bzbarsky]

Reporter

Comment 13

•

15 years ago

> The text is long, but the patterns are short, so no memcpy. Er, indeed. Well, the register allocation bug in question seems to not be very robust to minor code changes... ;)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 14

•

15 years ago

> implementation that has not taken care of the desired alignment (mod 16). This could be an issue because the alignment would need to happen for _both_ strings at once. If their relative alignment is not 0 mod 16, you can't win, right?

Moh Haghighat

Comment 15

•

15 years ago

Right. Here's a possible pseudocode: if (nbytes <= sizeThreshold) { ... } else { int srcMOD16 = ((int) src) % 16; int dstMOD16 = ((int) dst) % 16; if ((srcMOD16 | dstMOD16) == 0) { int rem = nbytes % 128; if (nbytes != rem) { ... } if (rem) { ... } } else if (srcMOD16 == dstMOD16) { int rem = 16 - srcMOD16; ... } else { ... } } The complexity can be discouraging, but the payoff can be really high too.

Luke Wagner [:luke]

Assignee

Comment 16

•

15 years ago

Perhaps we should file a separate 'fast memcmp on Linux' bug?

Luke Wagner [:luke]

Assignee

Comment 17

•

15 years ago

Attached patch tweaked (deleted) — Details — Splinter Review

It seemed like "#if !defined(__linux__)" was more appropriate than "#if !__linux__" as a guard.

Attachment #431729 - Attachment is obsolete: true

Attachment #432227 - Flags: review?

Luke Wagner [:luke]

Assignee

Updated

•

15 years ago

Attachment #432227 - Flags: review? → review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Reporter

Comment 18

•

15 years ago

Comment on attachment 432227 [details] [diff] [review] tweaked r=bzbarsky

Attachment #432227 - Flags: review?(bzbarsky) → review+

Luke Wagner [:luke]

Assignee

Comment 19

•

15 years ago

http://hg.mozilla.org/tracemonkey/rev/bc5039752e41

Whiteboard: fixed-in-tracemonkey

Luke Wagner [:luke]

Assignee

Comment 20

•

15 years ago

Hmm, backed out for issues on Windows and Linux64.

Luke Wagner [:luke]

Assignee

Comment 21

•

15 years ago

Oh, psh, over zealous assert; my eyes ignored the #if __i386__. However, this does reveal that __i386__ isn't defined by MSVC for x86 systems. _M_IX86 is, however, so this seems to indicate using: #if defined(__i386__) || defined(_M_IX86) around the patlen == 1 special case.

Moh Haghighat

Comment 22

•

15 years ago

Attached image profile indicating #if __i386__ not handled properly (deleted) — Details

Just the other day, I was looking at the performance profile of the latest trunk and noticed that in jsnum.h where js_DoubleToECMAInt32 has different code versions based on #ifdef __i386__, the else part was picked on Win32 (while the then part should have been picked). #ifdef __i386__ ... #else ... // look at the attached profile #endif Perhaps this is the same problem revelead by Luke's case. We may need a scan for similar cases.

Luke Wagner [:luke]

Assignee

Comment 23

•

15 years ago

MSDN indicates that _M_IX86 is what I need, so pushed with that and the assert fix. http://hg.mozilla.org/tracemonkey/rev/881eacfc665c

Robert Sayre

Comment 24

•

15 years ago

http://hg.mozilla.org/mozilla-central/rev/881eacfc665c

Status: ASSIGNED → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.