Closed Bug 1358599 Opened 7 years ago Closed 7 years ago

Use runtime checks for GC pre-barriers instead of patchable jumps

Categories

(Core :: JavaScript Engine: JIT, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla55
Performance Impact high
Tracking Status
firefox55 --- fixed

People

(Reporter: djvj, Assigned: djvj)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 6 obsolete files)

Currently, pre-barriers emitted in jitcode use patchable jumps around GC pre-barriers, that look something like this:

    JUMP skip_barrier;
    <BARRIER CODE>
  skip_barrier:
    ...

At both the start and end of an incremental garbage collection, we toggle all the JUMPs to CMPs and fall through to the barrier code.

This behaviour was written prior to our jitcode being made w^x.  Prior to this, this was a relatively cheap operation that saved runtime cost.  After w^x however, this operation becomes a lot more expensive, necessitating multiple mprotect() calls both before and after the toggle/patch code.

Checking this at runtime would look something like

    LOAD reg, *addressOfIncrementalGCFlag
    JZ reg, skip_barrier
    <BARRIER CODE>
  skip_barrier:
    ...

It's an extra load, which should be relatively fast as it's always from the same address and should be in L1 most of the time.  The JZ should be well-predicted since it'll always be true or false for extended periods of time.  Runtime cost is probably not that high, and in exchange, we get to eliminate toggleBarriers() entirely.
I think it's particularly egregious that the mprotect() and patching code touches ALL jitcode, both hot and cold.  Also, we would save some non-trivial amount of memory by not keeping the tables around to track the patchable jumps.
Whoa, nice catch.
If this does regress benchmarks, one option is to change this only for Baseline and IC code.
I'm not too concerned if this regresses Octane somewhat.  If it regresses Speedometer, then yeah.
(In reply to Kannan Vijayan [:djvj] from comment #4)
> I'm not too concerned if this regresses Octane somewhat.  If it regresses
> Speedometer, then yeah.

Speedometer is not the holy grail of benchmarks though. We can take a small hit on a typical |o1.foo = o2| SetProp micro-benchmark, anything more than that is unacceptable.
Of course, I'm not suggesting that Speedometer is a new gold standard.  My statement was more about caring less about Octane regressions than we would have in a previous life.  I'm going off of speedometer being a better reflection of web page behaviour, where we have more ground to cover than comp-heavy benchmarks than Octane, which we have already paid a lot of attention to.
Ok, it turns out that we need a few more instructions than I previously thought.  Namely, since we need a register, and we can't be sure of where we're getting that register, we need a push/pop around the gated barrier code.

I think we can eliminate this in many circumstances where we know that there's a register free.  In Ion, we can get the compiler to provide a scratch register, which may come for free.  In baseline, we use register ad-hoc, and may be able to use BaselineScratchReg in most circumstances, or some other register that we know is free.

For now, I'm just always doing the push/pop and seeing how well it does.
There are branchTest32 and branch32 overloads that take AbsoluteAddress + Imm32. Does that help?
I looked around for these, didn't find them.  When you say "AbsoluteAddress + Imm32", do you mean a "void (*)(AbsoluteAddress, Imm32)" signature?  If so, that's exactly what I need.  Where are they defined?
Ah, nm.  I kept looking for cmp32, and test32, didn't look for the fused test-and-branch methods.  It's been too long since I played with the MacroAssembler.
Attached patch runtime-pre-barriers.patch (obsolete) (deleted) — Splinter Review
Initial patch for runtime pre-barriers.  It compiles, no testing yet.  Turns out there was already a pre-existing MacroAssembler method to do this stuff.  It also turns out that the pre-existing method was doing a 'cmpl' while the underlying var was a 'bool', which seemed weird, so I switched the var over to a 'uint32_t'.
Attached patch runtime-pre-barriers.patch (obsolete) (deleted) — Splinter Review
Rebased patch.  Try shows jit+gc marking failures in Linux and Linux64:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6a4fccc4f41f3c5291715e89289e99bd1e421d30
Attachment #8862114 - Attachment is obsolete: true
Attached patch runtime-pre-barriers.patch (obsolete) (deleted) — Splinter Review
Updated patch fixing previous issues.  First off, I wasn't actually removing the patchable jump.  Secondly, I was using the inverted conditional in the branch instruction.  This one passes the jit-tests on commandline.  Running in try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f10eea49b062b5886279800f9d99bb3c3b4f2c8c
Attachment #8862454 - Attachment is obsolete: true
I haven't had time to try this on speedometer yet, but since I was concerned about regressions on octane, I ran this in js-shell against our in-tree octane.  20 runs, lowest and highest scores removed.  Reference on left, modified (no-toggleBarriers) on right.

Upshot this is a strict win on even octane:

       Richards -         27893 +- 104         26097 +- 166
      DeltaBlue -         41144 +- 453         42658 +- 531
         Crypto -         26629 +- 121         27320 +- 172
       RayTrace -         82233 +- 372         83586 +- 634
    EarleyBoyer -         21113 +- 250         22099 +- 168
         RegExp -           3135 +- 36           3225 +- 32
          Splay -         10905 +- 151         14317 +- 114
   SplayLatency -          7451 +- 132         17196 +- 116
   NavierStokes -         27744 +- 157         28197 +- 161
          PdfJS -          9907 +- 101         10346 +- 228
       Mandreel -         21093 +- 230         21987 +- 239
MandreelLatency -         25493 +- 284         26278 +- 378
        Gameboy -         33309 +- 853         35900 +- 900
       CodeLoad -         15860 +- 124         16279 +- 117
          Box2D -         31333 +- 538         31821 +- 669
           zlib -         68672 +- 381         69875 +- 441
     Typescript -         19585 +- 154         20097 +- 134
Score (version 9) -        21440 +- 82         23306 +- 127

Can someone tell me how to get reliable speedometer numbers?
Attachment #8862559 - Flags: review?(jdemooij)
Attachment #8862559 - Flags: review?(jcoppeard)
Props to :tcampbell for pointing this out as a potential perf issue.
Attached patch runtime-pre-barriers.patch (obsolete) (deleted) — Splinter Review
Latest patch.  Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=00e88c8966d3d746e0af7fe1e667e66b1aacb736
Attachment #8862559 - Attachment is obsolete: true
Attachment #8862559 - Flags: review?(jdemooij)
Attachment #8862559 - Flags: review?(jcoppeard)
Attachment #8862609 - Flags: review?(sphink)
Attachment #8862609 - Flags: review?(jdemooij)
Attachment #8862609 - Flags: review?(sphink) → review+
Whiteboard: [qf] → [qf:p1]
Attached patch runtime-pre-barriers.patch (obsolete) (deleted) — Splinter Review
After :bz pointed me to other places where |toggleBarriers| is called, I went and cleaned all of those up too.  New patch.  Going to keep r+ from :sfink since the delta is all related to jitcode.
Attachment #8862609 - Attachment is obsolete: true
Attachment #8862609 - Flags: review?(jdemooij)
Attachment #8862718 - Flags: review?(jdemooij)
Comment on attachment 8862718 [details] [diff] [review]
runtime-pre-barriers.patch

Review of attachment 8862718 [details] [diff] [review]:
-----------------------------------------------------------------

Simpler and faster, nice! Some suggestions for additional cleanup/removal, leave-open + follow-up patch is fine.

::: js/src/jit/MacroAssembler.h
@@ +1664,4 @@
>      }
>  
>      template <typename T>
>      void patchableCallPreBarrier(const T& address, MIRType type) {

Maybe inline callPreBarrier into this one, and rename to just callPreBarrier?

@@ +1667,5 @@
>      void patchableCallPreBarrier(const T& address, MIRType type) {
>          Label done;
>  
>          // All barriers are off by default.
>          // They are enabled if necessary at the end of CodeGenerator::generate().

Nit: can remove this comment too

@@ +1669,5 @@
>  
>          // All barriers are off by default.
>          // They are enabled if necessary at the end of CodeGenerator::generate().
> +
> +        branchTestNeedsIncrementalBarrier(Assembler::Equal, &done);

Nit: branchTestNeedsIncrementalBarrier asserts cond is Zero or NonZero, so Zero would be more consistent (it's the same value at least on x86/ARM).

Btw, branchTestNeedsIncrementalBarrier is also called in CodeGenerator::emitArrayPopShift, but there's no good reason for it. Mind filing a follow-up bug to remove that one?

@@ -1668,5 @@
>  
>          // All barriers are off by default.
>          // They are enabled if necessary at the end of CodeGenerator::generate().
> -        CodeOffset nopJump = toggledJump(&done);
> -        writePrebarrierOffset(nopJump);

writePrebarrierOffset and the assembler's preBarriers_ can be removed too.
Attachment #8862718 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #19)
> ::: js/src/jit/MacroAssembler.h
> @@ +1664,4 @@
> >      }
> >  
> >      template <typename T>
> >      void patchableCallPreBarrier(const T& address, MIRType type) {
> 
> Maybe inline callPreBarrier into this one, and rename to just callPreBarrier?

Good call.  Inlining actually cleans up the code a bit, especially in the case where we're barriering |MIRType::Value|.  The |branchTestGCThing| on the value can jump directly to the final (cacheline-aligned) |done| label, instead of the double-hopping as it would currently be.

I'm renaming the whole thing |guardedCallPreBarrier|.

> writePrebarrierOffset and the assembler's preBarriers_ can be removed too.

Done.  Also removed the preBarriers table on the JitCode structs too.  This stuff is like spring cleaning.  The more you remove, the more you find.

Other nits addressed.  There are enough changes here that I'd be more comfortable if you took another look at it, so putting up the delta patch for r?.
Attached patch remove-more-pre-barrier-code.patch (obsolete) (deleted) — Splinter Review
Remove more pre barrier related code that is no longer needed.  Delta from previous patch.
Attachment #8862852 - Flags: review?(jdemooij)
> Btw, branchTestNeedsIncrementalBarrier is also called in CodeGenerator::emitArrayPopShift, but there's no good reason for it. Mind filing a follow-up bug to remove that one?

I'm not seeing how we can remove this.  Don't we need a pre barrier on pop and shift?  I'm not seeing any other code that calls into |patchableCallPreBarrier| in that codegen.
Comment on attachment 8862852 [details] [diff] [review]
remove-more-pre-barrier-code.patch

Review of attachment 8862852 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM. Just a heads up, I reviewed a patch today that added a patchableCallPreBarrier to CodeGenerator.cpp so you might want to double check on tip.

::: js/src/jit/MacroAssembler.h
@@ +1663,4 @@
>          jump(&done);
>  
>          haltingAlign(8);
>          bind(&done);

Pre-existing but I don't know why this haltingAlign is here. We don't emit it elsewhere for similar code paths. If we removed it our code would be a bit smaller and we could rm the jump(&done) right before it. r=me either way.
Attachment #8862852 - Flags: review?(jdemooij) → review+
Assignee: nobody → kvijayan
Yeah,  I ran into a compile error over it.  Fixed it in my patch.

This is just a guess, but maybe cache lines?  The common case is the guarded code not getting executed.  If we jump past it, and the target address lands at the beginning of a cache line, we get a full cache line of instructions to execute in one go.

Not sure how much it's worth though..
Latest try looks good: https://treeherder.mozilla.org/#/jobs?repo=try&revision=d58abab48b81a63a6952722d636ce1eb490530e8

Pushing after one last local build.
Attached patch runtime-pre-barriers.patch (deleted) — Splinter Review
Final patch rebased and passing try.  Forwarding r+.
Attachment #8862718 - Attachment is obsolete: true
Attachment #8862852 - Attachment is obsolete: true
Attachment #8863229 - Flags: review+
Pushed by kvijayan@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0879ee58fcdc
Use runtime guards for jitcode pre-barriers instead of patchable jumps. r=jandem r=sfink
https://hg.mozilla.org/mozilla-central/rev/0879ee58fcdc
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
The verdict on AWFY is a ~5% Octane-richards regression. I'll file a follow-up bug to get rid of that haltingAlign(8) because that still seems silly.
Blocks: 1361021
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: