Open Bug 1808448 Opened 2 years ago Updated 1 year ago

Investigate tweaking the JIT tiering thresholds to improve Speedometer3 score

Categories

(Core :: JavaScript Engine, task, P3)

task

Tracking

()

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

It might be worthwhile to tweak the JIT tiering thresholds to target the Speedometer3 score.

It can't hurt to do some experiments in this area, but we've experimented with this in the past and found that our JIT thresholds are balanced pretty well for most workloads.

Blocks: speedometer3
Severity: -- → N/A
Priority: -- → P3

I will note that there's at least several of these kinds of knobs that are somewhat interrelated. For example, there's some knobs for inlining heuristics which interact with JIT tiering thresholds, because if we inline a hot function into its 10 call sites then it will take 10 times more calls to get it to Ion (assuming the 10 call sites call it equally often). I hate to be the person saying "what about machine learning?", but it may be interesting to see what happens if we expose all of these knobs to an ML agent which runs a few benchmarks over and over and tries to increase the score by turning the knobs in various ways.

If there's low-hanging fruit here, I expect that it's in tweaking inlining heuristics (where we've done relatively little tuning) instead of tiering thresholds.

In a previous iteration of this discussion I wrote an email about why tweaking the thresholds is unlikely to be a big win. Copying it here:

Thesis: We should not expect our overall JIT performance to be particularly sensitive to the exact value of compilation thresholds, and we probably shouldn't spend much effort trying to tune those thresholds.

Underlying intuition: the marginal cost of compiling some function F one iteration later is the difference in execution time between a single lower-tier iteration of F and a single higher-tier iteration of F. That's not very big. To get to a higher tier, we must expect to run a lot of iterations of F post-compilation. The cumulative cost of those iterations is much larger than the cumulative cost of a few extra lower-tier iterations on the margin.

For example, we currently need a hit count of at least 1000 before we consider Ion-compiling a function. To pay back the costs of Ion compilation, we need thousands of future iterations. Even if we assume that Ion is 10x faster than baseline, increasing the threshold by 10 only costs us the equivalent of ~100 post-compilation iterations. If that makes up a significant fraction of the overall runtime of the function, we probably shouldn't be Ion compiling in the first place!

In other words: whenever we compile a function, we're making a bet that the post-compilation execution time of the function will dominate the overall execution time of the function. If we're right, then the exact threshold doesn't matter much. If we're wrong, then we pay a pretty big cost -- we spend a bunch of time doing a compilation for no reason -- but the cost of slightly decreasing the threshold only applies to the very rare functions that almost reached the old threshold. For example, lowering the Ion threshold from 1000 to 990 can only cause us to waste time compiling functions if they run exactly 990 to 1000 times. That isn't likely to be very common.

Because the marginal effects of adjusting the threshold are small, we shouldn't expect to see big wins from finding an extremely precise value for compilation thresholds. It is quite plausible that our current thresholds fall within the fairly generous sweet spot.

I threw together a quick and dirty Python script to verify my mathematical intuition, and the results seem to agree with what I expected. The script generates a roughly exponentially distributed set of iteration counts. Then, given values for S, the speedup from compiling and C, the cost of compilation, it compares two strategies:

  1. Oracle: calculate the break-even threshold -- that is to say, the number of iterations for which immediately compiling is exactly as cheap as never compiling. Compile functions if they will run that many iterations or more.
  2. Threshold: Compile functions once they reach a given threshold. The script tests thresholds to find a good one for the specific values of S and C. Interestingly, the best threshold is remarkably consistent from run to run: for example, for S = 10 and C = 1000, over a couple dozen runs, the best threshold was always within [793,800].

I report the best threshold, the difference between the best threshold and the oracle, and the range of thresholds that give a total execution time within 1% of the best threshold. For realistic inputs, the oracle's advantage is unimpressive. A couple examples:

S = 2 (compiling code makes it 2 times faster)
C = 100 (compiling a function takes 100 times as long as executing it)
breakEven : 200.0 iterations
Oracle cost: 4545844.0
Best threshold: 40 iterations
Best threshold cost: 4558282 (1.002736 of oracle)
Thresholds in (31, 334) are within 1% of the best threshold.

S = 10 (compiling code makes it 10 times faster)
C = 1000 (compiling a function takes 1000 times as long as executing it)
breakEven : 1111.11111111 iterations
Oracle cost: 1181347.8
Best threshold: 800 iterations
Best threshold cost: 1249874 (1.058007 of oracle)
Thresholds in (763, 947) are within 1% of the best threshold.ind

This is obviously an extremely simple model, but I think the argument generalizes to more complicated circumstances.

One interesting detail: the "sweet spot" within 1% of the best threshold extends further up than down: that is, the marginal cost of lowering the threshold is higher than the marginal cost of raising it. I'm not sure if this has any practical implications, though.

tl;dr: Performance is more sensitive to tweaking heuristics that decide whether to do something (eg inlining) rather than when to do something (warmup thresholds).

Whiteboard: [sp3]
You need to log in before you can comment on or make changes to this bug.