Open Bug 1404950 Opened 7 years ago Updated 2 years ago

new Promise(resolve => {}) is a performance footgun

Categories

(Core :: JavaScript Engine, enhancement, P3)

enhancement

Tracking

()

Performance Impact low
Tracking Status
firefox57 --- wontfix

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: perf)

It's common for JS code to want to do this: var p = new Promise(resolve => { /* code */ }); Right now this is super expensive in SpiderMonkey because we'd need to clone the lambda function every single time we create a new Promise. This shows up quite badly in many profiles, for example https://perfht.ml/2fE6qTL from bug 1400534. This forces people to do crazy workarounds to avoid the idiomatic way to write this code such as https://reviewboard.mozilla.org/r/185408/diff/1 (from bug 1404652) which basically avoids the issue by storing the callback in a variable and passing that to all Promise constructors.
On this silly microbenchmark (perils noted): var count = 100000; var start = performance.now(); for (var i = 0; i < count; ++i) { new Promise((r) => void(0)); } var stop = performance.now(); alert((stop - start)/count * 1e6); I see ~450ns in Firefox, ~230 in Safari, ~160ns in Chrome...
Flags: needinfo?(jdemooij)
Blocks: 1342037
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #1) > On this silly microbenchmark (perils noted): > I see ~450ns in Firefox, ~230 in Safari, ~160ns in Chrome... Is this with async stacks enabled or disabled? Until bug 1280819 is fixed, that'll be a huge difference.
Flags: needinfo?(bzbarsky)
> Is this with async stacks enabled or disabled? Disabled. I always have them disabled, because otherwise performance measurements are sort of meaningless. :( With async stacks enabled, the Firefox number is ~1050ns.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (still digging out from vacation mail) from comment #3) > > Is this with async stacks enabled or disabled? > > Disabled. I always have them disabled, because otherwise performance > measurements are sort of meaningless. :( > > With async stacks enabled, the Firefox number is ~1050ns. Ok, that's quite unfortunate. I think v8 at least inlines Promise constructors, so that's probably something we have to look at.
If we know that the escaping closure won't be mutated and won't be used for its identity (which we might be able to know in this case, Promise being built-in?) then the closure creation could be hoisted by the compiler, I guess. More generally we should have some kind of lamdba lifting, of course.
A few thoughts: (1) We used to have a no-clone optimization for certain lambdas, but it's really hard to do for JS because of things like f.caller escaping the function. It's possible we could bring back a limited form of this, maybe just for strict mode. (2) Boris' benchmark in comment 1 spends most of its time in the Promise constructor, not the lambda cloning, AFAICS. For the micro-benchmark below I get 370-400 ns in the shell (with --no-async-stacks). If I manually hoist the lambda it improves to 350-380 ns - a very small win. (3) This runs in Ion. If the chrome JS code is really hot, we should double check it's Ion compiled, because lambda cloning will be relatively slower in Baseline. alert = print; performance = Date; function f() { var count = 100000; var start = performance.now(); for (var i = 0; i < count; ++i) { new Promise(r => void 0); } var stop = performance.now(); alert((stop - start)/count * 1e6); } f();
Self-hosting the promise constructor would be nice, because these JS -> C++ -> JS calls are bad for perf and hard to optimize. A profile does show a bunch of overhead here that we can probably eliminate.
(In reply to Jan de Mooij [:jandem] from comment #7) > A profile does show a bunch of overhead here that we can probably eliminate. In particular 6-7% is under JS::dbg::onNewPromise. This is a public API that can now be private (since Promise got moved into SpiderMonkey) and then we can easily fast-path this. I'll patch.
Depends on: 1405330
I'll spend some more time on this tomorrow. Allocating the PromiseObject and the resolving functions can hopefully be optimized quite a bit.
(In reply to Jan de Mooij [:jandem] from comment #6) > (3) This runs in Ion. If the chrome JS code is really hot, we should double > check it's Ion compiled, because lambda cloning will be relatively slower in > Baseline. AFAIK the chrome JS code in question in the profile in comment 0 is very hot but it doesn't run in loops, it runs in response to HTTP traffic originating from the child process observed as IPC traffic in the parent process, so basically it's the result of the same function running many many times but not in a loop. Is that something that will get Ion compiled?
(In reply to :Ehsan Akhgari (needinfo please, extremely long backlog) from comment #10) > AFAIK the chrome JS code in question in the profile in comment 0 is very hot > but it doesn't run in loops, it runs in response to HTTP traffic originating > from the child process observed as IPC traffic in the parent process, so > basically it's the result of the same function running many many times but > not in a loop. Is that something that will get Ion compiled? It should get Baseline-compiled after 10 calls, and Ion-compiled after 1000.
I'm not sure this is a great testcase for the lambda clone overhead I've been seeing. I suspect we have optimizations to prevent cloning the lambda in the case where it doesn't actually capture anything. When I run a separate test with an actual lambda capture, I see significantly more overhead when passing an arrow function literal as opposed to a reference: "use strict"; let iterations = 1024 * 1024; { let start = performance.now(); for (let i = 0; i < iterations; i++) { let deferred = {}; deferred.promise = new Promise((resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }); } let end = performance.now(); let delta = (end - start) * 1000 * 1000; document.body.appendChild(document.createTextNode(`Per-iteration: ${Math.round(delta / iterations)}ns`)); } document.body.appendChild(document.createElement("br")); { let start = performance.now(); let deferred; let lambda = (resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }; for (let i = 0; i < iterations; i++) { deferred = {}; deferred.promise = new Promise(lambda); } let end = performance.now(); let delta = (end - start) * 1000 * 1000; document.body.appendChild(document.createTextNode(`Per-iteration: ${Math.round(delta / iterations)}ns`)); } Results: Per-iteration: 646ns Per-iteration: 507ns In the patch Ehsan mentioned in comment 0, that was adding up to 5-10% of the total overhead in a pretty hot code path (and possibly more, since it seems to increase GC pressure, but I'm not sure how much). And, unfortunately, I'm not sure how much of that code path manages to run in Ion, since the non-syntactic scope stuff we turned on in bug 1381961 is known to break Ion in a lot of cases, and it's hard to tell with JIT is being used from profiles, sometimes. I do know that at least *some* of the code runs in Ion, though.
At least when testing in the shell, the overhead in comment #12 for the per-iteration lambda seems to stem from using |let| instead of |var| for the |deferred| variable.
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #12) > I suspect we have optimizations to prevent cloning the lambda > in the case where it doesn't actually capture anything. If it does not capture anything, then what you probably noticed is the lack of scope chain creation, to store the captured variables. We have optimizations, in IonMonkey, to avoid cloning the lambda it does not escape. As the Promise constructor is not in JS, we actually have no way to infer that, and thus they are always considered as escaping, which I guess is likely the default with promises anyway.
(In reply to Nicolas B. Pierron [:nbp] from comment #14) > If it does not capture anything, then what you probably noticed is the lack > of scope chain creation, to store the captured variables. That seems closer to reality, assuming scope chain creation is the most expensive part. It would also explain why switching from `let` to `var` in the loop made things faster. A more exhaustive test: let iterations = 1024 * 1024; function run(callback) { let start = performance.now(); for (let i = 0; i < iterations; i++) { callback(); } let end = performance.now(); let delta = (end - start) * 1000 * 1000; document.body.appendChild(document.createTextNode(`Per-iteration ${callback.name}(): ${Math.round(delta / iterations)}ns`)); document.body.appendChild(document.createElement("br")); } let deferred; let lambda = (resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }; function newPromiseCloneNewScope() { let deferred = {}; deferred.promise = new Promise((resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }); return deferred; } function newPromiseCloneSameScope() { deferred = {}; deferred.promise = new Promise((resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }); return deferred; } function newPromiseNoClone() { deferred = {}; deferred.promise = new Promise(lambda); return deferred; } run(newPromiseCloneNewScope); run(newPromiseCloneSameScope); run(newPromiseNoClone); { let start = performance.now(); let deferred; for (let i = 0; i < iterations; i++) { deferred = {}; deferred.promise = new Promise((resolve, reject) => { deferred.resolve = resolve; deferred.reject = reject; }); } let end = performance.now(); let delta = (end - start) * 1000 * 1000; document.body.appendChild(document.createTextNode(`Per-iteration for () CloneNoNewScope: ${Math.round(delta / iterations)}ns`)); document.body.appendChild(document.createElement("br")); } Results: Per-iteration newPromiseCloneNewScope(): 763ns Per-iteration newPromiseCloneSameScope(): 618ns Per-iteration newPromiseNoClone(): 513ns Per-iteration for () CloneNoNewScope: 534ns But, unfortunately, this doesn't help my original use case very much, since every promise constructor call is in a new function invocation, and the no-clone version is still by far the fastest.
Depends on: 1405999
Depends on: 1406870
Keywords: perf
Priority: -- → P3
Whiteboard: [qf:p3]
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #1) > I see ~450ns in Firefox, ~230 in Safari, ~160ns in Chrome... Boris, this should be quite a bit faster with the latest Nightly, if you want to try it. Still a lot more to shave off though.
Yes, on an m-c build from today looks somewhat faster (now ~260 in Firefox, ~170 in Safari; I think it got colder since my last measurements, so less thermal throttling...)
Cancelling needinfo for now. We improved this a lot, there's still more to do though.
Flags: needinfo?(jdemooij)
Performance Impact: --- → P3
Whiteboard: [qf:p3]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.