Optimize indirect calls by not pessimizing same-instance calls
Categories
(Core :: JavaScript: WebAssembly, enhancement, P2)
Tracking
()
People
(Reporter: dbezhetskov, Assigned: dbezhetskov)
References
(Blocks 1 open bug)
Details
(Keywords: perf-alert)
Attachments
(20 files, 9 obsolete files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
application/x-javascript
|
Details | |
Bug 1639153 - Part 1: Reserve two slots after stack arguments for the future tls preservation. r=lth
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
application/x-tar
|
Details |
Assignee | ||
Comment 1•5 years ago
|
||
Updated•5 years ago
|
Comment 2•5 years ago
|
||
I'm a bit confused about putting 2 extra words in each FunctionTableElem; I had been thinking the goal state is to get tables down to a single word. What's the benefit of performing the call_indirect signature check after pushing the frame? If the signature happens before pushing a frame, the trampoline can first check the signature, then push the frame, then call the tail entry (which would be after the signature check). Then there are only two entries per function.
Also nit: could we rename "tail entry" to "trampoline entry"? "Tail" keeps making me thing "tail call entry".
Assignee | ||
Comment 3•5 years ago
|
||
Hi, I understand your confusion.
Our final target is still to have two entries table and trampoline.
Final goal:
call_indirect -> call table_entry -> Trampoline stub -> target function
Trampoline stub:
check signature
check for instance equality
switch to callee instance
call the trampoline_entry
switch to caller instance
However, it's hard to achieve this in one commit and if I succeed to create such big commit (I tried to do this) it takes so much time because I don't know what could go wrong, so Andy and I decided to split this work into series of extremely short patches. It's maintained in the Steps to get there
section of the abi-2020 doc (https://docs.google.com/document/d/1oi6ROZJuE-hBb21Tgq-XcUegPktecF52gADzBJcYL34).
Long story short:
I will push inlined trampoline stub right in call_indirect and then adapt stack maps and frame iteration.
Next, I'm going to finally remove Frame::tls.
Then we move inlined trampoline into the stub. Here we don't need to care about runtime stuff because it's already done in the previous steps.
Because I'm going to inline trampoline stub on caller side we need to move signature check after the frame is pushed. Because obviously, we can't do the check on the caller side and this is why I moved it.
WDYT?
Assignee | ||
Comment 4•5 years ago
|
||
An additional question about the final design and the current signature check optimization:
AFAIU a thunk (aka trampoline aka decorator) is a fixed size stub with some connected data and it will be generated independently from the function.
So, we can't bake cmp with immediate into it, as we do know.
We can only load funcTypeId from memory and then do a check, right?
Comment 5•5 years ago
|
||
To add on -- we will eventually need to be able to perform signature checks with a frame already on the stack, in the tail-call case. So it's more consistent if the frame is always present when signature-check traps are raised.
Comment 6•5 years ago
|
||
To expand also on the number of function entry points -- right now only the function is in a position to check its signature. So we need a "call entry" (target of a call instruction) and a "tail entry" (target of a jump), and for each we need a checked and unchecked variant. Before Dmitry's work there are two entries (checked and unchecked call). Dmitry's current patch adds checked and unchecked tail entries.
In the future it would be possible to put the signature in the table associating code pointer, JSFunction*, and TlsData, and arrange to only check signatures in the trampoline -- if that's what we want to do. (After all, anyone that calls direct already knows that the callee has the right type.) That would allow us to reduce to only a call and a tail entry. Not sure if a good optimization or not, though.
WDYT Luke?
Comment 7•5 years ago
|
||
Apologies for the battery of comments -- a tail entry is indeed that: an entry for a tail call. Whether it's a tail call resulting from a WebAssembly tail-call proposal (something we're not working on) or the result of a jump from a trampoline is a detail -- the salient point is that the fixed part of the frame is already on the stack.
Comment 8•5 years ago
|
||
Haha another small apology -- Luke you may be wondering, why implement tail entries when we are not working on tail calls?
The answer is in the design doc:
As an optimization, when control enters an indirect entry trampoline, the trampoline can check if the incoming WasmTlsReg is actually same-instance, and in that case can jump to the type-checking tail entry. In that case there will be no trampoline frame and the callee will return directly to the caller.
So in that case, the trampoline can tail-call the callee, by jumping to the callee's tail entry. But because the trampoline can't currently check the callee's signature, we target the checked tail entry.
AFAIU the unchecked tail entry would only ever be invoked by a direct tail call from the tail calls proposal, so I guess that bit isn't strictly necessary.
Comment 9•5 years ago
|
||
Thanks for all the responses and info; understanding that the goal here is to break up a larger patch into smaller independent patches makes total sense. Would it make sense to hold off on landing the patches until the final state is better than the current state? I have a light concern that we get stuck for a while at an in-between state.
Also, "tail call entry" now makes sense; I like the symmetry between trampolines and tail calls :)
The 4 entry points also make sense. Instead of recording dynamic offsets for all of them, would it be possible (now or later) to have static offsets, such that from a pointer to any one entry point you can compute the others? The reason I ask is thinking ahead toward the function-references proposal when you have one function pointer (a funcref
or (ref (func ...))
) that can be called with a fixed or dynamic signature with a normal or tail call, so it seems like we'd want to have function references point to one canonical entry point and have all the other entry points be static offsets made at the callsite.
(In reply to Dmitry Bezhetskov from comment #4)
An additional question about the final design and the current signature check optimization:
AFAIU a thunk (aka trampoline aka decorator) is a fixed size stub with some connected data and it will be generated independently from the function.So, we can't bake cmp with immediate into it, as we do know.
We can only load funcTypeId from memory and then do a check, right?
Given a wasm function signature (and no other context), you can statically classify it as having either an immediate representation (where the signature is literally baked into the immediate, which means the immediate is invariant over executions and thus can be (de)serialized without patched) or requires an indirect representation. Thus, if the signature is classified as an immediate, then the trampoline can indeed bake in the cmp-with-immediate. (And if the signature is classified as indirect, then the trampoline can load it from the callee TlsData.)
Comment 10•5 years ago
|
||
Good morning! Good questions, Luke :) Point by point:
- About patch landing order. Completely agreed that we need to avoid shipping regressions. We operate a bit on instinct here as to what is a regression and what isn't; my impression was that in the planned change series at https://docs.google.com/document/d/1oi6ROZJuE-hBb21Tgq-XcUegPktecF52gADzBJcYL34/edit#heading=h.22ki0jodkf2h, no change would regress in a significant way.
I understand your concern to also be about leaving the tree in an awkward state for a long time, but I think Dmitry is on a roll at this point and when the inline trampoline code lands -- which is the next patch or so -- then we should start to see net performance improvements.
If that were all, I would be inclined to agree with you to hold off until a series can land, but on our side there has been the risk of making patches too big to review, or patches that don't do the right thing, or patches that simply need some style fixes. I think the pressure of getting something good enough to incrementally land has been a real positive thing for progress on this feature and so weighing everything, I would continue to land small steps toward the goal -- as long as we are all generally on board about the goal and the general plan for getting there.
-
Static offsets would be great! Right now I think that might not be possible because of whether the function has an immediate signature check or not. But yes, I think you are right we can get there.
-
Immediate signature checks: I think we were under the impression that in the final design, trampoline code would not be signature-dependent -- and so if you were to consider moving signature checks to trampolines, probably you would make all checks load the expected signature from the storage associated with the trampoline instead of having immediate comparisons. I was also unsure if immediate vs load-from-memory was important here -- the branch would surely predict perfectly, and it would "just" be another couple µops in the pipeline per indirect call. Dunno. WDYT here? As I said we have to operate a bit on instincts and ours could be miscalibrated.
Comment 11•5 years ago
|
||
Thanks for the explanations!
- Makes sense, sgtm
- I was thinking that differences of that nature could be covered up by nop'ing up to the desired static offset. But agreed that we don't need to do this now.
- Agreed that immediate vs load-from-memory is probably not a huge deal, but just in the interest of not giving it up unnecessarily: how would the trampoline be signature independent? My operating assumption has been that tramplines are statically specialized to the callee (so that they can use a direct call, and also bake in the callee's instance pointer as an immediate), and thus they could know the callee's expected signature as an immediate
Getting back to whether the signature check happens before or after pushing the frame: it makes sense that, for tail calls, the callee frame will have already been pushed, so this should be allowed. However, I think doing so for all call entries will cause extra prologue code to be generated for the two checked entries per function (since they must push their own frame and then jump over the frame-pushing of the unchecked entry). If the signature can be checked before pushing the frame, then the checked entries can check and then fall-through to the frame-pushing of the unchecked entries. IIRC from my measurements, b/c of the large number of small functions in Unity-type apps, prologue-generated machine code is a significant fraction of overall code size. Also icache locality would be somewhat improved. But I forget: are there any concrete problems with checking the signature before pushing the frame?
Comment 12•5 years ago
|
||
Are there any concrete problems with checking the signature before pushing the frame?
There are two nits and one serious problem.
- Nit: With tail calls, you have to deal with the signature check trap being raised either with or without the frame.
- Nit: If overhead is a problem and tail calls will come, then we should find a solution that allows tail calls without overhead.
- Problem: If a function is called from a trampoline, it needs to not set FP and instead use the FP set by the trampoline.
I think this last issue is the motivator to have checked call entries and checked tail entries. WDYT @luke ?
Assignee | ||
Comment 13•5 years ago
|
||
Assignee | ||
Comment 14•5 years ago
|
||
Luke, what is the plan for import calls? They use Frame::tls too, I suppose we are going to handle call_import in the same manner as call_indirect? It means that we are going to call thunk to switch instances.
WDYT?
Comment 15•5 years ago
|
||
Yeah, good question. I wasn't thinking about import calls b/c they are usually just simplified versions of indirect calls in that you statically know you're calling out of the instance. Thus, if it was easy, it would make sense to statically (at the callsite) do something instead of using a trampoline. That being said, I don't see an easy way to do that in this case so yeah, probably best just to call a trampoline just like indirect calls.
Assignee | ||
Comment 16•5 years ago
|
||
Assignee | ||
Comment 17•5 years ago
|
||
There are still a few remaining uses of masm.loadWasmTlsRegFromFrame():
- CodeGenerator::visitWasmCall
here we reload Tls because Ion assumes that WasmTlsReg is non-volatile.
After implementing trampoline and uses them for call_import and call_indirect we can safely remove this. - MacroAssembler::callWithABI
afaik we set tls here only for wasm BuiltinThunks. A BuiltinThunk uses WasmTlsReg only in the prologue/epilogue, so when we delete Frame::tls this will become obsolete and we can remove it then.
Comment 18•4 years ago
|
||
Comment 19•4 years ago
|
||
Backed out changeset 578160fb0ac3 (bug 1639153) for WasmFrameIter.cpp related bustage
Backout link: https://hg.mozilla.org/integration/autoland/rev/dac0f3345b5984106adcf22d20e2bceb4a8fb4e6
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305426059&repo=autoland&lineNumber=5524
task 2020-06-08T06:29:22.459Z] make[3]: Leaving directory '/builds/worker/workspace/build/src/obj-spider/js/src/irregexp'
[task 2020-06-08T06:29:22.459Z] /builds/worker/fetches/gcc/bin/g++ -std=gnu++17 -o Unified_cpp_js_src_frontend0.o -c -I/builds/worker/workspace/build/src/obj-spider/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 -DJS_CACHEIR_SPEW -DJS_STRUCTURED_SPEW -DEXPORT_JS_API -DMOZ_HAS_MOZGLUE -I/builds/worker/workspace/build/src/js/src/frontend -I/builds/worker/workspace/build/src/obj-spider/js/src/frontend -I/builds/worker/workspace/build/src/obj-spider/js/src -I/builds/worker/workspace/build/src/js/src -I/builds/worker/workspace/build/src/obj-spider/dist/include -I/builds/worker/workspace/build/src/obj-spider/dist/include/nspr -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-spider/js/src/js-confdefs.h -Wall -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wduplicated-cond -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -Wformat-overflow=2 -Wno-noexcept-type -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -freorder-blocks -O3 -fno-omit-frame-pointer -funwind-tables -Werror -fno-strict-aliasing -Werror=format -Wno-shadow -Wno-attributes -MD -MP -MF .deps/Unified_cpp_js_src_frontend0.o.pp Unified_cpp_js_src_frontend0.cpp
[task 2020-06-08T06:29:22.459Z] js/src/frontend/Unified_cpp_js_src_frontend1.o
[task 2020-06-08T06:29:23.263Z] /builds/worker/fetches/gcc/bin/g++ -std=gnu++17 -o Unified_cpp_js_src_jit10.o -c -I/builds/worker/workspace/build/src/obj-spider/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 -DJS_CACHEIR_SPEW -DJS_STRUCTURED_SPEW -DEXPORT_JS_API -DMOZ_HAS_MOZGLUE -I/builds/worker/workspace/build/src/js/src/jit -I/builds/worker/workspace/build/src/obj-spider/js/src/jit -I/builds/worker/workspace/build/src/obj-spider/js/src -I/builds/worker/workspace/build/src/js/src -I/builds/worker/workspace/build/src/obj-spider/dist/include -I/builds/worker/workspace/build/src/obj-spider/dist/include/nspr -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-spider/js/src/js-confdefs.h -Wall -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wduplicated-cond -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -Wformat-overflow=2 -Wno-noexcept-type -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -freorder-blocks -O3 -fno-omit-frame-pointer -funwind-tables -Werror -fno-strict-aliasing -Werror=format -Wno-shadow -Wno-attributes -MD -MP -MF .deps/Unified_cpp_js_src_jit10.o.pp Unified_cpp_js_src_jit10.cpp
[task 2020-06-08T06:29:23.264Z] js/src/jit/Unified_cpp_js_src_jit11.o
[task 2020-06-08T06:29:23.986Z] /builds/worker/fetches/gcc/bin/g++ -std=gnu++17 -o Unified_cpp_js_src_wasm1.o -c -I/builds/worker/workspace/build/src/obj-spider/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 -DJS_CACHEIR_SPEW -DJS_STRUCTURED_SPEW -DEXPORT_JS_API -DMOZ_HAS_MOZGLUE -I/builds/worker/workspace/build/src/js/src/wasm -I/builds/worker/workspace/build/src/obj-spider/js/src/wasm -I/builds/worker/workspace/build/src/obj-spider/js/src -I/builds/worker/workspace/build/src/js/src -I/builds/worker/workspace/build/src/obj-spider/dist/include -I/builds/worker/workspace/build/src/obj-spider/dist/include/nspr -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-spider/js/src/js-confdefs.h -Wall -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wduplicated-cond -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -Wformat-overflow=2 -Wno-noexcept-type -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -freorder-blocks -O3 -fno-omit-frame-pointer -funwind-tables -Werror -fno-strict-aliasing -Werror=format -Wno-shadow -Wno-attributes -MD -MP -MF .deps/Unified_cpp_js_src_wasm1.o.pp Unified_cpp_js_src_wasm1.cpp
[task 2020-06-08T06:29:23.986Z] js/src/wasm/Unified_cpp_js_src_wasm2.o
[task 2020-06-08T06:29:26.656Z] In file included from Unified_cpp_js_src_wasm1.cpp:11:0:
[task 2020-06-08T06:29:26.656Z] /builds/worker/workspace/build/src/js/src/wasm/WasmFrameIter.cpp: In function 'void js::wasm::GenerateFunctionPrologue(js::jit::MacroAssembler&, const js::wasm::FuncTypeIdDesc&, const mozilla::Maybe<unsigned int>&, js::wasm::FuncOffsets*)':
[task 2020-06-08T06:29:26.656Z] /builds/worker/workspace/build/src/js/src/wasm/WasmFrameIter.cpp:581:3: error: static assertion failed: code aligned
[task 2020-06-08T06:29:26.656Z] static_assert(CheckedTailEntryOffset % CodeAlignment == 0, "code aligned");
[task 2020-06-08T06:29:26.656Z] ^~~~~~~~~~~~~
[task 2020-06-08T06:29:28.430Z] Compiling regex v1.3.3
[task 2020-06-08T06:29:28.430Z] Running `CARGO=/builds/worker/fetches/rustc/bin/cargo CARGO_MANIFEST_DIR=/builds/worker/workspace/build/src/third_party/rust/regex CARGO_PKG_AUTHORS='The Rust Project Developers' CARGO_PKG_DESCRIPTION='An implementation of regular expressions for Rust. This implementation uses
[task 2020-06-08T06:29:28.430Z] finite automata and guarantees linear time matching on all inputs.
[task 2020-06-08T06:29:28.430Z] ' CARGO_PKG_HOMEPAGE='https://github.com/rust-lang/regex' CARGO_PKG_NAME=regex CARGO_PKG_REPOSITORY='https://github.com/rust-lang/regex' CARGO_PKG_VERSION=1.3.3 CARGO_PKG_VERSION_MAJOR=1 CARGO_PKG_VERSION_MINOR=3 CARGO_PKG_VERSION_PATCH=3 CARGO_PKG_VERSION_PRE= LD_LIBRARY_PATH='/builds/worker/workspace/build/src/obj-spider/release/deps:/builds/worker/fetches/rustc/lib:/builds/worker/workspace/build/src/obj-spider/dist/bin:/builds/worker/fetches/gcc/lib64' /builds/worker/fetches/rustc/bin/rustc --crate-name regex /builds/worker/workspace/build/src/third_party/rust/regex/src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts --crate-type lib --emit=dep-info,metadata,link -C opt-level=2 --cfg 'feature="aho-corasick"' --cfg 'feature="default"' --cfg 'feature="memchr"' --cfg 'feature="perf"' --cfg 'feature="perf-cache"' --cfg 'feature="perf-dfa"' --cfg 'feature="perf-inline"' --cfg 'feature="perf-literal"' --cfg 'feature="std"' --cfg 'feature="thread_local"' --cfg 'feature="unicode"' --cfg 'feature="unicode-age"' --cfg 'feature="unicode-bool"' --cfg 'feature="unicode-case"' --cfg 'feature="unicode-gencat"' --cfg 'feature="unicode-perl"' --cfg 'feature="unicode-script"' --cfg 'feature="unicode-segment"' -C metadata=59cb2151c37e484b -C extra-filename=-59cb2151c37e484b --out-dir /builds/worker/workspace/build/src/obj-spider/release/deps -C linker=/builds/worker/workspace/build/src/build/cargo-linker -L dependency=/builds/worker/workspace/build/src/obj-spider/release/deps --extern aho_corasick=/builds/worker/workspace/build/src/obj-spider/release/deps/libaho_corasick-ae60b24f6684d433.rmeta --extern memchr=/builds/worker/workspace/build/src/obj-spider/release/deps/libmemchr-f1b26a4756e91620.rmeta --extern regex_syntax=/builds/worker/workspace/build/src/obj-spider/release/deps/libregex_syntax-a209697783b4651f.rmeta --extern thread_local=/builds/worker/workspace/build/src/obj-spider/release/deps/libthread_local-df70bf44d84bd469.rmeta --cap-lints warn`
[task 2020-06-08T06:29:30.442Z] /builds/worker/workspace/build/src/config/rules.mk:746: recipe for target 'Unified_cpp_js_src_wasm1.o' failed
[task 2020-06-08T06:29:30.442Z] make[3]: *** [Unified_cpp_js_src_wasm1.o] Error 1
[task 2020-06-08T06:29:30.443Z] make[3]: Leaving directory '/builds/worker/workspace/build/src/obj-spider/js/src/wasm'
[task 2020-06-08T06:29:30.443Z] /builds/worker/workspace/build/src/config/recurse.mk:74: recipe for target 'js/src/wasm/target-objects' failed
[task 2020-06-08T06:29:30.443Z] make[2]: *** [js/src/wasm/target-objects] Error 2
[task 2020-06-08T06:29:30.443Z] make[2]: *** Waiting for unfinished jobs....
[task 2020-06-08T06:29:30.443Z] /builds/worker/fetches/gcc/bin/g++ -std=gnu++17 -o Unified_cpp_js_src_frontend1.o -c -I/builds/worker/workspace/build/src/obj-spider/dist/system_wrappers -include /builds/worker/workspace/build/src/config/gcc_hidden.h -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -fstack-protector-strong -DNDEBUG=1 -DTRIMMED=1 -DJS_CACHEIR_SPEW -DJS_STRUCTURED_SPEW -DEXPORT_JS_API -DMOZ_HAS_MOZGLUE -I/builds/worker/workspace/build/src/js/src/frontend -I/builds/worker/workspace/build/src/obj-spider/js/src/frontend -I/builds/worker/workspace/build/src/obj-spider/js/src -I/builds/worker/workspace/build/src/js/src -I/builds/worker/workspace/build/src/obj-spider/dist/include -I/builds/worker/workspace/build/src/obj-spider/dist/include/nspr -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/build/src/obj-spider/js/src/js-confdefs.h -Wall -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wsign-compare -Wtype-limits -Wunreachable-code -Wwrite-strings -Wno-invalid-offsetof -Wduplicated-cond -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Wno-error=maybe-uninitialized -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=coverage-mismatch -Wno-error=free-nonheap-object -Wformat -Wformat-overflow=2 -Wno-noexcept-type -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -freorder-blocks -O3 -fno-omit-frame-pointer -funwind-tables -Werror -fno-strict-aliasing -Werror=format -Wno-shadow -Wno-attributes -MD -MP -MF .deps/Unified_cpp_js_src_frontend1.o.pp Unified_cpp_js_src_frontend1.cpp
[task 2020-06-08T06:29:30.443Z] js/src/frontend/Unified_cpp_js_src_frontend2.o
[task 2020-06-08T06:29:31.950Z] In file included from Unified_cpp_js_src_frontend0.cpp:29:0:
[task 2020-06-08T06:29:31.950Z] /builds/worker/workspace/build/src/js/src/frontend/BytecodeEmitter.cpp: In member function 'bool js::frontend::BytecodeEmitter::emitSetOrInitializeDestructuring(js::frontend::ParseNode*, js::frontend::DestructuringFlavor)':
[task 2020-06-08T06:29:31.950Z] /builds/worker/workspace/build/src/js/src/frontend/BytecodeEmitter.cpp:2653:48: warning: 'kind' may be used uninitialized in this function [-Wmaybe-uninitialized]
[task 2020-06-08T06:29:31.950Z] NameOpEmitter noe(this, name, loc, kind);
[task 2020-06-08T06:29:31.950Z] ^
[task 2020-06-08T06:29:33.100Z] /builds/worker/workspace/build/src/js/src/frontend/Parser.cpp: In member function 'typename ParseHandler::ClassNodeType js::frontend::GeneralParser<ParseHandler, Unit>::classDefinition(js::frontend::YieldHandling, js::frontend::GeneralParser<ParseHandler, Unit>::ClassContext, js::frontend::DefaultHandling) [with ParseHandler = js::frontend::FullParseHandler; Unit = char16_t]':
[task 2020-06-08T06:29:33.100Z] /builds/worker/workspace/build/src/js/src/frontend/Parser.cpp:7356:16: warning: 'innerName' may be used uninitialized in this function [-Wmaybe-uninitialized]
[task 2020-06-08T06:29:33.100Z] NameNodeType innerName;
...
Comment 20•4 years ago
|
||
Comment 21•4 years ago
|
||
Same issue as above, different line: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305432836&repo=autoland&lineNumber=5453
Comment 22•4 years ago
|
||
Comment 23•4 years ago
|
||
Comment 24•4 years ago
|
||
Backed out for build bustages.
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305595073&repo=autoland&lineNumber=4976
Backout: https://hg.mozilla.org/integration/autoland/rev/7c2f0754a3118ee806ae019bce830012c53c7a47
Comment 25•4 years ago
|
||
Assignee | ||
Comment 26•4 years ago
|
||
issues with signature moving are resolved and now the changes are landed.
The problem was in --disable-jit configuration, I messed up with MacroAssembler-none.h.
Comment 27•4 years ago
|
||
bugherder |
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 28•4 years ago
|
||
Updated•4 years ago
|
Comment 29•4 years ago
|
||
Comment 30•4 years ago
|
||
bugherder |
Comment 31•4 years ago
|
||
Comment 32•4 years ago
|
||
Backed outfor causing bustage at ion-error-ool.js.
Backout link: https://hg.mozilla.org/integration/autoland/rev/c36c8243afb8117b01288cd9f9956d34633d41fd
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=f82a44c95497c4821a758ba031078945a188bc21&selectedTaskRun=Z09HoE7UROS8aY2B6stcxQ.0
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306466971&repo=autoland&lineNumber=44620
Comment 33•4 years ago
|
||
Comment 34•4 years ago
|
||
Backed out changeset b620533c6d63 (Bug 1639153) for causing build bustages in WasmFrameIter.cpp CLOSED TREE
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306641550&repo=autoland&lineNumber=4845
Backout: https://hg.mozilla.org/integration/autoland/rev/f2a097374fad969ca31468e65b0839108baca5ff
Comment 35•4 years ago
|
||
Comment 37•4 years ago
|
||
bugherder |
Assignee | ||
Comment 38•4 years ago
|
||
Updated•4 years ago
|
Comment 39•4 years ago
|
||
(In reply to Pulsebot from comment #33)
Pushed by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b620533c6d63
Refactor wasm::Frame to reduce number of gnarly casts. r=lth,wingo
Hi, it seems this change caused build failure on mips platform, is it ok that I submit the fix patch to this bug? Or should I file a new bug? Thanks!
Assignee | ||
Comment 40•4 years ago
|
||
Hu Zhao, I think it is OK to submit patches here if the number of patches is small.
BTW, I wonder that we support mips platform at all.
Comment 41•4 years ago
|
||
Comment 42•4 years ago
|
||
(In reply to Dmitry Bezhetskov from comment #40)
BTW, I wonder that we support mips platform at all.
We do have MIPS support but it's a tier-3 platform, supported not by Mozilla but by external contributors (most of them at Longsoon). There are MIPS simulators in the tree in the same way as arm/arm64, and sometimes we try to keep the MIPS code working by running them, but usually we just change things as we want and wait for the MIPS support team to catch up.
Comment 43•4 years ago
|
||
Comment 44•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment 45•4 years ago
|
||
Microbenchmark for call_indirect. On my Xeon workstation, I get these numbers with yesterday's mozilla-central (JS shell, configure with --disable-debug --enable-optimize --enable-release, run with --wasm-compiler=ion):
external: 388.2451171875
internal: 333.218017578125
This suggests that in the code that has landed so far, indirect calls to module-internal functions are about 14% faster than indirect calls to imported functions.
Comment 46•4 years ago
|
||
Updated to add direct calls as a comparison. These come in at 187ms roughly, ie, at about 56% of the module-internal indirect calls.
Assignee | ||
Comment 47•4 years ago
|
||
For my amd machine it is:
branches/default/tip Fri Jun 26:
external: 256.566162109375
internal: 254.14599609375
direct: 186.510986328125
branches/default/tip Fri Jun 26 + additional trampoline frame:
external: 420.4189453125
internal: 279.97607421875
direct: 186.052978515625
Direct calls aren't affected.
Internal: 9% slowdown. It's because we add additional "if" there, before we just load/reload tls from the frame.
External: 64% slowdown. Pushing additional trampoline frame + if + additional call/return into/from trampoline.
If we didn't push Frame::tls we would save:
2 push for internals, one for the function's frame and one for the trampoline
3 push for external
Updated•4 years ago
|
Assignee | ||
Comment 48•4 years ago
|
||
Luke, if we had the same approach with trampolines for import calls it would that they will slower than now. As you say there isn't an easy way to make a workaround for import calls because we need to somehow signalize for frame iteration that instance is changed.
Actually, I can say that there will be the same slowdown as for cross-instance case for call_indirect. It is about 50% slowdown compared to the previous version where we just WasmTlsReg <- callee's tls
and call callee.code
.
The current plan is good, but it has also such consequences, lemme show trade-offs for current plan:
- very fast direct calls, actually no overhead at all
- a little bit more code, about 4 bytes per function for nops
- import calls and cross-instance call_indirect become slow. We add constant overhead for this, additional call/return + push trampoline's data
- trampoline doesn't need to copy stack args because we address it from FP.
I don't have a whole picture of wasm in SpiderMonkey, but I think you have. So, these tradeoffs are ok for us or not? I'm asking because saw the web IDL binding proposal and it looks like we are going to use import calls to call WebIDL.
There is another tradeoff we can do. We can fuse your original idea with the current plan.
Let me describe this a little bit more:
Your original idea was to add slots after stack arguments and then track tls via CallSiteDesc.
I've investigated this earlier and I can say that it is doable if the number of slots is equal 2. (I have abandoned PR for this).
One for calleeTls and one other one for callerTls. We need to preserve it only on possible cross-instance calls.
The current plan with addressing from FP and trampolines allows us don't copy arguments when we call a trampoline.
So, with this fusion we have:
- a little bit slowdown for all calls included direct calls because we always allocate two slots after stack args, even if there is no any.
- import calls don't need to call a trampoline because runtime already knows about this transition.
- the amount of memory by wasm function is increased by 2 slots for each call.
- trampolines already have slots for callerTls and calleeTls so it needs only to prepare a frame for the callee and jump into it.
- trampoline doesn't need to copy stack args because we address it from FP.
You may ask, what is actually slowdown for direct calls because they are very frequent in wasm demos.
I've added to slots for all calls and measured on the benchmark (provided by Lars Hansen). There we call the function with no stack arguments so it's the most frequent cause for zengarden and other benchmarks.
I've run this test many times and calculate the average. The test machine is my hp laptop with ubuntu 18.04 and inter core i7 on board.
Base revision is (5c9a4697057bd5ef8e15e6223e049674107ca433 Wed Jul 1 Bug 1147091 - Add StableSort to nsTArray_Impl. r=froydnj Simon Giesecke <sgiesecke@mozilla.com>). Flags: --disable-debug --enable-optimize --enable-release, run with --wasm-compiler=ion
with two additional slots:
external: 267.17
internal: 254.44
direct: 140.26
reference:
external: 269.44
internal: 260.42
direct: 145.84
It looks strange but this is what we have. I compared many times but it looks like there no overhead for all calls and even some improvements..
The amount of memory is increased, of course, it is about 2 slots for each call (depends on the alignment).
So, wdyt about these two cases?
First one is ok and constant overhead doesn't scare us
or we increase each call by two slots and have cheap import calls and cheap trampolines?
Updated•4 years ago
|
Assignee | ||
Comment 49•4 years ago
|
||
One possible disadvantage of the second approach is that for same-instance call_indirect we need to preserve tls because for runtime we can't distinguish between import call and call_indirect they both have kind == dynamic, but it's just two store instruction.
Comment 50•4 years ago
|
||
I think slowing down cross-instance indirect and import calls is not good, so anything we can do to mitigate that is worth thinking hard about.
I don't think "adding two extra slots" should cost anything in the no-stack-args case if we do things right. Ion creates the stack frame with enough space for all outgoing arguments on function entry, so there should be no reason to make any adjustment of the SP just prior to the call - the adjustment should be made on function entry when we create the local frame. I thought that there might be an added small cost in functions that make zero-stack-arg calls and that themselves do not currently need to allocate any stack memory, like int f(int x) { return g(x)+1 }
, but currently we're allocating a frame (8 bytes unused memory) for those already to ensure proper stack alignment in the callee, so there would be no regression there either.
I'm a little more worried about code size, frankly, because programs are becoming large quickly. In the case of my function f
above I see 14 bytes of NOPs on x64 currently:
00000002 55 push %rbp
00000003 48 8b ec mov %rsp, %rbp
00000006 0f 1f 00 nopl %eax, (%rax)
00000009 0f 1f 80 00 00 00 00 nopl %eax, (%rax)
00000010 41 83 fa 23 cmp $0x23, %r10d
00000014 0f 84 0c 00 00 00 jz 0x0000000000000026
0000001A 0f 0b ud2
0000001C 0f 1f 40 00 nopl %eax, (%rax)
00000020 41 56 push %r14
00000022 55 push %rbp
00000023 48 8b ec mov %rsp, %rbp
00000026 48 83 ec 08 sub $0x08, %rsp
...
Comment 51•4 years ago
|
||
Just to check: when you say that all calls would have 2 extra slots (for callerTLS/calleeTLS), are those slots always allocated, but only initialized in the cross-instance cases by the trampolines? If so, then I agree it should be practically free so that seems like a good idea.
Assignee | ||
Comment 52•4 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #51)
Just to check: when you say that all calls would have 2 extra slots (for callerTLS/calleeTLS), are those slots always allocated, but only initialized in the cross-instance cases by the trampolines? If so, then I agree it should be practically free so that seems like a good idea.
We need to fill these slots for indirect calls. Imports calls will always store tls and calls via trampolines will fill them according to the trampoline logic.
Assignee | ||
Comment 53•4 years ago
|
||
We are going to remove Frame::tls and support trampolines for indirect calls, so we need to get rid of using Frame::tls.
In this and the followup patches I will iteratively remove all dependencies of Frame::tls and remove it eventually.
In this patch I changed wasm ABI to allocate two stack slots after stack args to preserve caller's and callee's tls'es in the near future.
Assignee | ||
Comment 54•4 years ago
|
||
This is a followup patch for removing Frame::tls.
Now, we are preserving caller's and callee's tls'es for all possible cross-instance calls in the previously allocated abi slots.
We will use them in the near future to untie c++ runtime from Frame::tls and to reload WasmTlsReg after a cross-instance call.
Comment 55•4 years ago
|
||
I'm afraid I'm about to go on vacation for a week, and probably at this point in time, Lars needs to be the better reviewer. I think he's also on vacation for I think the next two weeks, so I'm sorry about the delay.
Assignee | ||
Comment 56•4 years ago
|
||
Luke, thanks for the notice.
If you could review patches after the vacation it would be awesome. If you can't it is well, ok, /sigh/ for me to wait for Lars. In the last case I'll change you on Lars after hi returns from vacation.
BTW, patches are very simple and extremely short.
Assignee | ||
Comment 57•4 years ago
|
||
This is the third part of series of patches to Frame without tls pointer.
Here we preserve initial tls in all entry stubs and then use it to find a proper tls instance for a given frame.
To find the TlsData* for specific frame we start from a entry stub's tls
and then track tls through all possible cross-instance calls. This logic
is implemented in GetNearestEffectiveTls procedure.
Then, we use this new procedure to make singal handling free from Frame::tls.
Assignee | ||
Comment 58•4 years ago
|
||
Here we replace usage of Frame::tls in frame iteration with GetNearestEffectiveTls.
We also maintain current tls for frame iteration object to not to call GetNearestEffectiveTls everytime.
Assignee | ||
Comment 59•4 years ago
|
||
Here we remove remaining uses of Frame::tls. There are many places where
we use it: in debug frames, in profiling frames, in jit activation, etc.
All these places require short fixes to use our new scheme for getting
tls, so I gathered them together.
Assignee | ||
Comment 60•4 years ago
|
||
Let's see following code and let's assume that wasm-compiler is ion:
call foo
call bar
Ion can clobber tls inside foo and then just go with clobbered tls into bar.
It will crash if bar uses tls. At compile-time we don't know whether bar will use tls or not.
It works well when we restore tls each time when we return from a function because of the current frame structure.
But now, when we want to get rid of that Frame::tls we should guarantee that Ion doesn't clobber tls inside a function.
In fact we forces Ion to restore tls iff it clobbers it and it is a rare case.
Baseline doesn't need such logic because of its private stack slot for tls.
Assignee | ||
Comment 61•4 years ago
|
||
We remove Frame::tls and fix prologue/epilogue. Runtime, Ion and Baseline are ready for this move already.
Assignee | ||
Comment 62•4 years ago
|
||
When I did measurement I suddenly discovered that there is some small performance regression at patch 7.
My machine is not representative at all, but it is just a little bit mysterious.
At 6 part performance of direct calls: 185
At 7 part it is: 193
I haven't seen any major performance regression at patch 7.
I started investigation, compared the codes and discovered that SM debug (--enable-debug --disable-optimize) generates code which executes faster than SM release (--disable-debug --enable-optimize --enable-release).
At 7 part:
Debug: 183
Release: 195
Debug code:
0x2802daadd2b0 push %rbp
0x2802daadd2b1 mov %rsp,%rbp
0x2802daadd2b4 sub $0x20,%rsp
0x2802daadd2b8 cmp %rsp,0x30(%r14)
0x2802daadd2bc jb 0x2802daadd2c4
0x2802daadd2c2 ud2
0x2802daadd2c4 mov %edi,0x1c(%rsp)
0x2802daadd2c8 xor %eax,%eax
0x2802daadd2ca mov %eax,0x18(%rsp)
0x2802daadd2ce xchg %ax,%ax
0x2802daadd2d0 cmpl $0x0,0x38(%r14)
0x2802daadd2d5 je 0x2802daadd2dd
0x2802daadd2db ud2
0x2802daadd2dd mov 0x1c(%rsp),%eax
0x2802daadd2e1 test %eax,%eax
0x2802daadd2e3 je 0x2802daadd311
0x2802daadd2e9 sub $0x1,%eax
0x2802daadd2ec mov %eax,0x1c(%rsp)
0x2802daadd2f0 mov $0x7,%edi
0x2802daadd2f5 test $0xf,%spl
0x2802daadd2f9 je 0x2802daadd300
0x2802daadd2ff int3
0x2802daadd300 callq 0x2802daadd0a0
0x2802daadd305 mov %eax,%eax
0x2802daadd307 add 0x18(%rsp),%eax
0x2802daadd30b mov %eax,0x18(%rsp)
0x2802daadd30f jmp 0x2802daadd2ce
0x2802daadd311 mov 0x18(%rsp),%eax
0x2802daadd315 add $0x20,%rsp
0x2802daadd319 pop %rbp
0x2802daadd31a retq
Release code:
0x1256f09c4290 push %rbp
0x1256f09c4291 mov %rsp,%rbp
0x1256f09c4294 sub $0x20,%rsp
0x1256f09c4298 cmp %rsp,0x30(%r14)
0x1256f09c429c jb 0x1256f09c42a4
0x1256f09c42a2 ud2
0x1256f09c42a4 mov %edi,0x1c(%rsp)
0x1256f09c42a8 xor %eax,%eax
0x1256f09c42aa mov %eax,0x18(%rsp)
0x1256f09c42ae xchg %ax,%ax
0x1256f09c42b0 cmpl $0x0,0x38(%r14)
0x1256f09c42b5 je 0x1256f09c42bd
0x1256f09c42bb ud2
0x1256f09c42bd mov 0x1c(%rsp),%eax
0x1256f09c42c1 test %eax,%eax
0x1256f09c42c3 je 0x1256f09c42e6
0x1256f09c42c9 sub $0x1,%eax
0x1256f09c42cc mov %eax,0x1c(%rsp)
0x1256f09c42d0 mov $0x7,%edi
0x1256f09c42d5 callq 0x1256f09c40a0
0x1256f09c42da mov %eax,%eax
0x1256f09c42dc add 0x18(%rsp),%eax
0x1256f09c42e0 mov %eax,0x18(%rsp)
0x1256f09c42e4 jmp 0x1256f09c42ae
0x1256f09c42e6 mov 0x18(%rsp),%eax
0x1256f09c42ea add $0x20,%rsp
0x1256f09c42ee pop %rbp
0x1256f09c42ef retq
The only difference which I can see is:
test $0xf,%spl
je 0x2802daadd300
int3
Looks like inserting this dynamic check is speeding up the code.
Looks like an alignment problem.
I'll check the SM behavior on my laptop and compare results.
Assignee | ||
Comment 63•4 years ago
|
||
If I increase the loop body:
(func (export "run_direct") (param $count i32) (result i32)
(local $sum i32)
(block $END
(loop $LOOP
(br_if $END (i32.eqz (local.get $count)))
(local.set $count (i32.sub (local.get $count) (i32.const 1)))
(local.set $sum
(i32.add (call $internal (i32.const 7))
(local.get $sum)))
(local.set $sum
(i32.add (call $internal (i32.const 7))
(local.get $sum)))
(br $LOOP)))
(local.get $sum))
then difference between the reference and part7 PR is very small 254.25 vs 255 so I think the previous anomaly affects only very small loops.
Assignee | ||
Comment 64•4 years ago
|
||
I also compared reference and my stack of patches on https://github.com/lars-t-hansen/embenchen in release mode with --wasm-compiler=ion
The results (lower score is better):
The first column is score for the reference and the second column is for the my stack of patches.
Looks like everything is good!
box2d: 700 691
bullet: 895 903
conditionals: 213 214
copy: 463 465
corrections: 1356 1356
fannkuch: 2334 2257
fasta: 616 616
fib: 639 612
ifs: 197 190
binarytrees: 3235 3234
memops: 979 976
primes: 1451 1454
raybench: 1430 1444
rust-fannkuch: 2189 2192
skinning: 541 539
zlib: 987 988
I also compared external and internal calls with the previous patched benchmark:
Reference:
external: 257.699951171875
internal: 254.968017578125
direct: 258.489990234375
My stack of patches:
external: 258.5400390625
internal: 234.716064453125
direct: 256.7490234375
To sum up, I can conclude that direct call aren't affected by modulo the first test, internal calls became even faster, and external calls aren't affected, so removing Frame::tls is a good independent step, imho.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 65•4 years ago
|
||
Comment 66•4 years ago
|
||
Backed out for bustage on MacroAssembler.h
backout: https://hg.mozilla.org/integration/autoland/rev/eeed6471b7255c0d93d96ff2a202567256ac5a30
failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=312226941&repo=autoland&lineNumber=21791
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - In file included from /builds/worker/checkouts/gecko/js/src/jit/x86/SharedICRegisters-x86.h:10:
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - /builds/worker/checkouts/gecko/js/src/jit/MacroAssembler.h(4202,5): error: use of undeclared identifier 'increaseStackOffset'
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - increaseStackOffset(wasm::FrameWithTls::sizeWithoutFrame());
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - ^
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - 1 error generated.
[task 2020-08-06T06:29:23.793Z] 06:29:23 ERROR - make[4]: *** [/builds/worker/checkouts/gecko/config/rules.mk:748: Parser.obj] Error 1
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - make[4]: Leaving directory '/builds/worker/workspace/obj-build/js/src/frontend'
[task 2020-08-06T06:29:23.793Z] 06:29:23 ERROR - make[3]: *** [/builds/worker/checkouts/gecko/config/recurse.mk:72: js/src/frontend/target-objects] Error 2
[task 2020-08-06T06:29:23.793Z] 06:29:23 INFO - make[3]: *** Waiting for unfinished jobs....
Updated•4 years ago
|
Comment 67•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 68•4 years ago
|
||
bugherder |
Comment 69•4 years ago
|
||
Unverified but plausible regression: bug 1657741.
Comment 70•4 years ago
|
||
Backed out 3 changesets (bug 1639153) on lth's request
Backout link: https://hg.mozilla.org/mozilla-central/rev/24fc091bbbf6f2cf776618c1dc67f3d77cbb1feb
Updated•4 years ago
|
Assignee | ||
Comment 71•4 years ago
|
||
We have a tls dependency in callWithABI for wasm because we call builtins via BuiltinThunk.
Last one requires that tls to be loaded. In this patch I gradually extend callWithABI interface to
pass offset to tls. This allows us to preserve tls and load it by offset in callWithABI and frees us from
the Frame::tls dependency.
Assignee | ||
Comment 72•4 years ago
|
||
x86 has few register so to do div/mod for i64 it call the runtime
and clobber almost all gp registers including WasmTlsReg.
To be able to call c++ runtime via Builtin thunk we need to set up WasmTlsReg.
In this patch I create dependencies from MIR level to Codegen to be sure that WasmTlsReg is alive
when we call runtime.
Assignee | ||
Comment 73•4 years ago
|
||
To be able to call c++ runtime via Builtin thunk we need to set up WasmTlsReg.
In this patch I create dependencies from MIR level to Codegen to be sure that WasmTlsReg is alive
when we call runtime in div/mod i64 for arm.
Assignee | ||
Comment 74•4 years ago
|
||
In this patch we add a tls dependency for the remaining nodes which use
BuiltinThunk to call c++ runtime. By ABI requirements WasmTlsReg should
be set.
Assignee | ||
Comment 75•4 years ago
|
||
We generate builtin call for Mod operation for Double types, so we need
to add a tls dependency. In this patch I've added it.
Assignee | ||
Comment 76•4 years ago
|
||
We generate builtin call for MTruncateToInt32 operation for floating points types,
so we need to add a tls dependency.
I inserted NYI for arm64 because Ion doesn't support arm64.
Comment 77•4 years ago
|
||
Comment 78•4 years ago
|
||
Comment 79•4 years ago
|
||
bugherder |
Comment 80•4 years ago
|
||
Comment 81•4 years ago
|
||
Backed outfor bustages at Lowering.cpp.
Backout link: https://hg.mozilla.org/integration/autoland/rev/f0551b411d485a75e1f60ac837768d9cf59ef91e
Failure log 1: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=315285369&repo=autoland&lineNumber=3754
Failure log 2: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=315285384&repo=autoland&lineNumber=28767
Comment 82•4 years ago
|
||
Comment 83•4 years ago
|
||
Backed out 5 changesets (bug 1639153) for bustages complaining about Lowering.cpp.
Backout link: https://hg.mozilla.org/integration/autoland/rev/3010f73c9eec545c42715659f3593516ef686f70
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=315409723&repo=autoland&lineNumber=33796
[task 2020-09-11T05:59:11.889Z] 05:59:11 INFO - make[4]: Entering directory '/builds/worker/workspace/obj-build/js/src/jit'
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - /builds/worker/fetches/sccache/sccache /builds/worker/fetches/clang/bin/clang++ -std=gnu++17 -o Unified_cpp_js_src_jit8.o -c -I/builds/worker/workspace/obj-build/dist/system_wrappers -include /builds/worker/checkouts/gecko/config/gcc_hidden.h -U_FORTIFY_SOURCE -DNDEBUG=1 -DTRIMMED=1 -DWASM_SUPPORTS_HUGE_MEMORY -DJS_CACHEIR_SPEW -DJS_STRUCTURED_SPEW -DJS_HAS_CTYPES -DFFI_BUILDING -DEXPORT_JS_API -DMOZ_HAS_MOZGLUE -I/builds/worker/checkouts/gecko/js/src/jit -I/builds/worker/workspace/obj-build/js/src/jit -I/builds/worker/workspace/obj-build/js/src -I/builds/worker/checkouts/gecko/js/src -I/builds/worker/workspace/obj-build/dist/include -I/builds/worker/workspace/obj-build/dist/include/nspr -fPIC -DMOZILLA_CLIENT -include /builds/worker/workspace/obj-build/js/src/js-confdefs.h -Qunused-arguments -Qunused-arguments -Wall -Wbitfield-enum-conversion -Wempty-body -Wignored-qualifiers -Woverloaded-virtual -Wpointer-arith -Wshadow-field-in-constructor-modified -Wsign-compare -Wtype-limits -Wunreachable-code -Wunreachable-code-return -Wwrite-strings -Wno-invalid-offsetof -Wclass-varargs -Wempty-init-stmt -Wfloat-overflow-conversion -Wfloat-zero-conversion -Wloop-analysis -Wc++2a-compat -Wcomma -Wimplicit-fallthrough -Wunused-function -Wunused-variable -Werror=non-literal-null-conversion -Wstring-conversion -Wtautological-overlap-compare -Wtautological-unsigned-enum-zero-compare -Wtautological-unsigned-zero-compare -Wno-error=tautological-type-limit-compare -Wno-inline-new-delete -Wno-error=deprecated-declarations -Wno-error=array-bounds -Wno-error=backend-plugin -Wno-error=return-std-move -Wno-error=atomic-alignment -Wno-error=deprecated-copy -Wformat -Wformat-security -Wno-gnu-zero-variadic-macro-arguments -Werror=implicit-function-declaration -Wno-noexcept-type -Wno-psabi -Wno-unknown-warning-option -D_GLIBCXX_USE_CXX11_ABI=0 -fno-sized-deallocation -fno-aligned-new -fsanitize=bool,bounds,enum,integer-divide-by-zero,object-size,pointer-overflow,return,vla-bound -fno-sanitize-recover=bool,bounds,enum,integer-divide-by-zero,object-size,pointer-overflow,return,vla-bound -fsanitize-blacklist=/builds/worker/workspace/obj-build/js/src/ubsan_blacklist.txt -fsanitize=address -fcrash-diagnostics-dir=/builds/worker/artifacts -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -pipe -g -Xclang -load -Xclang /builds/worker/workspace/obj-build/build/clang-plugin/libclang-plugin.so -Xclang -add-plugin -Xclang moz-check -O2 -gline-tables-only -fno-omit-frame-pointer -funwind-tables -Werror -fno-strict-aliasing -Werror=format -Wno-shadow -MD -MP -MF .deps/Unified_cpp_js_src_jit8.o.pp Unified_cpp_js_src_jit8.cpp
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - In file included from Unified_cpp_js_src_jit8.cpp:2:
[task 2020-09-11T05:59:11.896Z] 05:59:11 ERROR - /builds/worker/checkouts/gecko/js/src/jit/Lowering.cpp:2362:16: error: unused variable 'opd' [-Werror,-Wunused-variable]
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - MDefinition* opd = truncate->input();
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - ^
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - 1 error generated.
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - /builds/worker/checkouts/gecko/config/rules.mk:723: recipe for target 'Unified_cpp_js_src_jit8.o' failed
[task 2020-09-11T05:59:11.896Z] 05:59:11 ERROR - make[4]: *** [Unified_cpp_js_src_jit8.o] Error 1
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - make[4]: Leaving directory '/builds/worker/workspace/obj-build/js/src/jit'
[task 2020-09-11T05:59:11.896Z] 05:59:11 INFO - make[4]: *** Waiting for unfinished jobs....
Comment 84•4 years ago
|
||
Comment 85•4 years ago
|
||
Comment 86•4 years ago
|
||
Comment 87•4 years ago
|
||
Comment 88•4 years ago
|
||
The backout from comment 87 is because the stack from comment 85 wasn't supposed to be landed .
Assignee | ||
Comment 89•4 years ago
|
||
@abutkovits , let's land this one https://phabricator.services.mozilla.com/D89243
The naming might be confusing but 6.5 is an internal numbering.
I want to land this 6.5 separately because actually it is independent patch.
Comment 90•4 years ago
|
||
Comment 91•4 years ago
|
||
The revision was landed: https://hg.mozilla.org/integration/autoland/rev/174a622bf334e32f911e97ca76c36b0a80d51f94
Comment 92•4 years ago
|
||
Comment 93•4 years ago
|
||
bugherder |
Comment 94•4 years ago
|
||
bugherder |
Comment 95•4 years ago
|
||
Comment 96•4 years ago
|
||
Updated•4 years ago
|
Comment 97•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Comment 98•4 years ago
|
||
Assignee | ||
Comment 99•4 years ago
|
||
Create the proxy stub lazily for calls which go through tables.
This stub will preserve caller tls, load callee's pinned register state and etc.
In the followup patches we will fill them with the that logic, but right now let's concentrate on the runtime part.
We bake callee and tls into the trampoline stub to be able to do the switch in the near future.
Comment 100•4 years ago
|
||
Backed out from beta as well, as requested by lth:
https://hg.mozilla.org/releases/mozilla-beta/rev/a9d230c1649f
Comment 101•4 years ago
|
||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 102•4 years ago
|
||
Comment 103•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/0e268b8d01fe
https://hg.mozilla.org/mozilla-central/rev/f0f44cfd6a07
https://hg.mozilla.org/mozilla-central/rev/c5ef1152da2b
https://hg.mozilla.org/mozilla-central/rev/bba9e8259714
https://hg.mozilla.org/mozilla-central/rev/e2105a52bcdd
https://hg.mozilla.org/mozilla-central/rev/81e645d5ccfd
https://hg.mozilla.org/mozilla-central/rev/bd8381646b47
Assignee | ||
Comment 104•4 years ago
|
||
Hi, could you please land the stack of patches together:
[https://phabricator.services.mozilla.com/D91081, https://phabricator.services.mozilla.com/D82881]
There are patches Bug 1639153 Part 1 - Part 5, then Part 7 and finally this one https://phabricator.services.mozilla.com/D91081.
Comment 105•4 years ago
|
||
Hi Dmitry, the stack of patches cannot be landed due to the following requirement "Revision is missing a Testing Policy Project Tag." .
Please check https://phabricator.services.mozilla.com/D83064#2965334 and add the Check-in Needed flag after the requirement is fulfilled.
Comment 107•4 years ago
|
||
Patch failed to land with :
"Details: We're sorry, Lando could not rebase your commits for you automatically. Please manually rebase your commits and try again.
hg error in cmd: hg import --no-commit -s 95 /tmp/tmpdrpaobho: applying /tmp/tmpdrpaobho
patching file js/src/wasm/WasmInstance.cpp
Hunk #1 FAILED at 1745
1 out of 1 hunks FAILED -- saving rejects to file js/src/wasm/WasmInstance.cpp.rej
abort: patch failed to apply"
Assignee | ||
Comment 108•4 years ago
|
||
This patch makes use of the new "Baldrdash2020" ABI support in Cranelift
to support the "ABI 2020" refactor in the Wasm compiler.
Comment 109•4 years ago
|
||
Dmitry, could you please add the missing testing-tags for the patches in this lando stack so that they can be landed?
https://lando.services.mozilla.com/D93190/
Assignee | ||
Comment 110•4 years ago
|
||
Hi :malexandru, I've added the missing test tags.
Comment 111•4 years ago
|
||
Comment 112•4 years ago
|
||
Backed out for spidermonkey bustage
backout: https://hg.mozilla.org/integration/autoland/rev/5b5ffbe4add97c98003026e0d9a730302161b646
failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=319118736&repo=autoland&lineNumber=88070
[task 2020-10-20T11:26:36.939Z] /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/host/bin/dump_syms /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/js/src/build/libmozjs-84a1.so
[task 2020-10-20T11:26:36.939Z] Unexpected error: 'ascii' codec can't decode byte 0x90 in position 3510: ordinal not in range(128)
[task 2020-10-20T11:26:36.941Z] Traceback (most recent call last):
[task 2020-10-20T11:26:36.941Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py", line 1086, in <module>
[task 2020-10-20T11:26:36.941Z] main()
[task 2020-10-20T11:26:36.941Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py", line 1081, in main
[task 2020-10-20T11:26:36.941Z] dumper.Process(args[2], options.count_ctors)
[task 2020-10-20T11:26:36.941Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py", line 523, in Process
[task 2020-10-20T11:26:36.941Z] self.ProcessFile(file_to_process, count_ctors=count_ctors)
[task 2020-10-20T11:26:36.941Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py", line 538, in ProcessFile
[task 2020-10-20T11:26:36.941Z] file, arch_num, arch, vcs_root, dsymbundle, count_ctors=count_ctors
[task 2020-10-20T11:26:36.941Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py", line 590, in ProcessFileWork
[task 2020-10-20T11:26:36.941Z] for line in proc.stdout:
[task 2020-10-20T11:26:36.941Z] File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
[task 2020-10-20T11:26:36.941Z] return codecs.ascii_decode(input, self.errors)[0]
[task 2020-10-20T11:26:36.941Z] UnicodeDecodeError: 'ascii' codec can't decode byte 0x90 in position 3510: ordinal not in range(128)
[task 2020-10-20T11:26:36.960Z] Traceback (most recent call last):
[task 2020-10-20T11:26:36.960Z] File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
[task 2020-10-20T11:26:36.960Z] "main", mod_spec)
[task 2020-10-20T11:26:36.960Z] File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
[task 2020-10-20T11:26:36.960Z] exec(code, run_globals)
[task 2020-10-20T11:26:36.960Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/python/mozbuild/mozbuild/action/dumpsymbols.py", line 106, in <module>
[task 2020-10-20T11:26:36.960Z] sys.exit(main(sys.argv[1:]))
[task 2020-10-20T11:26:36.960Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/python/mozbuild/mozbuild/action/dumpsymbols.py", line 102, in main
[task 2020-10-20T11:26:36.960Z] args.count_ctors)
[task 2020-10-20T11:26:36.960Z] File "/builds/worker/workspace/sm-package/mozjs-84.0a1.0/python/mozbuild/mozbuild/action/dumpsymbols.py", line 83, in dump_symbols
[task 2020-10-20T11:26:36.960Z] out_files = subprocess.check_output(args, universal_newlines=True)
[task 2020-10-20T11:26:36.960Z] File "/usr/lib/python3.6/subprocess.py", line 336, in check_output
[task 2020-10-20T11:26:36.960Z] **kwargs).stdout
[task 2020-10-20T11:26:36.960Z] File "/usr/lib/python3.6/subprocess.py", line 418, in run
[task 2020-10-20T11:26:36.960Z] output=stdout, stderr=stderr)
[task 2020-10-20T11:26:36.960Z] subprocess.CalledProcessError: Command '['/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/_virtualenvs/init_py3/bin/python', '/builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py', '-c', '--vcs-info', '--install-manifest=/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/_build_manifests/install/dist_include,/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/include', '-s', '/builds/worker/workspace/sm-package/mozjs-84.0a1.0', '/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/host/bin/dump_syms', '/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/crashreporter-symbols', '/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/js/src/build/libmozjs-84a1.so']' returned non-zero exit status 1.
[task 2020-10-20T11:26:36.960Z] Running: /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/_virtualenvs/init_py3/bin/python /builds/worker/workspace/sm-package/mozjs-84.0a1.0/toolkit/crashreporter/tools/symbolstore.py -c --vcs-info --install-manifest=/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/_build_manifests/install/dist_include,/builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/include -s /builds/worker/workspace/sm-package/mozjs-84.0a1.0 /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/host/bin/dump_syms /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/dist/crashreporter-symbols /builds/worker/workspace/sm-package/mozjs-84.0a1.0/obj-spider/js/src/build/libmozjs-84a1.so
Comment 113•4 years ago
|
||
This is kind of an odd failure.
It appears that running obj-spider/dist/host/bin/dump_syms .../obj-spider/js/src/build/libmozjs-84a1.so
is outputting a non-ASCII character (0x90), and python is choking on it. It's Python 3.6, and it's using univeral_newlines=True
, so I would have expected this to be decoding to utf8, not ASCII, but I'm perpetually confused by this encoding stuff. Perhaps the system encoding is wrong for some reason.
I don't know if this means there's an actual symbol with a weird name, or if there was an error message with a non-ASCII character in it, or what.
I'm going to see if I can diagnose it with an interactive shell job.
Comment 114•4 years ago
|
||
This seems to be some kind of problem with the crashreporter?
Specifically: I ran the above command on an interactive loaner and saved stdout and stderr to files. The stdout file looked mostly ok, but had one line with a mangled filename:
FILE 1761 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/convert/mod.rs
FILE 1762 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/convert/num.rs
FILE 1763 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/ffi.rs
FILE 1764 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/fmt/builders.rs
FILE 1765 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/fmt/k^B@^A^E^N^C^U<90>^B^G
FILE 1766 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/fmt/mod.rs
FILE 1767 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/fmt/num.rs
FILE 1768 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/hash/mod.rs
FILE 1769 /rustc/4fb7144ed159f94491249e86d5bbd033b5d60550/src/libcore/hash/sip.rs
That's cut & paste from the interactive terminal web page. The actual bytes can be inferred, but the only one that is higher than 0x7f is the 0x90.
That directory contains 4 files. The above entry should say float.rs
. The filesystem is fine; the name is only mangled in the dump_syms
output.
So I'm guessing the problem has nothing to do with this bug; it probably just changed the directory listing enough to produce the particular invalid character here rather than other invalid characters.
I'm not sure who to needinfo here, and have to run right now, so I'll post what I know.
Assignee | ||
Comment 115•4 years ago
|
||
I think we can try to land the stack again because 1673103 is fixed.
Comment 116•4 years ago
|
||
Comment 117•4 years ago
|
||
Backed out 7 changesets (bug 1639153) for Spidermonkey failures in wasm/simd/cvt-x64-ion-codegen.j. CLOSED TREE
Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=320144587&repo=autoland&lineNumber=26350
Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=42728e99fbc55663b2e573c822e1a0b529c325ae
Backout:
https://hg.mozilla.org/integration/autoland/rev/1c593eaefecfd789d3b3f9cc5be3f5d65441b417
Assignee | ||
Comment 118•4 years ago
|
||
The issue was fixed, let's land the stack.
Comment 119•4 years ago
|
||
Comment 120•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/b24505537d70
https://hg.mozilla.org/mozilla-central/rev/50772b4ff6aa
https://hg.mozilla.org/mozilla-central/rev/3cfa16c9f9a9
https://hg.mozilla.org/mozilla-central/rev/5dfb9fc15fba
https://hg.mozilla.org/mozilla-central/rev/5737845e8af2
https://hg.mozilla.org/mozilla-central/rev/107ac71d3eb1
https://hg.mozilla.org/mozilla-central/rev/49de28d6a75a
Comment 121•4 years ago
|
||
Comment 122•4 years ago
|
||
The last 7 changesets got backed out for crashes reported by users:
[@ WasmTrapHandler ] from 14 installations when I checked
https://crash-stats.mozilla.org/report/index/f0eb42d8-158a-426a-9034-f57830201031
Please also check if these are related (2 installations, all startup crashes):
[@ core::ops::function::Fn::call<T> | std::sys_common::backtrace::__rust_end_short_backtrace<T> ]
crash stack 1: https://crash-stats.mozilla.org/report/index/655326bc-3a73-4588-adad-d1db40201031
crash stack 2: https://crash-stats.mozilla.org/report/index/85a6bb2b-2f8b-4de0-914d-821eb0201031
Comment 123•4 years ago
|
||
Comment 124•4 years ago
|
||
Assignee | ||
Comment 125•4 years ago
|
||
[@ WasmTrapHandler ] from 14 installations when I checked
Hm, the main crash (https://crash-stats.mozilla.org/report/index/f0eb42d8-158a-426a-9034-f57830201031) isn't in my code, and the corresponding stack doesn't point to me, are you sure that my changes trigger the crash?
How can I reproduce it?
Please also check if these are related (2 installations, all startup crashes):
They are rust crashes on windows x64 becausenot implemented: No FastCall variant of Baldrdash2020
iiuc we don't support baseline+cranelift configuration on win x64.
Comment 126•4 years ago
|
||
(In reply to Dmitry Bezhetskov from comment #125)
[@ WasmTrapHandler ] from 14 installations when I checked
Hm, the main crash (https://crash-stats.mozilla.org/report/index/f0eb42d8-158a-426a-9034-f57830201031) isn't in my code, and the corresponding stack doesn't point to me, are you sure that my changes trigger the crash?
How can I reproduce it?
Example urls to reproduce the issue (according to crash reports):
- https://juraj-husek-game-studio.itch.io/last-mortem
- https://mega.nz/ - and various sub pages (unknown if you need to be logged in)
Assignee | ||
Comment 127•4 years ago
|
||
I think I catch this one: https://crash-stats.mozilla.org/report/index/f0eb42d8-158a-426a-9034-f57830201031, at least I can reproduce a crash with my changes. I use a different workflow from this bug (https://bugzilla.mozilla.org/show_bug.cgi?id=1657917) but hope that the underlying problem is the same.
I'm investigating ...
Assignee | ||
Comment 128•4 years ago
|
||
Finally fixed the issue.
The problem was in MacroAssembler::wasmCallIndirect
when callee.which() == wasm::CalleeDesc::AsmJSTable
. I made a typo when worked on part 2. I preserved WasmTlsReg twice on WasmCallerTLSOffsetBeforeCall but I needed to preserve it on WasmCaller* and on WasmCallee*.
Because of this
GetNearestEffectiveTls(frame)
return some garbage value and then
Instance* instance = GetNearestEffectiveTls(frame)->instance;
led to SIGSEGV.
Comment 129•4 years ago
|
||
Comment 130•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/8e95ddbb6b20
https://hg.mozilla.org/mozilla-central/rev/a5cd8e8ce4b5
https://hg.mozilla.org/mozilla-central/rev/aeacd48f565e
https://hg.mozilla.org/mozilla-central/rev/cee7eaa8c402
https://hg.mozilla.org/mozilla-central/rev/40031ad88dbd
https://hg.mozilla.org/mozilla-central/rev/5718a8d121f9
https://hg.mozilla.org/mozilla-central/rev/7017395d8df9
Assignee | ||
Comment 131•4 years ago
|
||
Fill the stub logic: preserve tls's, switch pinned register state
for callee, call callee and then restore the initial state.
Since trampoline stub produces additional interstitial frame we tag
callerFP to indicate such frames inside frame iterator.
Stack maps are also adjusted not to count non-gc slots.
Updated•4 years ago
|
Assignee | ||
Comment 132•4 years ago
|
||
Add the dynamic check and the fastpath for call_indirect
when target instance and source instance are actually the same.
Assignee | ||
Comment 133•4 years ago
|
||
When we know that instances are equal at instantiate time we replace
pointer to the trampoline with pointer to the checked call entry.
Assignee | ||
Comment 134•4 years ago
|
||
I'm just dumping some interstitial results for our 2/3 of this long optimization.
Recently I've uploaded a few patches that have demonstrated some performance results that I want to share.
To measure the performance I used my ubuntu intel core i7 laptop with 16gb of RAM.
Also, I've used calls_bench (posted above) for tests.
Master:
external 348
internal 294
import 225
direct 140
D90992: add a simple proxy trampoline patch that actually only jumps into the real code:
external 369
internal 328
import 231
direct 142
Well, one jump introduced its cost.
D102591: implement trampoline basic logic - switch tls back and forth:
external 357
internal 391
import 231
direct 142
Trampoline logic has its cost and so internal calls get significant slowdown.
D103573: let's add a dynamic check if call_indirect actually calls a function from our instance
external 397
internal 259
import 227
direct 142
Internal calls now actually become faster, but we can do better.
D103574: at instantiating time we can check if instances are equal and replace the trampoline pointer with a pointer to the function call checked entry.
external 405
internal 209
import 226
direct 141
In the end, we get a 16% slowdown for external calls and a 29% speedup for internal calls without any significant effect for import and direct calls.
I don't know how it affects wild-industrial use cases but the numbers are looked promising.
The last part of the optimization will be changing the representation of WebAssembly.Table.Elem that we can use for speeding up table.get/set inside wasm.
Comment 135•4 years ago
|
||
14 April integration test of current patch set:
- Applying the current four-patch set to this morning's mozilla-central went smoothly (one small formatting conflict).
- x64-release-debug builds completed OK.
- x64-release-debug jit-test runs on wasm/ and asm.js/ (without --tbpl) had 91 failures: assertions and crashes around stack traversal, reference types, and exception handling.
- x86-release-debug jit-tests ditto had 80 failures, same categories
- arm64-release-debug builds require --disable-cranelift for the time being, this is OK.
- arm64-release-debug jit-test runs on wasm/ and asm.js/ (without --tbpl) had 130 failures: assertions and crashes as above but also some UNIMPLEMENTED simulator crashes and a crash warning about an unaligned memory access (somebody forgot to untag a pointer?)
- arm-release-debug tests ditto had 114 failures, about as for arm64, including simulator crashes
It's fairly high priority to make the patches that are up for review pass at least the integration tests, so that the review sees code that is plausibly correct.
Updated•4 years ago
|
Assignee | ||
Comment 136•4 years ago
|
||
Introduce indirect stubs for call_indirect.
Use special interstitial trampolines for:
a) call_indirect some_imported_table idx
because we don't know what are calling
b) call_indirect our_private_table wasm_func_idx
because we actually know that this is a cross-instance call_instance call
for all other cases we can use raw function pointers because
(a) we don't change the instance
and (b) our wasm ABI guarantee that after the call tls* will be in the valid state, so
the following cases use raw function calls:
- call_indirect our_private_table js_func_idx
- call_indirect our_private_table internal_wasm_func
Trampolines or indirect stubs needs to be bulk created only for exported functions
and for internal functions that are inserted into exported tables.
Assignee | ||
Comment 137•3 years ago
|
||
Introduce indirect stubs for call_indirect.
Use special interstitial trampolines for:
a) call_indirect some_imported_table idx
because we don't know what are calling
b) call_indirect our_private_table wasm_func_idx
because we actually know that this is a cross-instance call_instance call
for all other cases we can use raw function pointers because
(a) we don't change the instance
and (b) our wasm ABI guarantee that after the call tls* will be in the valid state, so
the following cases use raw function calls:
- call_indirect our_private_table js_func_idx
- call_indirect our_private_table internal_wasm_func
Trampolines or indirect stubs needs to be bulk created only for exported functions
and for internal functions that are inserted into exported tables.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 138•3 years ago
|
||
(Some initial notes, I will edit these as more data become available.)
Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of D117123. Built it with ../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols
and also made a comparable build from the base rev.
The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with --wasm-compiler=ion
, no other switches, loading the benchmark file on the command line.
Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)
On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)
On an Apple M1 with macOS 11 (M1 MacBook Pro):
"external" calls slow down by a lot (running time goes from 270 to 430, ie increasing by 60%)
"internal" calls speed up by about 15% (comparable to i7)
"direct" calls are roughly invariant (as desired)
Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases:
- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)
Baseline Improved
DIR IIP IIE IIX DIR IIP IIE IIX
Xeon 753 1410 1425 1425 754 1070 1300 2080
i7 647 1130 1130 1140 645 836 1100 1520
M1 650 1050 1045 1045 650 770 980 1420
On all CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile if we think that this case will be common.
Comment 139•3 years ago
|
||
Fib benchmarks.
Assignee | ||
Comment 140•3 years ago
|
||
Thanks Lars for the testing,
I've run your fib tests too on my threadripper/Ubuntu 20.04 machine with the same configurations.
I've got the following results:
Baseline Improved Improved - dynamic check
DIR IIP IIE IIX DIR IIP IIE IIX DIR IIP IIE IIX
633 1187 1180 1256 635 849 1029 1993 635 853 1398 1799
I've included numbers for the version where we don't do the dynamic check because imho it is a rare case and removing it improving results for IIX a little bit.
Assignee | ||
Comment 141•3 years ago
|
||
I think about call_indirect
usage and find this research (https://www.usenix.org/system/files/sec20-lehmann.pdf) Table 2. It looks like the number of call_indirect
can be significant while almost all wasm usages are single module usages, so I think we should care less about truly external cases, wdyt?
Comment 142•3 years ago
|
||
almost all wasm usages are single module usages
I'm a little skeptical about that (is our population all wasm programs on the net or only the prominent/important ones; there are fewer of the latter, but the number of instances of the latter probably dwarfs the number of instances of the former) but I think that when a program is multi-module this will tend to be because it uses dynamic linking, and the cost is probably affordable.
I'm probably willing to believe that we can afford the extra cost of truly external calls. I think we should probably keep the dynamic check for now but once we have better heuristics for when a table is private we should reevaluate it.
Comment 143•3 years ago
|
||
I've done a little bit of memory profiling, not very systematic yet, just testing the waters.
The methodology for this experiment is to disable the wasm baseline jit (because some programs will be compiled without tiering and for the programs that are, object code caching will restore the optimized code on reload) and then load some wasm-using pages, logging the wasm machine code allocations.
Google earth w/o patch: 34668544 + 65536
Google earth w/ patch: 31260672 + 5242880 + 65536
This illustrates (a) that the patch can have a significant positive effect on the machine code size for the program itself (the 30+MB allocation goes from 34.7MB to 31.3MB), because indirect calls are represented by a more compact machine code sequence and (b) that the indirect stubs can take a significant amount of space, here about 17% of the machine code. The overall increase caused by the patch is only 5% however, due to the mentioned savings.
PSPDFKIT ("standalone" demo) w/o patch: 23265280
PSPDFKIT w/ patch: 23003136 + 1835008
Here, there are scant savings from more compact code, and though there's a relatively smaller amount of stub code for indirect calls, there is a total increase of 7% in executable memory.
ruffle.rs is a flash emulator. Go to the site, then click "demo", then select "Saturday morning watchmen" from the top right menu.
w/o patch: 1900544 + 65536 + 8126464 + 65536
w/ patch: 1835008 + 65536 + 131072 + 65536 + 8060928 + 65536 + 131072 + 262144 + 65536
Here we may see the effect of lazily creating some stubs, and/or the effect of storing the stub with the calling module. In both runs there are two large blobs, call them "first page" and "second page". During the second run, there is (131072 + 65536) extra memory allocated for the first page and (131072 + 262144 + 65536) for the second page, this memory is all for indirect stubs. All told we're seeing a 5% increase in executable memory for the program.
(Obviously there are some quantization effects in all of these numbers since only full 64KB pages can be allocated.)
I don't think these increases are terribly scary. I do think that we need to focus a bit on bringing these numbers down, if we can, by generating more compact stubs (removing the dynamic check for the intra-instance call would help) or by generating fewer (more aggressive determination of whether a table is private to an instance). I don't think that should block landing.
Comment 144•3 years ago
|
||
It's a little harder to test the autocad web app (web.autocad.com), there's probably a bug somewhere... in ion... Anyway, disable ion and enable baseline, login to the site and "stay logged in" and then bring up a blank tab, close the autocad tab, and then quit. Do this in both browsers. Now when the browser starts it should start in the blank tab. (Without this, it could look like the browser tried to load the autocad site on startup, also weird.) To test, load web.autocad.com, click on "samples", and select the roller assembly sample, this comes up quickly enough. Then Ctrl-L and type about:blank and enter, to unload the page.
Here there are many allocations, so I've captured a log and processed the log. In the end, without the patch the executable memory for wasm reaches 188.7MB at peak, and with the patch it reaches 200.0MB, a 6% increase. Since this is baseline code, we should expect the actual relative increase to be larger than that (with a measured peak being lower). If I can generate Ion numbers, I will post an update. (Edit: Ion problem tracked by bug 1728781.)
It could look like there are about 100 more allocations with the patch than without; these would be for individual pages that are needed for lazy indirect stubs, I expect. (Edit: Indeed this works out really well: 100 allocations should work out to about 7MB, there's also a separate 7.5MB allocation in the run with the patch, for 14.5MB total; meanwhile, the program's code with the patch is about 3.5MB smaller. 14.5-3.5=11, which is very close to the difference between the runs.)
Autocad was always considered a hard problem due to its structure; it's encouraging that this looks not particularly worse than other sites, so far.
Comment 145•3 years ago
|
||
Preliminary Ion numbers for web.autocad.com: without the patch the peak is 141.4MB, with the patch the peak is 150.0 MB, for a 6% increase here as well. This is OK - we want to try to shrink it, but we can live with it. Again, there are about 100 more individual lazy allocations totalling about 6.5MB, and an additional large 7.5MB eager allocation, but a 6.5MB savings on program object code.
Comment 146•3 years ago
|
||
Summarizing the memory profiling:
- we are up about 6% on peak executable memory consumption across all programs
- this is not great but is totally acceptable; we should however dig into these numbers a bit to look for savings
- if we remove the dynamic same-instance check from the stubs the eager stub allocations will likely be more compact
- once we improve the code for avoiding even eager stub allocations for known-same-instance calls, eager stub allocations may go away, though later bulk-upgrade of tables when they are discovered to not be same-instance will remove those savings in some cases
- once we use luke's clever idea for lazy stubs, most executable allocations for lazy stubs will disappear because we can allocate them densely instead one on every page. this will be very helpful for eg autocad, which has many lazy allocations, but for the other applications i don't know see that it will help much.
Possible bug: I'm not sure why we need to be allocating 64KB of executable memory for each lazy stub, as this computer uses 4KB pages, but that's what it looks like. This may be worth looking into soon. We allocate in 4KB chunks and this is the best we can do, probably.
Comment 147•3 years ago
|
||
Comment 148•3 years ago
|
||
Backed out for causing SM bustages. CLOSED TREE
Backout link : https://hg.mozilla.org/integration/autoland/rev/e0721c12a97a7b3497dbd2df3570839e5ed501fd
Push with failures : https://treeherder.mozilla.org/jobs?repo=autoland&resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&revision=d6506b5ee07b5c65f889ce7c2214b0172f205898&searchStr=SM&selectedTaskRun=Qb3LeVEMSjWZJXXJXSdV3Q.0
Link to failure log : https://treeherder.mozilla.org/logviewer?job_id=351620844&repo=autoland
Comment 149•3 years ago
|
||
Comment 150•3 years ago
|
||
bugherder |
Comment 151•3 years ago
|
||
Backed out for causing Bug 1731031.
Comment 152•3 years ago
|
||
Assignee | ||
Updated•3 years ago
|
Comment 153•3 years ago
|
||
bugherder |
Comment 154•3 years ago
|
||
Backed out for causing the Bug 1733016.
Comment 155•3 years ago
|
||
Backout merged to central: https://hg.mozilla.org/mozilla-central/rev/187c2cd787ba
Comment 156•3 years ago
|
||
(In reply to Cristian Tuns from comment #151)
Backed out for causing Bug 1731031.
== Change summary for alert #31797 (as of Mon, 11 Oct 2021 11:56:06 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
2% | wasm-godot-baseline | macosx1015-64-shippable-qr | webrender | 499.30 -> 487.07 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31797
Updated•3 years ago
|
Comment 157•3 years ago
|
||
Wohoo, I can reproduce on try (second 'p' failure on https://treeherder.mozilla.org/jobs?repo=try&revision=088f448eeb74e9b3ea34cc54705ee09ebbe1dc19&group_state=expanded&selectedTaskRun=fRNFivQtQQmwxP9SMsFYYQ.0), this is the same fault in a new test case. The symptoms are the same too: we get an indirect call to null to an element that should have been initialized by an active element segment initializer. This is the third case, the other two are https://treeherder.mozilla.org/logviewer?job_id=353031198&repo=autoland&lineNumber=47748 and https://treeherder.mozilla.org/logviewer?job_id=352999489&repo=autoland&lineNumber=28275.
(I'm not really any closer to understanding what this is but I'm currently going over the patch line by line to see if there's anything I've missed.)
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 158•3 years ago
|
||
Remaining blocking bug is considered fixed, so closing this as fixed. Further work to be filed as individual bugs under the umbrella of bug 1742930.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 159•3 years ago
|
||
Bug 1742053, which landed a refactoring of the last work here, was backed out by bug 1753061. We're not going to attempt to re-land, but will be adopting a different strategy, see bug 1754377.
Updated•3 years ago
|
Description
•