Indirect stubs code has quadratic table insertion time (was: Thread pool issue in emscripten web application)
Categories
(Core :: JavaScript: WebAssembly, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
thunderbird_esr91 | --- | unaffected |
firefox-esr91 | --- | unaffected |
firefox96 | --- | wontfix |
firefox97 | + | fixed |
firefox98 | + | fixed |
People
(Reporter: konstantinos.zagklis, Assigned: lth)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression)
Attachments
(2 files)
(deleted),
text/plain
|
Details | |
(deleted),
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:96.0) Gecko/20100101 Firefox/96.0
Steps to reproduce:
We ran an emscripten application that was working with FF 95 on the latest release (96) and it has become unresponsive.
The application utilizes emscriptens emsdk to build a wasm file, which has a thread pool in its C++ code.
More details in the attached file.
Actual results:
The application has been experiencing severely degraded performance and is unusable.
It hangs completely after a while.
Expected results:
The application should have worked as in FF 95.
Reporter | ||
Comment 1•3 years ago
|
||
Maybe this emscripten issue is related?
https://github.com/emscripten-core/emscripten/issues/15978
Comment 2•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Javascript: WebAssembly' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Assignee | ||
Comment 3•3 years ago
|
||
Do you happen to have either an online presence for this application or an archive of it you can share (either here or privately with me)? This is going to be hard to debug otherwise.
And/or, can you test with the latest Firefox Nightly? (See https://www.mozilla.org/en-US/firefox/channel/desktop/ for downloads.) I fixed an oom-during-streaming-compilation issue earlier this week and it would be useful to know if that was the same problem as what you're seeing.
Assignee | ||
Comment 4•3 years ago
|
||
From an initial profile in a clean local release build, it looks like the insert() method of the indirect stubs table is the problem, and more generally, that initElems takes 100% of the CPU for long stretches, due to insert and to a very, very heavily contended lock (or a very buggy lock). For a large thread pool it could take a long time to initialize everything. Not conclusive, but very suggestive. See https://share.firefox.dev/3Imxe6a.
Assignee | ||
Comment 5•3 years ago
|
||
27:36.37 INFO: Last good revision: d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0
27:36.37 INFO: First bad revision: 2200d1f5c9563332ec8162220ffa392b98a850da
27:36.37 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0&tochange=2200d1f5c9563332ec8162220ffa392b98a850da
The pushlog is exactly the landing of the indirect stubs code.
Assignee | ||
Comment 6•3 years ago
|
||
(Lock contention is just a symptom; renaming again.)
The problem here is the following. The application has a table containing 47951 functions. Initially we create indirect stubs for these after baseline-compiling the code; this takes 53ms. But then there are additional worker threads. Because the table is in the Module and the Module is shared, these threads also create stubs and insert them in the table (this is logically the right thing to do). But then we tier-2 compile the code, and because code pointers are different for tier-2 code we create new stubs for all the threads and insert those in the table too, for a total of ... stubs. In the end, the table has 815167 entries and creating one set of 48K stubs takes 55 seconds.
This is not the only thing going on; there appears to be more than one module involved too, but the second module is not a problem.
The main problem is the insertion code. It uses a binary search to find the insertion point (yeah!) but then it moves all elements that are in the way one spot to the right (well, ok) and then it does this repeatedly for every element it is trying to insert, and that basically means the code is O(n^2) in the size of the table (boo hiss).
None of this can happen in parallel, so all the threads are completely serialized when they do this, and so getting the page up and running can take a very long time indeed.
The stub creation happens because the table is not private to the module, and maybe it could be avoided, but that's the architecture we have right now and we're actively discussing what to do about that and that discussion will take some time to complete. So in the mean time the most appropriate fix here is to make the insertion code much faster, either by using a better algorithm or by using a better data structure than one linear table per module.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
Break the indirect stubs table it into per-tls tables to avoid the
very long insertion times resulting from having one shared table per
module.
Assignee | ||
Comment 8•3 years ago
|
||
The patch fixes the problem and is fairly contained. It could be smarter (but then it would be less contained). It needs to be cleaned up in various ways before landing. We'll want to uplift this to beta for sure.
Reporter | ||
Comment 9•3 years ago
|
||
Please post here when the beta is live so we can do a smoke test and verify our application is working as expected
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Comment 11•3 years ago
|
||
bugherder |
Assignee | ||
Comment 12•3 years ago
|
||
Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury
Beta/Release Uplift Approval Request
- User impact if declined: Some wasm applications will have really nasty hang-like behavior on startup.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Fairly contained reengineering of a data structure to avoid quadratic behavior when table grows large.
- String changes made/needed:
Comment 13•3 years ago
|
||
Comment on attachment 9259921 [details]
Bug 1750930 - Make the indirect stub table two-level. r?yury
Approved for 97.0b7.
Comment 14•3 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 15•3 years ago
|
||
(In reply to Konstantinos from comment #9)
Please post here when the beta is live so we can do a smoke test and verify our application is working as expected
The fix is in the FF98 Nightly builds now. I don't know if it made it into FF97 Beta 7 but it will be in FF97 Beta 8, which is built tomorrow, and should be available to beta users by Wednesday of this week. We're not fixing FF96. FF97 releases on February 8.
Reporter | ||
Comment 16•3 years ago
|
||
Reply to Lars T Hansen
Tested on Firefox developer edition 97.0b8 and 98.0a1
- Too slow (compare to previous than FF 96 and hromium based browsers)
- Application times out in a lot of Network operations
- Some functionalities not working at all
- Cannot perform VoIP calls like FF 95 and older
- No code errors, ONLY time outs
Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?
Reporter | ||
Comment 17•3 years ago
|
||
Tested on Firefox developer edition 97.0b8 and 98.0a1
1.Too slow (compare to previous than FF 96 and chromium based browsers)
2.Application times out in a lot of Network operations
3.Some functionalities not working at all
4.Cannot perform VoIP calls like FF 95 and older
5.No code errors, ONLY time outs
Our application it's not really usable anymore in FF 96 and newer versions
Is this solution final ?
Assignee | ||
Comment 18•3 years ago
|
||
Hi Konstantinos,
I will re-triage today and we'll see about what to do here. The original report was however about the slowness in establishing the initial connection (quoted below), and generally we try not to handle multiple different problems with the same bug report, so we'll likely want to close this one and file additional bugs for additional problems. I'm leaving it open for the time being.
Recording additional information from submitter, communicated through email:
You can access our application from this url: login.teamviewer.com/Connect
Use the following credentials
E-Mail: <REDACTED>
Password: <REDACTED>
In the application's main screen, fill in the text box this id: <REDACTED> and press connect when the button appears.
Using FF 95.0.2 the button will appear in ~20sec
Using FF 96.0.1 up to Firefox Nightly 98.0a1 will take more than 5min and the application won't work
Assignee | ||
Comment 19•3 years ago
|
||
The original report's problem with the startup time remains fixed (mostly -- there's a slightly longer lag with the fix than there was in FF95 but this is acceptable to me) but once started the application seems significantly laggier in recent builds than in FF95, I've received a video through other chanels with a side-by-side comparison.
According to the profiler there appears to be tremendous lock contention in js::wasm::Code::lookupIndirectStubRange which is called from Table::getFuncRef which is called from WasmTableObject::getImpl which is called from WasmTableObject::get. That is, the code very heavily reads functions from Table<funcref>. This is bound to be slow regardless, but the lock is a new thing that came in with the indirect stubs.
Profile for Nightly build 2022-01-30: https://share.firefox.dev/3ALszbn
(Unfortunately, FF95 does not appear to have worker profiling, but I'm going to try to obtain a good profile from FF96 before the indirect stubs landed, in the hope that worker profiles are available there.)
The reporter confirms that the application is compiled with -O2, so there should be no artifacts of debug code.
Assignee | ||
Comment 20•3 years ago
|
||
Profile for FF96 just before the indirect stubs landed: https://share.firefox.dev/3IPgi8K.
The application has 29 worker threads in the "WebCOOP+COEP Content" process. Certainly this creates room for some contention, should there be a lock.
The profiles are not completely comparable, but one might start the FF98 profile (previous comment) around 25s and focus only on the WebCOOP+COEP process. The execution time for most threads is dominated by futexwait - they're mostly idle - but for those threads that have long periods of activity initially in the segment starting at 25s, 20%-55% of the time is spent either in lock contention (dominates) or in LookupInSorted. Clearly these threads will tend to serialize each other since the lookups cannot happen in parallel. And the other threads are performing the same operations, which will tend to further exacerbate overhead and cause contention.
It was already known that we would want to avoid the lock in lookupIndirectStubRange, but it was assumed that applications would not hammer on this code in the manner of TeamViewer and that the lock would be affordable.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 21•3 years ago
|
||
I'm going to close this and spin off bug 1752870 for the new issue, which is related but different. Will copy comment 19 and comment 20 into that and populate it with STR, and will cc key people.
Assignee | ||
Updated•3 years ago
|
Description
•