1291355 - (stylo-jemalloc) stylo: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them

Reporter

Description

•

8 years ago

We currently enable multiple jemalloc arenas for stylo to eliminate the locking overhead for allocations. A recent email thread on this topic is as follows: froydnj: > I'd need more context on why it was added in the first place. One of > the things I'd be curious about would be why the block comment > immediately above that define is no longer valid--were Stylo folks > finding that the allocator arena lock was a bottleneck with Stylo > doing parallel layout-y things? You said on IRC that Cameron added > it; maybe he has more context here? heycam: > Yes, there was a marked drop off in performance as the number of threads > was increased (it peaked at 2 or 3 threads IIRC) without turning on this > jemalloc option. froydnj: > Thanks. More questions. :) > > What are the threads here? Number of threads for Stylo to do > off-main-thread layout? How bad is the performance dropoff? How many > threads do we want to run with here? Do we have any idea how much the > memory fragmentation story increases (e.g. what is the usual size of > the jemalloc arenas)? I'm all about the performance win of Stylo, but > if it comes at serious memory fragmentation costs, it's probably not > going to be a win for our 32-bit users. heycam: > I didn't write down the numbers, so working from memory from a couple of > months ago, but I think performance peaked at 2 or 3 threads, then went > down again to be similar to single threaded performance at 6 threads or > something. > > I don't really know what the plan is for deciding how many threads to > run with. For the moment we use round(num_cpus * 3 / 4), but I imagine > we'll need to test with real workloads to see how the choice of > threadcount interacts with all the other off main thread stuff Gecko is > doing. > > I have no idea about [fragmentation]. What would be a way I can measure the > jemalloc arena sizes in a non-stylo Firefox with and without this > option?

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Depends on: 1291356

Nathan Froyd [:froydnj]

Comment 1

•

8 years ago

This should have been filed somewhere else, IMHO.

Component: CSS Parsing and Computation → Memory Allocator

Eric Rahm [:erahm]

Assignee

Comment 2

•

8 years ago

> > I have no idea about [fragmentation]. What would be a way I can measure the > > jemalloc arena sizes in a non-stylo Firefox with and without this > > option? A good (though imperfect) proxy for this would be |explicit/heap-overhead/bin-unused| in about:memory report. The total for |explicit/head-overhead| is probably interesting as well as that would include |bookkeeping| as well (jemalloc internal memory usage).

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Summary: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them → stylo: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them

Mike Hommey [:glandium]

Updated

•

8 years ago

Priority: -- → P3

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Alias: stylo-jemalloc

Blocks: stylo-nightly
No longer blocks: stylo

Bobby Holley (:bholley)

Reporter

Comment 3

•

8 years ago

So, I just profiled this a bit, on mac and linux, with varying numbers of threads. Some findings: (1) I get pretty equivalent numbers with and without MOZ_MEMORY_NARENAS_DEFAULT_ONE (i.e. making the #define at [1] unconditional). That is to say, I cannot reproduce Cameron's original findings. It may be because of bug 1335941, whereby the atomic contention is changing the interactions between the threads and drowning out the signal somehow. Or maybe something in our more-modern stylo architecture makes this less of an issue (which would be great!). We'll need to optimize and implement more stuff to find out. (2) The parallel traversal seems about 15% slower with jemalloc4, which is not all that inconsistent with the results in bug 1219914 comment 44. These are early numbers, but they sure don't look good for jemalloc4. [1] http://searchfox.org/mozilla-central/rev/d20e4431d0cf40311afa797868bc5c58c54790a2/memory/mozjemalloc/jemalloc.c#152

Mike Hommey [:glandium]

Comment 4

•

8 years ago

Ah, forgot at detail: jemalloc 4 as we configure it doesn't have multiple arenas enabled by default. You can change that in memory/build/jemalloc_config.cpp by removing "narenas:1" from MOZ_MALLOC_OPTIONS.

Bobby Holley (:bholley)

Reporter

Comment 5

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #4) > Ah, forgot at detail: jemalloc 4 as we configure it doesn't have multiple > arenas enabled by default. You can change that in > memory/build/jemalloc_config.cpp by removing "narenas:1" from > MOZ_MALLOC_OPTIONS. Ok, I'll measure that when I get back to remeasuring this stuff, ideally after bug 1335941 lands.

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Priority: P3 → P2

Manish Goregaokar [:manishearth]

Updated

•

8 years ago

Blocks: stylo

Bobby Holley (:bholley)

Reporter

Comment 6

•

8 years ago

smaug just did some profile and said that multiple jemalloc arenas made Speedometer's innerHTML microbenchmark 10% slower, which is significant. That might be able to be optimized somehow, but that doesn't bode so well for multiple arenas so far...

Bobby Holley (:bholley)

Reporter

Comment 7

•

8 years ago

Eric and Julian are looking into this stuff. Getting to the bottom of this is very high priority, because we if we need to do any custom arena stuff we need to start designing that now.

Assignee: nobody → erahm

Priority: P2 → P1

Mike Hommey [:glandium]

Comment 8

•

8 years ago

jemalloc or mozjemalloc?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 9

•

8 years ago

Note, it was just some particular sub-sub-test of Speedometer (bug 1347525), on linux. I don't know how different malloc impl would affect to the overall Speedometer score. But what is clear that anything running in talos isn't in general enough for measurements. (jemalloc4 seems to be also slower than the current mozjemalloc (without stylo enabled stuff))

Mike Hommey [:glandium]

Comment 10

•

8 years ago

Also, comment 6 goes the opposite way of what used to be said, which was that multiple arenas made things faster... From comment 6 alone, I'd say what you want is per-page or per-something arenas, not per-thread arenas (which is essentially what mozjemalloc multiple arenas do). While not exposed to gecko, jemalloc (not mozjemalloc) has an API to get an arena and do allocations in that arena (which is BTW one of the good reasons to want to switch to jemalloc)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 11

•

8 years ago

One thing to remember, need to test this all on different platforms. Memory allocations seem to be in general faster on linux than on MacOS and Windows, so perhaps jemalloc and mozjemalloc behave quite differently on different OSes.

Bobby Holley (:bholley)

Reporter

Comment 12

•

8 years ago

(In reply to Mike Hommey [:glandium] (VAC: Apr 20-May 4) from comment #10) > Also, comment 6 goes the opposite way of what used to be said, which was > that multiple arenas made things faster... It made things faster for parallel style system performance. smaug is measuring something unrelated (single-thread innerHTML performance). > From comment 6 alone, I'd say > what you want is per-page or per-something arenas, not per-thread arenas > (which is essentially what mozjemalloc multiple arenas do). The point is to avoid locking contention during the parallel traversal. per-page doesn't help with that, but a special style-system arena that was keyed by thread would.

Eric Rahm [:erahm]

Assignee

Comment 13

•

8 years ago

I still seem to be getting unreliable stacks when profiling so I tested with DUMP_STYLE_STATISTICS=1 instead. I did a test run of loading the myspace album art test 5 times (fresh start each time) and dumping style statistics with multiple arenas enabled in mozjemalloc. I then split out the traversal_time_ms metric (I believe that's what we care about). I repeated this without multiple arenas. The results seem to indicate one arena is faster. > Multiple Arenas > Run 1 Run 2 Run 3 Run 4 Run 5 Average > 1.202983956 0.8736519958 0.6747109583 1.075475011 1.014448004 0.968253985 > 0.8492839988 1.255198964 1.157934021 1.200310013 1.199208025 1.132387004 > 1.161297027 1.219962025 1.278500014 1.354901004 1.227325993 1.248397212 > 34.46693602 27.65948197 34.41471403 28.23663497 27.24114002 30.4037814 > 1.535873977 1.517786004 1.559063967 0.9672119631 1.055503963 1.327087975 > 1825.349511 1719.367721 1867.61279 1653.857773 1923.593605 1797.95628 > > One Arena > Run 1 Run 2 Run 3 Run 4 Run 5 Average Delta > 0.8032089681 1.298232004 0.9981150506 1.084592019 1.032647968 1.043359202 0.07510521682 > 0.8614579565 0.7898909971 1.031210006 1.362927025 1.133467013 1.035790599 -0.09659640491 > 0.7925549871 0.8089640178 0.7779609878 1.176575024 0.7930260035 0.8698162041 -0.3785810084 > 33.69354497 28.15638104 24.87073396 28.59738801 34.56510697 29.97663099 -0.427150412 > 1.547398977 1.599415031 1.90692296 1.554740011 1.513577008 1.624410797 0.2973228227 > 1700.755901 1722.02633 1793.781668 1715.529505 1805.659356 1747.550552 -50.40572801 Example command line: > DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_1.txt > DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_2.txt > DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_3.txt > DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_4.txt > DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_5.txt > cat stylo_multiple_arenas_perf_*.txt | grep traversal_time_ms | cut -d ',' -f 3 | pr -5 -t

Eric Rahm [:erahm]

Assignee

Comment 14

•

8 years ago

Sorry I wasn't clear, this is on a desktop w/ - Ubuntu 16.04 LTS - clang 4.0. - CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz - 32GB RAM config: > ac_add_options --enable-stylo > ac_add_options --disable-stylo-build-bindgen

Eric Rahm [:erahm]

Assignee

Comment 15

•

8 years ago

For the Obama test we see better perf with multiple arenas. > Multiple Arenas > Run 1 Run 2 Run 3 Run 4 Run 5 Average > 0.6914929836 1.024849014 1.304770994 1.078581961 0.7182080299 0.9635805967 > 0.8394150063 1.143292 0.8399279905 1.260520949 0.8171049994 0.980052189 > 0.7503610104 0.9098720038 0.9900479927 1.15671102 1.115054998 0.9844094049 > 31.57984099 33.03947899 32.99107699 32.70683304 35.11506302 33.08645861 > > One Arena > Run 1 Run 2 Run 3 Run 4 Run 5 Average Delta > 0.7722550072 1.428720017 1.032161003 0.7012890419 1.329610008 1.052807015 0.08922641864 > 1.053375017 0.8312190184 1.222328981 1.186915964 1.234023017 1.1055724 0.1255202107 > 0.8040550165 0.8170910296 1.552518981 0.7702309522 1.132268982 1.015232992 0.03082358744 > 103.652606 104.5992 100.010437 101.929293 99.28652795 101.8956128 68.8091542

Bobby Holley (:bholley)

Reporter

Comment 16

•

8 years ago

Thanks for the measurements! It's really great to be making progress here. For our purposes, I think we only care about the "big" traversal, which generally seems to be the last one in the list. For posterity, can you note how many worker threads you were using? Assuming you weren't passing anything for STYLO_THREADS, I'm guessing the answer would be 6 based per [1] with 8 logical CPUs (4cores x 2 hyperthreads), but it's good to be sure. So in general your results validate heycam's findings. The myspace testcase (assuming you're using the non-"orig" one) has a small DOM and an enormous stylesheet, so the time will be dominated by selector matching, which doesn't allocate much. So we're looking at something like 1750 vs 1800 ms, which to me means that multiple arenas don't help with contention here, and they happen be to be a bit slower in general as we've measured elsewhere. Conversely, the wikipedia testcase has a large DOM, and so a significant amount of time is spent allocating style structs for each element. On that testcase we see multiple arenas taking us from 100ms to 33ms, basically a 3x improvement, which is huge. That's on the order of the entire win that we'd expect from the parallelism, so stylo isn't much of a win without it. The next step here is to dig into what exactly in jemalloc is slow in the single-arena case. Is there anything obvious that can be fixed to mostly-eliminate the contention? If not, we have two options: (1) Try to get multiple jemalloc arenas enabled, which potentially has significant performance implications across the platform. (2) Allocate and manage style structs using manual, per-thread + per-type arenas in Rust. My guess is that (2) is going to be a lot easier (and has the advantages discussed in bug 1347852), though it has a few downsides: * It doesn't completely eliminate the overhead, since we do some amount of malloc-ing for other stuff. * It's more complex, and we don't get to re-use Arc. * There's some degree of memory waste, unless we implement something less-arena-like and more-malloc-like. [1] http://searchfox.org/mozilla-central/rev/24eec01c4e43c0f029b0148900b5f8ed508f7e21/servo/components/style/gecko/global_style_data.rs#36

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Depends on: 1356988

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Blocks: 1356988

No longer depends on: 1356988

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Blocks: stylo-nightly-build

Chris Peterson [:cpeterson]

Updated

•

8 years ago

No longer blocks: 1356988

Julian Seward [:jseward]

Comment 17

•

8 years ago

(In reply to Eric Rahm [:erahm] from comment #15) > For the Obama test we see better perf with multiple arenas. I'm assuming that "Multiple arenas" means #ifndef MOZ_STYLO #define MOZ_MEMORY_NARENAS_DEFAULT_ONE #endif and "Single arena" means //#ifndef MOZ_STYLO #define MOZ_MEMORY_NARENAS_DEFAULT_ONE //#endif I can reproduce Eric's results for the Obama test. Numbers are for traversal_time_ms. I ran each test twice, for STYLO_THREADS={1,2,3,4,6,8}. There's a lot of noise but the trend is clear. Multiple arenas: 1 thr 181.9 187.9 2 thr 97.7 111.6 3 thr 74.9 75.5 4 thr 54.2 65.9 6 thr 47.1 48.6 8 thr 57.8 46.6 Single arena: 1 thr 177.3 181.8 2 thr 109.4 109.7 3 thr 99.9 106.3 4 thr 110.1 111.2 6 thr 145.0 135.0 8 thr 170.1 173.8 The multiple arena case continues to improve out to at least 8 threads. The single arena case does benefit from parallelism but gets worse beyond 3 threads. This is on a low-end Xeon with 4 cores and 8 threads, Fedora 25, Clang 3.9.

Julian Seward [:jseward]

Comment 18

•

8 years ago

I spent much of the day chasing this, but mostly got confused. Here's a summary. Maybe someone can point at a better direction. * System: Fedora 25, x86_64, clang 3.9, rustc 1.16.0, low end Xeon, 2.5 GHz * Using the Obama test, since that shows marked slowdowns with a single arena. * From comment #15 and comment #17 there is clearly something ungood going on. My goal was to compare single-vs-multiple arena behaviour whilst profiling with Valgrind's Callgrind tool. This collections counts of instructions, memory accesses, cache misses, branch mispredicts and global bus operations, and gives us a good chance of spotting bad interactions with the microarchitecture. Because Callgrind is slow and to avoid getting swamped with irrelevant data, I hacked up the patch shown in the next comment. This is intended to enable data collection only when traverse_subtree() is active, which IIUC is what the cost "traversal_time_ms" pertains to. I was aided in this enterprise by emilio. But it produces costs of only about 17 million instructions for the test case, which is far too small to account for the 100-ish millisecond cost. So something is not right there. Some questions: (1) Is there a way to reduce the number of processes in the picture? Preferably to 1? Having to guess which process' profile to look at is confusing. (2) Is there a way to have Firefox quit after processing of the test page is complete? This would remove noise resulting from idling / other activities between finishing drawing and shutdown. (3) Is traverse_subtree() the right place to bracket profiler start/stop hints? If not, where is a better place? Would it be better to put them around guaranteed single-threaded regions, so as to reduce confusion resulting from individual calls to traverse_subtree() being split into multiple threads?

Flags: needinfo?(bobbyholley)

Julian Seward [:jseward]

Comment 19

•

8 years ago

Attached patch Restricts Callgrind data collection to traverse_subtree(), per comment (deleted) — Details — Splinter Review

Julian Seward [:jseward]

Comment 20

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #18) > But it produces costs of only about 17 million instructions for the test case, > which is far too small to account for the 100-ish millisecond cost. So instead I compared multiple-vs-single arena profiles for the process containing around 650-680 calls to traverse_subtree. Results are What Multi Single Simulated? insns 11.83 G 12.11 G D1 misses 153.4 M 150.4 M y LL misses 11.26 M 10.60 M y Mispredicts 74.2 M 77.5 M y Global bus events 22.2 M 25.2 M Given that the entries marked "Simulated" are derived from Callgrind's simulation of caches and branch predictors, not from the hardware, and also given that Valgrind serialises threaded execution, I wouldn't pay too much attention to this. The only thing that particularly stands out is the change in global bus events, since that signifies locking. Indeed, the Multi version has 8.4 million calls to each of pthread_mutex_{lock,unlock}, the single one has 9.9 million such calls. Almost all of that change seems to come from "<F as malloc::boxed::FnBox<A>>::call_box", called from "std::sys::imp::thread::Thread::new::thread_start". Does that make any sense to anyone?

Nathan Froyd [:froydnj]

Comment 21

•

8 years ago

That seems kind of weird. Are we starting new threads for each bit of the traversal or something? I would think that the rayon thread pool would spin up N threads and then not spin up any more...

Nathan Froyd [:froydnj]

Comment 22

•

8 years ago

Julian and I spent a bit of time debugging comment 20's results over IRC. We discovered that libxul has several symbols that demangle to "<F as alloc::boxed::FnBox<A>>::call_box". They all look like: _ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h7aeb0dbca779e468E differing only in the hash at the end. valgrind is chopping off the hash, and lumping all the functions together in the callgraph, so even though Thread::new::thread_start is calling one flavor of ::call_box, that particular flavor isn't getting disambiguated from all the others. Needless to say, this sort of thing is extremely frustrating for trying to debug Rust code. There are a few ::call_box instances that call pthread_mutex_{lock,unlock}; it's difficult to find callers for them because they're apparently only called via traits, i.e. function pointers. I tried starting a Stylo build, hoping to set some breakpoints, but only got crashes while trying to twiddle refcounts from Stylo. I'll keeping trying. Julian pointed out that even if we figure out what's causing those extra global bus events, and even if all those extra bus events are causing some sort of slowdown, it doesn't do a great job explaining why changing the number of arenas makes any difference. So we're still a bit in the dark on that one.

Bobby Holley (:bholley)

Reporter

Comment 23

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #18) > (1) Is there a way to reduce the number of processes in the picture? > Preferably > to 1? Having to guess which process' profile to look at is confusing. yes, just pass --disable-e10s to mach. > > (2) Is there a way to have Firefox quit after processing of the test page is > complete? This would remove noise resulting from idling / other > activities > between finishing drawing and shutdown. You could panic!() at the completion of a traversal that takes more than 30ms or so. > > (3) Is traverse_subtree() the right place to bracket profiler start/stop > hints? traverse_dom is probably better, since it has less initialization overhead. But note that there are two versions, one in sequential.rs and one in parallel.rs. Those functions are the top-level driver of the traversal. In the parallel case, that is where we kick over to the worker rthreads. > If not, where is a better place? Would it be better to put them around > guaranteed single-threaded regions, so as to reduce confusion resulting > from > individual calls to traverse_subtree() being split into multiple threads? I'm not sure what you mean by "guaranteed single-threaded regions". If you want the actual work of style computation (which may occur in parallel on several threads), then you should look at compute_style. If you want only the cascade logic and not the selector matching (since I suspect the former is more responsible for the allocations, you should look at cascade_primary_or_pseudo.

Flags: needinfo?(bobbyholley)

Julian Seward [:jseward]

Comment 24

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23) > > (2) Is there a way to have Firefox quit after processing of the test page is > > complete? This would remove noise resulting from idling / other > > activities > > between finishing drawing and shutdown. > > You could panic!() at the completion of a traversal that takes more than > 30ms or so. I don't want to gate it on any specific time period since that's going to change, dramatically so when running on any Valgrind-based profiling tool. There's surely some place in the Stylo machinery where it knows that it has finished all traversal(s) (or whatever it is that we want to profile here). I could put a panic!() there. But .. where is that place?

Julian Seward [:jseward]

Updated

•

8 years ago

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Reporter

Comment 25

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #24) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23) > > > (2) Is there a way to have Firefox quit after processing of the test page is > > > complete? This would remove noise resulting from idling / other > > > activities > > > between finishing drawing and shutdown. > > > > You could panic!() at the completion of a traversal that takes more than > > 30ms or so. > > I don't want to gate it on any specific time period since that's going to > change, dramatically so when running on any Valgrind-based profiling tool. > Ok - maybe gate it on the number of elements styled instead? That's also in the traversal statistics. > There's surely some place in the Stylo machinery where it knows that it has > finished all traversal(s) (or whatever it is that we want to profile here). > I could put a panic!() there. But .. where is that place? Gecko calls into Servo to do styling for lots of reasons, most of which are for tiny subtrees or when there's very little work to be performed. I could probably come up with something, but for our purposes here it's just much easier to do the hacky thing and base it on the amount of work done.

Flags: needinfo?(bobbyholley)

Emilio Cobos Álvarez (:emilio)

Updated

•

8 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1359897

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Depends on: 1356701

Chris Peterson [:cpeterson]

Updated

•

8 years ago

Depends on: 1354215

Chris Peterson [:cpeterson]

Updated

•

8 years ago

No longer blocks: stylo, stylo-nightly

Julian Seward [:jseward]

Comment 26

•

8 years ago

I profiled Fx/Stylo running the Obama test for both single and multiple arenas, and studied counts of instructions, global memory operations, pthread_mutex_* calls and malloc calls, and looked at jemalloc and Rayon hot paths. My interim conclusions follow. I should point out that as a result of (5), much of this is my best guess, but nothing more. The important points are (4) and (5). (1) The problem [slowdown in single arena case] is not caused by mozjemalloc's locking strategy. In both configurations mozjemalloc does one pthread_mutex_lock and _unlock per malloc and per free. (2) This is instead caused by excessive inter-core cache line movement resulting from the use of a single arena: - If all threads use a single malloc lock, the containing line is going to be bounced around all cores. - The same goes for mozjemalloc's metadata, and in particular the use of the structure |run| in |arena_malloc_small|. - The same goes for the data in the blocks themselves. Imagine one thread frees a block, and that area is quickly reallocated to a different thread on a different core. When the new thread initialises the block, the hardware will have to move the associated line(s) to the new core (that is, take a bunch of L2 misses) because there is no way to tell it [on Intels] that the old contents of the block are no longer needed. This behaviour will be heavily dependent on the exact access patterns from Stylo and on how the kernel schedules threads across cores. This might go some way to explaining the apparently inconsistent malloc-perf results that we've seen previously. (3) So far I have not spotted any way to improve the parallel performance of mozjemalloc configured for single arenas. (4) As a result of (2) and (3), I think that single-arena mozjemalloc is a poor starting point for a scalable high performance parallel allocator. We are in effect using jemalloc in a way for which it was not intended. We would be better off switching to multi-arena by default and fixing whatever perf problems we can. (5) We have poor tooling for investigating such phenomena (inter core cache line transfers). So far the only tool I have found that can do this is Linux's "perf" tool, and even that is hard to use and provides non-exact attribution of costs to specific source locations. Some secondary conclusions: (6) The single-arena case appears to cause more activity in Rayon (v0.6.0). In particular Rayon makes a very large number of calls to steal_work(), around 4 million in the multi-arena case and 5.5 million in the single-arena case. This is visible when profiling with Valgind/Callgrind. Whether this is real or merely an artefact of the profiling, I don't know. (7) I tried to investigate (6) by hacking on the vendored Rayon (third_party/rust/rayon) but was thwarted by the build system, which refused to compile the modified sources due to checksum mismatches. We need a way to make this possible. (8) Ongoing efforts to reduce the allocation rate are worthwhile regardless of the results of this investigation. (9) I wrote a small test program to assess the effect of cache line sharing, to see if the effects are real, and to give a small test case to try out with tools (see attachment). It creates an array of counters with variable spacing and increments them from different threads. Increasing the spacing from 4 bytes (all in the same line) to 64 bytes (all in different lines) reduces the run time by a factor of 6 on an 4 core, 8 thread Xeon. The following commands are interesting: for ((sp = 0; sp < 100; sp = sp+8 )) ; \ do /usr/bin/time ./contention 8 $sp ; echo ; done perf stat -e cycles:u,instructions:u,branches:u,branch-misses:u,L1-dcache-load-misses:u,L1-icache-load-misses:u,LLC-load-misses:u,LLC-store-misses:u,l2_rqsts.miss:u ./contention 8 0 and the same, but with "8 64" instead of "8 0" Next steps are: (10) Cut out mozjemalloc into a standalone test program and try some synthetic workloads on it, also watching it with "perf". (11) See if I can get anything useful out of "perf" for Fx/Stylo as a whole (12) Look at multi-arena Stylo perf regressions (13) See if it is possible to modify Valgrind/Callgrind to collect cache line transfer numbers. (14) Look into modifying the in-tree Rayon, and/or trying out a more recent version (0.7.0).

Julian Seward [:jseward]

Comment 27

•

8 years ago

Attached file contention.c (deleted) — Details

Small program for experimenting with cache line contention.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 28

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #26) > (4) As a result of (2) and (3), I think that single-arena mozjemalloc is a > poor > starting point for a scalable high performance parallel allocator. We > are > in effect using jemalloc in a way for which it was not intended. We > would > be better off switching to multi-arena by default and fixing whatever > perf > problems we can. > Sounds reasonable, but we can't switch to slower-on-main-thread malloc. Do we know performance characteristics of some other malloc impls than jemalloc?

Bobby Holley (:bholley)

Reporter

Comment 29

•

8 years ago

This is excellent analysis Julian, thank you! The cache line contention issue makes total sense, and the thrashing to page in garbage memory after free() gives us important insight in our path forward (see below). My guess is that the main-thread perf regression from enabling multiple jemalloc arenas (the one that smaug measured) is due to the required TLS lookup on every malloc. I would bet we could eliminate that by simply having a global flag that we set before a parallel stylo traversal to enable/disable multiple arenas during that time only. However, I'm concerned that this (and the whole multiple-jemalloc-arena business in general) would just mask the underlying problem here. Specifically, the multiple-jemalloc-arenas approach makes sense if threads mostly own their data - that is to say, the mapping between between thread id and data under manipulation remains relatively stable. This is fine for something like DOM Workers, but totally falls apart with Rayon/Work-Stealing, where the association between thread and data is very loose. More concretely: multiple jemalloc arenas will work fine for the initial style pass, where they basically guarantee that threads will only free heap data that they own. But when processing restyles, we have only a 1/N chance that a given element will be restyled by the same OS thread that originally styled it. So each time the DOM is restyled, we'll free a bunch of memory in other threads' arenas, and allocate new style structs in different arenas than their neighbors. This will trend towards zero locality over many restyles, which is a terrible characteristic to have: it will be hard to detect on our benchmarks (which is why we originally determined that multiple jemalloc arenas fixed the problem), but will cause ever-worsening performance for users as they spend time on highly-dynamic pages. So I think we're going to need to roll our own custom arenas. I'll get working on a design.

Bobby Holley (:bholley)

Reporter

Comment 30

•

8 years ago

Oh, and Julian - continuing to understand this problem, especially the different characteristics of rayon 0.6 and 0.7, sounds great. You should be able to modify the in-tree rayon with a [replace] and then do |./mach vendor rust|. This is a patch for what a successful replace looks like: https://gist.github.com/bholley/85dc2277aaa732bce04cc9e672915595

Bobby Holley (:bholley)

Reporter

Comment 31

•

8 years ago

See also https://groups.google.com/d/msg/mozilla.dev.platform/_xQyW1UQEv8/Zy14-DdFBgAJ and https://groups.google.com/d/msg/mozilla.dev.platform/TSRH1EU-JVo/6GY6P6UnAQAJ

Nathan Froyd [:froydnj]

Comment 32

•

8 years ago

(In reply to Olli Pettay [:smaug] from comment #28) > Sounds reasonable, but we can't switch to slower-on-main-thread malloc. This statement really depends on how much slower the malloc would be; it's not an absolute rule. > Do we know performance characteristics of some other malloc impls than > jemalloc? Only anecdotally. tcmalloc (http://goog-perftools.sourceforge.net/doc/tcmalloc.html) is faster than glibc's default malloc (with some pretty graphs to prove it, though it's not clear how old those graphs are), but (anecdotally) consumes more memory than jemalloc. jemalloc is supposed to do a better job than dlmalloc (http://g.oswego.edu/dl/html/malloc.html), which is the basic of glibc's memory allocator, called ptmalloc (http://www.malloc.de/en/).

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 33

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #32) > (In reply to Olli Pettay [:smaug] from comment #28) > > Sounds reasonable, but we can't switch to slower-on-main-thread malloc. > > This statement really depends on how much slower the malloc would be; it's > not an absolute rule. Let's say so that if it would slow down Quantum Flow perf metrics, it is no no.

Bobby Holley (:bholley)

Reporter

Comment 34

•

8 years ago

Changing the Gecko-wide allocator is totally out of scope for this bug. I'm pretty sure we can get the best of both worlds by simply toggling multiple jemalloc arenas when we start and stop the parallel traversal. The question is whether that's good enough, or whether we should arena-allocate anyway. We might even want to do both.

Bobby Holley (:bholley)

Reporter

Comment 35

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #34) > Changing the Gecko-wide allocator is totally out of scope for this bug. I'm > pretty sure we can get the best of both worlds by simply toggling multiple > jemalloc arenas when we start and stop the parallel traversal. Emilio and I discussed this option a bit more earlier today, and he convinced me that it might be more workable than I thought - at least worth investigating. From Julian's analysis, it seems like there are two reasons for the bad performance with multiple arenas: (A) Two threads simultaneously allocating (and trying to initialize) memory in the same cache line. (B) One thread freeing memory, and another thread subsequently reusing it, causing us to do unnecessary work transferring the garbage data between cache lines. Multiple jemalloc arenas solve (A) but not necessarily (B). However, emilio pointed out that we can likely defer most frees until after the parallel traversal, especially since the style context generally holds the ComputedValues alive until after the traversal is complete (though this doesn't help Servo). It wouldn't be perfect, but it would be a lot easier than arena allocation, and so potentially a good place to start (we _may_ still want arenas to deal with the sequential allocation overhead, but I want to see how low I can get that first). I'll try to have a look at this tomorrow. A lot of the efficacy here depends on how much of the problem we think is (A) vs (B). Julian, do you have a sense of this?

Flags: needinfo?(jseward)

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

Depends on: 1361258

Julian Seward [:jseward]

Comment 36

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #29) > My guess is that the main-thread perf regression from enabling multiple > jemalloc arenas (the one that smaug measured) is due to the required TLS > lookup on every malloc. [..] Having peered at the jemalloc sources, I agree. And it's potentially dangerous, if TLS on some platform (Windows?) is significantly more expensive than on others. Do you know if anyone has tried multi-arena on Windows? > [..] Rayon/Work-Stealing, where the association between thread and data > is very loose. Hmm, that's an interesting point -- I hadn't thought of that. Maybe some refinement of Rayon, to give it placement or locality hints, or some such, would be something we should consider.

Julian Seward [:jseward]

Comment 37

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #35) > (A) Two threads simultaneously allocating (and trying to initialize) memory > in the same cache line. > (B) One thread freeing memory, and another thread subsequently reusing it, > causing us to do unnecessary work transferring the garbage data between > cache lines. > > Multiple jemalloc arenas solve (A) but not necessarily (B). For reasons I don't [yet] understand, multi-arena jemalloc appears to lock and unlock for every allocation/free. Hence even in this case we'll be bouncing around at least the cache line containing the lock. I had gotten the impression that multi-arena meant "one arena per thread" but perhaps that's not so. > A lot of the efficacy here depends on how much of the problem we think is > (A) vs (B). Julian, do you have a sense of this? Unfortunately no. Without tooling to find these misses we're pretty much shooting in the dark. I'll see if I can hack something up (and generally keep digging.)

Flags: needinfo?(jseward)

Nathan Froyd [:froydnj]

Comment 38

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #37) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #35) > > (A) Two threads simultaneously allocating (and trying to initialize) memory > > in the same cache line. > > (B) One thread freeing memory, and another thread subsequently reusing it, > > causing us to do unnecessary work transferring the garbage data between > > cache lines. > > > > Multiple jemalloc arenas solve (A) but not necessarily (B). > > For reasons I don't [yet] understand, multi-arena jemalloc appears to lock > and unlock for every allocation/free. Hence even in this case we'll be > bouncing around at least the cache line containing the lock. I had gotten > the impression that multi-arena meant "one arena per thread" but perhaps > that's not so. This is indeed not so, but (newer?) versions of jemalloc provide for thread-specific arenas (called "caches" in the documentation, so maybe they're not the same thing... http://jemalloc.net/jemalloc.3.html#tcache.create

Julian Seward [:jseward]

Comment 39

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #38) > (In reply to Julian Seward [:jseward] from comment #37) > > For reasons I don't [yet] understand, multi-arena jemalloc appears to lock > > and unlock for every allocation/free. [..] > This is indeed not so, but (newer?) versions of jemalloc provide for > thread-specific arenas (called "caches" in the documentation, so maybe > they're not the same thing... Right. So what I *expected* to see was a fast path in which the calling thread uses the TLS mechanism to get a pointer to its own private data area, which is the only potentially expensive bit, then tries to allocate in there, and only takes a lock if it has to fall back allocating from central facilities for whatever reason. But that's not what it does. > [..] but (newer?) versions of jemalloc provide for thread-specific arenas I'm getting the impression that we're not running the most recent or most capable version of jemalloc for some reason, but I don't know the back story. Can you give me a pointer to it?

Flags: needinfo?(nfroyd)

Andrew McCreight [:mccr8]

Comment 40

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #39) > I'm getting the impression that we're not running the most recent or most > capable version of jemalloc for some reason, but I don't know the back story. > Can you give me a pointer to it? I think bug 1219914 is the primary impediment. (Bug 762449 may have other blockers.)

Bobby Holley (:bholley)

Reporter

Comment 41

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #39) > (In reply to Nathan Froyd [:froydnj] from comment #38) > > (In reply to Julian Seward [:jseward] from comment #37) > > > For reasons I don't [yet] understand, multi-arena jemalloc appears to lock > > > and unlock for every allocation/free. [..] > > > This is indeed not so, but (newer?) versions of jemalloc provide for > > thread-specific arenas (called "caches" in the documentation, so maybe > > they're not the same thing... > > Right. So what I *expected* to see was a fast path in which the calling > thread uses the TLS mechanism to get a pointer to its own private data > area, which is the only potentially expensive bit, then tries > to allocate in there, and only takes a lock if it has to fall back > allocating from central facilities for whatever reason. Well, that would make sense to me, because there's always the possibility that thread A frees memory from the arena that thread B is allocating from. That can happen regardless of whether we use mutliple arenas or not, so I'd expect there to be a lock acquisition for main-thread allocations with or without multiple arenas. Are you saying there's _more_ locking with multi-arenas, or just that the locking isn't eliminated? Our stylo performance with multiple arenas seems fine, it's more a question of whether we regress main-thread performance at all. > > But that's not what it does. > > > [..] but (newer?) versions of jemalloc provide for thread-specific arenas > > I'm getting the impression that we're not running the most recent or most > capable version of jemalloc for some reason, but I don't know the back story. > Can you give me a pointer to it? Upgrading mozjemalloc is a total black hole and out of scope for us. We should consider ourselves to be stuck with mozjemalloc for the stylo timeframe.

Flags: needinfo?(nfroyd) → needinfo?(jseward)

Nicholas Nethercote [inactive]

Comment 42

•

8 years ago

> Upgrading mozjemalloc is a total black hole and out of scope for us. We should > consider ourselves to be stuck with mozjemalloc for the stylo timeframe. I agree. glandium has spent lots of time in the past trying to get jemalloc4 up to snuff, without success. Changing the allocator is *hard*, because *everything* depends on it.

Eric Rahm [:erahm]

Assignee

Comment 43

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #40) > (In reply to Julian Seward [:jseward] from comment #39) > > I'm getting the impression that we're not running the most recent or most > > capable version of jemalloc for some reason, but I don't know the back story. > > Can you give me a pointer to it? > > I think bug 1219914 is the primary impediment. (Bug 762449 may have other > blockers.) See also bug 1357301 comment 8 and bug 1357301 comment 9. It's a horribly broken mess.

Julian Seward [:jseward]

Comment 44

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #41) > Well, that would make sense to me, because there's always the possibility > that thread A frees memory from the arena that thread B is allocating from. > That can happen regardless of whether we use mutliple arenas or not, so I'd > expect there to be a lock acquisition for main-thread allocations with or > without multiple arenas. Ah yes. I hadn't thought properly about this. Sorry for the noise. > Are you saying there's _more_ locking with > multi-arenas, or just that the locking isn't eliminated? Just that locking isn't eliminated or even reduced -- it's the same as with single arenas. But I see now that's hard to avoid. > Upgrading mozjemalloc is a total black hole and out of scope for us. I understand. I wasn't proposing that. I just wanted to understand the history a bit.

Flags: needinfo?(jseward)

Bobby Holley (:bholley)

Reporter

Comment 45

•

8 years ago

Do you have any cycles today or tomorrow to poke at making multiple arenas dynamically toggleable (i.e. the arenas are always there, but we only do the TLS lookup when a static bool is set)? I was hoping to get to it today but a lot of stuff came up, and not sure I will tomorrow either. We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in ServoBindings.{cpp.h} to flip the bit, and then invoke that function in parallel.rs.

Flags: needinfo?(jseward)

Bobby Holley (:bholley)

Reporter

Comment 46

•

8 years ago

(And then the key would be to measure the performance on smaug's microbenchmarks with the extra arenas existing but toggled off).

Julian Seward [:jseward]

Comment 47

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45) > Do you have any cycles today or tomorrow to poke at making multiple arenas > dynamically toggleable Yes, looking at it now.

Flags: needinfo?(jseward)

Julian Seward [:jseward]

Comment 48

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45) > [..] making multiple arenas dynamically toggleable [..] Patch is at bug 1361258 comment 1. Looks good for the parallel traversal but I haven't tested smaug's microbenchmarks with it yet.

Mike Hommey [:glandium]

Comment 49

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #37) > I had gotten > the impression that multi-arena meant "one arena per thread" but perhaps > that's not so. IIRC, it's one arena per cpu (or maybe 2), and threads get an arena associated randomly, with the hope they'll be evenly distributed. This means threads still can fight for the same arena. (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45) > Do you have any cycles today or tomorrow to poke at making multiple arenas > dynamically toggleable (i.e. the arenas are always there, but we only do the > TLS lookup when a static bool is set)? I was hoping to get to it today but a > lot of stuff came up, and not sure I will tomorrow either. > > We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in > ServoBindings.{cpp.h} to flip the bit, and then invoke that function in > parallel.rs. Nice hack to test things out, but sounds quite horrible as something to land. It seems to me what's needed here is an API where code can get a jemalloc arena, and use it for allocations. Jemalloc 4 has an API for that, and we've been wanting to use it for e.g. isolating some kinds of allocations from others. Considering the state of things with jemalloc 4, we could consider adding a similar feature to mozjemalloc. Then code can opt-in to allocate their own arena-per-thread.

Julian Seward [:jseward]

Comment 50

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #49) In the longer term (post 57, not in scope of this bug) I suspect we are going to have to take the hit of moving to an allocator which is as thread friendly as possible, since we don't want to be serialising parallel code either by allocator locking or by the allocator moving memory between threads and causing a lot of coherence misses. Also, having a way to observe and quantify coherence misses (cross core traffic) would be helpful.

Bobby Holley (:bholley)

Reporter

Comment 51

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #49) > (In reply to Julian Seward [:jseward] from comment #37) > > I had gotten > > the impression that multi-arena meant "one arena per thread" but perhaps > > that's not so. > > IIRC, it's one arena per cpu (or maybe 2), and threads get an arena > associated randomly, with the hope they'll be evenly distributed. This means > threads still can fight for the same arena. > > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45) > > Do you have any cycles today or tomorrow to poke at making multiple arenas > > dynamically toggleable (i.e. the arenas are always there, but we only do the > > TLS lookup when a static bool is set)? I was hoping to get to it today but a > > lot of stuff came up, and not sure I will tomorrow either. > > > > We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in > > ServoBindings.{cpp.h} to flip the bit, and then invoke that function in > > parallel.rs. > > Nice hack to test things out, but sounds quite horrible as something to > land. Why, exactly? It seems like this basically gets us multiple arenas for stylo without infecting the rest of Gecko with it, which is exactly what we want. > It seems to me what's needed here is an API where code can get a > jemalloc arena, and use it for allocations. Jemalloc 4 has an API for that, > and we've been wanting to use it for e.g. isolating some kinds of > allocations from others. Considering the state of things with jemalloc 4, we > could consider adding a similar feature to mozjemalloc. Then code can opt-in > to allocate their own arena-per-thread. How would we pass that arena through Rust stdlib code that just does malloc?

Bobby Holley (:bholley)

Reporter

Comment 52

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #50) > (In reply to Mike Hommey [:glandium] from comment #49) > > In the longer term (post 57, not in scope of this bug) I suspect we are > going to have to take the hit of moving to an allocator which is as > thread friendly as possible I don't think that argument is going to hold water. Gecko is single-threaded enough that we can't make single-thread allocation performance or memory usage worse.

Julian Seward [:jseward]

Comment 53

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #52) > (In reply to Julian Seward [:jseward] from comment #50) > > (In reply to Mike Hommey [:glandium] from comment #49) > > > > In the longer term (post 57, not in scope of this bug) I suspect we are > > going to have to take the hit of moving to an allocator which is as > > thread friendly as possible > > I don't think that argument is going to hold water. Gecko is single-threaded > enough that we can't make single-thread allocation performance or memory > usage worse. My bad -- ambiguous phrasing. By "take the hit" I meant "deal with the hassle of", not "accept the performance loss".

Bobby Holley (:bholley)

Reporter

Comment 54

•

8 years ago

In any case, I'd like to get this sorted out way or another within the next few days. We're targeting building stylo by default for end-of-week, and we can't do that with the MOZ_STYLO #define in mozjemalloc.c. Originally smaug had measured a main-thread performance hit from multiple arenas, but from bug 1361258 comment 22 it sounds like he can't reproduce that anymore. We basically have two options: (1) Take some more measurements on multiple arenas, and enable them everywhere. This probably involves doing a bit more measurement on malloc-bound main-thread testcases to be sure we don't measure any impact of the TLS lookup, and measuring on AWSY and whatnot to make sure we don't increase fragmentation. (2) Enable Julian's "adaptive" solution. What do people prefer?

Flags: needinfo?(mh+mozilla)

Flags: needinfo?(jseward)

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 55

•

8 years ago

(1) perhaps. I (and others) should have rerun some tests with MOZ_STYLO, since right now I can't seem to get similar regression. Did I do some mistake there? Did I run with jemalloc? Did I build stylo (which itself seem to cause major perf regression, bug 1362982). I apologize if I caused wasted time :(

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 56

•

8 years ago

Does 'building stylo by default for end-of-week' mean not enabled by default, but all the code, including the additions to Element will be compiled? I assume so. Has anyone measured performance with stylo builds (even if stylo wasn't enabled)?

Flags: needinfo?(bobbyholley)

Julian Seward [:jseward]

Comment 57

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > (1) Take some more measurements on multiple arenas, and enable them > everywhere. This probably involves doing a bit more measurement on > malloc-bound main-thread testcases to be sure we don't measure any impact of > the TLS lookup, and measuring on AWSY and whatnot to make sure we don't > increase fragmentation. > (2) Enable Julian's "adaptive" solution. (1) is preferable, since it's less complexity, and it's where we'd like to get to eventually, but it is also more risky. I am particularly concerned that the TLS lookups may be slow on other P1 platforms. (2) is safer. I suggest the following: check whether we can find any slowdowns with multiple arenas on OSX, Windows and Android. Also check AWSY etc. If any problems appear, do (2), else do (1). I can check OSX easily enough. I *could* do Win32 on a VM, but I'm not happy to trust perf results from a VM, and really we need to test Win64. Can anyone test it natively? And can anyone test Android?

Flags: needinfo?(jseward)

Bobby Holley (:bholley)

Reporter

Comment 58

•

8 years ago

(In reply to Olli Pettay [:smaug] from comment #56) > Does 'building stylo by default for end-of-week' mean not enabled by > default, but all the code, including the additions to Element will be > compiled? I assume so. Yes. > Has anyone measured performance with stylo builds (even if stylo wasn't > enabled)? That would be a good thing to do. I've looked through all the MOZ_STYLO code and have been assuming that none of it would affect non-stylo perf, but it's certainly possible that the extra word in Element might change something. It's even possible that you might have been testing with this, and that this would explain the difference between your numbers and Julian's. If either you or Julian could run some measurements with MOZ_STYLO enabled but layout.css.servo.enabled set to false, that would be splendid.

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Reporter

Comment 59

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #57) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > (1) Take some more measurements on multiple arenas, and enable them > > everywhere. This probably involves doing a bit more measurement on > > malloc-bound main-thread testcases to be sure we don't measure any impact of > > the TLS lookup, and measuring on AWSY and whatnot to make sure we don't > > increase fragmentation. > > (2) Enable Julian's "adaptive" solution. > > (1) is preferable, since it's less complexity, and it's where we'd like > to get to eventually, but it is also more risky. I am particularly > concerned that the TLS lookups may be slow on other P1 platforms. > > (2) is safer. > > I suggest the following: check whether we can find any slowdowns with > multiple arenas on OSX, Windows and Android. Also check AWSY etc. > If any problems appear, do (2), else do (1). This sounds good to me. > I can check OSX easily enough. I *could* do Win32 on a VM, but I'm not > happy to trust perf results from a VM, and really we need to test Win64. > Can anyone test it natively? And can anyone test Android? I'd suggest measuring the perf first on osx. If there's no regression spotted, write up a step-by-step guide to perform the tests on windows, and get somebody with a native device to do the test (I think Nathan has one?). May also be worth testing both win32 and win64, though perhaps win32 + macosx64 is good enough.

Julian Seward [:jseward]

Comment 60

•

8 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1361258#c24 suggests that (1) might be a space problem, with Win7-32 RSS regressions. What sort of latitude do we have w.r.t. space? Does anybody know why only Win7-32 would regress on space?

Chris Peterson [:cpeterson]

Comment 61

•

8 years ago

To reduce the overhead of multiple arenas, could we use just one arena in the chrome process and multiple arenas in the content processes?

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 62

•

8 years ago

IIRC based on some profiles ehsan has seen, stylo would be particularly useful in the parent process, and less in child processes.

Bobby Holley (:bholley)

Reporter

Comment 63

•

8 years ago

Yes, and we're getting rid of the old style system shortly after we ship in any case.

(no longer active)

Comment 64

•

8 years ago

(In reply to Olli Pettay [:smaug] from comment #62) > IIRC based on some profiles ehsan has seen, stylo would be particularly > useful in the parent process, and less in child processes. That's correct.

(no longer active)

Comment 65

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #57) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54) > > (1) Take some more measurements on multiple arenas, and enable them > > everywhere. This probably involves doing a bit more measurement on > > malloc-bound main-thread testcases to be sure we don't measure any impact of > > the TLS lookup, and measuring on AWSY and whatnot to make sure we don't > > increase fragmentation. > > (2) Enable Julian's "adaptive" solution. > > (1) is preferable, since it's less complexity, and it's where we'd like > to get to eventually, but it is also more risky. I am particularly > concerned that the TLS lookups may be slow on other P1 platforms. Are we talking about an additional TLS lookup per malloc? TLS lookups do show up in profiles in other code every now and then, is there any reason to believe they wouldn't here?

Bobby Holley (:bholley)

Reporter

Comment 66

•

8 years ago

Yeah, in general I'm still of the opinion that 'adaptive' is the right path forward here.

Julian Seward [:jseward]

Comment 67

•

8 years ago

(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #65) > > I am particularly concerned that the TLS lookups may be slow on other P1 platforms. > > Are we talking about an additional TLS lookup per malloc? TLS lookups do > show up in profiles in other code every now and then, is there any reason to > believe they wouldn't here? Yes, one TLS lookup per malloc. For Linux and Mac it is done as an in-line read of a __thread variable, so I'm not convinced it would show up in a profile. For Windows it's a call to TlsGetValue which I hope would get inlined. If the platform doesn't support TLS, which is the case when MOZ_MEMORY_ANDROID is defined, then an arena is chosen by hashing: pthread_self() % narenas, which is potentially a 32-cycle hit, if the processor's % implementation can produce only one bit per cycle. This is why I'm concerned.

Emilio Cobos Álvarez (:emilio)

Comment 68

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #67) > (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from > comment #65) > > > I am particularly concerned that the TLS lookups may be slow on other P1 platforms. > > > > Are we talking about an additional TLS lookup per malloc? TLS lookups do > > show up in profiles in other code every now and then, is there any reason to > > believe they wouldn't here? > > Yes, one TLS lookup per malloc. For Linux and Mac it is done as an in-line > read of a > __thread variable, so I'm not convinced it would show up in a profile. For > Windows > it's a call to TlsGetValue which I hope would get inlined. Is it? How can the read be inline when we're not linked statically? I'd assume we need at least a function call to __tls_get_addr, even on Linux.

Mike Hommey [:glandium]

Comment 69

•

8 years ago

Copy/paste from my email earlier today: I don't see why rush things for doing multiple-arenas for stylo somehow (adaptive or not). You can still build stylo by default without addressing that. The adaptive approach is too hackish. It makes *everything* use multiple arenas during the time stylo does its work. Which means *every* other running thread will get memory from arenas other than arenas[0] while multiple-arenas are enabled for stylo. In turn, this means on long running processes, increased fragmentation for any long term allocations that might end up on !arenas[0], unfreed runs, etc. As I said Julian yesterday, a middle ground, since rust can't allow us to pass an arena to every allocation, would be to have stylo threads opt-in to use an arena other than arenas[0]. That is, in terms of the current patch in the bug, we'd keep the Gecko_SetMultipleJemallocArenasEnabled thing, but every thread would still have its attributed arena stay arenas[0], until some other function is called that would attribute a different arena. That function would then be run at rust-thread-spawn time.

Flags: needinfo?(mh+mozilla)

Julian Seward [:jseward]

Comment 70

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #69) > I don't see why rush things for doing multiple-arenas for stylo somehow > (adaptive or not). You can still build stylo by default without addressing > that. Yes, true, but you can't get much parallel speedup out of it with single arenas. See comment 17 for some motivating numbers. > The adaptive approach is too hackish. It makes *everything* use multiple > arenas during the time stylo does its work. [..] In effect it makes the heap behaviour more scheduling-sensitive -- and hence, in the case of space problems -- unreproducible -- than it is at present. To what extent that is a problem, I have no idea. > As I said Julian yesterday, a middle ground, since rust can't allow us to > pass an arena to every allocation, would be to have stylo threads opt-in to > use an arena other than arenas[0]. [..] That would be more principled, but it would take longer to do. We'd need to hack Rayon to provide us with hooks at worker thread start and exit, unless it already provides those. Also, this would necessitate a TLS lookup for every malloc call, even those made by non-Stylo threads, in order to find out which arena to use. By contrast, the current proposal has a fixed overhead of just one (perfectly predictable) branch per call.

Mike Hommey [:glandium]

Comment 71

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #70) > Also, this would necessitate a TLS lookup for every > malloc call, even those made by non-Stylo threads, in order to find out which > arena to use. By contrast, the current proposal has a fixed overhead of just > one (perfectly predictable) branch per call. No, you can still gate everything on Gecko_SetMultipleJemallocArenasEnabled having been called.

Mike Hommey [:glandium]

Comment 72

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #70) > That would be more principled, but it would take longer to do. We'd need to > hack Rayon to provide us with hooks at worker thread start and exit, unless > it > already provides those. https://docs.rs/rayon/0.7.0/rayon/struct.Configuration.html#method.start_handler

Bobby Holley (:bholley)

Reporter

Comment 73

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #69) > Copy/paste from my email earlier today: > I don't see why rush things for doing multiple-arenas for stylo somehow > (adaptive or not). You can still build stylo by default without addressing > that. The point of building stylo by default is so that people don't need special MOZ_STYLO builds to work on stylo, and that includes performance profiling and talos runs. The performance of stylo without multiple arenas is bad enough that I consider this a blocker to building by default. > > The adaptive approach is too hackish. It makes *everything* use multiple > arenas during the time stylo does its work. Which means *every* other > running thread will get memory from arenas other than arenas[0] while > multiple-arenas are enabled for stylo. In turn, this means on long running > processes, increased fragmentation for any long term allocations that might > end up on !arenas[0], unfreed runs, etc. That's a great point - I didn't think about that. > > As I said Julian yesterday, a middle ground, since rust can't allow us to > pass an arena to every allocation, would be to have stylo threads opt-in to > use an arena other than arenas[0]. That is, in terms of the current patch in > the bug, we'd keep the > Gecko_SetMultipleJemallocArenasEnabled thing, but every thread would still > have its attributed arena stay arenas[0], until some other function is > called that would attribute a different arena. That function would then be > run at rust-thread-spawn time. This sounds like a great plan, and I think it should be easy to implement. Glandium's points in comment 71 and comment 72 are spot-on.

Bobby Holley (:bholley)

Reporter

Comment 74

•

8 years ago

Note that this dovetails very nicely with the recent merge of rayon 0.7, which includes froydnj's work to add the start/exit handlers. Great work by everyone involved in getting this over the line. :-)

Julian Seward [:jseward]

Comment 75

•

8 years ago

Initial patches (from Mike) for the comment 69 proposal, plus some measurements are at bug 1361258 comment 35.

Bobby Holley (:bholley)

Reporter

Comment 76

•

8 years ago

(In reply to Julian Seward [:jseward] from comment #75) > Initial patches (from Mike) for the comment 69 proposal, plus some > measurements > are at bug 1361258 comment 35. Do the current patches solve the kludge-y modulus bucketing issue, and guarantee that a separate arena is associated with each thread?

Flags: needinfo?(jseward)

Mike Hommey [:glandium]

Comment 77

•

8 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #76) > (In reply to Julian Seward [:jseward] from comment #75) > > Initial patches (from Mike) for the comment 69 proposal, plus some > > measurements > > are at bug 1361258 comment 35. > > Do the current patches solve the kludge-y modulus bucketing issue, and > guarantee that a separate arena is associated with each thread? No, but next iteration will, actually. Do we care about Android at the moment?

Flags: needinfo?(bobbyholley)

Mike Hommey [:glandium]

Updated

•

8 years ago

Flags: needinfo?(jseward)

Chris Peterson [:cpeterson]

Comment 78

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #77) > Do we care about Android at the moment? No. We won't be enabling Stylo V1 on Android. We'll enable it on Android in a later release.

Bobby Holley (:bholley)

Reporter

Comment 79

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #77) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #76) > > (In reply to Julian Seward [:jseward] from comment #75) > > > Initial patches (from Mike) for the comment 69 proposal, plus some > > > measurements > > > are at bug 1361258 comment 35. > > > > Do the current patches solve the kludge-y modulus bucketing issue, and > > guarantee that a separate arena is associated with each thread? > > No, but next iteration will, actually. Great! > > Do we care about Android at the moment? Not for now, and probably not for 57. We will for 58 (or shortly thereafter).

Flags: needinfo?(bobbyholley)

Mike Hommey [:glandium]

Comment 80

•

8 years ago

It seems like this bug is useful as a tracking bug, but its title should be updated for that, because what the title talks about was done in bug 1361258.

Bobby Holley (:bholley)

Reporter

Comment 81

•

8 years ago

It's not clear to me what this bug tracks, exactly. I think we have a reasonable solution in bug 1361258, and should file new issues if there's more work to do.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

Bobby Holley (:bholley)

Reporter

Updated

•

8 years ago

No longer blocks: stylo-nightly-build

Restricts Callgrind data collection to traverse_subtree(), per comment 8 years ago Julian Seward [:jseward] (deleted), patch		Details \| Diff \| Splinter Review
contention.c 8 years ago Julian Seward [:jseward] (deleted), text/plain		Details