1375925 - stylo: Investigate why thread pool memory usage is so high

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #1) > Seems like the most straightforward thing to do is shutdown the rayon > threadpool when not in use. Well, most of the overhead comes from the per-thread arenas, right? That part is kinda orthogonal to whether the thread pool is actually running, and is hard to make on-demand.

Eric Rahm [:erahm]

Comment 3

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #2) > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment > #1) > > Seems like the most straightforward thing to do is shutdown the rayon > > threadpool when not in use. > > Well, most of the overhead comes from the per-thread arenas, right? That > part is kinda orthogonal to whether the thread pool is actually running, and > is hard to make on-demand. We only use per thread arenas for stylo threads, we request/relinquish them in the thread startup and shutdown functions [1]. That also unregisters the profiler which is partially what's using the memory. [1] https://github.com/servo/servo/blob/6ec95ecb9f5159eafa8d3051c259d566d518f29d/components/style/gecko/global_style_data.rs#L36-L53

Bobby Holley (:bholley)

Reporter

Comment 4

•

7 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #3) > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #2) > > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment > > #1) > > > Seems like the most straightforward thing to do is shutdown the rayon > > > threadpool when not in use. > > > > Well, most of the overhead comes from the per-thread arenas, right? That > > part is kinda orthogonal to whether the thread pool is actually running, and > > is hard to make on-demand. > > We only use per thread arenas for stylo threads, we request/relinquish them > in the thread startup and shutdown functions [1]. Yes, but we don't currently have any code to intelligently reclaim the arenas when they're set inactive. See [1]. In general, lots of stuff triggers styling. I guess we could shut down the thread pools if we find ourselves in a totally quiescent state, but we'd need to measure how much we're spinning them up and down, and deal with the memory fragmentation issue of acquiring and releasing the arenas all the time. In general I'm pretty skeptical. [1] http://searchfox.org/mozilla-central/rev/3291398f10dcbe192fb52e74974b172616c018aa/memory/mozjemalloc/mozjemalloc.cpp#2374

Eric Rahm [:erahm]

Comment 5

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #4) > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment > #3) > > (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #2) > > > (In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment > > > #1) > > > > Seems like the most straightforward thing to do is shutdown the rayon > > > > threadpool when not in use. > > > > > > Well, most of the overhead comes from the per-thread arenas, right? That > > > part is kinda orthogonal to whether the thread pool is actually running, and > > > is hard to make on-demand. > > > > We only use per thread arenas for stylo threads, we request/relinquish them > > in the thread startup and shutdown functions [1]. > > Yes, but we don't currently have any code to intelligently reclaim the > arenas when they're set inactive. See [1]. Sure, but we can fix that ;) glandium just did it that way because it was the quickest way to get stylo built by default. We should block enabling stylo by default on fixing that.

Bobby Holley (:bholley)

Reporter

Comment 6

•

7 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #5) > > Yes, but we don't currently have any code to intelligently reclaim the > > arenas when they're set inactive. See [1]. > > Sure, but we can fix that ;) glandium just did it that way because it was > the quickest way to get stylo built by default. We should block enabling > stylo by default on fixing that. At present it's never called though, right? It would only be called if we tried to spin the thread pool up and down dynamically, which I'm a skeptical of per comment 4.

Mike Hommey [:glandium]

Assignee

Comment 7

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from bug 1374824 comment #45) > (In reply to Mike Hommey [:glandium] from bug 1374824 comment #42) > > Some more data: > > - Somehow, talos is getting 6(!) style threads per process. I could only > > gather jemalloc stats for 2 processes, but both of them had 6 style threads. > > That's 6MB per process, which would bring us to 18MB for 3 processes. > > Yes, that matches my mental model. I was a bit surprised in comment 37 to > see you only talking about three threads, but I realized now you were > talking about a local machine, which is presumably 2-core (or 4 > non-hyperthreaded cores). > > My understanding is that the Talos machines are 4-core intel chips, and that > we run 6 threads on those cores (per the formula bz described). You can set > STYLO_THREADS=X as an enviromental variable to tweak the count if it helps > your investigation. I don't know, I think I was expecting a machine with enough cores to trigger 6 style threads to trigger more than 2 content processes. > > - In the data I got for the 2 processes, the difference, as seen by jemalloc > > itself, is around 15.7MB. That's 3+MB more than the expected overhead from > > the 12 arenas with one chunk per arena, and according to jemalloc, those are > > all due to fragmentation. > > Hm, that's weird. The whole point of allocating 1MB arenas is to avoid > fragmentation... To make my observation clearer: I see a similar amount of data allocated in the main arena in both cases, but a difference in how much is mapped for that. The only explanation is different allocation patterns with different fragmentation. It could just as much be noise. > > - Extrapolated to 3 processes, we're still far from 50MB, so either the > > process I didn't get a jemalloc report for has even more overhead than the 2 > > others somehow, or there are some non-jemalloc allocations happening > > somehow. I'm also not excluding jemalloc under-reporting its own overhead > > for those arenas. > > > > - All style arenas but one per process have 3112 bytes allocated in small > > allocations, and 36864 bytes in one large allocation. The other one per > > process has 5192 bytes in small allocations, and 36864 bytes in one large > > allocation. The small allocations are with the following exact sizes: 1 * 8, > > 2 * 16, 1* 1024, 1* 2048... the arena with 5192 bytes, add 1 * 32 and 1 * > > 2048. > > > > - I tried identifying where these allocations come from, but I hit > > https://github.com/rust-lang/rust/issues/41775 which prevents from getting > > good backtraces. Some manual work identified some things related to rust > > native strings and ffi strings, one 576 bytes allocation from > > tools/profiler/core/platform.cpp:2021, one allocation for the thread name > > for the profile, one 32808 allocation from > > tools/profiler/core/ThreadInfo.cpp:27, etc. All in all, that's a lot of > > things due to the profiler registration added in > > https://github.com/servo/servo/commit/ > > 82bffefbf7820b58d2ddbf0440326af799d1e36b happening after the thread switched > > to use a separate arena, per > > https://github.com/servo/servo/commit/ > > 642cd08f21727ee35dc3dace14d0c9ad5837f380 . It actually seems to make sense > > to move those allocations to the main arena. > > Yeah, agreed. We really only need the fancy per-thread arenas for the style > traversal. > > > If I force that to happen, I'm > > down to 2096 for most arenas, or 4176 bytes for one of them per process, and > > it looks like they could all come from rayon, but I'm still waiting for rr > > to hit the breakpoint I've set in jemalloc. I might send a PR to move those > > allocations to the main arena, but I first want to see why I was getting > > jemalloc segfaults in gdb in that case, suggesting some race condition. Funnily enough, after an hour of trying, rr doesn't hit a breakpoint in Gecko_SetJemallocThreadLocalArena or in choose_arena when it returns an arena != arenas[0], which makes no sense. The breakpoints are hit in gdb, but then I get random segfaults... I guess I should debug those...

Mike Hommey [:glandium]

Assignee

Comment 8

•

7 years ago

(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #5) > Sure, but we can fix that ;) glandium just did it that way because it was > the quickest way to get stylo built by default. We should block enabling > stylo by default on fixing that. That's bug 1364359, btw. Note that interestingly, those allocations that do happen on those arenas might also prevent them from being entirely cleaned-up... maybe in that case we could reassign the remaining chunks to the main arena... (but we need to keep in mind that we can't do the same thing for arenas meant to separate allocations for security purpose) Now, I'd argue if the Tp5 regression had been exactly number of processes * number of style threads * 1MB, I wouldn't have cared that much. Those nproc * nthread * 1MB chunks are largely unused anyways, and are not consuming as much RSS until style happens. But the regression is not exactly nproc * nthread * 1MB, and that's why I'm trying to find out what it is.

Mike Hommey [:glandium]

Assignee

Comment 9

•

7 years ago

So, this turns out to be "simply" the thread stacks, which default to 2MB + a guard page. Now, arguably, considering stack thread sizes are different across platforms, it would be better to explicitly set a size for them that we expect to work.

Bobby Holley (:bholley)

Reporter

Updated

•

7 years ago

Depends on: 1376883

Bobby Holley (:bholley)

Reporter

Comment 10

•

7 years ago

(In reply to Mike Hommey [:glandium] from comment #9) > So, this turns out to be "simply" the thread stacks, which default to 2MB + > a guard page. > > Now, arguably, considering stack thread sizes are different across > platforms, it would be better to explicitly set a size for them that we > expect to work. Ok, filed bug 1376883 to shrink the stack sizes.

Mike Hommey [:glandium]

Assignee

Comment 11

•

7 years ago

I guess we can close this bug now. I'm however leaving myself a ni? not to forget to go ahead with the last part of bug 1374824 comment 42, first confirming the weird race-condition-like crash with gdb is not an actual problem.

Status: NEW → RESOLVED

Closed: 7 years ago

Flags: needinfo?(mh+mozilla)

Resolution: --- → FIXED

Mike Hommey [:glandium]

Assignee

Updated

•

5 years ago

Flags: needinfo?(mh+mozilla)

Bugzilla

Quick Search

stylo: Investigate why thread pool memory usage is so high

Categories

(Core :: CSS Parsing and Computation, defect)

Tracking

()

People

(Reporter: bholley, Assigned: glandium)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Updated