Open Bug 1834977 Opened 1 year ago Updated 1 year ago

layout.css.stylo-threads=0 and/or layout.css.stylo-parallelism-threshold=0 improves sp3 numbers on Android significantly

Categories

(Core :: CSS Parsing and Computation, defect)

defect

Tracking

()

People

(Reporter: smaug, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [sp3])

Attachments

(1 file)

Seems like it would be good to understand this a little better. I could imagine it might be device-dependent, and we don't want to overrotate on the devices we happen to have in CI.

Yeah we've reproduced this in multiple devices, but others don't show the same issue.

In my device and Jesup's there's a somewhat reasonable explanation which is that there's only one P-core even tho there are multiple e-cores.

But on the a51 on automation there are four and four so that might not be it.

So my current theory that I need to verify is about priority of the stylo threads. Other potential avenues for investigation ks looking into if somehow more style sharing is disproportionately better on Android or something, Android having somehow a lot more context switching overhead, or jemaloc/TLS arenas?

But yeah I'd avoid disabling parallelism without really understanding this.

Here's some results testing different values for stylo-threads on a variety of devices. I was struggling to get results on some lower end devices (perhaps I was hitting the memory pressure clearing markers causing an exception issue), but can try again if needs be.

Also caveat that the results were fairly noisy, and perhaps I should have been capturing more data than just the score, but it was fairly time consuming already.

https://docs.google.com/spreadsheets/d/1FcZoy85SH3eXFARCyFYOx2ZnYnctB8e-9pISOi-bgzw/edit?usp=sharing

To me it looks like disabling parrallelism doesn't really regress anywhere. It seems to make not much difference on homogenous cores. On a 4+4 big/little configuration we sometimes see some improvements. And on devices where the CPU configuration is 1+1+6, or 2+2+4 we see a large improvement.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #3)

A try with a bigger parallelism threshold: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=46e0f500a3998d351f8c72b787d34414fc4edf25&newProject=try&newRevision=e3778acc9590726d0eff9f9e0b8556693ecb1e14&page=1

This seems to improve android and windows. This is just doubling the parallelism threshold but not the work unit size, so that we guarantee we have full work for two threads before we switch to parallel mode. Does that change seem objectionable Bobby?

Flags: needinfo?(bholley)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)

This seems to improve android and windows. This is just doubling the parallelism threshold but not the work unit size, so that we guarantee we have full work for two threads before we switch to parallel mode. Does that change seem objectionable Bobby?

That seems like a reasonable change to make (worth testing it locally in a few configurations as well on various workloads).

That said, I think it doesn't really bring us closer to understanding why parallelism doesn't seem to be working as we expect on Android. The e-core thing is a plausible theory, but Jamie indicated that disabling parallelism also doesn't regress devices with homogeneous cores, which isn't what I'd initially expect. So there might be another effect going on that could be useful to understand, and ideally we'd do some profiling and investigation on such a device.

Flags: needinfo?(bholley)
Severity: -- → S3

Also MotionMark1.2 score seems to improve on A54 if the value of the pref is 0.

Some comparison:
https://share.firefox.dev/3C1XB08 default settings
https://share.firefox.dev/3ozUutg style-threads=0

I don't trust too much about the actual time reported by those particular profiles, but whatever the profiler does manage to capture in the
stylo threads is mostly futex_wait. That shows up a lot on desktop too. Is that expected?

With 0 stylo thread, the main thread just keeps busy and cpu usage is around 90%. With stylo threads we have multiple threads, but reported cpu usage is way lower, at least for some of the threads.

Since stylo is using so many threads, some of those do get run on little cores. Though, even limiting to just 2 threads isn't as good as 0.
But I expect that AndroidUI thread in the parent process and the main thread of the content process and something else too get to use
big cores by default, so stylo might get run mostly on little cores. This needs some more investigation.
And once we figure out how to run the main thread using the fastest possible core on those devices which have single very fast core, the difference here would be even larger, I think.

Do we always try to use all the available stylo threads even if there wasn't too much work to do? Does stylo always wait for all the threads to
acknowledge the main thread that they have finished the work? (in other words, do we always wake up all the 6 threads?)

Summary: layout.css.stylo-threads=0 improves sp3 numbers on Android significantly → layout.css.stylo-threads=0 and/or layout.css.stylo-parallelism-threshold=0 improves sp3 numbers on Android significantly

Based on profiling using Android GPU Inspector, stylo threads are run almost always using little cores (on A54). And while they run on other threads, the big core which runs the main thread is occasionally slowed down, and it takes a bit time to ramp that up again.

Attachment #9335874 - Attachment description: WIP: Bug 1834977 - non-parallel stylo is faster on Android, so change the pref to layout.css.stylo-threads=0, r=emilio → WIP: Bug 1834977 - non-parallel stylo is faster on Android, so change the pref to layout.css.stylo-parallelism-threshold=0, r=emilio
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: