Closed Bug 1399962 Opened 7 years ago Closed 6 years ago

Consider reducing the number of content processes for dual core CPUs

Categories

(Core :: DOM: Content Processes, enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX
Performance Impact high

People

(Reporter: ehsan.akhgari, Unassigned)

References

Details

(Keywords: perf, topperf)

Attachments

(1 file)

Please see the blocking bugs.  A user with a computer that has a dual core (no hyper threading) CPU is reporting performance issues that seem to mostly come from starvation issues.

Have we considered reducing the number of content processes we have on such systems to 2 or something like that?
Flags: needinfo?(mrbkap)
Flags: needinfo?(gkrizsanits)
Are these bugs e10s-multi only or a combination of e10s-multi + quantum DOM + stylo + other parallelization that is new to 57?

And yes this came up. I talked to Bill and the consensus was that we should determine the number of content processes and the number of quantum DOM main threads based on the number of cores/available memory. I also talked to Jeff from product and he wanted to back up our claims with numbers, before we start tuning things based on assumptions. So the current plan is to first land some probes, do some experiments and then tune what we have accordingly.

Currently I'm working all e10s-multi related things alone so it's all going quite slow, and my backlog is quite long. If we have evidence (or a good reason to believe) that this is a critical issue and we need a more urgent fix, we might want to change our strategy.
Flags: needinfo?(gkrizsanits)
Hi had filed the blocking bug(both causing issues of high cpu , slow rendering,choppy scrolling)

@mike was very kind to help out & finally after nearly 10 days of figuring what caused these two,
got some info about one bug today (did mention again thanks to @mike)

https://bugzilla.mozilla.org/show_bug.cgi?id=1397092#c161
Flags: needinfo?(gkrizsanits)
Flags: needinfo?(ehsan)
These Bugs affect only in FF57 ie tested with FF56b12 released today for confirmation (both in one process mode).
with e10s off FF56b12 is even faster but FF57 still has issues.

Maybe answers

>Are these bugs e10s-multi only or a combination of e10s-multi + quantum DOM + stylo + other parallelization that is new to 57?
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #1)
> Are these bugs e10s-multi only or a combination of e10s-multi + quantum DOM
> + stylo + other parallelization that is new to 57?

This is on Nightly, so the latter.  e10s-multi doesn't live separate from those other features (but also note that Quantum DOM scheduling isn't enabled yet, and I'm not sure what other new parallelization work has happened recently).  So, mostly e10s-multi + stylo.  :-)

> And yes this came up. I talked to Bill and the consensus was that we should
> determine the number of content processes and the number of quantum DOM main
> threads based on the number of cores/available memory. I also talked to Jeff
> from product and he wanted to back up our claims with numbers, before we
> start tuning things based on assumptions. So the current plan is to first
> land some probes, do some experiments and then tune what we have accordingly.

That seems fair.

> Currently I'm working all e10s-multi related things alone so it's all going
> quite slow, and my backlog is quite long. If we have evidence (or a good
> reason to believe) that this is a critical issue and we need a more urgent
> fix, we might want to change our strategy.

According to the profiles I have seen from shellye's dual core systems, there is a *lot* of thread starvation issues going on, where threads just don't get scheduled on the CPU sometimes for tens of milliseconds.  Each content process adds a few threads to the number of threads waiting to be scheduled and reducing that number helps to reduce the chances of starvation when there's a lot of competition for the system to schedule threads.  Right now I don't think we have anything more concrete, and due to the nature of thread starvation, I doubt we can get better evidence than that.

We have telemetry to indicate that we have a fair number of users with 2 logical cores on their machine: https://sql.telemetry.mozilla.org/queries/3836 so I'd say that this is an important issue for us to figure out (for the *special case* of dual core CPUs even if that means before figuring out a general solution.)  How that should stack up with the rest of your priorities is better judged by you.  :-)  It's sad that we now have one person working on e10s to be honest.
Flags: needinfo?(gkrizsanits)
Flags: needinfo?(ehsan)
Also, shellye, *please* stop needinfoing people on every comment.  Please allow us to continue this discussion.  The reason I filed this bug separately was that it's orthogonal to the specific issue you were seeing before, so I'd appreciate if you let us focus on what this bug is about.  :-)
Flags: needinfo?(mrbkap)
Priority: -- → P3
There is one forgotten bug #1066789. It's about implanting heuristic, which should help improving performance, not just simply forcing "dom.ipc.processCount" preference with "4" value, which in most cases just decreases performance, as nearly 70% of Firefox users are still using 2 threaded CPUs ( https://hardware.metrics.mozilla.com/ ). What's more, let's not forget about other Firefox processes in background, so +3 more.
Severity: normal → major
Depends on: 1066789
Keywords: perf, topperf
OS: Unspecified → All
Hardware: Unspecified → All
Whiteboard: [qf]
(In reply to Virtual_ManPL [:Virtual] - (please needinfo? me - so I will see your comment/reply/question/etc.) from comment #6)
> There is one forgotten bug #1066789. It's about implanting heuristic, which
> should help improving performance, not just simply forcing
> "dom.ipc.processCount" preference with "4" value, which in most cases just
> decreases performance, as nearly 70% of Firefox users are still using 2
> threaded CPUs ( https://hardware.metrics.mozilla.com/ ). What's more, let's
> not forget about other Firefox processes in background, so +3 more.

It's not a forgotten thing, it's a resource issue. But I hope someone will finish the work. Anyway, the reason why I work on bug 1409002 to have an easy way to run e10s-multi performance on various machines and make informed decision about the right amount of allowed processes. Once the Talos test is set up it can also serve as an easy way to identify possible performance issues since the profiler is included in the test. This should be combined with some telemetry probes around the message queues in each processes at some point as well and tuned together with the number of quantum DOM main threads per process. At least this was the plan.
Whiteboard: [qf] → [qf:f63][qf:p1]
Here's a first stab at this. I have a few concerns about this patch:

- I don't know what PR_GetNumberOfProcessors will return on automation (and
so what the effect of this patch will be there). We might need to fix bug
1409002 sooner rather than later.
- Intuitively, this bug should affect users with fewer than 3 cores more
than users with 4 or more cores. This patch is core-agnostic. I wonder if
it should just ask "is the number of cores less than 3" and throttle in that
case instead of being generic.
- As far as I can tell, PR_GetNumberOfProcessors doesn't take hyperthreading
into account. We currently don't seem to have an API to return the number of
"logical" cores available. Is that a big problem? It's easy to fix this on
Windows, I'm looking into Linux and OSX.
- What percentage of our users are going to hit this? Is this going to
effectively disable multi for most of our users? Does that mean that multi
actually *reduces* Firefox's responsiveness for a majority of them?

Review commit: https://reviewboard.mozilla.org/r/215792/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/215792/
Comment on attachment 8945664 [details]
Bug 1399962 - Limit number of child processes by number of processors.

Jim and Jeff, your comments about this patch and/or answers to my questions in comment 10 (also in the commit message for this patch) would be greatly appreciated.

I think I need to fix the hyperthreading thing before requesting review and landing this.
Attachment #8945664 - Flags: feedback?(jmathies)
Attachment #8945664 - Flags: feedback?(jgriffiths)
My feedback is that I think taking HT into account is very very important. The tweet from the other day was concernng a system with 2 cores but also only 2 threads. Most modern Intel cpus have HT so they have 4 threads and I suspect very different real-world performance characteristics.

And to your point, the group with only 2 cores is 68% of our current users according to the Hardware Report[1]. The hardware report doesn't report on threads.


[1] https://hardware.metrics.mozilla.com/#goto-cpu-and-memory
There is a hardware report issue [1] and a redash query[2] that might help with HT.

[1] https://github.com/mozilla/firefox-hardware-report/issues/60
[2] https://sql.telemetry.mozilla.org/queries/47219/source

I also created this google sheet from the data, in case you're like me and have a hard time understanding things unless they are in a 3d pie chart:

https://docs.google.com/spreadsheets/d/1jjV_aZh5xbhwbJHF7pXUX7eEOT9bX-l4RIheiOSGOx4/edit#gid=0
Comment on attachment 8945664 [details]
Bug 1399962 - Limit number of child processes by number of processors.

r- because we can't just look at physical cores.
Attachment #8945664 - Flags: feedback?(jgriffiths) → feedback-
Very nice visualization you have there. 
Anyway, I think like even taking into account memory constraints should be very important. 

For example, I'm sure that on some Atom tablet with only 2GB of ram, for as much as they have 4 (real) cores, the overhead of 4-5 processes might start to cause swap problems, if I can explain.
(In reply to mirh from comment #15)
> Anyway, I think like even taking into account memory constraints should be
> very important. 

I agree with this, but fixing this problem is out of scope for this particular bug. Let's track that work in a new bug (especially because memory is such a moving target as content processes use more or less memory).
(In reply to Blake Kaplan (:mrbkap) from comment #10)
> Created attachment 8945664 [details]
> Bug 1399962 - Limit number of child processes by number of processors.
> 
> Here's a first stab at this. I have a few concerns about this patch:
> 
> - I don't know what PR_GetNumberOfProcessors will return on automation (and
> so what the effect of this patch will be there). We might need to fix bug
> 1409002 sooner rather than later.

Detailing the hardware might help to answer this. I think it'll be different depending on what's running.

> - As far as I can tell, PR_GetNumberOfProcessors doesn't take hyperthreading
> into account. We currently don't seem to have an API to return the number of
> "logical" cores available. Is that a big problem? It's easy to fix this on
> Windows, I'm looking into Linux and OSX.

From what I'm reading on the web, GetSystemInfo's dwNumberOfProcessors returns logical cores. Testing locally I can confirm this on Win7.

> - What percentage of our users are going to hit this? Is this going to
> effectively disable multi for most of our users? Does that mean that multi
> actually *reduces* Firefox's responsiveness for a majority of them?

This should be available in telemetry data.
Comment on attachment 8945664 [details]
Bug 1399962 - Limit number of child processes by number of processors.

I think we need to understand the issue better and not guess. we should also consider experimenting with this using shield unless there's a pressing need to get something out there.

If we do ship this, I would suggest tieing the logic to physical cores.
Attachment #8945664 - Flags: feedback?(jmathies) → feedback-
(In reply to Jim Mathies [:jimm] from comment #17)
> From what I'm reading on the web, GetSystemInfo's dwNumberOfProcessors
> returns logical cores. Testing locally I can confirm this on Win7.

Yikes, I managed to end up on the wrong MSDN page. Indeed, SYSTEM_INFO does return logical cores.

(In reply to Jim Mathies [:jimm] from comment #18)
> I think we need to understand the issue better and not guess. we should also
> consider experimenting with this using shield unless there's a pressing need
> to get something out there.

It seems like there have been several reports of Firefox 58 being pretty slow on machines with only two cores. I guess we could do a shield study to better understand the relationship between cores/processors to threads and... hangs? I'm not entirely sure what we would be tracking against.

In the meantime, though, it seems like it wouldn't be too much of a stretch to restrict the number of content processors for users with fewer than 3 cores.

> If we do ship this, I would suggest tieing the logic to physical cores.

Talking to Jeff, he had the opposite inclination. I think Overholt has a machine with only two processors handy. It would be useful to hear his experience with 4 vs 2 or 1 content processes. Andrew?
Flags: needinfo?(overholt)
(In reply to Blake Kaplan (:mrbkap) from comment #19)
> I think Overholt has a
> machine with only two processors handy. It would be useful to hear his
> experience with 4 vs 2 or 1 content processes. Andrew?

I set it to 8 the other day and general usage of Firefox seems ok (granted, I haven't done it 24/7 but it was more than 5 minutes).
Flags: needinfo?(overholt)
Blake, I filed bug 1437629 because it looks like I was only getting 4 content processes (I think my answer in comment 20 still holds, though).
(In reply to Blake Kaplan (:mrbkap) from comment #19)
> (In reply to Jim Mathies [:jimm] from comment #17)
...
> > If we do ship this, I would suggest tieing the logic to physical cores.
> 
> Talking to Jeff, he had the opposite inclination. I think Overholt has a
> machine with only two processors handy. It would be useful to hear his
> experience with 4 vs 2 or 1 content processes. Andrew?

My reasoning is, a 2-core, non-HT, low RAM, spinning disk system is much less likely to swap out and hang than a 2 physical/4 logical-core large RAM SSD-equipped system, even if both systems have a 2 core cpu running at 2.3 ghz.

My hypothesis for a shield study would, roughly, be that decreasing the number of content processes in the test cohort to 1 results in improved performance ( hangs? ) vs the control. I would want to graph that performance metric using core configuration as the other axis so we can see if some types of cpu hang more than others. We should also be tracking RAM counts as this may contribute.

I think we also need some machine population analysis done looking for clumps of users on crappy machines, eg :

* lower than 4GB RAM
* spinning hard disks ( see bug #1403817 )
* <4 physical *and* logical cores ( no hyperthreads )

I suspect we should identify these machines in genpop and set the content process count to 1.
I just came across this bug and wanted to add my 2p because making changes on the number of available cores/threads is something we already do in the codebase and which involves some duplicate code we might want to get rid of. First of all PR_GetNumberOfProcessors() returns the number of logical processors on Windows, Mac (IIRC) and Linux so for a processor with two dual-threaded cores it'll return a count of 4. The JIT also relies on the number of cores to tune some functionality, and it has its own function which has a slightly different implementation but also returns the number of logical cores [1].

I think we should unify this functionality and provide functions to distinguish between physical cores and logical cores. 

Additionally we might want to audit the places where we're making these decisions as some fail to take into account that PR_GetNumberOfProcessors() can return -1 on failure. [2] and [3] for example will assume a negative number of threads in that case.

[1] https://searchfox.org/mozilla-central/rev/f5fb323246bf22a3a3b4185882a1c5d8a2c02996/js/src/threading/CpuCount.h#14
[2] https://searchfox.org/mozilla-central/rev/f5fb323246bf22a3a3b4185882a1c5d8a2c02996/dom/media/platforms/agnostic/AOMDecoder.cpp#74
[3] https://searchfox.org/mozilla-central/rev/f5fb323246bf22a3a3b4185882a1c5d8a2c02996/dom/media/platforms/agnostic/VPXDecoder.cpp#58
(In reply to Gabriele Svelto [:gsvelto] from comment #23)
> I just came across this bug and wanted to add my 2p because making changes
> on the number of available cores/threads is something we already do in the
> codebase and which involves some duplicate code we might want to get rid of.
> First of all PR_GetNumberOfProcessors() returns the number of logical
> processors on Windows, Mac (IIRC) and Linux so for a processor with two
> dual-threaded cores it'll return a count of 4. The JIT also relies on the
> number of cores to tune some functionality, and it has its own function
> which has a slightly different implementation but also returns the number of
> logical cores [1].

Yes, PR_GetNumberOfProcessors() indeed returns the number of logical cores, not physical cores, I believe.  We also rely on this for our implementation of Navigator.hardwareConcurrency.
Whiteboard: [qf:f63][qf:p1] → [qf:f64][qf:p1]
Whiteboard: [qf:f64][qf:p1] → [qf:p1:f64]
This seems a bit up in the air. From what I can tell:

  #1 - Jeff f-'d b/c we thought it was using physical cores instead of logical cores (comment 14)
  #2 - We then figured out that was wrong
  #3 - Jim f-'d b/c he thought we *should* be using physical cores (comment 18), and also we need more info
  #4 - Jeff had a somewhat thorough proposal for a shield study in comment 22
  #5 - Gabriel points out that we should probably be sharing all this logic in places that make decisions based on core count

#5 is more of a follow-up.

We need to decide whether physical cores or logical cores matter more (the status quo is logical cores, but out telemetry doesn't capture that I guess). It's possible the shield study can help us with that. Jeff do you know who can work on the shield study? I'm interested in bumping the process count to 8 on Windows Nightly in bug 1470280, but Blake has reservations because this was never resolved.
Flags: needinfo?(jgriffiths)
Just another semi-related thing: considering most machines with less than 4 logical cores are definitely old and thus have less memory then this would help with OOM crashes too. On Win64 we have >90% of crashes due to commit space exhaustion and reducing the number of processes should definitely help there.
(In reply to Gabriele Svelto [:gsvelto] from comment #26)
> Just another semi-related thing: considering most machines with less than 4
> logical cores are definitely old and thus have less memory then this would
> help with OOM crashes too. On Win64 we have >90% of crashes due to commit
> space exhaustion and reducing the number of processes should definitely help
> there.

Do we have telemetry to back up an increase in commit space related crashes when we cranked up to 4 content processes for Win64 users?
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #25)
...
>   #3 - Jim f-'d b/c he thought we *should* be using physical cores (comment
> 18), and also we need more info


We can get physical and logical cores from telemetry:

https://sql.telemetry.mozilla.org/queries/47219

I think I mostly care about physical.

I wanted to have a more nuanced analysis of hardware capabilities based on some combo of RAM, Cores, speed and HDD vs SSD, that got partly stalled because we just can't detect SSD status, see bug 1403817. 

Another factor is, we're actively looking at producing WIN10 / arm64 builds of Firefox and the telemetry will have to support these distinctly as well; the CPU architecture and GPUs will all be relatively exotic to us.
Flags: needinfo?(jgriffiths)
(In reply to Eric Rahm [:erahm] (please no mozreview requests) from comment #27)
> Do we have telemetry to back up an increase in commit space related crashes
> when we cranked up to 4 content processes for Win64 users?

No, I'm not even sure that we still have telemetry data for that period, but currently the overwhelming majority of OOM crash pings show machines running out of commit-space: https://sql.telemetry.mozilla.org/queries/52834#146985

I've done some further investigation on this phenomenon but it's way OT for this bug, shoot me an e-mail if you want more info.
AFAIK the only way logical core count differs from physical one, is hyper-threading. And while that's really not like having a second physical core, for as much multitasking is concerned you are more than half-way there imo (as for raw performance itself instead, I think maybe you get a +30% increase in very multithread applications). 

And tbh, it seems more fair to put a Northwood/Prescott Pentium 4 in the same bag of a modern day Celeron, than pretending a mobile i7 (or at least those they sold until a year ago) is on the same level of the latter.

Fission is moving us to a model where we aim to significantly scale up the number of content processes, to isolate web content running from different domains from one another.

So I don't think we're going to be doing what the bug summary suggests. I'm going to WONTFIX for now.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX

Well, pity.
I'd still hope you won't dump "modularization", just like chrome that for its love could not even start with few processes

[It's not my fight anymore but] in firefox instead, with enough tinkering of some parameters*, memory usage can easily decrease of 30%

* dom.ipc.process, browser.tabs.remote.autostart, extensions.webextensions.remote, layers.gpu-process.enabled, media.rdd-process.enabled

Performance Impact: --- → P1
Whiteboard: [qf:p1:f64]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: