Open Bug 1778214 Opened 2 years ago Updated 2 years ago

When downloading a large file, the socket thread has a higher priority than the foreground content process, which becomes CPU starved on low core machines

Categories

(Core :: Networking, defect, P2)

Firefox 103
x86_64
Windows 10
defect

Tracking

()

Performance Impact medium

People

(Reporter: thee.chicago.wolf, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, perf:pageload, Whiteboard: [necko-triaged])

Attachments

(1 file)

This issue is probably related to bug 1740941 / bug 1714846.

STR:

  1. Open Task Manager and note FF CPU usage.

  2. Open a new tab and visit https://officecdn.microsoft.com/db/492350F6-3A01-4F97-B9C0-C7C6DDF67D60/media/en-US/ProPlus2019Retail.img to download a 4GB file.

  3. As it's downloading, check Task Manager for CPU usage. Then, in the same tab downloading ProPlus2019Retail.img, type some website into the URL bar and press Enter to navigate to it.

Expected result:

  1. FF will immediately navigate to the URL entered.

Actual result:

  1. FF will wait to load the page entered into the URL bar until the download completes or is cancelled.
Summary: When downloading a large file, URL bar navigation doesn't work until the download finishes → When downloading a large file, URL bar navigation doesn't work until the download finishes or is cancelled
Keywords: perf
Whiteboard: [fxperf]
Performance Impact: --- → ?
Whiteboard: [fxperf]

(In reply to Arthur K. [He/Him] from comment #0)

  1. FF will wait to load the page entered into the URL bar until the download completes or is cancelled.

I can't reproduce this. I followed the steps and then put xkcd.com in the address bar and hit enter, and the page loaded nearly instantaneously.

Can you reproduce on a clean profile? Does the profile where this is broken have fission enabled (you can check in about:support) ? What URL are you navigating to while the download is ongoing?

Flags: needinfo?(thee.chicago.wolf)

From STR step 2, you began the download of that huge IMG file and saw no slowed or stalled response (or high CPU util)? I admit, I am on and old PC but would capturing a perf profile help shed some light? I'm currently on 103 RC1 but, for me, I'm just still seeing the super-high CPU util from bug 1714846 which seems is the underlying cause. I also thought to try and remove the download icon from the toolbar thinking it might be the spinning download animation but it didn't make any difference.

Flags: needinfo?(thee.chicago.wolf)

I tested on macOS and on a reasonably powerful (though 3-4 years old) macbook pro; If this is a Windows-only issue or more pronounced on less powerful hardware I guess that could explain why I can't reproduce. Unfortunately my windows machine is even beefier than my macbook...

It's possible bug 1742797 would help on nightly? If you can easily reproduce, it might be worth testing today's nightly to see if it's better there?

Flags: needinfo?(thee.chicago.wolf)

So I just uploaded a profile of this issue here: https://share.firefox.dev/3okV4Y6

The profile is showing two things: What's happening after initiating a download and then trying to navigate to a new site while download is in progress.

I started by 1) beginning a download of the 4GB img file and, while it was downloading, 2) entered CNN into the URL bar and pressed Enter. It took about 10-12 seconds but CNN did come up....sloooowly. So for all intents and purposes, maybe this issue is better or resolved in the later builds of 103. Again, I'm on an old PC so maybe this issue doesn't manifest on modern hardware.

(In reply to :Gijs (he/him) from comment #3)

I tested on macOS and on a reasonably powerful (though 3-4 years old) macbook pro; If this is a Windows-only issue or more pronounced on less powerful hardware I guess that could explain why I can't reproduce. Unfortunately my windows machine is even beefier than my macbook...

It's possible bug 1742797 would help on nightly? If you can easily reproduce, it might be worth testing today's nightly to see if it's better there?

I'll check back here when 104 b1 pushes out. I suspect too that the fix in bug 1742797 will help on my old clunker (Latitude E6500, 16GB RAM, 500GB SSD, i7-M640, don't judge me!).

Flags: needinfo?(thee.chicago.wolf)

(In reply to Arthur K. [He/Him] from comment #4)

So I just uploaded a profile of this issue here: https://share.firefox.dev/3okV4Y6

The profile contains very little activity from front-end code, most likely because you already eliminated that cause of high CPU use by removing the downloads icon from the toolbar (mentioned in comment 2).

The profile shows high CPU use in the "BackgroundThreadPool #15" and "Socket Thread" threads of the parent process. The "BackgroundThreadPool #15" thread wasn't sampled, so it's hard to say what happened there exactly, we just know how much CPU it used. If you can easily reproduce, you might want to capture another profile with "BackgroundThreadPool" added to the list of sampled threads.

The CPU only has 2 cores, and only 2 logical cores (although comment 5 says i7-M640, which should be a CPU with 4 logical cores; maybe hyper-threading has been disabled on that machine for some reason).

The Awake markers on the Socket Thread show a "Current Thread Priority" of 12. The Awake markers in the main thread of the cnn.com content process show a Current Thread Priority of 9, so the JS code of the cnn.com content process won't run, until the Socket Thread of the parent runs out of work to do.

Keywords: perf:pageload
Summary: When downloading a large file, URL bar navigation doesn't work until the download finishes or is cancelled → When downloading a large file, the socket thread has a higher priority than the foreground content process, which becomes CPU starved on low core machines
Component: Downloads Panel → Networking
Product: Firefox → Core

(In reply to Florian Quèze [:florian] from comment #6)

(In reply to Arthur K. [He/Him] from comment #4)

So I just uploaded a profile of this issue here: https://share.firefox.dev/3okV4Y6

The profile contains very little activity from front-end code, most likely because you already eliminated that cause of high CPU use by removing the downloads icon from the toolbar (mentioned in comment 2).

The profile shows high CPU use in the "BackgroundThreadPool #15" and "Socket Thread" threads of the parent process. The "BackgroundThreadPool #15" thread wasn't sampled, so it's hard to say what happened there exactly, we just know how much CPU it used. If you can easily reproduce, you might want to capture another profile with "BackgroundThreadPool" added to the list of sampled threads.

It's super easy for me to repro so I did another profile with BackgroundThreadPool added to the "Add Custom Threads By Name" field. I hope that's what you meant. Have a look here: https://share.firefox.dev/3zsGunQ

I also tried to turn on every other option that did reporting and it certainly was WAY more intensive on the machine than the previous profile. Hoping that I got everything and more that you were needing to look at. Let me know if I can do something else further.

(In reply to Florian Quèze [:florian] from comment #6)

(In reply to Arthur K. [He/Him] from comment #4)

So I just uploaded a profile of this issue here: https://share.firefox.dev/3okV4Y6

The CPU only has 2 cores, and only 2 logical cores (although comment 5 says i7-M640, which should be a CPU with 4 logical cores; maybe hyper-threading has been disabled on that machine for some reason).

I turned HT back on and tried again but CPU util is still 50-60% on this old clunker.

(In reply to Arthur K. [He/Him] from comment #8)

I turned HT back on and tried again but CPU util is still 50-60% on this old clunker.

Turning on HT shouldn't reduce the CPU load, but I'm hoping it might make the cnn.com page load happen faster.

(In reply to Florian Quèze [:florian] from comment #9)

(In reply to Arthur K. [He/Him] from comment #8)

I turned HT back on and tried again but CPU util is still 50-60% on this old clunker.

Turning on HT shouldn't reduce the CPU load, but I'm hoping it might make the cnn.com page load happen faster.

You want me to do another profile now that HT is back on?

Also, 104.0b1 should push out later tonight so I'll test again with that since it seems bug 1742797 got fixed there. Maybe it'll help with this bug. I'll post my findings when I update.

(In reply to Arthur K. [He/Him] from comment #10)

(In reply to Florian Quèze [:florian] from comment #9)

(In reply to Arthur K. [He/Him] from comment #8)

I turned HT back on and tried again but CPU util is still 50-60% on this old clunker.

Turning on HT shouldn't reduce the CPU load, but I'm hoping it might make the cnn.com page load happen faster.

You want me to do another profile now that HT is back on?

I would be interested if you don't mind. This would help confirm that the bug initially reported only happens on dual core machines. Thanks for the new profile in comment 7 by the way!

(In reply to Florian Quèze [:florian] from comment #12)

(In reply to Arthur K. [He/Him] from comment #10)

(In reply to Florian Quèze [:florian] from comment #9)

(In reply to Arthur K. [He/Him] from comment #8)

I turned HT back on and tried again but CPU util is still 50-60% on this old clunker.

Turning on HT shouldn't reduce the CPU load, but I'm hoping it might make the cnn.com page load happen faster.

You want me to do another profile now that HT is back on?

I would be interested if you don't mind. This would help confirm that the bug initially reported only happens on dual core machines. Thanks for the new profile in comment 7 by the way!

With HT it for sure now loads CNN faster. Before I was counting in my head and took around 70-80 seconds (which you can likely see in the profile size / length). Now it's around 20-30 seconds I think: https://share.firefox.dev/3cHb9Vl

I think I'd originally turned HT off when Spectre / Meltdown / Heartbleed came out as people advised disabling it. I think Windows is pretty patched up even if the CPU microcode on this old CPU isn't (via BIOS).

Hope there's something useful here now.

And here's how we're looking for 104.0b1: https://share.firefox.dev/3zqYL3R

Sadly, the issue is still present and it seems bug 1742797 didn't help out.

Blocks: necko-perf
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

I captured a profile on the Dell 2018 reference laptop, which has a dual core CPU: https://share.firefox.dev/3PYM6vo Everything is pretty slow, some network requests are delayed by more than 8s.

(In reply to Florian Quèze [:florian] from comment #15)

I captured a profile on the Dell 2018 reference laptop, which has a dual core CPU: https://share.firefox.dev/3PYM6vo Everything is pretty slow, some network requests are delayed by more than 8s.

Are you basically seeing the same thing as my comment 14 profile?

Attached image profiler-memory-leak.png (deleted) —

I saw that 105.0b1 was built so I decided to revisit this bug and capture a new perf profile. Something major changed between 104 and 105.0b1 with respect the Profiler. I experienced a HUGE memory leak and FF pegged my CPU @ 100% solid after finishing a capture using my STR. I was unable to upload the perf profile because FF became non-responsive. A picture is worth a thousand words.

I managed to capture a perf profile: https://share.firefox.dev/3ABTmsm

Noticed that even after the download had finished, CPU was still pegged @ 100% until Profiler finished uploading. FF is stuck @ 1.36GB of RAM still used after the dust settled. This might be a new bug in Profiler.

The Performance Priority Calculator has determined this bug's performance priority to be P2. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: Windows
Impact on browser UI: Causes noticeable jank
Configuration: Specific but common
[x] Able to reproduce locally
[x] Bug affects multiple sites

Performance Impact: ? → medium

This is how we're looking for 106.0b1: https://share.firefox.dev/3QYe6PO

Just updated to the 1st 106 RC build. This issue seems to have improved a tad. After initiating the D/L, entering a new site in the URL bar and navigating away to a new pages seems to take only a few seconds now whereas before it was around 15-20 seconds. CPU util is still high.

Sadly, perf profiler still seems busted as when I captured a perf profile it greatly slowed everything down. Here's how it looked for 106.0 RC1: https://share.firefox.dev/3Ewpe47

Still present in 109.0b2.

(In reply to Florian Quèze [:florian] from comment #6)

The CPU only has 2 cores, and only 2 logical cores (although comment 5 says i7-M640, which should be a CPU with 4 logical cores; maybe hyper-threading has been disabled on that machine for some reason).

The Awake markers on the Socket Thread show a "Current Thread Priority" of 12. The Awake markers in the main thread of the cnn.com content process show a Current Thread Priority of 9, so the JS code of the cnn.com content process won't run, until the Socket Thread of the parent runs out of work to do.

Bas, Randell, I think I heard you both mention needing to adjust the priority of the Socket process similar to what we have done recently for the GPU process. Do you have thoughts about how this might improve bugs like this one, or make things worse?

Flags: needinfo?(rjesup)
Flags: needinfo?(bas)

(In reply to Florian Quèze [:florian] from comment #23)

(In reply to Florian Quèze [:florian] from comment #6)

The CPU only has 2 cores, and only 2 logical cores (although comment 5 says i7-M640, which should be a CPU with 4 logical cores; maybe hyper-threading has been disabled on that machine for some reason).

The Awake markers on the Socket Thread show a "Current Thread Priority" of 12. The Awake markers in the main thread of the cnn.com content process show a Current Thread Priority of 9, so the JS code of the cnn.com content process won't run, until the Socket Thread of the parent runs out of work to do.

Bas, Randell, I think I heard you both mention needing to adjust the priority of the Socket process similar to what we have done recently for the GPU process. Do you have thoughts about how this might improve bugs like this one, or make things worse?

Hmpf, I guess in theory setting the foreground process status on the socket process as it gets enabled would just keep the situation the same as it is now. But obviously in this scenario that isn't ideal. I guess that the current foreground content process should also get the foreground status to address this? At the same time in general you want the parent process to have a higher priority than even the foreground content process. This is all somewhat problematic and indicative of being unable to properly propagate priorities between tasks and across processes and threads.

Flags: needinfo?(bas)

(In reply to Bas Schouten (:bas.schouten) from comment #24)

I guess that the current foreground content process should also get the foreground status to address this? At the same time in general you want the parent process to have a higher priority than even the foreground content process.

The foreground status applies a priority boost, we could give a lower base priority to the content process and give them the same foreground boost as the parent.

To fix this bug, we would want the socket thread (whether on the parent or socket process) to have a slightly lower priority than the foreground content process. But I'm not sure that would be good for page load.

(In reply to Florian Quèze [:florian] from comment #25)

(In reply to Bas Schouten (:bas.schouten) from comment #24)

I guess that the current foreground content process should also get the foreground status to address this? At the same time in general you want the parent process to have a higher priority than even the foreground content process.

The foreground status applies a priority boost, we could give a lower base priority to the content process and give them the same foreground boost as the parent.

To fix this bug, we would want the socket thread (whether on the parent or socket process) to have a slightly lower priority than the foreground content process. But I'm not sure that would be good for page load.

Dumb question - could we down-prio the relevant socket thread(s) for downloads relative to page loads, rather than down-prio'ing all socket operations, or are there technical limitations that make that difficult?

Alternatively, could a useful initial start be changing the process/thread priority based on whether there are/aren't ongoing page loads? Of course this only helps while not doing page loads at the same time as downloading, but it feels like it may be more straightforward to implement than splitting off downloads-related socket traffic to their own thread(s).

Flags: needinfo?(florian)

(In reply to :Gijs (he/him) from comment #26)

Dumb question - could we down-prio the relevant socket thread(s) for downloads relative to page loads, rather than down-prio'ing all socket operations, or are there technical limitations that make that difficult?

Alternatively, could a useful initial start be changing the process/thread priority based on whether there are/aren't ongoing page loads? Of course this only helps while not doing page loads at the same time as downloading, but it feels like it may be more straightforward to implement than splitting off downloads-related socket traffic to their own thread(s).

I don't know enough about how the socket code works to answer these questions, but they are interesting questions!

Flags: needinfo?(florian)

There aren't separate socket threads for downloads, and largely the CPU use is driven by the incoming data arrival.

I suspect the main thing going on here is fighting for data on the wire; a 4GB file means downstream router queues will all be full, and depending on the equipment (especially in your house) this could lead to significant delays receiving documents from CNN. If there's a 1/2 second delay in the router, it may delay every round-trip (for DNS, for TLS negotiation, etc) by 1/2 second. With a low-end CPU with low cores (2), this exacerbates any issues, especially if you have a fast internet connection (i.e. it can push data at you faster than you can process it).

Flags: needinfo?(rjesup)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: