Closed Bug 1382886 Opened 7 years ago Closed 3 years ago

socket thread consuming 80% CPU doing SSL polling

Categories

(Core :: Security, defect, P3)

defect

Tracking

()

RESOLVED FIXED
94 Branch
Tracking Status
firefox57 --- wontfix
firefox94 --- fixed

People

(Reporter: dbaron, Assigned: dragana)

References

(Blocks 2 open bugs)

Details

(Keywords: perf, power)

Crash Data

Attachments

(13 files, 1 obsolete file)

(deleted), image/png
Details
(deleted), image/png
Details
(deleted), text/plain
Details
(deleted), application/gzip
Details
(deleted), text/plain
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
I just noticed that the parent process of Firefox was consuming slightly more than 100% of a core -- most of this usage was on the socket transport thread. I hadn't noticed this before, so it seems like something recent changed. A profile -- https://perfht.ml/2tLxchC -- shows that it the time seems to be spent polling various SSL connections -- mostly things like the PR_EnterMonitor/PR_ExitMonitor in SSL_DataPending and also locking in nsNSSShutDownList::enterActivityState and nsNSSShutDownList::leaveActivityState. I restarted Firefox yesterday; I'm running a build with my own changes in it, based on https://hg.mozilla.org/mozilla-central/rev/e0b0865639ce plus my own patches as of Mon Jul 17 11:01:21 2017 -0700 (locally, 6fca9cb0b6ed). Debugging the socket transport thread shows that it seems to be polling the various connections that my open tabs would have open. I have a reasonably large session, but probably only 30-40 of the tabs are actually active (i.e., have been restored). I'll attach the log of the gdb session and some other information. I have the Firefox instance still running, so I could do further debugging.
Attached image screenshot of 'top' (deleted) —
Attached image screenshot of 'top -H' showing threads (deleted) —
Hi Honza, Would you take a look at this bug? Thanks!
Flags: needinfo?(honzab.moz)
It eventually stopped happening without restarting Firefox, so I no longer have something to debug.
W/o debugging this at realtime it's hard to say why this has happened. Let's keep it open. I think this could be a known issue described in bug 705755. There is a patch that was nearly r+'ed, but we were not sure if this issue still existed. I'll check if the symptoms are identical and resurrect that bug if positive.
Assignee: nobody → honzab.moz
Whiteboard: [necko-active]
Flags: needinfo?(honzab.moz)
My current instance of Nightly is exhibiting this. I'll leave it running for now but eventually I'm going to need that cpu core back. :-)
Now it has gone away :-(
I'm not sure if this helps, but here's a high-res profile that I took: https://perfht.ml/2vWOuhx
(In reply to Aaron Klotz [:aklotz] (a11y work receiving priority right now, please send interceptor reviews to dmajor or handyman) from comment #8) > Now it has gone away :-( It's actually a good news! If this disappears on itself shortly afterwards, it suggests it's the bug I mentioned for we which we more or less know what to do.
Is that going to be landed soon? I think it would be a good idea if we could fix that before we release "the new Firefox" ;-)
Flags: needinfo?(honzab.moz)
Re comment 9: please next time start logging [1] and add pipnss:5 module. BTW, do you have at least some idea on what site or how you can reproduce this? I merged the patch from bug 705755 but I can't trigger the short write path (likely cause of this bug). Thanks. [1] https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Flags: needinfo?(honzab.moz)
Depends on: 705755
(In reply to Honza Bambas (:mayhemer) from comment #12) > Re comment 9: please next time start logging [1] and add pipnss:5 module. Sure. If I see it again, I will. > > BTW, do you have at least some idea on what site or how you can reproduce > this? I merged the patch from bug 705755 but I can't trigger the short > write path (likely cause of this bug). Sadly I have no idea. My machine was already in that state when I sat down to start work in the morning.
Aaron, bug 705755 has landed, I'll ask you to keep an eye on this bug and if it doesn't happen in a considerably long period, we may duplicate this bug to it. Thanks.
Flags: needinfo?(aklotz)
Sounds good! Thanks for landing that!
Flags: needinfo?(aklotz)
Priority: -- → P1
Priority: P1 → P3
Whiteboard: [necko-active]
It's been more than 20 days and no report of this issue. Closing for now, in case this appears again please reopen if the symptoms are identical or open a new bug. Thanks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
I just saw this in a recent nightly.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(In reply to Josh Matthews [:jdm] from comment #18) > I just saw this in a recent nightly. Can you be a bit more specific, like that you checked stack traces it's the same issue?
Yes, on macOS I saw that Firefox was sitting at 100% CPU usage in the system monitor. The Gecko profiler did not show any smoking guns, so I opened up Instruments.app and took a recording. The stacks showed 80% of time spent in the socket thread polling connections.
(In reply to Josh Matthews [:jdm] from comment #20) > Yes, on macOS I saw that Firefox was sitting at 100% CPU usage in the system > monitor. The Gecko profiler did not show any smoking guns, so I opened up > Instruments.app and took a recording. The stacks showed 80% of time spent in > the socket thread polling connections. You don't have by chance the stacktrace captured? I would be interested in finding out if NSS/PSM ssl layers were involved. Thanks!
I didn't save it. If I see it again I will.
However I'm pretty sure that the last meaningful stack frame as ssl_Poll.
Whiteboard: [necko-triaged]
I have been able to reproduce this quite consistently today on Nightly. It took me a while to figure what was going on, and this issue is the closest I can find which seems to be related to what I am experiencing. I am not alone in this, see: http://forums.mozillazine.org/viewtopic.php?p=14774720#p14774720 https://www.reddit.com/r/firefox/comments/7bnaor/bug_firefox_nightly_has_cpu_spikes_starting_with/ What I can tell so far: This happens when a new tab is created. The "Socket Thread" (on Linux) will use 100 %CPU. Usually it is temporary, this stops after 20-30 seconds. However, as I write this, it's still ongoing for minutes, and doesn't feel like it will stop actually unless maybe I quit Nightly. I have been trying to figure the minimum steps-to-reproduce, but it's difficult to narrow. So far I can reproduce easily though, I just right-click on a link on the front page of https://news.ycombinator.com/ to open the page in a new tab. This is not necessary though, I can also just paste directly a URL in a new tab. I do clear recent history, as I can't seem to be able to reproduce for the same link if I recently previously opened it. (The ongoing "Socket Thread" finally went away as I type this part of the comment). Since I can fairly easily reproduce it as of now, if there is anything I can do in helping to narrow the issue, just ask.
Hey rhill, What happens when you disable the thumbnailing service temporarily? I ask, because this sounds vaguely similar to bug 1412505. You can disable the thumbnailing service temporarily by creating a new pref, browser.pagethumbnails.capturing_disabled, and setting it to true, and then restarting the browser.
Flags: needinfo?(rhill)
:mconley, it didn't work. To be sure I got this right: I created a new entry in about:config named "browser.pagethumbnails.capturing_disabled" (without quotes), and of type "boolean", and set it to "true". I quit Nightly then launched. The issue was reproduced immediately by right-clicking to the top link on the front page of Hacker News. Then just now, I tried another link in there, and same result, a "Socket Thread" using 100% CPU for 20-30 seconds.
Flags: needinfo?(rhill)
Okay, thanks for testing my stab in the dark. :) Hey mayhemer, is there anything rhill can do here to help us make this actionable? Perhaps some network logging we can do?
Flags: needinfo?(honzab.moz)
Just to emphasize, clearing recent history is an important step on my side. I could not reproduce a second time the issue for the link I already had opened. But I reproduced immediately right after clearing recent history. The selected entries are: Browsing and Download, Cookies, Cache, Offline Web Data, Site Preferences.
I captured a log output from about:networking#logging when the 100% CPU "Socket Thread" issue is occurring. This was only a few seconds. Beware, the file is 216 MB uncompressed. The link visited was: https://techcrunch.com/2017/11/08/chia-network-cryptocurrency
Possibly the cause of Bug 1415691, which I just posted.
I filed yesterday Bug 1415387 with a mozregression result for the issue that started in Nightly the same day. The offending changeset has been backed out (see comment 6 in that bug).
I think symptoms as described from comment 24 down are a different bug. Awaiting confirmation.
I can't reproduce so far after Nightly updated to 58.0a1 (2017-11-09) (64-bit). I was still reproducing the issue right before the browser updated this morning.
Flags: needinfo?(honzab.moz)
I'm hitting this again running on https://hg.mozilla.org/mozilla-central/rev/e1d7427787f7 plus my local patches.
(In reply to David Baron :dbaron: ⌚️UTC-8 from comment #35) > Created attachment 8930267 [details] > "gdb profiling" until something decided to crash the process because I was > attached in gdb Thanks, but this is not very helpful w/o any stacks or attempts to figure out the code paths causing the loops.
I don't think this is a necko bug.
Assignee: honzab.moz → nobody
Component: Networking → Security
Whiteboard: [necko-triaged]

I know what the problem is:
Polling for WRITE has 2 meanings: 1) it is use to drive handshake 2) it is use when necko wants to write data
In times before 0RTT this was all correct and meant the same thing, because the client always needs to write first and it poll for WRITE until the socket is ready.

In case of 0RTT: necko polls for WRITE and it is able to write 0RTT -> necko does not have any more data to write, but it continues to poll for WRITE to drive handshake -> nss returns from poll immediately because it can write more 0RTT data. In this case polls for WRITE actually has 2 meaning.

I will try to change the behavior of necko and make it not poll for WRITE if it does not have data to write. Let's see how difficult it is, this is more than decade old code (probably 2 decates)....

Assignee: nobody → dd.mozilla
Status: REOPENED → ASSIGNED
Attached file Poll for READ during 0RTT (obsolete) (deleted) —

Polling for write will in most cases always return immediately which causes necko to basically busy wait for the handshake to finish. To avoid thiis we will poll for READ and call OnSocketWritable to drive the haandshake.

Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/23c30486969a Poll for READ during 0RTT r=necko-reviewers,valentin
Status: ASSIGNED → RESOLVED
Closed: 7 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch

Can we back this out. I want to resolve this in a different way. Sorry for making troubles.

Flags: needinfo?(aryx.bugmail)

Backed out as requested by Dragana.

Backout link

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 93 Branch → ---
Attached file Refactor EnsureNPNComplete (deleted) —

Extract Check0RttEnabled
The old code checks 0RTT state then does a DriveHandshake then checks 0RTT again. This is done in this way because before DriveHandshake is called for the first time 0RTT states are not set. DriveHandshake is sometimes called as a side effect by IsAlive() check. The new code makes this more complex and just calls DriveHandshaek before checking 0RTT.
Extract code for setting 0RTT telemetry values.
Remove some code that set timing because the same code is called a bit later again.

HandshakeDone will be called after a handshake is finished and also after the certificate verifications are done.
The code relies on HandshakeDone to signal that the handshake is done. When early-data is not available HandshakeDone is responsible for setting up a Http2 session if needed. There are 2 outcomes when early-data is used:

  1. early-data is accepted and transaction continues polling for read,
  2. early-data is rejected. In this case, the transaction is restarted as well as polling flags, i.e. the connection will stop polling for read and start polling for write.
    Another difference is that a transaction that is started during the early-data period will behave as a normal transaction, i.e. it will write data and continue polling for read to receive response. The special cases during early-data(mWaitingFor0RTTResponse==true) are removed from nsHttpConnection::OnSocketWritable().

EnsureNPNComplete is only responsible for driving handshake and checking the early-data availability. All logic for finishing a handshake (i.e. checking whether early-data is accepted and checking alpn value) has been moved to HandshakeDone.
The patch also extracts FinishNPNSetup that is responsible for the bookkeeping after a handshake is done or fails, e.g. resetting transactions if 0Rtt is used but handshake fails, updating timings and sending telemetry.

HandshakeDone needs to be dispatched so that it is not called inside nss locks. The side effect of this is that nsHttpConnection::OnSocketWritable() may be called in between HandshakeDone being dispatched and executed. Therefore we still need to keep CheckCanWrite0RTTData(). This can be fixed in a follow up patch.

Side cleanups:
Remove mNotTrustedMitmDetected - his was used for ESNI, but it is not used anymore

Attachment #9236155 - Attachment is obsolete: true

In this case necko should poll for read (not for write) and reset the poll flags when the handshake is done.
The other option is to inspect the resumption ticket before adding it to the nss socket and find out which alpn will be used and disable the early-data if the version is http/1.1 and the transaction cannot send early-data. This currently only works on Nightly. When we roll out the necko’s token cache we can consider making this change.

Additional changes:
Consolidate mEarlyDataNegotiated and mWaitingFor0RTTResponse into mEarlyDataState

Before this bug TLS handshake was only driven by forcing writes. SecurtyInfo was set during a write code path. That is not anymore true and the TLS handshake can be driven by reading from a socket. That causes an issue where the SecurtyInfo was not set in case a TLS handshake fails. This bug added the setting of the SecurtyInfo to the read code path, but that causes problems when the transaction is closed due to corrupted response.
This patch fixes this by moving the setting of SecurtyInfo to Close() function.

Do not call HandshakeDoneInternal if the connectioon has been closed between posting the HandshakeDoneInternal runnable aand executing it.

Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fe6b8f41ca83 Refactor EnsureNPNComplete r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/632eefa7e529 Add a callback when the handshake is done. r=necko-reviewers,kershaw,keeler https://hg.mozilla.org/integration/autoland/rev/0c049a8ee4a8 Don’t use a special way to write 0RTT data. Reuse the standard code path r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/7f2260b49e60 Use HandshakeDone to signal that a handshake is done r=necko-reviewers,keeler,kershaw https://hg.mozilla.org/integration/autoland/rev/c922a30d444e Make sure that nsHttpConnection immediately knows that the handshake is done. r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/1db60c600b97 Remove a busy-wait if the early-data is negotiated and http/1.1 is used but the transaction cannot send early-data. r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/e0d76ea1bdd2 Fix fuzzing build, also we cannot set SecurityInfo during read r=necko-reviewers,kershaw
Status: RESOLVED → REOPENED
Crash Signature: [@ mozilla::net::nsHttpConnection::OnInputStreamReady] [@ mozilla::net::nsHttpConnection::FinishNPNSetup] [@ mozilla::net::nsHttpConnection::OnReadSegment] [@ mozilla::net::nsHttpNTLMAuth::GenerateCredentials] [@ mozilla::net::nsHttpConnection::TryTak…
Flags: needinfo?(dd.mozilla)
Resolution: FIXED → ---
Target Milestone: 94 Branch → ---

Also set mEarlyDataState to done earlier.

Attachment #9242325 - Attachment description: onsider that mTransport may be nullptr when Handshake done callback is called → Consider that mTransport may be nullptr when Handshake done callback is called
Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e9635ec4198f Refactor EnsureNPNComplete r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/f69657a7e597 Add a callback when the handshake is done. r=necko-reviewers,kershaw,keeler https://hg.mozilla.org/integration/autoland/rev/971bda7b5986 Don’t use a special way to write 0RTT data. Reuse the standard code path r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/f8d31e3ebc93 Use HandshakeDone to signal that a handshake is done r=necko-reviewers,keeler,kershaw https://hg.mozilla.org/integration/autoland/rev/18651e373837 Make sure that nsHttpConnection immediately knows that the handshake is done. r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/4150724f355f Remove a busy-wait if the early-data is negotiated and http/1.1 is used but the transaction cannot send early-data. r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/7f0065ff7b82 Fix fuzzing build, also we cannot set SecurityInfo during read r=necko-reviewers,kershaw https://hg.mozilla.org/integration/autoland/rev/40e64e1009be Consider that mTransport may be nullptr when Handshake done callback is called r=necko-reviewers,kershaw
Regressions: 1732257
Flags: needinfo?(dd.mozilla)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: