Closed Bug 336944 Opened 19 years ago Closed 19 years ago

100% CPU and delays with https/ssl activity

Categories

(Core :: Security, defect)

1.8 Branch
x86
OS/2
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: mrmazda, Assigned: KaiE)

Details

(Keywords: fixed1.8.1, perf, regression)

Attachments

(6 files)

Latest available build that this not happen in: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.8) Gecko/20060413 SeaMonkey/1.1a More details in newsgroup thread beginning with: http://groups.google.com/group/mozilla.dev.ports.os2/msg/f173e04eae4fb1c5?hl=en& Summary of news thread: 1-Happens on https://bugzilla.mozilla.org/ https://bugzilla.redhat.com/ https://bugzilla.novell.com/ https://qa.mandriva.com/ and elsewhere 2-SSL pop3 login delays 3-https login delays 4-Large bug queries frequently take 4+ times normal, hogging the CPU the whole time (2-3 minutes or more instead of ~30 seconds for http://tinyurl.com/msj4y) 5-Loading existing bug pages frequently hog CPU, taking inordinate time to finish (distinguishable from Bug 254185 or any like it)
This so far appears to be OS/2 specific...
Assignee: dveditz → mozilla
Several observations: - I ran the tinyurl search Felix quoted with my normal profile and the OS/2 nightly build Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.8) Gecko/20060504 SeaMonkey/1.1a I didn't see any particular delays there. - I then switched "security.OCSP.enabled" from 1 to 0. I noted some delay and pretty high CPU usage while loading the same URL with the same SeaMonkey build. - I today built a Firefox debug build off the trunk. With short bugzilla searches I didn't see any problems but when I use the tinyurl there is a long delay (10-15s) with 100% CPU before anything appears on screen with "security.OCSP.enabled" 0 or 1?! - (probably unrelated) The first time I tried to search at all I got an alert saying: "bugzilla.mozilla.org has sent an incorrect or unexpected message. Error Code: -12237" i.e. the BadServer message The timing could point to bug 111384 or bug 152426. I CC kaie and Julien, perhaps they have some hints.
Keywords: regression
Could you please clarify, do you see your problem with - OCSP turned on (1) - OCSP turned off (0) - regardless of OCSP being turned on or off ? Thanks. -12237 means: SSL_ERROR_RX_UNEXPECTED_CHANGE_CIPHER "SSL received an unexpected Change Cipher Spec record."
I'll take a look over the week-end on my OS/2 box. I don't have a working build environment for the full mozilla client unfortunately, but I do have one for NSPR/NSS, so if the problem is in that area, I might be able to make some progress.
OSCP on doesn't seem to cause any improvement here. One way I often get delay of around 30 seconds is from http://www.mozilla.org/ports/os2/ middle clicking first on the "Recently Changed Warpzilla Bugs" link to open in new tab, immediately followed by left clicking on " Current Open Warpzilla Bugs" link to open in current tab. I've also noticed when I get a really long delay, that if I try to close the tab instead of waiting for completion, it seems some files get left open if I shortly after close SM to restart it and reclaim squandered memory. When I try to xcopy /m /s backup my profile tree while SM is supposedly shut down, xcopy squawks about being unable to access files. The only way I know to fix that is a reboot.
Flags: blocking1.8.1?
How are NSPR-pollable-events implemented on os/2? We now do SSL on mutliple threads, and use pollable events to wake up the threads to process available data. Sound like the wake up is not working on os/2.
Kai, if you give me a pointer (lxr link?) where you do these calls I could also take a look on OS/2. Also, I think Felix had some other observations that might shed some light on the reasons behind this delay. (Was it the fast back cache pref?)
This bug makes trying to use eBay's login painful.
So how narrow can we get the regression window here?
There were no 1.8 branch OS/2 builds between 13 and 21 April. That period is when it happened: 13th OK, 21st not. So, maybe start with a fresh build from a 17th pull. ;-) I might be slightly off about the end date due to the bug 335142 difficulty. But I think I saw that bug along with this for about 2 days before Serge filed it.
I think this bug might impact Windows users as well. It is possible for PR_NewPollableEvent to fail on Windows. We've had bugs in the past where a user's personal firewall disallowed the creation of a connected socket pair because it thought the browser was trying to open a server socket. It is also fairly common that Win9x fails to open a connected socket pair because there may be no loopback TCP device configured! This is handled somewhat gracefully in nsSocketTransportService by making select() timeout frequently. So, I think this is something that needs to be addressed at the PSM-level as it may impact Windows users on a significant scale. Kai: you can simulate this behavior by making PR_NewPollableEvent return NULL.
Flags: blocking1.8.1? → blocking1.8.1+
-> kai
Assignee: mozilla → kengert
maybe that's the same bug bob saw, however he gets it combined with an ocsp error message
(In reply to comment #11) > > Kai: you can simulate this behavior by making PR_NewPollableEvent return NULL. Darin: PSM calls this only once during the session, very early during its init. The application actually crashes if I force this one time init to fail.
probably unrelated to this bug, but if it is possible for PR_NewPollableEvent to fail, we might want to move this singleton call to the time the PSM XPCOM module gets created, so we can fail gracefully.
The application must do its best to work even when PR_NewPollableEvent fails because it will fail on some Windows systems (and on OS/2). If the function is only needed for OCSP, then maybe we can just disable OCSP instead of all of SSL support.
Attached image OCSP error dialog screenshot (deleted) —
Is this related? I've seen this on http://www.familytv.com/ repeatedly.
Post-20060413 SM 1.9 builds (e.g. 2006041905) are just as bad as post-20060413 SM 1.8 branch builds. Minefield 20060525 and 20060419 are also bad.
Here's how I see the 100% CPU utilization. Not that I see this problem constantly on XP, and almost never (perhaps never) on Linux. 1. Set OCSP to validate certs that list an OCSP responder (middle option) 2. Check my mail using Gmail's SSL facility: https://mail.google.com/mail/ 3. Leave that tab, and open other tabs to do other work, often involving logins to other SSL sites (like bugzilla.m.o) I'll see the OCSP responder error a few times per hour, both when the SSL Gmail tab is active (on top), and also when others are active. If I'm working on other tabs, the SSL Gmail tab becomes active, and I see the error message. I don't understand why I see this OCSP error on Windows XP and not on Linux. It's the same server, with the same certificate. Regardless of the CPU utilization, why would I see OCSP errors on one machine, and not the other?
Update: I made sure I had the latest builds on both XP machines (two of them) and my Linux (Fedora Core 5) machine. I see the OCSP error on all 3 machines, but the 100% CPU utilization only occurs on the XP machines. On Linux, I'm running: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1a3) Gecko/20060621 BonEcho/2.0a3
This patch simluates an OCSP server error 1 out of 5 attempts. Using this patch on Windows and connecting to google mail over https, one quickly gets the reported error dialog with 100% cpu activity.
While the busy loop was active, I set breakpoints on all threads on the functions above NSPR layer. It appears the busy loop is at nsSocketTransportService::Poll. However, I had a problem with these breakpoints set, after a couple of "continue" in the debugger, the system got really unresponsive, and I was glad I could at least kill things after 10 minutes.
So we have two issues here: a) what causes this error message to show up? b) why is there a busy loop with 100% cpu while the error is shown? Regarding a), I have never been able to reproduce myself. Based on code inspection, I see only two causes that produce the shown error message: a1) The OCSP responder signals a response status of internalError (2), --Internal error in issuer as exaplained in http://www.ietf.org/rfc/rfc2560.txt a2) The http client callback implementation in Mozilla, that fetches the ocsp response on behalf on NSS fails with some internal error. When we implemented the callback, we used the same error code as used for a1). I intend to produce an engineering build of Firefox from 1.8 branch, that uses a different error message for a2), and have Bob try it. This is to find out, whether the frequent failures are caused by an internal failure, or indeed by a1). Will talk about b) later.
(In reply to comment #16) > The application must do its best to work even when PR_NewPollableEvent fails > because it will fail on some Windows systems (and on OS/2). If the function is According to Darin, the failure on Windows is due to firewalls denying the ability to create a socket.
Kai, we'd really like to get a fix for this in the beta1 timeframe because it's possible that this could impact many users.
While I'm still not able to reproduce myself, I'm making progress. In my simulated error case, I was able to hide the busy loop by adding PR_Sleep( 50 milliseconds ) to nsSocketTransportService::Poll. I hope this is the same busy loop as seen with the real failure. Darin, David: No calls to PR_NewPollableEvent seem to have failed in my test scenario. SocketTransport has a valid pollable event (mPollList[0].fd). Darin, would it help you if I produce trace output created using NSPR_LOG_MODULES="nsSocketTransport:5" ? But you could reproduce it yourself, as described in comment 21. Regarding the true origin of the error message: Right now I'm doing a new debug build for Bob who is able to reproduce, that will give me lots of trace info about the failed OCSP connection. I already know that the kicked off channel load is failing in nsNSSCallbacks.cpp, where OnStreamComplete is called with param 3 (nsresult) indicating a failure. As soon as Bob runs the new trace build, I hope to have full details about the failed connection, and hopefully this will enable me to reproduce the original cpu busy loop.
Darin: Might it be possible that nsSocketTransport is setting the pollable event more often than waiting for it? Wan-Teh proposed these should be balanced.
When Bob gets the error message, a plain HTTP POST request to http://ocsp.thawte.com:80/ failed with either NS_ERROR_MODULE_NETWORK:NS_ERROR_CONNECTION_REFUSED or NS_ERROR_MODULE_NETWORK:NS_ERROR_NET_RESET
Another approach to simulate the error seen be Bob: - use Win XP - edit file mozilla/netwerk/base/src/nsSocketTransport2.cpp - enable TEST_CONNECT_ERRORS - go to https://mail.google.com/mail/ - if an ocsp error shows up immediately, try again, until you are able to log in - click around in mail - sooner or later the error will show up. to force it, try to sign out and back in. - as soon as you see the OCSP error message while gmail is shown, open task manager, and you'll see the 100% cpu Darin, do you think this should be fixed at the nsSocketTransport level? We should solve the busy loop problem, but independently of it, do you think PSM / NSS should retry, if a the connection fails with CONN_REFUSED, CONN_TIMEOUT or CONN_RESET ?
running a debug build with NSPR_LOG_MODULES="nsSocketTransport:5" the application continuously produces the following series of lines: 1948[f443a0]: calling PR_Poll [active=2 idle=2] 1948[f443a0]: poll timeout: 65535 1948[f443a0]: nsSocketTransport::OnSocketReady [this=2e75658 outFlags=1] 1948[f443a0]: nsSocketInputStream::OnSocketReady [this=2e756f8 cond=0] 1948[f443a0]: nsSocketInputStream::Read [this=2e756f8 count=4096] 1948[f443a0]: calling PR_Read [count=4096] 1948[f443a0]: PR_Read returned [n=-1] 1948[f443a0]: nsSocketInputStream::AsyncWait [this=2e756f8] 1948[f443a0]: active [1] { handler=2ffdd10 condition=0 pollflags=6 } 1948[f443a0]: active [0] { handler=2e75658 condition=0 pollflags=5 } 1948[f443a0]: idle [1] { handler=2cd7f10 condition=0 pollflags=0 } 1948[f443a0]: idle [0] { handler=2ec8328 condition=0 pollflags=0 } I think I should check whether this is caused by the poll implementation of the SSL thread.
Attached patch Patch v1 (deleted) — Splinter Review
I believe I have a fix! In short, when one SSL socket is blocked (because of UI), for other SSL sockets the SSL thread implementation of the NSPR poll method returns readable or writeable, but the SSL thread implementation of read/write methods will immediately return -1 with would-block, because the SSL thread is not available. The patch fixes the problem, by not signaling readable/writeable when we know that we are not able to... This patch fixes the 100% cpu consumption issue for me. However, we should still discuss, whether we want an automatic retry for network failures on OCSP attempts. Darin or Bob, could you please review?
Attachment #227519 - Flags: superreview?(darin)
Attachment #227519 - Flags: review?(rrelyea)
(In reply to comment #31) > Created an attachment (id=227519) [edit] > Patch v1 > > I believe I have a fix! I can confirm that the test build Kai gave me still generates the errors, but does not consume 100% of the CPU. I'll keep testing that build.
Comment on attachment 227519 [details] [diff] [review] Patch v1 r=relyea
Attachment #227519 - Flags: review?(rrelyea) → review+
Attachment #227519 - Flags: superreview?(darin) → superreview+
fixed on trunk
Status: NEW → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Attachment #227519 - Flags: approval1.8.1?
Comment on attachment 227519 [details] [diff] [review] Patch v1 >Index: mozilla/security/manager/ssl/src/nsSSLThread.cpp >+ return PR_POLL_READ; >+ } >+ else >+ { nit: no need for else after return >+ } >+ else >+ break; >+ } >+ > default: > break; the "else break;" statement is a little odd. it isn't necessary since the code will fall-through to the default case. a=darin on behalf of drivers for the 1.8 branch
Attachment #227519 - Flags: approval1.8.1? → approval1.8.1+
Thanks for your comments Darin. I removed the first else-after-return. I removed the second else-break. In both locations I added a trailing break to the block, to protects us from future accidential fall-throughs. I will land this change on the trunk, and the combined patch on the 1.8 branch.
fixed1.8.1
Keywords: fixed1.8.1
*** Bug 340359 has been marked as a duplicate of this bug. ***
Felix, can you confirm that this is fixed on OS/2, too? Or do we need some extra work?
Haven't been able to find anyplace where this problem is not solved with OCSP set to 0. Does it need non-zero OCSP testing too?
Might be a good idea to switch on ocsp=1 at least for a while to be sure. Ralph Haase reports that he still finds similar problems on OS/2 with SM 1.1a. I will try to work with him to track down if this is the same bug or a different one.
I see no difference with OCSP=1
The error dialog is handled in bug 350512.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: