Closed Bug 92123 Opened 23 years ago Closed 23 years ago

Linux-Crash on AOL cert enrollment page - Trunk, M092 & N610 [@ libc.so.6 - ssl_DefSend]

Categories

(Core Graveyard :: Talkback Client, defect, P1)

Other
Linux
defect

Tracking

(Not tracked)

VERIFIED FIXED
mozilla0.9.4

People

(Reporter: ji, Assigned: bbaetz)

References

()

Details

(4 keywords)

Crash Data

Attachments

(1 file)

Build: branch 07/24 linux build OS: RH 6.2-J Using 07/24 linux branch build with a new profile, when I tried to get an AOL cert from https://certificates.netscape.com, the browser crashes. Steps to reproduce: 1. Launch browser with a new profile. 2. Go to https://certificates.netscape.com 3. Click on Get the Cert The browser crashes, a talkback window comes up, but the browser window remains. However it seems the browser is disconnected from network. You can't go to any web site from that browser window. Callstack to follow.
Below is the callstack: Incident ID 33283866 Stack Signature libc.so.6 + 0xb2c82 (0x404d9c82) de72ad1c Bug ID Trigger Time 2001-07-24 10:50:44 User Comments Get AOL cert Build ID 2001072405 Product ID Netscape6.10 Platform ID LinuxIntel Stack Trace libc.so.6 + 0xb2c82 (0x404d9c82) libnspr4.so + 0x1ce78 (0x401b5e78) libnspr4.so + 0xc42a (0x401a542a) ssl_DefSend() ssl3_SendRecord() SSL3_SendAlert() ssl_SecureClose() ssl_Close() nsSSLIOLayerClose() libnspr4.so + 0xb5f7 (0x401a45f7) nsSocketTransport::CloseConnection() nsSocketTransport::Process() nsSocketTransportService::Run() nsThread::Main() libnspr4.so + 0x1f3ee (0x401b83ee) libpthread.so.0 + 0x5b85 (0x401ccb85
Keywords: crash
Version: 1.01 → 2.0
Verified on Linux. Mac and Win32 are OK.
Adding keywords. I'll download builds going backward until I find the one that does not crash.
Keywords: nsbeta1, regression
This was working with the 7/23 Linux commercial build.
My mistake. This bug goes back to at least 7/13.
May 23 works for me. May 31 fails for me. I have no other builds in between.
Is this a Linux and JA build only?
I'm using the normal Linux branch build on Redhat 6.0. I used to not be able to reach https://certificates.netscape.com/NSEnroll.html , but now that the browser loads it, the crash bug has become visible.
The build I used is English linux branch build and system is RH6.2 Japanese.
Getting certs with Linux from any of the sites on this page http://junruh.mcom.com/tests.html under "CMS 4.2 testing" works. The crash occurs only when clicking on the link at https://certificates.netscape.com
Priority: -- → P1
cc shadow
Changing summary.
Keywords: nsdogfood, pp
Summary: Browser crashes when trying to get a cert → Linux-Crash on AOL cert enrollment page
I just finished simple test for this version of browser with cms and netscape root ca. I didn't see any problem with cms and it is not crashed with netscape root ca but hung. It displays "connecting to certificates.netscape.com" but nothing's happened.
reproduced on Linux build 0725200105.0.9.2 ->javi P1 t->2.0
Assignee: ssaux → javi
Target Milestone: --- → 2.0
anyone know if certificates.netscape.com uses Keep-Alive connections?
btw,certificates.netscape.com is still running iCMS 4.1. I don't know how to tell if it's using keep_alives but I can check it if someone knows a way to determine that. in the CMS.cfg I see: eeGateway.keepAliveOn=false I'm running the 0724 build of N6.1 on Windows 2000 and aren't seeing any problems with that site, fwiw.
javi: You may be able to use something like that in a debug build setenv NSPR_LOG_MODULES nsHttp:5 setenv NSPR_LOG_FILE foo.log I just saw that in bug 90196 where the resulting log shows information about keep alive. Not sure whether it will work in your case.
Using the Linux RTM candidate bits (installer) at: ftp://sweetlou/products/client/seamonkey/unix/linux/2.2/x86/2001-07-24-18-0.9.2/ I was unable to produce a crash on Red Hat Linux 6.2. However, I did experience a "hang" when going from "https://certificates.netscape.com" to "https://certificates.netscape.com/NSEnroll.html". However, if I go directly to the URL "https://certificates.netscape.com/NSEnroll.html", I am able to successfully enroll and obtain a user certificate. I verified this with Beomsuk on his machine, and he was able to produce the exact same behaviour.
target 2.1
Target Milestone: 2.0 → 2.1
Summary: this bug does not show up when you use CMS 4.2 SP2 (the current version). It happens when you use old versions of CMS.
adding nsenterprise to all P1, P2 PSM bugs with target milestone of 2.1
Keywords: nsenterprise
I think this might be a dup of bug 83747 because the stacks look identical. Although bug 83747 was logged first, this bug has actually been looked at, so I'll leave it up to QA to mark that bug a dup of this one. Adding Trunk, M092 & N610 [@ libc.so.6 - ssl_DefSend] to summary and topcrash keyword for tracking. The final N610 Linux build 2001072504 has been seeing this crash according to the latest Talkback data. Here are a couple of entries: ssaux's crash: Incident ID 33328193 Stack Signature libc.so.6 + 0xb0eb2 (0x404ceeb2) 86d987d7 Bug ID Trigger Time 2001-07-25 10:43:41 User Comments Reproducing bug 92123 Build ID 2001072504 Product ID Netscape6.10 Platform ID LinuxIntel Stack Trace libc.so.6 + 0xb0eb2 (0x404ceeb2) libnspr4.so + 0x1ce78 (0x401b4e78) libnspr4.so + 0xc42a (0x401a442a) ssl_DefSend() ssl3_SendRecord() SSL3_SendAlert() ssl_SecureClose() ssl_Close() nsSSLIOLayerClose() libnspr4.so + 0xb5f7 (0x401a35f7) nsSocketTransport::CloseConnection() nsSocketTransport::Process() nsSocketTransportService::Run() nsThread::Main() libnspr4.so + 0x1f3ee (0x401b73ee) libpthread.so.0 + 0x4eca (0x401caeca) and junruh's crash: Incident ID 33325439 Stack Signature libc.so.6 + 0xdea32 (0x40551a32) f223bb1f Bug ID Trigger Time 2001-07-25 09:35:02 User Comments Build ID 2001072504 Product ID Netscape6.10 Platform ID LinuxIntel Stack Trace libc.so.6 + 0xdea32 (0x40551a32) libnspr4.so + 0x1ce78 (0x401b7e78) libnspr4.so + 0xc42a (0x401a742a) ssl_DefSend() ssl3_SendRecord() SSL3_SendAlert() ssl_SecureClose() ssl_Close() nsSSLIOLayerClose() libnspr4.so + 0xb5f7 (0x401a65f7) nsSocketTransport::CloseConnection() nsSocketTransport::Process() nsSocketTransportService::Run() nsThread::Main() libnspr4.so + 0x1f3ee (0x401ba3ee) libpthread.so.0 + 0x760e (0x401d460e)
Keywords: topcrash
Summary: Linux-Crash on AOL cert enrollment page → Linux-Crash on AOL cert enrollment page - Trunk, M092 & N610 [@ libc.so.6 - ssl_DefSend]
Win32 07/27 branch build and Mac 07/26 branch build are hung when clicking Get the Cert icon on https://certificates.netscape.com page, the status bar shows the browser is transfering data from certificates.netscape.com forever. But win and Mac builds don't crash.
For win32 and mac builds, when I see the hang, if I reload the page by clicking on Stop and Back icon, clicking on Get the Cert icon can get to the enrollment page.
My memory from looking at talkback reports was that this crash was a SIGPIPE (not a SIGSEGV as most crashes are). I'm not behind the firewall now so I can't double-check. It's useful to include the crash reason when filing talkback bugs.
Trigger Type: Program Crash Trigger Reason: SIGPIPE: Write on Pipe, with no one to read: (signal 13) jay/shiva, lets get trigger reason added to the quick search report.
*** Bug 83747 has been marked as a duplicate of this bug. ***
I got this a couple of times, when reading mail over imap/ssl (see bug 92517). The first time was at shutdown, and the second time was while reading mail, after not having touched the computer for a while. In that case, I could keep using the product after talkback came up. Are we writing to a closed socket?
Mass assigning QA to ckritzer.
QA Contact: junruh → ckritzer
I'm not seeing this crash on Linux anymore. Marking WORLSFORME. Please re-open if this still crashes.
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
Still crashes, according to talkback. In fact, it's probably our #1 trunk topcrash on Linux. With recent NSPR build system changes (bug 88045), we now see the top of the stack more clearly. Looking at a report from the 2001-08-15 build, the top of the stack is: libc.so.6+ ... pt_Send() pl_DefSend() ssl_DefSend()
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
(Oops, I added this in the wrong bug.) Is there a web site that produces this crash? https://certificates.netscape.com/ does not anymore. Without a reproducible test case, this will *never* get fixed. (I've tried the 3 sites mentioned in the talkback reports and none of them cause my build from this morning to crash.) The fact that the crash on https://certificates.netscape.com/ went away without changing PSM code leads me to believe we're crashing in NSS because of a bug in a different part of the code. Perhaps trying to write to a closed socket that only happens on slow connections. Anyone in QA have a modem they could test with?
Recent user comments on this crash (some of which have URLs) are below. Numbers in parentheses are talkback report numbers. (34159147) Comments: clicked on gaim link (34156456) URL: https://secure.globalsign.net/cacert/root.cacert (34156456) Comments: Adding a cert via https://secure.globalsign.net/cacert/root.cacert (34144893) Comments: crash on file quit (34134388) Comments: I wasn't even in the virtual desktop where mozilla was running. The mail/news app was up and was checking/filtering mail so I assume that's what did it.... (34130076) Comments: exit mozilla and terminate network connection at about the same time (34104702) URL: my.yahoo.com (34104702) Comments: Pressed refresh after leaving Mozilla unused for about 2 hours. (34092247) URL: http://dr.dk/licens/vaerdat.htm (34092190) URL: http://dr.dk/licens/vaerdat.htm (34069916) Comments: 2001-08-13-08 linux 2.2 (34069457) Comments: 2001-08-13-08 on linux 2.2tried selecting a newsgroup to download byusing the download and sync window.on linux it hangs then i guess after waiting a whileit crashes (34030983) URL: https://jsecom11.sun.com:443/ECom/docs/CompleteRegister.jsp (34030983) Comments: Submitting the form by clicking on "Register" button. (33996736) URL: salon.com (33996736) Comments: Just reading an article. I didn't even click on everything. Odd I don't think I've ever had a totally spontaneous crash before. (33994589) URL: register.com (33994589) Comments: submitting e-mail on register.com's webmail. (33915983) URL: www.salon.com/.... (33915983) Comments: clicked on a link to read the next page of a story (33897868) Comments: 2001080910 linx 2.2idle had browser/mesger up and it crashed (33844811) Comments: i think only one thread crashed or something. so maybe that's a dns problem. mozilla is still alive and a little bit well
I noticed that nowhere in this bug report does anyone mention the bit about "PSM or Netscape 6.1 detected: click here first to get a patch" appearing on the "https://certificates.netscape.com/NSEnroll.html" webpage - is that because this patch link is a recent addition somewhere? If no, why isn't it mentioned - is it inconsequential? If yes, has anyone tried it? ji, could you try this again after installing the patch?
That "patch" was added on July 4, 2001. The "patch" is actually just the GTE CyberTrust root certificate, which our internal CA chains to.
cc'ing wtc. dbaron observed that NSPR passes 0 (not MSG_NOSIGNAL) for the send flags, and does not abstract signal syscalls, so callers are likely to get SIGPIPE in socket reader abrupt termination situations. What to do? At the least, I think we need to prevent SIGPIPE from being raised, at the NSPR level if possible. /be
NSPR ignores SIGPIPE. Maybe that is not happening? It is possible that we are bitten by the fact that pthreads on Linux are really process clones. Maybe we need to ignore SIGPIPE for each thread.
That is what I did in a different pthread application I wrote. In Linux, each pthread has its own mask of signals being blocked and allowed. It is best to unblock the signals only in those threads where you want to handle the signal. However, all threads share the same signal handler code, which makes things difficult if you want to handle a signal in multiple threads, as you don't have access to thread local storage from within the signal handler...
It could be that talkback is trapping the SIGPIPE and sending crash reports even though Mozilla would otherwise ignore it.
dbaron: Don't think so. Its being raised, and theres no current signal handler for SIGPIPE, so we'd crash without talkback, wouldn't we?
It sounds like a non-main thread got killed by the SIGPIPE, and that (rightly) triggered talkback. What wtc said: do we need to make sure each new thread ignores SIGPIPE? /be
brendan wrote: > What wtc said: do we need to make sure each new thread > ignores SIGPIPE? According to the LinuxThreads FAQ, this is not necessary. (http://pauillac.inria.fr/~xleroy/linuxthreads/faq.html#J) So what NSPR does should be sufficient.
Is it possible that Mozilla in Linux sets a signal handler for SIGCHLD only? I traced through InstallUnixSignalHandlers, it does essentially nothing on my system. Using gdb, I stopped at the first line of main, set breakpoints to functions sigaction and sigvec. It only stopped twice, in unix_rand.c, setting handlers for SIGCHLD only. Did I miss another system function that sets signal handlers, or do you think gdb was unable to behave correctly? gdb was unable to work with a breakpoint set at sigprocmask. However, I tried to set breakpoints on source locations where sigprocmask should be called, but it didn't stop. Do you have an idea, how we could test whether there is really a signal handler set?
/netwerk/dns/src/unix-dns.c, line 699 -- CATCH_SIGNAL_DFL(SIGPIPE);
wtc, that looks like dead code, jwz's old child process dns helper jazz from the classic days. Cc'ing gordon and darin, who can best say whether this file should be cvs removed. /be
Wan-Teh helped me to confirm that I saw a debugger problem. During application init the following code is executed: struct sigaction sigact; int rv; sigact.sa_handler = SIG_IGN; sigemptyset(&sigact.sa_mask); sigact.sa_flags = 0; rv = sigaction(SIGPIPE, &sigact, 0); PR_ASSERT(0 == rv);
Ok, is something reseting SIG_DFL for SIGPIPE later on? /be
dbaron: you were right. If I take ns/fullsoft/tests/crasher.c, massage it so that it compiles, and then add the sigaction call from nspr to main(), and change the crash test to raise(SIGPIPE), then: a) If we initialise talkback before calling sigaction, then the program continues after the call to raise. b) If we call sigaction before initialising talkback, then talkback catches it, and we appear to crash (I had to hack lots of stuff to get it to run at all, so I'm not surprised that the dialog didn't show up). stracing the build shows that talkback calls sigaction for SIGPIPE. And of course NSPR is initialised before it starts registering components, so we get case b). This would also explain why I stopeed seeing this - I stopped using the branch builds to use self-built trunk stuff (without talkback) after 6.1 was released. talkback people - we need a way to convince talkback not to register a handler for SIGPIPE. Alternately we can reset the signal handler for SIGPIPE ourselves in the QFA component. We don't have to worry about portability in setting the signal handler because linux is the only unix system to use talkback. I can come up with the obvious patch for that if thats felt to be the best way to go. Why did this wait until June to hit us? This should have been happening forever, shouldn't it?
bbaetz: good detective work! We should report this bug to the talkback vendor and ask them for a fix or workaround. (The workaround is probably the obvious patch.)
shiva, can you check on this?
The workarround is the obvious patch, and when I get in to work I'll generate one (of course, I can't actually test that it works, but I'll add a raise(SIGPIPE), and check that talkback doesn't come up) The question of why this didn't hit us earlier still remains, though. Did something in PSM change so that they try writing to a closed socket? This should have hit non-PSM use, in theory.
It does hit non-PSM use. Out of the last 100 or so libc.so.6 stack signature crashes that I went through, I'd say about 40-60 were this crash, 14 were SIGPIPEs in libX11.so.6, and 7 were an IMAP-related SIGPIPE.
OK. Lets disable the signal, then. This will then be identical to the non-talkback builds. Patch once I get somewhere I can easily build commercial.
Taking, patch to commercial tree attached, looking for r, sr I've compiled the file I've modified - I can't link it, or test it, though, because I don't have the config stuff for that.
Assignee: javi → bbaetz
Status: REOPENED → NEW
Component: Client Library → Talkback
Product: PSM → Browser
Target Milestone: 2.1 → mozilla0.9.4
Version: 2.0 → other
Attached patch patch (deleted) — Splinter Review
Are we going to completely ignore SIGPIPE ? or Is there a specific case do we want to ignore ?
Yes, we need to completely ignore sigpipe. That code was taken from the NSPR init code - we're just duplicating that behaviour.
r=syd
ccing shannon and jdunn.
So I've been told that we support talkback on HPUX as well. I'll have to copy the NSPR code for that as well (which is slightly different). I'll need testing that the file compiles, to check that I've included the correct headers.
wtc says that the HPUX specific code is not needed, since mozilla doesn't use that type of threading (its incompatible with X, apparently). So my original patch can go in as is. Can I get an sr from someone please? darin?
Your patch is fine for all Unix flavors, including HP-UX. Make sure the indentation is right. (Hard for me to tell from a patch file.) r=wtc.
fyi I believe we also use talkback on solaris.
sr=darin
Status: NEW → RESOLVED
Closed: 23 years ago23 years ago
QA Contact: ckritzer → chofmann
Resolution: --- → FIXED
Fix checked into the comm tree yesterday. QA assigning to default talkback QA. I guess to verify, just look at the talkback logs.
I'm not finding this exact stack trace in talkback for today. I'm marking this verified, reopen if it turns up again.
Status: RESOLVED → VERIFIED
*** Bug 89518 has been marked as a duplicate of this bug. ***
*** Bug 96269 has been marked as a duplicate of this bug. ***
Product: Core → Core Graveyard
Crash Signature: [@ libc.so.6 - ssl_DefSend]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: