Closed Bug 102227 Opened 23 years ago Closed 23 years ago

N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]

Categories

(Directory :: LDAP XPCOM SDK, defect)

x86
All
defect
Not set
major

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: leif, Assigned: leif)

References

Details

(Keywords: crash, topcrash, Whiteboard: [PDT+])

Crash Data

Attachments

(3 files, 1 obsolete file)

We have a few Talkback reports indicating that we are crashing on line 852 in nsLDAPConnection.cpp. The stack is nsLDAPConnection::OnFound [d:\builds\seamonkey\mozilla\directory\xpcom\base\src\nsLDAPConnection.cpp, line 852] XPTC_InvokeByIndex [d:\builds\seamonkey\mozilla\xpcom\reflect\xptcall\src\md\win32\xptcinvoke.cpp, line 139] EventHandler [d:\builds\seamonkey\mozilla\xpcom\proxy\src\nsProxyEvent.cpp, line 515] PL_HandleEvent [d:\builds\seamonkey\mozilla\xpcom\threads\plevent.c, line 591] The relevant code is: NS_IMETHODIMP nsLDAPConnection::OnFound(nsISupports *aContext, const char* aHostName, nsHostEnt *aHostEnt) { PRUint32 index = 0; PRNetAddr netAddress; char addrbuf[64]; // Do we have a proper host entry? If not, set the internal DNS // status to indicate that host lookup failed. // if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) { mDNSStatus = NS_ERROR_UNKNOWN_HOST; return NS_ERROR_UNKNOWN_HOST; } // Make sure our address structure is initialized properly // memset(&netAddress, 0, sizeof(netAddress)); PR_SetNetAddr(PR_IpAddrAny, PR_AF_INET6, 0, &netAddress); I can't think of any reason why we'd sometimes crash on this call to |memset()|, and I've not been able to reproduce it either. I'm kind of stumped how to debug this problem, I don't understand how |netAddress| could not be correcly allocated on the stack? -- Leif
Status: NEW → ASSIGNED
From a talkback report: x86 Registers: EAX: 00060003 EBX: 60e32b60 ECX: 02a9afcc EDX: 606864b4 ESI: 02b0a954 EDI: 00000000 ESP: 0012fc28 EBP: 0012fc90 EIP: 6068332e cf PF af zf sf of IF df nt RF vm IOPL: 0 CS: 001b DS: 0023 SS: 0023 ES: 0023 FS: 0038 GS: 0000 cmp [eax],edi 60683330 0f84d9000000 je 6068340f 60683336 6a20 push 0x20 60683338 8d45e0 lea eax,[ebp-0x20] 6068333b 57 push edi 6068333c 50 push eax 6068333d e89a200000 call 606853dc 60683342 8d45e0 lea eax,[ebp-0x20] 60683345 50 push eax 60683346 57 push edi 60683347 6a17 push 0x17 60683349 6a01 push 0x1 6068334b ff15dc29dccc call dword ptr [ccdc29dc]
*** Bug 102567 has been marked as a duplicate of this bug. ***
I just ran into this on my linux box running a branch build. Talkback ID is 36186399. x86 Registers: EAX: 09fec8cc EBX: 41337130 ECX: 0000266e EDX: 41336998 ESI: 00000003 EDI: 09fece90 ESP: bffff1bc EBP: bffff298 EIP: 4132fd02 cf pf af zf sf of IF df nt RF vm IOPL: 0 CS: 0023 DS: 002b SS: 002b ES: 002b FS: 0000 GS: 0007 Code Around the PC: 4132fd02 833900 cmp dword ptr [ecx],0x0 4132fd05 7519 jnz 4132fd20 4132fd07 8b4508 mov eax,[ebp+0x8] 4132fd0a c7404c1e004b80 mov dword ptr [eax+0x4c],0x804b001e 4132fd11 b81e004b80 mov eax,0x804b001e 4132fd16 e945010000 jmp 4132fe60 4132fd1b 90 nop 4132fd1c 8d742600 lea esi,[esi] 4132fd20 6a6c push 0x6c
After looking at this some more, both Mose and I are not convinced that the Talkback report is pointing at the correct line. In fact, we suspect the crasher might be at around line 845: if (!aHostEnt->hostEnt.h_addr_list || !aHostEnt->hostEnt.h_addr_list[0]) { We've been able to reproduce a crasher on this exact line, where |aHostEntr->hostEnt.h_addr_list| is non-null but points into never-never land (or Uranus as mose would say), and we crash on the second half of the |if()| statement. This causes a segfault. It's still unclear how this structure is getting corrupted, or why. Does anyone have suggestions if a) I'm not testing the |aHostEnt| structure properly for "correctness" or b) what could cause the DNS service (or possible the proxy code) to corrupt the host data or c) is this a corruption on the stack itself, making our |aHostEnt| point into the void somehow? Thanks! -- Leif
You might try adding assertions to nsDNSRequest::FireStop() to ascertain whether or not the hostent is corrupt at that point. I presume that aHostEnt is !nil, but I don't see a test for that.
OK, so I noticed that in my builds, the crash happens more of the time when there is an error dialog, after I select the error item. Additionally, just for grins, I tried recompiling nsLDAPConnection.cpp using PROXY_SYNC rather than PROXY_ASYNC. Interestingly, once when I saw the core dump with this PROXY_SYNC code, I saw an assertion from nsDNSRequest::Cancel: NS_ASSERTION(!PR_CLIST_IS_EMPTY(this), "request is not queue on lookup"); This is making me wonder if ::Cancel is sometimes getting called after the lookup has already finished. Is this allowable semantics?
gordon: correct, aHostEnt is not nil. I tried adding the assertions you suggested, and the hostent is NOT corrupt when just before the call to OnFound. So this may be proxy or xptcall or other event queue lossage of some sort.
OK, so I see what's going on here. The DNS service is calling OnFound back with a pointer to some private data. Then, it assumes that once OnFound returns, there's no need for the private data any more, and sets the nsCOMPtr holding it to nsnull. However, in the case of an asynchronous proxy, the data may not have actually been used yet. So I think we can work around this in the short term by using a synchronous proxy (maybe I was mistaken when I thought it still dumped core before with the sync proxy, because it's not now). Long term, I'd propose the nsIDNSListener should hand back refcounted data directly, rather than just a pointer into a privately refcounted objet. I'm still seeing the assertion I mentioned before with PROXY_SYNC, anyone know what's up with this?
The assertion is happening when the nsLDAPConnection destructor calls mDNSRequest->Cancel. It's not clear to me why this is happening, however: I added some logging, and nsLDAPConnection::OnStopLookup is getting called, and that function zeroes out mDNSRequest.
Keywords: crash, nsbranch+
Attached patch Possible fix, v1 (obsolete) (deleted) — Splinter Review
Comment on attachment 52290 [details] [diff] [review] Possible fix, v1 This patch is missing one part, posting a new one soon.
Attachment #52290 - Attachment is obsolete: true
Attached patch Potential fix, v2 (deleted) — Splinter Review
Requesting SR= and R= on the v2 patch. It's tested on all three platforms. -- Leif
Comment on attachment 52295 [details] [diff] [review] Potential fix, v2 sr=bienvenu
Attachment #52295 - Flags: superreview+
Checked in on trunk. Richi P.: can you maybe try a "trunk" build on Monday or so, and see if this fixes your problem? Thanks, -- Leif
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
I'm using build 2001100503 on win32 right now. Unfortunately, a lot has happened since I sent that bug report. One of the major changes is that I delete my User profile and started from scratch (some changes a few weeks back caused Mozilla installers to **** on me). With this build, Mozilla doesn't seem to crash anymore when doing an LDAP lookup. I'll bang on it some more and see what happens. I'll also download a build on Monday and see if that makes any difference as well.
Sorry ... spoke too soon. It's still happening on 2001100503 win32 (I just noticed on the Platform heading for this bug report, it says Linux only). The behavior is erratic. Near as I can tell, one of three things happen: 1) I start Mozilla, compose a message, type in a few chars. and it SIGSEGVs (the win32 equivalent, at least) 2) I start Mozilla, do some stuff, compose a message, type in a few chars. and some entries in the personal dictionary will show up and in the bottom and error entry saying problems with the LDAP server. I try a different sequence of letters and next thing I know, LDAP is working. 3) LDAP works fine. Once LDAP lookup starts to work, though, I can't seem to make it break again without restarting Mozilla. Will check again on Monday.
What was the timestamp on the file you downloaded? The fix wasn't checked in until around 7pm, so I suspect you won't see the fix in any builds until earliest Saturday morning. -- Leif
Finally! On win32 mozilla 2001100610 (timestamp 06-Oct-2001 14:06), doing LDAP lookups isn't crashing like before. Of course, there's very little traffic on the LAN so the environment is unlike that when I experienced it before, but it looks good so far.
Requesting PDT for checkin on 0.9.4 branch. -- Leif
Whiteboard: PDT
Verified with 20011008 trunk build on Window 2000. LDAP auto complete works fine against the following servers: Hostname: 208.12.37.50 Base DN: dc=mcom,dc=com Hostname: 208.12.36.22 Base DN: o=Airius.com Hostname: 208.12.37.103 Base DN: o=mcom.com
QA Contact: olgac → yulian
Whiteboard: PDT → [PDT+]
pls check this into the branch - PDT+
Checked in on 0.9.4 branch -- Leif
*** Bug 103868 has been marked as a duplicate of this bug. ***
Re-open to get into the 0.9.5 branch.
Blocks: 101793
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Checked in on 0.9.5 branch
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
We still show four incidents on the Trunk as recently as 10-04. Can we check it in? Adding info for talkback tracking. This was a topcrasher on the branch. Changing platform to reflect that this was/is happening on Windows and Linux.
Keywords: topcrash
OS: Linux → All
Hardware: All → PC
Summary: Segfault in OnFound in nsLDAPConnection → N620 Trunk Segfault in OnFound in nsLDAPConnection [@ nsLDAPConnection::OnFound]
Tom, do you see this on the topcrash report for the 094 branch and 095 branch after 10-9? Thanks.
greer: re-read the comments in the bug, and you'll see that the fix wasn't checked in until late on 10/5, so it's not surprising that there are crashes on 10/4.
Talkback data shows no incidents with this signature after 10/9. Marking VERIFIED fixed.
Status: RESOLVED → VERIFIED
Crash Signature: [@ nsLDAPConnection::OnFound]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: