Closed Bug 86917 Opened 23 years ago Closed 22 years ago

socket transport should try all ip addresses returned by DNS service Trunk M098 [@ PR_GetIdentitiesLayer]

Categories

(Core :: Networking, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla0.9.9

People

(Reporter: gordon, Assigned: darin.moz)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(1 file, 5 obsolete files)

The DNS service returns multiple ip addresses for domains that provide them. Either the socket transport needs to cycle through these ip addresses if the first one fails, or we need to change the DNS service API to simply return a single ip address, but internally round-robin them.
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla1.0
i don't think the DNS api should change, since you'd still want to allow the socket transport to be able to try other ip addresses if the first one fails. how about just having the socket transport make a random selection from the set of resolved ip addresses?
It shouldn't just be a random selection. There should be some sort of iteration through the addresses if one fails.
*** Bug 66872 has been marked as a duplicate of this bug. ***
Try the following: 1. Visit www.google.com 2. Type out any search criterion and press enter Mozilla spends time with DNS resolution for www.google.com all over again. Isn't there the concept of DNS cache? This continuous DNS resolution makes Mozilla very, very slow... It took 8 seconds for Mozilla to load www.google.com, a page I visit near ten times a day!
manoj: what mozilla build are you using?
*** Bug 87971 has been marked as a duplicate of this bug. ***
*** Bug 89204 has been marked as a duplicate of this bug. ***
+mostfreq - kept forgetting to do that.
Keywords: mostfreq
Attached patch tiny patch for this bug (obsolete) (deleted) — Splinter Review
I made a patch. I think using getaddinfo() is better solution, but this patch may help people who want to fix this bug ASAP.
gordon/darin: Can you look at the Patch ??
Comment on attachment 41587 [details] [diff] [review] tiny patch for this bug could you please remove all #if 0'd code from the patch, and resubmit? thanks!
Comment on attachment 41587 [details] [diff] [review] tiny patch for this bug could you please remove all #if 0'd code from the patch, and resubmit? thanks!
Attachment #41587 - Flags: needs-work+
Attached patch remove #if from previous patch (obsolete) (deleted) — Splinter Review
Comment on attachment 49073 [details] [diff] [review] remove #if from previous patch some problems and nits: 1- fix indentation to 4 spaces 2- use nsnull instead of NULL 3- remove extraneous ";" after "PRNetAddrList *p, *next;;" 4- move "p = next" inside the for loop declaration. 5- why have you removed the line which initializes mNetAddress.ipv6.port? 6- please do not declare data structures prefixed with "PR" ... that prefix is reserved for use by NSPR. so, PRNetAddrList is wrong... nsNetAddrList would be much better. 7- "mNetAddressp" is not a very descriptive name. please resolve these issues, thanks!
> 1- fix indentation to 4 spaces > 2- use nsnull instead of NULL > 3- remove extraneous ";" after "PRNetAddrList *p, *next;;" > 6- please do not declare data structures prefixed with "PR" ... > that prefix is reserved for use by NSPR. so, PRNetAddrList is wrong... > nsNetAddrList would be much better. OK, I'll fix them. > 4- move "p = next" inside the for loop declaration. Do you mean "for(p = mNetAddressList; p; p = next){" ? I think my code is more easy to understand. Or how about this? for(p = mNetAddressList; p; (next = p->next, delete p, p = next)) ; > 5- why have you removed the line which initializes mNetAddress.ipv6.port? Becase there isn't mNetAddress now. > 7- "mNetAddressp" is not a very descriptive name. Could you suggest a good name ?
4,5- no problem 7- you have mNetAddressList... so how about just mNetAddress?
actually, better yet... why do you even need to store mNetAddressList? you could just store mNetAddress (which has a next pointer) and then free the first address while advancing to the next. then i think you could do away with mNetAddressList.
*** Bug 101033 has been marked as a duplicate of this bug. ***
*** Bug 100950 has been marked as a duplicate of this bug. ***
Attached patch modified according to darin's suggestion (obsolete) (deleted) — Splinter Review
Comment on attachment 52043 [details] [diff] [review] modified according to darin's suggestion >Index: nsSocketTransport.cpp >+ nsNetAddrList *p, *next; >+ for(p = mNetAddress; p; ){ >+ next = p->next; >+ delete p; >+ p = next; >+ } please indent using 4 spaces to be consistent with the rest of the file. >@@ -641,14 +647,13 @@ >+ if (mNetAddress == nsnull) { > // > // Initialize the port used for the connection... > // > // XXX: The list of ports must be restricted - see net_bad_ports_table[] in > // mozilla/network/main/mkconect.c > // >- mNetAddress.ipv6.port = PR_htons(((mProxyPort != -1 && !mProxyTransparent) ? mProxyPort : mPort)); how about removing the comments above this line, since they no longer apply. >@@ -672,7 +677,7 @@ > // > // The DNS lookup has finished... It has either failed or succeeded. > // >- if (NS_FAILED(mStatus) || !PR_IsNetAddrType(&mNetAddress, PR_IpAddrAny)) { >+ if (NS_FAILED(mStatus) || mNetAddress != nsnull) { the != nsnull is not needed. just test mNetAddress. >@@ -832,7 +837,8 @@ > // This is only done the first time doConnection(...) is called. > // > if (NS_SUCCEEDED(rv)) { >- status = PR_Connect(mSocketFD, &mNetAddress, gConnectTimeout); >+try_again: >+ status = PR_Connect(mSocketFD, &(mNetAddress->mNetAddress), gConnectTimeout); mNetAddress->mNetAddress is awkward... you need to come up with better names for these. how about: struct nsNetAddressList { PRNetAddr addr; nsNetAddressList *next; }; mNetAddressList->addr mNetAddressList->next >@@ -857,6 +863,12 @@ > // > else { > // Connection refused... >+ nsNetAddrList *p = mNetAddress; >+ mNetAddress = mNetAddress->next; >+ delete p; >+ if(mNetAddress){ >+ goto try_again; >+ } make sure you use a consistent coding style... put a space between the if and opening ( >+ nsNetAddrList *p = mNetAddress; >+ mNetAddress = mNetAddress->next; >+ delete p; >+ if(mNetAddress){ >+ goto try_again; >+ } the indentation here is really wacked. >+ nsNetAddrList *p = mNetAddress; >+ mNetAddress = mNetAddress->next; >+ delete p; >+ if(mNetAddress){ >+ goto try_again; >+ } again, cleanup the indentation. >+/****************/ >+ mNetAddress = new nsNetAddrList; >+ mNetAddress->next = nsnull; indentation! >Index: nsSocketTransport.h >+typedef struct nsNetAddrList{ >+ PRNetAddr mNetAddress; >+ struct nsNetAddrList *next; >+} nsNetAddrList; >+ the typedef is not needed. this is C++ ... just declare it as a struct. see suggestions above.
Attachment #52043 - Flags: needs-work+
Attached patch Yet another patch (obsolete) (deleted) — Splinter Review
Comment on attachment 52221 [details] [diff] [review] Yet another patch >Index: nsSocketTransport.cpp > else { > // Connection refused... >+ nsNetAddrList *p = mNetAddressList; >+ mNetAddressList = mNetAddressList->next; >+ delete p; how about writing a little member function to do this, something like: AdvanceToNextAddress() or... nsNetAddrList::Advance(&mNetAddressList); since this code fragment is repeated several times, it would make sense to put it in a little inline function. >+ if (mNetAddressList){ >+ goto try_again; >+ } no need for the extra braces here... please remove. how about a comment here, indicating what is going on here (even though it is completely obvious to anyone familar with the code). >+ for (p = aHostEnt->hostEnt.h_addr_list; *p; p++){ add a space before the opening { >+ *addrp = new nsNetAddrList; >+ (*addrp)->next = nsnull; >+ (*addrp)->addr.ipv6.port = PR_htons(((mProxyPort != -1 && !mProxyTransparent) ? mProxyPort : mPort)); >+ (*addrp)->addr.raw.family = PR_AF_INET6; >+ memcpy(&((*addrp)->addr.ipv6.ip), *p, sizeof((*addrp)->addr.ipv6.ip)); how about a comment above this data structure definition? >+struct nsNetAddrList { >+ PRNetAddr addr; >+ struct nsNetAddrList *next; >+}; >- PRNetAddr mNetAddress; >+ nsNetAddrList *mNetAddressList; please remove a space before the "*" so the m's line up. ok.. so, basically just a couple more minor nits... this patch is starting to look good. make these last modifications and then i think the patch should be good to go.
Attachment #52221 - Flags: needs-work+
*** Bug 104848 has been marked as a duplicate of this bug. ***
*** Bug 106237 has been marked as a duplicate of this bug. ***
*** Bug 108108 has been marked as a duplicate of this bug. ***
Blocks: 108108
This is still a problem. An example of this is if I try to access the web site: http://www.kame.net This address has both IPv4 and IPv6 addresses. If I use mozilla release 20011116 under Linux, I cannot connect to http://www.kame.net, because mozilla only tries to connect to the IPv6 address. Under Netscape 4, I can successfully connect to that web site, because it looks up the IPv4 address. I am on an IPv4-only network.
Bugs targeted at mozilla1.0 without the mozilla1.0 keyword moved to mozilla1.0.1 (you can query for this string to delete spam or retrieve the list of bugs I've moved)
Target Milestone: mozilla1.0 → mozilla1.0.1
+mozilla 1.0 - DNS has supported this for a long time...
Keywords: mozilla1.0
*** Bug 112426 has been marked as a duplicate of this bug. ***
-> me
Assignee: gordon → darin
Status: ASSIGNED → NEW
-> 0.9.8
Status: NEW → ASSIGNED
Target Milestone: mozilla1.0.1 → mozilla0.9.8
Btw, it would be nice to be able to force using v4 or v6 in browser somehow, maybe something like http4:// and http6://
I don't know if that really belongs at the protocol level, which is what the scheme in a URL indicates. Lets get a new bug for this one please.
*** Bug 114719 has been marked as a duplicate of this bug. ***
*** Bug 115793 has been marked as a duplicate of this bug. ***
*** Bug 68424 has been marked as a duplicate of this bug. ***
*** Bug 118619 has been marked as a duplicate of this bug. ***
What mechanism outside mozilla can I use to determine if a problem I'm seeing is a dup of this bug? On our network, folks tend to send url's via email and omit the domain, i.e. http://rdc11/scripts/cgi/BldToolWeb/home.pl instead of http://rdc11.pc.sas.com/scripts/cgi/BldToolWeb/home.pl I am running 0.9.7 on d5120.us.sas.com and can't view the first url with 0.9.7 I get The requested URL could not be retrieved While trying to retrieve the URL: http://rdc11.unx.sas.com/scripts/cgi/BldToolWeb/home.pl The following error was encountered: Unable to determine IP address from host name for rdc11.unx.sas.com The dnsserver returned: Name Error: The domain name does not exist. This means that: The cache was not able to resolve the hostname presented in the URL. Check if the address is correct. Your cache administrator is squid@unx.sas.com. Note that I did not use rdc11.unx.sas.com anywhere, nor is it in my proxy settings. The shorter URL works in IE and Netscape 4.7 When I do an nslookup : bash-2.01$ nslookup rdc11.pc.sas.com Name Server: nssrv01.unx.sas.com Address: 10.19.1.23 Name: rdc11.pc.sas.com Address: 10.28.10.70 I don't see more than one IP address. I read this bug and many that were marked dups of this, but I'm really a mozilla tester/user, not a developer, so I don't know the network intracacies (i.e. IPv6 vs IPv4 and how to know what responses Mozilla is getting from a server.)
-> 0.9.9
Target Milestone: mozilla0.9.8 → mozilla0.9.9
I know I'm stretching it, but shouldn't the severity of this bug be critical due to data loss?
I believe it should, since this makes causes some sites *inaccessible* in mozilla - those who have both A and AAAA records (eg, www.stealth.net,www.6bone.net) if you don't have v6 connectivity.
I suspect this fix is a right fix. Mozilla use IPv4 mapped IPv6 address. Use of IPv4 mapped IPv6 address is not recommended for serveral reasons. One reason is that many socket options are specific to IPv4 address only. One is security reason(see URL below) http://www.iijlab.net/i-d/draft-itojun-ipv6-transition-abuse-01.txt NetBSD disabled the use of IPv4 mapped IPv6 address by default. There is an socket option to enable it. Maybe other *BSD using KAME stack(www.kame.net) will disable it by default. And most important reason is Windows does not support IPv4 mapped IPv6 address. mozilla creates PF_INET6 socket, then try to connect to first address with this socket. if connection failed, try to connect to next address with this socket, not newly created socket. If IPv4 mapped IPv6 address is supported, this method will work. But IPv4 mapped IPv6 address is not supported, this is not a right method. Mozilla should create new socket with approprate PF_* and connect with the socket. I known this is an important issue, but I don't have time to work with...
i think you have it backwards: that document talks about "IPv6 over IPv4 tunnelling" which mozilla has nothing to do with. mozilla merely stores addresses in IPv6 format, which is a superset of the IPv4 format. we use the IPv6 format in order to have one data structure for representing IP addresses. as far as this bug being a dataloss bug, can anyone demonstrate that this bug causes dataloss in the real world? are there numbers of websites out there that mozilla fails to work with as a result of this bug? my feeling is that it isn't the case... and that's is why this bug hasn't gotten fixed yet. there are other much more critical bugs in mozilla, and those should and usually do get solved first. but, rest assured this will be fixed by mozilla 1.0 :-)
ok, looks like i missed Jakub's examples )-:
Yes, I can confirm "data loss" in a sense. If a given web site has both an A record (IPv4)and an A6 or AAAA record (IPv6), and the IPv6 is unreachable for any reason, the web siteis inaccessable, even if the IPv4 host is responsive. I just re-confirmed this with www.jp.freebsd.org,which has both AAAA and A records in DNS, but the AAAA host is not reachable.Also, take a look at section 3.5 of the w3's "Common User Agent Problems"document at http://www.w3.org/TR/cuap
Dataloss in the real world: +mostfreq by comment 8 of 47 and 13 dups. Affects mailnews as well as any other aspect of Mozilla for which name resolution is necessary. Comment 24 looks like this bug was almost fixed in 10/2001 - come on, we have 12 hours left before the tree closes for 0.9.8! :-)
*alright* time permitting, let's see if this can't make it in for 0.9.8 ... no promises though :-)
Target Milestone: mozilla0.9.9 → mozilla0.9.8
Attached patch revised patch (obsolete) (deleted) — Splinter Review
this new patch is a bit cleaner than the previous, and adds fewer memory allocations. i still need to setup a DNS server with which to test out this patch, and since it touches very critical code, i think it would be best to NOT try to force this in for mozilla 0.9.8. i'm going to wait until the mozilla 0.9.9 development cycle for this one.
Attachment #41587 - Attachment is obsolete: true
Attachment #49073 - Attachment is obsolete: true
Attachment #52043 - Attachment is obsolete: true
Attachment #52221 - Attachment is obsolete: true
Target Milestone: mozilla0.9.8 → mozilla0.9.9
Keywords: patch
Severity: normal → major
Priority: P3 → P2
rick: could you please sr= this patch? thx!
Comment on attachment 65153 [details] [diff] [review] revised patch r=gordon
Attachment #65153 - Flags: review+
Attachment #65153 - Flags: superreview+
the patch has been checked in. marking FIXED, but i would appreciate some help verifying that this patch is correct.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
This bug is not fixed. Here is a log. 3[50039f28]: nsSocketTransport::OnFound [www.netbsd.org:80 88b3d00] lookup succ eeded [FQDN=www.netbsd.org] 3[50039f28]: => 3ffe:8050:201:1860:290:27ff:feab:19a7 3[50039f28]: => ::ffff:204.152.186.171 3[50039f28]: nsSocketTransport: OnStopLookup(...) [www.netbsd.org:80 88b3d00]. Status = 0 Host is www.netbsd.org 2[50025f28]: nsSocketTransport: Entering Process() [host=www.netbsd.org:80 this =88b3d00], aSelectFlags=0, CurrentState=1. 2[50025f28]: nsSocketTransport: Transport [host=www.netbsd.org:80 this=88b3d00] is in WaitDNS state. 2[50025f28]: nsSocketTransport: Entering doResolveHost() [host=www.netbsd.org:8 0 88b3d00]. 2[50025f28]: nsSocketTransport: Leaving doResolveHost() [www.netbsd.org:80 88b3 d00] rv = 0 2[50025f28]: nsSocketTransport: Transport [host=www.netbsd.org:80 this=88b3d00] is in Closed state. 2[50025f28]: nsSocketTransport: Transport [host=www.netbsd.org:80 this=88b3d00] is in WaitConnect state. 2[50025f28]: nsSocketTransport: Entering doConnection() [host=www.netbsd.org:80 this=88b3d00], aSelectFlags=0. 2[50025f28]: nsSocketTransport: Connection Refused [www.netbsd.org:80 88b3d00]. PRErrorCode = ffffe896 2[50025f28]: nsSocketTransport: Leaving doConnection() [www.netbsd.org:80 88b3d 00]. rv = 804b000d. 2[50025f28]: connection failed... trying next address 2[50025f28]: nsSocketTransport: Leaving Process() [host=www.netbsd.org:80 this=88b3d00], mStatus = 80470007, CurrentState=3, mSelectFlags=0
taya: is that where the log ends? reopening
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
so, i setup a dummy bind instance w/ entries like so: foo IN A 192.168.1.10 foo IN A 192.168.1.11 foo IN A 10.169.106.21 the 192.168.1.10 and .11 addresses don't have any servers on port 80. only the last entry has a server on port 80. bind (or perhaps libc) seems to return the addresses in random order, so it looks like this is a pretty good testcase. a current mozilla trunk build crashes when using this dummy nameserver to connect to the host 'foo'; however, with this patch it now works. the problem was that the invalid socket wasn't getting removed from the select list properly. as a result PR_Poll ended up dereferencing a junk file descriptor pointer.
Attachment #65153 - Attachment is obsolete: true
hey darin, i'm a little confused why the following lines are needed: + // need to re-enter Process() asynchronously + mService->AddToWorkQ(this); + done = PR_TRUE; shouldn't just doing the 'continue' cause the next mNetAddress to be tried... at this point the old socketFD has been removed from the select_list, so it should be safe to just cycle through the mNetAddrList entries until a connection succeeds or it runs out of IP addresses... what have i forgotten ;-) ? -- rick
rick: there's a hidden crash waiting in what you propose... see the former patch ;-) the problem is that when the STS calls Process() it has not yet removed the socket from the select list because if Process() returns BASE_STREAM_WOULD_BLOCK the socket would have to remain on the select list. the STS optimizes for the case when the socket has to remain on the select list, which is IMO the correct optimization. this means that Process() must not return BASE_STREAM_WOULD_BLOCK if the socket is invalid. so, i need a state change to handle connection refused, and the state change can't happen synchronously cuz then i'd end up returning BASE_STREAM_WOULD_BLOCK from Process(). i need to have the STS call Process() again after it has removed my old dead socket from its select list.
I see the problem now... I thought that transport->Process() was *only* called from nsSocketTransportService::ProcessWorkQ() -- where the transport *is* removed from the select list prior to calling Process()... I didn't realize that nsSocketTransport::Process() was also called directly in Run().. Personally, i think it would be cleaner to change nsSocketTransportService::Run() to *always* remove the transport before calling nsSocketTransport::Process() and then only adding it back onto the select-list if NS_BASE_STREAM_WOULD_BLOCK is returned... But I completely understand your reluctance to do so ;-) -- rick
Comment on attachment 67844 [details] [diff] [review] the patch that should have been checked in sr=rpotts@netscape.com
Attachment #67844 - Flags: superreview+
gordon: can you r= this latest patch?
Comment on attachment 67844 [details] [diff] [review] the patch that should have been checked in r=gordon.
Attachment #67844 - Flags: review+
fixed-on-trunk
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → FIXED
Sorry for the spam, I just wanted to say this is working great on w2k build 2002020703 in Navigator and MailNews - even better than NC4's weird behavior of failing over to a second IP if the first one is unresponsive, but not trying the first one again (for the duration of the session anyway) if the second one suddenly becomes unresponsive.
*** Bug 123628 has been marked as a duplicate of this bug. ***
Updating this bug with the info from the duped bug 123628. This one was fixed on the Trunk after M098 so we see a few crashes in M098 with the matching signature. However, all crashes on the Trunk stopped in the 2/06 builds following the checkin. Marking verified fixed.
Status: RESOLVED → VERIFIED
Keywords: crash, topcrash
Summary: socket transport should try all ip addresses returned by DNS service → socket transport should try all ip addresses returned by DNS service Trunk M098 [@ PR_GetIdentitiesLayer]
*** Bug 125909 has been marked as a duplicate of this bug. ***
Please refer to bug 08108(http://bugzilla.mozilla.org/show_bug.cgi?id=108108). I found that the call stack like this: nsSocketTransport::Process in case eSocketState_WaitDNS ->nsSocketTransport::doResolveHost *** here we use "mNetAddrList.Init(1);" *** ->nsSocketTransportService::LookupHost (or lookup in cached dns service) in case eSocketState_WaitConnect ->nsSocketTransport::doConnection if failed ->nsSocketTransport::OnConnectionFailed ->mNetAddrList.GetNext(mNetAddress) *** how can we GetNext here??? ***
*** Bug 161388 has been marked as a duplicate of this bug. ***
As in comment 44 http://bugzilla.mozilla.org/show_bug.cgi?id=86917#c44, mozilla will fail to work on IPv4 addresses if the machine has no IPv6 mapped IPv4 address (::ffff:w.x.y.z) As I mentioned in bug 161388, if this address is toggled on and off mozilla will alternately fail and work. w3m works fine, when a machine has a AAAA, it uses IPv6, when the machine has only an A, it will use IPv4. While yes, as comment 45 says, the referenced document deals with the dangers of tunneled IPv6 over IPv4, some OSs will disable these mapped address to avoid the dangers illustrated in that document. When they disable mapped addresses, mozilla will not access any IPv4 server.
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Same here as #72, on FreeBSD 5.0-CURRENT #0: Fri Aug 2 02:29:30 GMT 2002 sysctl net.inet6.ip6.v6only=1 and i can't connect anywhere sysctl net.inet6.ip6.v6only=0 and it works again.
Darin: if you checked this into the trunk and any relevant branches at the time, and it was verified, then I'd like to move any new problems to new bugs. We already had one bug that reported a regression/new problem, so I want to keep the drift minimal. As I understand it, this bug had a fix that worked for multiple IPv4 addresses. There was also a bug that asked for retries if the "connection refused" to a port. If this was never meant to fix multiple IPv6 or mixed IPv4|IPv6 DNS configurations, please note that as well.
benc: this was meant to fix the problem for IPv6 as well as mixed IPv6|IPv4, and indeed it did fix that originally. however, the fix for bug 149943 regressed the mixed IPv6|IPv4 support. that said, i'd also prefer to deal with the regression in a new bug. i think there is one... see bug 161228.
*** Bug 166009 has been marked as a duplicate of this bug. ***
re-resolving as FIXED.
Status: REOPENED → RESOLVED
Closed: 23 years ago22 years ago
Resolution: --- → FIXED
I'm probabaly missing something, where this transferred to a new bug. But I'm seeing (with an IPv6 enabled 1.2.1) that if you go to a host with AAAA and A records, but it offers HTTP on both v4 and v6 addresses, and HTTPS only on the v4 record, that you will get a "connection refused" when you try to connect to it via https. It looks like this sort of problem was being addressed here, and moved to another bug. But, bug 161228 appears not to address things like this, and it's also marked resolved. Can someone point me to a bug relating to the problem I'm describing here, or tell me I need to open a new one (which I'd consider very unlikely)... Thanks.
I never did get any comments regarding my question on this bug a few months ago. Is anyone looking at this? Should I be looking somewhere else?
chris: please file a new bug, and be sure to report the port numbers being used for HTTP and HTTPS. sample URLs would do just fine. thx!
Crash Signature: [@ PR_GetIdentitiesLayer]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: