Closed
Bug 412457
Opened 17 years ago
Closed 9 years ago
should unescape hostname first, then perform IDNA
Categories
(Core :: Networking, defect)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla45
People
(Reporter: erik, Assigned: valentin)
References
(Blocks 2 open bugs)
Details
(Keywords: dev-doc-needed)
Attachments
(2 files, 4 obsolete files)
(deleted),
patch
|
Biesinger
:
review-
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mcmanus
:
review+
|
Details | Diff | Splinter Review |
Host names in URIs (and IRIs) may have %-escaped octets in them. Firefox (and other Mozilla software) should unescape the host name before applying the IDNA (Internationalized Domain Names in Applications) process. For example: <a href="http://十%2ecom/">blah</a> The above causes Firefox 2 to display an error message containing the string xn--%2ecom-9b5j. Other browsers like MSIE 7 and Opera 9 will convert the above domain name into xn--kkr.com and will succeed. The steps should be: 1. convert incoming HTML to Unicode 2. resolve NCRs like 十 3. parse URI/IRI to get host name 4. unescape host name (%2e becomes .) 5. perform IDNA's ToASCII 6. perform DNS lookup 7. perform HTTP request with Host: header
Reporter | ||
Comment 1•17 years ago
|
||
If step 4 yields invalid UTF-8 in some labels (label = between dots), skip step 5 or perform step 5 only on the valid UTF-8 labels.
Reporter | ||
Comment 2•17 years ago
|
||
On 2nd thought, if step 4 yields malformed UTF-8, abort the lookup, and produce an error message (or something). Firefox currently already returns an error if there are any %-escapes in the host name, so aborting on malformed UTF-8 would not be a step backwards.
Comment 3•16 years ago
|
||
Host name is unescaped just before NormalizeIDN() is called.
Reporter | ||
Comment 4•16 years ago
|
||
Thanks for working on this. With this patch, what happens when the host name contains an escaped slash (i.e. %2F)?
Comment 5•16 years ago
|
||
Sequence %2f in the hostname will be unescaped and will be part of the hostname. F.e. URL "http://a%2fb/c" is parsed as follows (see mHost): (gdb) p *url $4 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = { _vptr.nsISupports = 0x235ee88}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> = { _vptr.nsISupports = 0x235ef98}, <No data fields>}, <No data fields>}, <nsISerializable> = {<nsISupports> = { _vptr.nsISupports = 0x235efb8}, <No data fields>}, <nsIClassInfo> = {<nsISupports> = {_vptr.nsISupports = 0x235efd4}, <No data fields>}, mRefCnt = { mValue = 1}, _mOwningThread = {mThread = 0x9bd5550}, mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x9cf7478 "http://a/b/c", mLength = 12, mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort = -1, mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 3}, mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, mHost = {mPos = 7, mLen = 3}, mPath = {mPos = 10, mLen = 2}, mFilepath = { mPos = 10, mLen = 2}, mDirectory = {mPos = 10, mLen = 1}, mBasename = { mPos = 11, mLen = 1}, mExtension = {mPos = 13, mLen = -1}, mParam = { mPos = 12, mLen = -1}, mQuery = {mPos = 12, mLen = -1}, mRef = {mPos = 12, mLen = -1}, mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x23269a2 "", mLength = 0, mFlags = 1}, <No data fields>}, mParser = {mRawPtr = 0x9cc1eb8}, mFile = {mRawPtr = 0x0}, mHostA = 0x0, mHostEncoding = 1, mSpecEncoding = 0, mURLType = 2, mMutable = 1, mSupportsFileURL = 0, static gIDN = 0x9c9d1e8, static gCharsetMgr = 0x0, static gInitialized = 1, static gEscapeUTF8 = 1, static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0} Connecting to such host fails at net_IsValidHostName(). But something like this is possible: js> var newURI2 = ios.newURI("http://example.com%2fxxx/", null, null); js> var newURI3 = ios.newURI(newURI2.spec, null, null); js> newURI2.host example.com/xxx js> newURI3.host example.com Is this behaviour correct? Maybe there should be check if '/' is present after hostname is unescaped.
Reporter | ||
Comment 6•16 years ago
|
||
First of all, when you re-insert a host name back into a URL, you must first escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; is questionable in the http: case. (MSIE does not treat ; as a host/path terminator in http: nor ftp:.) However, see below. Secondly, I suppose there are a couple of different places to check for bad characters in the host name. One might be just before a DNS lookup, but if an HTTP request is going through a proxy, the host is not looked up in DNS. So, checking for bad characters after unescaping and running it through IDNA might be good.
Comment 7•16 years ago
|
||
(In reply to comment #6) > First of all, when you re-insert a host name back into a URL, you must first > escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; > is questionable in the http: case. (MSIE does not treat ; as a host/path > terminator in http: nor ftp:.) However, see below. > > Secondly, I suppose there are a couple of different places to check for bad > characters in the host name. One might be just before a DNS lookup, but if an > HTTP request is going through a proxy, the host is not looked up in DNS. So, > checking for bad characters after unescaping and running it through IDNA might > be good. I added the check after unescaping and IDNA. No character that is allowed in hostname by net_IsValidHostName() needs escaping, so there is no nsEscapeCount()...
Attachment #324616 -
Attachment is obsolete: true
Attachment #324833 -
Flags: review?(cbiesinger)
Attachment #324616 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 8•16 years ago
|
||
Looks good. I recommend adding a couple of tests: http://%e5%8d%81.com -> http://xn--kkr.com http://%80.com -> error (invalid UTF-8)
Comment 9•16 years ago
|
||
(In reply to comment #8) > Looks good. I recommend adding a couple of tests: > http://%e5%8d%81.com -> http://xn--kkr.com > http://%80.com -> error (invalid UTF-8) Added. Erik, can you do the review?
Attachment #324833 -
Attachment is obsolete: true
Attachment #325078 -
Flags: review?(cbiesinger)
Attachment #324833 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 10•16 years ago
|
||
I recommend adding a comment to each unit test. For example: // Make sure escaped dot (%2e) is unescaped before applying IDNA I also recommend adding unit tests for all of the host terminators :/;?# I.e. %3A %2F %3B %3F %23 The code looks good, but I think we should ask a regular Mozilla reviewer to take a look.
Comment 11•16 years ago
|
||
Should we really be unescaping %-escaped octets in URIs? In particular, the "." part in a URI must be literally given, according to the RFC, and the label part must be a set of "alphanum"s, not "escaped"s. This seems dangerous.
Comment 12•16 years ago
|
||
Nevermind, I see the new RFCs allow it.
Comment 13•16 years ago
|
||
The new RFCs don't treat a semicolon as a host delimiter, by the way -- the ; character is allowed in hostnames. How should %-encoded bytes be treated? IE7 seems to just send the raw bytes to the DNS resolution layer as far as Erik's testing seems to show. Opera decodes the %-encoded bytes as UTF-8 and reencodes them as IDN before doing the query. Nobody else seems to do anything useful. What should we do?
Reporter | ||
Comment 14•16 years ago
|
||
The last paragraph of section 3.2.2 in the following RFC seems to suggest that Opera's interpretation is fine: http://www.ietf.org/rfc/rfc3986.txt I find it somewhat strange to send %-escaped text in DNS, since %-escapes are for URIs, not DNS. I think it's fine to stop treating the semicolon as a host, port or path terminator, especially in http: and https:. However, for ftp: it may still be useful (though MSIE doesn't support it).
Comment 15•16 years ago
|
||
Attachment #325078 -
Attachment is obsolete: true
Attachment #325254 -
Flags: review?(cbiesinger)
Attachment #325078 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 16•16 years ago
|
||
> there should be only allowed characters in hostname after
> unescaping and applying IDNA
How about "after unescaping and attempting to apply IDNA. "\x80" is illegal in UTF-8, so IDNA fails, and 0x80 is illegal in DNS too."
Have you tried some actual hrefs in real HTML files? (I'm not familiar with the rest of the Mozilla code, so I'd like to be sure that real HTML files work properly, in addition to the JavaScript tests that you have written.)
Comment 17•16 years ago
|
||
(In reply to comment #16) > Have you tried some actual hrefs in real HTML files? (I'm not familiar with the > rest of the Mozilla code, so I'd like to be sure that real HTML files work > properly, in addition to the JavaScript tests that you have written.) Yes, of course I did also some tests with URLs in HTML files and it worked correctly.
Attachment #325254 -
Attachment is obsolete: true
Attachment #325403 -
Flags: review?(cbiesinger)
Attachment #325254 -
Flags: review?(cbiesinger)
Reporter | ||
Comment 18•16 years ago
|
||
Looks good. Thanks! By the way, MSIE 6 and 7 reject host names with %-escapes in the range %00-1F, and %2F, %40 and %5C. MSIE 6 rejects %25 under some circumstances and MSIE 7 rejects %3F. Firefox should probably be at least as careful as MSIE, if not much more careful.
Comment 19•16 years ago
|
||
looks like this is a duplicate of bug 309671, although now that there's a patch here it probably shouldn't be marked as such
Comment 20•16 years ago
|
||
Comment on attachment 325403 [details] [diff] [review] minor change in comment Don't you also have to change SetHost? (Hm, if you're adding the hostname validity check here we might be able to remove it from the other places where we currently have it)
Updated•16 years ago
|
Attachment #325403 -
Flags: review?(cbiesinger) → review-
Updated•14 years ago
|
Comment 21•13 years ago
|
||
Hello. I don't know if the issue I'm about to present should be a separate issue. When an e-mail message is displayed in Thunderbird, the text body might contain a URI with an international domain name, such as http://www.henriksørensen.dk/ I hope that your program will display the above URI with a LATIN SMALL LETTER O WITH STROKE. I am not fully confident with how text is displayed in different programs when sent from Bugzilla to you. When I click on that URI in Thunderbird, it is sent to Firefox. Firefox tries to display the page at that URI but Firefox says Server not found. Firefox tries to resolve the domain name. Here is output from the plugin HttpFox: Host www.henriks%c3%b8rensen.dk The error code is NS_ERROR_UNKNOWN_HOST. Here is output from HttpFox when I manually type the same URL in Firefox: Host www.xn--henriksrensen-hnb.dk In the latter case the international domain name is correctly interpreted and handled by Firefox. Shouldn't Firefox treat a URI equally whether the URI is sent from Thunderbird or the URI is typed manually? Then I changed which browser should handle HTTP URIs. I temporarily chose Safari. When I click on the URI in Thunderbird, the URI is sent to Safari. The correct host is found right away. I use Thunderbird and Firefox for Mac OS X.
Assignee | ||
Comment 22•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba2bb2369ba4
Assignee | ||
Comment 23•9 years ago
|
||
Assignee | ||
Updated•9 years ago
|
Assignee: michal.novotny → valentin.gosu
Assignee | ||
Comment 24•9 years ago
|
||
Comment on attachment 8686406 [details] [diff] [review] should unescape hostname first, then perform IDNA This patch is based on code that has been r+d by Honza in Bug 1023468, but enough things have changed that it warrants a new review. Changes: * We do percent decoding of hostnames in BuildNormalizedSpec and SetHost. * NormalizeIDN returns an nsresult. We abort the parsing if the call fails. * NS_UnescapeURL is modified, so when called with the esc_Host flag, it only unescapes ASCII characters that are allowed in the hostname, and non-ascii characters. This is mostly so that hostnames containing % are not rejected, since they are part of the ID of some addons. * I also change NS_UnescapeURL to only deref *(p+1) and *(p+2) only once. It's clearer this way.
Attachment #8686406 -
Flags: review?(mcmanus)
Assignee | ||
Comment 25•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c6a52809582
Updated•9 years ago
|
Attachment #8686406 -
Flags: review?(mcmanus) → review+
Assignee | ||
Comment 27•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/cc4f5b4bb374b95c36c818e666c2a087fe6f20ca Bug 412457 - should unescape hostname first, then perform IDNA r=mcmanus
Comment 28•9 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/cc4f5b4bb374
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox45:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
Assignee | ||
Comment 29•9 years ago
|
||
This needs a dev-doc as it changes the way URL are parsed. Previously percent encoded characters in the hostname wouldn't be decoded, and clicking on http://%65%78%61%6d%70%6c%65%2e%6f%72%67 would fail as the domain %65%78%61%6d%70%6c%65%2e%6f%72%67 didn't exist. Now however, the hostname is percent decoded, thus correctly loading http://example.org Note that for http://example.org%3a %3a wouldn't be decoded, as it's a character that isn't allowed in the hostname (the colon).
Blocks: url
Keywords: dev-doc-needed
You need to log in
before you can comment on or make changes to this bug.
Description
•