Closed Bug 412457 Opened 17 years ago Closed 9 years ago

should unescape hostname first, then perform IDNA

Categories

(Core :: Networking, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox45 --- fixed
status2.0 --- wanted

People

(Reporter: erik, Assigned: valentin)

References

(Blocks 2 open bugs)

Details

(Keywords: dev-doc-needed)

Attachments

(2 files, 4 obsolete files)

Host names in URIs (and IRIs) may have %-escaped octets in them. Firefox (and
other Mozilla software) should unescape the host name before applying the IDNA
(Internationalized Domain Names in Applications) process. For example:

<a href="http://&#x5341;%2ecom/">blah</a>

The above causes Firefox 2 to display an error message containing the string
xn--%2ecom-9b5j. Other browsers like MSIE 7 and Opera 9 will convert the above
domain name into xn--kkr.com and will succeed.

The steps should be:

1. convert incoming HTML to Unicode
2. resolve NCRs like &#x5341;
3. parse URI/IRI to get host name
4. unescape host name (%2e becomes .)
5. perform IDNA's ToASCII
6. perform DNS lookup
7. perform HTTP request with Host: header
If step 4 yields invalid UTF-8 in some labels (label = between dots), skip step
5 or perform step 5 only on the valid UTF-8 labels.
On 2nd thought, if step 4 yields malformed UTF-8, abort the lookup, and produce
an error message (or something). Firefox currently already returns an error if
there are any %-escapes in the host name, so aborting on malformed UTF-8 would
not be a step backwards.
Attached patch fix + unit test (obsolete) (deleted) — Splinter Review
Host name is unescaped just before NormalizeIDN() is called.
Assignee: nobody → michal
Status: NEW → ASSIGNED
Attachment #324616 - Flags: review?(cbiesinger)
Thanks for working on this. With this patch, what happens when the host name contains an escaped slash (i.e. %2F)?
Sequence %2f in the hostname will be unescaped and will be part of the hostname. F.e. URL "http://a%2fb/c" is parsed as follows (see mHost):

(gdb) p *url
$4 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = {
          _vptr.nsISupports = 0x235ee88}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> = {
        _vptr.nsISupports = 0x235ef98}, <No data fields>}, <No data fields>}, <nsISerializable> = {<nsISupports> = {
      _vptr.nsISupports = 0x235efb8}, <No data fields>}, <nsIClassInfo> = {<nsISupports> = {_vptr.nsISupports = 0x235efd4}, <No data fields>}, mRefCnt = {
    mValue = 1}, _mOwningThread = {mThread = 0x9bd5550}, 
  mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, 
      mData = 0x9cf7478 "http://a/b/c", mLength = 12, 
      mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort = -1, 
  mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 3}, 
  mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, 
  mHost = {mPos = 7, mLen = 3}, mPath = {mPos = 10, mLen = 2}, mFilepath = {
    mPos = 10, mLen = 2}, mDirectory = {mPos = 10, mLen = 1}, mBasename = {
    mPos = 11, mLen = 1}, mExtension = {mPos = 13, mLen = -1}, mParam = {
    mPos = 12, mLen = -1}, mQuery = {mPos = 12, mLen = -1}, mRef = {mPos = 12, 
    mLen = -1}, 
  mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x23269a2 "", mLength = 0, mFlags = 1}, <No data fields>}, 
  mParser = {mRawPtr = 0x9cc1eb8}, mFile = {mRawPtr = 0x0}, mHostA = 0x0, 
  mHostEncoding = 1, mSpecEncoding = 0, mURLType = 2, mMutable = 1, 
  mSupportsFileURL = 0, static gIDN = 0x9c9d1e8, static gCharsetMgr = 0x0, 
  static gInitialized = 1, static gEscapeUTF8 = 1, 
  static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0}


Connecting to such host fails at net_IsValidHostName(). But something like this is possible:

js> var newURI2 = ios.newURI("http://example.com%2fxxx/", null, null);
js> var newURI3 = ios.newURI(newURI2.spec, null, null);
js> newURI2.host
example.com/xxx
js> newURI3.host
example.com

Is this behaviour correct? Maybe there should be check if '/' is present after hostname is unescaped.
First of all, when you re-insert a host name back into a URL, you must first escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; is questionable in the http: case. (MSIE does not treat ; as a host/path terminator in http: nor ftp:.) However, see below.

Secondly, I suppose there are a couple of different places to check for bad characters in the host name. One might be just before a DNS lookup, but if an HTTP request is going through a proxy, the host is not looked up in DNS. So, checking for bad characters after unescaping and running it through IDNA might be good.
Attached patch new patch, validity of hostname is checked (obsolete) (deleted) — Splinter Review
(In reply to comment #6)
> First of all, when you re-insert a host name back into a URL, you must first
> escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ;
> is questionable in the http: case. (MSIE does not treat ; as a host/path
> terminator in http: nor ftp:.) However, see below.
> 
> Secondly, I suppose there are a couple of different places to check for bad
> characters in the host name. One might be just before a DNS lookup, but if an
> HTTP request is going through a proxy, the host is not looked up in DNS. So,
> checking for bad characters after unescaping and running it through IDNA might
> be good.

I added the check after unescaping and IDNA. No character that is allowed in hostname by net_IsValidHostName() needs escaping, so there is no nsEscapeCount()...
Attachment #324616 - Attachment is obsolete: true
Attachment #324833 - Flags: review?(cbiesinger)
Attachment #324616 - Flags: review?(cbiesinger)
Looks good. I recommend adding a couple of tests:
http://%e5%8d%81.com -> http://xn--kkr.com
http://%80.com -> error (invalid UTF-8)
Attached patch added some tests (obsolete) (deleted) — Splinter Review
(In reply to comment #8)
> Looks good. I recommend adding a couple of tests:
> http://%e5%8d%81.com -> http://xn--kkr.com
> http://%80.com -> error (invalid UTF-8)

Added. Erik, can you do the review?
Attachment #324833 - Attachment is obsolete: true
Attachment #325078 - Flags: review?(cbiesinger)
Attachment #324833 - Flags: review?(cbiesinger)
I recommend adding a comment to each unit test. For example:

// Make sure escaped dot (%2e) is unescaped before applying IDNA

I also recommend adding unit tests for all of the host terminators :/;?#
I.e. %3A %2F %3B %3F %23

The code looks good, but I think we should ask a regular Mozilla reviewer to take a look.
Should we really be unescaping %-escaped octets in URIs? In particular, the "." part in a URI must be literally given, according to the RFC, and the label part must be a set of "alphanum"s, not "escaped"s. This seems dangerous.
Nevermind, I see the new RFCs allow it.
The new RFCs don't treat a semicolon as a host delimiter, by the way -- the ; character is allowed in hostnames.

How should %-encoded bytes be treated? IE7 seems to just send the raw bytes to the DNS resolution layer as far as Erik's testing seems to show. Opera decodes the %-encoded bytes as UTF-8 and reencodes them as IDN before doing the query. Nobody else seems to do anything useful. What should we do?
The last paragraph of section 3.2.2 in the following RFC seems to suggest that Opera's interpretation is fine:

http://www.ietf.org/rfc/rfc3986.txt

I find it somewhat strange to send %-escaped text in DNS, since %-escapes are for URIs, not DNS.

I think it's fine to stop treating the semicolon as a host, port or path terminator, especially in http: and https:. However, for ftp: it may still be useful (though MSIE doesn't support it).
Attached patch Added some tests and comments to unittest (obsolete) (deleted) — Splinter Review
Attachment #325078 - Attachment is obsolete: true
Attachment #325254 - Flags: review?(cbiesinger)
Attachment #325078 - Flags: review?(cbiesinger)
> there should be only allowed characters in hostname after
> unescaping and applying IDNA

How about "after unescaping and attempting to apply IDNA. "\x80" is illegal in UTF-8, so IDNA fails, and 0x80 is illegal in DNS too."

Have you tried some actual hrefs in real HTML files? (I'm not familiar with the rest of the Mozilla code, so I'd like to be sure that real HTML files work properly, in addition to the JavaScript tests that you have written.)
Attached patch minor change in comment (deleted) — Splinter Review
(In reply to comment #16)
> Have you tried some actual hrefs in real HTML files? (I'm not familiar with the
> rest of the Mozilla code, so I'd like to be sure that real HTML files work
> properly, in addition to the JavaScript tests that you have written.)

Yes, of course I did also some tests with URLs in HTML files and it worked correctly.
Attachment #325254 - Attachment is obsolete: true
Attachment #325403 - Flags: review?(cbiesinger)
Attachment #325254 - Flags: review?(cbiesinger)
Looks good. Thanks!

By the way, MSIE 6 and 7 reject host names with %-escapes in the range %00-1F, and %2F, %40 and %5C. MSIE 6 rejects %25 under some circumstances and MSIE 7 rejects %3F. Firefox should probably be at least as careful as MSIE, if not much more careful.
looks like this is a duplicate of bug 309671, although now that there's a patch here it probably shouldn't be marked as such
Comment on attachment 325403 [details] [diff] [review]
minor change in comment

Don't you also have to change SetHost?


(Hm, if you're adding the hostname validity check here we might be able to remove it from the other places where we currently have it)
Attachment #325403 - Flags: review?(cbiesinger) → review-
Hello. I don't know if the issue I'm about to present should be a separate issue.

When an e-mail message is displayed in Thunderbird, the text body might contain a URI with an international domain name, such as
http://www.henriksørensen.dk/

I hope that your program will display the above URI with a LATIN SMALL LETTER O WITH STROKE. I am not fully confident with how text is displayed in different programs when sent from Bugzilla to you.

When I click on that URI in Thunderbird, it is sent to Firefox. Firefox tries to display the page at that URI but Firefox says Server not found.

Firefox tries to resolve the domain name. Here is output from the plugin HttpFox:
Host	www.henriks%c3%b8rensen.dk

The error code is NS_ERROR_UNKNOWN_HOST.

Here is output from HttpFox when I manually type the same URL in Firefox:
Host	www.xn--henriksrensen-hnb.dk

In the latter case the international domain name is correctly interpreted and handled by Firefox.

Shouldn't Firefox treat a URI equally whether the URI is sent from Thunderbird  or the URI is typed manually?

Then I changed which browser should handle HTTP URIs. I temporarily chose Safari. When I click on the URI in Thunderbird, the URI is sent to Safari. The correct host is found right away. I use Thunderbird and Firefox for Mac OS X.
Assignee: michal.novotny → valentin.gosu
Comment on attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA

This patch is based on code that has been r+d by Honza in Bug 1023468, but enough things have changed that it warrants a new review.
Changes:
* We do percent decoding of hostnames in BuildNormalizedSpec and SetHost.
* NormalizeIDN returns an nsresult. We abort the parsing if the call fails.
* NS_UnescapeURL is modified, so when called with the esc_Host flag, it only unescapes ASCII characters that are allowed in the hostname, and non-ascii characters. This is mostly so that hostnames containing % are not rejected, since they are part of the ID of some addons.
* I also change NS_UnescapeURL to only deref *(p+1) and *(p+2) only once. It's clearer this way.
Attachment #8686406 - Flags: review?(mcmanus)
Attachment #8686406 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/cc4f5b4bb374
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla45
This needs a dev-doc as it changes the way URL are parsed.

Previously percent encoded characters in the hostname wouldn't be decoded, and clicking on http://%65%78%61%6d%70%6c%65%2e%6f%72%67 would fail as the domain %65%78%61%6d%70%6c%65%2e%6f%72%67 didn't exist.
Now however, the hostname is percent decoded, thus correctly loading http://example.org
Note that for http://example.org%3a %3a wouldn't be decoded, as it's a character that isn't allowed in the hostname (the colon).
Blocks: url
Keywords: dev-doc-needed
Blocks: 1142083
Blocks: 700999
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: