Closed Bug 412457 Opened 17 years ago Closed 9 years ago

should unescape hostname first, then perform IDNA

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla45

Tracking Flags:

Tracking

Status

firefox45

---

fixed

status2.0

---

wanted

People

(Reporter: erik, Assigned: valentin)

References

(Blocks 2 open bugs)

Details

(Keywords: dev-doc-needed)

Attachments

(2 files, 4 obsolete files)

fix + unit test 16 years ago Michal Novotny [:michal] (deleted), patch		Details \| Diff \| Splinter Review
new patch, validity of hostname is checked 16 years ago Michal Novotny [:michal] (deleted), patch		Details \| Diff \| Splinter Review
added some tests 16 years ago Michal Novotny [:michal] (deleted), patch		Details \| Diff \| Splinter Review
Added some tests and comments to unittest 16 years ago Michal Novotny [:michal] (deleted), patch		Details \| Diff \| Splinter Review
minor change in comment 16 years ago Michal Novotny [:michal] (deleted), patch	Biesinger : review-	Details \| Diff \| Splinter Review
should unescape hostname first, then perform IDNA 9 years ago Valentin Gosu [:valentin] (he/him) (deleted), patch	mcmanus : review+	Details \| Diff \| Splinter Review

Erik van der Poel

Reporter

Description

•

17 years ago

Host names in URIs (and IRIs) may have %-escaped octets in them. Firefox (and
other Mozilla software) should unescape the host name before applying the IDNA
(Internationalized Domain Names in Applications) process. For example:

<a href="http://&#x5341;%2ecom/">blah</a>

The above causes Firefox 2 to display an error message containing the string
xn--%2ecom-9b5j. Other browsers like MSIE 7 and Opera 9 will convert the above
domain name into xn--kkr.com and will succeed.

The steps should be:

1. convert incoming HTML to Unicode
2. resolve NCRs like &#x5341;
3. parse URI/IRI to get host name
4. unescape host name (%2e becomes .)
5. perform IDNA's ToASCII
6. perform DNS lookup
7. perform HTTP request with Host: header

Erik van der Poel

Reporter

Comment 1

•

17 years ago

If step 4 yields invalid UTF-8 in some labels (label = between dots), skip step
5 or perform step 5 only on the valid UTF-8 labels.

Erik van der Poel

Reporter

Comment 2

•

17 years ago

On 2nd thought, if step 4 yields malformed UTF-8, abort the lookup, and produce
an error message (or something). Firefox currently already returns an error if
there are any %-escapes in the host name, so aborting on malformed UTF-8 would
not be a step backwards.

Michal Novotny [:michal]

Comment 3

•

16 years ago

Attached patch fix + unit test (obsolete) (deleted) — Details — Splinter Review

Host name is unescaped just before NormalizeIDN() is called.

Assignee: nobody → michal

Status: NEW → ASSIGNED

Attachment #324616 - Flags: review?(cbiesinger)

Erik van der Poel

Reporter

Comment 4

•

16 years ago

Thanks for working on this. With this patch, what happens when the host name contains an escaped slash (i.e. %2F)?

Michal Novotny [:michal]

Comment 5

•

16 years ago

Sequence %2f in the hostname will be unescaped and will be part of the hostname. F.e. URL "http://a%2fb/c" is parsed as follows (see mHost):

(gdb) p *url
$4 = {<nsIFileURL> = {<nsIURL> = {<nsIURI> = {<nsISupports> = {
          _vptr.nsISupports = 0x235ee88}, <No data fields>}, <No data fields>}, <No data fields>}, <nsIStandardURL> = {<nsIMutable> = {<nsISupports> = {
        _vptr.nsISupports = 0x235ef98}, <No data fields>}, <No data fields>}, <nsISerializable> = {<nsISupports> = {
      _vptr.nsISupports = 0x235efb8}, <No data fields>}, <nsIClassInfo> = {<nsISupports> = {_vptr.nsISupports = 0x235efd4}, <No data fields>}, mRefCnt = {
    mValue = 1}, _mOwningThread = {mThread = 0x9bd5550}, 
  mSpec = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, 
      mData = 0x9cf7478 "http://a/b/c", mLength = 12, 
      mFlags = 5}, <No data fields>}, mDefaultPort = 80, mPort = -1, 
  mScheme = {mPos = 0, mLen = 4}, mAuthority = {mPos = 7, mLen = 3}, 
  mUsername = {mPos = 7, mLen = -1}, mPassword = {mPos = 7, mLen = -1}, 
  mHost = {mPos = 7, mLen = 3}, mPath = {mPos = 10, mLen = 2}, mFilepath = {
    mPos = 10, mLen = 2}, mDirectory = {mPos = 10, mLen = 1}, mBasename = {
    mPos = 11, mLen = 1}, mExtension = {mPos = 13, mLen = -1}, mParam = {
    mPos = 12, mLen = -1}, mQuery = {mPos = 12, mLen = -1}, mRef = {mPos = 12, 
    mLen = -1}, 
  mOriginCharset = {<nsACString_internal> = {<nsCSubstring_base> = {<No data fields>}, mData = 0x23269a2 "", mLength = 0, mFlags = 1}, <No data fields>}, 
  mParser = {mRawPtr = 0x9cc1eb8}, mFile = {mRawPtr = 0x0}, mHostA = 0x0, 
  mHostEncoding = 1, mSpecEncoding = 0, mURLType = 2, mMutable = 1, 
  mSupportsFileURL = 0, static gIDN = 0x9c9d1e8, static gCharsetMgr = 0x0, 
  static gInitialized = 1, static gEscapeUTF8 = 1, 
  static gAlwaysEncodeInUTF8 = 1, static gEncodeQueryInUTF8 = 0}


Connecting to such host fails at net_IsValidHostName(). But something like this is possible:

js> var newURI2 = ios.newURI("http://example.com%2fxxx/", null, null);
js> var newURI3 = ios.newURI(newURI2.spec, null, null);
js> newURI2.host
example.com/xxx
js> newURI3.host
example.com

Is this behaviour correct? Maybe there should be check if '/' is present after hostname is unescaped.

Erik van der Poel

Reporter

Comment 6

•

16 years ago

First of all, when you re-insert a host name back into a URL, you must first escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ; is questionable in the http: case. (MSIE does not treat ; as a host/path terminator in http: nor ftp:.) However, see below.

Secondly, I suppose there are a couple of different places to check for bad characters in the host name. One might be just before a DNS lookup, but if an HTTP request is going through a proxy, the host is not looked up in DNS. So, checking for bad characters after unescaping and running it through IDNA might be good.

Michal Novotny [:michal]

Comment 7

•

16 years ago

Attached patch new patch, validity of hostname is checked (obsolete) (deleted) — Details — Splinter Review

(In reply to comment #6)
> First of all, when you re-insert a host name back into a URL, you must first
> escape it (e.g. / -> %2F). The host name terminators are :/;?# although the ;
> is questionable in the http: case. (MSIE does not treat ; as a host/path
> terminator in http: nor ftp:.) However, see below.
> 
> Secondly, I suppose there are a couple of different places to check for bad
> characters in the host name. One might be just before a DNS lookup, but if an
> HTTP request is going through a proxy, the host is not looked up in DNS. So,
> checking for bad characters after unescaping and running it through IDNA might
> be good.

I added the check after unescaping and IDNA. No character that is allowed in hostname by net_IsValidHostName() needs escaping, so there is no nsEscapeCount()...

Attachment #324616 - Attachment is obsolete: true

Attachment #324833 - Flags: review?(cbiesinger)

Attachment #324616 - Flags: review?(cbiesinger)

Erik van der Poel

Reporter

Comment 8

•

16 years ago

Looks good. I recommend adding a couple of tests:
http://%e5%8d%81.com -> http://xn--kkr.com
http://%80.com -> error (invalid UTF-8)

Michal Novotny [:michal]

Comment 9

•

16 years ago

Attached patch added some tests (obsolete) (deleted) — Details — Splinter Review

(In reply to comment #8)
> Looks good. I recommend adding a couple of tests:
> http://%e5%8d%81.com -> http://xn--kkr.com
> http://%80.com -> error (invalid UTF-8)

Added. Erik, can you do the review?

Attachment #324833 - Attachment is obsolete: true

Attachment #325078 - Flags: review?(cbiesinger)

Attachment #324833 - Flags: review?(cbiesinger)

Erik van der Poel

Reporter

Comment 10

•

16 years ago

I recommend adding a comment to each unit test. For example:

// Make sure escaped dot (%2e) is unescaped before applying IDNA

I also recommend adding unit tests for all of the host terminators :/;?#
I.e. %3A %2F %3B %3F %23

The code looks good, but I think we should ask a regular Mozilla reviewer to take a look.

Hixie (not reading bugmail)

Comment 11

•

16 years ago

Should we really be unescaping %-escaped octets in URIs? In particular, the "." part in a URI must be literally given, according to the RFC, and the label part must be a set of "alphanum"s, not "escaped"s. This seems dangerous.

Hixie (not reading bugmail)

Comment 12

•

16 years ago

Nevermind, I see the new RFCs allow it.

Hixie (not reading bugmail)

Comment 13

•

16 years ago

The new RFCs don't treat a semicolon as a host delimiter, by the way -- the ; character is allowed in hostnames.

How should %-encoded bytes be treated? IE7 seems to just send the raw bytes to the DNS resolution layer as far as Erik's testing seems to show. Opera decodes the %-encoded bytes as UTF-8 and reencodes them as IDN before doing the query. Nobody else seems to do anything useful. What should we do?

Erik van der Poel

Reporter

Comment 14

•

16 years ago

The last paragraph of section 3.2.2 in the following RFC seems to suggest that Opera's interpretation is fine:

http://www.ietf.org/rfc/rfc3986.txt

I find it somewhat strange to send %-escaped text in DNS, since %-escapes are for URIs, not DNS.

I think it's fine to stop treating the semicolon as a host, port or path terminator, especially in http: and https:. However, for ftp: it may still be useful (though MSIE doesn't support it).

Michal Novotny [:michal]

Comment 15

•

16 years ago

Attached patch Added some tests and comments to unittest (obsolete) (deleted) — Details — Splinter Review

Attachment #325078 - Attachment is obsolete: true

Attachment #325254 - Flags: review?(cbiesinger)

Attachment #325078 - Flags: review?(cbiesinger)

Erik van der Poel

Reporter

Comment 16

•

16 years ago

> there should be only allowed characters in hostname after
> unescaping and applying IDNA

How about "after unescaping and attempting to apply IDNA. "\x80" is illegal in UTF-8, so IDNA fails, and 0x80 is illegal in DNS too."

Have you tried some actual hrefs in real HTML files? (I'm not familiar with the rest of the Mozilla code, so I'd like to be sure that real HTML files work properly, in addition to the JavaScript tests that you have written.)

Michal Novotny [:michal]

Comment 17

•

16 years ago

Attached patch minor change in comment (deleted) — Details — Splinter Review

(In reply to comment #16)
> Have you tried some actual hrefs in real HTML files? (I'm not familiar with the
> rest of the Mozilla code, so I'd like to be sure that real HTML files work
> properly, in addition to the JavaScript tests that you have written.)

Yes, of course I did also some tests with URLs in HTML files and it worked correctly.

Attachment #325254 - Attachment is obsolete: true

Attachment #325403 - Flags: review?(cbiesinger)

Attachment #325254 - Flags: review?(cbiesinger)

Erik van der Poel

Reporter

Comment 18

•

16 years ago

Looks good. Thanks!

By the way, MSIE 6 and 7 reject host names with %-escapes in the range %00-1F, and %2F, %40 and %5C. MSIE 6 rejects %25 under some circumstances and MSIE 7 rejects %3F. Firefox should probably be at least as careful as MSIE, if not much more careful.

Christian :Biesinger (don't email me, ping me on IRC)

Comment 19

•

16 years ago

looks like this is a duplicate of bug 309671, although now that there's a patch here it probably shouldn't be marked as such

Christian :Biesinger (don't email me, ping me on IRC)

Comment 20

•

16 years ago

Comment on attachment 325403 [details] [diff] [review]
minor change in comment

Don't you also have to change SetHost?


(Hm, if you're adding the hostname validity check here we might be able to remove it from the other places where we currently have it)

Christian :Biesinger (don't email me, ping me on IRC)

Updated

•

16 years ago

Attachment #325403 - Flags: review?(cbiesinger) → review-

Johnny Stenback (:jst)

Updated

•

14 years ago

status2.0: --- → wanted

Gunnar Vestergaard

Comment 21

•

13 years ago

Hello. I don't know if the issue I'm about to present should be a separate issue.

When an e-mail message is displayed in Thunderbird, the text body might contain a URI with an international domain name, such as
http://www.henriksørensen.dk/

I hope that your program will display the above URI with a LATIN SMALL LETTER O WITH STROKE. I am not fully confident with how text is displayed in different programs when sent from Bugzilla to you.

When I click on that URI in Thunderbird, it is sent to Firefox. Firefox tries to display the page at that URI but Firefox says Server not found.

Firefox tries to resolve the domain name. Here is output from the plugin HttpFox:
Host	www.henriks%c3%b8rensen.dk

The error code is NS_ERROR_UNKNOWN_HOST.

Here is output from HttpFox when I manually type the same URL in Firefox:
Host	www.xn--henriksrensen-hnb.dk

In the latter case the international domain name is correctly interpreted and handled by Firefox.

Shouldn't Firefox treat a URI equally whether the URI is sent from Thunderbird  or the URI is typed manually?

Then I changed which browser should handle HTTP URIs. I temporarily chose Safari. When I click on the URI in Thunderbird, the URI is sent to Safari. The correct host is found right away. I use Thunderbird and Firefox for Mac OS X.

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 22

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba2bb2369ba4

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 23

•

9 years ago

Attached patch should unescape hostname first, then perform IDNA (deleted) — Details — Splinter Review

Valentin Gosu [:valentin] (he/him)

Assignee

Updated

•

9 years ago

Assignee: michal.novotny → valentin.gosu

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 24

•

9 years ago

Comment on attachment 8686406 [details] [diff] [review]
should unescape hostname first, then perform IDNA

This patch is based on code that has been r+d by Honza in Bug 1023468, but enough things have changed that it warrants a new review.
Changes:
* We do percent decoding of hostnames in BuildNormalizedSpec and SetHost.
* NormalizeIDN returns an nsresult. We abort the parsing if the call fails.
* NS_UnescapeURL is modified, so when called with the esc_Host flag, it only unescapes ASCII characters that are allowed in the hostname, and non-ascii characters. This is mostly so that hostnames containing % are not rejected, since they are part of the ID of some addons.
* I also change NS_UnescapeURL to only deref *(p+1) and *(p+2) only once. It's clearer this way.

Attachment #8686406 - Flags: review?(mcmanus)

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 25

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c6a52809582

Patrick McManus [:mcmanus]

Updated

•

9 years ago

Attachment #8686406 - Flags: review?(mcmanus) → review+

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 27

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/cc4f5b4bb374b95c36c818e666c2a087fe6f20ca
Bug 412457 - should unescape hostname first, then perform IDNA r=mcmanus

Carsten Book [:Tomcat]

Comment 28

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/cc4f5b4bb374

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox45: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla45

Valentin Gosu [:valentin] (he/him)

Assignee

Comment 29

•

9 years ago

This needs a dev-doc as it changes the way URL are parsed.

Previously percent encoded characters in the hostname wouldn't be decoded, and clicking on http://%65%78%61%6d%70%6c%65%2e%6f%72%67 would fail as the domain %65%78%61%6d%70%6c%65%2e%6f%72%67 didn't exist.
Now however, the hostname is percent decoded, thus correctly loading http://example.org
Note that for http://example.org%3a %3a wouldn't be decoded, as it's a character that isn't allowed in the hostname (the colon).

Blocks: url

Keywords: dev-doc-needed

Mathew Hodson

Updated

•

10 months ago

Blocks: 1142083

Mathew Hodson

Updated

•

10 months ago

Blocks: 700999

You need to log in before you can comment on or make changes to this bug.