Closed Bug 10373 Opened 25 years ago Closed 24 years ago

Non-ASCII url cannot be found

Categories

(Core :: Layout, defect, P2)

defect

Tracking

()

VERIFIED FIXED

People

(Reporter: teruko, Assigned: waterson)

References

()

Details

(Whiteboard: [nsbeta2+])

Attachments

(4 files)

Above URL name is included Japanese directory. When you go to the page, the Dos console says "http://babel/tests/browser/http-test/%3f%3f%3f loaded sucessfully", but in the Apprunner says "Not Found". Step of reproduce 1. Go to the http://babel/tests/browser/http-test 2. Click on "??" "Not Found" shows. Tested 7-21-16-NECKO Win32 build and 7-22-08-M9 Win32 build.
Priority: P3 → P2
Target Milestone: M9
The loaded successfully is currently tied to document load status. it really has no clue as to what the HTTP status was. I am not sure what the other problem you are mentioning is... Is it expected to show the document (I verified in 4.6 it doesn't)? There is a whole section on URL encoding that is missing right now from the picture. But I want to confirm that the bug really is about encoding URLs.
As a maintainer of the Babel server cited above, I would like to offer some additional facts of relevance. (Sorry, teruko, I forgot to tell you about the Babel server's limitation described below.) 1. The directory name which ends the above URL is actually in Japanese and contain three 2-byte words. Unfortunately the server is running on an US Windows NT4 and thus mangles the multi-byte directory and file names. This is why you see 3 ? marks. Thus, even if you properly encode the URL, you will never find the directory. 2. Let me offer a sample on a Unix server which can handle 8-bit file names. http://kaze:8000/url/ 3. In the directory, you will find 3 sub-directories. The 3rd from top (if viewed with 4.6/4.7 or 5.0 under Japanese (EUC-JP) encoding) will show in Japanese. 4. The first 2 directories are named using the escaped URLs and the 2nd one actually represents the escaped URL version of the 3rd Japanese one. If esacping is working correctly, you should see the escaped URL matching that of the 2nd one when the cursor is perched on the 3rd directory. (Compare FTP URL below.) Some issues: A. Click on the 3rd directory and it fails. We seem to be escaping special characters like "%" but not 8-bit charcters. (4.6/4.7 and 5.0 w/necko are pretty much the same here.) We should fix this in 5.0 B. 4.6/4.7 actually escapes 8-bit path names in FTP url. For example, try the same 3 directory listing with the ftp protocol at the URL below. You can get inside the 3rd directory under Japanese (EUC-JP) encoding: ftp://kaze/pub/ And you can also see the escaped URL on the status bar when the cursor is on the 3rd directory. 5.0 w/necko does not escape 8-bit names and cannot get inside this directory. We should fix this in 5.0 Question: Are you planning on supporting escaping to both native encoding and UTF-8 if a server indicates it can accept Unicode? I believe there is a recent IETF draft on url-i18n which discusses the UTF-8 implementation.
*** Bug 7399 has been marked as a duplicate of this bug. ***
*** Bug 7847 has been marked as a duplicate of this bug. ***
*** Bug 8333 has been marked as a duplicate of this bug. ***
*** Bug 8337 has been marked as a duplicate of this bug. ***
*** Bug 10429 has been marked as a duplicate of this bug. ***
Severity: major → blocker
10429 was gonna be marked a blocker, this the dup, blocker.
Status: NEW → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
This should work ok now... Pl. verify.
Whiteboard: waiting for new build with fix to verify
Can you international types verify this? I have verified that you can have spaces in your path, which was bug 10429
Whiteboard: waiting for new build with fix to verify → waiting for reporter to verify
QA Contact: paulmac → teruko
Yes. This should go to teruko now.
Status: RESOLVED → REOPENED
I tested this in 7-31-09 and 8-02-09 build. I used http://kaze/url/ to test this. When I went to http://kaze/url/ and changed Charset to euc-jp,the 3rd directory name in Japanese is displayed correctly. Then, when I click on the link in Japanese directory name, the location bar displayed http://kaze/url/.../ and "Not Found" showed up. This needs to be reopened.
Resolution: FIXED → ---
Clearing Fixed resolution due to reopen.
Status: REOPENED → ASSIGNED
Target Milestone: M9 → M10
This has a lot to do with the fact that nsIURI currently works with char* and not PRUnichar* I would like to verify this again once we move the URL accessor to all PRUnichar (sometime for M10). Marking it such.
*** Bug 12473 has been marked as a duplicate of this bug. ***
Blocks: 13449
Target Milestone: M10 → M12
Correcting target milestone.
Summary: Non-ASCII url cannot be found → [dogfood] Non-ASCII url cannot be found
PDT would like to know if the 4.x product allow non-ascii in URL? Does a test case rely on allowing 8bit?
In 4.x, we are able to resolve FTP URL links containing 8-bit characters but not HTTP URL links Please note that this is not the same as inputting into the Location window in 8-bit characters. We didn't support that in 4.x.
Kat's comment is true for Japanese, but not for Latin-1. In 4.x you can type http://babel/5x_tests/misc/montse/αϊινσϊ.jpg and it will show the file and the URL will be correct. We need this working for Latin-1
Sorry about that. We didn't esacpe multi-byte characters in 4.x but did so for single-byte 8-bit charactesr in HTTP.
It turns out that with the current Mozilla build (10/20/99 Win32), Latin 1 URL is resolved to an existing page also. It seems that both in 4.x and Mozilla, we are not doing anything to an URL which contains single-byte 8-bit characters in URL, i.e. just passing them through and it works. I think we should escape these single-byte 8-bit characters, however. It is multi-byte characters which are not supported at this point in HTTP or FTP URLs in Mozilla (or in 4.x).
Summary: [dogfood] Non-ASCII url cannot be found → Non-ASCII url cannot be found
And now Kat is right; it's the typing (another bug) which is not working, but if the file is selected it will show up. Removing dogfood.
Moving Assignee from gagan to warren since he is away.
Assignee: warren → momoi
What's the deal on this bug now? The link in the URL field above is stale (in 4.x too): http://babel/tests/browser/http-test/%3f%3f%3f This one works fine: http://babel/5x_tests/misc/montse/αϊινσϊ.jpg although I don't think that should be a valid url. Those characters should be escaped to work shouldn't they (or is that not how things are in practice). Reassigning back to i18n.
What needs to happen on this bug is: 1. Use the test case above. (Changed from the one on babel which cannot process multi-byte URL as it is US NT4) 2. There are 2 URL links on this test page. The last one is in Japanese. When you view this page under the Japanese (EUC-JP) encoding, you should see the status bar at the bottom display the escaped URL. Currently it shows 3 dots indicating that Mozilla is not able to escape multi-byte directory names. When the 3rd (JPN) link is properly escaped, the escaped part should look like the 2nd URL which shows the escaped EUC-JP name used in the 3rd link. The 1st escaped example shows the escaped sequence of the same 3 characters in Shift_JIS. 3. Contrast this with the ftp protocol under 4.7: ftp://kaze/pub This page contains the same 3 directory names. Use 4.7 and move the cursor over the 3rd link. You see that 4.7 escapes this to the identical string as you see for the 2nd URL. Assigning this to ftang for an assessment as to what needs to be done. When we isolate the problem, please send it back to warren.
Assignee: momoi → ftang
I've determined that the URL parsing is pretty screwed up in this regard. The way I think it should work is: - SetSpec and SetRelativePath should take escaped strings (like in the examples), and in the parsing process convert the escapes to their unescaped representation (held inside the nsStdURL object). - We should probably assert if we see a non-ascii character passed to SetSpec/SetRelativePath. The caller should have converted to UTF-8 before calling us (I think this is being done by the webshell). - GetSpec should reconstruct and re-escape the URL string before returning it. Note that this is tricky, because if you have a slash in a filename (could happen) it should be careful to turn this into %2F rather than make it look like a directory delimiter. - The nsIURI accessors should return the unescaped representation of things. That way if I say SetSpec("file:/c|/Program%20Files") and then call GetPath, I should get back "c|/Program Files". The FTP protocol must be doing escaping by hand. This should be handled by nsStdURL. Cc'ing valeski & andreas.
My two cents on this: Yes, URL parsing is pretty much screwed up regarding escaping and multibyte chars. - There was a task to convert the URL accessors to PRUnichar. See bug 13453. It is now marked invalid. - nsStdURL does no escaping currently. Why not store the URL as we get it (escaped or not) from whoever is calling nsStdURL-functions? Who cares? Just note it on the URL, because one should never escape an already escaped string or unescape an already unescaped string. And it is a problem to definitly find out if a URL is already escaped or not. I think we need to have a member variable for that. The constructors or SetSpec or SetPath (or the others) should have an additional parameter which tells if the given string is already escaped or not which is then stored in the member-variable. The get-accessors (like GetPath or GetSpec) should have an additional parameter which gives the information if we want the spec/path escaped or unescaped and let that be done by the accessors on the fly looking at the escape-member-variable and doing the appropiate thing (copy and convert or just copy). The webshell would want to see the unescaped version to present the user his native view of the URLs, internally (like sending the request to the server) we would use the escaped version of the URL. I don't think it's true that we always want to see the unescaped version.
Assignee: ftang → bobj
1. Let not mixed URL for different protocol in the same bug 2. "ftp protocol data to ftp URL conversion" and "ftp URL to ftp protocol data generateion" is done in the client side, not in the server side. So, the clien code should do the right thing w/ it. The code should always URL escape the ftp data before concatnate into ftp URL and unescape it from URL into ftp protocol data. 3. HTTP url generation is done in the server side. Therefore, the client code have no control for bad URL generation (which mena the URL contains byte > 0x80). If http server do send some URL to client which contains bytes > 0x80 the cleint code should URL escape it. But when the client access that URL, it whould not unescape it. Reassign to bobj to find the right owner.
Status: NEW → ASSIGNED
Target Milestone: M12 → M13
Bulk move of all Necko (to be deleted component) bugs to new Networking component.
Assignee: bobj → erik
Status: ASSIGNED → NEW
Target Milestone: M13 → M14
Reassigned to erik for M14
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Fixed by Andreas' changes that I just checked in.
Status: RESOLVED → REOPENED
reopened because of backout
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Blocks: 24854
Change platform and OS to ALL
OS: Windows 95 → All
Hardware: PC → All
Keywords: beta1
Putting on PDT+ radar for beta1.
Whiteboard: waiting for reporter to verify → [PDT+]waiting for reporter to verify
the changes Warren spoke of are in again, can someone with access to this server take a look if this is fixed?
I tested this in 2000020808 Win32 build. The result was same as before. When I went to http://kaze/url/ and changed Charset to euc-jp,the 3rd directory name in Japanese is displayed correctly. Then, when I click on the link in Japanese directory name, the location bar displayed http://kaze/url/.../ and "Not Found" showed up.
The Japanese directory under the http test does not work in Nav4 and MSIE, so it's OK if it doesn't work in Mozilla. The Japanese directory under the ftp test works in Nav4, MSIE and Mozilla, but it is displayed wrong in Mozilla and MSIE, while it is displayed OK in Nav4. So, the only thing in this bug report that needs to be fixed is the display of FTP directory and file names.
I agree that if FP folder & file names work OK, then we would at least have a parity with our own earlier version, and that would be accpetable. ftang's comment on HTTP servers is well-taken.
One more thing. There are 2 files under the Japanese name directory (3rd from the top) on the ftp server. The content of these 2 files should display on Mozilla. Currently, I don't see the 2 files at all. It looks like the 2nd and 3rd directories are mixed up right now and thus sometimes (not always) show the name of "testb.html" which does not exist under the 3rd directory but the 2nd one.
FTP and non-ASCII file names are relatively minor aspects on the Net today. I believe we should remove the beta1 keyword and PDT+ approval.
Could someone please describe a little bit better how the urls are looking and how they should look.
Removed beta1 and PDT+. Please re-enter beta1 if you would like the PDT team to re-consider.
Keywords: beta1
Whiteboard: [PDT+]waiting for reporter to verify
Jud, please re-assign this to whoever owns the FTP and/or URL code. I don't think this is a beta1-stopper.
Assignee: erik → valeski
Status: ASSIGNED → NEW
Target Milestone: M14 → M15
Moving to M16.
Target Milestone: M15 → M16
Whiteboard: [HELP WANTED]
I'm not following this. can someone in i18n take this on?
Nominating to beta2.
Keywords: nsbeta2
On US Win95 on 4.72 if I type http://kaze:8000/url into the location bar and hit return (1a) with the View|Character Set to either Japanese(EUC-JP) or Japanese(Auto-Detect), then - the 3rd directory displays the kanji characters for "nihongo" correctly - clicking on that link does NOT work and I get the Not Found error page - the URL in the location bar displays: http://kaze:8000/url/“ú–{Œê/ (2a) with the View|Character Set to Western(ISO-8859-1), then - the 3rd directory displays latin1 garbage: ÆüËܸì/ - clicking on that link DOES work - the URL in the location bar displays: http://kaze:8000/url/ÆüËܸì/ On US Win95 running the 2000042109 build and typing http://kaze:8000/url into the location bar and hitting return (1b) with the View|Character Set to either Japanese(EUC-JP) or Japanese(Auto-Detect), then - the 3rd directory displays the kanji characters for "nihongo" correctly - clicking on that link does not work and I get the Not Found error page - the URL in the location bar displays: http://kaze:8000/url/%C3%A6%C2%97%C2%A5%C3%A6%C2%9C%C2%AC%C3%A8%C2%AA%C2%9E/ (2b) with the View|Character Set to Western(ISO-8859-1), then - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72) - clicking on that link DOES work - the URL in the location bar displays: http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B 8%C3%83%C2%AC/ Additionally, if I paste the resulting URL from (2a) into the location bar: http://kaze:8000/url/ÆüËܸì/ Seamonkey also gets the Not Found page and the results in the location bar: http://kaze:8000/url/%C3%86%C3%BC%C3%8B%C3%9C%C2%B8%C3%AC/ Teruko, Does 4.72 behave differently on Ja Windows?
whoops. cut & paste error in previous comment! The comment for case (2b) fails to find the page and should read: (2b) with the View|Character Set to Western(ISO-8859-1), then - the 3rd directory displays latin1 garbage: ÆüËܸì/ (same as 4.72) - clicking on that link does NOT work and I get the Not Found error page - the URL in the location bar displays: http://kaze:8000/url/%C3%83%C2%86%C3%83%C2%BC%C3%83%C2%8B%C3%83%C2%9C%C3%82%C2%B 8%C3%83%C2%AC/
The ftp problem should be split off into a separate bug. As noted, the ftp links work, but are not displayed correctly (instead of Japanese kanji characters you see Latin1 garbage).
Assignee: valeski → ftang
Target Milestone: M16 → M17
reassign this back to ftang. I think we need to change the code in Layout , but not in the necko.
Similar problem as 30460. Patch available at http://warp/u/ftang/tmp/illurl.txt
Status: NEW → ASSIGNED
One thing I forget to say is the patch is depend on bug 37395. per ftang/waqar/troy meeting- we should move the url fixing code into content sink so we don't need to convert/escape every time. Also, we argee to reassign this bug to layout.
Assignee: ftang → troy
Status: ASSIGNED → NEW
Component: Networking → Layout
*** Bug 38133 has been marked as a duplicate of this bug. ***
with Troy's departure, this is at risk for M17. PDT team, is this required for beta2?
Assignee: troy → buster
Assignee: buster → waterson
Whiteboard: [HELP WANTED] → [nsbeta2+]
Putting on [nsbeta2+] radar for beta2 fix. Sending over to waterson.
It seems like the right thing to do in this case is convert/escape/mangle the URL in the anchor tag itself? This would also make sure that the correct thing happened if someone changed the "href" attribute using L1 DOM APIs. attinasi and I talked about keeping the resolved version of the URL in the anchor tag to deal with some style & performance issues: maybe this could just be an extension (or precursor) to that work? Presumably we'd need to do this for other elements that had "href" properties, as well. Comments?
Status: NEW → ASSIGNED
Some notes from Troy: 1) Frank's patch is expensive in terms of performance, especially because it includes dynamic allocation. We should be able to do much better. 2) We should be able to convert the URL to an immutable 8-bit ASCII string one time, probably at the time we process the attribute (or maybe lazily the first time we actually use the attribute.) We would cache this converted immutable string and hand that to necko. 3) or, necko could just do the conversion on the fly. but that brings us right back to performance problems.
So I claim (2) is the right thing to do, and should be done by the anchor tag. We need to be able to handle L1 DOM updates, too, and the tag itself (not the content sink) is the only one that can do that.
Can we canonicalize it when we ASCII-ize it too? That would help in dealing with bug 29611 (which we reports that we spend too much time determining a link's state due to conversions.) I'm linking this bug to 29611 since it may help it.
Blocks: 29611
Yeah, we should absolutely canonicalize it then. (See comments above...)
Since Ftang's 11/30/99 comment on HTTP URL issue is on target, I would only add an additional server example to contrast different servers on this problem. The above URL: http://kaze:8000/url points to a Netscape Enterprsie server 2.01. No Netscape Enterprise server supports (up to the current 4.1) 8-bit path or file names according to the server admin document. Thus it send 8-bit URL portion without escaping. How tro deal with this has been documented by ftang already. An Apache sever on the other hand supports 8-bit names and escapes them. See for example a nearly identical identical example to the above on an Apache server: http://mugen/url There the Japanese name is escaped by the server properly -- do view source on the page -- and the directory can be easily accessed. On another issue: should we support UTF-8 URL? IE has a default option setting to this and then the user can turn off this option. After discussing this issue with Erik, I'm inclined to believe that UTF-8 URL has problems and would not need to be supported at this juncture.
I shoud add that on the Apache server, Comm 4.x, IE5, and Netscape 6 all work OK with the Japanese path name.
Depends on: 40461
Ok, I talked to ftang yesterday, and here's what he thinks the right thing to do is as I understand it. Problem. If an anchor tag's "href" attribute contains non-ASCII characters. URLs don't. So how do we properly escape the non-ASCII characters in the URL so that an ASCII string results? Solution. Use the document's charset to return the href attribute to its original 8-bit encoding. Then URL escape (e.g., " " --> "%20") the non-ASCII printables. Then call nsIURI::Resolve with the document's base URL. I've implemented it, and sure enough, it seems to make this test case work. It leaves a bad taste in warren's mouth, but I burned my tounge a while back and can't taste a thing. Do we need to do this for *every* relative URI reference? Zoiks! Anyway, I've implemented a layout-specific version of NS_MakeAbsoluteURI() (maybe I should call it NS_MakeAbsoluteURIWithHRef or something) that takes a charset ID and does the heavy lifting. It's in a new file which I'll attach to the bug. I'll also attach new diffs to nsHTMLAnchorElement.cpp. Comments on this approach?
chris: this looks great. r=buster. what about other URLs, like <img src=...>? Do those need to be treated separately?
Looks good. Where do you want to put the "silly extra stuff"? If we put it in necko, it will introduce a dependency on i18n (although maybe we have one already). But maybe this is right anyway. Any suggestions on how to specify the necko APIs (in nsNetUtil.h) so that it clear what you're supposed to be passing in? Maybe we just need a comment saying that the nsString is supposed to already have all this stuff done to it (charset encoded, escaped). Or maybe we should eliminate it in favor of the new thing you wrote.
I am going to put the charset-specific version of NS_MakeAbsoluteURI() in layout; and I am going to change it to NS_MakeAbsoluteURIWithCharset() to avoid gratuitous name overloading.
I'd like to make it a method on nsIIOService provided we already depend on i18n. (I think we do for string bundles. - Gagan, Jud?)
is bug 40661 related?: ""files" datasource fails to open non-latin named directory"
No. that bug is a dup of 28787
I had a quick look at the new MakeAbsoluteURIWithCharset method, and it uses Unicode conversion routines. However, Necko does not appear to use the Unicode converters. Necko may use string bundles, but they are in a separate DLL (strres.dll), while the Unicode converters are in uc*.dll. I don't know how important it is to keep Necko free of uc*.dll dependencies, but this is what I found after a quick look.
Erik: Thanks for the info. Is there any plan to bundle all (many) of the intl dlls into one as we did for necko (to improve startup time, and reduce clutter)?
We have discussed that, but I don't know of any concrete plans in that area. Frank?
I took a look at MakeAbsoluteURIWithCharset and it includes the following code at the end: static const PRInt32 kEscapeEverything = nsIIOService::url_Forced - 1; nsXPIDLCString escaped; ioservice->Escape(spec.GetBuffer(), kEscapeEverything, getter_Copies(escaped)); This usage of nsIIOService::url_Forced is certainly not what I wanted it to do and I don't believe it does anything useful this way. This method is used to escape a specific part of an URL not a whole URL. There are different rules for every part.
Could you suggest an alternative?
Do we really need to escape this stuff? What characters that can possibly damage urlparsing can be expected from the charset conversion? Maybe you could use the old nsEscape functions in xpcom/io as a replacement.
This issue is also relevant for simple XLinks, XPath, etc. "src" attributes. When I implemented XPath and simple XLink, I asked ftang how to do this but he could not convince me what is the right thing to do :) So what I do there is just grab the Unicode string, AssignWithConversion() to nsCAutoString and pass it to the URI-creating objects. This seems to work in basic cases, like spaces, but I do not think it is the correct way. What bugs me the most is this: suppose a document is in UTF-16 and a URL contains an illegal char that does not fit in ASCII. ftang said to 1) get the doc charset and use the nsIIOService ConvertAndEscape() function, but to me this seems like it cannot work for UTF-16. Or does the convert thing automatically convert UTF-16 into UTF-8 (which fits into char*)? How would we then know what to send back to server? I also seem to have trouble understanding how do we escape multibyte UTF-8 URLs. If a two-byte character becomes two escaped URL chars, will the system still work? Do we send the server back the bits it gave us?
andreas: how is using the "old" nsEscape different from what I'm doing now? Are there really differnt rules for escaping different parts of a URI? That doesn't seem right. heikki: nsIUnicodeEncoder takes a UCS-2 string as input and returns the result in an 8-bit string as output. Presumably it does the right thing on UTF-16 to round-trip the UTF-16 bytes.
The main reason for escaping is to hide certain special characters from the urlparser that would mislead the parser, like having a @ in a username or a / in a filename or something similar. Depending on the position inside the url different characters are special for the parser and that is what can be done with the new escape functions. I have to tell it which part of the URL I want to escape. Simply giving it every possible mask will not work. The old nsEscape stuff does not look for a special part, it can be used to escape whole urls, but it may escape to much or not much enough.
In the case of a UTF-16 document, it is not clear to me that the server is expecting a URL-encoded (%XX) UTF-16 URL. In fact, some people working on these issues in the standards bodies seem to be pushing UTF-8 as a convention in URLs. See ftp://ftp.ietf.org/internet-drafts/draft-masinter-url-i18n-05.txt. However, blindly converting *every* part of a URL to UTF-8 has bad consequences in today's Web, as Microsoft discovered. They do not convert the "query" part (the part after the question mark) to UTF-8. Also, they have a preference for the part before the question mark. In most versions, they convert that part to UTF-8 but in the Korean and Taiwanese versions they found that they had to set the default pref to "OFF" (i.e. no conversion to UTF-8), presumably because people in those countries were using servers that expect something other than UTF-8 in those parts of the URLs. For now, converting from PRUnichar to the doc's charset and then %-encoding is the best thing to do, in my opinion. We should continue to follow the emerging specs in this area, and possibly modify our implementation accordingly. UTF-16 documents are currently still quite rare, I think, but if we are really concerned about this, my suggestion is to make a special case for the UTF-16 doc charset and convert to UTF-8 (instead of UTF-16).
fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → FIXED
Not sure how to verify this problem. Could the reporter please check this in the latest build ?
I verified this in 2000-06-02-08 Win32, Mac, and Linux build.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: