Closed Bug 278161 Opened 20 years ago Closed 19 years ago

make file URLs always be in (escaped) UTF-8 regardless of the file system encoding (opening an link to a local non-ASCII file)

Categories

(Core :: Networking: File, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jshin1987, Assigned: jshin1987)

References

Details

(Keywords: fixed1.8.1, intl)

Attachments

(3 files, 6 obsolete files)

Follow-up to bug 261929 When a local file is referred to in an html document in an encoding different from the local file system encoding, it cannot be opened because the encoding of the document (instead of the file system encoding) is used for the file URL (my comments - bug 261929 comment #19 and bug 263750 comment #17 - seem to have been incorrect) regardless of the value of 'network.standard-url.encode-utf8'. With a patch I'm gonna upload, whatever the document encoding may be, file URLs will be in escaped UTF-8 and the file system encoding will be used only when interacting with the local file system (as long as net_GetXXX helpers are used)
jshin: I think it is important to support file:// URLs that are also encoded using the filesystem charset for backwards compatibility. e.g., suppose someone has a file:// URL bookmarked, or saved somewhere. We should try to parse the unescaped file:// URL as UTF-8, and if that fails, we should fallback on the filesystem charset.
Summary: make file URLs always be in (escaped) UTF-8 regardless of the file system encoding → make file URLs always be in (escaped) UTF-8 regardless of the file system encoding
*** Bug 278166 has been marked as a duplicate of this bug. ***
do note that the gnomevfs code assumes, iirc, that file:// uris use the filesystem charset.
Attached patch first shot (obsolete) (deleted) — Splinter Review
haven't tested much yet (haven't yet compiled on Windows) darin, yes, I'm doing that in net_GetFileFromURLSpec.
(In reply to comment #3) > do note that the gnomevfs code assumes, iirc, that file:// uris use the > filesystem charset. On modern Linux where UTF-8 is the default, that wouldn't matter but I heard that one of the first things Japanese Linux users do with FC3, SuSE 9.x, etc is to switch back to ja_JP.EUC-JP (don't ask me why...). Anyway, how do we interact with gnomevfs? Do we pass a (file) URL to it? (execuse me for my ignorance) Thanks for the reference: 146 if (NS_SUCCEEDED(fileURI->SchemeIs("file", &isFile)) && isFile) { 147 gnome_vfs_get_file_info(spec.get(), &fileInfo, GNOME_VFS_FILE_INFO_DEFAULT); http://developer.gnome.org/doc/API/2.0/gnome-vfs-2.0/gnome-vfs-20-gnome-vfs-file-info-ops.html is silent about the first argument for gnome_vfs_get_file_info which is text_url. I doubt it expects the local file system encoding, however. It's more likely to expect IRI (in UTF-8)
Status: NEW → ASSIGNED
jshin: be careful... this is performance critical code. you might impact startup time significantly by asking for the UTF-16 file path first (in the XP_UNIX case at least). maybe we should optimize for the case where UTF-8 is the filesystem charset (except for Windows and OS/2).
jshin: see http://mail.gnome.org/archives/gnome-vfs-list/2004-March/msg00049.html - according to that URL, the URI passed to gnome-vfs does expect the filesystem encoding.
> the URI passed to gnome-vfs does expect the filesystem encoding. I don't like that ;-) However, we may just get away with using escaped UTF-8 and hope gnome-vfs does what we do here (try to interpret it as UTF-8 and see if it works. if it works, it's fine. if not, try file system charset.) ] Whenever we can detect the charset used for the URI type we try to ] convert it to/from utf8 automatically inside gnome-vfs. (In reply to comment #7) > jshin: be careful... this is performance critical code. you might impact > startup time significantly by asking for the UTF-16 file path first (in the Hmm... so I have to do that exercise once more .... To special-case UTF-8 for XP_UNIX, we have to make the codeset readily available by caching it rather early (xpcom init?)
maybe nsNativeCharsetUtils.h could expose an API to expose the current codeset. or maybe it could have a "IsNativeUTF8"-type function.
How about this in http://mail.gnome.org/archives/gnome-vfs-list/2004-March/msg00049.html ? It seems like we can get away with passing UTF-8 expecting gnome-vfs to take care of the encoding mismatch gracefully. > In the case of unknown encoding we try to use utf8 as much as possible, > on the grounds that thats where we want to go in the future, but as a > fallback we can also try latin1 or the local encoding. a new reference (because nsIconChannel.cpp has changed since) http://lxr.mozilla.org/seamonkey/source/modules/libpr0n/decoders/icon/gtk/nsIconChannel.cpp#240
well, that mail says: > >>o) If I give gnome-vfs a file: URI, how should non-ascii characters be > >>encoded? Should I give it escaped UTF-8, or unescaped UTF-8, or should I > >>use the filesystem's charset? > >>(the function I'm most interested in in this context is > >>gnome_vfs_get_file_info) > > > > You should use the filesystems charset (which is undefined, filenames > > are bytestrings). But sure... if UTF-8 works, use it :)
Attached patch second shot (obsolete) (deleted) — Splinter Review
I added NS_IsNativeUTF8(), but it might be still slow due to Lock/Unlock. (I may have to build an optimized build and measure the start-up time) Can I do without them ?
Attachment #171094 - Attachment is obsolete: true
You can check gInitialized before locking. That way you avoid entering the locks in most cases. Also, I think the NS_IsNativeUTF8 function should be available on all platforms.
(In reply to comment #14) > You can check gInitialized before locking. That way you avoid entering the > locks in most cases. I considered that, but I was afraid it's not thread-safe, which might be all right in this case... > Also, I think the NS_IsNativeUTF8 function should be available on all platforms. Is it for the sake of completeness/future-proofness or do you have any specific use in mind for now?
> I considered that, but I was afraid it's not thread-safe, which might be all > right in this case... Right, the only concern is if gInitialized switches from true to false. Then you'd just be reading a stale value for gIsNativeUTF8. But, that's ok since that value never changes post-initialization anyways. > Is it for the sake of completeness/future-proofness or do you have any > specific use in mind for now? I was just thinking that it would be good to have a cross-platform API even if the implementation is not interesting on some platforms, and even if the API is never called on other platforms. Are you concerned about implementing it on Windows?
(In reply to comment #16) > Right, the only concern is if gInitialized switches from true to false. Then > you'd just be reading a stale value for gIsNativeUTF8. How about false to true? LazyInit() is called twice (iconv_open is called twice), but it's closed only once... > I was just thinking that it would be good to have a cross-platform API even if > the implementation is not interesting on some platforms, and even if the API is > never called on other platforms. Are you concerned about implementing it on > Windows? No. I was just thinking that if it's not used, I wouldn't have to implement it. If you like the other way, I have no objection to adding it on all platforms. At least for now, that would return true on OSX and BeOS and false on Windows
> How about false to true? LazyInit() is called twice (iconv_open is called > twice), but it's closed only once... Hrm.. so LazyInit doesn't set gInitialized to true until after it sets gNativeIsUTF8. In other words, you don't have to worry about false to true. Like this: { if (!gInitialized) { Lock(); if (!gInitialized) LazyInit(); Unlock(); } return gNativeIsUTF8; } > No. I was just thinking that if it's not used, I wouldn't have to implement > it. If you like the other way, I have no objection to adding it on all > platforms. At least for now, that would return true on OSX and BeOS and false > on Windows Aren't there some cases where the multibyte codepage is UTF-8 on Windows?
(In reply to comment #18) > gNativeIsUTF8. In other words, you don't have to worry about false to true. > > Like this: .... Aha.. thanks. > > false on Windows > Aren't there some cases where the multibyte codepage is UTF-8 on Windows? For Indic scripts and other scripts for which there were no legacy Windows code pages, it might indeed be UTF-8 on Win 2k/XP (or it may be just ISCII-xx at least for Indic..[1] ). I have to check it out after changing my default locale to one of them. Hmm... in that case, if GetACP() is expensive (although not likely) and we use it somewhere, we may need to cache it by adding some real code. For now, I'll just use GetACP() http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/ unicode_81rn.asp http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_21bk.asp
Attached patch patch (obsolete) (deleted) — Splinter Review
NS_IsNativeUTF8 is now available on all platforms. I changed the locking as suggested by Darin. I haven't yet tested with gnomevfs (somehow 'smb' and 'sftp' don't work on FC3)
Attachment #178799 - Attachment is obsolete: true
> I haven't yet tested with gnomevfs (somehow 'smb' and > 'sftp' don't work on FC3) I'm sure the GnomeVFS people changed the behavior of the API again. It's been a royal pain in the neck trying to use that library :-(
(In reply to comment #21) > > I haven't yet tested with gnomevfs (somehow 'smb' and > > 'sftp' don't work on FC3) > > I'm sure the GnomeVFS people changed the behavior of the API again. It's been What should I do, then? The patch is ready for review except GnomeVFS issue.
OS: Linux → All
try whitelisting "ssh" for gnomevfs
smb: and sftp: are by default on, aren't they? Anyway, I've just added 'network.gnomevfs.supported-protocols' in about:config and set it to 'smb:,ssh:,sftp:', but 'smb://server/share' still doesn't work. With konqueror, it works fine. I may be doing something wrong.
yeah, sftp is, but ssh isn't. iirc, FC3 has no sftp, only ssh. Testing in konqueror is useless, it doesn't use gnomevfs. try nautilus.
I can't make ssh: work although smb works fine in mozilla after opening smb: link in nautilus. ssh: doesn't work in nautilus. Anyway, how can I test the code at http://lxr.mozilla.org/seamonkey/source/modules/libpr0n/decoders/icon/gtk/nsIconChannel.cpp#240 ? It's only about 'file://' url so that I don't have to wrestle with 'ssh:,sftp:,smb:'
moz-icon://file:///path/to/some/file
phew... Whether I applied the patch or not, moz-icon://file:///full/path/KoreanDirectoryInEUCKR doesn't work (I get just a dull grey icon) under ko_KR.EUC-KR locale. Regardless of the locale (UTF-8 or legacy), moz-icon://file:///full/path/DirectoryNameInUTF8 gives me the directory icon. Perhaps, in a separate bug, we have to modify nsIconChannel to check the return status of gnome_vfs_file_info and try again with the file name converted to the locale encoding. http://developer.gnome.org/doc/API/2.0/gnome-vfs-2.0/gnome-vfs-20-gnome-vfs-file-info-ops.html#gnome-vfs-get-file-info
Comment on attachment 178951 [details] [diff] [review] patch Asking for review.
Attachment #178951 - Flags: superreview?(brendan)
Attachment #178951 - Flags: review?(darin)
As for Windows, I found that on Windows XP (and I think the same is true on Win XP) I can only pick one of legacy code pages to use with 'A' APIs, which makes sense because 'A' APIs only work with legacy code pages. Therefore, at the moment, IsNativeUTF8 should return false on Windows.
Comment on attachment 178951 [details] [diff] [review] patch + rv = localFile->InitWithPath(NS_ConvertUTF8toUTF16(path)); + // In rare cases, a valid UTF-8 string can be valid as a native encoding + // (e.g. 0xC5 0x83 is valid both as UTF-8 and Windows-125x ) + if (NS_FAILED(rv)) + rv = localFile->InitWithNativePath(path); why would InitWithPath fail in such case?
Attachment #178951 - Flags: superreview?(brendan)
Attachment #178951 - Flags: review?(darin)
Summary: make file URLs always be in (escaped) UTF-8 regardless of the file system encoding → make file URLs always be in (escaped) UTF-8 regardless of the file system encoding (opening an link to a local non-ASCII file)
Attached patch update (possibly addressing cbie's comment) (obsolete) (deleted) — Splinter Review
I added |OpenNSPRFileDesc(PR_RDONLY,...)|. This will slow things down. Moreover, it's assumed that the file in question exists, which may not be valid. Any thought?
Attachment #178951 - Attachment is obsolete: true
I would say that assuming it's a UTF-8 URL unless the file exists is probably an OK assumption... what do others think? (Why not use Exists, btw?)
(In reply to comment #33) > I would say that assuming it's a UTF-8 URL unless the file exists is probably an > OK assumption... what do others think? > > (Why not use Exists, btw?) by a simple oversight (I knew it's there, but I couldn't find it. I was looking for it in nsILocalFile.idl instead of nsIFile.idl). If darin et al agree with us, I'll use // See if the file exists assuming its name is in UTF-8. // If not, fall back to the native file system encoding. PRBool exists; if (NS_FAILED(localFile->Exists(&exists)) || !exists) rv = localFile->InitWithNativePath(path);
Yes, the switch to UTF-8 file URLs is needed on Windows if we are ever to successfully switch over to support Unicode file paths. So, having a failover mechanism makes sense. However, this patch is wrong because there is no requirement for the file to exist when net_GetFileFromURLSpec is called. URLs and nsIFiles are pointers to files that may or may not exist. Therefore, the only time you can test for file existence to determine if this failover trick should be applied is when you are asked to open or create the file. So, nsFileChannel might be a good place for this since that is the code that opens a file URL. Of course, what about people who use nsIFileURL::GetFile. That's a problem, and I'm not sure what to do about that.
Blocks: 288154
*** Bug 281384 has been marked as a duplicate of this bug. ***
You can test this by loading a non-utf-8 page with an href to an iri: http://www.w3.org/International/tests/sec-iri-3
*** Bug 202657 has been marked as a duplicate of this bug. ***
Perhaps, we should just throw away this part in attachment 182846 [details] [diff] [review]. It's extremely low that a 'meaningful' filename (as opposed to random gibberish) encoded in a legacy encoding is a valid UTF-8 (there's a thread on the Unicode mailing list about this and the conclusion was that the chance is very low, IIRC). + // In rare cases, a valid UTF-8 string can be valid as a native + // encoding (e.g. 0xC5 0x83 is valid both as UTF-8 and Windows-125x ). + // Try to open it assuming it's UTF-8 first. If it fails, fall back to + // the native file system encoding. + PRFileDesc *file; + rv = localFile->OpenNSPRFileDesc(PR_RDONLY, 0, &file); + if (NS_FAILED(rv)) + rv = localFile->InitWithNativePath(path); + else + PR_Close(file); Darin, what do you think?
Attached patch patch with a 'less-than-bullet-proof' fallback (obsolete) (deleted) — Splinter Review
I realized that there are some temporary patches that had been added with the understanding that they'd be removed once this bug is fixed. I got rid of them as well as the wrong fallback routine in the previous patch. If we only need to worry about bookmarks for 'file urls', I guess this is good enough.
Attachment #182846 - Attachment is obsolete: true
> + // Try to open it assuming it's UTF-8 first. If it fails, fall back ... > Darin, what do you think? This sounds like a good plan to me.
(In reply to comment #41) > > + // Try to open it assuming it's UTF-8 first. If it fails, fall back > ... > > Darin, what do you think? > > This sounds like a good plan to me. Darin, I'm afraid I failed to make myself clear. What I meant in comment #41 is that it may not be so bad to get rid of the fallback code in attachment 182846 [details] [diff] [review] ( that is WRONG anyway : comment #35) and 'bite the bullet' (of causing a potentia l incompatibility). My reasoning for that is that there's very low chance that a valid UTF-8 string is a valid and 'non-gibberish' string in a legacy encoding. If the potential incompatibility is expected to only occur in bookmarks with fi le urls, I'm pretty sure that the risk is very low (unless somebody intentionall y makes a 'gibberish' file name that is both valid in UTF-8 and a legacy encodin g and bookmarks it to expose the problem. Could that be a security risk somehow? ) here's just a random idea: comment #35 > who use nsIFileURL::GetFile. That's a problem, and I'm not sure what to do > about that. Does the file corresponding to an nsIFileURL have to exist in this case? If so (which is not likely), can we modify net_GetFileFromURLSpec to have a 3rd optional argument for existence check (or rather legacy encoding fallback)? http://lxr.mozilla.org/seamonkey/ident?i=EnsureFile
Blocks: 228437
A nsIFile does not have to point to a file that exists. The same goes for nsIFileURL. I'm happy to leverage an IsUTF8 check to avoid having to code messy failover logic.
Attached patch patch updated to the trunk (deleted) — Splinter Review
Asking for r/sr. Thanks, Darin, for relieving me of that potentially messy stuff.
Attachment #209284 - Attachment is obsolete: true
Attachment #216317 - Flags: superreview?(darin)
Attachment #216317 - Flags: review?
Comment on attachment 216317 [details] [diff] [review] patch updated to the trunk >Index: netwerk/base/src/nsURLHelperUnix.cpp ... > // Escape the path with the directory mask >- if (NS_EscapeURL(ePath.get(), ePath.Length(), esc_Directory+esc_Forced, escPath)) >- escPath.Insert(prefix, 0); >- else >- escPath.Assign(prefix + ePath); >+ NS_EscapeURL(NS_ConvertUTF16toUTF8(ePath), >+ esc_Directory+esc_Forced+esc_AlwaysCopy, escPath); >+ escPath.Insert(prefix, 0); While the old code path was more complex, it was also a bit more efficient in the case where nothing needs to be escaped. I think you should change this to: nsAutoString path; rv = aFile->GetPath(path); ... NS_ConvertUTF16toUTF8 ePath(uPath); if (NS_EscapeURL(ePath.get(), ePath.Length(), esc_Directory+esc_Forced, escPath)) escPath.Insert(prefix, 0); else escPath.Assign(prefix + ePath); That way, you avoid one realloc in the case where no escaping was needed. This same nit applies to all of the files. >+ else >+ // if path is not in UTF-8, assuming it is encoded in the native charset >+ rv = localFile->InitWithNativePath(path); Please indent the comment This same nit applies to all of the files. >Index: netwerk/base/src/nsURLHelperWin.cpp >+ ePath.ReplaceChar(PRUnichar(0x5cu), PRUnichar(0x2fu)); It might be a little clearer if you used uppercase 'C' and 'F' in the above. Same for the OS2 file. r+sr=darin w/ nits picked
Attachment #216317 - Flags: superreview?(darin)
Attachment #216317 - Flags: superreview+
Attachment #216317 - Flags: review?
Attachment #216317 - Flags: review+
Attached patch patch for the branch (obsolete) (deleted) — Splinter Review
Darin, thanks for r/sr. patch with your concerns addressed got landed on the trunk. this patch is for 1.8branch and includes changes suggested in the review comment.
(In reply to comment #47) > you don't need the esc_AlwaysCopy anymore, do you? Thanks for pointing that out. I've just checked in the fix. > > nsURLHelperOSX.cpp needs no changes? are paths always UTF-8 there? Yes, it's always in UTF-8 (except that it is in NFKD instead of NFC, which I took care of in nsLocalFileOSX.cpp a couple of years ago). > + // XXX In rare cases, a valid UTF-8 string can be valid as a > native > > not sure that this should be an XXX comment. we don't consider that a bug, > right? Well, it depends. There's a very low chance that somebody may hit that problem during the transition period.
The USE_STDCONV case is missing a declaration for IsNativeUTF8().
Attachment #216670 - Flags: review?(jshin1987)
Comment on attachment 216670 [details] [diff] [review] Fix USE_STDCONV bustage (checked in, trunk) Sorry for the bustage and thanks for the fix
Attachment #216670 - Flags: review?(jshin1987) → review+
Comment on attachment 216670 [details] [diff] [review] Fix USE_STDCONV bustage (checked in, trunk) The bustage fix has been checked in on the trunk.
Attachment #216670 - Attachment description: Fix USE_STDCONV bustage → Fix USE_STDCONV bustage (checked in, trunk)
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Blocks: 193358
Blocks: 88292
To land this patch on 1.8 branch, we need to take care of bug 332108 (and other cases if there's any).
Attachment #216605 - Attachment is obsolete: true
Depends on: 333703
Blocks: 261929
Depends on: 332108
Comment on attachment 217263 [details] [diff] [review] branch patch updated to include STDCONV fix a=darin for 1.8.1
Attachment #217263 - Flags: approval-branch-1.8.1+
landed on the 1.8 branch along with the patch for bug 332108
Keywords: fixed1.8.1
*** Bug 345882 has been marked as a duplicate of this bug. ***
Unfortunately this amounts to a change in the Plugin API, without any way for plugins to detect it, except to **** around with the User Agent string.
Blocks: 211961
Depends on: 409796
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: