Closed Bug 158006 Opened 23 years ago Closed 22 years ago

save as, save link target as, save image as : non-ASCII filename/title/URL handling

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4alpha

People

(Reporter: jshin1987, Assigned: jshin1987)

References

()

Details

(Keywords: intl)

Attachments

(1 file, 3 obsolete files)

This is to add an I18N twist to bug 115176, which is by itself pretty complicated. When 'save as' is used to save the html doc (with non-ASCII URL whether URL-encoded or not, non-ASCII title) in a browser window or 'save link target as' or 'save image as' is invoked on docs and images (with non-ASCII URL whether URL-encoded or not or non-ASCII title, non-ASCII link name) linked from the page, Mozilla suggests a filename which is not what users usually expect. This is a complex problem. Some aspects of this issue cannot be resolved by web clients alone and require coordination with popular http servers. Implication is that Mozilla.org may have to be involved with standardization process to deal with this issue. I'll get back with more details soon.
Status: UNCONFIRMED → NEW
Ever confirmed: true
agree with Jungshik. future for now
Status: NEW → ASSIGNED
Target Milestone: --- → Future
A real life example sent by KANG Jeong-hee : http://www.mogaha.go.kr/webapp/viewNoticeBoardServlet?serial=214012&board_list=54&page=1&sort=reg_date&listnum=10&main=1 In the page, there's a link to '국제포럼(020821).hwp' (followed by '(61481 bytes)' above the row in a light purple( with 02-3703-4895) and right of a skyblue cell. When 'save link target as' is selected, the filename suggested by Mozilla URL-encodes non-ASCII characters. The page is in EUC-KR, but there's no guarantee that the local filesystem uses the same encoding as the web page itself. As I wrote in my bug report, there are many factors to take into account. BTW, in this particular case, 'Content-Disposition:' header is not emitted by the server. Therefore, making Mozilla's C-D header handling compliant to RFC 2231 wouldn't help. (see bug 155949 and bug 162765 for RFC 2231 compliance)
Comment 2 is more relevant to bug 161242 than here.
In particular case mentioned in comment #2, fixing bug 155569 solved the problem. The page in comment #2 is in EUC-KR(so are the URL and the filename included in the URL in question) and I'm running Mozilla in ko_KR.UTF-8 locale. The filename suggested by Mozilla is the UTT-8(the local file system encoding) representation of the file name represented in EUC-KR(the document encoding). of course, this does not mean that we can close this bug because there are more complicated cases as discussed in bug 115176.
'save target as' and 'save image as' are handled by http://lxr.mozilla.org/seamonkey/source/xpfe/communicator/resources/content/contentAreaUtils.js Once I fix bug 162765 and expose necessary methods to Javascript, it should be easy to fix this bug for most cases (some other related bugs have been resolved over the last few months by Boris). On 'non-Unicode platforms', a mismatch between the local file system encoding and what Mozilla put filenames in may still remain, though.
attachment 107217 [details] [diff] [review] to bug 162765 makes Mozilla work well for standard-compliant cases and some non-standard-compliant cases as well(RFC 2231 has to be used, but some servers use RFC 2047-encoding). Unlike its C++ counterpart (that makes use of 'originCharset' of the current URI when untagged raw 8bit chars are encountered), Javascript (for 'save target as') doesn't. I have to figure out how to access 'originCharset' of the current URI with JS. BTW, Roy, can you assign this to me?
sure. thanks
Assignee: yokoyama → jshin
Status: ASSIGNED → NEW
QA Contact: ruixu → ylong
Blocks: 162765
Sorry for spamming. I got the dependency the other way around.
No longer blocks: 162765
Depends on: 162765
Attached patch a partial patch (obsolete) (deleted) — Splinter Review
This patch has to be applied along with my patches for bug 162765 and bug 191541. Still, it only works for RFC 2231 and RFC 2047 encoded filename parameters in C-D header. When raw 8bit characters are used as filename paramter in C-D header, this patch doesn't work. The key is to figure out how to get |originCharset| of the referring document in contentAreaUtil.js. Last December, I tinkered with this problem for a while and I believe I came up with a solution, but I couldn't find any trace of it on my disk. (my computer underwent a huge overhaul..). Anyway, I'm uploading this patch so that others can take a look (there are a lot of commented-out lines and debug output, which reflects the tentative nature of the patch) and help me out. Getting originCharset in nsExternalHelperAppService.cpp was easy, but I couldn't quite figure out in JS code. BTW, I always get UTF-8 as |documentCharset| in |getDocumentCharset| at the end of my patch regardless of the charset used in the refering document. For instance, http://jshin.net/moztest/download.html is explicitly marked as in EUC-KR, but when I try to save one of files with raw 8bit chars in C-D header, I get UTF-8 in JS code. I changed the locale under which Mozilla is launched (my primary locale is ko_KR.UTF-8), but nothing has changed. I tried ja_JP.EUC-JP, ko_KR.EUC-KR and C. Any idea as to what's going on?
Adding Naoki to CC because he wrote |nsTextToSubURI| apparently because he wanted to solve problems related to this (and I used it a couple of places to solve similar problems.) Another example to show that getting |originCharset| is important is as following: http://www.mogaha.go.kr/webapp/bbs/pub/view.action?bid=200&serial=268306&sc_param=common&listStatusStr=rO0ABXNyABpjb20uZ2VuMTI4LmRhdGEuTGlzdFN0YXR1c7sGCWct5HSSAgAESQAHbWF4U2l6ZUkABm9mZnNldEwACmNvbmRpdGlvbnN0ABNMamF2YS91dGlsL0hhc2hNYXA7TAAEc29ydHQAEkxqYXZhL2xhbmcvU3RyaW5nO3hwAAAADwAAAABzcgARamF2YS51dGlsLkhhc2hNYXAFB9rBwxZg0QMAAkYACmxvYWRGYWN0b3JJAAl0aHJlc2hvbGR4cD9AAAAAAABLdwgAAABlAAAAAHh0AAZzZXJpYWw%3D Sorry this URL is reaallly long. That document is in EUC-KR and there's a jpg file linked (look for '284943'). Now click on the image linked and it will get loaded. Mozilla gets the filename correctly in Korean and nicely puts it on the titlebar. However, 'Save Image as' is attempted, it comes up with the suggested file name as if 'escaped URL' is in ISO-8859-1. Why? Because |documentCharset| is assummed to be in ISO-8859-1. It seems like there are two possible solutions. One is to figure out a way to get charset of the refering document (|originCharset|) in nsContentAreaUtil.js and the other is to set 'charset' (nominal) of image documents to that of the referring document instead of the default(?) ISO-8859-1. I just hit upon another idea, but that's beyond the scope of this bug. On several places in Mozilla, ISO-8859-1 is taken as the last resort (or sometimes the default). This may have to be subjected to the localization and/or the configurability. Anyway, I'll discuss this in another forum.
I'm sorry it appears that my second solution in my prev. comment would not work because charset is obtained from this code + else if (document.commandDispatcher.focusedWindow) { + charset = document.commandDispatcher.focusedWindow.document.characterSet; + dump ("getDocumetnCharset: charset obtained from cmddispatch.fw :" + charset + "\n"); + } rather than this code + charset = aDocument.characterSet; + charsetSrc = aDocument.characterSetSource; + dump ("getDocumetnCharset: charset obtained from aDcoument :" + charset + "\n"); + dump ("getDocumetnCharset: charset source :" + charsetSrc + "\n");
The originCharset of a URI will often be UTF8 just because someone in Mozilla internals helpfully converted the unescaped URI string to UTF8 before creating the URI object.... or just passed in a null charset at URI object creation (a lot of people do that).
> The originCharset of a URI will often be UTF8 At least in |nsExternalHelperAppService::ExtractSuggestedFilename| (see attachment 117822 [details] [diff] [review]) and somehwer else in nsImageDocument.cpp(see http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&file=nsImageDocument.cpp&root=/cvsroot&subdir=mozilla/content/html/document/src&command=DIFF_FRAMESET&rev1=1.92&rev2=1.93) it helped me with guessing the encoding of the URL. In JS code, I think I need to figure out the equivalent of what I do in attachment 117822 [details] [diff] [review] (getting the URI of a channel and the originCharset of that URI). The following may be a bit off-topic in a sense. > internals helpfully converted the unescaped URI string to UTF8 before > creating the URI object.... Is that what originCharset is supposed to indicate? I thought it indicates the charset of the referrer (referring document) if it's 'well-defined' and 'well-known'. When a standalone image is loaded in its own window by clicking on the link, even getting originCharset the way described above (via channel) wouldn't help because the charset of an 'image-only document' is set to ISO-8859-1[1], which is why |charset=document.commandDispatcher.focusedWindow.document.characterSet| gives me ISO-8859-1 (comment #11) [1] This may have to do with the following or related code: http://lxr.mozilla.org/seamonkey/source/content/base/src/nsDocument.cpp#511 For 'image-only' documents, mCharacterSet seems never set. It may be argued that |mCharacterSet| of 'stand-alone non-textual' documents have to be set to charset of the referring document if it's well-defined. > internals helpfully converted the unescaped URI string to UTF8 before > creating the URI object.... or just passed in a null charset at URI > object creation That's why I want to have originCharset, instead. I suspect this is not set to UTF-8 if the referring document is explicitly set to charset other than UTF-8. For instance, http://jshin.net/moztest/download.html is in EUC-KR and I believe originCharset of http://jshin.net/moztest/random21.yyy is EUC-KR. Otherwise, my patch to 162765 wouldn't work, but it works.
Sorry the last paragraph in comment #13 had to be deleted. Pls, ignore it.
Target Milestone: Future → mozilla1.4alpha
Attached patch a working patch (obsolete) (deleted) — Splinter Review
It turned out that I was missing the obvious. Anyway, with this patch along with patches for bug 162765 and bug 191541, 'save target as and save image as' work fine. The only exception is when 'save image as' is loaded in a standalone browser pane. In that case, 'charset' of the focused window is always set to ISO-8859-1 (see my previous comment). Anyway, that has to be dealt with in a separate bug I'm gonna file if not filed yet.
Attachment #117838 - Attachment is obsolete: true
Comment on attachment 118012 [details] [diff] [review] a working patch asking r/sr. this may be a bit premature because bug 162765 that blocks this is not yet fixed. But I'm pretty sure it'll soon (a working patch is waiting for review).
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: review?(ben)
bug 198598 has just been filed for a problem with saving 'stand-alone' 'non-textual' document.
Comment on attachment 118012 [details] [diff] [review] a working patch >+function getCharsetforSave(aDocument) >+{ >+ if (aDocument) >+ charset = aDocument.characterSet; Too much indentation. >+ else if (document.commandDispatcher.focusedWindow) { >+ charset = document.commandDispatcher.focusedWindow.document.characterSet; >+ } >+ else { >+ charset = window._content.document.characterSet; >+ } >+ return charset; >+} >+ It's inconsistent to use { } for the second and third case but not for the first. sr=jag with those nits fixed.
Attachment #118012 - Flags: superreview?(jaggernaut)
Attachment #118012 - Flags: superreview+
Attachment #118012 - Flags: review?(neil)
Attachment #118012 - Flags: review?(ben)
Thank you jag for sr. I'll fix nits as you pointed out. BTW, could you take a look at my patch for bug 162765 if you can? This bug is blocked by it and I like to land both patches together before 1.4. Thanks again.
Comment on attachment 118012 [details] [diff] [review] a working patch >+ var dispHeader = this.mContentDisposition; Is it worth duplicating this variable? >+ // To make JS engine happy. >+ var dummy = new Object(); Yeah, I hate that too, but you should use var dummy = { value: null }; >+ try { >+ fileName = mhp.getParameter(dispHeader, "filename", charset, true, dummy); >+ } >+ catch (e) { >+ fileName = mhp.getParameter(dispHeader, "name", charset, true, dummy); > } OK, so you catch an exception for the filename, can getting the name throw an exception too? That sounds bad. >+function getCharsetforSave(aDocument) >+{ >+ if (aDocument) >+ charset = aDocument.characterSet; >+ else if (document.commandDispatcher.focusedWindow) { >+ charset = document.commandDispatcher.focusedWindow.document.characterSet; >+ } >+ else { >+ charset = window._content.document.characterSet; >+ } >+ return charset; >+} You forgot to declare charset. But you could just use if (...) return ...; if (...) return ...; return ...;
Attachment #118012 - Flags: review?(neil) → review-
Thanks for catching those things, Neil.
Attached patch a new patch per neil's and jag's comments (obsolete) (deleted) — Splinter Review
Thank you for review. I addressed concerns of both of you. Can you review this patch? Thanks.
Attachment #118012 - Attachment is obsolete: true
Comment on attachment 121757 [details] [diff] [review] a new patch per neil's and jag's comments realized that neil was not on CC. asking for r/sr explicitly. BTW, I'll change two statements below with 'return fileName.replace(...);'. + fileName = fileName.replace(/^"|"$/g, ""); + return fileName;
Attachment #121757 - Flags: superreview?(jaggernaut)
Attachment #121757 - Flags: review?(neil)
Comment on attachment 121757 [details] [diff] [review] a new patch per neil's and jag's comments >+ reuturn document.commandDispatcher.focusedWindow.document.characterSet; Silly typo :-) >+ } >+ else { Don't bother with else after return (twice).
Attachment #121757 - Flags: review?(neil) → review+
Comment on attachment 121757 [details] [diff] [review] a new patch per neil's and jag's comments + reuturn document.commandDispatcher.focusedWindow.document.characterSet; Typo. Combined with Neil's comment, this becomes: function getCharsetforSave(aDocument) { if (aDocument) return aDocument.characterSet; if (document.commandDispatcher.focusedWindow) return document.commandDispatcher.focusedWindow.document.characterSet; return window._content.document.characterSet; } Does commandDispatcher.focusedWindow always point to the window you want? Can we somehow get to this function while it points to another frame, or to the XUL window instead?
Thank you for review, neil and jag. > Does commandDispatcher.focusedWindow always point to the window you want? Can > we somehow get to this function while it points to another frame, or to the XUL > window instead? I guess we're safe as long as 'Save Target As/Save Image As/Save Frame As' are concerned. As for 'Save Page As', I'm not sure what it does for framed pages if the encoding of a page is different from the encoding of a focused frame. Do you think it may be better to put window._content.document.characterSet before commandDispatcher.fW? I haven't tested framed pages, but in non-frame cases, both are always the came. BTW, you may be interested in bug 199237, too. That bug has little to do with the question at hand, but is one of several 'save'-related bugs.
I did some experiments with a framed page (http://jshin.net/moztest/dl_frame.html) and found that trying aDocument.characterset,document.commandDispatcher.focusedWindow, and window._content.document.characterSet in that order is the 'right' thing to do. For right-click-activate menu items (saev link target as, save imgge as), document.commandDispatcher.focusedWindow always work giving us charset of the focused frame. For Save Frame As, both aDocument.characterset and d.c.f work. However, window._content.document.characterSet has the charset declared in the frameset html (instead of a focused frame). Now bug 162765 this bug depends on was fixed, let's land this one as well. jag, can I get sr?
Status: NEW → ASSIGNED
Could you attach a new patch without the typo (and perhaps with the changes made as suggested in comment 24 and comment 25)? Will sr when that's attached.
Attachment #121757 - Attachment is obsolete: true
Attachment #126311 - Flags: superreview?(jaggernaut)
Comment on attachment 126311 [details] [diff] [review] a new patch (with typo fixed and redundant 'else if' eliminated) carrying over r=Neil, sr=jag When editing an existing file, prefer to stay in that file's style. In this case, don't use {} for single line |if|s. No need for a new patch.
Attachment #126311 - Flags: superreview?(jaggernaut)
Attachment #126311 - Flags: superreview+
Attachment #126311 - Flags: review+
Thank you all. Fix checked in (with braces removed). Now moving on to bug 199237.
Status: ASSIGNED → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Attachment #121757 - Flags: superreview?(jaggernaut)
*** Bug 119028 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: