Closed Bug 51852 Opened 24 years ago Closed 23 years ago

File -> Save Page As... doesn't honor Content-encoding when page has no-cache

Categories

(Core Graveyard :: File Handling, defect, P3)

x86
All
defect

Tracking

(Not tracked)

VERIFIED WORKSFORME
mozilla0.9.9

People

(Reporter: 1212mozilla, Assigned: law)

References

()

Details

(Keywords: helpwanted, Whiteboard: [rtm-])

I just ran into this problem with the latest nightly build. Go to the url and try to save it as an html file. I tried to save it in my home directory as: /home/steveo/netaddress.html My text editor then complains about null characters the file ends up looking like: ?ÅZ[sÚH~?_ÑÑÖÆvÅFàK??!Yã?)_R?Tj÷ÅÕ?´j?Ô??©ùïû?n]'&]úrÎwî§i¾88`ìZ¨??F¡?"¶;¾¹r¥Ï??/®jõÓz½v\«ù?ÆéñÛíì1ö¹ß®]wû¬ç;5vpð®ZmNÕÌÓ¿??ÞU??Þà²û®°òåÍ?Þ5;`?©`ãP?}?pÆ}á+\?î?+<3îz?'Ô?eÈÆ/dè?^4m³(6»êÚìº}ÕmY#9¡Ås¤¯°`ËÚ°ñ¿e>¸syãc7¼Ã?!?Ä(yI/ÖÈÈñX?ã?C?)É81I??vÇqdì+l%g?ûËÅT?àCd¨øÐ{jÊ«.eÌî³{!?Å\?î?÷<ÍþXpC<??úô?µYäú¬:5lA(çîH?5?QÀè^,á(²Xçæzн´,_¨åý?üõuö<óçr??Hn´+ ²?Ãñ!?õs??:å"OÑÐ"ß_ÊÀ¼Ï?ð¡«X?)?°H(LV$jÍö??Íbßu´^ﳤÆ'z|újÞ¢@ú?#@?/?ì??©öY$¹Á?=RË?hE;?çH"e(Qïì6Õ£í©Zi?ÐJä~ËtêDÚd8#¹V?ýÎmïÓ?yÜ?Ä|"ZÖ|Î?`@&]Ǿ£55êB:q´»Wý³Zárû¨yìC×fµÏ`§w^?AgÕ¿ªùä~<?¹ê?Æêù?±9?E?YpVÅÍ?åú±¶?qìéiµ9÷bPh«À³?s-}`*Âk>é?ÅbQËu´ >8?Øïk?NþÃÜ6juL¥î?íú|îN¸?a-Ëmø9Usý?øz3Þµ®ä7×ó¸...... This page does not look like this when you use View -> Page Source. I have tried to save some other html documents and they save just fine. I have only seen this problem with this particular site. Netscape 4 saves the file just fine.
This happens on 2000-09-06-04 WinNT4 as well.
I see this when saving from the browser, but not saving from the editor (they go through different code paths). I see it with the given URL, but not with a simple local file url, and also not with other http urls. Is it possible that the Save As code is writing unicode, or some strange encoding, instead of ascii? Though the file written by the browser is quite a bit smaller (2760 bytes vs. 9008) than the one written by the editor, and I'd expect larger if it were writing as unicode. The file doesn't seem to have a meta charset tag (at least, one doesn't show up in view source).
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Linux → All
The clue is the header Content-Encoding: gzip so the browser is saving the gzipped form while the editor is saving the expanded form.
tenthumbs: What do you have to do to get the server to give that to you? My attempt to contact the server give me Content-Type: text/html with no Content-Encoding at all: [steveo@steveo steveo]$ telnet www.netaddress.com 80 Trying 204.68.24.100... Connected to netaddress.com. Escape character is '^]'. GET / HTTP/1.1 Host: www.netaddress.com Accept: */* HTTP/1.1 200 OK Date: Mon, 11 Sep 2000 14:55:54 GMT Server: Apache/1.2.5 Expires: Thu, 01 Jan 1970 00:00:00 GMT Pragma: no-cache Cache-Control: no-cache Content-Length: 9298 Content-Type: text/html <!-- Net@ddress (generation 34FM.0700.4.03nwcst323) (C) USA.NET, Inc. --> ......(Rest of document sent in plain text)
Accept-Encoding: gzip,deflate,compress,identity which is what Mozilla sends.
telnet www.netaddress.com 80 Trying 204.68.24.100... Connected to netaddress.com. Escape character is '^]'. GET / HTTP/1.1 Host: www.netaddress.com Accept: */* Accept-Encoding: gzip HTTP/1.1 200 OK Date: Tue, 12 Sep 2000 17:20:50 GMT Server: Apache/1.2.5 Expires: Thu, 01 Jan 1970 00:00:00 GMT Pragma: no-cache Cache-Control: no-cache Content-Length: 2760 Content-encoding: gzip Content-Type: text/html followed by compressed page....... The issue seems to be that mosilla should save the page according to the Content-Type (in this case text/html) rather than the Content-Encoding as it is currently doing.
Summary: File -> Save Page As... saves garbage → File -> Save Page As... saves with Content-Encoding rather than by Content-Type
related to bug 35956?
->Networking
Assignee: asa → gagan
Component: Browser-General → Networking
QA Contact: doronr → tever
Confirming that the resulting file is GZIP compressed. Linux build 2000100909. I'm guessing that this should be a simple fix. ->rtm
Keywords: rtm
cc'ing neeti, gordon. May be related to their cache bug.
How often would a regular user hit this bug? What do I have to do in the browser to cause the accept-encoding to be sent with a value that would generate an unusable result?
Whiteboard: [need info]
A user will encounter this bug any time they try to save a page from a server that uses compression. Mozilla seems to always request that any server uses compression if the server is able to. Most servers at this point don't seem to. I don't have statistics. I assume that this will become more common in the future. If you want to duplicate this bug, use www.netaddress.com, as it appears to gzip stuff before transmission. I don't know which other servers do this. I happened to discover this bug when I tried to save their login page and modify it to make checking my email on that site easier.
darin: The proposed fix for the other bug (saving with .gz extension) would also affect this one (in a good way hopefully)... letting you investigate.
This bug is related to bug 35956; however, it only shows up when the server sends a no-cache header. Otherwise, the cache would have saved the decoded content when the page was first viewed. When the second request is made for the page, the cache would provide the content decoded. The behavior of the cache in this regard is broken (as discussed in bug 35956). ->Future
Target Milestone: --- → Future
PDT marking [rtm-] since darin moved it to MFuture
Whiteboard: [need info] → [rtm-]
Updating summary based on Darin's comments. Another way to reproduce this bug is to right-click a link to http://www.netaddress.com/ and select "save link as". (The original steps to reproduce will stop working when bug 40867 and related bugs are fixed.)
Summary: File -> Save Page As... saves with Content-Encoding rather than by Content-Type → File -> Save Page As... doesn't honor Content-encoding when page has no-cache
*** Bug 57509 has been marked as a duplicate of this bug. ***
*** Bug 65361 has been marked as a duplicate of this bug. ***
We *should* save such pages as gzip by default, but applying the correct system naming conventions. See bug 68420. However, there could be an option in the filepicker to decompress it on the fly.
http://dmoz.org/ just installed mod_gzip on their servers. I save pages from that site *all the time* and this is getting annoying, to say the least.
I would also like to state that I disagree with Andreas Schneider. Files should never be saved compressed just because they have a gzip content-encoding. If pages have a content-type of text/html they should be saved as text/html no matter what the content-encoding is. The content encoding is only meant to make network transfer faster or more reliable. It does not express how the file is stored on the server and it should not reflect how the file is saved or viewed by the client. content-type and content-encoding are very different and should be treated independently. File type should be determined only by content-type, not by content-encoding.
Stephen: Content-Encoding is *not* meant to "make network transfer faster or more reliable". HTTP/1.1 has Transfer-Encoding for that (see bug 59464, bug 68519). HTTP/1.0 does not support transfer codings, but it is very clear about content codings. I'm not familiar with mod_gzip, but it seems to be an abuse of Content-Encoding. Some quotes from RFC 1945 (HTTP/1.0): 3.5: Content coding values are used to indicate an encoding transformation that has been applied to a resource. Content codings are primarily used to allow a document to be compressed or encrypted without losing the identity of its underlying media type. Typically, the resource is stored in this encoding and only decoded before rendering or analogous usage. 7.2.1: A Content-Type specifies the media type of the underlying data. A Content-Encoding may be used to indicate any additional content coding applied to the type, usually for the purpose of data compression, that is a property of the resource requested. 10.3: The Content-Encoding is a characteristic of the resource identified by the Request-URI. Typically, the resource is stored with this encoding and is only decoded before rendering or analogous usage.
I eat my hat. I was not aware of transfer-encoding. IE, Netscape 4, and mod_gzip seem to treat content-encoding as a transfer-encoding as far as I know, although mod_gzip's page does mention that it supports "chunking" in transfer-encoding. For those who are interested, mod_gzip homepage: http://www.remotecommunications.com/apache/mod_gzip/
Any difference to bug 35956 here now? With the new cache a Cache-Control header shouldn't change behavior. I haven't verified this however.
Open Networking bugs, qa=tever -> qa to me.
QA Contact: tever → benc
*** Bug 87449 has been marked as a duplicate of this bug. ***
-> law (who has a similar bug)
Assignee: gagan → law
*** Bug 88369 has been marked as a duplicate of this bug. ***
-> xp apps. If this is actually a cache or http problem, please send it back.
Component: Networking → XP Apps
QA Contact: benc → sairuh
Happens for me on 0.9.2, for pages on www.livejournal.com, e.g. http://www.livejournal.com/users/bradfitz Note that page doesn't have no-cache, but has Cache-Control: private, proxy-revalidate
I don't read those quotes from the RFC 1945 (HTTP/1.0) the same way Andreas does... It seems to me that: "The Content-Encoding is a characteristic of the resource identified by the Request-URI. Typically, the resource is stored with this encoding and is only decoded before rendering or analogous usage." Is telling us that the file is stored (somewhere in the network) using this encoding. Our browser is fetching that file, and rendering the data. At that point, while the Content-Encoding is interesting, it is not necessarily relevant if we want to save the rendered version of the content which we are looking at. As users we are only aware of the rendered version. We have no access (that I am aware of) to any information about the encoding, and when we ask for data to be saved locally we are not, therefore, in any reasonable sense asking for the same content-encoding to be used. We are asking for that file to be saved as we *see* it. This is *post* "rendering or analogous usage", as far as I am concerned. I don't really see any other possible understanding. At the *very least* the output file needs to have some clear indication of the saved encoding, if any is used. A step up from that would be to not use any encoding on saving a rendered document (as per my understadning, above). And perhaps the best solution would be to offer the option of encoding in any of the potential schemes, with a default of the one set as content-encoding, but with a default of "not encoded" and the option to change that in the save as dialogue. [I think it is laudible to deisgn and implment "correctly" to the letter of the RFC, or whatever, but I think it is important to realize that applications are for users, and their view of the situation is *more important* than the "actual correct situation" as defined in a theoretical definition. In this case what matters, for example, is what the user thinks the document he or she is saving is, not what it "really is", it seems to me.]
> At the *very least* the output file needs to have some clear indication of the > saved encoding, if any is used. This is exactly what we should do. I filed bug 90490 for it.
*** Bug 97773 has been marked as a duplicate of this bug. ***
HTTP 1.1 (RFC 2068) has additional comments about Content-Encoding which are relevant to this discussion (Mozilla is, after all, an HTTP 1.1 compliant browser). "14.11 Content Encoding The Content-Encoding entity-header field is used as a modifier to the media-type. When present, its value indicates what additional content codings have been applied to the entity body, and thus what decoding mechanisms must be applied in order to obtain the media-type referenced by the Content-Type header field. Content-Encoding is primarily used to allow a document to be compressed without losing the identity of its underlying media type." It goes on to repeat the HTTP 1.0 comment with some additional remarks: "The content-coding is a characteristic of the entity identified by the Request-URI. Typically the entity-body is stored with this encoding and is only decoded before rendering or analogous usage. However, a non-transparent proxy MAY modify the content-coding if the new coding is known to be acceptable to the recpient, unless the "no-transform" cache-control directive is present in the message." In my opinion, the second sentence of the above paragraph ought to be struck from the next revision of HTTP. It is making a statement about something which is outside the scope of the HTTP specification, namely how entities are stored on a server. The first paragraph I quoted above makes it quite clear that in order to obtain the original media type (i.e. the Word or PDF or whatever kind of document) one must remove the encoding. So if the browser is passing the document to an external helper application (i.e. for rendering) or is storing it with the original suffix (which on most operating systems is associated with the media type) it should first remove the encoding. Storing the entity in the cache with the encoding in place is certainly acceptable and might even be preferable since it would take up less space in the cache. This seems like kind of a no-brainer to me. If the browser knows enough to remove the content-encoding before rendering the entity itself (i.e. for HTML files) then it should also know enough to remove the encoding before handing it off to another renderer or storing it for later rendering by another renderer. The last sentence of the second quoted paragraph above makes this acceptable, in that it allows processes through which the entity passes to modify the encoding.
The RFC number I cited in my previous comments was wrong, I misread the RFC header. It should have been RFC 2616.
*** Bug 94596 has been marked as a duplicate of this bug. ***
*** Bug 86235 has been marked as a duplicate of this bug. ***
this happens with .php3 pages this is also a problem with .jsp files on www.storagereview.com threads.
spam: over to File Handling.
Component: XP Apps → File Handling
WRT bug #86235 and the many duplicates of this one and my comment on a duplicate bug report of mine, I do not get gzipped contents for every page I save. I get a gzipped Yahoo News page (e.g. http://news.yahoo.com) too, but pages from Yahoo Groups are NOT gzipped and neither compressed, e.g.: http://groups.yahoo.com/group/c64rmx/message/2275 $ stat 2275 File: "2275" Size: 4417 Blocks: 16 Regular File $ file 2275 2275: data $ cat 2275 | gunzip gunzip: stdin: not in gzip format $ gunzip -t 2275 gunzip: 2275.gz: not in gzip format Can anyone enlighten me please whether this is the same bug or anything else?
*** Bug 108453 has been marked as a duplicate of this bug. ***
->0.99
Target Milestone: Future → mozilla0.9.9
*** Bug 104976 has been marked as a duplicate of this bug. ***
This is a very old, very annoying issue. Has anyone spoken to the mod_gzip people about this? Perhaps they used Content-encoding based off some clause of the spec we're missing? Otherwise, maybe they'll switch to using Transfer-encoding, and we can evangelize the major sites using older versions (er, actually, there don't seem to be any major sites runing any form of encoding stuffs). The RFC seems to be worded poorly regarding content and transfer-encoding, but I'm pretty sure we're doing the right thing as far as saving it gziped, though we should append .gz to the file name (bug 90490). "Fixing" this by making mozilla think content-encoding == transfer-encoding could definatly limit flexibility in the future.
The FAQ of mod_gzip: http://www.remotecommunications.com/apache/mod_gzip/mod_gzip_faq.htm#q1800 <BLOCKQUOTE> "Content-Encoding" and "Transfer-Encoding" are both clearly defined in the public IETF Internet RFC's that govern the development and improvement of the HTTP protocol which is the 'language' of the World Wide Web itself. See [ Related Links ]. "Content-Encoding" was meant to apply to methods of encoding and/or compression that have been already applied to documents BEFORE they are requested. This is also known as 'pre-compressing pages'. The concept never really caught on because of the complex file maintenance burden it represents and there are few Internet sites that use pre-compressed pages of any description. "Transfer-Encoding" was meant to apply to methods of encoding and/or compression used DURING the actual transmission of the data itself. In modern practice, however, and for all intents and purposes, the 2 are now one and the same. Since most HTTP content from major online sites is now dynamically generated the line has blurred between what is happening BEFORE a document is requested and WHILE it is being transmitted. Essentially, a dynamically generated HTML page doesn't even exist until someone asks for it so the original concept of all pages being 'static' and already present on the disk has quickly become an 'older' concept and the originally defined black-and-white line of separation between "Content-Encoding" and "Transfer-Encoding" has simply turned into a rather pale shade of gray. Unfortunately, the ability for any modern Web or Proxy Server to supply 'Transfer-Encoding' in the form of compression is even less available than the spotty support for 'Content-Encoding'. Suffice it to say that regardless of the 2 different publicly defined 'Encoding' specifications, if the goal is to compress the requested content ( static or dynamically generated ) it really doesn't matter which of the 2 publicly defined 'Encoding' methods is used... the result is still the same. The user receives far fewer bytes than normal and everything is happening much faster on the client side.</BLOCKQUOTE>
This still seems to be current in 0.9.6 :( $ telnet phpbb.sf.net 80 Trying 216.136.171.201... Connected to phpbb.sf.net. Escape character is '^]'. GET /phpBB2/index.php HTTP/1.1 Host: phpbb.sourceforge.net Accept-Encoding: gzip, deflate, compress;q=0.9 HTTP/1.1 200 OK Date: Fri, 23 Nov 2001 17:25:14 GMT Server: Apache/1.3.20 (Unix) PHP/4.0.6 X-Powered-By: PHP/4.0.6 Set-Cookie: phpbb2area51=cookiedeletedbyme expires=Sat, 23-Nov-02 17:25:24 GMT; path=/ Set-Cookie: phpbb2area51_sid=01a3f91999095979835a2fd2146721b7; path=/ Cache-Control: pre-check=0, post-check=0, max-age=0 Pragma: no-cache Expires: Fri, 23 Nov 2001 17:25:24 GMT Last-Modified: Fri, 23 Nov 2001 17:25:24 GMT Content-Encoding: gzip Vary: Accept-Encoding Transfer-Encoding: chunked Content-Type: text/html gziped data is stored when I do File->Save as... in Mozilla (0.9.6 milestone on WindowsXP).
It is correct to save it gziped. The only problem is .gz should be added to the filename. mod_gzip is what's broken. It should be using Transfer-Encoding for the gziping. It's using a T-E of chucked, which I don't understand the point of (sounds to me like it duplicates the size data from Content-Length and the reliability TCP already has). But I would assume HTTP allows multiple T-E's to apply.
In what format are any pages at http://groups.yahoo.com saved?
groups.yahoo.com pages are saved as gzipped HTML too.. It's all fine and well that mod_gzip (and apparently the buildin gzip routines of PHP) are broken but mod_gzip and PHP are the software that takes care of the majority of gzipped pages out there. IMO Mozilla should work around this.. Will take a look at some RFC's when I have the time.
With "broken", do you mean the gzip encoded data are broken? If not, I still don't know what format pages at Yahoo!Groups are stored in, because they still cannot be gunzip'ped.
Actually, mod_gzip is not broken according to the available evidence. If Mozilla prefers content to be sent using Transfer-Encoding for compression then it needs to send a TE header indicating that. What it sent was an Accept-Encoding header, indicating that it could handle a Content-Encoding of gzip, but no TE header. So naturally mod_gzip had to use Content-Encoding. Whether it's strictly "correct" to save the file using a .gz extension is less important than doing what the user expects. The user doesn't care about the subtleties of content-encoding versus transfer-encoding. They just want to save the file and then open it using their favorite editor or viewer. This problem not only affects what happens when a file is saved using "File->Save As" it also affects how files are stored before being passed to an external viewer (notably pdf viewers). In any such case, the encoding should absolutely be removed to produce the indicated content-type that the external viewer is expecting, otherwise the user is unable to view the content, which may be correct, but is entirely useless.
Theres a bug somewhere on us supported transfer encodings. But apache uses content-encodings without mod_gzip, but not TE, last time I checked. Anyway, file -> save should be WYSIWYG, whilst right click->save link as should use content conversions. http://lxr.mozilla.org/seamonkey/source/xpfe/components/xfer/src/nsStreamTransfer.cpp#152 needs to be moved to a higher level, I think.
I'd like to point out again that any time when the user gets a result that they have no way to expect or predict is going to lead to confusion. Certainly, if the user has a page open and asks to save it, he or she should get the page in whatever form it appears to be, and not with some unspecified compression. Further, though, if there is a link which appears to lead to a file of a particular type, that is the type a user reasonably expects it be be saved as when a Save Link As... operation is attempted. There is *no* reason to use the same compression that the server used either to store or transmit such documents in either situation. The intent of the user is clear in both situations - save a file of a type that matches the user expectations. As the content and transer encoding are not available to the user through the browser (or even to the web site creator, in many cases, it would seem) there is no way users could reasonably expect their content to be saved in such forms. And as it happens, there is no guarantee that the user will even have a method of accessing a file saved in an arbitrary compression method. GZIP, for example, is not a standard Windows compression method. Building obscure behaviour of this nature into Mozilla will ensure that casual, non-computer geek users find it confusing and not worth the time or effort to use. Surely that is not what we want. The process of obtaining documents through the web *is* the process of rendering them (to the screen or to local storage) for a web browser. As such, the content and transfer encoding is informational only, and should not affect the display or storage of any files. That is left to the client machine.
Agreed with file->save page. The problem with right click->save link is that apache sets teh content-encoding to .gz for a tar.gz file, so we have people saving downloaded files, and getting the file uncompressed. See bug 35956.
... which would seem to be a bug in Apache, right? So why are we looking to fix it with Mozilla? At the worst, we have a situation where those people are getting files that lack compression. If we change things so that we always save with the Content Encoding, we risk saving files in formats that users have no convenient way to access on their system. A mysteriously uncompressed file is still accessible, at least. And when Apache sets the right encoding, Mozilla will save the files correctly. I assume that other browsers must have the same problem with files marked in such a way, right?
yes and no... other browsers have hacks in place to work around such problems. we should try to do the same in order to provide a nice experience for mozilla users.
...and IE probably has fewer people downloading tgz files than linux mozilla does. (Note the summary of bug 35936).
*** Bug 114285 has been marked as a duplicate of this bug. ***
The Internet Explorer save always the file with no Content-Encoding. We have currently the problem on our marketplaces, that users upload some attachments, they are stored in compressed form in the database. If an user wants to see this attachmet, we get a request from the browser, that content-encoding of gzip or deflate is allowed. So we send the file compressed to the browser. The only logical way is the the browser removes the encoding and save it uncompressed or send its to the helper application. Reasons: - The user wants to download myfile.doc and not a gzipped file (Most user don't know anything about compressing http and so on) - There is no standard way for keeping the information about content encoding inside the file (with gzip maybe .gz, but if its deflated), in a normal file is not http header - The content-encoding should only be kept in proxy or caches where the http header is preserved. Now we have to scan all requests for mozilla and netscape browsers and remove the accept-enconding request. As a result any NS or MZ user has to download word file uncompressed with triple time.
*** Bug 114849 has been marked as a duplicate of this bug. ***
This seems to be fixed with the changes for bug 114874. It now works for me at my.yahoo.com with WinME 2001121403 correctly saving the html, while 0.9.6 saves the gziped version.
What do you get with http://groups.yahoo.com? With mozilla-2001121215_trunk-0_rh7 I still get binary data that can't be gunzipped.
http://groups.yahoo.com works for me in WinME 2001121403.
groups.yahoo.com deflates the content as default and do not gzip it...
*** Bug 108688 has been marked as a duplicate of this bug. ***
closing WFM based on most recent comments
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → WORKSFORME
*** Bug 121001 has been marked as a duplicate of this bug. ***
Reopening. This behaviour is still occuring on 2002011921 linux. See dupe bug 121001 for details.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Also confirming this behaviour on 2002011604 / Win2k
Just as a note.... The current behavior has nothing to do with the "no-cache" header. It was introduced as a result of fixing bug 116445. It's a bug completely separate from this one. As such, I'd recommend reclosing this, reopening bug 121001, and assigning that to ben@netscape.com
Ah, reclosing this then. sorry for the noise.
Status: REOPENED → RESOLVED
Closed: 23 years ago23 years ago
Resolution: --- → WORKSFORME
I am using build 2002022603 on w98 and still see this bug as orginally reported: 1) go to www.netaddress.com 2) File -> Save Page As... 3) for Save as type:, choose "Web Page, HTML only" -- it works correctly if saving "Web Page, complete" 4) press Save 5) open the saved file in your favorite text editor The file is junk (gzipped).
That's because I checked in the fix for that problem on the afternoon of the 26th (about 13 hours after your build was shipped)
mass-verifying WorksForMe bugs. reopen only if this bug is still a problem with a *recent trunk build*. mail search string for bugspam: AchilleaMillefolium
Status: RESOLVED → VERIFIED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.