Closed Bug 549295 Opened 15 years ago Closed 15 years ago

new lines are inserted in downloaded html.

Categories

(Core :: DOM: Serializers, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -
blocking1.9.2 --- needed
status1.9.2 --- .7-fixed

People

(Reporter: michel, Assigned: laurent)

References

()

Details

(Keywords: regression)

Attachments

(4 files, 1 obsolete file)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) Moin Moin. From Firefox 3 something on Firefox started to manimuplate the downloaded html-text by inserting new-lines. As this effectively hinders one from removing unwanted html-parts by using search and replace, I do have to use the ie now to download the html text. Is there a possibility to save the html-text verbatim. The aforementioned hyperling points to an msd-page, where ff 3.6 inserts new-lines even in the title of a paragraph or in simple sentences: Search for: _________________________________________________________________ Language Identifier Constants and Strings _________________________________________________________________ which should be one line. _Tschuess, __Michael. Reproducible: Always Steps to Reproduce: 1.Open "http://msdn.microsoft.com/en-us/library/dd318693%28VS.85%29.aspx" 2.Press File/Save Page as 3.View the written html file. Actual Results: Unsolicited new-lines breaking the searchability. Expected Results: The original html text.
This seems to have been a deliberate change from 3.5 to 3.6 -- it's trying to word-wrap the HTML so that it is more readable, but in this case it makes things worse. I agree that there should be a way to force it to save exactly the bits that came over the network, at least.
Status: UNCONFIRMED → NEW
Component: General → File Handling
Ever confirmed: true
Product: Firefox → Core
QA Contact: general → file-handling
Reporter, are you saving as "Web page, complete" (as oppose to "Web page, HTML only", which does not change the source in any way) in both cases? Zack, there is such a way; save as "HTML only" as opposed to "complete".
I just tried, and Fx 3.5 and 3.6 seem to have the same behavior for me here when saving as "complete" (and in neither case do I see an obvious newline being inserted anywhere; certainly not in the "Language Identifier Constants and Strings" heading, as comment 0 seems to imply).
I tested it again and indeed 3.6 doesn't modify the file when saving as "HTML only" - but it does seem to reformat it when saving as "complete".
Yes, when saving as "complete" what's saved is a serialization of the DOM. This includes some reformatting in some cases, depends on what scripts did to the page, rewrites a bunch of URI attribute values to point to the saved copies of some subresources, etc, etc. So what exactly is the problem? I can't tell at this point...
Reformatting is definitely happening when the page is saved as "complete" - it's most obvious if you compare the block of "javascript:__doPostBack(..." table entries near the head, in this and the other attachment. I didn't do a 3.5 vs 3.6 comparison.
Attachment #429761 - Attachment mime type: application/octet-stream → text/plain
Well, the OP wants no reformatting, so I guess the live question is whether he's happy with using "HTML only" instead of "complete".
Wait. Is this all about reformatting in the source as opposed to visual? With "complete" there's no guarantee of what you get looking anything like the original source. The point is to preserve the DOM, not the source. So if that's all this is about, the bug is invalid. Comment 1 says there's a change from 3.5 to 3.6. Is there?
It looks to me like 3.6 is much more aggressive about inserting line breaks in reserialized DOM than 3.5 was. Search for "Ethiopia" in both this attachment and attachment 429765 [details] - 3.6 broke the line immediately after that, 3.5 didn't. 3.6 also will insert a line break in between "<a" and "href", and in the middle of style="..." attribute values. Whether that's a *bug* is disputable, I think.
Laurent, did we make purposeful changes of some sort here?
Moin Moin. > Wait. Is this all about reformatting in the source as opposed to visual? Yes, as sometimes You need to edit the text in an editor, which is verrrrry tedious if it has been reformatted, as the former versions of FF did not. > With "complete" there's no guarantee of what you get looking anything like the > original source. The point is to preserve the DOM, not the source. So if > that's all this is about, the bug is invalid. Well it used to be, as the name implies, so please, forgive me if I sticked to the description "Web page, complete" instead of an internal definition. However, I cant save the document twice (as complete page and then the original text.). Which solves the bug for me. But as I assume, that this behaviour might irritate other users as well (who just do not know bugzilla). Hence I suggest to add a checkbox in the files save dialog, upon choosing complete allowing to save the verbatim text using the same routine as in the html-only case (maybe saving it twice and overwriting the firstly saved file.). Or just addin a new entry like "Web page, complete, with verbatim HTML preservation" . _Tschuess, __Michael.
P.S.: I hope I am not too pushy here (or even obnoxious). If You do think, that this topic is a waste of time, then drop it please. P^2.S.: "However, I cant" should be "However, I can", of course ("http://www.inanime.com/imgx/ryoohki_eaten_carrot.jpg").
There's no way to preserve the original HTML when saving "complete" because you have to rewrite various attributes to point to local versions of the files. That said, if this changed from Firefox 3.5, would you be willing to narrow down the date range when this changed using nightly builds?
Actually, nevermind. I can reproduce in 3.6 locally; I'll hunt down the range.
Ok, this changed in the range in which bug 422403 landed.
Assignee: nobody → laurent
Blocks: 422403
blocking1.9.2: --- → ?
blocking2.0: --- → ?
Laurent, did that patch change the behavior of the serializer when neither OutputRaw nor OutputFormatted is set? Should webbrowserpersist now be setting OutputRaw to get the old behavior?
Uh, yeah. This is totally a regression from bug 422403. Old code flow in AppendText for HTML: if (inPre) { // appendconvertlf } else if (raw) { // append } else if (!format) { // if long lines, wrap, else append } else { // wrap } "long" in this context is 128 chars of text on the line. New code flow: if (inPre || raw) { // appendconvertlf } else if (format) { // append formatted wrapped } else if (wrap) { // wrap } else { // if has long lines, wrap, else appendconvertlf } But the new init code for the XHTML content serializer always sets OutputWrap if OutputRaw is not set. So for XHTML that last |else| is never hit. That's wrong, wrong, and again wrong. Especially since the claim is that this behavior is "to keep compatible behaviors for old callers of this interface", since it does nothing of the sort.
Now maybe the firefox ui should still pass OutputRaw here. But there's still a serializer bug.
Component: File Handling → Serializers
QA Contact: file-handling → dom-to-text
Not blocking on the branch, but please do request branch approval when you have a patch.
blocking1.9.2: ? → ---
Ok, I'm going to work on it.
Attached patch The patch (obsolete) (deleted) — Splinter Review
This patch just removed an assignement of nsIDocumentEncoder::OutputWrap to default flag. I also updated tests. Results are more consistent with the expected behavior. In fact, expected results are same results before the patch of bug 422403. I must admit I made a mistake :-) I still have to check some things before asking a review.
P.S.: It is warming to see, that others make errors similar to the ones I do make. Thank You for hunting it down. "http://www.welt.de/multimedia/archive/00262/Baer5_DW_Reise_Khut_262781g.jpg".
Attached patch patch v1.1 (deleted) — Splinter Review
a new patch, with more tests fixed, ready for a review.
Attachment #430827 - Attachment is obsolete: true
Attachment #434372 - Flags: superreview?(bzbarsky)
Attachment #434372 - Flags: review?(Olli.Pettay)
Blocks: 524975
Comment on attachment 434372 [details] [diff] [review] patch v1.1 sr=bzbarsky. Don't forget to backport to 1.9.2 as needed!
Attachment #434372 - Flags: superreview?(bzbarsky) → superreview+
Thank you Boris. As we just discussed here https://bugzilla.mozilla.org/show_bug.cgi?id=524975#c37 I will modify a bit the patch for the trunk, in order to remove the flag I added in the attachment 409324 [details] [diff] [review] (http://hg.mozilla.org/mozilla-central/rev/4ae32b827a7f) for an attempt to fix bug 524975.
Comment on attachment 434372 [details] [diff] [review] patch v1.1 Sorry for the delay.
Attachment #434372 - Flags: review?(Olli.Pettay) → review+
Attachment #434372 - Flags: approval1.9.2.3?
I have no problems taking this for 1.9.3, but I wouldn't hold the release for it.
blocking2.0: ? → -
landed http://hg.mozilla.org/mozilla-central/rev/16d33b6f570e (sorry Olli, I made a mistake in your name in the message of the commit :-( )
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Comment on attachment 434372 [details] [diff] [review] patch v1.1 Since the bug broke the old behavior of the html serializer, and causes issue for copy paste etc, The patch should be applied in the 1.9.2 branch
Attachment #434372 - Flags: approval1.9.2.4? → approval1.9.2.5?
blocking1.9.2: --- → needed
Comment on attachment 434372 [details] [diff] [review] patch v1.1 a=beltzner for mozilla-1.9.2 default Thanks for writing tests!
Attachment #434372 - Flags: approval1.9.2.5? → approval1.9.2.5+
Attachment #434372 - Flags: approval1.9.2.5+ → approval1.9.2.6+
patch checked in mozilla-1.9.2 http://hg.mozilla.org/releases/mozilla-1.9.2/rev/4c49ce3c8275 (gecko 1.9.2.6)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: