115328 - Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content

Reporter

Description

•

23 years ago

In mfcEmbed load a page containing some JS that adds extra elements (e.g. banner adverts). Call nsIWebBrowserPersist::SaveDocument on webBrowser and it saves the modified DOM not, the original one. This means that the saved copy contains the elements added by the JS and when saved copy is loaded again, the JS runs once more adding more elements effectively doubling up everything. If possible, webBrowser should pull the original HTML from the cache for the current page, parse that into its own DOM and save from that.

Alex Weyers

Comment 1

•

23 years ago

Sample website: http://msn.espn.go.com/main.html Menus at top and bottom of page as well as flash content are added by document.write() statements.

Adam Lock

Reporter

Comment 2

•

23 years ago

*** Bug 118792 has been marked as a duplicate of this bug. ***

Adam Lock

Reporter

Updated

•

23 years ago

Target Milestone: --- → Future

Adam Lock

Reporter

Comment 3

•

23 years ago

Expand this bug to cover the general issue of how to get a fresh DOM from the cached data for the currently loaded URI. I think it could be pretty tricky, especially when frames and such are taken into account.

OS: Windows 2000 → All

Hardware: PC → All

Summary: webbrowser saves the modified DOM, not the original → Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache

Jason Kersey

Comment 4

•

23 years ago

This makes the save webpage complete feature completely useless to me. No chance of getting to this sooner? Also, it's in all builds, not just embedding ones.

Ben Goodger (use ben at mozilla dot org for email)

Updated

•

23 years ago

Blocks: 115634

Adam Lock

Reporter

Comment 5

•

23 years ago

There is not much I can do about this problem until there is some way of obtaining the original DOM (and sub-DOMs for framesets) for a given URI. To me that means either a copy of the original should be kept for such purposes, or it should be possible to reconstruct one from the cached data.

Johnny Stenback (:jst)

Comment 6

•

23 years ago

The original DOM is not accessible after it's loaded, we don't keep an extra copy of the DOM in memory for obvious reasons.

Jeremy M. Dolan

Comment 7

•

23 years ago

Bringing over dataloss keyword from dupe. Blah.

Keywords: dataloss

Boris Zbarsky [:bzbarsky]

Comment 8

•

23 years ago

*** Bug 137784 has been marked as a duplicate of this bug. ***

Jean-Marc Desperrier

Comment 9

•

23 years ago

Can't the solution used for bug 40867 be used here to get back the original content ? Of course you still would need to reparse the DOM from the copy in the cache.

Judson Valeski

Updated

•

23 years ago

Keywords: topembed

Judson Valeski

Comment 10

•

23 years ago

given the complexity of the true fix here (after discussions with Adam), we're topembed minusing this.

Keywords: topembed → topembed-

Michael Dunn

Comment 11

•

23 years ago

Internal Reference: http://bugscape.netscape.com/show_bug.cgi?id=12989

Boris Zbarsky [:bzbarsky]

Comment 12

•

23 years ago

*** Bug 148614 has been marked as a duplicate of this bug. ***

Adam Lock

Reporter

Comment 13

•

22 years ago

*** Bug 154902 has been marked as a duplicate of this bug. ***

sairuh (rarely reading bugmail)

Comment 14

•

22 years ago

another sample: doing a save "web page, complete" on the following page with frames also exhibits this bug: http://developer.apple.com/techpubs/macosx/Essentials/AquaHIGuidelines/index.html

Whiteboard: se-radar

Jesse Ruderman

Comment 15

•

22 years ago

Would Mozilla use the current DOM or a new DOM (from the cached html) when determining which images to save?

Jennifer Kobayashi

Comment 16

•

22 years ago

Requested downgrade of stopper status on proprietary embed client - request denied. Adding kw:topembed to open this up for review. Please see: http://bugscape.netscape.com/show_bug.cgi?id=12989 for more detailed information.

Keywords: topembed- → topembed

saari (gone)

Comment 17

•

22 years ago

topembed+ per EDT

Keywords: topembed → topembed+

Torben

Comment 18

•

22 years ago

*** Bug 179490 has been marked as a duplicate of this bug. ***

Boris Zbarsky [:bzbarsky]

Comment 19

•

22 years ago

*** Bug 182546 has been marked as a duplicate of this bug. ***

Brant Gurganus

Comment 20

•

22 years ago

By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and <http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss bugs are of critical or possibly higher severity. Only changing open bugs to minimize unnecessary spam. Keywords to trigger this would be crash, topcrash, topcrash+, zt4newcrash, dataloss.

Severity: normal → critical

Adam Lock

Reporter

Updated

•

22 years ago

Depends on: 191023

Juan José Mata

Comment 21

•

22 years ago

5/5 EDT triage: minusing topembed+ status. Dropping this from the radar to better focus on existing working set.

Keywords: topembed+ → topembed-

Boris Zbarsky [:bzbarsky]

Comment 22

•

21 years ago

*** Bug 218416 has been marked as a duplicate of this bug. ***

Matthias Versen [:Matti]

Comment 23

•

20 years ago

*** Bug 274745 has been marked as a duplicate of this bug. ***

timeless

Comment 24

•

20 years ago

*** Bug 283622 has been marked as a duplicate of this bug. ***

mansheier

Comment 25

•

19 years ago

*** Bug 299752 has been marked as a duplicate of this bug. ***

Benjamin Smedberg

Comment 26

•

19 years ago

Hrm, I don't want it to save the original HTML, I want it to save the current DOM exactly as I see it on the screen; but in that case I probably also want to remove all scripts from the page.

Boris Zbarsky [:bzbarsky]

Comment 27

•

19 years ago

If you remove all scripts, then the page may not work properly after being saved...

Benjamin Smedberg

Comment 28

•

19 years ago

Well, in most cases I don't want it to "work" at all: most of the time when I save a page it is a receipt or other record of what I am currently viewing: I don't want it to change and I don't want it to run scripts with the security privileges of a file: URL anyway (that's a different bug, I know). I can understand there might be other situations in which saving the original HTML with all its scripts is a good idea, but I don't want this behavior to be left undiscussed and treated as if there was only one true way.

mansheier

Comment 29

•

19 years ago

(In reply to comment #28) > Well, in most cases I don't want it to "work" at all: most of the time when I > save a page it is a receipt or other record of what I am currently viewing: I > don't want it to change and I don't want it to run scripts with the security > privileges of a file: URL anyway (that's a different bug, I know). > > I can understand there might be other situations in which saving the original > HTML with all its scripts is a good idea, but I don't want this behavior to be > left undiscussed and treated as if there was only one true way. In my opinion, the discussion whether the JavaScript-implementation of Firefox is secure or not is another issue. It has nothing to do with saving the webpage. The Internet Explorer uses its own Model for saving - that is ****. Opera does almost what I want - save the webpage exactly how it comes from the server, but Opera does not change the links between <noscript>-tags. I think that "HTTrack Website Copier" is a good example of how to save a webpage. It leaves the files as they are and replaces all the links in a webpage with relative paths and saves the referenced files (for example images, css, scripts, ...). Since HTTrack is open source, maybe this is a possible way for Firefox too.

Jesse Ruderman

Updated

•

19 years ago

Summary: Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache → Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content

Brian 'netdragon' Bober

Comment 30

•

19 years ago

I encountered this problem today at Eyewonder. I'll attach a testcase in a sec. We should save the files exactly as they come from the server. Being able to save generated content instead of how they are on the server is bug 120457 --> Reassigning

Boris Zbarsky [:bzbarsky]

Comment 31

•

19 years ago

Brian, saving as "web page, complete" modifies the content by definition. There is no feasible way to save the original on-generated content when doing a "web page, complete" save. Please read up on the code before making any more comments like that, ok?

Brian 'netdragon' Bober

Comment 32

•

19 years ago

Attached file testcase (deleted) — Details

Being feasable and being possible are two different things. It's not reasonable to have both the generated content, and the original JS saved in the same file (plus it makes debugging pages a lot harder). I know that making additional requests of the server for the HTML file is dangerous, especially when it's CGI request, but I don't see a reason why we can't store the original HTML file in memory until the frame is destroyed, and then just re-request any images, js, etc that goes with it. Even with 10 pages open, we're only talking about an additional 400K of memory on average.

Brian 'netdragon' Bober

Comment 33

•

19 years ago

Boris: I understand link URLs will be modified, but besides that, can't the DOM can be left intact? See also: Bug 60426 - Allow users to choose between generated and source html in view-source Bug 120457 - "Save As" should optionally allow to save generated content of a page

Boris Zbarsky [:bzbarsky]

Comment 34

•

19 years ago

> Boris: I understand link URLs will be modified, but besides that, can't the DOM > can be left intact? The "web page, complete" mode is meant to be a "save what the user sees" mode and that's what it does. Doing that means saving the DOM. That's not likely to change unless the UI folks decide this option should have a completely different behavior in general.

Brian 'netdragon' Bober

Comment 35

•

19 years ago

Thanks for clarifying. So the conundrum is whether we want to save the scripts, or remove them. Removing them would cause problems for user-initiated events, such as expanding divs, whereas leaving them can cause duplicated content. For testing problems with user-initiated events, leaving the scripts is useful, but could also be a security issue (comment #28) and can cause duplicated content when the DOM is written (the issue in this bug). An additional option for saving original page (with links modified) is an RFE bug 271571 (that I reported in 2004 and forgot about) and isn't relevant to this bug. "Workaround": I determined a quick way (based on Boris's description of our behavior) to get rid of the duplicated content in simple cases. If you look at the testcase, then delete the generated content with DOM inspector (in this case, the DIV), then "Save Complete", you'll get a version without the data repeated. This will be very hard for some really complicated generated content, but will work well for pages with it only generated in one place on the page. Boris: Is bug 120457 already covered by our current behavior and therefore INVALID? Boris: I think what we might be able to (some day down the road) do for all bugs regarding generated content, like this and bug 60426, is keep a list of page changes, sort of like an undo list, so that we can allow people to choose which generated content they want to remove from the page. This could be pref-disabled by default as not to bloat memory usage. Do you think this would be possible in the back-end, and beyond that would it also be possible for a 3rd party extension to perform this function?

Adam Guthrie

Comment 36

•

19 years ago

*** Bug 305437 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Comment 37

•

18 years ago

*** Bug 364711 has been marked as a duplicate of this bug. ***

Phil Ringnalda (:philor)

Updated

•

18 years ago

Assignee: adamlock → nobody

QA Contact: adamlock → docshell

Dinar

Comment 41

•

15 years ago

hello. document.write()'s content should not be saved second time/duplicated with regular html text outside of <script></script>, because it should be as original and most people(near 98%) open the saved page with javascript is turned on, and if page developer wants document.write()'s content is visible after it is saved and opened with noscript browser, let he duplicate the content in <noscript></noscript> or <noscript><iframe></iframe></noscript>, as i know firefox saves them even when it has javascript turned on.

Dinar

Comment 42

•

15 years ago

"as i know firefox saves them even when it has javascript turned on" - no, it is not so, it does not save iframe in noscript content when javascript is on.

BugBot [:suhaib / :marco/ :calixte]

Comment 43

•

2 years ago

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: critical → --

Kris Maglione [:kmag]

Comment 44

•

2 years ago

It still isn't clear what the desired behavior is here, but this bug has existed for over 20 years without causing any real harm, so...

Severity: -- → S4