Open
Bug 115328
Opened 23 years ago
Updated 2 years ago
Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content
Categories
(Core :: DOM: Navigation, defect)
Core
DOM: Navigation
Tracking
()
NEW
Future
People
(Reporter: adamlock, Unassigned)
References
Details
(Keywords: dataloss, topembed-, Whiteboard: se-radar)
Attachments
(1 file)
(deleted),
text/html
|
Details |
In mfcEmbed load a page containing some JS that adds extra elements (e.g. banner
adverts). Call nsIWebBrowserPersist::SaveDocument on webBrowser and it saves the
modified DOM not, the original one.
This means that the saved copy contains the elements added by the JS and when
saved copy is loaded again, the JS runs once more adding more elements
effectively doubling up everything.
If possible, webBrowser should pull the original HTML from the cache for the
current page, parse that into its own DOM and save from that.
Comment 1•23 years ago
|
||
Sample website:
http://msn.espn.go.com/main.html
Menus at top and bottom of page as well as flash content are added by
document.write() statements.
*** Bug 118792 has been marked as a duplicate of this bug. ***
Expand this bug to cover the general issue of how to get a fresh DOM from the
cached data for the currently loaded URI. I think it could be pretty tricky,
especially when frames and such are taken into account.
OS: Windows 2000 → All
Hardware: PC → All
Summary: webbrowser saves the modified DOM, not the original → Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache
Comment 4•23 years ago
|
||
This makes the save webpage complete feature completely useless to me. No
chance of getting to this sooner? Also, it's in all builds, not just embedding
ones.
There is not much I can do about this problem until there is some way of
obtaining the original DOM (and sub-DOMs for framesets) for a given URI. To me
that means either a copy of the original should be kept for such purposes, or it
should be possible to reconstruct one from the cached data.
Comment 6•23 years ago
|
||
The original DOM is not accessible after it's loaded, we don't keep an extra
copy of the DOM in memory for obvious reasons.
Comment 8•23 years ago
|
||
*** Bug 137784 has been marked as a duplicate of this bug. ***
Comment 9•23 years ago
|
||
Can't the solution used for bug 40867 be used here to get back the original
content ?
Of course you still would need to reparse the DOM from the copy in the cache.
Comment 10•23 years ago
|
||
given the complexity of the true fix here (after discussions with Adam), we're
topembed minusing this.
Comment 11•23 years ago
|
||
Internal Reference:
http://bugscape.netscape.com/show_bug.cgi?id=12989
Comment 12•23 years ago
|
||
*** Bug 148614 has been marked as a duplicate of this bug. ***
Reporter | ||
Comment 13•22 years ago
|
||
*** Bug 154902 has been marked as a duplicate of this bug. ***
Comment 14•22 years ago
|
||
another sample: doing a save "web page, complete" on the following page with
frames also exhibits this bug:
http://developer.apple.com/techpubs/macosx/Essentials/AquaHIGuidelines/index.html
Whiteboard: se-radar
Comment 15•22 years ago
|
||
Would Mozilla use the current DOM or a new DOM (from the cached html) when
determining which images to save?
Comment 16•22 years ago
|
||
Requested downgrade of stopper status on proprietary embed client - request
denied. Adding kw:topembed to open this up for review. Please see:
http://bugscape.netscape.com/show_bug.cgi?id=12989
for more detailed information.
Comment 18•22 years ago
|
||
*** Bug 179490 has been marked as a duplicate of this bug. ***
Comment 19•22 years ago
|
||
*** Bug 182546 has been marked as a duplicate of this bug. ***
Comment 20•22 years ago
|
||
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity. Only changing open bugs to
minimize unnecessary spam. Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Comment 21•22 years ago
|
||
5/5 EDT triage: minusing topembed+ status. Dropping this from the radar to
better focus on existing working set.
Comment 22•21 years ago
|
||
*** Bug 218416 has been marked as a duplicate of this bug. ***
Comment 23•20 years ago
|
||
*** Bug 274745 has been marked as a duplicate of this bug. ***
Comment 24•20 years ago
|
||
*** Bug 283622 has been marked as a duplicate of this bug. ***
Comment 25•19 years ago
|
||
*** Bug 299752 has been marked as a duplicate of this bug. ***
Comment 26•19 years ago
|
||
Hrm, I don't want it to save the original HTML, I want it to save the current
DOM exactly as I see it on the screen; but in that case I probably also want to
remove all scripts from the page.
Comment 27•19 years ago
|
||
If you remove all scripts, then the page may not work properly after being saved...
Comment 28•19 years ago
|
||
Well, in most cases I don't want it to "work" at all: most of the time when I
save a page it is a receipt or other record of what I am currently viewing: I
don't want it to change and I don't want it to run scripts with the security
privileges of a file: URL anyway (that's a different bug, I know).
I can understand there might be other situations in which saving the original
HTML with all its scripts is a good idea, but I don't want this behavior to be
left undiscussed and treated as if there was only one true way.
Comment 29•19 years ago
|
||
(In reply to comment #28)
> Well, in most cases I don't want it to "work" at all: most of the time when I
> save a page it is a receipt or other record of what I am currently viewing: I
> don't want it to change and I don't want it to run scripts with the security
> privileges of a file: URL anyway (that's a different bug, I know).
>
> I can understand there might be other situations in which saving the original
> HTML with all its scripts is a good idea, but I don't want this behavior to be
> left undiscussed and treated as if there was only one true way.
In my opinion, the discussion whether the JavaScript-implementation of Firefox
is secure or not is another issue. It has nothing to do with saving the webpage.
The Internet Explorer uses its own Model for saving - that is ****. Opera does
almost what I want - save the webpage exactly how it comes from the server, but
Opera does not change the links between <noscript>-tags. I think that "HTTrack
Website Copier" is a good example of how to save a webpage. It leaves the files
as they are and replaces all the links in a webpage with relative paths and
saves the referenced files (for example images, css, scripts, ...). Since
HTTrack is open source, maybe this is a possible way for Firefox too.
Updated•19 years ago
|
Summary: Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache → Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content
Comment 30•19 years ago
|
||
I encountered this problem today at Eyewonder. I'll attach a testcase in a sec.
We should save the files exactly as they come from the server. Being able to
save generated content instead of how they are on the server is bug 120457
--> Reassigning
Comment 31•19 years ago
|
||
Brian, saving as "web page, complete" modifies the content by definition. There
is no feasible way to save the original on-generated content when doing a "web
page, complete" save. Please read up on the code before making any more
comments like that, ok?
Comment 32•19 years ago
|
||
Being feasable and being possible are two different things. It's not reasonable
to have both the generated content, and the original JS saved in the same file
(plus it makes debugging pages a lot harder). I know that making additional
requests of the server for the HTML file is dangerous, especially when it's CGI
request, but I don't see a reason why we can't store the original HTML file in
memory until the frame is destroyed, and then just re-request any images, js,
etc that goes with it. Even with 10 pages open, we're only talking about an
additional 400K of memory on average.
Comment 33•19 years ago
|
||
Boris: I understand link URLs will be modified, but besides that, can't the DOM
can be left intact?
See also:
Bug 60426 - Allow users to choose between generated and source html in view-source
Bug 120457 - "Save As" should optionally allow to save generated content of a page
Comment 34•19 years ago
|
||
> Boris: I understand link URLs will be modified, but besides that, can't the DOM
> can be left intact?
The "web page, complete" mode is meant to be a "save what the user sees" mode
and that's what it does. Doing that means saving the DOM. That's not likely to
change unless the UI folks decide this option should have a completely different
behavior in general.
Comment 35•19 years ago
|
||
Thanks for clarifying. So the conundrum is whether we want to save the scripts,
or remove them. Removing them would cause problems for user-initiated events,
such as expanding divs, whereas leaving them can cause duplicated content. For
testing problems with user-initiated events, leaving the scripts is useful, but
could also be a security issue (comment #28) and can cause duplicated content
when the DOM is written (the issue in this bug). An additional option for saving
original page (with links modified) is an RFE bug 271571 (that I reported in
2004 and forgot about) and isn't relevant to this bug.
"Workaround": I determined a quick way (based on Boris's description of our
behavior) to get rid of the duplicated content in simple cases. If you look at
the testcase, then delete the generated content with DOM inspector (in this
case, the DIV), then "Save Complete", you'll get a version without the data
repeated. This will be very hard for some really complicated generated content,
but will work well for pages with it only generated in one place on the page.
Boris: Is bug 120457 already covered by our current behavior and therefore INVALID?
Boris: I think what we might be able to (some day down the road) do for all bugs
regarding generated content, like this and bug 60426, is keep a list of page
changes, sort of like an undo list, so that we can allow people to choose which
generated content they want to remove from the page. This could be pref-disabled
by default as not to bloat memory usage. Do you think this would be possible in
the back-end, and beyond that would it also be possible for a 3rd party
extension to perform this function?
Comment 36•19 years ago
|
||
*** Bug 305437 has been marked as a duplicate of this bug. ***
Comment 37•18 years ago
|
||
*** Bug 364711 has been marked as a duplicate of this bug. ***
Updated•18 years ago
|
Assignee: adamlock → nobody
QA Contact: adamlock → docshell
Comment 41•15 years ago
|
||
hello. document.write()'s content should not be saved second time/duplicated with regular html text outside of <script></script>, because it should be as original and most people(near 98%) open the saved page with javascript is turned on, and if page developer wants document.write()'s content is visible after it is saved and opened with noscript browser, let he duplicate the content in <noscript></noscript> or <noscript><iframe></iframe></noscript>, as i know firefox saves them even when it has javascript turned on.
Comment 42•15 years ago
|
||
"as i know firefox saves them even when it has javascript turned on" - no, it is not so, it does not save iframe in noscript content when javascript is on.
Comment 43•2 years ago
|
||
In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.
Severity: critical → --
Comment 44•2 years ago
|
||
It still isn't clear what the desired behavior is here, but this bug has existed for over 20 years without causing any real harm, so...
Severity: -- → S4
You need to log in
before you can comment on or make changes to this bug.
Description
•