Closed Bug 596443 Opened 14 years ago Closed 14 years ago

SVG embedded by reference is not loading

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0b8
Tracking Status
blocking2.0 --- betaN+

People

(Reporter: jwatt, Assigned: bjarne)

References

()

Details

Attachments

(3 files, 3 obsolete files)

bz and I have seen some issues with the loading of:

http://codedread.com/browser-tests/svg-image/html.html

bz tells me it's likely a cache issue.

Each row loads the same URL, using a different mechanism.

The issue is visible in the third and fourth /columns/. On a hard refresh the SVG documents embedded in the cells in the third column fails to load. On a soft refresh the SVG documents in both the third AND fourth column fails to load.
blocking2.0: --- → ?
In particular, I was seeing the same problem, until I cleared the cache.  Then the problem went away.  I wish I'd saved my cache before clearing....

Jonathan _has_ saved his.  Byron, can you take a look, please?
Bjarne, can you look into this? Seems possibly related to some of the other intermittent cache load problems that have been reported.
Assignee: nobody → bjarne
blocking2.0: ? → betaN+
See bug #588804, comment #22.
Note that this page isn't https....

Jonathan, did you ever quit your browser, and does the problem still appear after restart?
See bug #588804, comment #23.
It's hard to find an explanation for why the hard reload fails. Were the reload-issues reproducible or did they just occur once?

I've checked the http-responses for the SVGs in question and the server do reply with Content-Length. So current thinking is that this issue is not caused by the problem identified in bug #588804.
When I hit this bug, they were reproducible until I cleared the cache.
Ok - thanks. This is also what I remember from looking at it with Jonathan.

When I did look at this with Jonathan, I also think we searched the about:cache for the SVGs but found only three of the four. Moreover, after a hard reload still only three SVGs were found. Jonathan - could you confirm or add details?

An crucial question is: What kind of cache-issue do we believe this to be?

As I understand the test, each row loads the same resource using four different mechanisms to refer to it. Theoretically, the first column triggers the actual load from the server and the three other columns loads the resource from the cache once it is complete, right? Why would the second column show the content while third and fourth does not (they should all be loaded from cache)? Does the third and fourth mechanisms rely on different notifications than the second to determine when the content is ready, and does the cache somehow gets into a state where it fails to dispatch some notifications?
Mmmm...  bug #444165 just crossed my mind. Relevant? (assuming that the cache does something weird/wrong about notifications)
> An crucial question is: What kind of cache-issue do we believe this to be?

If we knew, that would make things a lot simpler... all we know is that there are all sorts of cases recently where a hard-reload will make things render right but anything else won't.  This, being a reproducible such case, seemed like a good candidate for maybe hunting down what's going on.

For what it's worth, I just tried loading this in my nightly build, and I see the blue cicles in the _first_ column, with the other columns broken.  I'll see if I can quit, save the cache, and reproduce in a debug build....
(In reply to comment #10)
> > An crucial question is: What kind of cache-issue do we believe this to be?
> 
> If we knew, that would make things a lot simpler... all we know is that there

It's more of a rhetorical question really - to make us remember that there may be several qualitatively different issues in the cache which have recently surfaced with the recent changes in threading. :)  So - yes: If we know this answer, things would be a lot simpler...

> are all sorts of cases recently where a hard-reload will make things render
> right but anything else won't.  This, being a reproducible such case, seemed
> like a good candidate for maybe hunting down what's going on.

I believe a hard reload in this bug actually renders worse? Which is the confusing part for me and which IMO makes it a potential victim of other issues than what (I believe) is causing bug #588804.

> For what it's worth, I just tried loading this in my nightly build, and I see
> the blue cicles in the _first_ column, with the other columns broken.  I'll see
> if I can quit, save the cache, and reproduce in a debug build....

That would be really useful! Please keep us informed on this one...
Oh - just to make sure: I assume the conditions of bug #598243 does *not* apply to your latest finding?
I have nothing approaching full drives, no.
Unfortunately, my cache ended up getting cleared (because I forgot that I needed to debug this and needed to work around a problem elsewhere) before I restarted my browser....  I'll keep trying to reproduce this again.  :(
(In reply to comment #11)
> I believe a hard reload in this bug actually renders worse?

Clumsily phrased...  sorry! I meant to say that in this bug, a hard reload does not fix the issue but also renders badly. If the cache is the culprit here, it is not because it returns wrong content but rather because it somehow interferes with loading the resource. The weird part is why the cache interferes only with the third column (the iframe) on a hard reload, while working (if I understand the description in comment #0 correctly) for columns 1,2 and 4.

Looking at the logs, I see that the svgs are requested conditionally, and that the server replies with a 304 (Not Modified). I'll read the codepath for this case more carefully.

Another observation is that one of four requests is done with the LOAD_ONLY_IF_MODIFIED flag set. I see that imglib does this when validating an img-entry and I see that the prefetch-service use this flag. Are any elements in this test likely to be prefetched?
No.  Those LOAD_ONLY_IF_MODIFIED loads must be from imagelib.

Question.  Does the server happen to be broken and send a response body with those 304s?
(In reply to comment #16)
> No.  Those LOAD_ONLY_IF_MODIFIED loads must be from imagelib.

Mmmm...   but the sequence of requests should be the intuitive one, right: First column 1, then 2, 3 and 4? Or is the tree built differently? (Or is this indeterministic?)

> Question.  Does the server happen to be broken and send a response body with
> those 304s?

Funny you should ask because the log has a line for each 304 stating "this response should not contain a body". But I believe this line is always printed by nsHttpTransaction::HandleContentStart() ...  it doesn't mean anything.

None of the 304s has a Content-Length header and I see no evidence that any of them has content attached (but then I have not yet been able to reproduce this). Any particular reason for asking? (Bug #597918?)
For this particular testcase, the sequence of requests should be in DOM order, I think.  So row 1 in left-to-right order, then row 2 in the same order, etc.

> Any particular reason for asking?

Some servers screw up the 304 thing and then we fail the next request on that connection.  Doesn't seem like it's happening here, though.
(In reply to comment #18)
> Some servers screw up the 304 thing and then we fail the next request on that
> connection.  Doesn't seem like it's happening here, though.

I've been reading bug #595698 and suspect the problem he's experiencing are the same as seen here: Static file, etag, 304-response, reproducible after hitting some weird state. Hard reloads doesn't resolve the issue because (quote from bug #595698, comment #0) :

"The file is static and not changed, so of course the server returns a 304 NOT MODIFIED, then Firefox dips into its cache and retrieves the bugged file, so it *never works again*."

Question 1: why does a soft reload break the 4th column while keeping the 1st and 2nd? 

Question 2: why does a hard reload only break the 3rd column (the iframe) and not the 4th (or 1st and 2nd for that matter)?

Answer 1: The 1st column contains an img-tag, and we've established that imglib adds the LOAD_ONLY_IF_MODIFIED flag to the request and hence should be immune in our case to bad content once the image has been loaded correctly once. This doesn't explain why 2nd and 4th column "survive" bad content though.

Big question: Why is it the iframes that have bad content? Is it a coincidence that 4 different cache-entries (all four SVGs) end up being corrupted at the same time (hard reload), but only for the iframe...?
> (quote from bug #595698, comment #0) :

As I said in that bug, I suspect his analysis is wrong.

> This doesn't explain why 2nd and 4th column "survive" bad content though.

Yeah, and I don't have a good explanation yet....

> Big question: Why is it the iframes that have bad content? 

An excellent question, yes.  Again, I don't have an answer.

> Is it a coincidence that 4 different cache-entries (all four SVGs) end up
> being corrupted 

Put that way, sounds pretty darned unlikely.  That's a really good point.
(In reply to comment #20)
> As I said in that bug, I suspect his analysis is wrong.

Agreed, but I think he is seeing the same issue as we do here, and IMO he phrased the 304-issue quite nicely... :)

> Put that way, sounds pretty darned unlikely.  That's a really good point.

Struck me as unlikely as well, yeah. Stepping through this in a debugger reveals a few things:

1) the img-tags trigger prefetch on startup
2) img-tags are loaded by imgLoader::ValidateEntry() on subsequent requests (backing up Answer 1 above)
3) the object- and embed-tags seem to be loaded by nsObjectLoadingContent::LoadObject() which calls nsHttpChannel::AsyncOpen() directly with |this| as the listener
4) the iframes seem to be loaded by nsDocShell::DoChannelLoad() via nsURILoader::OpenURI()

So, columns 2 and 4 are loaded using the same mechanism, whereas column 3 (the iframe-column) is loaded with a different mechanism. In particular, the channel-listeners are different.

If we knew how error-handling of nsObjectLoadingContent differs from that of nsDocShell/nsURILoader we might get a clue about what goes wrong. I suspect the use of nsObjectLoadingContent::Fallback() may have something to do about this, but I'm not sure about details.
<object> will handle errors by becoming equivalent to a <span>, basically.
I guess for <embed> and <object>/<span> the question is: How would existing linked content be handled if a reload of the linked content fails....?
I'm not sure I follow the question...
Sorry - I'll try to be less diffuse: <object> and <embed> are the tags in columns 2 and 4 resp. In comments above we have been puzzled about why those columns seem to "survive" reloading. So, assuming these tags hold existing, valid, linked content, how do they deal with errors when reloading the linked content? (My guess is that they fall back to display whatever was there before.)
I covered that in comment 22.  If the load fails, for a wide range of values of "fail", the <embed> falls back to its alt text and the <object> acts just like a <span> (and shows its descendant DOM nodes).  They don't do any sort of "whatever was there before" thing.
Ahh - ok. Didn't pick up the implications of comment #22.  :(  My bad...

So, this implies that the reloads for column 2 and 4 actually works, no? I.e. on a hard reload, the content returned from the cache (due to the 304-response) for columns 2 and 4 is correct, whereas for the iframe in column 3 it's wrong. Moreover, on a soft reload, the content from cache is ok for column 2, but not for 3 or 4.
On a hard reload, I don't believe we will send a request that can get a 304 response (we force bypassing the cache entirely on hard reloads).

One difference for the iframe, actually, is that the iframe load happens in a separate loadgroup and so probably does NOT pick up the "bypass the cache" flag!  So that could explain the behavior difference for hard reloads, at least.
(In reply to comment #28)
> On a hard reload, I don't believe we will send a request that can get a 304
> response (we force bypassing the cache entirely on hard reloads).

We do actually add If-Modified-Since and If-None-Match headers on a hard reload (Ctrl-R is a hard reload, right?). This allows the server to reply with a 304, and we return content from cache. We may of-course discuss whether we *should* add these headers, but I think it's reasonable to do so since we simply check the validity of cached content with the server according to the http-spec.

The interesting bit IMO is why the <iframe> in some cases muck up the content and in some cases propagate this to the <embed> -tag.
> (Ctrl-R is a hard reload, right?)

No, Ctrl-Shift-R is.  Ctrl-R is a regular reload, which forces a GET even if the cache entry is valid but will accept a 304 response.  Ctrl-Shift-R will bypass the cache entirely.
Ahh... that clears up things! Using Ctrl-Shift-R I get assertions in the log which certainly should not be there and should be investigated.
The assertions I observe are results of this scenario:

Two requests R1 and R2 are created for the same uri, both in mode 2 (write-only) due to the hard reload. R1 is activated (nsCacheService::ActivateRequest()) and causes cache-entry A to be created and inserted in mActiveEntries for the uri. The actual write to the cache-entry is done on the cache-IO thread at some point later. R2 is then activated, finds A in mActiveEntries, dooms it (since R2 is in mode 2), creates cache-entry B and inserts B in mActiveEntries for the uri. Writing to this entry is also done on the cache-IO thread later.

Eventually the write to A happens. It binds the entry to the disk-device (nsDiskCacheDevice::BindEntry()) creating bindings and performs other book-keeping before finishing the writing. Some time later the write to B happens. It starts binding to the disk-device but finds another binding with the same key (namely the binding for A) and asserts.

The assertion itself may not seem critical, and the actions taken (dooming A) by nsDiskCacheDevice::BindEntry() seems reasonable and sufficient to resolve the situation. However 1) the existence of the assertion indicates an assumption throughout the code that this should not happen, and we don't know if or how this is handled everywhere. And 2) I imagine that dooming A may be insufficient if the scenario was a little more complex, e.g. if there were read-requests connected to A. (Maybe in this bug, the requests from the iframe are read/write requests binding to a A? When A gets doomed, it's not clear what will happen to such requests.)

The (rather long) discussion in bug #548406 may provide useful background. The relevant parts start at bug #548406, comment #91 and revolves around whether we should allow the cache to bind an entry which has been doomed between activation and binding. I have been in favor of allowing this, for the reasons indicated in item (2) in previous paragraph, but I may have to reconsider.

Whether fixing this book-keeping resolves the situation we see in this bug is not completely understood, but the assertions clearly point to something fishy in the disk-cache device which should be handled in any case.
I've spent quite some time to set up a test to trigger the assertions mentioned and have discovered that there is a missing detail plus a minor flaw in the scenario above: A new request with read-flag set must be attached to entry B, and the write to B must come before the write to A. With these changes, the problem is as described in comment #32 and the failing sequence is as follows:

1) Request R1 in mode 2 is created and creates entry A
2) Request R2 in mode 2 is created, dooms A and creates entry B
3) Request R3 in mode 3 (or 1) is created and is attached to entry B
4) The write to B happens, binding B to the disk-device
5) The write to A happens, tries to bind A to disk-device, asserts because B is already bound, and dooms B

R3 is important because without it, B would be deactivated (and binding removed) immediately after step 4. If we swap modes on R2 and R3 (i.e. tries to make R2 attach to entry A and R3 to create B) we bump into bug #561313.
Attached patch Testcase to trigger assertion (obsolete) (deleted) β€” β€” Splinter Review
Attached patch Patch v1 (obsolete) (deleted) β€” β€” Splinter Review
This is a patch which deals with the assertions mentioned above and the testcase. It also handles the testcase for bug #561313 and is passing basic local testing. Pushed to tryserver for better verification.
Attachment #483162 - Flags: review?(bzbarsky)
Attached patch Updated testcase for bug #561313 (deleted) β€” β€” Splinter Review
This is the testcase from bug #561313 updated to expect a different value for the second request. We may have to care about non-idempotent requests here, but I'll leave that for later.
Attached patch Updated unit-test (obsolete) (deleted) β€” β€” Splinter Review
This is an update of the previously submitted unit-test, now expecting a cached value from the third request.
Attachment #481490 - Attachment is obsolete: true
Comment on attachment 483162 [details] [diff] [review]
Patch v1

I'd prefer doProcessPendingRequests to be non-optional.  Other than that, looks fine.
Attachment #483162 - Flags: review?(bzbarsky) → review+
Results from tryserver looks ok with one weird exception described here

http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/bherland@mozilla.com-5ce83a7d727c/tryserver-macosx64-debug/tryserver_snowleopard-debug_test-xpcshell-build95.txt.gz

Here, the testcase for this bug (making it weird) from comment #37 receives responses in the wrong order and thus fails. However, the failure seems to come from the same problem as described in e,g. bug #602345 so I think it's ok.

You want me to make doProcessPendingRequests non-optional for a final version or should we just ask for approval and try to land it as it is?
> You want me to make doProcessPendingRequests non-optional for a final version

Yes.
Carried over r+ since this is a trivial change. Requesting approval for 2.0
Attachment #483162 - Attachment is obsolete: true
Attachment #483577 - Flags: review+
Attachment #483577 - Flags: approval2.0?
Keywords: checkin-needed
This is blocking. It does not need approval to land.
Attachment #483577 - Flags: approval2.0? → approval2.0+
Anything else I need to do now to get it landed...?  :)
No.  Do you not have push access?  If so, I can push this for you.
Please do...  (No - I don't have push-access)
Anything new here?
http://hg.mozilla.org/mozilla-central/rev/47bdba5b3d87
Keywords: checkin-needed
Can the unit test from comment 37 get checked in so that it runs automatically somehow?
Bjarne, if you convert the testcase in comment 37 into an appliable patch with a commit message n' all, I can land that for you.
Attached patch Added metainformation (deleted) β€” β€” Splinter Review
Should apply cleanly now...  it may not have been reviewed so I'm being really conservative and requesting review...
Attachment #483165 - Attachment is obsolete: true
Attachment #486600 - Flags: review?(bzbarsky)
Attachment #486600 - Flags: review?(bzbarsky) → review+
Requesting check-in...
Keywords: checkin-needed
Resolving since this bug is apparently fixed. I assume only attachment 486600 [details] [diff] [review] needs check-in but not attachment 483164 [details] [diff] [review]?
Status: NEW → RESOLVED
Closed: 14 years ago
OS: Mac OS X → All
Hardware: x86 → All
Resolution: --- → FIXED
Whiteboard: test needs landing
Target Milestone: --- → mozilla2.0b8
Your assumption is absolutely right.
http://hg.mozilla.org/mozilla-central/rev/15ec275d4d29
Flags: in-testsuite+
Keywords: checkin-needed
Whiteboard: test needs landing
Depends on: 610654
The test fails intermittently: filed bug 610654
Would this patch be useful for mozilla 1.9.2 branch? I ask because Thunderbird 3.1.x uses the disk cache for imap messages, and we're seeing some fairly strange corruptions, and I'm wondering if it might be caused by a bug like this.  Reportedly, the cache entry size corresponds exactly to the size of the message, but the contents are those of an other message, generally the middle of an other message, as if the cache entry has the wrong offset into the actual disk file. I'm hard pressed to figure out how the mailnews code could be doing this since we serialize access to a particular mailbox, meaning we'd only ever be caching one message at a time for a given mailbox.
It might, but you may also want to have a look at bug #604880 and/or bug #604897. Feel free to request approval for branches.
Thx, Bjarne.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: