596443 - SVG embedded by reference is not loading

Reporter

Description

•

14 years ago

bz and I have seen some issues with the loading of:

http://codedread.com/browser-tests/svg-image/html.html

bz tells me it's likely a cache issue.

Each row loads the same URL, using a different mechanism.

The issue is visible in the third and fourth /columns/. On a hard refresh the SVG documents embedded in the cells in the third column fails to load. On a soft refresh the SVG documents in both the third AND fourth column fails to load.

Jonathan Watt [:jwatt]

Reporter

Updated

•

14 years ago

blocking2.0: --- → ?

Boris Zbarsky [:bzbarsky]

Comment 1

•

14 years ago

In particular, I was seeing the same problem, until I cleared the cache.  Then the problem went away.  I wish I'd saved my cache before clearing....

Jonathan _has_ saved his.  Byron, can you take a look, please?

Johnny Stenback (:jst)

Comment 2

•

14 years ago

Bjarne, can you look into this? Seems possibly related to some of the other intermittent cache load problems that have been reported.

Assignee: nobody → bjarne

blocking2.0: ? → betaN+

Bjarne (:bjarne)

Assignee

Comment 3

•

14 years ago

See bug #588804, comment #22.

Boris Zbarsky [:bzbarsky]

Comment 4

•

14 years ago

Note that this page isn't https....

Jonathan, did you ever quit your browser, and does the problem still appear after restart?

Bjarne (:bjarne)

Assignee

Comment 5

•

14 years ago

See bug #588804, comment #23.

Bjarne (:bjarne)

Assignee

Comment 6

•

14 years ago

It's hard to find an explanation for why the hard reload fails. Were the reload-issues reproducible or did they just occur once?

I've checked the http-responses for the SVGs in question and the server do reply with Content-Length. So current thinking is that this issue is not caused by the problem identified in bug #588804.

Boris Zbarsky [:bzbarsky]

Comment 7

•

14 years ago

When I hit this bug, they were reproducible until I cleared the cache.

Bjarne (:bjarne)

Assignee

Comment 8

•

14 years ago

Ok - thanks. This is also what I remember from looking at it with Jonathan.

When I did look at this with Jonathan, I also think we searched the about:cache for the SVGs but found only three of the four. Moreover, after a hard reload still only three SVGs were found. Jonathan - could you confirm or add details?

An crucial question is: What kind of cache-issue do we believe this to be?

As I understand the test, each row loads the same resource using four different mechanisms to refer to it. Theoretically, the first column triggers the actual load from the server and the three other columns loads the resource from the cache once it is complete, right? Why would the second column show the content while third and fourth does not (they should all be loaded from cache)? Does the third and fourth mechanisms rely on different notifications than the second to determine when the content is ready, and does the cache somehow gets into a state where it fails to dispatch some notifications?

Bjarne (:bjarne)

Assignee

Comment 9

•

14 years ago

Mmmm...  bug #444165 just crossed my mind. Relevant? (assuming that the cache does something weird/wrong about notifications)

Boris Zbarsky [:bzbarsky]

Comment 10

•

14 years ago

> An crucial question is: What kind of cache-issue do we believe this to be?

If we knew, that would make things a lot simpler... all we know is that there are all sorts of cases recently where a hard-reload will make things render right but anything else won't.  This, being a reproducible such case, seemed like a good candidate for maybe hunting down what's going on.

For what it's worth, I just tried loading this in my nightly build, and I see the blue cicles in the _first_ column, with the other columns broken.  I'll see if I can quit, save the cache, and reproduce in a debug build....

Bjarne (:bjarne)

Assignee

Comment 11

•

14 years ago

(In reply to comment #10)
> > An crucial question is: What kind of cache-issue do we believe this to be?
> 
> If we knew, that would make things a lot simpler... all we know is that there

It's more of a rhetorical question really - to make us remember that there may be several qualitatively different issues in the cache which have recently surfaced with the recent changes in threading. :)  So - yes: If we know this answer, things would be a lot simpler...

> are all sorts of cases recently where a hard-reload will make things render
> right but anything else won't.  This, being a reproducible such case, seemed
> like a good candidate for maybe hunting down what's going on.

I believe a hard reload in this bug actually renders worse? Which is the confusing part for me and which IMO makes it a potential victim of other issues than what (I believe) is causing bug #588804.

> For what it's worth, I just tried loading this in my nightly build, and I see
> the blue cicles in the _first_ column, with the other columns broken.  I'll see
> if I can quit, save the cache, and reproduce in a debug build....

That would be really useful! Please keep us informed on this one...

Bjarne (:bjarne)

Assignee

Comment 12

•

14 years ago

Oh - just to make sure: I assume the conditions of bug #598243 does *not* apply to your latest finding?

Boris Zbarsky [:bzbarsky]

Comment 13

•

14 years ago

I have nothing approaching full drives, no.

Boris Zbarsky [:bzbarsky]

Comment 14

•

14 years ago

Unfortunately, my cache ended up getting cleared (because I forgot that I needed to debug this and needed to work around a problem elsewhere) before I restarted my browser....  I'll keep trying to reproduce this again.  :(

Bjarne (:bjarne)

Assignee

Comment 15

•

14 years ago

(In reply to comment #11)
> I believe a hard reload in this bug actually renders worse?

Clumsily phrased...  sorry! I meant to say that in this bug, a hard reload does not fix the issue but also renders badly. If the cache is the culprit here, it is not because it returns wrong content but rather because it somehow interferes with loading the resource. The weird part is why the cache interferes only with the third column (the iframe) on a hard reload, while working (if I understand the description in comment #0 correctly) for columns 1,2 and 4.

Looking at the logs, I see that the svgs are requested conditionally, and that the server replies with a 304 (Not Modified). I'll read the codepath for this case more carefully.

Another observation is that one of four requests is done with the LOAD_ONLY_IF_MODIFIED flag set. I see that imglib does this when validating an img-entry and I see that the prefetch-service use this flag. Are any elements in this test likely to be prefetched?

Boris Zbarsky [:bzbarsky]

Comment 16

•

14 years ago

No.  Those LOAD_ONLY_IF_MODIFIED loads must be from imagelib.

Question.  Does the server happen to be broken and send a response body with those 304s?

Bjarne (:bjarne)

Assignee

Comment 17

•

14 years ago

(In reply to comment #16)
> No.  Those LOAD_ONLY_IF_MODIFIED loads must be from imagelib.

Mmmm...   but the sequence of requests should be the intuitive one, right: First column 1, then 2, 3 and 4? Or is the tree built differently? (Or is this indeterministic?)

> Question.  Does the server happen to be broken and send a response body with
> those 304s?

Funny you should ask because the log has a line for each 304 stating "this response should not contain a body". But I believe this line is always printed by nsHttpTransaction::HandleContentStart() ...  it doesn't mean anything.

None of the 304s has a Content-Length header and I see no evidence that any of them has content attached (but then I have not yet been able to reproduce this). Any particular reason for asking? (Bug #597918?)

Boris Zbarsky [:bzbarsky]

Comment 18

•

14 years ago

For this particular testcase, the sequence of requests should be in DOM order, I think.  So row 1 in left-to-right order, then row 2 in the same order, etc.

> Any particular reason for asking?

Some servers screw up the 304 thing and then we fail the next request on that connection.  Doesn't seem like it's happening here, though.

Bjarne (:bjarne)

Assignee

Comment 19

•

14 years ago

(In reply to comment #18)
> Some servers screw up the 304 thing and then we fail the next request on that
> connection.  Doesn't seem like it's happening here, though.

I've been reading bug #595698 and suspect the problem he's experiencing are the same as seen here: Static file, etag, 304-response, reproducible after hitting some weird state. Hard reloads doesn't resolve the issue because (quote from bug #595698, comment #0) :

"The file is static and not changed, so of course the server returns a 304 NOT MODIFIED, then Firefox dips into its cache and retrieves the bugged file, so it *never works again*."

Question 1: why does a soft reload break the 4th column while keeping the 1st and 2nd? 

Question 2: why does a hard reload only break the 3rd column (the iframe) and not the 4th (or 1st and 2nd for that matter)?

Answer 1: The 1st column contains an img-tag, and we've established that imglib adds the LOAD_ONLY_IF_MODIFIED flag to the request and hence should be immune in our case to bad content once the image has been loaded correctly once. This doesn't explain why 2nd and 4th column "survive" bad content though.

Big question: Why is it the iframes that have bad content? Is it a coincidence that 4 different cache-entries (all four SVGs) end up being corrupted at the same time (hard reload), but only for the iframe...?

Boris Zbarsky [:bzbarsky]

Comment 20

•

14 years ago

> (quote from bug #595698, comment #0) :

As I said in that bug, I suspect his analysis is wrong.

> This doesn't explain why 2nd and 4th column "survive" bad content though.

Yeah, and I don't have a good explanation yet....

> Big question: Why is it the iframes that have bad content? 

An excellent question, yes.  Again, I don't have an answer.

> Is it a coincidence that 4 different cache-entries (all four SVGs) end up
> being corrupted 

Put that way, sounds pretty darned unlikely.  That's a really good point.

Bjarne (:bjarne)

Assignee

Comment 21

•

14 years ago

(In reply to comment #20)
> As I said in that bug, I suspect his analysis is wrong.

Agreed, but I think he is seeing the same issue as we do here, and IMO he phrased the 304-issue quite nicely... :)

> Put that way, sounds pretty darned unlikely.  That's a really good point.

Struck me as unlikely as well, yeah. Stepping through this in a debugger reveals a few things:

1) the img-tags trigger prefetch on startup
2) img-tags are loaded by imgLoader::ValidateEntry() on subsequent requests (backing up Answer 1 above)
3) the object- and embed-tags seem to be loaded by nsObjectLoadingContent::LoadObject() which calls nsHttpChannel::AsyncOpen() directly with |this| as the listener
4) the iframes seem to be loaded by nsDocShell::DoChannelLoad() via nsURILoader::OpenURI()

So, columns 2 and 4 are loaded using the same mechanism, whereas column 3 (the iframe-column) is loaded with a different mechanism. In particular, the channel-listeners are different.

If we knew how error-handling of nsObjectLoadingContent differs from that of nsDocShell/nsURILoader we might get a clue about what goes wrong. I suspect the use of nsObjectLoadingContent::Fallback() may have something to do about this, but I'm not sure about details.

Boris Zbarsky [:bzbarsky]

Comment 22

•

14 years ago

<object> will handle errors by becoming equivalent to a <span>, basically.

Bjarne (:bjarne)

Assignee

Comment 23

•

14 years ago

I guess for <embed> and <object>/<span> the question is: How would existing linked content be handled if a reload of the linked content fails....?

Boris Zbarsky [:bzbarsky]

Comment 24

•

14 years ago

I'm not sure I follow the question...

Bjarne (:bjarne)

Assignee

Comment 25

•

14 years ago

Sorry - I'll try to be less diffuse: <object> and <embed> are the tags in columns 2 and 4 resp. In comments above we have been puzzled about why those columns seem to "survive" reloading. So, assuming these tags hold existing, valid, linked content, how do they deal with errors when reloading the linked content? (My guess is that they fall back to display whatever was there before.)

Boris Zbarsky [:bzbarsky]

Comment 26

•

14 years ago

I covered that in comment 22.  If the load fails, for a wide range of values of "fail", the <embed> falls back to its alt text and the <object> acts just like a <span> (and shows its descendant DOM nodes).  They don't do any sort of "whatever was there before" thing.

Bjarne (:bjarne)

Assignee

Comment 27

•

14 years ago

Ahh - ok. Didn't pick up the implications of comment #22.  :(  My bad...

So, this implies that the reloads for column 2 and 4 actually works, no? I.e. on a hard reload, the content returned from the cache (due to the 304-response) for columns 2 and 4 is correct, whereas for the iframe in column 3 it's wrong. Moreover, on a soft reload, the content from cache is ok for column 2, but not for 3 or 4.

Boris Zbarsky [:bzbarsky]

Comment 28

•

14 years ago

On a hard reload, I don't believe we will send a request that can get a 304 response (we force bypassing the cache entirely on hard reloads).

One difference for the iframe, actually, is that the iframe load happens in a separate loadgroup and so probably does NOT pick up the "bypass the cache" flag!  So that could explain the behavior difference for hard reloads, at least.

Bjarne (:bjarne)

Assignee

Comment 29

•

14 years ago

(In reply to comment #28)
> On a hard reload, I don't believe we will send a request that can get a 304
> response (we force bypassing the cache entirely on hard reloads).

We do actually add If-Modified-Since and If-None-Match headers on a hard reload (Ctrl-R is a hard reload, right?). This allows the server to reply with a 304, and we return content from cache. We may of-course discuss whether we *should* add these headers, but I think it's reasonable to do so since we simply check the validity of cached content with the server according to the http-spec.

The interesting bit IMO is why the <iframe> in some cases muck up the content and in some cases propagate this to the <embed> -tag.

Boris Zbarsky [:bzbarsky]

Comment 30

•

14 years ago

> (Ctrl-R is a hard reload, right?)

No, Ctrl-Shift-R is.  Ctrl-R is a regular reload, which forces a GET even if the cache entry is valid but will accept a 304 response.  Ctrl-Shift-R will bypass the cache entirely.

Bjarne (:bjarne)

Assignee

Comment 31

•

14 years ago

Ahh... that clears up things! Using Ctrl-Shift-R I get assertions in the log which certainly should not be there and should be investigated.

Bjarne (:bjarne)

Assignee

Comment 32

•

14 years ago

The assertions I observe are results of this scenario:

Two requests R1 and R2 are created for the same uri, both in mode 2 (write-only) due to the hard reload. R1 is activated (nsCacheService::ActivateRequest()) and causes cache-entry A to be created and inserted in mActiveEntries for the uri. The actual write to the cache-entry is done on the cache-IO thread at some point later. R2 is then activated, finds A in mActiveEntries, dooms it (since R2 is in mode 2), creates cache-entry B and inserts B in mActiveEntries for the uri. Writing to this entry is also done on the cache-IO thread later.

Eventually the write to A happens. It binds the entry to the disk-device (nsDiskCacheDevice::BindEntry()) creating bindings and performs other book-keeping before finishing the writing. Some time later the write to B happens. It starts binding to the disk-device but finds another binding with the same key (namely the binding for A) and asserts.

The assertion itself may not seem critical, and the actions taken (dooming A) by nsDiskCacheDevice::BindEntry() seems reasonable and sufficient to resolve the situation. However 1) the existence of the assertion indicates an assumption throughout the code that this should not happen, and we don't know if or how this is handled everywhere. And 2) I imagine that dooming A may be insufficient if the scenario was a little more complex, e.g. if there were read-requests connected to A. (Maybe in this bug, the requests from the iframe are read/write requests binding to a A? When A gets doomed, it's not clear what will happen to such requests.)

The (rather long) discussion in bug #548406 may provide useful background. The relevant parts start at bug #548406, comment #91 and revolves around whether we should allow the cache to bind an entry which has been doomed between activation and binding. I have been in favor of allowing this, for the reasons indicated in item (2) in previous paragraph, but I may have to reconsider.

Whether fixing this book-keeping resolves the situation we see in this bug is not completely understood, but the assertions clearly point to something fishy in the disk-cache device which should be handled in any case.

Bjarne (:bjarne)

Assignee

Comment 33

•

14 years ago

I've spent quite some time to set up a test to trigger the assertions mentioned and have discovered that there is a missing detail plus a minor flaw in the scenario above: A new request with read-flag set must be attached to entry B, and the write to B must come before the write to A. With these changes, the problem is as described in comment #32 and the failing sequence is as follows:

1) Request R1 in mode 2 is created and creates entry A
2) Request R2 in mode 2 is created, dooms A and creates entry B
3) Request R3 in mode 3 (or 1) is created and is attached to entry B
4) The write to B happens, binding B to the disk-device
5) The write to A happens, tries to bind A to disk-device, asserts because B is already bound, and dooms B

R3 is important because without it, B would be deactivated (and binding removed) immediately after step 4. If we swap modes on R2 and R3 (i.e. tries to make R2 attach to entry A and R3 to create B) we bump into bug #561313.

Bjarne (:bjarne)

Assignee

Comment 34

•

14 years ago

Attached patch Testcase to trigger assertion (obsolete) (deleted) — Details — Splinter Review

Bjarne (:bjarne)

Assignee

Comment 35

•

14 years ago

Attached patch Patch v1 (obsolete) (deleted) — Details — Splinter Review

This is a patch which deals with the assertions mentioned above and the testcase. It also handles the testcase for bug #561313 and is passing basic local testing. Pushed to tryserver for better verification.

Attachment #483162 - Flags: review?(bzbarsky)

Bjarne (:bjarne)

Assignee

Comment 36

•

14 years ago

Attached patch Updated testcase for bug #561313 (deleted) — Details — Splinter Review

This is the testcase from bug #561313 updated to expect a different value for the second request. We may have to care about non-idempotent requests here, but I'll leave that for later.

Bjarne (:bjarne)

Assignee

Comment 37

•

14 years ago

Attached patch Updated unit-test (obsolete) (deleted) — Details — Splinter Review

This is an update of the previously submitted unit-test, now expecting a cached value from the third request.

Attachment #481490 - Attachment is obsolete: true

Boris Zbarsky [:bzbarsky]

Comment 38

•

14 years ago

Comment on attachment 483162 [details] [diff] [review]
Patch v1

I'd prefer doProcessPendingRequests to be non-optional.  Other than that, looks fine.

Attachment #483162 - Flags: review?(bzbarsky) → review+

Bjarne (:bjarne)

Assignee

Comment 39

•

14 years ago

Results from tryserver looks ok with one weird exception described here

http://stage.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/bherland@mozilla.com-5ce83a7d727c/tryserver-macosx64-debug/tryserver_snowleopard-debug_test-xpcshell-build95.txt.gz

Here, the testcase for this bug (making it weird) from comment #37 receives responses in the wrong order and thus fails. However, the failure seems to come from the same problem as described in e,g. bug #602345 so I think it's ok.

You want me to make doProcessPendingRequests non-optional for a final version or should we just ask for approval and try to land it as it is?

Boris Zbarsky [:bzbarsky]

Comment 40

•

14 years ago

> You want me to make doProcessPendingRequests non-optional for a final version

Yes.

Bjarne (:bjarne)

Assignee

Comment 41

•

14 years ago

Attached patch Patch v1.1 (made param non-optional, otherwise unchanged) (deleted) — Details — Splinter Review

Carried over r+ since this is a trivial change. Requesting approval for 2.0

Attachment #483162 - Attachment is obsolete: true

Attachment #483577 - Flags: review+

Attachment #483577 - Flags: approval2.0?

Bjarne (:bjarne)

Assignee

Updated

•

14 years ago

Keywords: checkin-needed

Robert Longson [:longsonr]

Comment 42

•

14 years ago

This is blocking. It does not need approval to land.

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Attachment #483577 - Flags: approval2.0? → approval2.0+

Bjarne (:bjarne)

Assignee

Comment 43

•

14 years ago

Anything else I need to do now to get it landed...?  :)

Boris Zbarsky [:bzbarsky]

Comment 44

•

14 years ago

No.  Do you not have push access?  If so, I can push this for you.

Bjarne (:bjarne)

Assignee

Comment 45

•

14 years ago

Please do...  (No - I don't have push-access)

Bjarne (:bjarne)

Assignee

Comment 46

•

14 years ago

Anything new here?

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

14 years ago

Blocks: 605255

Bjarne (:bjarne)

Assignee

Comment 47

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/47bdba5b3d87

Keywords: checkin-needed

Robert Longson [:longsonr]

Comment 48

•

14 years ago

Can the unit test from comment 37 get checked in so that it runs automatically somehow?

Johnny Stenback (:jst)

Comment 49

•

14 years ago

Bjarne, if you convert the testcase in comment 37 into an appliable patch with a commit message n' all, I can land that for you.

Bjarne (:bjarne)

Assignee

Comment 50

•

14 years ago

Attached patch Added metainformation (deleted) — Details — Splinter Review

Should apply cleanly now...  it may not have been reviewed so I'm being really conservative and requesting review...

Attachment #483165 - Attachment is obsolete: true

Attachment #486600 - Flags: review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

Attachment #486600 - Flags: review?(bzbarsky) → review+

Bjarne (:bjarne)

Assignee

Comment 51

•

14 years ago

Requesting check-in...

Keywords: checkin-needed

Dão Gottwald [:dao]

Comment 52

•

14 years ago

Resolving since this bug is apparently fixed. I assume only attachment 486600 [details] [diff] [review] needs check-in but not attachment 483164 [details] [diff] [review]?

Status: NEW → RESOLVED

Closed: 14 years ago

OS: Mac OS X → All

Hardware: x86 → All

Resolution: --- → FIXED

Whiteboard: test needs landing

Target Milestone: --- → mozilla2.0b8

Bjarne (:bjarne)

Assignee

Comment 53

•

14 years ago

Your assumption is absolutely right.

Dão Gottwald [:dao]

Comment 54

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/15ec275d4d29

Flags: in-testsuite+

Keywords: checkin-needed

Whiteboard: test needs landing

Mats Palmgren (inactive)

Updated

•

14 years ago

Depends on: 610654

Mats Palmgren (inactive)

Comment 55

•

14 years ago

The test fails intermittently: filed bug 610654

David :Bienvenu

Comment 56

•

14 years ago

Would this patch be useful for mozilla 1.9.2 branch? I ask because Thunderbird 3.1.x uses the disk cache for imap messages, and we're seeing some fairly strange corruptions, and I'm wondering if it might be caused by a bug like this.  Reportedly, the cache entry size corresponds exactly to the size of the message, but the contents are those of an other message, generally the middle of an other message, as if the cache entry has the wrong offset into the actual disk file. I'm hard pressed to figure out how the mailnews code could be doing this since we serialize access to a particular mailbox, meaning we'd only ever be caching one message at a time for a given mailbox.

Bjarne (:bjarne)

Assignee

Comment 57

•

14 years ago

It might, but you may also want to have a look at bug #604880 and/or bug #604897. Feel free to request approval for branches.

David :Bienvenu

Comment 58

•

14 years ago

Thx, Bjarne.

Testcase to trigger assertion 14 years ago Bjarne (:bjarne) (deleted), patch		Details \| Diff \| Splinter Review
Patch v1 14 years ago Bjarne (:bjarne) (deleted), patch	bzbarsky : review+	Details \| Diff \| Splinter Review
Updated testcase for bug #561313 14 years ago Bjarne (:bjarne) (deleted), patch		Details \| Diff \| Splinter Review
Updated unit-test 14 years ago Bjarne (:bjarne) (deleted), patch		Details \| Diff \| Splinter Review
Patch v1.1 (made param non-optional, otherwise unchanged) 14 years ago Bjarne (:bjarne) (deleted), patch	bjarne : review+ bzbarsky : approval2.0+	Details \| Diff \| Splinter Review
Added metainformation 14 years ago Bjarne (:bjarne) (deleted), patch	bzbarsky : review+	Details \| Diff \| Splinter Review