Closed Bug 588804 Opened 14 years ago Closed 14 years ago

"File not found" error for (extant) HTTPS resource

Categories

(Core :: Networking: Cache, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- beta7+

People

(Reporter: shaver, Assigned: bjarne)

References

(Blocks 1 open bug)

Details

(Keywords: dogfood, regression)

Attachments

(6 files, 1 obsolete file)

"Firefox can't find the file at https://book.aircanada.com/pl/AConline/en/img/caption.gif"

I noticed that when I visited aircanada.com for the second time today, I was getting broken images.  When I tried to load just the image, I got a message that I think should only appear for file: URLs.  Anyone have ideas about how that could happen?
Restarted, it happened again on a later page.  I guess I should get a networking log of it?
if I shift-reload the "view image" URL (after the error message), it works, but then a simple reload gives me the error again.  http logging shows we hit the cache in the broken case, so I'll try again with cache logging.  adding some cache-y sorts, too.
the cache log is not very informative, since it just says "using the cached entry" and then no sign of an error (though it does say that it's closing the entry).

staralliance.gif (another image that was showing this behaviour) shows up in the about:cache, but the file to which it refers does not exist, perhaps explaining this behaviour!

I'm also seeing it now for scripts as well, on the same site.

I wonder if this is bad out-of-cache-space handling?

Number of entries:  	12489
Maximum storage size: 	256000 KiB
Storage in use: 	255965 KiB
Adding Byron too, thought I did that before.
Component: Networking → Networking: Cache
QA Contact: networking → networking.cache
I can look into this.
Assignee: nobody → byronm
blocking2.0: --- → ?
Blocking, at least until we understand how widespread this is, or what caused it. Do we know yet whether this is a regression? Sounds to me like it is, if so, we need the regression keyword.
blocking2.0: ? → betaN+
I had never seen it until the cache size change, I'm gonna call it a regression.
Keywords: regression
Fwiw, I've also seen this when using view-source on https: pages.  It'll show a file not found error page instead.  Clearing the cache fixes... temporarily.  The problem reappears sometime later.
If you go to page info when you have broken-image icons in various places where they shouldn't be, e.g. for https://documents.google.com/images/doclist/cleardot.gif, the type, size and dimensions are all correct, but the "Media Preview" pane doesn't update at all.  If I try to Save As the image, I get ".... could not be saved, because the source file could not be read."  I see this sometimes when downloading files from bugzilla too, especially when trying to save a patch (even if I have the patch visible in the window, just hitting Control-S).
Blocks: http_cache
OS: Windows 7 → All
Hardware: x86 → All
I agreed with Byron to take over this since he is back to school now.

Hard to reproduce. If anyone knows actions which typically trigger this problem please post them here. A log would also be useful, of-course. I'm currently playing with a local apache with throttled upload-bandwidth to see if I can provoke some errors in the cache, but no luck so far.
Assignee: byronm → bjarne
My desktop computer, which has a profile/cache in this state saved against such a need, is due to arrive in California early next week, so I'll be able to repro then if needed.  I am not really up to sending the cache to others, but I would be more than willing to do a windbg session to gather more info!
I'm really not able to reproduce this situation...  :(  If anyone gets into it, it would also be very interesting to see the headers of a failing cache-entry (about:cache, click "List Cache Entries" of disk cache and then click on the entry itself).

Http*s* seems to be important here (although I don't see why it should make a difference). Can we assume that this issue is seen only/primarily for resources loaded with http*s* ?
(In reply to comment #11)
> My desktop computer, which has a profile/cache in this state saved against such
> a need, is due to arrive in California early next week, so I'll be able to
> repro then if needed.  I am not really up to sending the cache to others, but I
> would be more than willing to do a windbg session to gather more info!

It would be very useful if you would be able to use this machine to reproduce while logging. I'd appreciate a ping when you have time (and hardware) to try this.
I've just experienced this bug too. But on a release build without logging :-/ Anyway the cache info is:


Cache entry information
key: 	https://bug594882.bugzilla.mozilla.org/attachment.cgi?id=474574
fetch count: 	16
last fetched: 	2010-09-16 02:07:11
last modified: 	2010-09-16 01:54:05
expires: 	1970-01-01 01:00:00
Data size: 	867
file on disk: 	/home/michal/.mozilla/firefox/6n4piytr.default/Cache/63C4C64Bd01
Security: 	This is a secure document.

Client:  	HTTP
request-method: 	GET
request-Accept-Encoding: 	gzip, deflate
response-head: 	HTTP/1.1 200 OK Server: Apache X-Backend-Server: pm-app-bugs03 Vary: Accept-Encoding Content-Type: text/plain; name="594882-sync-suspend-coutners.patch"; charset= Content-Encoding: gzip Strict-Transport-Security: max-age=2629744; includeSubDomains Date: Wed, 15 Sep 2010 23:54:05 GMT Content-Disposition: inline; filename="594882-sync-suspend-coutners.patch"
charset: 	ISO-8859-1
security-info:  ...
necko:classified:  	1


The interesting information is that although the data size is only 867 bytes it is stored on the disk. But in fact the data file on the disk doesn't exist.
(In reply to comment #14)
> The interesting information is that although the data size is only 867 bytes it
> is stored on the disk. But in fact the data file on the disk doesn't exist.

You mean that it got its own file? It wasn't stored in a bucket-file? What about the real resource: Is it 867 bytes or is it really larger? I observe it's gzipped: Is 867 the unzipped or the zipped length? And what about the Content-Length-header: Is it really not present?
An observation: 3.6 does not cache this attachment on disk. If memory-cache is enabled, it stores it there.
Another observation: (FF3.6.9) If I load the attachment with plain http, I receive a 301 redirecting to the https-url. The 301 *is* stored in the disk-cache though, with this piece of info

file on disk: /home/bjarne/.mozilla/firefox/qewu2n5s.default/Cache/ABEB3347d01

and that file doesn't exist, just like in Michals case.

Wonder what happens if the plain http-url on a subsequent request returns 200 and content...  the cache-entry does then already exist and should be updated.
Using my trunk-build and loading the resource from above with plain http, the 301-response does not appear in the cache anymore. This seems to be caused by the STS-support, and makes somewhat sense.

However (with my trunk-build) the resource loaded over https is now cached on disk. Triggers two questions: Why has the policy changed from 3.6 now allowing the resource to be cached on disk? And how will this play with a cache from pre-STS times (or in a profile where STS is enabled and disabled - see bug #594990)?
> Why has the policy changed from 3.6 now allowing the resource to be cached on
> disk?

See bug 531801.
(In reply to comment #19)
> > Why has the policy changed from 3.6 now allowing the resource to be cached on
> > disk?
> 
> See bug 531801.

Right - thanks. That clears up some of this.
(In reply to comment #17)
> Another observation: (FF3.6.9) If I load the attachment with plain http, I
> receive a 301 redirecting to the https-url. The 301 *is* stored in the
> disk-cache though, with this piece of info

This is reproducible on trunk with a non-sts host redirecting. Try e.g.

https://api-secure.recaptcha.net/challenge?k=6LckBgMAAAAAAHlkAGWVrdc3gorusTVlLjixmeFh

which redirects to google. The problem is that there is no content, hence no content is written to the cache-entry, which sc*** up some assumptions of the disk-cache. A similar effect can be provoked with xpcshell or in a debugger on any response in fact.

I believe I have a fair idea of the core of this problem by now, and it may very well also cause bug #596443. However, I need some time to phrase it and clean up the xpcshell-test which demonstrates it.
The core of the problem is that nsDiskCacheRecord::DataFile() is used as the primary indicator of whether an entry is stored in a separate file. If no data (as opposed to metadata) has been written to the entry (like in the case of 3xx-responses without content), this method returns 0, indicating that the entry is stored in a separate file, even when this is not the case.

Any cache-able response is subject to the same problem, including 200-responses: If a 200 response for whatever reason is cut off before FF starts saving the content to file, it is (falsely) reported by the disk-cache to be stored in a separate file. Moreover, all subsequent requests to this resource *may* result in empty content returned from the cache.

Why would a response be cut off before writing content to the cache?

I can imagine various reasons ranging from bugs in necko to hick-ups in the network itself. Much of this is out of our control so we must make the cache or necko resilient to this issue in any case (IMHO).

Which entries will have this problem?

Obviously the entry must be cache-able and not subject to validation at the server. An important factor here is that the Content-Length header must not be present in the response because the cache validates entries where this value is different from the size of data stored in the cache. But if we get a response with no Content-Length http-header, sufficient expiration-details to avoid validation with the server, and no content is stored to the cache, we end up with a valid cache-entry with no data, which also claims to reside in a non-existent separate file. Examples of this are 3xx responses without content, like the one on comment #22. But it can also happen for 200-responses.

How to fix?

The consequence for about:cache is that a filename for the entry is reported erroneously, and this is easily fixed by e.g. checking that the file exists before reporting it. However all subsequent retrievals of such entries return no data and claim that data is in a file, and we cannot check this for all resources everywhere in the code. So IMO this must be fixed in necko.

200 responses without data doesn't make sense (IMHO) so we can easily extend nsHttpChannel::CheckCache() to validate such entries. IMO this will handle the most severe cases.

3xx responses may or may not come with content, and the file-cache is not designed to cache the latter. However, the content of such responses is rarely used and probably even more rarely saved to a file by the user, so IMO we can get away here by fixing the about:cache code as indicated above.

A couple of open questions still remains and I'll dig into these asap:

1) What actually happens on hard reload for such entries? I.e. how does the cache deal with replacing the content (assuming the content is actually loaded this time) in an entry which is believed to reside in a separate file, but this file does not exist.

2) What happens if a cached 3xx response is validated and replaced with a 200 response. I.e. even if we fix about:cache to not wrongly report files, but the cached resource is replaced with a response with actual content, but the file doesn't exist. (In fact, this is probably the same basic issue as above, but I'll leave it for now...)
Attached patch Testcase plus WIP-fix (obsolete) (deleted) β€” β€” Splinter Review
Attaching this for comments, but will fix the about:cache issue and try to figure out the open questions before submitting a proper fix.
> The consequence for about:cache is that a filename for the entry is reported
> erroneously, and this is easily fixed by e.g. checking that the file exists
> before reporting it. However all subsequent retrievals of such entries return
> no data and claim that data is in a file, and we cannot check this for all
> resources everywhere in the code. So IMO this must be fixed in necko.

I am wondering if we can change the description and implementation of nsICacheEntryDescriptor::file to say that it throws and returns null if the file doesn't exist. Any thoughts?
> What actually happens on hard reload for such entries? I.e. how does the
> cache deal with replacing the content (assuming the content is actually loaded
> this time) in an entry which is believed to reside in a separate file, but this
> file does not exist.

Cache seems to handle this well on Linux. There are, however a certain amount of file-operations involved and these have different implementations for Linux, OSX and Windows. I'll make a test and make tryserver check this...
The open questions from comment 23 were inspired by bug #596443 and following the reasoning in bug #596443, comment #6 I'll leave the two open questions about reloading/replacing a broken cache-entry out of this.

This patch makes sure a cached cut-off response is validated at the server. It also checks for existence of any file reported by about:cache.

The test loads a resource twice. First time called the handler cuts off the response and second time it returns properly. If the entry was fetched from cache on the second load, the test will fail.
Attachment #476939 - Attachment is obsolete: true
Attachment #477699 - Flags: review?(michal.novotny)
Comment on attachment 477699 [details] [diff] [review]
Validate cached response with no content, test, avoid reporting non-existent file in about:cache

> +    PRBool fileExists = PR_FALSE;
>      rv = descriptor->GetFile(getter_AddRefs(cacheFile));
> -    if (NS_SUCCEEDED(rv)) {
> +    if (NS_SUCCEEDED(rv))
> +        rv = cacheFile->Exists(&fileExists);
> +
> +    if (NS_SUCCEEDED(rv) && fileExists) {
>          nsAutoString filePath;
>          cacheFile->GetPath(filePath);
>          APPEND_ROW("file on disk", NS_ConvertUTF16toUTF8(filePath));
>      }
>      else
>          APPEND_ROW("file on disk", "none");

We should show "none" only in case when the datasize is 0 or the data is stored in the cache block files. Your change can hide some real problems when the separate file should exist but is missing (due to bug in cache code or simply because user has deleted it manually). In this case we should show "missing" (and maybe it should be red and bold).


> +        } else if (size < 1) { // No data, no content-length. See bug #588804
> +            LOG(("Validating due to no cached data and no Content-Length header"));
> +            return NS_OK;

First of all I don't see how this solves the problem described in comment #14. Size is 867 so this code doesn't help at all. Maybe I'm missing something but I don't think that your comment #23 explains this problem.

Also this change fixes only case when there is no content length and we have no data. This wouldn't solve problem when we have partial data, i.e. corrupted images will not reload.
Attachment #477699 - Flags: review?(michal.novotny) → review-
(In reply to comment #28)
> We should show "none" only in case when the datasize is 0 or the data is stored
> in the cache block files. Your change can hide some real problems when the
> separate file should exist but is missing (due to bug in cache code or simply
> because user has deleted it manually). In this case we should show "missing"
> (and maybe it should be red and bold).

Maybe an assertion would be sufficient to flag real issues? I'm no expert on GUIs, but I believe a red and bold "missing" might be intimidating to a user.

> First of all I don't see how this solves the problem described in comment #14.
> Size is 867 so this code doesn't help at all. Maybe I'm missing something but I
> don't think that your comment #23 explains this problem.
> 
> Also this change fixes only case when there is no content length and we have no
> data. This wouldn't solve problem when we have partial data, i.e. corrupted
> images will not reload.

Agreed - this only fixes the issue when *no* data is written. The question is actually *how* do we know that partial data has been written when there is no Content-Length header. The cache doesn't really have any way of knowing how much data to expect, afaics.

Perhaps this should be split into two separate bugs: One to deal with no writes (which has the unfortunate side-effect described in comment #23), and one to deal with partially written data (which do *not* suffer from the side-effect, but which of-course results in corrupt data) ?
(In reply to comment #29)
> Agreed - this only fixes the issue when *no* data is written. The question is
> actually *how* do we know that partial data has been written when there is no
> Content-Length header. The cache doesn't really have any way of knowing how
> much data to expect, afaics.

Google chrome seems to discard cache entry if chunked transfer doesn't stop cleanly (i.e. there must be a last empty chunk). We should probably do the same. This would eliminate partial entries due to network failure.


> Perhaps this should be split into two separate bugs: One to deal with no writes
> (which has the unfortunate side-effect described in comment #23), and one to
> deal with partially written data (which do *not* suffer from the side-effect,
> but which of-course results in corrupt data) ?

What side-effect do you mean? The only problem I see is that reading from such entries returns no data. But reading data from partial cache entries is the same problem.

Btw you wrote that "200 responses without data doesn't make sense". E.g. main page on my server (http://michal.etc.cz/) is empty (not empty HTML but just empty file). I understand that this isn't valid argument, anyway such pages exist... :)
> Perhaps this should be split into two separate bugs: One to deal with no writes
> and one to deal with partially written data ?

The problem in comment #14 doesn't fall into any of these bugs. This isn't a partial entry. The total size should be 867, so whole response was received. But the cached data is simply missing.
(In reply to comment #30)

> Btw you wrote that "200 responses without data doesn't make sense". E.g. main
> page on my server (http://michal.etc.cz/) is empty (not empty HTML but just
> empty file). I understand that this isn't valid argument, anyway such pages
> exist... :)

A properly delimited 0 length 200 response might be sensible.. consider if it was a CSV list of must fix bugs for FF 4. 0 Length would be ideal :)

Re the more important truncated document bit, I would like to include truncated documents as "negative pipelining feedback" in my pipelining odyssey.. For me I think it is as simple as checking !mResponseIsComplete in sHttpTransaction::Done(), because I only want fully formed CL or chunked entries for pipelining.. you probably need some check allowing EOF only there was no CL or chunking on the transaction..
(In reply to comment #30)
> What side-effect do you mean? The only problem I see is that reading from such
> entries returns no data. But reading data from partial cache entries is the
> same problem.

The side-effect comes when no data is ever written to the entry. As described in comment #23 nsDiskCacheRecord::DataFile() (*note* ...File()) returns 0 in such cases. This is interpreted in e.g. about:cache as if the entry has data in its own file (load e.g. the url in comment #22 in Fx4 (not 3.6 because it will cache it in memory ref comment #19) and lookup the original url in the disk-cache - it should be a 302 with data stored in its own (non-existent) file). When you get partial data, this side-effect is not a problem.

> Btw you wrote that "200 responses without data doesn't make sense". E.g. main
> page on my server (http://michal.etc.cz/) is empty (not empty HTML but just
> empty file). I understand that this isn't valid argument, anyway such pages
> exist... :)

I see your point (and Patricks point below). However, your homepage set Content-Length at least...  :)  Would we be willing to not re-use such pages from cache if they have no Content-Length ?

(In reply to comment #31)
> The total size should be 867, so whole response was received.
> But the cached data is simply missing.

Comment #14 says that the file is missing. Was there no data visible either?

I agree that this is odd...  need to think more about this variation.
I see this pretty regularly on bugzilla.
Both RedHat Network and the management interface on our Zeus load balancers are almost unusable because of this.  I constantly have to shift+reload to be able to actually see the status indicators.
Ok - a few questions:

1) Are we still talking https? Or does this also happen over plain http?
2) Would any of you have the chance to dump info (headers, size, filename etc) for any such failing entry? (In particular if anything different from comment #14 comes up...) Also please dump stats for your disk-cache, btw.
3) justdave: is there a public URL which would allow me to reproduce?
I've been seeing this behaviour on two (unfortunately non-public) https sites.  Images aren't showing up sporadically until I shift-reload.  Not sure what other information I could provide that would be useful, but let me know.
I see this all the time when viewing source of bugzilla attachments.  Try clicking some small test cases and then using view-source?  I can dump things in here when it happens next, but it basically happens all the time!
(In reply to comment #36)
> 1) Are we still talking https? Or does this also happen over plain http?

yes, https only, I haven't noticed this on any http sites.

> 2) Would any of you have the chance to dump info (headers, size, filename etc)
> for any such failing entry? (In particular if anything different from comment
> #14 comes up...) Also please dump stats for your disk-cache, btw.

If you can provide instructions (or a pointer to a page with existing instructions) how to retrieve said info to supply it to you, sure.

> 3) justdave: is there a public URL which would allow me to reproduce?

Unfortunately, no.  These are all internal admin sites behind logins, except for Beltzner's example.
Here's some screenshots (not that it'll help much as far as data is concerned)

Here's what the Zeus interface looks like when you first load the page (assuming you've been there before to have the images cached) :
https://people.mozilla.com/~justdave/zeusfirst.png

Here's what it looks like (and what it's supposed to look like) after a shift+reload:
https://people.mozilla.com/~justdave/zeusshiftreloaded.png

Note the enabled/disabled indicators on each of the virtual servers, the tabs across the top, and the traffic level/status indicator on the top left.

An examination of the header traffic via LiveHTTPHeaders shows that Minefield is sending a request for the missing files to the server with an "If-Modified-Since" header, and the server is responding with "304 Not Modified".  The Media tab in the Page Info dialog shows the corresponding files with 0px x 0px listed as their sizes.
(In reply to comment #39)
> If you can provide instructions (or a pointer to a page with existing
> instructions) how to retrieve said info to supply it to you, sure.

Sorry: Open a new tab, type "about:cache" in the url-field. Then click on List Cache Entries" for the DiskDevice. Dump summary-info (top of this page) of your disk-cache here, please. Then try to locate the entry for one of the missing entries (you need to know the url for it, or at least parts of it in order to find it). If you find it, click on the entry and copy the content of the page you get here.

(In reply to comment #40)
> An examination of the header traffic via LiveHTTPHeaders shows that Minefield
> is sending a request for the missing files to the server with an
> "If-Modified-Since" header, and the server is responding with "304 Not
> Modified".

That is interesting! Do you think you can get the info mentioned above for a missing entry, both before and after shift-reload?
Yeah, I've also been seeing cache problems only on https; I see it regularly on bugzilla attachments, and I needed to clear my cache to pay a bill this morning (an https site, requiring login).
Some followup-questions: Do you guys see this on 3.6 or 4? Does it seem to happen more frequently when cache is close to full?
4. Cache isn't even half-full.
This url might work: https://mail.mozilla.com/zimbra/public/blank.html
Only in 4 for me.  BTW, I can reproduce it at will viewing this bug.  Look at the CC list at the top of the bug.  On initial load of the bug it's a bulleted list.  The title of the bug is normal text on a white background.  Upon shift-reloading the bug, the CC list is a scrollbox, and the bug title is on a gray box.  So it's affecting CSS, too.
(In reply to comment #41)
> > An examination of the header traffic via LiveHTTPHeaders shows that Minefield
> > is sending a request for the missing files to the server with an
> > "If-Modified-Since" header, and the server is responding with "304 Not
> > Modified".
> 
> That is interesting! Do you think you can get the info mentioned above for a
> missing entry, both before and after shift-reload?

Why is this interesting? IMO that's expected behavior. We have an invalid entry but with "correct" metadata, so we send a conditional request.
I can consistently reproduce this on a test machine hosted inside mozillalabs.com, which is sitting behind a Zeus SSL terminator.

By submitting an enter to the location bar, I see GET http://mozillaball.mozillalabs.com requests happening, with 301 responses.

If I press the reload button, however, I see a GET https://mozillaball.mozillalabs.com with a 304 response.

I tcpdumped the initial HTTP request.  about:cache details for the entry are attached; there was no cache entry for http://mozillaball.mozillalabs.com.

The 304 request/response pair, as reported by the Web Console, follows.

The Last Modified date of the cache entry does NOT update if I press enter on the location bar; it DOES update if I press the reload button.
(In reply to comment #49)

> I can consistently reproduce this on a test machine hosted inside
> mozillalabs.com,

Brilliant! :)

> I tcpdumped the initial HTTP request.  about:cache details for the entry are
> attached;

Does the file /Users/michaelhanson/Library/Caches/Firefox/Profiles/fvfl3wpn.fres4.1/Cache/B9125F29d01 exist at this point?

> The Last Modified date of the cache entry does NOT update if I press enter on
> the location bar; it DOES update if I press the reload button.

Does anything else update in the cache-entry after reload? E.g. the "file on disk" field?

Any chance to get a nspr-log of this? See https://developer.mozilla.org/en/HTTP_Logging for details - please also add "cache:5" to the comma-separated NSPR_LOG_MODULES list...
(In reply to comment #44)
> 4. Cache isn't even half-full.

I think I got it. Actually it is caused by the full cache. Although we have now 250MB cache size by default (and even much bigger via smart sizing) the limits for block files _CACHE_00[1-3]_ haven't changed.

I.e. we have only:
8MB for files with size 1-1024
32MB for files with size 1025-4096
128MB for files with size 4097-16384

The problem arises when writing of the data fails but metadata is written successfully. HTTPS is just a red herring here. Metadata of HTTPS response usually fits into _CACHE_002_ because of the large security-info. When the data is small enough to fit into _CACHE_001_ (small patch, image, etc...) and it is full, then the problem comes up.
Attached file nspr log (deleted) β€”
(In reply to comment #50)
> Does the file /Users/michaelhanson/Library/Caches/Firefox/Profiles/fvfl3wpn.fres4.1/Cache/B9125F29d01 exist at this point?

It does not.

>> The Last Modified date of the cache entry does NOT update if I press enter on
>> the location bar; it DOES update if I press the reload button.

>Does anything else update in the cache-entry after reload? E.g. the "file on
>disk" field?

The fetch count, last fetched, last modified, and expires fields change; the Date field of the response-head changes.  The file on disk does not (and still does not exist).

> Any chance to get a nspr-log of this? See
https://developer.mozilla.org/en/HTTP_Logging for details - please also add
"cache:5" to the comma-separated NSPR_LOG_MODULES list...

Attached.
(In reply to comment #51)
> The problem arises when writing of the data fails but metadata is written
> successfully. HTTPS is just a red herring here. Metadata of HTTPS response
> usually fits into _CACHE_002_ because of the large security-info. When the data
> is small enough to fit into _CACHE_001_ (small patch, image, etc...) and it is
> full, then the problem comes up.

This sounds like a real issue, yes. And it can explain why such cache entries claim to be stored in a separate file (i.e. "response is cut off before content is written to cache" as discussed in comment #23). But wouldn't the else-block in nsDiskCacheDevice::DeactivateEntry() remove metadata if writing content fails?
Attachment #483600 - Attachment mime type: text/plain → application/x-gzip
(In reply to comment #53)
> This sounds like a real issue, yes. And it can explain why such cache entries
> claim to be stored in a separate file (i.e. "response is cut off before content
> is written to cache" as discussed in comment #23). But wouldn't the else-block
> in nsDiskCacheDevice::DeactivateEntry() remove metadata if writing content
> fails?

I didn't debug it deeply yet. I just tried to simulate the full cache by failing appropriate nsDiskCacheBlockFile::AllocateBlocks() call in debugger and can confirm that it creates the corrupted cache entry.
(In reply to comment #52)
> > Any chance to get a nspr-log of this? See
> https://developer.mozilla.org/en/HTTP_Logging for details - please also add
> "cache:5" to the comma-separated NSPR_LOG_MODULES list...
> 
> Attached.

Thanks! The (relevant) requests/response pairs are

- one request to http://... + 301-response redirecting to https
- another request to http://... + 301-response redirecting to https
- then the request with i-m-s header to https://...  +  the 304 response

(Not sure why there are two requests to the http-url, but never mind...)

So the https-url is stored in your cache, Necko believes the content is stored in a separate file, which doesn't exist. IMO we have the situation described in comment #23, and Michal may very well have figured out (in his comments above) the reason *why* the content is not stored (or "cut off" to use the term from comment #23).

Would you mind capturing another nspr-log while doing a hard reload, i.e. a shift-ctrl-r ? It should bypass the cache completely and reload the url....
Attached file nspr log of force-reload (deleted) β€”
(In reply to comment #55)
Log of force reload, attached.
Thanks again! And now the page looks ok? What does the cache-entry look like?
The page looks fine; the cache-entry has a fetch count of 1 (before I click on it); the dates look reasonable.

The file on disk, however, is still a path to a non-existent file: /Users/michaelhanson/Library/Caches/Firefox/Profiles/fvfl3wpn.fres4.1/Cache/B9125F29d01

That's surprising... my about:cache shows 12754 files, 1048576 KiB max, 119584 KiB in use.

I visited a couple more pages and saw that new entries were appearing in the cache - 22k and 83k files appeared, I checked one of them and it was gziped CSS from yahoo, so the cache is still functioning at some level.

I'm not sure whether it's relevant to Michal's observation, but sorting the directory by size, the smallest file is 16389 bytes.  No obvious pattern jumps out looking at the dates and sizes.
(In reply to comment #58)
> The file on disk, however, is still a path to a non-existent file:
> /Users/michaelhanson/Library/Caches/Firefox/Profiles/fvfl3wpn.fres4.1/Cache/B9125F29d01

That's OK. If it was caused by full cache block file then force reload can't help. It won't free any space.

> That's surprising... my about:cache shows 12754 files, 1048576 KiB max, 119584
> KiB in use.

As I said, there are separate limits for small entries. So although only 10% of your cache is used it isn't possible to cache entries with certain size. This is a bug and needs to be fixed.
Attached patch proposed fix (deleted) β€” β€” Splinter Review
Attached patch should fix the problem. Of course it doesn't solve the other issue that we stop caching small entries when cache block files are full. I'll file a follow-up bug for this.
Keywords: dogfood
(In reply to comment #60)
> Attached patch should fix the problem. Of course it doesn't solve the other
> issue that we stop caching small entries when cache block files are full.

Very nice catch! :)  I really cannot say anything about whether it resolves the issue seen here since I've not been able to reproduce it, but maybe others can.

I'm still a little puzzled by some why this only seems to happen when using https, though. And it also seems to be connected to encoded streams, but heck...  your solution is valid in any case.

I've filed bug #604880 describing the particular issue you uncovered and made this bug depend on it. Bug #604880 should be trivial to get approved and landed, and we'll see if the issue seen here goes away. If it doesn't, there are other things to handle, but your fix would be valid in any case.

> I'll file a follow-up bug for this.

Bug number..?
Depends on: 604880
(In reply to comment #61)
> Bug number..?

604897
(In reply to comment #61)
> I've filed bug #604880 describing the particular issue you uncovered and made
> this bug depend on it. Bug #604880 should be trivial to get approved and
> landed, and we'll see if the issue seen here goes away. If it doesn't, there
> are other things to handle, but your fix would be valid in any case.

Thanks, that's a good idea. Anyway, anybody having the problem with view-source or broken images could check in debugger if it is caused by full cache block file. Set breakpoint at nsDiskCacheMap::FindRecord() and do some disk cache activity (e.g. pageload or view details of some cache entry in about:cache?device=disk) that triggers this breakpoint. Then check if any of the bitmaps is full, e.g. in gdb:

x/1024wx mBlockFile[0].mBitMap
x/1024wx mBlockFile[1].mBitMap
x/1024wx mBlockFile[2].mBitMap
I also registered bug #605101 and bug #605108 for the two specific issues described in comment #23.
In case there are any problems with reproducibility, I have found that https://bug605048.bugzilla.mozilla.org/attachment.cgi?id=483877 consistently gives me the File not found page when I view source the first time, and then it works fine on reload.
fwiw, i'm seeing what looks like this when i login to both bankofamerica.com and ba.com; missing content on the page and layout, and in the case of ba.com forms are not showing up properly
This is a major website compatibility hit; we need to pull it into an earlier beta.
blocking2.0: betaN+ → beta8+
There is a fair chance are that bug #604880 will resolve this issue. Can we carry over the blocking-flag to bug #604880 and get it landed?
Bug #604880 has landed - do you guys who experience this issue see any improvement?
Can't reproduce with my testcase any longer, which is good.
I am still seeing the problem in 20101021030744.  The file-on-disk entry continues to point to a non-existent file; force reload works once but the problem reoccurs after that point.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b8pre) Gecko/20101021 Firefox/4.0b8pre -- the problem seems less pronounced than it used to be, but I'm still seeing it.  The Zeus management GUI is no longer missing a lot of the icons that used to be, and the table frames are working again, but the status display and a few of the icons still don't show without a shift+reload.
(In reply to comment #74)
> the problem seems less pronounced than it used to be, but I'm still seeing it.

Did you try to clear the whole cache after upgrading FF? Also could you check the bitmaps as described in comment #63?
(In reply to comment #75)
> (In reply to comment #74)
> > the problem seems less pronounced than it used to be, but I'm still seeing it.
> 
> Did you try to clear the whole cache after upgrading FF?

I had not, and after doing so, the problem does seem to be completely gone.  The bank website I was having issues with and the Red Hat Network site also are no longer exhibiting this problem.

> Also could you check the bitmaps as described in comment #63?

No idea how to set a breakpoint, sorry.
Sounds like we need to ship a cache version increment with this, to make sure the cache is cleared.
Attached patch Bump disk-cache version (deleted) β€” β€” Splinter Review
Unless anyone disagrees I'll say Michals patch for bug #604880 fixed this issue. This is a simple patch to bump the cache-version suggested by Jason in comment #77.
Attachment #485626 - Flags: review?(jduell.mcbugs)
blocking2.0: beta8+ → beta7+
Comment on attachment 485626 [details] [diff] [review]
Bump disk-cache version

Given that this is a version bump change only, and we're blocking beta7 on it, and Jason is on semi-vacation, I'm taking the privilege to r=jst this, even if I'm not a peer!
Attachment #485626 - Flags: review?(jduell.mcbugs) → review+
Cache version bump landed:

http://hg.mozilla.org/mozilla-central/rev/f91e0db149dd

Marking FIXED!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: