Closed Bug 548406 Opened 15 years ago Closed 14 years ago

xpcshell tests: assertion "mem cache leaking entries" in nsMemoryCacheDevice::Shutdown() causing abort test failed (with xpcshell return code: -6) test_setindexhandler.js, test_registerdirectory.js, test_sjs.js

Categories

(Core :: Networking, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 567680

People

(Reporter: bent.mozilla, Assigned: bjarne)

References

Details

Attachments

(5 files, 8 obsolete files)

(deleted), text/html
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), text/html
Details
(deleted), text/html
Details
(deleted), patch
Details | Diff | Splinter Review
This log shows the failure in test_registerdirectory.js and also a similar failure in test_setindexhandler.js:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267040039.1267041846.12372.gz
Linux mozilla-central debug test xpcshell on 2010/02/24 11:33:59
s: moz2-linux-slave23

TEST-UNEXPECTED-FAIL | /builds/slave/mozilla-central-linux-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_registerdirectory.js | test failed (with xpcshell return code: -6), see following log:
PROCESS-CRASH | /builds/slave/mozilla-central-linux-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_registerdirectory.js | application crashed (minidump found)
Thread 0 (crashed)
TEST-UNEXPECTED-FAIL | /builds/slave/mozilla-central-linux-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_setindexhandler.js | test failed (with xpcshell return code: -6), see following log:
PROCESS-CRASH | /builds/slave/mozilla-central-linux-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_setindexhandler.js | application crashed (minidump found)
Thread 0 (crashed)
Filling in the blanks between those first two:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1266974697.1266976971.23953.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1266997863.1267000195.427.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267000344.1267003031.10454.gz

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267050979.1267053202.20476.gz

and as bsmedberg noticed, what's going wrong is

###!!! ASSERTION: ### mem cache leaking entries?: 'mEntryCount == 0', file
/builds/moz2_slave/mozilla-central-linux-debug/build/netwerk/cache/src/nsMemoryCacheDevice.cpp,
Summary: Random test_registerdirectory.js failure → xpcshell tests: assertion in nsMemoryCacheDevice::Shutdown() causing abort test failed (with xpcshell return code: -6) test_setindexhandler.js, test_registerdirectory.js, test_sjs.js
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267078208.1267080399.6231.gz
Linux mozilla-central debug test xpcshell on 2010/02/24 22:10:08
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267079519.1267081572.9355.gz
Linux mozilla-central debug test xpcshell on 2010/02/24 22:31:59
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267100610.1267102578.22439.gz
inux mozilla-central debug test xpcshell on 2010/02/25 04:23:30
Hard to reproduce, but eventually...

This may certainly be caused by the cache-write thread introduced in bug #513074 shutting down after the memory-cache is shut down. See brief reasoning in bug #513074 starting at comment 16, 6th paragraph and suggestion in comment #27.

(I cannot clearly see now why this was necessary though - it was clearer to me while working on bug #513074...  Nevertheless, shutting down the cache-write thread earlier like early patches for bug #513074 leaks lots of nsStorageStream-objects.)

Theory is that, due to the sequence of shutdowns, there may be cache-entries waiting to be evicted on the cache-write thread while the memory-cache shuts down. This may cause imbalance and trigger the assertion. Removing the assertion is the quick-and-dirty solution, however, something tells me that this is not acceptable...  :) The proper way may be to add code to wait for the cache-write thread to finish. I'll try to come up with a nice way to do this (probably waiting for an event or something).
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267126395.1267129499.29979.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/25 11:33:15
s: moz2-darwin9-slave20
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267129246.1267131696.8324.gz
Linux mozilla-central debug test xpcshell on 2010/02/25 12:20:46
s: moz2-linux-slave26
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267136550.1267139194.961.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/25 14:22:30
s: moz2-darwin9-slave22
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267139662.1267141884.12062.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/25 15:14:22
s: moz2-darwin9-slave17
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267159713.1267161415.3974.gz
Linux mozilla-central debug test xpcshell on 2010/02/25 20:48:33
s: moz2-linux-slave21
Attached patch Removing the assertion (obsolete) (deleted) — — Splinter Review
I'll claim that the invariant which justifies the failing assertion is not true anymore due to the async cache-writes introduced in bug #513074 and the need to shut down the write-thread *after* shutting down the memory-cache. Thus, I propose to remove this assertion.

With the patch I've run the failing xpcshell-test ("test_registerdirectory.js") in a loop iterating 250+ times without seeing the crash. Without the patch I've seen it crash after 80-90 iterations.

Requesting review by bz. Feel free to change it (e.g. to biesi) if appropriate.
Assignee: nobody → bjarne
Status: NEW → ASSIGNED
Attachment #429118 - Flags: review?(bzbarsky)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267167798.1267170086.28853.gz
Linux mozilla-central debug test xpcshell on 2010/02/25 23:03:18
s: moz2-linux-slave21
Comment on attachment 429118 [details] [diff] [review]
Removing the assertion

I'd prefer christian review this, yes.

Why are we writing to _memory_ cache async, though?  That's a little odd to me...
Attachment #429118 - Flags: review?(bzbarsky) → review?(cbiesinger)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267167150.1267169383.26456.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/25 22:52:30
s: moz2-darwin9-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267207758.1267209759.32123.gz
Linux mozilla-central debug test xpcshell on 2010/02/26 10:09:18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267213851.1267217004.23020.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/26 11:50:51
s: moz2-darwin9-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267238803.1267240814.32333.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/26 18:46:43
s: moz2-darwin9-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267241581.1267243694.7786.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/26 19:33:01
s: moz2-darwin9-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267290798.1267292927.22956.gz
Linux mozilla-central debug test xpcshell on 2010/02/27 09:13:18
s: moz2-linux-slave21
(In reply to comment #16)
> Why are we writing to _memory_ cache async, though?  That's a little odd to
> me...

When you put it that way : Good question. :) I'm sure there is a reasonable explanation but biesi might be the only person who can explain it properly. Or maybe michal can (cc'ing him)?

Btw, do we need all these crash-logs? I'm uncertain about the established procedure (if any) here, but I believe the crash is sufficiently documented by now...?
Normal procedure is to post logs of random orange into the bug, partly as an incentive to fix it or back it out.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267305697.1267307787.29570.gz
Linux mozilla-central debug test xpcshell on 2010/02/27 13:21:37
s: moz2-linux-slave28

;)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267342759.1267344757.26137.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/27 23:39:19
s: moz2-darwin9-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267387919.1267390011.6253.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/28 12:11:59
s: moz2-darwin9-slave05
(In reply to comment #23)
> (In reply to comment #16)
> > Why are we writing to _memory_ cache async, though?  That's a little odd to
> > me...
> 
> When you put it that way : Good question. :) I'm sure there is a reasonable
> explanation but biesi might be the only person who can explain it properly. Or
> maybe michal can (cc'ing him)?

That's really good question. AFAIK there is no reason for doing this. In fact, why is there the check for STORE_ON_DISK_AS_FILE in nsHttpChannel::InstallCacheListener()? I overlooked this while doing the review...

>    nsCacheStoragePolicy policy;
>    rv = mCacheEntry->GetStoragePolicy(&policy);
>
>    if (!gHttpHandler->mCacheWriteThread ||
>         NS_FAILED(rv) ||
>         policy == nsICache::STORE_ON_DISK_AS_FILE) {
>        LOG(("nsHttpChannel::InstallCacheListener sync tee %p\n", tee.get()));
>        rv = tee->Init(mListener, out, nsnull);
>    } else {
>        LOG(("nsHttpChannel::InstallCacheListener async tee %p\n",
>                tee.get()));
>        rv = tee->InitAsync(mListener, gHttpHandler->mCacheWriteThread, out, nsnull);
>    }

I think that we should create a sync tee when the policy is nsICache::STORE_IN_MEMORY. We should also create a sync tee when the policy is nsICache::STORE_ANYWHERE and the disk cache is disabled, but we don't know this information since it isn't available through the cache API.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267402687.1267404431.14822.gz
Linux mozilla-central debug test xpcshell on 2010/02/28 16:18:07
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267421383.1267423657.5637.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/02/28 21:29:43
(In reply to comment #28)
> That's really good question. AFAIK there is no reason for doing this. In fact,
> why is there the check for STORE_ON_DISK_AS_FILE in
> nsHttpChannel::InstallCacheListener()?

The check is because some callers expect the full content to be available at the OnStopRequest() call and these will do SetCacheAsFile(TRUE) which set the STORE_ON_DISK_AS_FILE-flag. The async tee does not guarantee data to be available, so we use the sync tee for this. See nsICachingChannel.idl and e.g. nsDownloader::OnStartRequest()

> I think that we should create a sync tee when the policy is
> nsICache::STORE_IN_MEMORY.

I don't disagree - but we also need the existing check. So something like

if (STORE_ON_DISK_AS_FILE || STORE_IN_MEMORY)
  create sync-API
else
  create async-API
end

would address bz's concern, right?

IMO, the change above makes general sense, but it doesn't fix the issue we see here (I've reproduced the crash with the if-block above).

Without being an expert of how the necko-cache works, I'll risk an explanation based on general knowledge of why we see async writes for the memory cache : If no special cache-policy has been requested it's the cache itself which decides where to store the entry. Sometimes it will chose to store it in memory, sometimes on disk. (Maybe it also can move entries from memory to disk as a replacement-mechanism.) The point is that when we create the tee which writes the entry, we don't know where it will be stored, hence we end up with memory-cache entries being asynchronously stored.

I'd be happy if someone with better knowledge of necko could explain this better and/or potentially outline how we can work around it.

> We should also create a sync tee when the policy is
> nsICache::STORE_ANYWHERE and the disk cache is disabled, but we don't know this
> information since it isn't available through the cache API.

Maybe register new bug to change API and add this case to the if?

Bottom line : IMO Michals suggestion and bz's concern makes sense in general and there is a simple fix for it. However, it doesn't fix the issue of this bug, and IMHO the assertion in nsMemoryCacheDevice is not valid anymore and should be removed (or replaced if we can come up with another one).
Comment on attachment 429118 [details] [diff] [review]
Removing the assertion

still thinking about this bug, but this is the wrong fix.
Attachment #429118 - Flags: review?(cbiesinger) → review-
I think there's 3 options here:
a) Process pending events on the cache write thread before shutting down the cache
b) In InstallCacheListener, after creating the stream, check the entry's deviceID and only create an async stream if it's "disk"
c) expose the actual storage policy (DISK/MEMORY/DISK_AS_FILE) on nsICacheEntryInfo so that callers don't have to hardcode device IDs

I think my preference would be doing c) or a).
In fact b) is suboptimal because the offline cache is "offline" not "disk", yet writes his files to disk too.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267442783.1267444926.9786.gz
Linux mozilla-central debug test xpcshell on 2010/03/01 03:26:23
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267451408.1267453388.16435.gz
Linux mozilla-central debug test xpcshell on 2010/03/01 05:50:08
s: moz2-linux-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267476709.1267479486.8087.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/01 12:51:49  
s: moz2-darwin9-slave14
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267504492.1267507317.14345.gz
Linux mozilla-central debug test xpcshell on 2010/03/01 20:34:52
s: moz2-linux-slave20
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267511160.1267513688.7803.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/01 22:26:00
s: moz2-darwin9-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267548320.1267550731.25926.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/02 08:45:20
s: moz2-darwin9-slave15

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267548249.1267550153.23495.gz
Linux mozilla-central debug test xpcshell on 2010/03/02 08:44:09
s: mv-moz2-linux-ix-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267550282.1267552434.32431.gz
Linux mozilla-central debug test xpcshell on 2010/03/02 09:18:02
s: moz2-linux-slave27
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267569206.1267571516.934.gz
Linux mozilla-central debug test xpcshell on 2010/03/02 14:33:26
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267569574.1267572098.2915.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/02 14:39:34
s: moz2-darwin9-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267575646.1267578657.25452.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/02 16:20:46
s: moz2-darwin9-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267575646.1267578657.25452.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/02 16:20:46
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267580028.1267582146.5560.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/02 17:33:48
s: moz2-darwin9-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267580904.1267583370.9360.gz
Linux mozilla-central debug test xpcshell on 2010/03/02 17:48:24
s: moz2-linux-slave28
(In reply to comment #33)
> The check is because some callers expect the full content to be available at
> the OnStopRequest() call and these will do SetCacheAsFile(TRUE) which set the
> STORE_ON_DISK_AS_FILE-flag.

This is undocumented feature. There is IMO nowhere written that calling SetCacheAsFile(TRUE) ensures that all data is written to the file at OnStopRequest().

> Without being an expert of how the necko-cache works, I'll risk an explanation
> based on general knowledge of why we see async writes for the memory cache : If
> no special cache-policy has been requested it's the cache itself which decides
> where to store the entry. Sometimes it will chose to store it in memory,
> sometimes on disk. (Maybe it also can move entries from memory to disk as a
> replacement-mechanism.) The point is that when we create the tee which writes
> the entry, we don't know where it will be stored, hence we end up with
> memory-cache entries being asynchronously stored.

That's not how it works (see nsCacheService::EnsureEntryHasDevice()). I.e. if the disk cache is enabled and if the storage policy of the entry allows caching to the disk then the disk cache is used. Of course it's still possible, but unlikely, that something will fail (CreateDiskDevice(), MarkBinding(), etc...) and in this case the memory cache would be used. 

> c) expose the actual storage policy (DISK/MEMORY/DISK_AS_FILE) on
> nsICacheEntryInfo so that callers don't have to hardcode device IDs

I'm not sure I understand. We know the storage policy of the cache entry, but we don't know what device will be chosen in nsCacheService::EnsureEntryHasDevice(). And this is done after we create the tee. So we should either expose the expected resulting device on the entry, or directly the values of mEnableMemoryDevice, mEnableDiskDevice, mEnableOfflineDevice on the cache service.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267622254.1267624542.18125.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 05:17:34
s: moz2-linux-slave27

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267625078.1267627348.29587.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 06:04:38
s: moz2-linux-slave25


http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267625078.1267627348.29587.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 06:32:03
s: moz2-linux-slave26

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267632296.1267634992.24385.gz
WINNT 5.2 mozilla-central debug test xpcshell on 2010/03/03 08:04:56
s: win32-slave23
> There is IMO nowhere written that calling SetCacheAsFile(TRUE) ensures that all
> data is written to the file at OnStopRequest().

Yes, but if the one consumer SetCacheAsFile exists for expects that behavior (which it sounds like it does), then we can't just change it.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267672806.1267675030.27650.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 19:20:06
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267677648.1267680042.9968.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 20:40:48
s: moz2-linux-slave26
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267681854.1267684050.21262.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 21:50:54
s: moz2-linux-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267688776.1267690850.9905.gz
Linux mozilla-central debug test xpcshell on 2010/03/03 23:46:16
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267698904.1267700995.13880.gz
Linux mozilla-central debug test xpcshell on 2010/03/04 02:35:04
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267690319.1267692934.15968.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/04 00:11:59
s: moz2-darwin9-slave17
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267723751.1267726105.22845.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/04 09:29:11
s: moz2-darwin9-slave06
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267724716.1267727218.27050.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/04 09:45:16
s: moz2-darwin9-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267743043.1267746262.27295.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/04 14:50:43
s: moz2-darwin9-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267791849.1267793826.12159.gz
Linux mozilla-central debug test xpcshell on 2010/03/05 04:24:09
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267806185.1267808019.27525.gz
Linux mozilla-central debug test xpcshell on 2010/03/05 08:23:05
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267817362.1267819573.27267.gz
Linux mozilla-central debug test xpcshell on 2010/03/05 11:29:22
s: moz2-linux-slave16
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267820308.1267821959.1378.gz
WINNT 5.2 mozilla-central opt test xpcshell on 2010/03/05 12:18:28
s: win32-slave22
(In reply to comment #50)
> > c) expose the actual storage policy (DISK/MEMORY/DISK_AS_FILE) on
> > nsICacheEntryInfo so that callers don't have to hardcode device IDs
> 
> I'm not sure I understand. We know the storage policy of the cache entry, but
> we don't know what device will be chosen in
> nsCacheService::EnsureEntryHasDevice(). And this is done after we create the
> tee. So we should either expose the expected resulting device on the entry, or
> directly the values of mEnableMemoryDevice, mEnableDiskDevice,
> mEnableOfflineDevice on the cache service.

What I meant was expose a way to determine what device was chosen. This determination happens before the tee is created - it happens during openOutputStream, so it would be possible to use this.

(In reply to comment #52)
> > There is IMO nowhere written that calling SetCacheAsFile(TRUE) ensures that all
> > data is written to the file at OnStopRequest().
> 
> Yes, but if the one consumer SetCacheAsFile exists for expects that behavior
> (which it sounds like it does), then we can't just change it.

If you don't provide the guarantee that the file will have the full contents at onStopRequest, how is the stream listener supposed to know when the file does have the full contents?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267863398.1267865404.26796.gz
Linux mozilla-central debug test xpcshell on 2010/03/06 00:16:38
s: moz2-linux-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1267974911.1267976924.24188.gz
Linux mozilla-central debug test xpcshell on 2010/03/07 07:15:11
(In reply to comment #68)
> What I meant was expose a way to determine what device was chosen. This
> determination happens before the tee is created - it happens during
> openOutputStream, so it would be possible to use this.

There is a lazy initialization. The device is chosen at the first write. This is too late for us.

#0  nsCacheService::EnsureEntryHasDevice (this=0xf1155980, entry=0xeb79c460)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1396
#1  0x00d34fbf in nsCacheService::OpenOutputStreamForEntry (entry=0xeb79c460, mode=2, offset=0, result=0xeb638648)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1725
#2  0x00d300f6 in nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit (this=0xeb638640)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.cpp:594
#3  0x00d30755 in nsCacheEntryDescriptor::nsOutputStreamWrapper::EnsureInit (this=0xeb638640)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.h:146
#4  0x00d30275 in nsCacheEntryDescriptor::nsOutputStreamWrapper::Write (this=0xeb638640, buf=0xf75242d2 "\037\245\213\245", count=
    1, result=0xf26ff174) at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.cpp:641
#5  0x01db1a3d in nsInputStreamTeeWriteEvent::Run (this=0xeb6419e0) at /opt/moz/hg/src/xpcom/io/nsInputStreamTee.cpp:114
#6  0x01de2136 in nsThread::ProcessNextEvent (this=0xf3699290, mayWait=1, result=0xf26ff23c)
    at /opt/moz/hg/src/xpcom/threads/nsThread.cpp:527
(In reply to comment #74)
> (In reply to comment #68)
> > What I meant was expose a way to determine what device was chosen. This
> > determination happens before the tee is created - it happens during
> > openOutputStream, so it would be possible to use this.
> 
> There is a lazy initialization. The device is chosen at the first write. This
> is too late for us.
> 
> #0  nsCacheService::EnsureEntryHasDevice (this=0xf1155980, entry=0xeb79c460)
>     at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1396
> #1  0x00d34fbf in nsCacheService::OpenOutputStreamForEntry (entry=0xeb79c460,
> mode=2, offset=0, result=0xeb638648)
>     at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1725
> #2  0x00d300f6 in nsCacheEntryDescriptor::nsOutputStreamWrapper::LazyInit
> (this=0xeb638640)
>     at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.cpp:594
> #3  0x00d30755 in nsCacheEntryDescriptor::nsOutputStreamWrapper::EnsureInit
> (this=0xeb638640)
>     at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.h:146
> #4  0x00d30275 in nsCacheEntryDescriptor::nsOutputStreamWrapper::Write
> (this=0xeb638640, buf=0xf75242d2 "\037\245\213\245", count=
>     1, result=0xf26ff174) at
> /opt/moz/hg/src/netwerk/cache/src/nsCacheEntryDescriptor.cpp:641
> #5  0x01db1a3d in nsInputStreamTeeWriteEvent::Run (this=0xeb6419e0) at
> /opt/moz/hg/src/xpcom/io/nsInputStreamTee.cpp:114
> #6  0x01de2136 in nsThread::ProcessNextEvent (this=0xf3699290, mayWait=1,
> result=0xf26ff23c)
>     at /opt/moz/hg/src/xpcom/threads/nsThread.cpp:527

Seems to me like (b) and (c) will be hard to achieve, given the way these things work.

We *might* leave the sync/async write decision to nsInputStreamTee itself (for example in nsInputStreamTee::TeeSegment), but that would require access to nsCacheEntryDescriptor::nsOutputStreamWrapper::EnsureInit() , plus biesis (c)-solution from above. Is putting such responsibility on a simple class like nsInputStreamTee ok? And is exposing EnsureInit() ok?
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268043227.1268045314.4670.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/08 02:13:47
s: moz2-darwin9-slave02
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268054450.1268056554.5655.gz
Linux mozilla-central debug test xpcshell on 2010/03/08 05:20:50
The assertion isn't caused by the sequence of shutdowns. The problem is that in nsMemoryCacheDevice::BindEntry() we are binding an entry that was already doomed and deactivated in nsCacheService. Here is a commented log:

> -143038576[8cf3d58]: Opening entry for session 909be10, key http://localhost:4444/test_registerdirectory.js, mode 2, blocking 0
> -143038576[8cf3d58]: Created request 909be48
> -143038576[8cf3d58]: Activate entry for request 909be48
> -143038576[8cf3d58]: Active entry for request 909be48 is 0
> -143038576[8cf3d58]: mMemoryDevice: 0x0
> -143038576[8cf3d58]: Device search for request 909be48 returned 0
> -143038576[8cf3d58]: Added entry 91d52c0 to mActiveEntries

Open cache entry for a HTTP request.


> -143038576[8cf3d58]: nsHttpChannel::OnStartRequest [this=9084070 request=90496b8 status=0]
> -143038576[8cf3d58]: nsHttpChannel::OnStopRequest [this=9084070 request=90496b8 status=0]
> -143038576[8cf3d58]: nsHttpChannel::FinalizeCacheEntry [this=9084070]

HTTP request finishes while no data has been written to the cache yet.


> -143038576[8cf3d58]: Opening entry for session 90e7148, key http://localhost:4444/test_registerdirectory.js, mode 2, blocking 0
> -143038576[8cf3d58]: Created request 91d5900
> -143038576[8cf3d58]: Activate entry for request 91d5900
> -143038576[8cf3d58]: Active entry for request 91d5900 is 91d52c0
> -143038576[8cf3d58]: Dooming entry 91d52c0
> -143038576[8cf3d58]: Removed entry 91d52c0 from mActiveEntries

New HTTP request is started and the old cache entry is doomed and deactivated.


> -192058512[921d818]: XXXXXX nsMemoryCacheDevice::BindEntry ++mEntryCount, mEntryCount=1, entry=0x91d52c0

Now we're writing to the doomed cache for the first time. nsMemoryCacheDevice::BindEntry() obviously doesn't expect this situation and mEntryCount is wrongly increased.


> -143038576[8cf3d58]: nsHttpChannel::CloseCacheEntry [this=9084070] mStatus=0 mCacheAccess=2
> -143038576[8cf3d58]: Deactivating entry 91d52c0
> -143038576[8cf3d58]: nsMemoryCacheDevice::DeactivateEntry for entry 0x91d52c0
> -143038576[8cf3d58]: deleted doomed entry 0x91d52c0

Writing to the cache entry has finished. nsMemoryCacheDevice::DeactivateEntry() just deletes the entry.


I'm not sure yet what is the best solution. It seems to me safe and reasonable to fail in nsCacheService::EnsureEntryHasDevice() when entry has no device and is already doomed.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268107445.1268109570.23918.gz
Linux mozilla-central debug test xpcshell on 2010/03/08 20:04:05 moz2-linux-slave26
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268108160.1268110233.25209.gz
Linux mozilla-central debug test xpcshell on 2010/03/08 20:16:00
s: moz2-linux-slave16
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268110266.1268112795.30043.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/08 20:51:06
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268113662.1268115717.12827.gz
Linux mozilla-central debug test xpcshell on 2010/03/08 21:47:42
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268122277.1268124237.362.gz
Linux mozilla-central debug test xpcshell on 2010/03/09 00:11:17
(In reply to comment #80)
> The assertion isn't caused by the sequence of shutdowns.

You're absolutely right. I implemented (a) from comment #35 yesterday but it didn't resolve the issue. Plan was to spend some time today to figure out why...

> The problem is that in
> nsMemoryCacheDevice::BindEntry() we are binding an entry that was already
> doomed and deactivated in nsCacheService.

...  but you've already figured it out. :)

> I'm not sure yet what is the best solution. It seems to me safe and reasonable
> to fail in nsCacheService::EnsureEntryHasDevice() when entry has no device and
> is already doomed.

This would avoid calling BindEntry() which causes the imbalance; result would propagate to nsInputStreamTeeWriteEvent::Run() and abort the write-loop, which doesn't hurt. However, it would also return early from nsCacheEntryDescriptor::
nsOutputStreamWrapper::Write() and thus not call OnWrite() nor mOutput->Write(). Not calling OnWrite() means not updating DataSize() and I'm not certain what the consequence of this may be. Not calling mOutput->Write() is probably ok since it just means that we write this (doomed) entry to the storage-stream.

I made a quick patch implementing this idea and I cannot any longer reproduce the assertion locally. Waiting for results from tryserver...
> Not calling mOutput->Write()
> is probably ok since it just means that we write this (doomed) entry to the
> storage-stream.

Should be "...means that we *don't* write this (doomed) ..." of-course

Results from tryserver seem ok. Should we also try to address bz's concern from comment #16 here?
(In reply to comment #87)
> Should we also try to address bz's concern from comment #16 here?

Yes, but we can do it in separate bug.
Attached patch Early exit from nsCacheService::EnsureEntryHasDevice() (obsolete) (deleted) — — Splinter Review
Attachment #429118 - Attachment is obsolete: true
Attachment #431339 - Flags: review?(michal.novotny)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268155271.1268158011.19347.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/09 09:21:11
Comment on attachment 431339 [details] [diff] [review]
Early exit from nsCacheService::EnsureEntryHasDevice()

The review should do somebody else because I've suggested this change.

I think that the fix is correct. But I'm still trying to figure out if it makes sense in some situation to bind doomed entry. I don't understand the (!entry->IsDoomed()) condition in nsMemoryCacheDevice::BindEntry() and I think that it is wrong. All BindEntry() methods are called only from nsCacheService::EnsureEntryHasDevice() so with this patch the condition in nsMemoryCacheDevice::BindEntry() can be removed.
Attachment #431339 - Flags: review?(michal.novotny)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268179475.1268182013.24633.gz
Linux mozilla-central debug test xpcshell on 2010/03/09 16:04:35
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268186595.1268189117.10967.gz
Linux mozilla-central debug test xpcshell on 2010/03/09 18:03:15
s: moz2-linux-slave27
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268198922.1268201034.6481.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/09 21:28:42
s: moz2-darwin9-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268202641.1268204374.14450.gz
Linux mozilla-central debug test xpcshell on 2010/03/09 22:30:41
s: moz2-linux-slave25
> But I'm still trying to figure out if it makes
> sense in some situation to bind doomed entry. I don't understand the
> (!entry->IsDoomed()) condition in nsMemoryCacheDevice::BindEntry() and I think
> that it is wrong.

An observation is that the same condition exists in nsDiskCacheDevice::BindEntry(). nsOfflineCacheDevice::BindEntry() does not include this condition, but has comments about binding doomed entries. Hence, binding doomed entries seems to be a situation taken into account by the designers of this code, which suggests to me that we may have to allow it.

There may be an alternative solution : In nsCacheService::DoomEntry_Internal(), device->DoomEntry() is called iff (obviously) the device is found. I.e. for an unbound entry, nsMemoryCacheDevice::DoomEntry() is not called. Because nsMemoryCacheDevice::DoomEntry() calls EvictEntry(), which ultimately decreases mEntryCount, this is the other side of the imbalance we see. A way to resolve this is to increase mEntryCount in nsMemoryCacheDevice::BindEntry() iff the entry is not doomed. I've made a patch implementing this and can no longer reproduce this problem locally. Pushed to tryserver...
(In reply to comment #88)
> (In reply to comment #87)
> > Should we also try to address bz's concern from comment #16 here?
> 
> Yes, but we can do it in separate bug.

Should I file it?
Attached patch Proper accounting in BindEntry() (obsolete) (deleted) — — Splinter Review
Attachment #431339 - Attachment is obsolete: true
Attachment #431600 - Flags: review?(michal.novotny)
Comment on attachment 431600 [details] [diff] [review]
Proper accounting in BindEntry()

Results from tryserver are ok
Attachment #431600 - Flags: review?(michal.novotny) → review-
Comment on attachment 431600 [details] [diff] [review]
Proper accounting in BindEntry()

If the doomed entry is initialized then nsMemoryCacheDevice::OnDataSizeChange() is called on write and mTotalSize is increased. Entry's size is subtracted from mTotalSize in nsMemoryCacheDevice::EvictEntry() but this method isn't called for these entries.

I really don't see any reason why we should allow to bind doomed entries and write data to them. I'd guess that such situation didn't happen before the patch from bug #513074 landed so I'm really curious if the disk and offline cache can handle it correctly.
(In reply to comment #97)
> Should I file it?

Yes, please.
> (In reply to comment #97)
> > Should I file it?
> 
> Yes, please.

Bug 551447.

> I really don't see any reason why we should allow to bind doomed entries and
> write data to them. I'd guess that such situation didn't happen before the
> patch from bug #513074 landed so I'm really curious if the disk and offline
> cache can handle it correctly.

I don't see the reason either, but the observation is that there is explicit code to deal with it and this code was there before the patch for bug #513074 landed, so there must be something about it. :)

What does it mean that an entry is doomed? In my understanding, dooming a cache-entry means that it is no longer needed. What does it mean to bind an entry? In my understanding, it means to associate the entry with a device capable of storing it, i.e. it "gets" storage. From this, the term "bind a doomed entry" means (more or less) "find storage for an unneeded entry", which means that the entry can probably be discarded as you say.

> (From update of attachment 431600 [details] [diff] [review])
> If the doomed entry is initialized then nsMemoryCacheDevice::OnDataSizeChange()
> is called on write and mTotalSize is increased. Entry's size is subtracted from
> mTotalSize in nsMemoryCacheDevice::EvictEntry() but this method isn't called
> for these entries.

You're right - mTotalSize would end up wrong. But we could move the increase of mTotalSize into the if-block in nsMemoryCacheDevice::OnDataSizeChange(), no?
> From this, the term "bind a
> doomed entry" means (more or less) "find storage for an unneeded entry", which
> means that the entry can probably be discarded as you say.

For the record : I see no way to un-doom an entry, so a doomed entry will stay doomed and useless.
(In reply to comment #103)
> I don't see the reason either, but the observation is that there is explicit
> code to deal with it and this code was there before the patch for bug #513074
> landed, so there must be something about it. :)

On the other hand, if this didn't happen before the patch from bug #513074
landed then the code in the disk/offline cache is probably untested.

> What does it mean that an entry is doomed? In my understanding, dooming a
> cache-entry means that it is no longer needed. What does it mean to bind an
> entry? In my understanding, it means to associate the entry with a device
> capable of storing it, i.e. it "gets" storage. From this, the term "bind a
> doomed entry" means (more or less) "find storage for an unneeded entry", which
> means that the entry can probably be discarded as you say.

This is exactly how I see it.

> You're right - mTotalSize would end up wrong. But we could move the increase of
> mTotalSize into the if-block in nsMemoryCacheDevice::OnDataSizeChange(), no?

But in this case mTotalSize wouldn't contain correct information about occupied
memory.
(In reply to comment #104)
> For the record : I see no way to un-doom an entry, so a doomed entry will stay
> doomed and useless.

Yes, doomed entries can't be "undoomed". This is also stated in nsICacheEntryDescriptor.idl.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268263432.1268266102.15325.gz
Linux mozilla-central debug test xpcshell on 2010/03/10 15:23:52
s: moz2-linux-slave16
> On the other hand, if this didn't happen before the patch from bug #513074
> landed then the code in the disk/offline cache is probably untested.

Possibly, but why would code and comments about binding doomed entries be there in the first place? By design? My point is that if we change design-decisions in legacy code we better know what we are doing, or at least test and guard carefully against regressions and weird side-effects.

Example of side-effect : The current code allows doomed entries to be stored in the cache (although these entries are kept out of the inventory for the device). This means that it will always be possible to retrieve data from a cache-entry by clients (even when the entry is doomed), right?  AFAICS, the fix suggested in comment 80 will change this.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268333808.1268336014.16633.gz
Linux mozilla-central debug test xpcshell on 2010/03/11 10:56:48
s: moz2-linux-slave28
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268336680.1268339454.27873.gz
Linux mozilla-central debug test xpcshell on 2010/03/11 11:44:40
s: moz2-linux-slave07
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268341118.1268343406.6776.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/11 12:58:38
s: moz2-darwin9-slave14
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268380446.1268382442.3564.gz
Linux mozilla-central debug test xpcshell on 2010/03/11 23:54:06
s: moz2-linux-slave20
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268392384.1268394680.8641.gz
Linux mozilla-central debug test xpcshell on 2010/03/12 03:13:04
s: moz2-linux-slave20
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268455176.1268457630.27665.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/12 20:39:36
s: moz2-darwin9-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268458927.1268460891.2460.gz
Linux mozilla-central debug test xpcshell on 2010/03/12 21:42:07
s: moz2-linux-slave16
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268483031.1268485042.4392.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/13 04:23:51
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268506386.1268508269.18569.gz
Linux mozilla-central debug test xpcshell on 2010/03/13 10:53:06
s: moz2-linux-slave28
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268524413.1268527304.31742.gz
Linux mozilla-central debug test xpcshell on 2010/03/13 15:53:33
s: moz2-linux-slave24
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268555000.1268557014.25107.gz
Linux mozilla-central debug test xpcshell on 2010/03/14 00:23:20
s: moz2-linux-slave28
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268573199.1268575106.12148.gz
Linux mozilla-central debug test xpcshell on 2010/03/14 06:26:39
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268588709.1268590688.19829.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/14 10:45:09
s: moz2-darwin9-slave16
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268604203.1268606652.25020.gz
Linux mozilla-central debug test xpcshell on 2010/03/14 15:03:23
s: moz2-linux-slave27
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268604476.1268607110.26224.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/14 15:07:56
s: moz2-darwin9-slave16
(In reply to comment #96)
> nsDiskCacheDevice::BindEntry(). nsOfflineCacheDevice::BindEntry() does not
> include this condition, but has comments about binding doomed entries. Hence,

I couldn't find that comment. Where is it?

(In reply to comment #110)
> Example of side-effect : The current code allows doomed entries to be stored in
> the cache (although these entries are kept out of the inventory for the
> device). This means that it will always be possible to retrieve data from a
> cache-entry by clients (even when the entry is doomed), right?  AFAICS, the fix
> suggested in comment 80 will change this.

It is possible to read from doomed entries, but also it isn't guaranteed that writing to the cache must always succeed. E.g. it will fail when entry's size goes above half of storage capacity. So failing in nsCacheService::EnsureEntryHasDevice() would be just another failure that can occur.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268708758.1268710700.20768.gz
Linux mozilla-central debug test xpcshell on 2010/03/15 20:05:58
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268762743.1268764374.7179.gz
Linux mozilla-central debug test xpcshell on 2010/03/16 11:05:43
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268767516.1268769535.21471.gz
Linux mozilla-central debug test xpcshell on 2010/03/16 12:25:16
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268781518.1268783529.27773.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/16 16:18:38
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268789332.1268791710.15859.gz
Linux mozilla-central debug test xpcshell on 2010/03/16 18:28:52
s: moz2-linux-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268812953.1268815045.6245.gz
Linux mozilla-central debug test xpcshell on 2010/03/17 01:02:33
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268822256.1268824183.2035.gz
Linux mozilla-central debug test xpcshell on 2010/03/17 03:37:36
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268817722.1268820214.23803.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/17 02:22:02
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268826668.1268828434.21699.gz
Linux mozilla-central debug test xpcshell on 2010/03/17 04:51:08
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268844382.1268846101.13559.gz
Linux mozilla-central debug test xpcshell on 2010/03/17 09:46:22
s: moz2-linux-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268876435.1268878435.11513.gz
Linux mozilla-central debug test xpcshell on 2010/03/17 18:40:35
s: moz2-linux-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268877434.1268879533.14026.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/17 18:57:14
s: moz2-darwin9-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268915622.1268917987.30494.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/18 05:33:42
s: moz2-darwin9-slave20
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268952214.1268954697.15670.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/18 15:43:34
s: moz2-darwin9-slave22
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268978772.1268981518.15585.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/18 23:06:12
s: moz2-darwin9-slave02

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1268977606.1268980342.12260.gz
Linux mozilla-central debug test xpcshell on 2010/03/18 22:46:46
s: moz2-linux-slave20
> (In reply to comment #96)
> > nsDiskCacheDevice::BindEntry(). nsOfflineCacheDevice::BindEntry() does not
> > include this condition, but has comments about binding doomed entries. Hence,
> 
> I couldn't find that comment. Where is it?

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsDiskCacheDeviceSQL.cpp#1371 

> It is possible to read from doomed entries, but also it isn't guaranteed that
> writing to the cache must always succeed.

What if the request which created the entry is an async read-request?
Attached patch New doom-strategy in ActivateEntry() (obsolete) (deleted) — — Splinter Review
While working on bug #549767 this possible solution dawned on me. It passes about 2000 iterations of "test_setindexhandler.js" without the mem-cache problem, but I eventually get the "ACCESS_WRITE request should have been given a new entry" assertion in ProcessPendingRequests(). I'd like input to the approach while deciding how to deal with the new assertion...
Attachment #431600 - Attachment is obsolete: true
Attachment #433542 - Flags: review?(michal.novotny)
(In reply to comment #149)
> > (In reply to comment #96)
> > > nsDiskCacheDevice::BindEntry(). nsOfflineCacheDevice::BindEntry() does not
> > > include this condition, but has comments about binding doomed entries. Hence,
> > 
> > I couldn't find that comment. Where is it?
> 
> http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsDiskCacheDeviceSQL.cpp#1371 
> 
> > It is possible to read from doomed entries, but also it isn't guaranteed that
> > writing to the cache must always succeed.
> 
> What if the request which created the entry is an async read-request?

Mmmm...  I see now that this is proabbly not possible...  any async read-request would either find the entry in |activeEntries,| or by searching the device. In both cases the entry is bound, right?
Attached patch New doom-strategy in ActivateEntry() version 2 (obsolete) (deleted) — — Splinter Review
Updated patch to avoid the new assertion. It does 3500+ iterations locally without failing (i.e. it's *really* hard to reproduce the assertion, if at all possible). Pushed to tryserver...
Attachment #433542 - Attachment is obsolete: true
Attachment #433546 - Flags: review?(michal.novotny)
Attachment #433542 - Flags: review?(michal.novotny)
(In reply to comment #149)
> http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsDiskCacheDeviceSQL.cpp#1371 

That comment says nothing about binding doomed entries. All that it says is that dooming the entry removes appropriate row from the table, so when binding a new entry (not doomed) it is safe to use insert statement.

> > What if the request which created the entry is an async read-request?
> 
> Mmmm...  I see now that this is proabbly not possible...  any async
> read-request would either find the entry in |activeEntries,| or by searching
> the device. In both cases the entry is bound, right?

Yes, you can't get unbound entry when opening cache entry for read-only access.
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269039844.1269041974.4638.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/19 16:04:04
s: moz2-darwin9-slave41
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269041332.1269042850.7081.gz
WINNT 5.2 mozilla-central opt test xpcshell on 2010/03/19 16:28:52
s: win32-slave20
Attachment #433546 - Flags: review?(michal.novotny) → review-
Comment on attachment 433546 [details] [diff] [review]
New doom-strategy in ActivateEntry()  version 2

(In reply to comment #152)
> +  // which is pending on a thread somewhere. Now, if we doom the entry,
> +  // the question is what to do when/if the pending request is processed
> +  // and the entry tries to bind. See bug #548406. Current strategy is to
> +  // leave the entry and allow it to be bound later, but create a new
> +  // entry for this request.

We need to doom the old entry and remove it from mActiveEntries. I've got following assertion with this patch:

#0  0x008ee832 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x0051bd66 in kill () at ../sysdeps/unix/syscall-template.S:82
#2  0x01fd033c in NS_DebugBreak_P (aSeverity=1, aStr=0x228b500 "### nsCacheEntryHashTable::AddEntry - entry already used", aExpr=
    0x228b4c4 "((nsCacheEntryHashTableEntry *)hashEntry)->cacheEntry == 0", aFile=
    0x228b1b4 "/opt/moz/hg/src/netwerk/cache/src/nsCacheEntry.cpp", aLine=465) at /opt/moz/hg/src/xpcom/base/nsDebugImpl.cpp:341
#3  0x00ee571e in nsCacheEntryHashTable::AddEntry (this=0x8609320, cacheEntry=0xcabf428)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheEntry.cpp:464
#4  0x00eeb9f5 in nsCacheService::ActivateEntry (this=0x86092e8, request=0xeaf87e8, result=0xffbccc38)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1339
#5  0x00eeb2e5 in nsCacheService::ProcessRequest (this=0x86092e8, request=0xeaf87e8, calledFromOpenCacheEntry=1, result=0xcabf078)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1151
#6  0x00eeb63a in nsCacheService::OpenCacheEntry (session=0xc7ce618, key=..., accessRequested=2, blockingMode=0, listener=0x0, 
    result=0xcabf078) at /opt/moz/hg/src/netwerk/cache/src/nsCacheService.cpp:1236
#7  0x00eef16d in nsCacheSession::OpenCacheEntry (this=0xc7ce618, key=..., accessRequested=2, blockingMode=0, result=0xcabf078)
    at /opt/moz/hg/src/netwerk/cache/src/nsCacheSession.cpp:98
#8  0x00f490ac in nsHttpChannel::OpenCacheEntry (this=0xcabef68, offline=0, delayed=0xffbccebc)
    at /opt/moz/hg/src/netwerk/protocol/http/src/nsHttpChannel.cpp:1807
#9  0x00f43b75 in nsHttpChannel::Connect (this=0xcabef68, firstTime=1)
    at /opt/moz/hg/src/netwerk/protocol/http/src/nsHttpChannel.cpp:292
#10 0x00f5226b in nsHttpChannel::AsyncOpen (this=0xcabef68, listener=0x87fb598, context=0x0)
    at /opt/moz/hg/src/netwerk/protocol/http/src/nsHttpChannel.cpp:4491
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269108763.1269111182.19449.gz
Linux mozilla-central debug test xpcshell on 2010/03/20 11:12:43
s: moz2-linux-slave25
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269131905.1269133911.30658.gz
Linux mozilla-central debug test xpcshell on 2010/03/20 17:38:25
s: moz2-linux-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269132420.1269134516.31947.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/20 17:47:00
s: moz2-darwin9-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269153357.1269155518.4772.gz&fulltext=1
Summary: xpcshell tests: assertion in nsMemoryCacheDevice::Shutdown() causing abort test failed (with xpcshell return code: -6) test_setindexhandler.js, test_registerdirectory.js, test_sjs.js → xpcshell tests: assertion "mem cache leaking entries" in nsMemoryCacheDevice::Shutdown() causing abort test failed (with xpcshell return code: -6) test_setindexhandler.js, test_registerdirectory.js, test_sjs.js
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269188304.1269190346.8948.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/21 09:18:24
s: moz2-darwin9-slave40
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269202367.1269204377.5219.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/21 13:12:47
s: moz2-darwin9-slave15
Testcase from comment #165 is very useful!

(In reply to comment #159)
> We need to doom the old entry and remove it from mActiveEntries. I've got
> following assertion with this patch:
> 
> #0  0x008ee832 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
> #1  0x0051bd66 in kill () at ../sysdeps/unix/syscall-template.S:82
> #2  0x01fd033c in NS_DebugBreak_P (aSeverity=1, aStr=0x228b500 "###
>   [ snip ]

Yup - I have the same assertion deep in my logs...  (XPCOM_DEBUG_BREAK not set :( )

If I simply remove the unbound entries from mActiveEntries and call entry->DetachDescriptors(), the testcase produces no assertions and seems to leak only one object. Do you think it would help to put the old entry in some list to make sure it's removed at shutdown? (I'll give this a whack tomorrow I guess...)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269209517.1269211735.22127.gz
Linux mozilla-central debug test xpcshell on 2010/03/21 15:11:57
s: moz2-linux-slave20
(In reply to comment #167)
> leak only one object. Do you think it would help to put the old entry in some
> list to make sure it's removed at shutdown? (I'll give this a whack tomorrow I
> guess...)

I still do think that correct solution was the patch in comment #89. Boris, Christian, could you please have a look at it?
Yeah - I'm starting to think you're right...  the last suggested solution also relies on preventing the entry from binding to the device (detaching the descriptors in this case), which has been your preferred approach all the time.
OTOH, the last suggested suggestion is more specific in the sense that it only prevents force-write-requests-to-unbound-entry to bind, as opposed to preventing all doomed entries from binding (ref second paragraph in comment #91).

Maybe a strategy involving an early bind might make sense here (binding the entry prior to usage)?
(In reply to comment #171)
> OTOH, the last suggested suggestion is more specific in the sense that it only
> prevents force-write-requests-to-unbound-entry to bind, as opposed to
> preventing all doomed entries from binding (ref second paragraph in comment
> #91).

And why is this better?

> Maybe a strategy involving an early bind might make sense here (binding the
> entry prior to usage)?

That's not necessary. We could simply bind any unbound entry in nsCacheService::DoomEntry_Internal().
(In reply to comment #172)
> (In reply to comment #171)
> > OTOH, the last suggested suggestion is more specific in the sense that it only
> > prevents force-write-requests-to-unbound-entry to bind, as opposed to
> > preventing all doomed entries from binding (ref second paragraph in comment
> > #91).
> 
> And why is this better?

It's more "specific" because it affects only requests we know causes problems in this context (force-writes to unbound entries). Entries that are doomed for other reasons (whatever these may be - if any) are unaffected. Whether this is "better" or not can be subject to discussion.

> > Maybe a strategy involving an early bind might make sense here (binding the
> > entry prior to usage)?
> 
> That's not necessary. We could simply bind any unbound entry in
> nsCacheService::DoomEntry_Internal().

That's also a possibility, yes. My suggestion was triggered by this comment 

http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsCacheService.cpp#1318
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269252873.1269255079.18967.gz
Linux mozilla-central debug test xpcshell on 2010/03/22 03:14:33  
s: moz2-linux-slave26
(In reply to comment #155)
> (In reply to comment #149)
> > http://mxr.mozilla.org/mozilla-central/source/netwerk/cache/src/nsDiskCacheDeviceSQL.cpp#1371 
> 
> That comment says nothing about binding doomed entries. All that it says is
> that dooming the entry removes appropriate row from the table, so when binding
> a new entry (not doomed) it is safe to use insert statement.

IMO it advocates that the code is ok even if the entry is doomed, i.e. calling BindEntry() with a doomed entry is considered a possible scenario. If calling BindEntry() with a doomed entry was never meant to happen, why add this comment? (Moreover, why add the if-blocks in the other BindEntry() methods?)
Attached patch Improved doom-strategy in ActivateEntry() version 3 (obsolete) (deleted) — — Splinter Review
Adding the entries to a list and delete them at shutdown fixes the mentioned leak.

With this approach we allow ACCESS_WRITE requests to preempt pending, unbound entries (representing the same resource, obviously) without dooming them. "Preemt" in this context means "prevent them from binding to a device", and it makes the first actual access (it's lazy-initialized) to the entry fail. Such pending entries must have been created initially by WRITE-requests (since they are unbound), hence it makes perfect sense to preempt them upon receiving another ACCESS_WRITE request.

Since we don't doom these entries, other unbound cache-entries doomed for other reasons will not be affected. (The debate is really about whether such other reasons exist.)

An interesting point is how to deal with pending read-requests (if this is possible) to the entry being preempted. Maybe we can use Michals testcase to build up a number of combinations of such requests.
Attachment #433546 - Attachment is obsolete: true
Attachment #433930 - Flags: review?(michal.novotny)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269278269.1269280522.6499.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/22 10:17:49
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269280735.1269284588.19029.gz
Linux mozilla-central debug test xpcshell on 2010/03/22 10:58:55
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269290358.1269292463.9167.gz
Linux mozilla-central debug test xpcshell on 2010/03/22 13:39:18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269301769.1269303815.9117.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/22 16:49:29
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269314453.1269316749.11264.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/22 20:20:53
s: moz2-darwin9-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269358368.1269360159.13071.gz
Linux mozilla-central debug test xpcshell on 2010/03/23 08:32:48
s: moz2-linux-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269380475.1269382625.14940.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/23 14:41:15
s: moz2-darwin9-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269449499.1269452016.21481.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/24 09:51:39
s: moz2-darwin9-slave05
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269507207.1269509256.19562.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/25 01:53:27  
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269530264.1269532437.20261.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/25 08:17:44
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269533565.1269536159.1427.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/25 09:12:45
s: moz2-darwin9-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269535421.1269538207.8826.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/25 09:43:41
s: moz2-darwin9-slave23
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269555344.1269557448.6578.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/25 15:15:44
s: moz2-darwin9-slave07
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269561469.1269563408.8391.gz
Linux mozilla-central debug test xpcshell on 2010/03/25 16:57:49
s: mv-moz2-linux-ix-slave03
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269592353.1269594339.29833.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/26 01:32:33
s: moz2-darwin9-slave16
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269613875.1269615873.7440.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/26 07:31:15  
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269636611.1269638737.9952.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/26 13:50:11
s: moz2-darwin9-slave14
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269639525.1269641489.17354.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/26 14:38:45
s: moz2-darwin9-slave40
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269641655.1269643712.23002.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/26 15:14:15
s: moz2-darwin9-slave40
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269675721.1269677716.30513.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/27 00:42:01
s: moz2-darwin9-slave15
http://tinderbox.mozilla.org/addnote.cgi?log=Firefox/1269821268.1269822281.13315.gz
OS X 10.5.2 mozilla-central opt test xpcshell on 2010/03/28 17:07:48

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269820464.1269822446.13602.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/28 16:54:24

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269817938.1269819943.8410.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/28 16:12:18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269835033.1269837297.11128.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/28 20:57:13
s: moz2-darwin9-slave22
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269906099.1269908165.496.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/29 16:41:39
s: moz2-darwin9-slave07
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269951825.1269953822.5378.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 05:23:45
s: moz2-darwin9-slave39
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269977058.1269979076.14008.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 12:24:18
s: moz2-darwin9-slave14
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269978435.1269980522.17947.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 12:47:15
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269984612.1269986590.4679.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 14:30:12
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1269986073.1269988056.9653.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 14:54:33
s: moz2-darwin9-slave41
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270002569.1270004581.21601.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/30 19:29:29
s: moz2-darwin9-slave10
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270042027.1270044047.15620.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/31 06:27:07
s: moz2-darwin9-slave13
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270057585.1270059579.31457.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/31 10:46:25
s: moz2-darwin9-slave41
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270058702.1270060741.2396.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/03/31 11:05:02
s: moz2-darwin9-slave09
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270131544.1270133550.8793.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 07:19:04
s: moz2-darwin9-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270131544.1270133550.8793.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 07:19:04
s: moz2-darwin9-slave15
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270158548.1270160876.27103.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 14:49:08
s: moz2-darwin9-slave21
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270178856.1270180867.12191.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 20:27:36
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270181500.1270183564.17701.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 21:11:40
s: moz2-darwin9-slave05
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270190080.1270192090.3640.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/01 23:34:40
s: moz2-darwin9-slave43
OS X 10.5.2 mozilla-central debug test xpcshell [testfailed] Started 10:06, finished 10:45
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270227982.1270230272.29591.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270236713.1270238771.19315.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/02 12:31:53
s: moz2-darwin9-slave11
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270242311.1270244317.1638.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/02 14:05:11
s: moz2-darwin9-slave17
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270298408.1270301021.20844.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/03 05:40:08
s: moz2-darwin9-slave19
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270477098.1270479715.25830.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/05 07:18:18
s: moz2-darwin9-slave19

TEST-UNEXPECTED-FAIL | /builds/slave/mozilla-central-macosx-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_registerdirectory.js | test failed (with xpcshell return code: -6), see following log:
TEST-UNEXPECTED-FAIL | /builds/slave/mozilla-central-macosx-debug-unittest-xpcshell/build/xpcshell/tests/test_necko/test/test_sjs.js | test failed (with xpcshell return code: -6), see following log:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270565684.1270567699.8014.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/04/06 07:54:44
s: talos-r3-leopard-015
(log also shows Bug 523936)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270544559.1270546801.27084.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/06 02:02:39
s: moz2-darwin9-slave18
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270586103.1270588815.10200.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/04/06 13:35:03
s: talos-r3-leopard-022
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1270627802.1270630117.26727.gz
OS X 10.5.2 mozilla-central debug test xpcshell on 2010/04/07 01:10:02
s: moz2-darwin9-slave02
(In reply to comment #175)
> > That comment says nothing about binding doomed entries. All that it says is
> > that dooming the entry removes appropriate row from the table, so when binding
> > a new entry (not doomed) it is safe to use insert statement.
> 
> IMO it advocates that the code is ok even if the entry is doomed, i.e. calling
> BindEntry() with a doomed entry is considered a possible scenario. If calling
> BindEntry() with a doomed entry was never meant to happen, why add this
> comment? (Moreover, why add the if-blocks in the other BindEntry() methods?)

I think that you understand it wrong. IMHO It is there only to explain that
INSERT statement can be used, and that it isn't needed to use DELETE+INSERT or
UPDATE statement.
Attachment #433930 - Flags: review?(michal.novotny)
Comment on attachment 433930 [details] [diff] [review]
Improved doom-strategy in ActivateEntry()  version 3

> +   // Current strategy is to deactivate the entry and detach descriptors
> +   // (which will prevent it from being bound later). Entry is not doomed.

I can't see a reason for doing this. Also postponing the deletion of the entry to the cache shutdown isn't a good idea.

Again, I think that the fix should be either the one in comment #89 or the one mentioned at the end of comment #172.
Attached patch Implements suggested approach in comment 172 (obsolete) (deleted) — — Splinter Review
(In reply to comment #245)
> Again, I think that the fix should be either the one in comment #89 or the one
> mentioned at the end of comment #172.

Well, this one ensures that an entry about to be doomed is bound... (comment #172 as I understand it). Pushed to tryserver...
Attachment #439357 - Flags: review?(michal.novotny)
Comment on attachment 439357 [details] [diff] [review]
Implements suggested approach in comment 172

Hmmm..  the approach in comment #172 may be sound, but my implementation was not. Please forget this patch at the moment...
Attachment #439357 - Flags: review?(michal.novotny)
Attached patch Simple testcase showing some issues (deleted) — — Splinter Review
Please have a look at the attached test. It loads an URL three times, third time with the flag to bypass cache (forced reload). It expects to receive a cached result on the second load and a fresh result on the third load.

Note that these channel-loads are not serialized so that the first response is received before the second request is sent etc. This means cache-requests may/will be queued because the first cache-request has not been finalized before the second is created. Note also that the third channel-load will become an ACCESS_WRITE cache-request like in the test given in comment #165, and that the main difference from that test is the second channel-load.

Running this test as given produces the assertion seen in comment #159 and applying the fix from comment #89 doesn't change this. By removing the second channel-load we get the assertion about memory-cache leaking entries (the topic of this bug), which is handled by e.g. the fix from comment #89.

IMHO, we should try to find a solution which allows us to handle this situation also, i.e. a sequence with an unbound write-request followed by a read-request and then a force-write. IMO, we must be careful about what we do to the unbound write-request because we would like to serve the read-request, right? Or put another way : We don't want the clients of the second and third channel-loads to depend on how fast the first channel-load is written into cache?
The test looks fine to me, i.e. code like that should not assert.
Interestingly enough, disabling async cache-writing doesn't help here either...  we don't even get to nsHttpChannel::InstallCacheListener()
(In reply to comment #271)
> Running this test as given produces the assertion seen in comment #159 and
> applying the fix from comment #89 doesn't change this. By removing the second
> channel-load we get the assertion about memory-cache leaking entries (the topic
> of this bug), which is handled by e.g. the fix from comment #89.

These are 2 independent bugs, so we should fix this issue in a separate bug.

> IMHO, we should try to find a solution which allows us to handle this situation
> also, i.e. a sequence with an unbound write-request followed by a read-request
> and then a force-write. IMO, we must be careful about what we do to the unbound
> write-request because we would like to serve the read-request, right? Or put
> another way : We don't want the clients of the second and third channel-loads
> to depend on how fast the first channel-load is written into cache?

What do you mean by "unbound write request"? We have unbound entries but not unbound requests.
(In reply to comment #271)
> Please have a look at the attached test. It loads an URL three times, third
> time with the flag to bypass cache (forced reload). It expects to receive a
> cached result on the second load and a fresh result on the third load.

The expected result depends on how we fix this issue. Basically, I see 2 options:

1) In case we doom the entry because of write-force request we also have to fail all queued requests.
2) Fail the write-force request when there are some queued requests for the entry that should be doomed.

Option 1 makes more sense to me and in this case the result would be 1,2,3. In case of option 2 the result would be 1,1,3.
> The expected result depends on how we fix this issue.

Shouldn't we rather select how to fix the issue based on a predetermined expected result? :)

> Option 1 makes more sense to me and in this case the result would be 1,2,3. In
> case of option 2 the result would be 1,1,3.

AFAICS, option 1 would give 1,2,3 or 1,1,3 depending on timing between requests and speed of the cache-device, which IMO is completely unreasonable. Hence, option 2 may be preferred just because it's deterministic. In fact, I also intuitively like the 1,1,3 result better because there is nothing wrong with the two first requests, so they should behave like the third request doesn't exist.

> What do you mean by "unbound write request"? We have unbound entries but not
> unbound requests.

Sorry : "Write-request to unbound entry". More precisely : "write-request to an existing entry which is not bound yet".

A thought-experiment : Three XMLHttpRequests access a resource like the test does (first two requests are normal, third has a bypass-cache flag). Which result-value would we expect for the second XMLHttpRequest?
First of all, I think that we should move the discussion to the separate bug since it isn't related to this bug.


(In reply to comment #281)
> > The expected result depends on how we fix this issue.
> 
> Shouldn't we rather select how to fix the issue based on a predetermined
> expected result? :)
> 
> > Option 1 makes more sense to me and in this case the result would be 1,2,3. In
> > case of option 2 the result would be 1,1,3.
> 
> AFAICS, option 1 would give 1,2,3 or 1,1,3 depending on timing between requests
> and speed of the cache-device, which IMO is completely unreasonable. Hence,
> option 2 may be preferred just because it's deterministic. In fact, I also
> intuitively like the 1,1,3 result better because there is nothing wrong with
> the two first requests, so they should behave like the third request doesn't
> exist.

But I don't think the result must be deterministic in this case. The cache doesn't guarantee that you can always read back what you've written. And this is exactly the case. HTTP request #3 wants to bypass the cache (and the caller probably knows why he asked that) so a new content is expected and the cache should be updated. And in our case this means that HTTP request #2 won't use cache (which may be obsolete). This seems to me reasonable. Also please note that your testcase is artificial test that IMO doesn't reflect real scenarios.


> Sorry : "Write-request to unbound entry". More precisely : "write-request to an
> existing entry which is not bound yet".

Ah, OK. Anyway IMO the issue isn't limited to unbound entries. The test would assert even if the entry would be already bound.


> A thought-experiment : Three XMLHttpRequests access a resource like the test
> does (first two requests are normal, third has a bypass-cache flag). Which
> result-value would we expect for the second XMLHttpRequest?

I would expect the correct response from the server (i.e. options 1). If the server's response differs from the cache content then we return right response. If the response is the same then we just sent one inefficient request to the server.
(In reply to comment #282)
> First of all, I think that we should move the discussion to the separate bug
> since it isn't related to this bug.

Ok. Bug #561313 is created and I'm following up on remaining comments in it. However, IMO solutions to this bug and bug #561313 are likely to be related.

Now, back to the original problem...  :)  How do we get rid of the imbalance in nsMemoryCache?

We know it is caused by a pending write-request which binds to the device after it is doomed by a subsequent force-write request. 

The phrase "pending" comes from async cache-writes, "binds to device" comes from policy of current code which allows doomed entries to bind, "after" comes from lazy initialization of cache-entries, and "doomed by" comes from the chosen policy for force-write requests. Eliminating any of these would resolve the issue, no?
In fact, fixing bug #551447 (see comment #103) will probably also resolve the issue we see here since we would not get pending write-requests to memory-cache.

I cannot remember having seen an increase in disk-cache issues after the async cache-write patch landed, so this may indicate that the disk-cache actually handles this...?
(In reply to comment #283)
> We know it is caused by a pending write-request which binds to the device after
> it is doomed by a subsequent force-write request. 
> 
> The phrase "pending" comes from async cache-writes, "binds to device" comes
> from policy of current code which allows doomed entries to bind, "after" comes
> from lazy initialization of cache-entries, and "doomed by" comes from the
> chosen policy for force-write requests. Eliminating any of these would resolve
> the issue, no?


Actually, that's not correct. It doesn't have to be async cache-write that binds entry after it was doomed (see testcase in comment #165). And also it doesn't have to be write-force request that dooms the entry (see this new slightly changed testcase). That's why I think that this bug and bug #561313 aren't related. And fixing bug #551447 won't fix this bug since both testcases will still reproduce the issue.
(In reply to comment #285)
> Created an attachment (id=441094) [details]
> slightly modified version of attachment #433827 [details]

I'm glad you agree with me on the subject of artificial testcases...  ;)

> (In reply to comment #283)
> Actually, that's not correct. It doesn't have to be async cache-write that
> binds entry after it was doomed (see testcase in comment #165). And also it
> doesn't have to be write-force request that dooms the entry (see this new
> slightly changed testcase). That's why I think that this bug and bug #561313
> aren't related. And fixing bug #551447 won't fix this bug since both testcases
> will still reproduce the issue.

I must learn to express myself clearer.... I meant to say that the issue we see in *this* bug (occasional imbalanced counter in nsMemoryCache) is caused by a pending write-request which binds to the device after its entry has been doomed by a subsequent force-write request. Moreover, I meant (and still means) that fixing bug #551447 will most likely resolve the issue we see in this bug.

Given the available API it will be pretty difficult to prevent a determined programmer to mess up the cache, even from JS like you latest test demonstrates. It doesn't mean that it's impossible to prevent this, though, but perhaps this more general problem should be moved to a separate bug?
Attached file just another testcase (deleted) —
(In reply to comment #291)
> I must learn to express myself clearer.... I meant to say that the issue we see
> in *this* bug (occasional imbalanced counter in nsMemoryCache) is caused by a
> pending write-request which binds to the device after its entry has been doomed
> by a subsequent force-write request. Moreover, I meant (and still means) that
> fixing bug #551447 will most likely resolve the issue we see in this bug.

No, it won't. I've created another testcase that is more realistic. It requires elevated privileges so it can't be run on webpage, but such code can occur in any add-on and it shouldn't assert on shutdown. Just disable disk cache before running the test.

Another real scenario is even more simple:
1) disable disk cache
2) load page http://michal.etc.cz/bug548406/p2.html in 2 tabs or windows (there is a 5s delay before server starts responding)
3) wait until both pages load and do shift-reload in both tabs/windows quickly so that the second reload is done within the 5 second delay

These both testcases will cause assert on shutdown even when bug #551447 is fixed.

> Given the available API it will be pretty difficult to prevent a determined
> programmer to mess up the cache, even from JS like you latest test
> demonstrates. It doesn't mean that it's impossible to prevent this, though, but
> perhaps this more general problem should be moved to a separate bug?

Why is it difficult? IMO both solutions mentioned in comment #245 will fix it.
Attached patch More reliable book-keeping in nsMemoryCacheDevice (obsolete) (deleted) — — Splinter Review
This patch improves book-keeping in nsMemoryCacheDevice. I get no assertions for any of the testcases attached to this bug. Pushed to tryserver for more thorough verification.
Attachment #433930 - Attachment is obsolete: true
Attachment #439357 - Attachment is obsolete: true
Attachment #443598 - Flags: review?(michal.novotny)
Comment on attachment 443598 [details] [diff] [review]
More reliable book-keeping in nsMemoryCacheDevice

Patch passes tryserver on Linux and WinNT. One mochitest-plain failure on OSX : /tests/dom/tests/mochitest/ajax/offline/test_simpleManifest.html

Cannot reproduce failure locally. The test seems to focus on the application-cache which should not be affected by this patch. Will look closer tomorrow...
Comment on attachment 443598 [details] [diff] [review]
More reliable book-keeping in nsMemoryCacheDevice

The failing test only uses the offline-/application-cache so it should not be connected to this patch. Hence, I claim that it passes tryserver and awaits review.
Comment on attachment 443598 [details] [diff] [review]
More reliable book-keeping in nsMemoryCacheDevice

With the current code mTotalSize is decreased when some entry is doomed. This doesn't happen with your patch.


> -    if (entry->IsDoomed()) {
> -#ifdef DEBUG
> -        // XXX verify we've removed it from mMemCacheEntries & eviction list
> -#endif
> -        delete entry;
> -        CACHE_LOG_DEBUG(("deleted doomed entry 0x%p\n", entry));
> -        return NS_OK;
> -    }

Why don't delete the entry in DeactivateEntry() when it is doomed ?


> + // This is an entry which was doomed before it bound to a device
> + // (see bug #548406). The entry is actually in the doomed-list of
> + // nsCacheService at this point, so we steal it from that list and
> + // add it to our eviction-list. Since it is doomed, we put it in
> + // the list of entries which will be evicted first
> + NS_ASSERTION(!PR_CLIST_IS_EMPTY(entry),"entry is not on doom-list!");
> + PR_REMOVE_AND_INIT_LINK(entry);
> + PR_APPEND_LINK(entry, &mEvictionList[kQueueCount - 1]);

Maybe do PR_INSERT_LINK ?


> nsMemoryCacheDevice::EvictEntry(nsCacheEntry * entry, PRBool deleteEntry)

With your patch EvictEntry() is called always with deleteEntry==TRUE, so remove the argument.
Attachment #443598 - Flags: review?(michal.novotny) → review-
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1273812505.1273814775.6380.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/05/13 21:48:25

s: talos-r3-leopard-035
TEST-UNEXPECTED-FAIL | /Users/cltbld/talos-slave/mozilla-central-leopard-debug-u-xpcshell/build/xpcshell/tests/test_necko/test/test_registerdirectory.js | test failed (with xpcshell return code: -6), see following log:
(In reply to comment #377)
> (From update of attachment 443598 [details] [diff] [review])
> With the current code mTotalSize is decreased when some entry is doomed. This
> doesn't happen with your patch.

... and ...

> Why don't delete the entry in DeactivateEntry() when it is doomed ?

The general idea of the patch is to simplify bookkeeping in nsMemoryCacheDevice. So, I've removed various special cases and follow some simple steps/rules for removing entries from the cache :

doom entry => make sure entry isn't found when searching device
deactivate entry => increase mInactiveSize, put entry on list to be evicted
evict entry => consolidate stats and delete entry

Focus is on making sure these steps/rules are followed by all entries, which is done by ensuring entries always are on a relevant list (doom-list or active-list of nsCacheService, hashtable or some eviction-list by the device).

This is why mTotalSize is not changed when an entry is doomed (which I btw believe makes sense since the entry actually is kept by the device even if its doomed). It is also why entries are not deleted in DeactivateEntry : It's the job of EvictEntry, moreover EvictEntriesIfNecessary is called at the end of DeactivateEntry anyway.

> > + // This is an entry which was doomed before it bound to a device
> > + // (see bug #548406). The entry is actually in the doomed-list of
> > + // nsCacheService at this point, so we steal it from that list and
> > + // add it to our eviction-list. Since it is doomed, we put it in
> > + // the list of entries which will be evicted first
> > + NS_ASSERTION(!PR_CLIST_IS_EMPTY(entry),"entry is not on doom-list!");
> > + PR_REMOVE_AND_INIT_LINK(entry);
> > + PR_APPEND_LINK(entry, &mEvictionList[kQueueCount - 1]);
> 
> Maybe do PR_INSERT_LINK ?

Fair enough - we want to get rid of it as fast as possible... Fixed.

> > nsMemoryCacheDevice::EvictEntry(nsCacheEntry * entry, PRBool deleteEntry)
> 
> With your patch EvictEntry() is called always with deleteEntry==TRUE, so remove
> the argument.

Fair enough. Fixed.
Attachment #443598 - Attachment is obsolete: true
Attachment #445339 - Flags: review?(michal.novotny)
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274207063.1274209843.22344.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/05/18 11:24:23
s: talos-r3-leopard-015
cjones' changes in bug 564158 mean we get a real stack out of assertions on OS X now, which is nice:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274280731.1274283212.13418.gz
Rev3 MacOSX Leopard 10.5.8 mozilla-central debug test xpcshell on 2010/05/22 14:25:57
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1274563557.1274565767.1370.gz
I landed bug 567680 today, which seems to have *very* similar symptoms to this bug.  That might have made things better here, or might even have solved it completely.  Bjarne, maybe you want to look at that bug as well?

I think we should watch here if this happens after this point on mozilla-central again.
(In reply to comment #446)
> I landed bug 567680 today, which seems to have *very* similar symptoms to this
> bug.  That might have made things better here, or might even have solved it
> completely.  Bjarne, maybe you want to look at that bug as well?

Isn't the fix for bug #567680 very similar to the (rejected) fix suggested in comment #98?
Seems to have been fixed by bug 567680. Resolving as duplicate.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
No longer depends on: 513074
No longer blocks: 438871
Whiteboard: [orange]
Attachment #445339 - Flags: review?(michal.novotny)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: