Closed
Bug 1011771
Opened 11 years ago
Closed 11 years ago
Intermittent f02n0g08.png | application crashed [@ mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*)]
Categories
(Core :: Networking: Cache, defect)
Tracking
()
RESOLVED
FIXED
mozilla32
Tracking | Status | |
---|---|---|
firefox30 | --- | unaffected |
firefox31 | --- | unaffected |
firefox32 | --- | fixed |
firefox-esr24 | --- | unaffected |
People
(Reporter: KWierso, Assigned: michal)
References
(Blocks 1 open bug)
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(1 file, 2 obsolete files)
(deleted),
patch
|
michal
:
review+
michal
:
checkin+
|
Details | Diff | Splinter Review |
https://tbpl.mozilla.org/php/getParsedLog.php?id=39845205&tree=Mozilla-Inbound
Android 2.2 Tegra mozilla-inbound opt test plain-reftest-1 on 2014-05-16 14:17:36 PDT for push 505f38b0649b
slave: tegra-111
REFTEST TEST-START | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png | 186 / 2985 (6%)
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.html | 186 / 2985 (6%)
REFTEST TEST-PASS | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png | image comparison (==)
REFTEST TEST-UNEXPECTED-FAIL | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.html | This test left crash dumps behind, but we weren't expecting it to!
REFTEST INFO | Found unexpected crash dump file /mnt/sdcard/tests/reftest/profile/minidumps/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.dmp.
REFTEST INFO | Found unexpected crash dump file /mnt/sdcard/tests/reftest/profile/minidumps/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.extra.
REFTEST INFO | Loading a blank page
REFTEST TEST-END | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f01n2c08.png
REFTEST TEST-START | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png | 187 / 2985 (6%)
REFTEST TEST-LOAD | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.html | 187 / 2985 (6%)
INFO | automation.py | Application ran for: 0:02:42.719816
INFO | zombiecheck | Reading PID log: /tmp/tmpRSB3hXpidlog
Contents of /data/anr/traces.txt:
PROCESS-CRASH | http://10.26.84.17:30111/tests/image/test/reftest/pngsuite-filtering/f02n0g08.png | application crashed [@ mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*)]
Crash dump filename: /tmp/tmp60vTPM/60e7e9b5-6e3a-d375-49b88d3f-5ddec81e.dmp
Operating system: Android
0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
2 CPUs
Crash reason: SIGSEGV
Crash address: 0x5
Thread 21 (crashed)
0 libxul.so!mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*) [nsAutoPtr.h:505f38b0649b : 897 + 0x4]
r4 = 0x00340028 r5 = 0x5e9ff380 r6 = 0x5cd0d1c8 r7 = 0x00000001
r8 = 0x5cd0d1c4 r9 = 0x56e333ba r10 = 0x00000001 fp = 0x00000000
sp = 0x449c5dc8 lr = 0x55d93c45 pc = 0x55d92f62
Found by: given as instruction pointer in context
1 libxul.so!mozilla::net::CacheFileChunk::Release() [CacheFileChunk.cpp:505f38b0649b : 64 + 0x5]
r4 = 0x00000001 r5 = 0x00000000 r6 = 0x5cd0d1c8 r7 = 0x00000001
r8 = 0x5cd0d1c4 r9 = 0x56e333ba r10 = 0x00000001 fp = 0x00000000
sp = 0x449c5de8 pc = 0x55d93c45
Found by: call frame info
2 libxul.so!mozilla::RefPtr<mozilla::SharedThreadPool>::~RefPtr() + 0xd
r4 = 0x449c5e04 r5 = 0x00000000 r6 = 0x5cd0d1c8 r7 = 0x00000001
r8 = 0x5cd0d1c4 r9 = 0x56e333ba r10 = 0x00000001 fp = 0x00000000
sp = 0x449c5df8 pc = 0x55cecfaf
Found by: call frame info
3 libxul.so!mozilla::net::CacheFile::RemoveChunk(mozilla::net::CacheFileChunk*) [CacheFile.cpp:505f38b0649b : 1348 + 0x5]
r4 = 0x5ee581c0 r5 = 0x00000000 r6 = 0x5cd0d1c8 r7 = 0x00000001
r8 = 0x5cd0d1c4 r9 = 0x56e333ba r10 = 0x00000001 fp = 0x00000000
sp = 0x449c5e00 pc = 0x55d930cf
Found by: call frame info
4 libxul.so!mozilla::net::CacheFileChunk::Release() [CacheFileChunk.cpp:505f38b0649b : 64 + 0x5]
r4 = 0x00000001 r5 = 0x00000000 r6 = 0x5cd0d1c8 r7 = 0x00000001
r8 = 0x5cd0d1c4 r9 = 0x56e333ba r10 = 0x00000001 fp = 0x00000000
sp = 0x449c5e20 pc = 0x55d93c45
Found by: call frame info
Comment 1•11 years ago
|
||
michal/honza: new cache?
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → michal.novotny
Comment hidden (Legacy TBPL/Treeherder Robot) |
Comment 3•11 years ago
|
||
This also shows up in crash data in the wild, see https://crash-stats.mozilla.com/report/list?signature=mozilla%3A%3Anet%3A%3ACacheFile%3A%3ARemoveChunk%28mozilla%3A%3Anet%3A%3ACacheFileChunk%2A%29 for crash reports.
Updated•11 years ago
|
Blocks: cache2afterenable
Comment 4•11 years ago
|
||
Comment 5•11 years ago
|
||
FYI, there is a lot of crashes with mozilla::net::CacheFile::RemoveChunk in the stack, maybe check on all of them, some may give a clue.
Assignee | ||
Comment 6•11 years ago
|
||
The problem is that mRemovingChunk is accessed in CacheFileChunk::Release() after the refcnt is decreased. When the CacheFileChunk::Release() method is called on two threads at the same time, it is possible that the object is destroyed on one thread before we execute the condition on the second thread.
Assignee | ||
Comment 7•11 years ago
|
||
The patch fixes the problem described in comment #6. I've also renamed the member mRemovingChunk. The name was chosen at time when we didn't cache chunks, but now is its name incorrect and misleading.
Attachment #8426991 -
Flags: review?(honzab.moz)
Assignee | ||
Comment 8•11 years ago
|
||
Comment on attachment 8426991 [details] [diff] [review]
patch v1
It can still crash since the activeChunk information might be outdated in Release() so we would still call RemoveChunk() on a deleted object. Reposting Release() to a single thread seems to be the simplest solution for this problem. I'll create a new patch.
Attachment #8426991 -
Attachment is obsolete: true
Attachment #8426991 -
Flags: review?(honzab.moz)
Comment 9•11 years ago
|
||
Comment on attachment 8426991 [details] [diff] [review]
patch v1
Review of attachment 8426991 [details] [diff] [review]:
-----------------------------------------------------------------
(In reply to Michal Novotny (:michal) from comment #8)
> Comment on attachment 8426991 [details] [diff] [review]
> patch v1
>
> It can still crash since the activeChunk information might be outdated in
> Release() so we would still call RemoveChunk() on a deleted object.
> Reposting Release() to a single thread seems to be the simplest solution for
> this problem. I'll create a new patch.
Yes, it apparently can.
Can AddRef be called on a different thread then the thread you want to post Release() to? If yes, then posting is not a solution here.
::: netwerk/cache2/CacheFileChunk.cpp
@@ +61,5 @@
> delete (this);
> return 0;
> }
>
> + if (activeChunk && count == 1) {
so, here you are using a local var since |this| can already be freed, right? But you are passing |this| to RemoveChunk that happily dereferences it again. Do you really think that is a correct way of coding?
This code is all wrong. You should actually allow Put/Get to/from both the mChunks and mCachedChunks _and_ calls to AddRef() and Release() on chunks happen just an only under the file's lock. Chunk may remove itself from the hashtable when it reaches count 1 safely - under the lock.
Attachment #8426991 -
Flags: review-
Comment 10•11 years ago
|
||
Actually, Get/Put+AddRef and Release+Remove should both be strictly atomic.
Assignee | ||
Comment 11•11 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #9)
> ::: netwerk/cache2/CacheFileChunk.cpp
> @@ +61,5 @@
> > delete (this);
> > return 0;
> > }
> >
> > + if (activeChunk && count == 1) {
>
> so, here you are using a local var since |this| can already be freed, right?
> But you are passing |this| to RemoveChunk that happily dereferences it
> again. Do you really think that is a correct way of coding?
You should first try to read the code and think about it a bit before you post any comment. This would be OK if activeChunk would contain fresh information, because if the chunk is active it is in mChunks and this release was the last reference other than from HT. So it is ensured that the object exists.
So back to reposting Release(). This will ensure that the object is not destroyed on a different thread. This still won't ensure that anybody AddRefs() the chunk on the different thread but I can safely call RemoveChunk (which should be maybe renamed to DeactivateChunk or MaybeDeactivateChunk) which can grabs the CacheFile's lock and decide whether this is really the last reference.
> You should actually allow Put/Get to/from both the mChunks and mCachedChunks
I don't understand what are you trying to say with this part of the sentence.
> _and_ calls to AddRef() and Release() on chunks happen just an only under the
> file's lock.
This is a complete nonsense. You can never ensure that CacheFileChunk::AddRef() or CacheFileChunk::Release() is called always under the CacheFile's lock.
Comment 12•11 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #11)
> (In reply to Honza Bambas (:mayhemer) from comment #9)
> > ::: netwerk/cache2/CacheFileChunk.cpp
> > @@ +61,5 @@
> > > delete (this);
> > > return 0;
> > > }
> > >
> > > + if (activeChunk && count == 1) {
> >
> > so, here you are using a local var since |this| can already be freed, right?
> > But you are passing |this| to RemoveChunk that happily dereferences it
> > again. Do you really think that is a correct way of coding?
>
> You should first try to read the code and think about it a bit before you
> post any comment. This would be OK if activeChunk would contain fresh
> information, because if the chunk is active it is in mChunks and this
> release was the last reference other than from HT. So it is ensured that the
> object exists.
Then you can use mActiveChunk instead, right?
>
> So back to reposting Release(). This will ensure that the object is not
> destroyed on a different thread. This still won't ensure that anybody
> AddRefs() the chunk on the different thread but I can safely call
> RemoveChunk (which should be maybe renamed to DeactivateChunk or
> MaybeDeactivateChunk)
I'm in favor of that.
> which can grabs the CacheFile's lock and decide
> whether this is really the last reference.
>
>
> > You should actually allow Put/Get to/from both the mChunks and mCachedChunks
>
> I don't understand what are you trying to say with this part of the sentence.
>
>
> > _and_ calls to AddRef() and Release() on chunks happen just an only under the
> > file's lock.
I'll try to rephrase:
The problem here is that when Release() { if (--mRefCnt == 1) m[Cached]Chunks.Remove(); } is not atomic, someone can find the reference to the chunk in the hash table and do AddRef() in a way that you on one thread get to the point of deleting the object and on the other do (dead-object)->AddRef().
So, it should be like (pseudo-code):
GetChunk(**chunk)
{
autolock();
if (!m[Cached]Chunks.Get(index, chunk)) { // if found, ref++
refptr<Chunk> c = new Chunk(); // 1 ref
m[Cached]Chunks.Put(index, c); // 2 refs
c.forget(chunk); // 2 refs
}
}
and
Release()
{
autolock();
cnt = --mRefCnt;
if (cnt == 1) {
m[Cached]Chunks.Remove(this); // problem here is that you will reenter the Release(), but you could well post here
// on the other hand, you could just let m[Cached]Chunks don't ref the chunk, but then there is a chance (that will happen! :)) there will be an invalid ptr in the hashtables, so I would take that only as a last resort solution
}
if (cnt == 0)
delete this;
}
>
> This is a complete nonsense. You can never ensure that
> CacheFileChunk::AddRef() or CacheFileChunk::Release() is called always under
> the CacheFile's lock.
Of course, what I'm saying is that you need to atomize moving with the refcnt _and_ searching/removing in/from the hashtable. Sorry if that was not clear.
Comment 13•11 years ago
|
||
Other option is to make the hash tables use nsAutoPtr instead. So it will not alter the mRefCnt and will auto delete the handle when you reach mRefCnt == 0 where you will instead of |delete this| do mChunks.Remove(this). Moving the counter and access to the hash tables must be atomic.
Comment 14•11 years ago
|
||
Also FYI: autoptr1 = autoptr2 will move value from autoptr2 to autoptr1, so that the object won't go away.
Assignee | ||
Comment 15•11 years ago
|
||
I want to make sure you fully understand the life cycle of the chunk. The goal isn't to just simply remove the chunk from hashtables when the refcnt decreases to 1 (or zero if hashtables won't hold a strong reference to it). AFAICS this is what you're focused on with your proposal.
The chunk is "active" when it is actively used and during this time it _must_ be present in mChunks. When the chunk is inactive (all references other than the reference from mChunks were released) it is either put into mCachedChunks or destroyed. I.e. mChunks hashtable is driven by Release() and on the other hand _we_ decide when the cached chunk should be released, so once we release a cached chunk the Release() method must just decrement the RefCnt and it must not do anything else.
Also please note that when all releases of an active chunk are released that doesn't necessarily mean that such chunk will be released or cached. If it is dirty, we initiate writing of this chunk to disk and the chunk _must_ stay during this write in mChunks. The write process can fail synchronously or it can fail/succeed asynchronously. During initiation of this write process the chunk is addrefed/released several times and we must make sure we won't re-enter the CacheFile's lock.
Comment 16•11 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #15)
> I.e. mChunks hashtable is driven by Release()
> and on the other hand _we_ decide when the cached chunk should be released,
> so once we release a cached chunk the Release() method must just decrement
> the RefCnt and it must not do anything else.
>
I only don't follow this part. Can you please rephrase or something to make this more clear? I cannot answer before I fully understand. Thanks.
Assignee | ||
Comment 17•11 years ago
|
||
I just wanted to explain the difference between mChunks and mCachedChunks. Chunk _might_ be deactivated (removed from mChunks) when all references are released, i.e. the impulse comes always from CacheFileChunk::Release(). OTOH chunks from mCachedChunks are removed by impulse from CacheFile::ThrowMemoryCachedData(), CacheFile::CleanUpPreloadedChunks() etc. and in this case we must not try to deactivate the chunk in CacheFileChunk::Release().
Comment 18•11 years ago
|
||
I'll try.. Thanks for this outline BTW.
(In reply to Michal Novotny (:michal) from comment #15)
> I want to make sure you fully understand the life cycle of the chunk. The
> goal isn't to just simply remove the chunk from hashtables when the refcnt
> decreases to 1 (or zero if hashtables won't hold a strong reference to it).
> AFAICS this is what you're focused on with your proposal.
Yes, you are right on this.
>
> The chunk is "active" when it is actively used and during this time it
> _must_ be present in mChunks. When the chunk is inactive (all references
> other than the reference from mChunks were released) it is either put into
> mCachedChunks or destroyed. I.e. mChunks hashtable is driven by Release()
> and on the other hand _we_ decide when the cached chunk should be released,
I think with the proposal from comment 13 we still can have a control, maybe you got confused as I wrote that instead of delete this you will just do mChunks.Remove(Index()). No, you would still call mFile->RemoveChunk(this) of course that may do whatever it wants. Also, can addref the object again safely.
> so once we release a cached chunk the Release() method must just decrement
> the RefCnt and it must not do anything else.
>
> Also please note that when all releases of an active chunk are released that
> doesn't necessarily mean that such chunk will be released or cached.
Aha. Good point. Anyway, looking into CacheFile::RemoveChunk I can see the dirty write is started from there. And it happens under the lock. Note that my proposal means to NOT lock in CacheFileChunk::AddRef(). So, I presume the write process holds a hard ref (in this case it will go from 0 back to 1 - which in our case is perfectly alright) and will hold the chunk live again for the time of write, is that correct? If so, then I don't see why not to use comment 13 proposal.
> If it
> is dirty, we initiate writing of this chunk to disk and the chunk _must_
> stay during this write in mChunks. The write process can fail synchronously
> or it can fail/succeed asynchronously. During initiation of this write
> process the chunk is addrefed/released several times and we must make sure
> we won't re-enter the CacheFile's lock.
The sync fail would be bad if you addref/realease the chunk under the lock, right, and it can happen when dispatch to the IO thread fails, what easily can in case of shutdown... Could we just make it fail only asynchronously?
If this becomes too complicated then we may do something similar to CacheEntry referencing. There is a handle given to consumers that has a side refcoutner to make any before-end-of-life decisions on the object.
Comment 19•11 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #17)
> I just wanted to explain the difference between mChunks and mCachedChunks.
I know the difference, but this outline was good anyway.
> Chunk _might_ be deactivated (removed from mChunks) when all references are
> released, i.e. the impulse comes always from CacheFileChunk::Release(). OTOH
> chunks from mCachedChunks are removed by impulse from
> CacheFile::ThrowMemoryCachedData(), CacheFile::CleanUpPreloadedChunks() etc.
> and in this case we must not try to deactivate the chunk in
> CacheFileChunk::Release().
I don't much see an argument against the proposal.
"we must not try to deactivate" - hmm.. question is if the chunk is added a ref (and potentially released) when in mCachedChunks only. Is there such a case? Anyway, we can easily branch by checking if the chunk is in mChunks (by GetWeak()). If not, then it's not active, and we can just safely do nothing.
If the chunk is released from mCachedChunks (and then deleted) under the lock, we are still safe.
Assignee | ||
Comment 20•11 years ago
|
||
I've added a comment explaining why just re-posting CacheFileChunk::Release() from non-main thread to main thread fixes the problem.
Attachment #8428679 -
Flags: review?(honzab.moz)
Assignee | ||
Comment 21•11 years ago
|
||
Comment on attachment 8428679 [details] [diff] [review]
patch v2
https://tbpl.mozilla.org/?tree=Try&rev=80c2c58300bd
Comment 22•11 years ago
|
||
> https://tbpl.mozilla.org/?tree=Try&rev=80c2c58300bd
{based on 10 days old m-c)
Comment 23•11 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #6)
> The problem is that mRemovingChunk is accessed in CacheFileChunk::Release()
> after the refcnt is decreased. When the CacheFileChunk::Release() method is
> called on two threads at the same time, it is possible that the object is
> destroyed on one thread before we execute the condition on the second thread.
So this was happening:
T1 T2
Release() // last external ref
{
cnt = refcnt-- (= 1)
!removing && cnt == 1:
GetChunkLocked()
lock
AddRef(): refcnt++ (= 2)
unlock
Release()
{
cnt = refcnt-- (= 1)
!removing && cnt == 1:
lock
refcnt++ (= 2)
removing = true
mChunk.Remove() (= 1)
unlock
lock
refcnt-- (= 0)
delete this;
AddRef() or whatever access -> CRASH!
Hence having a single thread Release() call will fix this bug. Hopefully there is no other similar to this one just even more hidden... I still don't like even the patched code but I don't see any other problem with it right now.
And the idea for referring chunks in the hashtable with nsAutoPtr would not help here either.
Comment 24•11 years ago
|
||
Comment on attachment 8428679 [details] [diff] [review]
patch v2
Review of attachment 8428679 [details] [diff] [review]:
-----------------------------------------------------------------
Please add AssertOwnsLock to CacheFile::RemoveChunkInternal
::: netwerk/cache2/CacheFile.cpp
@@ +1325,5 @@
>
> SetError(rv);
> CacheFileIOManager::DoomFile(mHandle, nullptr);
> return rv;
> + } else {
else after return, btw
::: netwerk/cache2/CacheFileChunk.cpp
@@ +65,5 @@
> {
> + nsrefcnt count = mRefCnt - 1;
> + if (DispatchRelease()) {
> + // Redispatched to the main thread.
> + return count;
nit: maybe just return mRefCnt - 1 or just mRefCnt, you will save atomic (expensive) access when already on the main thread.
@@ +79,5 @@
> return 0;
> }
>
> + // We can safely access this chunk after decreasing mRefCnt since we re-post
> + // all calls to Release() from non-main thread to the main thread. I.e. no
from non-main threadS
or simply:
off main-thread
::: netwerk/cache2/CacheFileChunk.h
@@ +128,5 @@
> uint32_t mIndex;
> EState mState;
> nsresult mStatus;
> bool mIsDirty;
> + bool mActiveChunk;
please comment when this is actually set. as I understand, mActiveChunk == true <=> mChunk.contains(the_chunk), right? also a note about locking (that both these states change atomically)
Attachment #8428679 -
Flags: review?(honzab.moz) → review+
Comment 25•11 years ago
|
||
Michal, any reason not to update/land this asap?
Assignee | ||
Comment 27•11 years ago
|
||
Attachment #8428679 -
Attachment is obsolete: true
Attachment #8430059 -
Flags: review+
Attachment #8430059 -
Flags: checkin+
Comment 29•11 years ago
|
||
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
Updated•11 years ago
|
status-firefox30:
--- → unaffected
status-firefox31:
--- → unaffected
status-firefox32:
--- → fixed
status-firefox-esr24:
--- → unaffected
You need to log in
before you can comment on or make changes to this bug.
Description
•