Closed
Bug 1251130
Opened 9 years ago
Closed 9 years ago
Intermittent test_caches.html | application crashed [@ mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*)]
Categories
(Core :: Networking: Cache, defect)
Core
Networking: Cache
Tracking
()
RESOLVED
DUPLICATE
of bug 1257611
People
(Reporter: philor, Assigned: michal)
References
Details
(Keywords: intermittent-failure, Whiteboard: [necko-active])
Attachments
(1 obsolete file)
Reporter | ||
Comment 2•9 years ago
|
||
Necko put its test in dom/cache/test/mochitest/test_caches.html?
Naughty!
Comment 3•9 years ago
|
||
The test may be in dom/cache, but the stack is all necko:
14:26:46 INFO - 0 XUL!mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*) [RefPtr.h:3d625cdac138 : 36 + 0x0]
14:26:46 INFO - rax = 0x0000000113094500 rdx = 0x0000000000000000
14:26:46 INFO - rcx = 0x0000000000000003 rbx = 0x0000000000000003
14:26:46 INFO - rsi = 0x0000000000000000 rdi = 0x0000000100500044
14:26:46 INFO - rbp = 0x00000001170d3da0 rsp = 0x00000001170d3d40
14:26:46 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000000
14:26:46 INFO - r10 = 0x0000000000080000 r11 = 0xfffffffffbb433e0
14:26:46 INFO - r12 = 0x00000001080bc0a8 r13 = 0x000000010060f228
14:26:46 INFO - r14 = 0x0000000000000001 r15 = 0x00000001170d3dd0
14:26:46 INFO - rip = 0x000000010260224c
14:26:46 INFO - Found by: given as instruction pointer in context
14:26:46 INFO - 1 XUL!mozilla::net::CacheFileIOManager::ShutdownInternal() [CacheFileIOManager.cpp:3d625cdac138 : 1215 + 0x14]
14:26:46 INFO - rbx = 0x00000001170d3dd0 rbp = 0x00000001170d3e00
14:26:46 INFO - rsp = 0x00000001170d3db0 r12 = 0x0000000000000004
14:26:46 INFO - r13 = 0x00000001080bc0a8 r14 = 0x00007fff5fbfebd0
14:26:46 INFO - r15 = 0x00000001161dc680 rip = 0x0000000102603c3a
14:26:46 INFO - Found by: call frame info
14:26:46 INFO - 2 XUL!mozilla::net::ShutdownEvent::Run() [CacheFileIOManager.cpp:3d625cdac138 : 570 + 0xc]
14:26:46 INFO - rbx = 0x0000000117cee220 rbp = 0x00000001170d3e20
14:26:46 INFO - rsp = 0x00000001170d3e10 r12 = 0x0000000000000004
14:26:46 INFO - r13 = 0x00000001173e0110 r14 = 0x00007fff5fbfebd0
14:26:46 INFO - r15 = 0x00000001080bc0a8 rip = 0x0000000102621748
14:26:46 INFO - Found by: call frame info
14:26:46 INFO - 3 XUL!mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) [CacheIOThread.cpp:3d625cdac138 : 294 + 0x6]
14:26:46 INFO - rbx = 0x0000000000000020 rbp = 0x00000001170d3e70
14:26:46 INFO - rsp = 0x00000001170d3e30 r12 = 0x0000000000000004
14:26:46 INFO - r13 = 0x00000001173e0110 r14 = 0x0000000000000003
14:26:46 INFO - r15 = 0x00000001080bc0a8 rip = 0x0000000102614d8a
14:26:46 INFO - Found by: call frame info
14:26:46 INFO - 4 XUL!mozilla::net::CacheIOThread::ThreadFunc() [CacheIOThread.cpp:3d625cdac138 : 224 + 0x8]
14:26:46 INFO - rbx = 0x00000001173e0110 rbp = 0x00000001170d3ea0
14:26:46 INFO - rsp = 0x00000001170d3e80 r12 = 0x000000000000692b
14:26:46 INFO - r13 = 0x00000000000008ff r14 = 0x00000001170d3e87
14:26:46 INFO - r15 = 0x0000000000000000 rip = 0x0000000102614b88
14:26:46 INFO - Found by: call frame info
14:26:46 INFO - 5 XUL!mozilla::net::CacheIOThread::ThreadFunc(void*) [CacheIOThread.cpp:3d625cdac138 : 173 + 0x8]
14:26:46 INFO - rbx = 0x00000001173e0110 rbp = 0x00000001170d3ec0
14:26:46 INFO - rsp = 0x00000001170d3eb0 r12 = 0x000000000000692b
14:26:46 INFO - r13 = 0x00000000000008ff r14 = 0x00000001170d4000
14:26:46 INFO - r15 = 0x0000000000000000 rip = 0x00000001026149e4
14:26:46 INFO - Found by: call frame info
I suppose dom Cache could be doing something bad, but I think we need to investigate whats happening in this stack first.
Comment 4•9 years ago
|
||
Some shutdown related things changed in neck cache recently. For example, bug 1247432. It seems the change to use profile-change-net-teardown as the shutdown hook causes necko cache to shutdown before DOM cache API now. It didn't use to do that. Not sure if thats related, but seems possible.
In dom/cache the only thing that changed recently is bholley's NS_ProxyRelease() stuff in bug 1164581. That also touched the necko cache.
Comment 5•9 years ago
|
||
Honza, can you take a look at this crash? I won't be able to look again until next week after I get back from PTO.
Flags: needinfo?(honzab.moz)
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → michal.novotny
Flags: needinfo?(michal.novotny)
Updated•9 years ago
|
Whiteboard: [necko-active]
Assignee | ||
Comment 7•9 years ago
|
||
It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is now called multiple times and what happens here is:
- MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and waits until condVar is notified
- IO Thread: ShutdownEvent::Run() notifies condVar
- MainThread: CacheFileIOManager::Shutdown() joins IO thread (via gInstance->mIOThread->Shutdown()) which means we process pending events on mainthread
- MainThread: CacheFileIOManager::Shutdown() is called again while we're still waiting for IO thread to be shut down and because gInstance is not null the whole shutdown process is executed again.
Updated•9 years ago
|
Severity: normal → critical
Assignee | ||
Comment 8•9 years ago
|
||
CacheFileIOManager::gInstance is now nulled out on cache IO thread in CacheFileIOManager::ShutdownInternal() while main thread is blocked on condVar in CacheFileIOManager::Shutdown(). I also did a cleanup of CacheFileIOManager usage in events. Now it is safe to check gInstance and then use it whenever on Cache IO thread without addrefing.
I'm not able to get clean try run. I tried several pushes but win7 seems to be broken (not by this patch).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0fbc500589f9
Attachment #8726816 -
Flags: review?(honzab.moz)
Comment 9•9 years ago
|
||
(In reply to Michal Novotny (:michal) from comment #7)
> It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is
> now called multiple times and what happens here is:
>
> - MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and
> waits until condVar is notified
> - IO Thread: ShutdownEvent::Run() notifies condVar
> - MainThread: CacheFileIOManager::Shutdown() joins IO thread (via
> gInstance->mIOThread->Shutdown()) which means we process pending events on
> mainthread
Hmm.. no, we don't. It's doing a PR_JoingThread() which doesn't loop.
> - MainThread: CacheFileIOManager::Shutdown() is called again while we're
> still waiting for IO thread to be shut down and because gInstance is not
> null the whole shutdown process is executed again.
Based on the above, this is impossible.
Assignee | ||
Comment 11•9 years ago
|
||
Then I'm out of ideas here. How can we be past condVar.wait() on the main thread when condVar was not yet notified on Cache IO thread? Any ideas?
Flags: needinfo?(michal.novotny) → needinfo?(honzab.moz)
Comment 12•9 years ago
|
||
You are right we can call Shutdown in the io manager twice now. But the problem is elsewhere than in reentrance of it. It may be a race condition on access to gInstance during this shutdown phase. Actually, something I was pointing out at a long time. Still, the patch may be fixing the issue and seems correct to me.
Flags: needinfo?(honzab.moz)
Comment 13•9 years ago
|
||
Comment on attachment 8726816 [details] [diff] [review]
fix
Review of attachment 8726816 [details] [diff] [review]:
-----------------------------------------------------------------
This should be uplifted, right? I'm a bit scared of this patch, also because we don't the true cause of this bug yet, but if you believe it's safe enough, please nominate to go along the shutdown bug. Thanks.
Attachment #8726816 -
Flags: review?(honzab.moz) → review+
Comment 14•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #13)
because we don't *know* the true cause of this bug yet
Comment 15•9 years ago
|
||
Assignee | ||
Comment 17•9 years ago
|
||
Comment on attachment 8726816 [details] [diff] [review]
fix
Approval Request Comment
[Feature/regressing bug #]: 1247432
[User impact if declined]: possible shutdown crash
[Describe test coverage new/current, TreeHerder]: this is tested by all existing tests that use disk cache
[Risks and why]: low - This patch cleans up shutdown procedure in the cache a bit. We are not sure what exactly causes these crashes (also bug 1254720 and bug 1248389 comment 30) but this change makes accessing CacheFileIOManager::gInstance safe at least on main thread and cache IO thread. I'll probably come with yet another clean up later, but for now we should land this on the same branches as bug 1247432 was landed.
[String/UUID change made/needed]: none
Attachment #8726816 -
Flags: approval-mozilla-beta?
Attachment #8726816 -
Flags: approval-mozilla-aurora?
Comment 18•9 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox48:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
status-firefox46:
--- → affected
Comment on attachment 8726816 [details] [diff] [review]
fix
Crash fix, taking it.
Attachment #8726816 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 21•9 years ago
|
||
Comment on attachment 8726816 [details] [diff] [review]
fix
Dropping the request. This is causes (at least) bug 1255690.
Candidate for a complete backout!
Attachment #8726816 -
Flags: approval-mozilla-beta?
Comment 22•9 years ago
|
||
Michal, please back this out ASAP. We need a better patch here...
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: FIXED → ---
Assignee | ||
Updated•9 years ago
|
Flags: needinfo?(michal.novotny)
Attachment #8726816 -
Flags: approval-mozilla-aurora+
Comment 23•9 years ago
|
||
Comment 24•9 years ago
|
||
This bug is not new:
https://crash-stats.mozilla.com/report/index/367db667-c96c-4497-8f7f-fc4b82160308
Fx 44, w/o the patch for bug 1247432.
Comment 25•9 years ago
|
||
And one more:
https://crash-stats.mozilla.com/report/index/ab06c015-bcb0-41be-8cbb-a9ec32160309
Tho, these two crashes are just ad only two I've found. both on OSX, where this crash bug also appeared.
Comment 26•9 years ago
|
||
try run with sync,cache2:5 logging:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef
let's try get some info from logs.
Comment 27•9 years ago
|
||
another try without the sync module:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=78362bdde9a3
Apparently this is a tight timing issue that sync logging is interfering with.
Comment 28•9 years ago
|
||
OK, I have a solution to this bug: let's turn on cache logging, and it will never crash again! :)
Comment 29•9 years ago
|
||
Merge of backout:
https://hg.mozilla.org/mozilla-central/rev/3a4549c08a31
Target Milestone: mozilla48 → ---
Comment 30•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #26)
> try run with sync,cache2:5 logging:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef
>
> let's try get some info from logs.
Managed to reproduce:
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/68c79c7068c2e5f0d4ba2b47b70325a6fcb427c9c277e8c0a1b82aa66e133f047e5f5307fe0f1ebba38662d6eb14d30fc11a02cdcb02eea896f7b1d3c45bf486
http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/3437c29f38a3681b9d3d1f542bda028ba5db5cd86d6c1b03409def6a2c242abfbfd94a73c5ed7272b02c5d493a9b780ee70ac278f02301cc550126bd1267d90e
Comment 31•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #30)
> (In reply to Honza Bambas (:mayhemer) from comment #26)
> > try run with sync,cache2:5 logging:
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef
> >
> > let's try get some info from logs.
>
> Managed to reproduce:
>
> http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/
> 68c79c7068c2e5f0d4ba2b47b70325a6fcb427c9c277e8c0a1b82aa66e133f047e5f5307fe0f1
> ebba38662d6eb14d30fc11a02cdcb02eea896f7b1d3c45bf486
>
nspr-pid=2336-uid=bbd4d41b-d0ac-4301-95aa-ca8612c9025c.log
> http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/
> 3437c29f38a3681b9d3d1f542bda028ba5db5cd86d6c1b03409def6a2c242abfbfd94a73c5ed7
> 272b02c5d493a9b780ee70ac278f02301cc550126bd1267d90e
nspr-pid=2013-uid=ab305ead-f338-43b4-9bfc-875ec11796d8.log
Michal, can you figure out something from it?
Flags: needinfo?(michal.novotny)
Comment 32•9 years ago
|
||
At least the theory that CacheFileIOManager::ShutdownInternal is called more than once is gone.
Comment 33•9 years ago
|
||
Based on Bug 1257379:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=68329b909126
added some more logging too.
Updated•9 years ago
|
Status: REOPENED → ASSIGNED
Flags: needinfo?(michal.novotny)
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago → 9 years ago
Resolution: --- → DUPLICATE
Updated•9 years ago
|
Attachment #8726816 -
Attachment is obsolete: true
You need to log in
before you can comment on or make changes to this bug.
Description
•