Closed Bug 1251130 Opened 9 years ago Closed 9 years ago

Intermittent test_caches.html | application crashed [@ mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*)]

Categories

(Core :: Networking: Cache, defect)

defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1257611
Tracking Status
firefox46 --- affected
firefox47 --- affected
firefox48 --- affected

People

(Reporter: philor, Assigned: michal)

References

Details

(Keywords: intermittent-failure, Whiteboard: [necko-active])

Attachments

(1 obsolete file)

This is necko cache, not DOM Cache API.
Component: DOM → Networking: Cache
Necko put its test in dom/cache/test/mochitest/test_caches.html? Naughty!
The test may be in dom/cache, but the stack is all necko: 14:26:46 INFO - 0 XUL!mozilla::net::CacheFileHandles::GetAllHandles(nsTArray<RefPtr<mozilla::net::CacheFileHandle> >*) [RefPtr.h:3d625cdac138 : 36 + 0x0] 14:26:46 INFO - rax = 0x0000000113094500 rdx = 0x0000000000000000 14:26:46 INFO - rcx = 0x0000000000000003 rbx = 0x0000000000000003 14:26:46 INFO - rsi = 0x0000000000000000 rdi = 0x0000000100500044 14:26:46 INFO - rbp = 0x00000001170d3da0 rsp = 0x00000001170d3d40 14:26:46 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000000 14:26:46 INFO - r10 = 0x0000000000080000 r11 = 0xfffffffffbb433e0 14:26:46 INFO - r12 = 0x00000001080bc0a8 r13 = 0x000000010060f228 14:26:46 INFO - r14 = 0x0000000000000001 r15 = 0x00000001170d3dd0 14:26:46 INFO - rip = 0x000000010260224c 14:26:46 INFO - Found by: given as instruction pointer in context 14:26:46 INFO - 1 XUL!mozilla::net::CacheFileIOManager::ShutdownInternal() [CacheFileIOManager.cpp:3d625cdac138 : 1215 + 0x14] 14:26:46 INFO - rbx = 0x00000001170d3dd0 rbp = 0x00000001170d3e00 14:26:46 INFO - rsp = 0x00000001170d3db0 r12 = 0x0000000000000004 14:26:46 INFO - r13 = 0x00000001080bc0a8 r14 = 0x00007fff5fbfebd0 14:26:46 INFO - r15 = 0x00000001161dc680 rip = 0x0000000102603c3a 14:26:46 INFO - Found by: call frame info 14:26:46 INFO - 2 XUL!mozilla::net::ShutdownEvent::Run() [CacheFileIOManager.cpp:3d625cdac138 : 570 + 0xc] 14:26:46 INFO - rbx = 0x0000000117cee220 rbp = 0x00000001170d3e20 14:26:46 INFO - rsp = 0x00000001170d3e10 r12 = 0x0000000000000004 14:26:46 INFO - r13 = 0x00000001173e0110 r14 = 0x00007fff5fbfebd0 14:26:46 INFO - r15 = 0x00000001080bc0a8 rip = 0x0000000102621748 14:26:46 INFO - Found by: call frame info 14:26:46 INFO - 3 XUL!mozilla::net::CacheIOThread::LoopOneLevel(unsigned int) [CacheIOThread.cpp:3d625cdac138 : 294 + 0x6] 14:26:46 INFO - rbx = 0x0000000000000020 rbp = 0x00000001170d3e70 14:26:46 INFO - rsp = 0x00000001170d3e30 r12 = 0x0000000000000004 14:26:46 INFO - r13 = 0x00000001173e0110 r14 = 0x0000000000000003 14:26:46 INFO - r15 = 0x00000001080bc0a8 rip = 0x0000000102614d8a 14:26:46 INFO - Found by: call frame info 14:26:46 INFO - 4 XUL!mozilla::net::CacheIOThread::ThreadFunc() [CacheIOThread.cpp:3d625cdac138 : 224 + 0x8] 14:26:46 INFO - rbx = 0x00000001173e0110 rbp = 0x00000001170d3ea0 14:26:46 INFO - rsp = 0x00000001170d3e80 r12 = 0x000000000000692b 14:26:46 INFO - r13 = 0x00000000000008ff r14 = 0x00000001170d3e87 14:26:46 INFO - r15 = 0x0000000000000000 rip = 0x0000000102614b88 14:26:46 INFO - Found by: call frame info 14:26:46 INFO - 5 XUL!mozilla::net::CacheIOThread::ThreadFunc(void*) [CacheIOThread.cpp:3d625cdac138 : 173 + 0x8] 14:26:46 INFO - rbx = 0x00000001173e0110 rbp = 0x00000001170d3ec0 14:26:46 INFO - rsp = 0x00000001170d3eb0 r12 = 0x000000000000692b 14:26:46 INFO - r13 = 0x00000000000008ff r14 = 0x00000001170d4000 14:26:46 INFO - r15 = 0x0000000000000000 rip = 0x00000001026149e4 14:26:46 INFO - Found by: call frame info I suppose dom Cache could be doing something bad, but I think we need to investigate whats happening in this stack first.
Some shutdown related things changed in neck cache recently. For example, bug 1247432. It seems the change to use profile-change-net-teardown as the shutdown hook causes necko cache to shutdown before DOM cache API now. It didn't use to do that. Not sure if thats related, but seems possible. In dom/cache the only thing that changed recently is bholley's NS_ProxyRelease() stuff in bug 1164581. That also touched the necko cache.
Honza, can you take a look at this crash? I won't be able to look again until next week after I get back from PTO.
Flags: needinfo?(honzab.moz)
Michal?
Flags: needinfo?(honzab.moz) → needinfo?(michal.novotny)
Assignee: nobody → michal.novotny
Flags: needinfo?(michal.novotny)
Whiteboard: [necko-active]
It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is now called multiple times and what happens here is: - MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and waits until condVar is notified - IO Thread: ShutdownEvent::Run() notifies condVar - MainThread: CacheFileIOManager::Shutdown() joins IO thread (via gInstance->mIOThread->Shutdown()) which means we process pending events on mainthread - MainThread: CacheFileIOManager::Shutdown() is called again while we're still waiting for IO thread to be shut down and because gInstance is not null the whole shutdown process is executed again.
Blocks: 1247432
Severity: normal → critical
Attached patch fix (obsolete) (deleted) — Splinter Review
CacheFileIOManager::gInstance is now nulled out on cache IO thread in CacheFileIOManager::ShutdownInternal() while main thread is blocked on condVar in CacheFileIOManager::Shutdown(). I also did a cleanup of CacheFileIOManager usage in events. Now it is safe to check gInstance and then use it whenever on Cache IO thread without addrefing. I'm not able to get clean try run. I tried several pushes but win7 seems to be broken (not by this patch). https://treeherder.mozilla.org/#/jobs?repo=try&revision=0fbc500589f9
Attachment #8726816 - Flags: review?(honzab.moz)
(In reply to Michal Novotny (:michal) from comment #7) > It's a regression caused by bug 1247432. CacheFileIOManager::Shutdown() is > now called multiple times and what happens here is: > > - MainThread: CacheFileIOManager::Shutdown() dispatches ShutDownEvent and > waits until condVar is notified > - IO Thread: ShutdownEvent::Run() notifies condVar > - MainThread: CacheFileIOManager::Shutdown() joins IO thread (via > gInstance->mIOThread->Shutdown()) which means we process pending events on > mainthread Hmm.. no, we don't. It's doing a PR_JoingThread() which doesn't loop. > - MainThread: CacheFileIOManager::Shutdown() is called again while we're > still waiting for IO thread to be shut down and because gInstance is not > null the whole shutdown process is executed again. Based on the above, this is impossible.
Flags: needinfo?(michal.novotny)
Then I'm out of ideas here. How can we be past condVar.wait() on the main thread when condVar was not yet notified on Cache IO thread? Any ideas?
Flags: needinfo?(michal.novotny) → needinfo?(honzab.moz)
You are right we can call Shutdown in the io manager twice now. But the problem is elsewhere than in reentrance of it. It may be a race condition on access to gInstance during this shutdown phase. Actually, something I was pointing out at a long time. Still, the patch may be fixing the issue and seems correct to me.
Flags: needinfo?(honzab.moz)
Comment on attachment 8726816 [details] [diff] [review] fix Review of attachment 8726816 [details] [diff] [review]: ----------------------------------------------------------------- This should be uplifted, right? I'm a bit scared of this patch, also because we don't the true cause of this bug yet, but if you believe it's safe enough, please nominate to go along the shutdown bug. Thanks.
Attachment #8726816 - Flags: review?(honzab.moz) → review+
(In reply to Honza Bambas (:mayhemer) from comment #13) because we don't *know* the true cause of this bug yet
Comment on attachment 8726816 [details] [diff] [review] fix Approval Request Comment [Feature/regressing bug #]: 1247432 [User impact if declined]: possible shutdown crash [Describe test coverage new/current, TreeHerder]: this is tested by all existing tests that use disk cache [Risks and why]: low - This patch cleans up shutdown procedure in the cache a bit. We are not sure what exactly causes these crashes (also bug 1254720 and bug 1248389 comment 30) but this change makes accessing CacheFileIOManager::gInstance safe at least on main thread and cache IO thread. I'll probably come with yet another clean up later, but for now we should land this on the same branches as bug 1247432 was landed. [String/UUID change made/needed]: none
Attachment #8726816 - Flags: approval-mozilla-beta?
Attachment #8726816 - Flags: approval-mozilla-aurora?
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment on attachment 8726816 [details] [diff] [review] fix Crash fix, taking it.
Attachment #8726816 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8726816 [details] [diff] [review] fix Dropping the request. This is causes (at least) bug 1255690. Candidate for a complete backout!
Attachment #8726816 - Flags: approval-mozilla-beta?
Depends on: 1255690
Michal, please back this out ASAP. We need a better patch here...
Status: RESOLVED → REOPENED
Flags: needinfo?(michal.novotny)
Resolution: FIXED → ---
Flags: needinfo?(michal.novotny)
Attachment #8726816 - Flags: approval-mozilla-aurora+
And one more: https://crash-stats.mozilla.com/report/index/ab06c015-bcb0-41be-8cbb-a9ec32160309 Tho, these two crashes are just ad only two I've found. both on OSX, where this crash bug also appeared.
try run with sync,cache2:5 logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef let's try get some info from logs.
another try without the sync module: https://treeherder.mozilla.org/#/jobs?repo=try&revision=78362bdde9a3 Apparently this is a tight timing issue that sync logging is interfering with.
OK, I have a solution to this bug: let's turn on cache logging, and it will never crash again! :)
(In reply to Honza Bambas (:mayhemer) from comment #30) > (In reply to Honza Bambas (:mayhemer) from comment #26) > > try run with sync,cache2:5 logging: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=85ff56eb6fef > > > > let's try get some info from logs. > > Managed to reproduce: > > http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/ > 68c79c7068c2e5f0d4ba2b47b70325a6fcb427c9c277e8c0a1b82aa66e133f047e5f5307fe0f1 > ebba38662d6eb14d30fc11a02cdcb02eea896f7b1d3c45bf486 > nspr-pid=2336-uid=bbd4d41b-d0ac-4301-95aa-ca8612c9025c.log > http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/ > 3437c29f38a3681b9d3d1f542bda028ba5db5cd86d6c1b03409def6a2c242abfbfd94a73c5ed7 > 272b02c5d493a9b780ee70ac278f02301cc550126bd1267d90e nspr-pid=2013-uid=ab305ead-f338-43b4-9bfc-875ec11796d8.log Michal, can you figure out something from it?
Flags: needinfo?(michal.novotny)
At least the theory that CacheFileIOManager::ShutdownInternal is called more than once is gone.
No longer blocks: 1238707
Status: REOPENED → ASSIGNED
Flags: needinfo?(michal.novotny)
Depends on: 1257611
Status: ASSIGNED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → DUPLICATE
Attachment #8726816 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: