Closed
Bug 1267820
Opened 9 years ago
Closed 8 years ago
Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)
Categories
(Core :: Networking: Cache, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla52
People
(Reporter: KWierso, Assigned: mayhemer)
References
Details
(Keywords: intermittent-failure, Whiteboard: [necko-active])
Attachments
(1 file)
(deleted),
patch
|
michal
:
review+
ritu
:
approval-mozilla-aurora+
ritu
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Updated•9 years ago
|
Component: XPCOM → Networking: Cache
Comment hidden (Intermittent Failures Robot) |
Honza, do you or Michal have the cycles to take this on (either during this work week, or after?)
Flags: needinfo?(honzab.moz)
Whiteboard: [necko-next]
Comment 3•9 years ago
|
||
new leaks and regressions should be active.. its ok to play hot potato with the owner using jduell as a backstop if need be
Assignee: nobody → honzab.moz
Whiteboard: [necko-next] → [necko-active]
Assignee | ||
Comment 4•9 years ago
|
||
I have a strong suspicion this is caused by your predictor stuff, Nick. But I can take a look next week.
Flags: needinfo?(honzab.moz)
Yeah, that's possible (I'm not going to write it off entirely), though the predictor's use of the cache storage service hasn't changed at all, so it would have to be something more indirect (same for the other leaked items - no direct changes in usage of those types in the predictor). I do agree the timing is exceedingly suspicious, though.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•9 years ago
|
status-firefox48:
--- → affected
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•8 years ago
|
Assignee: honzab.moz → hurley
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 39•8 years ago
|
||
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Summary: Intermittent tc-M(7) leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...) → Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 47•8 years ago
|
||
:hurley, it has been almost 4 months, this is one of the top intermittents we have in all of mozilla- what are the next steps here?
This gets noticeably worse on Friday August 5th, and by Friday August 12th, it remains a top issue.
Flags: needinfo?(hurley)
Comment 48•8 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #47)
> :hurley, it has been almost 4 months, this is one of the top intermittents
> we have in all of mozilla- what are the next steps here?
I was finally able to get a reproduction with refcounting logs yesterday (thanks to :swu), so I'm pouring over those now trying to make sense of things. For something so common on ci, this had been impossible for me to reproduce locally or on a loaner from releng.
> This gets noticeably worse on Friday August 5th, and by Friday August 12th,
> it remains a top issue.
Yeah, that's the weird thing - I didn't change anything then, but I did notice the spike. I presume it's just that something outside my control started hitting the path that causes the leak more often, but I have zero proof of that.
Flags: needinfo?(hurley)
Comment hidden (Intermittent Failures Robot) |
Comment 50•8 years ago
|
||
Just looking at what is leaking, I think the most likely cause of this is code trying to dispatch a runnable that holds onto a CacheStorageService after the thread service has shut down.
Comment 51•8 years ago
|
||
I triggered this locally today once by running the dom/workers/test/serviceworkers mochitests. I'm not sure if it only triggers in this directly, but just mentioning as a datapoint.
That being said, InterceptedChannel does have this:
https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/InterceptedChannel.h#71
But I don't see any InterceptedChannel in the set of leaked objects.
Comment 52•8 years ago
|
||
Indeed, I've tracked it down to somewhere in the chain of stuff under Predictor::UpdateCacheability. That does indeed have a runnable associated with it, but it's currently unclear to me exactly where things go off the rails (for example, if the runnable has no real work to do, the leak never happens, even though the runnable runs). I'm inspecting & commenting out larger and large swaths of code to hopefully track down the bits that get us into trouble.
Assignee | ||
Comment 53•8 years ago
|
||
Nick, connection of these leaks to the predictor is a theory, right? Was you considering careful hard-disabling of the predictor for few days on m-c and see if these leaks really go away? Maybe we trace something that isn't there. There are almost two weeks to the next merge, so enough time to get data and not getting it slipped to m-a.
Flags: needinfo?(hurley)
Comment 54•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #53)
> Nick, connection of these leaks to the predictor is a theory, right? Was
> you considering careful hard-disabling of the predictor for few days on m-c
> and see if these leaks really go away? Maybe we trace something that isn't
> there. There are almost two weeks to the next merge, so enough time to get
> data and not getting it slipped to m-a.
These are specifically related to prefetch. I'm pretty comfortable in that evaluation. Prefetch is disabled on aurora, beta, and release, and none of these leaks show up on those branches - only on nightly.
Like I alluded to in comment 52, if I early-return from Predictor::UpdateCacheability, the leak goes away. Same if we never find the right thing to update cacheability of. Now its just a game of bisecting what bit of stuff that gets triggered from UpdateCacheability is directly responsible for the leak so I can figure out. It could be in the UpdateCacheability stack directly, or it could be a bug in the code that actually performs the prefetches (which only runs if UpdateCacheability runs).
Flags: needinfo?(hurley)
Comment 55•8 years ago
|
||
Do we have an xpcom shutdown observer that spins the event loop if necko work is still outstanding? Thats how we handle these issues in other subsystems.
Comment 56•8 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #55)
> Do we have an xpcom shutdown observer that spins the event loop if necko
> work is still outstanding? Thats how we handle these issues in other
> subsystems.
The work the predictor does directly is on the main thread, so as long as the main thread clears out its queue, that part is good. However, this code also interacts with the cache thread and the socket thread, and I don't know offhand if either of those have a shutdown observer to spin the loop. Honza, I know you'll know about the cache thread, and I suspect you'll have a good answer about the socket thread, too.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 57•8 years ago
|
||
Now I don't follow. Is it good to have an observer that spins or is it bad? From my judgment it would do more harm than good. Also, how would that be related to leaks?
Comment 58•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #57)
> Now I don't follow. Is it good to have an observer that spins or is it bad?
> From my judgment it would do more harm than good. Also, how would that be
> related to leaks?
AFAIK its the only way to safely shutdown async work. Its what we do in our quota-based storage APIs:
https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.cpp#2449
Its related to leaks in that spinning the event loop there allows you to block shutdown until your async work is done.
Spinning is bad if it can be triggered directly from JS; can break run-to-completion, etc. Generally you don't want to spin. However, spinning to cleanup at shutdown seems the one place its necessary and acceptable to me.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 61•8 years ago
|
||
(In reply to Ben Kelly [:bkelly] from comment #58)
> (In reply to Honza Bambas (:mayhemer) from comment #57)
> > Now I don't follow. Is it good to have an observer that spins or is it bad?
> > From my judgment it would do more harm than good. Also, how would that be
> > related to leaks?
>
> AFAIK its the only way to safely shutdown async work. Its what we do in our
> quota-based storage APIs:
>
> https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.
> cpp#2449
This might be needed for more complex stuff, yes (tho, sounds to me a bit like a bad design.)
>
> Its related to leaks in that spinning the event loop there allows you to
> block shutdown until your async work is done.
>
> Spinning is bad if it can be triggered directly from JS; can break
> run-to-completion, etc. Generally you don't want to spin. However,
> spinning to cleanup at shutdown seems the one place its necessary and
> acceptable to me.
Nothing like this is needed for the cache IO thread. xpcom shutdown spins the loop for us later anyway (correctly!), so anything scheduled from the IO thread to run back on the main thread will execute. before that we do block (no spinning) for the cache thread to finish. and we don't intend to change that, because there is no need to change that. cache IO shutdown doesn't need to loop through the main thread to actually finish its stuff.
Flags: needinfo?(honzab.moz)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 64•8 years ago
|
||
So I've narrowed the leak down to one line (within the predictor): http://searchfox.org/mozilla-central/rev/6b94deded39a868f248a492e74e05733d6c4ed48/netwerk/base/Predictor.cpp#2418
If I comment out that line, the leak goes away. If I leave it in, the leak stays. Even if I do *nothing* in the OnCacheEntryAvailable associated with that AsyncOpenURI, the leak stays. Even if I return ENTRY_NOT_WANTED in the OnCacheEntryCheck associated with that AsyncOpenURI, the leak stays.
This is starting to look more and more like a shutdown race as bkelly mentioned in comment 55 and comment 58. Honza, I know you said it didn't seem possible, but can you take a look at the line linked above and make sure I'm not doing anything stupid there, or something unexpected from the cache service's point of view? If the code in the predictor looks sane, I think we need to dig into timing in the cache, which I'll need your help with. I'll kick off a try run with cache logging enabled so we can get some logs from the leak in case we end up needing it.
Flags: needinfo?(honzab.moz)
Comment 65•8 years ago
|
||
Try run with cache logging mentioned above: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8 (I'll likely need to retrigger some mochitest-1 jobs before failures with logs show up)
Assignee | ||
Comment 66•8 years ago
|
||
Nick, thanks! Will look at the logs when available.
Assignee | ||
Comment 67•8 years ago
|
||
I think I know the cause. We probably schedule a timer (purge timer) in the service after it has been shutdown. I will have a patch soon.
Assignee: hurley → honzab.moz
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 68•8 years ago
|
||
As that seems unlikely (I already checked on that in the past and rejected as a cause), I would like to see the log anyway.
Comment 69•8 years ago
|
||
An affected nspr log is at https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/03564317e8995fe4adfe3e9854c87aba8b8edda76aa5fea9398704250ab19dec5a24cbdcee1c894626d080eb00b38f21e11451a0f36a3318c69b5d8e1f857290 from the job at https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8&selectedJob=27474777
There are a few other jobs in the same push that have the failure, so if needed, those logs are available too.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 76•8 years ago
|
||
Cause:
2016-09-19 18:57:42.255602 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:43.260031 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::Notify
2016-09-19 18:57:46.154352 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::ShutdownBackground
2016-09-19 18:57:46.154709 UTC - [Cache2 I/O]: D/cache2 mMemorySize=259780 (+172,-0)
2016-09-19 18:57:46.155910 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
2016-09-19 18:57:46.155988 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit
The timer is never notified nor canceled and released. Hence the Timer<->Service cycle and leak.
Nick, just FYI, using --set-env try arg to setup logging is not the appropriate way. Use the MOZ_LOG var in testing/mochitest/runtests.py which correctly solves overlapping of logs produced by multiple processes all seeing the same MOZ_LOG_FILE env var. That was the source of my initial confusions here.
Status: NEW → ASSIGNED
Assignee | ||
Comment 77•8 years ago
|
||
- no longer engage the purge timer after mShutdown = true (in CacheStorageService::Shutdown, synchronized before CacheStorageService::ShutdownBackground)
- I also added locking to all places we work with the timer (except the one non-null check, comment is already there it's intentional)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5532246b1395 (leak checking try push)
Attachment #8792887 -
Flags: review?(michal.novotny)
Updated•8 years ago
|
Attachment #8792887 -
Flags: review?(michal.novotny) → review+
Assignee | ||
Comment 78•8 years ago
|
||
Leak's gone: https://treeherder.mozilla.org/#/jobs?repo=try&revision=be85031261f66afc059d4cc6bf438c556a92ad28
Keywords: checkin-needed
Comment 79•8 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #78)
> Leak's gone
I'm wishing *so* hard for emoji support in bugzilla right now. Woo, Honza!
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Comment 81•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a370db31b85
Fix HTTP cache2 leak after shutdown, don't setup purging timer after shutdown. r=michal
Keywords: checkin-needed
Comment 82•8 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment hidden (Intermittent Failures Robot) |
Comment 84•8 years ago
|
||
Please request Aurora/Beta approval on this when you get a chance.
Flags: needinfo?(honzab.moz)
Comment 85•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> Please request Aurora/Beta approval on this when you get a chance.
I'm not convinced we should do this - while this has been a latent bug, it's only (to our knowledge) triggered by predictor prefetch stuff, which is disabled everywhere but nightly - hence no leaks on aurora or beta in the orangefactor reports. I'm not convinced it's worth the risk of uplifting a patch. Then again, I didn't write the patch, so take what I say with somewhere between a grain of salt and an entire salt lick :D
Comment 86•8 years ago
|
||
OrangeFactor shows this hitting on Aurora50 throughout that cycle, so consider me confused then :)
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&endday=2016-09-19&startday=2016-08-01&tree=mozilla-aurora
Assignee | ||
Comment 87•8 years ago
|
||
This is independent of predictor/prefetch. Predictor only helped to narrow it down because it apparently exhausts the cache2 back end during shutdown much more than when not turned on. Will request a's.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 88•8 years ago
|
||
Comment on attachment 8792887 [details] [diff] [review]
v1
Approval Request Comment
[Feature/regressing bug #]: HTTP cache2
[User impact if declined]: none (it's just a leak not causing any secondary trouble), although, generally speaking this frequent intermittent failure may well lead to overlook of something more important
[Describe test coverage new/current, TreeHerder]: just landed on m-c
[Risks and why]: zero, the patch just doesn't allow a certain non-critical action after the shutdown mark
[String/UUID change made/needed]: none
Attachment #8792887 -
Flags: approval-mozilla-beta?
Attachment #8792887 -
Flags: approval-mozilla-aurora?
Hi Wes, could you please confirm this fixes the intermittent? That will help me decide whether to uplift this or not. Thanks!
Flags: needinfo?(wkocher)
Reporter | ||
Comment 90•8 years ago
|
||
The green line at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-09-21&tree=trunk would be the thing to track, but I think it's still probably too early to tell if this has been fixed. Leaving the ni to check back in a few days.
Ok. I'll wait for confirmation from Sheriffs before uplifting this one to Beta50.
Comment hidden (Intermittent Failures Robot) |
Comment on attachment 8792887 [details] [diff] [review]
v1
Verified by Wes to have fixed the intermittent failure, Aurora51+, Beta50+
Attachment #8792887 -
Flags: approval-mozilla-beta?
Attachment #8792887 -
Flags: approval-mozilla-beta+
Attachment #8792887 -
Flags: approval-mozilla-aurora?
Attachment #8792887 -
Flags: approval-mozilla-aurora+
Reporter | ||
Comment 95•8 years ago
|
||
bugherder uplift |
Reporter | ||
Comment 96•8 years ago
|
||
bugherder uplift |
You need to log in
before you can comment on or make changes to this bug.
Description
•