1267820 - Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

Yeah, that's possible (I'm not going to write it off entirely), though the predictor's use of the cache storage service hasn't changed at all, so it would have to be something more indirect (same for the other leaked items - no direct changes in usage of those types in the predictor). I do agree the timing is exceedingly suspicious, though.

u408661

Updated

•

9 years ago

Blocks: 1271944

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Updated

•

9 years ago

status-firefox48: --- → affected

Comment hidden (Intermittent Failures Robot)

Honza Bambas (:mayhemer)

Assignee

Updated

•

8 years ago

Assignee: honzab.moz → hurley

Comment hidden (Intermittent Failures Robot)

BMO Automation

Comment 39

•

8 years ago

Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.

Priority: -- → P3

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Updated

•

8 years ago

Summary: Intermittent tc-M(7) leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...) → Intermittent leakcheck | default process: 2000 bytes leaked (CacheStorageService, CancelableRunnable, CondVar, Mutex, Runnable, ...)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 47

•

8 years ago

:hurley, it has been almost 4 months, this is one of the top intermittents we have in all of mozilla- what are the next steps here? This gets noticeably worse on Friday August 5th, and by Friday August 12th, it remains a top issue.

Flags: needinfo?(hurley)

u408661

Comment 48

•

8 years ago

(In reply to Joel Maher ( :jmaher) from comment #47) > :hurley, it has been almost 4 months, this is one of the top intermittents > we have in all of mozilla- what are the next steps here? I was finally able to get a reproduction with refcounting logs yesterday (thanks to :swu), so I'm pouring over those now trying to make sense of things. For something so common on ci, this had been impossible for me to reproduce locally or on a loaner from releng. > This gets noticeably worse on Friday August 5th, and by Friday August 12th, > it remains a top issue. Yeah, that's the weird thing - I didn't change anything then, but I did notice the spike. I presume it's just that something outside my control started hitting the path that causes the leak more often, but I have zero proof of that.

Flags: needinfo?(hurley)

Comment hidden (Intermittent Failures Robot)

Andrew McCreight [:mccr8]

Comment 50

•

8 years ago

Just looking at what is leaking, I think the most likely cause of this is code trying to dispatch a runnable that holds onto a CacheStorageService after the thread service has shut down.

Ben Kelly [:bkelly, not reviewing]

Comment 51

•

8 years ago

I triggered this locally today once by running the dom/workers/test/serviceworkers mochitests. I'm not sure if it only triggers in this directly, but just mentioning as a datapoint. That being said, InterceptedChannel does have this: https://dxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/InterceptedChannel.h#71 But I don't see any InterceptedChannel in the set of leaked objects.

u408661

Comment 52

•

8 years ago

Indeed, I've tracked it down to somewhere in the chain of stuff under Predictor::UpdateCacheability. That does indeed have a runnable associated with it, but it's currently unclear to me exactly where things go off the rails (for example, if the runnable has no real work to do, the leak never happens, even though the runnable runs). I'm inspecting & commenting out larger and large swaths of code to hopefully track down the bits that get us into trouble.

Honza Bambas (:mayhemer)

Assignee

Comment 53

•

8 years ago

Nick, connection of these leaks to the predictor is a theory, right? Was you considering careful hard-disabling of the predictor for few days on m-c and see if these leaks really go away? Maybe we trace something that isn't there. There are almost two weeks to the next merge, so enough time to get data and not getting it slipped to m-a.

Flags: needinfo?(hurley)

u408661

Comment 54

•

8 years ago

(In reply to Honza Bambas (:mayhemer) from comment #53) > Nick, connection of these leaks to the predictor is a theory, right? Was > you considering careful hard-disabling of the predictor for few days on m-c > and see if these leaks really go away? Maybe we trace something that isn't > there. There are almost two weeks to the next merge, so enough time to get > data and not getting it slipped to m-a. These are specifically related to prefetch. I'm pretty comfortable in that evaluation. Prefetch is disabled on aurora, beta, and release, and none of these leaks show up on those branches - only on nightly. Like I alluded to in comment 52, if I early-return from Predictor::UpdateCacheability, the leak goes away. Same if we never find the right thing to update cacheability of. Now its just a game of bisecting what bit of stuff that gets triggered from UpdateCacheability is directly responsible for the leak so I can figure out. It could be in the UpdateCacheability stack directly, or it could be a bug in the code that actually performs the prefetches (which only runs if UpdateCacheability runs).

Flags: needinfo?(hurley)

Ben Kelly [:bkelly, not reviewing]

Comment 55

•

8 years ago

Do we have an xpcom shutdown observer that spins the event loop if necko work is still outstanding? Thats how we handle these issues in other subsystems.

u408661

Comment 56

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #55) > Do we have an xpcom shutdown observer that spins the event loop if necko > work is still outstanding? Thats how we handle these issues in other > subsystems. The work the predictor does directly is on the main thread, so as long as the main thread clears out its queue, that part is good. However, this code also interacts with the cache thread and the socket thread, and I don't know offhand if either of those have a shutdown observer to spin the loop. Honza, I know you'll know about the cache thread, and I suspect you'll have a good answer about the socket thread, too.

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Assignee

Comment 57

•

8 years ago

Now I don't follow. Is it good to have an observer that spins or is it bad? From my judgment it would do more harm than good. Also, how would that be related to leaks?

Ben Kelly [:bkelly, not reviewing]

Comment 58

•

8 years ago

(In reply to Honza Bambas (:mayhemer) from comment #57) > Now I don't follow. Is it good to have an observer that spins or is it bad? > From my judgment it would do more harm than good. Also, how would that be > related to leaks? AFAIK its the only way to safely shutdown async work. Its what we do in our quota-based storage APIs: https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent.cpp#2449 Its related to leaks in that spinning the event loop there allows you to block shutdown until your async work is done. Spinning is bad if it can be triggered directly from JS; can break run-to-completion, etc. Generally you don't want to spin. However, spinning to cleanup at shutdown seems the one place its necessary and acceptable to me.

Comment hidden (Intermittent Failures Robot)

Honza Bambas (:mayhemer)

Assignee

Comment 61

•

8 years ago

(In reply to Ben Kelly [:bkelly] from comment #58) > (In reply to Honza Bambas (:mayhemer) from comment #57) > > Now I don't follow. Is it good to have an observer that spins or is it bad? > > From my judgment it would do more harm than good. Also, how would that be > > related to leaks? > > AFAIK its the only way to safely shutdown async work. Its what we do in our > quota-based storage APIs: > > https://dxr.mozilla.org/mozilla-central/source/dom/quota/ActorsParent. > cpp#2449 This might be needed for more complex stuff, yes (tho, sounds to me a bit like a bad design.) > > Its related to leaks in that spinning the event loop there allows you to > block shutdown until your async work is done. > > Spinning is bad if it can be triggered directly from JS; can break > run-to-completion, etc. Generally you don't want to spin. However, > spinning to cleanup at shutdown seems the one place its necessary and > acceptable to me. Nothing like this is needed for the cache IO thread. xpcom shutdown spins the loop for us later anyway (correctly!), so anything scheduled from the IO thread to run back on the main thread will execute. before that we do block (no spinning) for the cache thread to finish. and we don't intend to change that, because there is no need to change that. cache IO shutdown doesn't need to loop through the main thread to actually finish its stuff.

Flags: needinfo?(honzab.moz)

Comment hidden (Intermittent Failures Robot)

u408661

Comment 64

•

8 years ago

So I've narrowed the leak down to one line (within the predictor): http://searchfox.org/mozilla-central/rev/6b94deded39a868f248a492e74e05733d6c4ed48/netwerk/base/Predictor.cpp#2418 If I comment out that line, the leak goes away. If I leave it in, the leak stays. Even if I do *nothing* in the OnCacheEntryAvailable associated with that AsyncOpenURI, the leak stays. Even if I return ENTRY_NOT_WANTED in the OnCacheEntryCheck associated with that AsyncOpenURI, the leak stays. This is starting to look more and more like a shutdown race as bkelly mentioned in comment 55 and comment 58. Honza, I know you said it didn't seem possible, but can you take a look at the line linked above and make sure I'm not doing anything stupid there, or something unexpected from the cache service's point of view? If the code in the predictor looks sane, I think we need to dig into timing in the cache, which I'll need your help with. I'll kick off a try run with cache logging enabled so we can get some logs from the leak in case we end up needing it.

Flags: needinfo?(honzab.moz)

u408661

Comment 65

•

8 years ago

Try run with cache logging mentioned above: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8 (I'll likely need to retrigger some mochitest-1 jobs before failures with logs show up)

Honza Bambas (:mayhemer)

Assignee

Comment 66

•

8 years ago

Nick, thanks! Will look at the logs when available.

Honza Bambas (:mayhemer)

Assignee

Comment 67

•

8 years ago

I think I know the cause. We probably schedule a timer (purge timer) in the service after it has been shutdown. I will have a patch soon.

Assignee: hurley → honzab.moz

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Assignee

Comment 68

•

8 years ago

As that seems unlikely (I already checked on that in the past and rejected as a cause), I would like to see the log anyway.

u408661

Comment 69

•

8 years ago

An affected nspr log is at https://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/03564317e8995fe4adfe3e9854c87aba8b8edda76aa5fea9398704250ab19dec5a24cbdcee1c894626d080eb00b38f21e11451a0f36a3318c69b5d8e1f857290 from the job at https://treeherder.mozilla.org/#/jobs?repo=try&revision=f56e3b5f35a8&selectedJob=27474777 There are a few other jobs in the same push that have the failure, so if needed, those logs are available too.

Comment hidden (Intermittent Failures Robot)

Honza Bambas (:mayhemer)

Assignee

Comment 76

•

8 years ago

Cause: 2016-09-19 18:57:42.255602 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit 2016-09-19 18:57:43.260031 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::Notify 2016-09-19 18:57:46.154352 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::ShutdownBackground 2016-09-19 18:57:46.154709 UTC - [Cache2 I/O]: D/cache2 mMemorySize=259780 (+172,-0) 2016-09-19 18:57:46.155910 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit 2016-09-19 18:57:46.155988 UTC - [Cache2 I/O]: D/cache2 CacheStorageService::SchedulePurgeOverMemoryLimit The timer is never notified nor canceled and released. Hence the Timer<->Service cycle and leak. Nick, just FYI, using --set-env try arg to setup logging is not the appropriate way. Use the MOZ_LOG var in testing/mochitest/runtests.py which correctly solves overlapping of logs produced by multiple processes all seeing the same MOZ_LOG_FILE env var. That was the source of my initial confusions here.

Status: NEW → ASSIGNED

Honza Bambas (:mayhemer)

Assignee

Comment 77

•

8 years ago

Attached patch v1 (deleted) — Details — Splinter Review

- no longer engage the purge timer after mShutdown = true (in CacheStorageService::Shutdown, synchronized before CacheStorageService::ShutdownBackground) - I also added locking to all places we work with the timer (except the one non-null check, comment is already there it's intentional) https://treeherder.mozilla.org/#/jobs?repo=try&revision=5532246b1395 (leak checking try push)

Attachment #8792887 - Flags: review?(michal.novotny)

Michal Novotny [:michal]

Updated

•

8 years ago

Attachment #8792887 - Flags: review?(michal.novotny) → review+

Honza Bambas (:mayhemer)

Assignee

Comment 78

•

8 years ago

Leak's gone: https://treeherder.mozilla.org/#/jobs?repo=try&revision=be85031261f66afc059d4cc6bf438c556a92ad28

Keywords: checkin-needed

u408661

Comment 79

•

8 years ago

(In reply to Honza Bambas (:mayhemer) from comment #78) > Leak's gone I'm wishing *so* hard for emoji support in bugzilla right now. Woo, Honza!

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox48: affected → wontfix

status-firefox49: affected → wontfix

status-firefox50: --- → affected

status-firefox51: --- → affected

status-firefox52: --- → affected

Pulsebot

Comment 81

•

8 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9a370db31b85 Fix HTTP cache2 leak after shutdown, don't setup purging timer after shutdown. r=michal

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 82

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9a370db31b85

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

status-firefox52: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Comment hidden (Intermittent Failures Robot)

Ryan VanderMeulen [:RyanVM]

Comment 84

•

8 years ago

Please request Aurora/Beta approval on this when you get a chance.

Flags: needinfo?(honzab.moz)

u408661

Comment 85

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #84) > Please request Aurora/Beta approval on this when you get a chance. I'm not convinced we should do this - while this has been a latent bug, it's only (to our knowledge) triggered by predictor prefetch stuff, which is disabled everywhere but nightly - hence no leaks on aurora or beta in the orangefactor reports. I'm not convinced it's worth the risk of uplifting a patch. Then again, I didn't write the patch, so take what I say with somewhere between a grain of salt and an entire salt lick :D

Ryan VanderMeulen [:RyanVM]

Comment 86

•

8 years ago

OrangeFactor shows this hitting on Aurora50 throughout that cycle, so consider me confused then :) https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&endday=2016-09-19&startday=2016-08-01&tree=mozilla-aurora

Honza Bambas (:mayhemer)

Assignee

Comment 87

•

8 years ago

This is independent of predictor/prefetch. Predictor only helped to narrow it down because it apparently exhausts the cache2 back end during shutdown much more than when not turned on. Will request a's.

Flags: needinfo?(honzab.moz)

Honza Bambas (:mayhemer)

Assignee

Comment 88

•

8 years ago

Comment on attachment 8792887 [details] [diff] [review] v1 Approval Request Comment [Feature/regressing bug #]: HTTP cache2 [User impact if declined]: none (it's just a leak not causing any secondary trouble), although, generally speaking this frequent intermittent failure may well lead to overlook of something more important [Describe test coverage new/current, TreeHerder]: just landed on m-c [Risks and why]: zero, the patch just doesn't allow a certain non-critical action after the shutdown mark [String/UUID change made/needed]: none

Attachment #8792887 - Flags: approval-mozilla-beta?

Attachment #8792887 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 89

•

8 years ago

Hi Wes, could you please confirm this fixes the intermittent? That will help me decide whether to uplift this or not. Thanks!

Flags: needinfo?(wkocher)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 90

•

8 years ago

The green line at https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1267820&startday=2016-08-01&endday=2016-09-21&tree=trunk would be the thing to track, but I think it's still probably too early to tell if this has been fixed. Leaving the ni to check back in a few days.

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 91

•

8 years ago

Ok. I'll wait for confirmation from Sheriffs before uplifting this one to Beta50.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 92

•

8 years ago

Looks like it dropped to zero to me.

Flags: needinfo?(wkocher)

Comment hidden (Intermittent Failures Robot)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 94

•

8 years ago

Comment on attachment 8792887 [details] [diff] [review] v1 Verified by Wes to have fixed the intermittent failure, Aurora51+, Beta50+

Attachment #8792887 - Flags: approval-mozilla-beta?

Attachment #8792887 - Flags: approval-mozilla-beta+

Attachment #8792887 - Flags: approval-mozilla-aurora?

Attachment #8792887 - Flags: approval-mozilla-aurora+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 95

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/36e4081424b4

status-firefox51: affected → fixed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 96

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/a843ec667941

status-firefox50: affected → fixed