Closed Bug 1239504 Opened 9 years ago Closed 9 years ago

LSan detects ImageBridgeChild thread Profiler leaks when the update channel is set to beta

Categories

(Core :: Graphics: Layers, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
e10s + ---
firefox45 + fixed
firefox46 + fixed
firefox47 --- fixed

People

(Reporter: mccr8, Assigned: tromey)

References

(Blocks 1 open bug)

Details

(Keywords: memory-leak)

Attachments

(1 file, 1 obsolete file)

RyanVM did a try run with e10s enabled for Aurora, except set as Release: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ffa74919006&group_state=expanded&filter-searchStr=m-e10s LSan detects a bunch of leaks in the profiler. The allocation stacks of the leaking objects are: 14:31:59 INFO - ==2039==ERROR: LeakSanitizer: detected memory leaks 14:31:59 INFO - Direct leak of 24640 byte(s) in 1 object(s) allocated from: 14:31:59 INFO - #0 0x4750d1 in __interceptor_malloc /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_malloc_linux.cc:74 14:31:59 INFO - #1 0x48dd7d in moz_xmalloc /builds/slave/try-l64-asan-00000000000000000/build/src/memory/mozalloc/mozalloc.cpp:83 14:31:59 INFO - #2 0x7f422377c9a6 in operator new /builds/slave/try-l64-asan-00000000000000000/build/src/obj-firefox/tools/profiler/../../dist/include/mozilla/mozalloc.h:186 14:31:59 INFO - #3 0x7f422377c9a6 in create /builds/slave/try-l64-asan-00000000000000000/build/src/obj-firefox/tools/profiler/../../dist/include/PseudoStack.h:230 14:31:59 INFO - #4 0x7f422377c9a6 in mozilla_sampler_register_thread(char const*, void*) /builds/slave/try-l64-asan-00000000000000000/build/src/tools/profiler/core/platform.cpp:1021 14:31:59 INFO - #5 0x7f421cd9a39e in profiler_register_thread /builds/slave/try-l64-asan-00000000000000000/build/src/obj-firefox/ipc/chromium/../../dist/include/GeckoProfilerImpl.h:225 14:31:59 INFO - #6 0x7f421cd9a39e in base::Thread::ThreadMain() /builds/slave/try-l64-asan-00000000000000000/build/src/ipc/chromium/src/base/thread.cc:150 14:31:59 INFO - #7 0x7f421cd9becc in ThreadFunc(void*) /builds/slave/try-l64-asan-00000000000000000/build/src/ipc/chromium/src/base/platform_thread_posix.cc:36 14:31:59 INFO - ----------------------------------------------------- The profiler_unregister_thread isn't working for these Chromium IPC threads for some reason. Is the profiler disabled on beta but enabled on aurora? That could be the cause of this, somehow.
Summary: Profiler leaks → Profiler leaks in e10s RELEASE_BUILD
This affects non-e10s as well.
[Tracking Requested - why for this release]: Merge day orange.
Tracking to make sure it doesn't happen on merge day
ASAN test are going to get very angry on mozilla-beta next week when Gecko 45 gets uplifted there. Can we please find some time to investigate this soon?
Flags: needinfo?(bgirard)
Summary: Profiler leaks in e10s RELEASE_BUILD → Profiler leaks in RELEASE_BUILD
Given the timing, I wonder if this a regression from bug 1193838 and/or bug 1103094.
By my count, this'll break 44% of the ASAN jobs we run on Beta in less than a week. Does anybody have cycles to look into this?
Severity: normal → critical
Flags: needinfo?(ttromey)
Flags: needinfo?(mconley)
(In reply to Andrew McCreight [:mccr8] from comment #0) > Is the profiler disabled on beta but enabled on aurora? That could be the > cause of this, somehow. Yes, it is my understanding that Nightly (and maybe Aurora) have the --enable-profiling configuration flag set, and that beta / release do not. BenWa, any ideas what's going on here?
Flags: needinfo?(mconley)
The theory about --enable-profiling is looking a bit weaker - RyanVM pushed to try to Nightly with mozconfig set up without --enable-profiling, and it came out green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ee50e5daa3fc
FWIW, I found the place that enables frame pointers for ASan builds: http://mxr.mozilla.org/mozilla-central/source/build/autoconf/frameptr.m4#37
We've got a theory that somehow removing the "a2" from the version number triggers this. Trying on a Nightly ASAN build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ef77863eda8
We have a winner. It's the update channel doing it. https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0e80124717f&group_state=expanded Of the areas of code that look at the update channel (and there are surprisingly few outside of the test suite), this particular ditty is looking particularly suspicious to me: http://mxr.mozilla.org/mozilla-aurora/source/xpcom/threads/BackgroundHangMonitor.cpp#552
Summary: Profiler leaks in RELEASE_BUILD → Profiler leaks when the update channel is set to beta
Doing things when the update channel is set to beta is particularly sketchy given that... we don't. I suppose it must be set that way for releases, but it isn't for on-push builds, and we run tests on on-push builds, but not on releases. I presume that has to change somehow for release promotion to work, but at the moment if you do something for update channel == beta, you do something completely untested.
Which might explain why this hasn't been caught before (the code block in question isn't new - was added by rvitillo in bug 1133521 back in the Fx38 days) if there haven't been any Aurora-as-Beta uplift simulation runs in awhile. Another possibility is that BHR has been seeing a lot of work for the e10s m8 cycle, so maybe it's one of those changes tickling this code in an unexpected way.
Some more Try runs (all on top of Aurora tip w/ the Aurora-as-Beta simulation patch applied): 1.) Remove the release channel override. Given the earlier results, we would expect this to be green. https://treeherder.mozilla.org/#/jobs?repo=try&revision=ccf8a58a7621 2.) Make the BHR block in BackgroundHangMonitor.cpp always run on RELEASE_BUILD. IIUC, this should also be green as it was only being skipped when the channel is beta. Pushed on top of the patch from #1 as well. https://treeherder.mozilla.org/#/jobs?repo=try&revision=71b9e258bf9d 3.) Make the BHR block in BackgroundHangMonitor.cpp never run on RELEASE_BUILD. IIUC, this should also be orange if it's indeed responsible for causing the leaks. Pushed on top of the patch from #1 as well. https://treeherder.mozilla.org/#/jobs?repo=try&revision=19a3dd9c302a One other important comment - if #1 is indeed green, the immediate severity of this bug is reduced, because our mozilla-beta CI builds have an update channel of "default" - so this shouldn't go orange on the next uplift in that case. Yay! However, this is a (scary) example of us not shipping what we're testing. Also, the long-term plan with release promotion is that these builds will eventually be set to "beta" for CI as well, so we can expect this problem to follow at that time if not resolved before then.
Only #2 is orange, so that confirms that running that block of code is what's causing the leaks. #1 is green, so that confirms that we should be OK on merge day at least. Downgrading the severity based on that.
Severity: critical → major
Keywords: mlk
Hey rvitillo, this block got added way back in Fx 39 in bug 1133521, but something appears to have tickled it such that it throws leak errors in our ASAN builds (see comment 0 for a leak trace). Any idea what we might be seeing here?
Blocks: 1133521
Flags: needinfo?(rvitillo)
I took a look at the thread register stuff. It looks fine. Removing ni? since the bug is with BHR. Let me know if we think the profiler API is being used properly but still leaking.
Flags: needinfo?(bgirard)
For whoever ends up investigating this, https://treeherder.mozilla.org/#/jobs?repo=try&revision=0b1cab01cffd on top of m-c confirms that this can be trivially reproduced by unconditionally running that block.
Flags: needinfo?(ttromey)
One recent change was making BHR run on 100% of Beta instead of 50% in bug 1241507.. I don't know if that could be responsible
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #16) > Hey rvitillo, this block got added way back in Fx 39 in bug 1133521, but > something appears to have tickled it such that it throws leak errors in our > ASAN builds (see comment 0 for a leak trace). > > Any idea what we might be seeing here? I don't see how that block could be connected to the failure of comment 0.
Flags: needinfo?(rvitillo)
I've pushed a sloppily instrumented build to see if I can see where the decrementing / deletion of the PseudoStack is occurring (or isn't occurring): https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7a1b1d45765
Flags: needinfo?(mconley)
So I'm not 100% sure from reading the log (http://archive.mozilla.org/pub/firefox/try-builds/mconley@mozilla.com-39629f1be86f49054882889c89ab996923adaff2/try-linux64-asan/try_ubuntu64-asan_vm_test-mochitest-e10s-browser-chrome-1-bm53-tests1-linux64-build38.txt.gz), but it looks like there are some PseudoStack deallocations that occur _after_ LeakSanitizer complains about leaks. So are those PseudoStack deallocations happening too late in the process shutdown, or is LeakSanitizer running too early? mccr8, have you ever seen anything like this?
Flags: needinfo?(mconley) → needinfo?(continuation)
Blocks: LSan
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #23) > but it looks like there are some PseudoStack deallocations that occur > _after_ LeakSanitizer complains about leaks. Nice idea, but it is hard to say exactly what is happening. There's another "Suppressions used" block (which is generated by LSan) after the second set of PseudoStack traffic. It could be the case that the second set is stuff happening in the parent process, while the leak is in the child process. You'd have to log the pid in there to be sure. If you are going to do another try run, I'd also log the pointer value, because then you could hack up some crude refcount logging which might be useful. > So are those PseudoStack deallocations happening too late in the process > shutdown, or is LeakSanitizer running too early? mccr8, have you ever seen > anything like this? LeakSanitizer runs very late. I haven't seen anything like this. It is odd that this leak shows up in LSan runs but not regular debug runs. Though now that I'm looking at it, PseudoStack does not have MOZ_COUNT_CTOR and MOZ_COUNT_DTOR, and uses its own refcounting, so it might be leaking in debug builds, too.
Flags: needinfo?(continuation)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #23) > but it looks like there are some PseudoStack deallocations that occur > _after_ LeakSanitizer complains about leaks. > > So are those PseudoStack deallocations happening too late in the process > shutdown, or is LeakSanitizer running too early? mccr8, have you ever seen > anything like this? It sounds similar to bug 1067547.] That bug is also the reason that PseudoStack doesn't have MOZ_COUNT_CTOR.
(In reply to Tom Tromey :tromey from comment #25) > It sounds similar to bug 1067547.] > That bug is also the reason that PseudoStack doesn't have MOZ_COUNT_CTOR. Oh, right, I forgot about that! That does sound similar, but I don't think that can be an issue here because ASan should intercept all mallocs and frees, so there shouldn't be any issue with the profiler running before ASan.
My understanding is that RelEng intends to attempt to enable Release Promotion on Beta at some point during this cycle. If that's the case, this is going to permafail as soon as that change gets made. Can we please get someone to look into this soon?
Severity: major → critical
Flags: needinfo?(ttromey)
I'm afraid I never got too far into trying to investigate this. :( It seems as if BHR is not shutting down properly, or at the very least, initializing it causes us to leak profiler PseudoStack's on some IPC threads. But I have _no idea_ why.
I tried a short experiment which I think indicates there is a real leak here. I'd appreciate it if someone else took a look. I pushed this try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ad4c9c39013 This has two patches, one to remove the "beta" check per comment #11, and one to add logging to the PseudoStack constructor and destructor (with pid, per comment #24). I downloaded the resulting raw log file and ran it through this perl script: %m = (); while (<>) { chomp; next unless m/@@@@ (NEW|DELETE) (.*)$/; if ($1 eq "NEW") { die "double new " . $2 if defined $m{$2}; $m{$2} = 1; } else { die "bad delete " . $2 unless defined $m{$2}; delete $m{$2}; } } foreach $key (keys %m) { print "leak " . $key . "\n"; } This prints a number of leaks: pokyo. zcat try_ubuntu64-asan_vm_test-mochitest-e10s-browser-chrome-1-bm54-tests1-linux64-build82.txt.gz |perl q.pl leak 4527 0x62b00005b200 leak 5268 0x62b00002a200 [... 40 total ...] Spot-checking the first one: pokyo. zcat try_ubuntu64-asan_vm_test-mochitest-e10s-browser-chrome-1-bm54-tests1-linux64-build82.txt.gz | grep '4527 0x62b00005b200' 15:51:57 INFO - @@@@@@@@@@@@@@@@ NEW 4527 0x62b00005b200
Flags: needinfo?(ttromey)
(In reply to Tom Tromey :tromey from comment #29) > I tried a short experiment which I think indicates there is a real leak here. > I'd appreciate it if someone else took a look. I've only got the faintest grasp of perl, but I think I understand what that script is doing (looking for NEW (pid) (address) and adding them to a hashmap, and removing them on DELETE, then dumping the leftovers). So this looks compelling to me - I agree that there's likely some kind of leak here.
A few more comments - First, we're not going to see this on ordinary debug builds, even with the MOZ_COUNT_CTOR, because BHR is disabled there (and many other places): https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/moz.build#76 Second, I don't really understand the beta-disabling logic here: https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/BackgroundHangMonitor.cpp#538 My current theory is that this code is trying to disable the monitor because either telemetryEnabled or clientID is false, and that this disabling code is buggy somehow. However my understanding of the intent of the code is that it only disable monitoring for some beta builds. That is, the changes introduced in bug 1133521 have an effect beyond what one would expect from the value of BHR_BETA_MOD. But, I don't know for sure, maybe this is intended as well.
I added some more logging and found that the leaked PseudoStack was the one registered by ImageBridgeChild in the child process. Debugging here, I see that ImageBridgeChild::ShutDown is never called in the child process. This function seems to be the only spot where sImageBridgeChildThread is deleted ("seems" since I'm having trouble finding callers of GetThread to check those). Deleting the sImageBridgeChildThread also seems to be the only way to cause the thread to exit cleanly. I haven't yet checked to see what happens here without the problem BHR code.
If I back out my hacks to BackgroundHangMonitor and just enable my logging code, I still see a PseudoStack leak when running locally.
That makes sense, I guess. I think the actual ImageBridgeChild doesn't show up as a leak in LSan because a global variable refers to it, but presumably it does not contain a reference to the PseudoStack.
I'm not sure what a good way is to ignore this leak.
Depends on: 1215265
(In reply to Andrew McCreight [:mccr8] from comment #34) > That makes sense, I guess. I think the actual ImageBridgeChild doesn't show > up as a leak in LSan because a global variable refers to it, but presumably > it does not contain a reference to the PseudoStack. The PseudoStack is referred to via TLS though. https://dxr.mozilla.org/mozilla-central/source/tools/profiler/core/platform.cpp#70 And lsan claims to check TLS https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizerDesignDocument So I'm not completely sure what is going here. Perhaps lsan is wrong here, or maybe I've made some error in my analysis. I tried hacking in a call to layers::ImageBridgeChild::ShutDown in the conent process case of gfxPlatform::ShutdownLayersIPC. This worked for my simple case but glancing at bug 1215265 makes it seem like the reality is more complicated. Maybe one workable hack would be to have ImageBridgeChild acquire a reference to the PseudoStack? It sounds like this would fool lsan.
(In reply to Tom Tromey :tromey from comment #36) > Maybe one workable hack would be to have ImageBridgeChild acquire a > reference to the PseudoStack? It sounds like this would fool lsan. That sounds like it is worth a try.
This works in my preliminary test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=19cee316ee5a I didn't run the real test, though. For that I assume I need the patch from the original try push, but I was wondering if I needed anything else as well.
Sorry, mixed myself up a bit. You don't even need to fiddle with the version number. Just adding |ac_add_options --enable-update-channel=beta| to build/mozconfig.common.override is all you should need to do to trigger it.
I rebased on mozilla-central, made the mozconfig change, and pushed this try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12bee4299f3d
Comment on attachment 8717534 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild This suppresses the leak report as far as we can tell. If this is ok I will update bug 1215265 to note that, when that patch goes in, this hack should be removed as well.
Attachment #8717534 - Flags: review?(continuation)
Comment on attachment 8717534 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild Review of attachment 8717534 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for fixing this! The patch should be reviewed by a layers graphics person.
Attachment #8717534 - Flags: review?(continuation) → review?(sotaro.ikeda.g)
Component: Gecko Profiler → Graphics: Layers
Summary: Profiler leaks when the update channel is set to beta → LSan detects ImageBridgeChild thread Profiler leaks when the update channel is set to beta
Comment on attachment 8717534 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild Looks good. Thanks!
Attachment #8717534 - Flags: review?(sotaro.ikeda.g) → review+
Comment on attachment 8717534 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild This doesn't compile on Windows. https://treeherder.mozilla.org/logviewer.html#?job_id=16534958&repo=try
Flags: needinfo?(ttromey)
Attachment #8717534 - Flags: feedback-
Comment 45 was on top of Aurora, in case that ends up mattering.
Assignee: nobody → ttromey
Patch looks great for fixing the ASAN leaks, though!
Unified build bites again perhaps. In any case, I neglected to notice the "using base::Thread"; the simplest fix is to simply move the definition of ImageBridgeThread a bit lower in the file. Rebuilding here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bfc26368789a Will carry over the r+.
Attachment #8717534 - Attachment is obsolete: true
Flags: needinfo?(ttromey)
Attachment #8717908 - Flags: review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Comment on attachment 8717908 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild Approval Request Comment [Feature/regressing bug #]: Unknown. [User impact if declined]: LSan try builds will fail due to an apparent PseudoStack leak. [Describe test coverage new/current, TreeHerder]: LSan tests hit this constantly. [Risks and why]: Low risk. The patch simply acquires a (non-owning and otherwise unused) reference to the thread's PseudoStack, to fool LSan into thinking that the object is not leaked. [String/UUID change made/needed]: N/A.
Attachment #8717908 - Flags: approval-mozilla-aurora?
Comment on attachment 8717908 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild Approval Request Comment [Feature/regressing bug #]: Unknown. [User impact if declined]: LSan try builds will fail due to an apparent PseudoStack leak. [Describe test coverage new/current, TreeHerder]: LSan tests hit this constantly. [Risks and why]: Low risk. The patch simply acquires a (non-owning and otherwise unused) reference to the thread's PseudoStack, to fool LSan into thinking that the object is not leaked. [String/UUID change made/needed]: N/A.
Attachment #8717908 - Flags: approval-mozilla-beta?
Comment on attachment 8717908 [details] [diff] [review] avoid ASan leak report by holding PseudoStack in ImageBridgeChild Taking it to improve the builds. Should be in 45 beta 6.
Attachment #8717908 - Flags: approval-mozilla-beta?
Attachment #8717908 - Flags: approval-mozilla-beta+
Attachment #8717908 - Flags: approval-mozilla-aurora?
Attachment #8717908 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: