Implement PHC on Linux
Categories
(Core :: Memory Allocator, task, P2)
Tracking
()
People
(Reporter: decoder, Assigned: n.nethercote)
References
(Depends on 1 open bug)
Details
Attachments
(3 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/plain
|
tdsmith
:
data-review+
|
Details |
(deleted),
text/x-phabricator-request
|
Details |
This bug is about porting the gwp-asan allocator parts to jemalloc (for a better overall description of gwp-asan, see the parent bug).
For now I'll assume that the client-side crash reporter integration also happens as part of that patch. If that isn't the case, please let me know and I'll file a separate bug to track that task. Thanks!
Assignee | ||
Comment 1•6 years ago
|
||
This is a work in progress. Plenty of work left to do.
- Lots of "njn:" comments that identify things that need fixing.
- There are link errors on Android.
- It produces lots of logging output to stderr at the moment.
- There are frequent crashes in FramePointerStackWalk(). (I wonder if this is
related to the logging output.) - Crashes in Windows content processes aren't correctly handled, because I
don't understand the IPC done in that case. - I don't know how to test it. Probably crash tests of some kind?
- No performance measurements yet.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a270ba88e894c0d7865c4487aa7781ea2e942234
is a representative try run. At least it's working reasonably well on Linux64
and Mac.
Anyway, I think it would be useful to get some preliminary feedback. glandium,
can you look at the replace-malloc parts; gsvelto, can you look at the crash
reporter parts. Thanks!
Assignee | ||
Comment 2•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Comment 3•6 years ago
|
||
A quick note that new and changed data collections are subject to Data Collection Review before they land.
Comment 4•5 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:njn, could you have a look please?
Reporter | ||
Comment 6•5 years ago
|
||
Do you just need more time to work on this or is this blocked on someone/something else? I'll be happy to help unblock if I can.
I wonder if it would appease the bots if you mark the review as "plan changes".
Assignee | ||
Comment 8•5 years ago
|
||
It's mostly a matter of time and attention. There is one technical challenge which is that on Windows I don't understand the way data is transferred from a content process to the parent process when a crash occurs. On Mac and Linux it was obvious and straightforward and I could make the necessary changes quite easily, but the Windows design is a real head-scratcher.
Comment 9•5 years ago
|
||
We discussed this offline but I forgot to add the dependency here: this also needs bug 1035892 to be fixed to work correctly on Mac. Unfortunately I'm stalled on that one due to an obscure race-condition triggered by the modernized exception handler.
Reporter | ||
Comment 10•5 years ago
|
||
Windows would be the highest priority for now, Mac and Linux are only P2 per project description and their support is not a requirement for this to land.
Jed, can you help :njn with comment 8 or can you forward the needinfo to someone else who can? Thanks!
Comment 11•5 years ago
|
||
I know almost nothing about the IPC situation in Breakpad, but I have general Breakpad-on-Windows experience. If that ends up making me the person who'd have to do the least ramp-up, I can give it a try.
Comment 12•5 years ago
|
||
I know how the crash reporter works on Linux, but that knowledge doesn't seem to be in short supply here. For Windows I could figure it out if I had to, but it would be better to start with someone who already has more of the background knowledge — so, probably gsvelto or dmajor, who are both already on this bug.
(Assuming this is about Breakpad proper; the mechanisms for sending crash annotations are more in IPC territory but I'd have to spend some time looking at the code.)
Assignee | ||
Comment 13•5 years ago
|
||
A while back I privately (via email) asked both gsvelto and dmajor about the Windows breakpad code. It appears nobody has much idea how that code works, unfortunately.
Here is a slightly longer explanation of the weirdness:
On a content process crash, I need to pass some extra data from the content process to the parent process. On Linux and Mac this is fairly straightforward and I have it working, but the way things work on Windows is really bizarre and I don't understand it. There are these two closely related types, ProtocolMessage and ClientInfo:
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/breakpad-client/windows/common/ipc_protocol.h#140-161
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/breakpad-client/windows/crash_generation/client_info.h#113-149
The latter has various comments saying "WARNING: Do not dereference these pointers as they are pointers in the address space of another process." I don't understand the point of these, and how they are supposed to work, and how data is passed from the child to the parent in general when a crash occurs.
Comment 14•5 years ago
|
||
Is the patch on phab the latest version of your changes? Maybe if I had in front of me the Linux source plus a live Windows build with GWP but no Breakpad integration, I might be able to fill in the gap.
Updated•5 years ago
|
Assignee | ||
Comment 15•5 years ago
|
||
dmajor: I just pushed the latest code. It has a bunch of fprintf statements in the Windows breakpad code that I was using to understand it. They might be helpful, because another part of the weirdness is that things happen in an unexpected order: the IPC struct gets set up at start-up and doesn't seem to get filled in at crash time.
Comment 16•5 years ago
|
||
Can you walk me through how to trigger the relevant behavior, what is supposed to happen, and what actually happens? In the console all I see is a flood of "GWP[n,n]" so I don't really know what to be looking for.
Assignee | ||
Comment 17•5 years ago
|
||
dmajor: as written, the patch does the checking and prints a bunch of stuff about it, but you won't see any unusual behaviour.
To trigger a GWP-relevant crash at startup, change the "#if 0" in xpcom/base/nsMemoryReporter.com to "#if 1". The output will show "IS GWP ALLOC" and the subsequent output is relevant to the IPC stuff.
Comment 18•5 years ago
|
||
njn: I'm having a lot of trouble wrapping my head around the patch. Partly it's because it's a pretty large size of WIP to keep in mental working set when I don't have the automatic familiarity of having written it. Also, I feel uneasy around the fixme comments and general WIP-ness, not knowing whether what I'm looking at will be the real thing.
To reduce the mental load I'd like to suggest splitting up the patch. In my view there are three logical pieces to this: (1) The allocator itself, with no regard for crash reporting. (In theory, this could stand on its own: Socorro would still hear about the unmapped accesses, it's just that we wouldn't know they came from GWP -- do I understand this right?) (2) crash reporting for Linux, and (3) crash reporting for Windows.
Even if Windows reporting is "the" goal from a project management perspective, I don't think that means the other bits can't be completed first, especially if doing so makes the Windows part easier. If you could polish and land parts 1 and 2, that would make the remaining portion that I need to think about more tractable on my end. What do you think?
Assignee | ||
Comment 19•5 years ago
|
||
I don't think there's much point landing any of this without having the Windows crash handling hooked up.
In terms of understanding things, it's basically just the changes under toolkit/crashreporter/breakpad-client/windows/ (and the code within that directory in general) that's causing me problems. All the rest of it can be thought of as producing a special value on some crashes, and I need to pass that special value from the crashing content process to the parent process. And the existing code for passing values from the crashing content process to the parent process is really weird.
If it's too hard, that's fine, I'll get back to it soon.
Comment 20•5 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #19)
I don't think there's much point landing any of this without having the
Windows crash handling hooked up.
It's not like we'd land it without Windows and then walk away. The intent was to enable better collaboration. It's so much easier for me to look at a small piece and know the rest is already tree-quality and landed, than to maintain a large diff that may change in final form.
If you'd rather just work on it, that's fine by me too. But frankly, if I were the reviewer, I would ask for logical pieces to be in separate patches even if you had the whole thing already finished.
Assignee | ||
Comment 22•5 years ago
|
||
I'm actively working on it. Plan is to ship on Linux64 first, to get something shipping, and then work through the Windows crash reporter weirdness immediately afterwards. (I realize that Windows is a much more desirable target due to the user population.) I'm still aiming for this to happen in Q2.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 23•5 years ago
|
||
glandium, gsvelto: the latest version has everything in place, so please do a full review of your parts.
glandium: I'm seeing one problem both locally and on try pushes -- gtests are crashing part way through. I caught one crash while debugging and it looks like a PHC allocation is being given to mozjemalloc to deallocate, which leads to a MOZ_DIAGNOSTIC_ASSERT failure within mozjemalloc. I'm not sure if it's a problem with my code or some pre-existing flaw with the replace-malloc machinery. Oddly enough, the problem isn't manifesting anywhere else and other test suites such as mochitests are working fine.
Assignee | ||
Comment 24•5 years ago
|
||
The gtest failures are caused by the profiler's use of the jemalloc_replace_dynamic
function. That function has multiple issues and is a big problem for PHC.
- It's racy: multiple functions could call
jemalloc_replace_dynamic()
concurrently, which meansreplace_init_func()
could be called multiple times concurrently, which meansmemory_hooks.cpp:gMallocTable
could be overwritten concurrently. - It assumes we're swapping between the
gReplaceMallocTableDefault
and something else. If the starting table isn't the default one (e.g. it's the PHC table) then we can't switch back to the PHC one. Fixing this is hard; if we try to change the code to use the current table we get more races. - The init/uninit calls coming from the profiler aren't balanced; there can be multiple uninit calls for a single (or even zero) init calls.
I'm struggling to come up with a different design that is safe and sufficiently capable for both the profiler's current use case and PHC. Swapping out your allocator mid-execution in a thread-safe manner without locks is hard. I'm not sure how to move forward.
jesup: how important is the MOZ_PROFILER_MEMORY feature?
Assignee | ||
Comment 25•5 years ago
|
||
The MOZ_PROFILER_MEMORY stuff was added in bug 1480430.
Assignee | ||
Comment 26•5 years ago
|
||
It assumes we're swapping between the gReplaceMallocTableDefault and something else. If the starting table isn't the default one (e.g. it's the PHC table) then we can't switch back to the PHC one.
Putting this another way: at first glance the API makes it look like you can switch to a different replace-malloc and then switch back. But that's not true -- as soon as you call jemalloc_replace_dynamic()
you've lost whatever replace-malloc functionality you might have in place, and you can only switch back to the default (i.e. no replacement). Depending on what the original replace-malloc functionality did, this could stop the accumulation of data (e.g. with DMD) or could cause crashes (with PHC).
IMO this is a fundamentally broken design and isn't acceptable. We need to come up with a way to fix it properly, or failing that, we should remove it.
Comment 27•5 years ago
|
||
MOZ_PROFILER_MEMORY is a very important feature; it's already been used extensively in performance work (and caught a multi-GB JS memory regression immediately on being enabled). (Someone on the profiler team accidentally made it not-default recently; you need to turn on Memory in options now - this should be fixed sometime soon.)
The current design was constrained by a requirement that the ABI for replacement libs not change. If we relax that requirement, several other things are options.
Agreed, low-overhead switching it tough. Earlier patches changed the ABI, but didn't have the race condition. However, some of your other criticisms might still be valid. And to an extent, I'm not sure we need to be able to run both profiler-with-memory and DMD/etc at the same time; I'm ok with having such limitations (though best enforced in the code in some manner).
Assignee | ||
Comment 28•5 years ago
|
||
I'm not sure we need to be able to run both profiler-with-memory and DMD/etc at the same time
The plan for PHC is to initially run it for all Nightly users, all the time. We may then possibly extend it to all Beta users. So if those users ever want to use the profiler -- and I am sure that's a hard requirement -- then the profiler can't disable the current replace-malloc functionality when it wants to start counting allocations. So PHC is currently totally blocked by jemalloc_replace_dynamic
's flaws.
Assignee | ||
Comment 29•5 years ago
|
||
I've managed to fix the jemalloc_replace_dynamic
problems enough for PHC to run (see bug 1557907).
Assignee | ||
Comment 30•5 years ago
|
||
Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto
chutten: This patch adds five new annotations to crash reports. The vast majority of the time they won't be used. Once in a blue moon, PHC will detect a bad memory access and trigger a crash, in which case these annotations will be added. The annotations indicate the allocation's base address, one or two stack traces (allocation and free locations), the allocation's size, and what kind of allocation (NeverAllocated, Freed, or one of a couple of other possibilities).
This all seems pretty unobjectionable to me, and similar to other stuff we already have in crash reports. The only notable thing is that these annotations, when present, are likely to point directly at potential security vulnerabilities. So we should mark them as security-sensitive, like we already do with some other fields. But I think that happens on the Socorro side, is that right?
Comment 31•5 years ago
|
||
Nicholas, how are you going to retrieve and use that data? If the new fields are not placed in crash pings you will get a lot less of them. We usually get 20x to 30x more crash pings than crash reports so limiting yourself to crashreports will lead to a lot less samples. That being said if the annotations are security sensitive I believe that crash-pings are a no-no because IIRC the dataset is public. Chris certainly knows more though.
Assignee | ||
Comment 32•5 years ago
|
||
My current idea is that crash-stats would show those PHC fields to any user who has permissions to see security-sensitive fields.
Comment 33•5 years ago
|
||
Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto
Load balancing to :tdsmith
Comment 34•5 years ago
|
||
Comment on attachment 9053825 [details]
Bug 1523276 - Implement PHC, a probabilistic heap checker. r=glandium,gsvelto
Thanks for the collection review request. Would you mind filling out the data review request form, attaching it separately to this bug, and raising a data-review? flag on that attachment?
The form is here: https://raw.githubusercontent.com/mozilla/data-review/master/review.md
The process is described here: https://wiki.mozilla.org/Firefox/Data_Collection#Step_1:_Submit_Request
Also, just to confirm wrt crash pings vs reports, crash pings aren't public per se but they are accessible to all employees. Of course, crash report access is metered out through crash-stats.
Assignee | ||
Comment 35•5 years ago
|
||
Comment 36•5 years ago
|
||
If crash ping data isn't public then it would be worth considering it in addition to crash stats. As I mentioned in the comment above they tend to deliver over an order of magnitude more crashes than what we get on crash-stats.
Comment 37•5 years ago
|
||
Assignee | ||
Comment 38•5 years ago
|
||
Thanks, tdsmith! You did a great job of interpolating my responses. I only see one thing worth correcting:
- What populations will you measure?
All users.
Initially it will only be Nightly users on Linux. Soon afterward we will expand to Nightly users on Windows, and possibly Mac.
If things go well, we might expand to Beta users. Beyond that, it's conceivable we'd expand to Release users eventually, but it's far from certain.
Assignee | ||
Comment 39•5 years ago
|
||
Depends on D25021
Comment 40•5 years ago
|
||
Assignee | ||
Comment 41•5 years ago
|
||
Bug 1567065 is the follow-up bug to enable PHC on Linux64 Nightly builds.
Comment 42•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/3c6a53b1fb24
https://hg.mozilla.org/mozilla-central/rev/161dc1283019
Updated•5 years ago
|
Assignee | ||
Comment 43•5 years ago
|
||
(Retrospectively renaming this bug to account for the change in the tool's name, and to account for the fact that this bug eventually covered just the Linux implementation.)
Description
•