Closed
Bug 897769
Opened 11 years ago
Closed 11 years ago
Test/benchmark PERF_SAMPLE_STACK_USER on B2G
Categories
(Firefox OS Graveyard :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jld, Assigned: jld)
References
Details
(Keywords: perf, Whiteboard: [c=profiling p=3 s=2013.08.09])
Attachments
(1 file)
(deleted),
text/x-csrc
|
Details |
Part of the story for using ARM exception handling tables instead of the current frame pointer hacks is being able to get userspace stacks for perf_event profiling. There were Linux kernel changes in August 2012 to allow copying part of the sampled process's stack into the perf buffer as part of the sample so that a userland agent could perform table-driven stack unwinding instead of trying to embed that much complexity in the kernel.
So we'd need to backport it to the older kernels we're using for b2g, and get an idea of how well it performs. This is, I think, the main item of missing information here.
Assignee | ||
Comment 1•11 years ago
|
||
Here's a small benchmark program on my keon:
0m2.98s real 0m2.98s user 0m0.00s system
With perf running globally at 1 kHz, not copying the stack:
0m3.01s real 0m3.00s user 0m0.00s system
Copying 512 bytes of stack per sample:
0m3.01s real 0m3.00s user 0m0.00s system
Copying 32 KiB of stack (same as what the breakpad unwinder in Gecko does):
0m3.30s real 0m3.14s user 0m0.00s system
Copying up to 32 KiB of stack (and allocating that much buffer space per sample), but using only 1184 bytes[1]:
0m3.21s real 0m3.05s user 0m0.00s system
The "real" time includes fwrite()ing the full records to /dev/null, which appears to be slower than actually unwinding them will be[2]. The "user" time difference is the actual cost of the interrupt handler. The empty space in the last case shouldn't cost anything directly (it's not zeroed or otherwise written), but it presumably has cache and/or TLB effects, and increases profiler wakeups.
For something to measure this against, here's an example of the current in-kernel frame pointer unwinding: 1 kHz, -mapcs-frame, 102 stack frames:
0m3.05s real 0m3.04s user 0m0.00s system
[1] The stack dump proceeds until the specified size limit is reached or an access fails, so if we're on the main stack then the area with the process's arguments and initial environment will be copied.
[2] My work in progress on bug 810526 has been getting 50-60 µs/sample on somewhat deeply nested stacks, of which a large minority was the Gecko profiler infrastructure. Additionally, there remains room for optimization, and it should be faster when it's not handling the "pop under bitmask" instructions needed for the frame pointers used for meta-profiling.)
Assignee | ||
Comment 2•11 years ago
|
||
The kernel source: https://github.com/jld/gp-keon-kernel/compare/perf-stackcopy-gp
Assignee | ||
Comment 3•11 years ago
|
||
My small "benchmark" program. Creates a bunch of frames and runs a timing loop.
Assignee | ||
Comment 4•11 years ago
|
||
I'm going to say that the answer is “yes, fast enough”. 1kHz should be enough for most uses.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•