1143598 - High Resolution Profiling

Reporter

Description

•

10 years ago

The Gecko profiler maxes out at 1 ms sample interval. JS applications like Shumway and j2me.js try really hard to optimize startup time, which is basically in the < 1s range. Having only 1000 samples to work with is not very useful. Can we lower the sample interval? Even if this slows down the application considerably, if it produces a meaningful profile it would still be useful.

Jordan Santell [:jsantell] [@jsantell]

Comment 1

•

10 years ago

We could make the customProfilerOptions[0] a preference for this, but not sure how small of an interval the nsIProfiler could handle https://github.com/mozilla/gecko-dev/blob/3bfd6947732ef41b734fa2be5b3e975a430575f5/browser/devtools/performance/modules/front.js#L409-L414

Markus Stange [:mstange]

Comment 2

•

10 years ago

On my machine (Mac), the limit seems to be around 0.02ms.

Shu-yu Guo [:shu]

Comment 3

•

10 years ago

I believe djvj is looking at increasing the frequency proper. I'm looking at decreasing the amount of work the sampler has to do per sample and the amount of space needed in the circular buffer for a single sample.

Jordan Santell [:jsantell] [@jsantell]

Comment 4

•

10 years ago

so can we make this a pref? doing so would be easy, but would want to know what the min-allowable value should be, and if that'd be dependent on the other work ("increasing the frequency proper", and circular buffer improvements)

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Comment 5

•

10 years ago

Did some prelim investigations into instrumenting the profiler for timing the cost of sample runs. There are two issues to deal with here. 1. Measure the cost of triggering an 'interrupt' on the victim thread (the thread being sampled). On linux, this is the raw cost (to the victim) of receiving and handling a signal. On MacOS, this is the raw cost (to the victim) of being suspended. Same on windows. 2. Measure the average cost of performing a sample itself. This is easy to instrument within the sampler code. The former (1) is hard to measure, because simply using TimeStamp::Now() from within the browser will measure just the wall clock time between an interrupt being requested (i.e. a tgkill in linux, or a thread_suspend on macos, etc.), and being handled in the victim thread. It doesn't specifically delineate the runtime cost to specifically the victim thread, which is what we care about. So to measure (1) specifically, I'm writing a small test program in each of the platforms, using the different test mechanisms. The test program works basically like this: a. A victim thread and sampler thread are started. b. Victim thread takes the time (A), then counts from 0 to 2billion, then takes the time (B) c. Sampler thread interrupts victim thread a bunch of times. In the control case (1) we instruct the sampler thread to not interrupt at all, and in the test case (2) we instruct the sampler to interrupt the victim thread at some fixed rate, collecting times D1=B1-A1, D2=B2-A2, and S (the number of times the victim was sampled in the test case). We take the average over may runs of D1, D2, and S, and calculate (D2-D1)/S to give the 'average cost of sampling'

Flags: needinfo?(kvijayan)

Kannan Vijayan [:djvj]

Comment 6

•

10 years ago

Just finished this on linux. Did 1 run of 50, then another of 100, and another of 100. In summary, we're looking at about 4-5 microseconds cost to the victim thread per interrupt. Averaged results follow: LINUX: Sampling cost in linux (run 1 - 50 samples): Runtime without sampling (ns): 1102521393.0 Runtime with sampling (ns): 1140570975.58 Difference in runtime (ns): 38049582.58 Average number of samples: 7461.94 Cost per sample (ns): 5099.15 Sampling cost in linux (run 2 - 100 samples): Runtime without sampling (ns): 1100818930.26 Runtime with sampling (ns): 1133066985.73 Difference in runtime (ns): 32248055.47 Average number of samples: 7414.35 Cost per sample (ns): 4349.41 Sampling cost in linux (run 3 - 100 samples): Runtime without sampling (ns): 1101129955.11 Runtime with sampling (ns): 1133793193.81 Difference in runtime (ns): 32663238.70 Average number of samples: 7419.42 Cost per sample (ns): 4402.40 CPUINFO (4-core Intel Xeon workstation): processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Xeon(R) CPU E31225 @ 3.10GHz stepping : 7 microcode : 0x23 cpu MHz : 1601.585 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid bogomips : 6185.61 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:

Kannan Vijayan [:djvj]

Comment 7

•

10 years ago

So the situation on mac is a bit worse. I wrote up the same test, but using mach's thread_suspend primitive, and looked at performance. For a 100-iteration run, we got averages of: Unsampled: 9.274s Sampled: 10.880s Average sample count: 7753.11 Average cost per sample: 20 microseconds CPU: 2.8Ghz Intel Core i7 (MacBook Pro) Ouch! So I have a bit of speculation for why using thread_suspend()/thread_resume() costs so much on OSX. Consider what's happening. thread_suspend() tells the kernel to pull the thread off the execution queue and put it on the suspend queue, then wake up the sampler thread. Later on, thread_resume() tells the kernel to put the victim thread back on the schedulable queue. What that means is that in addition to whatever cost there is to suspend/resume thread state for the victim, the victim thread also loses the rest of its timeslice when it gets suspended, and can only resume execution at the next scheduler tick. Using signals should be able to perform better because signals inherently convey the full lifetime of the suspension activity to the kernel. With thread_suspend(), the kernel doesn't now WHEN thread_resume() is going to be called next, so it HAS To put the thread on the suspend queue until thread_resume() is called. With a signal, the kernel knows that if the thread is running, it can remaining running after the signal handler has executed. In fact, outside of interrupting the thread, and setting up the signal execution context, the kernel can forget about messing about with any of the thread execution behaviour. Returning from a signal handler to normal execution can be handled entirely in userland. Just to test this theory, I tested using pthread_kill() to send signals to threads to interrupt them instead of using thread_suspend() and thread_resume(). With this, the cost of interrupt is only 6 microseconds on average. I talked to jmuizelaar about the pthread_kill vs thread_suspend issue, and he said that there may be some crashiness issue with using pthread_kill on Macos, pointing to bug 721025. The bug (still open) discusses some heavy crashiness on OSX back when the profiler used a pthread_kill implementation. The crashes mysteriously went away after the new macos backend landed, so people stopped talking about it. It may be the case that the issue wasn't pthread_kill related at all and just coincidentally went away during that time. We should look into this. With the current state of things, we can't really increase sampling frequences on OSX. 20us per sampling overhead implies that for a 5% sampling overhead, we can interrupt at most once every 400us, or half a millisecond. That gives us a maximum sample rate of 2k on OSX. And this is on a top-of-the-line macbook pro bought a few months ago. We'll need to look into shifting the macos implementation to use pthread_kill if we want to bump our sampling frequency there.

Kannan Vijayan [:djvj]

Comment 8

•

10 years ago

Ok, new report. I've done a measurement of two metrics across 3 platforms. The metrics: overhead-of-sampling and overhead-of-tracing The overhead-of-sampling strictly measures the cost of interrupting and then resuming a thread for the purpose of obtaining a trace of it at a given point in time. The overhead-of-tracing strictly measures the cost of obtaining a trace when the sampled thread is interrupted. Splitting these two up gives us an idea of how much benefit we can expect out of improvements to the part of this process we can control - the stack walking. Here are the general results: PLATFORM SamplingOverhead TracingOverhead Linux Workstation using Signals 6us 20us OSX MacbookPro using thread suspend/resume 20us 20us OSX MacbookPro using pthread_kill 6us 20us Windows 8 RAZR Laptop using thread suspend/resume 20us 20us In general, across all platforms, we're taking about 20us to walk the jit stack, symbolicate, fill the profile buffer, as well as copy any pseudostack entries into the profile buffer. Also, in general, we're taking about 6us for thread-signalling based sampling methods, and 20us for thread suspend/resume based sampling methods. On OSX, we may be able to use pthread_kill to move back to a thread-signalling based sampling method, and away from the slower thread suspend/resume. On Windows, it's not clear that there exists a good alternative method, so we may be stuck with the cost. The major conclusion is that there _is_ value in reducing the tracing cost. If we can symbolicate our JS stacks lazily, we stand to improve sampling times significantly.

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Comment 9

•

10 years ago

(In reply to Kannan Vijayan [:djvj] from comment #8) > If we can symbolicate our JS stacks lazily, we stand to improve sampling > times significantly. This would also give us more bang for our buck as far as buffer size goes. Sounds like a win/win! Only downside is keeping JSScripts (and their associated data) alive longer, which I believe :shu is working on mitigating (to a degree) in bug 1148202.

Kannan Vijayan [:djvj]

Updated

•

10 years ago

Whiteboard: [leave open]

Kannan Vijayan [:djvj]

Comment 10

•

10 years ago

Attached patch add-profiling-freq-pref.patch (deleted) — Details — Splinter Review

Add a pref for changing the profiling frequency (and buffer size) from about:config.

Attachment #8586312 - Flags: review?(jsantell)

Jordan Santell [:jsantell] [@jsantell]

Comment 11

•

10 years ago

A lot of this has changed in the last few days -- we're moving defaults to actors/profiler.js, but the front (in browser/devtools/performance) is now being used rather than the globals in (browser/devtools/shared/profiler/globals), and I have a patch for the buffer pref (bug 1147945); I can just add that there if its easier?

Michael Bebenita [:mbx]

Reporter

Comment 12

•

10 years ago

I tried lowering the sampling interval in Instruments on OSX and it looks like the lowest it can go is 40us. I tried that sampling interval and looked at the time interval between each time stamp for a profile lasting several seconds. The median was roughly ~44us.

Kannan Vijayan [:djvj]

Comment 13

•

10 years ago

A 40us interval is a sampling frequency of 25khz, which is pretty nice. The thing with Instruments is that it has special abilities due to kernel integration. The kernel can interrupt threads a LOT easier than than userland. I'm not sure how exactly they are implementing it, but if it was me, I'd take advantage of all the following abilities: 1. If the thread is suspended (i.e. waiting for next slice of time or otherwise io-waiting), the kernel implicitly knows what the thread is up to and doesn't need to "do anything" to sample it. Our userspace profiler can't determine these states easily. 2. If the thread is active, suspending it is still pretty fast - the kernel can install a special interrupt handler for profiling interrupts that doesn't spill the entire register state. There's no need to save floatregs, other crap when profiling. Depending on how they are doing it (interrupts, or pause-and-observe), the kernel can completely bypass the scheduler. Life would be nice if we had these abilities :( But we don't.

Kannan Vijayan [:djvj]

Comment 14

•

10 years ago

(In reply to Jordan Santell [:jsantell] [@jsantell] from comment #11) > A lot of this has changed in the last few days -- we're moving defaults to > actors/profiler.js, but the front (in browser/devtools/performance) is now > being used rather than the globals in > (browser/devtools/shared/profiler/globals), and I have a patch for the > buffer pref (bug 1147945); I can just add that there if its easier? Sure, that works too! If you do, mark this a dup of that.

Kannan Vijayan [:djvj]

Updated

•

10 years ago

Depends on: 1150252

Jordan Santell [:jsantell] [@jsantell]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → DUPLICATE

Jordan Santell [:jsantell] [@jsantell]

Updated

•

10 years ago

Attachment #8586312 - Flags: review?(jsantell)

BMO Automation

Updated

•

6 years ago

Product: Firefox → DevTools