Closed Bug 503424 Opened 15 years ago Closed 15 years ago

TM/nanojit: add built-in support for compiled-trace and -guard profiling

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta1-fixed

People

(Reporter: n.nethercote, Assigned: jseward)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 12 obsolete files)

(deleted), patch
graydon
: review+
Details | Diff | Splinter Review
If we want to improve nanojit's codegen, it makes sense to focus on the areas where it will make a difference. So it would be useful to know which are the hot traces. I don't think we have that facility at the moment. This bug is a proposal to add a facility whereby each compiled trace has a counter that is incremented each time it is executed. Than at termination it would print out a list (and the code) of the hottest blocks (top 10, top 100, whatever), indicating what proportion of trace executions they account for. Then we can look at improving the codegen for those hot blocks. Valgrind has a facility like this and it has proven very useful.
Me want. It would also be nice to record at least the total time taken (and thus the average) by each trace.
+1 for the block counting. The timing stuff (I presume you mean rdtsc adding-up) could be useful too; but it could confuse the issue if we get tied up in delays due to cache misses, which are independent of the code generation quality, mostly.
Looks like the right thing to inspect.
Some preliminary design notes: - We need to add a counter for every trace. - We need a way of incrementing the counter each time a trace is executed. This code can be added (conditionally, of course) in Assembler::gen() at the end of the loop. It's better to increment at the start of a trace rather than the end so that the trace is still counted even if there's an early exit. - We need a way of iterating through all the traces at the end, to find the hot ones. - We need a way of printing out LIR and asm for the hot fragments at the end. So these cannot be disposed of along the way. Or perhaps the asm can be disposed of -- if the LIR is kept we could just re-assemble it in order to print out the native code. Because the LIR and asm code pages tend to get destroyed as we go (I think; I don't really understand the code memory handling very well) it seems best to store all the necessary stuff in a separate table. For each trace, it might contain: - the counter - a copy of the LIR - a copy of the asm (unless reassembling at the end is easy) - some info indicating where in the original JavaScript it came from, if possible Making copies of all that code will be slow, but this doesn't have to be fast, and then we wouldn't have to worry about holding onto LIR/asm code that we would normally dispose of (which would uglify our normal code paths).
(In reply to comment #4) > Some preliminary design notes: > > - We need to add a counter for every trace. > > - We need a way of incrementing the counter each time a trace is executed. > This code can be added (conditionally, of course) in Assembler::gen() at the > end of the loop. It's better to increment at the start of a trace rather than > the end so that the trace is still counted even if there's an early exit. I had assumed the generation of the counter would be in jstracer.cpp, but I think your proposal is better as it will then be available to all nanojit users. > - We need a way of iterating through all the traces at the end, to find the hot > ones. > > - We need a way of printing out LIR and asm for the hot fragments at the end. > So these cannot be disposed of along the way. Or perhaps the asm can be > disposed of -- if the LIR is kept we could just re-assemble it in order to > print out the native code. I often don't need to look at the LIR/asm--JS source location info is often enough. What I have sometimes wanted and never had is description of the trace in terms of the JS source lines it goes through. (I see that you already mention that below, too. I believe it is fairly easy because the tracer knows what JS source position it is at when it records each bytecode.) But I'm sure the LIR/asm will be useful sometimes as well. > Because the LIR and asm code pages tend to get destroyed as we go (I think; I > don't really understand the code memory handling very well) it seems best to > store all the necessary stuff in a separate table. For each trace, it might > contain: > > - the counter > - a copy of the LIR > - a copy of the asm (unless reassembling at the end is easy) > - some info indicating where in the original JavaScript it came from, if > possible > > Making copies of all that code will be slow, but this doesn't have to be fast, > and then we wouldn't have to worry about holding onto LIR/asm code that we > would normally dispose of (which would uglify our normal code paths). I think the copying is entirely reasonable. If it does have an excessive perf cost, which I doubt it will, it can be optional.
(In reply to comment #5) > I often don't need to look at the LIR/asm--JS source location info is often > enough. I assume you're mostly looking at tracing things -- for those working on NJ code generation, the LIR/asm is very important :)
Blocks: 504587
(In reply to comment #4) Keeping a copy of the LIR and asm around sounds big and complex. Maybe it's not necessary: all we need to store is an integer key which uniquely identifies the fragment. When profiling, print the code etc out in the normal TMFLAGS=blah way, along with the key. When the profiled code is running, inc counters in the usual way. When profiled run finished, just print the (key,counter) pairs. We can then look in the logfile to find the blocks. Not so convenient, but easier to implement.
(In reply to comment #7) Am working on a patch to do this. (commenting here so as to check i'm not duplicating work w/ anyone else -- yell if I am).
(In reply to comment #8) > > Am working on a patch to do this. (commenting here so as to check > i'm not duplicating work w/ anyone else -- yell if I am). That's what the "assigned" field is for :)
Assignee: general → jseward
Attached patch patch which does profiling, as outlined above (obsolete) (deleted) — Splinter Review
(In reply to comment #10) (description): Patch adds a simple profiler, which counts, for each fragment, every time the fragment passes the loopEntry point (gal told me this is the right place to put the counter increment; I have no idea). At the end of the run, the top 50 blocks are shown, one line for each block, including the percentage of total block runs, and the cumulative statistics too. This makes it very easy to see which the hot blocks are. This of course only works for debug builds. It is integrated with and controlled by the existing TMFLAGS= mechanism. To get profiling, you must do TMFLAGS=fragprofile. Each printed cost line has a "FragID=XXXXXX" part, where XXXXXX is a 6-digit number. This makes it possible to find the fragment in the normal debug printing, because those FragID numbers are printed there too. Hence the minimal useful invokation is TMFLAGS=fragprofile,minimal ... Code from both tracer and regexps is profiled. Only x86 is supported in this patch, although other archs should be trivial. It's only a matter of implementing Assembler::asm_inc_profctr to generate the load-inc-store insns. The implementation is mostly trivial. class Fragment gets two new fields, profHash (to identify the block) and profCounts, to hold the count. The only real difficulty is to extract those fields at the time the Fragment is destroyed. This is difficult because making Fragment::~Fragment virtual doesn't work, since there are some Fragments which are really VMFragments, but others within nanojit which are vanilla Fragments, and hence whose destruction is not observable outside nanojit. My least-worst solution (in the patch) is to have, in Fragmento, an optional finalizer callback used when destroying a Fragment. The profile machinery in jstracer.cpp uses this to harvest counts any time any Fragment is destroyed. Example output for trace-tests.js follows. ----------------- Per-fragment execution counts ------------------ Total count = 153260112 0: (32.63% 50004998) 32.63% 50004998 FragID=000893 1: (39.15% 60004996) 6.52% 9999998 FragID=005744 2: (45.68% 70004994) 6.52% 9999998 FragID=005746 3: (50.25% 77012463) 4.57% 7007469 FragID=007873 4: (54.77% 83933739) 4.52% 6921276 FragID=007779 5: (59.28% 90855015) 4.52% 6921276 FragID=007795 6: (63.80% 97776291) 4.52% 6921276 FragID=007836 7: (68.31% 104697567) 4.52% 6921276 FragID=007847 8: (72.83% 111618843) 4.52% 6921276 FragID=007859 9: (74.14% 113631756) 1.31% 2012913 FragID=007747 10: (75.46% 115644552) 1.31% 2012796 FragID=007814 11: (76.77% 117657286) 1.31% 2012734 FragID=007816 12: (78.08% 119660304) 1.31% 2003018 FragID=007737 13: (79.38% 121663210) 1.31% 2002906 FragID=007808 14: (80.61% 123536928) 1.22% 1873718 FragID=007727 15: (81.83% 125410583) 1.22% 1873655 FragID=007804 16: (82.95% 127124279) 1.12% 1713696 FragID=007731 17: (84.06% 128837912) 1.12% 1713633 FragID=007806 18: (85.15% 130493991) 1.08% 1656079 FragID=007821 19: (86.23% 132150062) 1.08% 1656071 FragID=007823 20: (87.31% 133805849) 1.08% 1655787 FragID=007757 21: (88.39% 135461605) 1.08% 1655756 FragID=007759 22: (89.33% 136901602) 0.94% 1439997 FragID=007724 23: (90.27% 138341599) 0.94% 1439997 FragID=007768 24: (91.21% 139781596) 0.94% 1439997 FragID=007784 25: (92.15% 141221593) 0.94% 1439997 FragID=007801 26: (93.08% 142661590) 0.94% 1439997 FragID=007828 27: (94.02% 144101587) 0.94% 1439997 FragID=007839 28: (94.35% 144601585) 0.33% 499998 FragID=001431 29: (94.59% 144961583) 0.23% 359998 FragID=007771 30: (94.82% 145321581) 0.23% 359998 FragID=007787 31: (95.06% 145681579) 0.23% 359998 FragID=007831 32: (95.29% 146041577) 0.23% 359998 FragID=007842 33: (95.52% 146401575) 0.23% 359998 FragID=007850 34: (95.76% 146761573) 0.23% 359998 FragID=007864 35: (95.99% 147121570) 0.23% 359997 FragID=007830 36: (96.23% 147481567) 0.23% 359997 FragID=007841 37: (96.44% 147805522) 0.21% 323955 FragID=007775 38: (96.65% 148129477) 0.21% 323955 FragID=007791 39: (96.86% 148453432) 0.21% 323955 FragID=007833 40: (97.08% 148777387) 0.21% 323955 FragID=007844 41: (97.29% 149101342) 0.21% 323955 FragID=007855 42: (97.50% 149425297) 0.21% 323955 FragID=007869 43: (97.71% 149749250) 0.21% 323953 FragID=007835 44: (97.92% 150073203) 0.21% 323953 FragID=007846 45: (98.13% 150397147) 0.21% 323944 FragID=007778 46: (98.34% 150721091) 0.21% 323944 FragID=007794 47: (98.55% 151045035) 0.21% 323944 FragID=007858 48: (98.77% 151368939) 0.21% 323904 FragID=007872 49: (98.88% 151540897) 0.11% 171958 FragID=007803
Attachment #389800 - Flags: review?(gal)
Nice! This will be extremely useful. It gives equal weight to every fragment... that's probably good enough to start with. Some nits about the output: > 0: (32.63% 50004998) 32.63% 50004998 FragID=000893 > 1: (39.15% 60004996) 6.52% 9999998 FragID=005744 > 2: (45.68% 70004994) 6.52% 9999998 FragID=005746 > 3: (50.25% 77012463) 4.57% 7007469 FragID=007873 I find it easier to read with the timing columns swapped, ie. self-time before cumulative time. Putting a header line giving each column a title would be good. (Then you wouldn't need "FragID=" on every line.) For the FragID, is the number in exactly the same form as that printed for the "minimal" output? I see stuff like this: Recording starting from trace-test.js:299@1610 which doesn't look quite the same as a 6-digit FragId. If it's in exactly the same form, it makes it much easier for Vim users to bounce between the table and the actual fragment (in Vim you can use '#' and '*' to do forward/backward searches for the exact word that the cursor is currently on). Also, you put "gal" for review rather than "gal@uci.edu"... not sure if that'll work?
Yeah, gal@uci.edu is my bugzilla account for historic reasons. :gal is the short-hand. Random style nit: func(arg, arg), not func( arg, arg ). Also not assert ( this ). Could we do the ranking with a script, and just output raw data? Or maybe write it to a file? The stats gathering and ranking code looks a bit out of place in jstracer.cpp (IMO). I don't particularly like the entanglement of tracer logging/verbose bits and nanojit. Can we disentangle this a bit? I wonder if we can just use regular lir to model the counters. It would minimally affect register allocation, but probably not noticably. If we are really worried about that we could add LIR_XINC or something like that. The finalizer hook is also a bit ugly. I wonder whether its necessary. In general, I think this functionality should be in nanojit. That would solve the entanglement problem and instead of the finalizer the code can be ifd-defed straight into the destructor.
Major update to patch. This not only profiles fragments but also exits, making it possible to see exactly where we're leaving trace. Also, for each fragment, it now shows the static exit count and the static instruction byte count, so as to give a bit of a handle on generated code bloat. Is WIP. Do not review. Works well enough to run v8-richards, but asserts (for debug builds) on trace-tests. Further debugging needed.
Attachment #389800 - Attachment is obsolete: true
Attachment #389800 - Flags: review?(gal)
Attached patch cleaned up & debugged version of previous patch (obsolete) (deleted) — Splinter Review
Cleaned up version of the patch in comment #14. Suitable for general use/testing. Survives trace-test.js. Changes: * don't assert when Fragments are recycled; handle correctly * also profile exits created by jsregexp (untested) * fix space leaks in result-printing code * minor readability changes in output printing
Attachment #392628 - Attachment is obsolete: true
Here's an example of what the output now looks like, for perfect.js. Recommended usage: TMFLAGS=fragprofile,minimal,recorder First there's a flat profile with one line per frag, containing start counts for the frag, plus cumulative info, number of static exits and static instructions for the frag, and the FragID so you can find it easily in the recorder (or any other phase) output. Then follows the per-exit count. They're self explanatory. Each exit is labelled with a GuardID, which is printed as part of LIR, so again the precise exits are findable in the debug output. This is a pretty boring example. v8-richards or v8-crypto produce more interesting results. ----------------- Per-fragment execution counts ------------------ Total count = 2686 0: (72.23% 1940) 72.23% 1940 (se 10, sib 405) FragID=000005 1: (90.80% 2439) 18.58% 499 (se 8, sib 322) FragID=000001 2: (100.00% 2686) 9.20% 247 (se 10, sib 405) FragID=000003 3: (100.00% 2686) 0.00% 0 (se 0, sib 0) FragID=000002 4: (100.00% 2686) 0.00% 0 (se 0, sib 0) FragID=000004 Total static instruction bytes (sib) = 1132, total static exits (se) = 28 Analysis by exit counts FragID=000005, total count 1940: GuardID=012 248 (12.78%) Looped (013) 1692 (87.22%) FragID=000001, total count 499: GuardID=009 1 ( 0.20%) Looped (010) 498 (99.80%) FragID=000003, total count 247: GuardID=012 1 ( 0.40%) Looped (013) 246 (99.60%) FragID=000002, total count 0: Looped (000) 0 ( 0.00%) FragID=000004, total count 0: Looped (008) 0 ( 0.00%)
Summary: TM/nanojit: add counters that count how many times each compiled trace is executed → TM/nanojit: add built-in support for compiled-trace and -guard profiling
>----------------- Per-fragment execution counts ------------------ > >Total count = 2686 > > 0: (72.23% 1940) 72.23% 1940 (se 10, sib 405) FragID=000005 > 1: (90.80% 2439) 18.58% 499 (se 8, sib 322) FragID=000001 May I suggest removing the text labels from the data rows and using column headers instead (the spacing is still bad in this version): Exec-Count Cum-Exec-Count Static-Exits Bytes Frag-ID (72.23% 1940) 72.23% 1940 10 405 000005 (90.80% 2439) 18.58% 499 8 322 000001 IMO we currently have a problem with hard-to-read, hard-to-machine-parse debugging output. Your example is better than most of what we have but I want it better yet. :-) I would also submit that the exit count stats are not in fact self-explanatory for someone who doesn't already kind of know what you are doing here.
All previous versions of patch under certain circumstances fail to harvest profile info from Fragments that are delete'd, giving nonsense profile data for some programs (not all). Found whilst chasing small testcases related to v8-richards. This version fixes it. Also incorporates output formatting suggestions per dmandelin in comment #16.
Attachment #392693 - Attachment is obsolete: true
Attached patch rebase previous patch to 31101:bb6086ec04b8 (obsolete) (deleted) — Splinter Review
Rebased to TM tip of about four hours back. Expect further breakage in days to come, given that Fragmento is scheduled for demolition.
Attachment #392922 - Attachment is obsolete: true
Attached patch rebase previous patch to 31574:12a9bea2d331 (obsolete) (deleted) — Splinter Review
Attachment #394128 - Attachment is obsolete: true
.. and add an extra assertion checking consistency of counts.
Attachment #394847 - Attachment is obsolete: true
Attached patch rebased following deletion of Fragmento (obsolete) (deleted) — Splinter Review
Attachment #394943 - Attachment is obsolete: true
Depends on: 512744
Attached patch Rebase des Tages (obsolete) (deleted) — Splinter Review
Attachment #396627 - Attachment is obsolete: true
(In reply to comment #13) > Could we do the ranking with a script, and just output raw data? Or maybe write > it to a file? The stats gathering and ranking code looks a bit out of place in > jstracer.cpp (IMO). I don't think this case is really doing *so* much formatting (basically collect, sort, and print the top 50) that it warrants depending on an external processing script (that itself is likely to bitrot). If we were doing histograms and graphs, sure; but this is just a tabular listing. We already have plenty of stat-tabulating code over in js/src: gc stats, scope stats, hashtable stats, etc. Some of it even does histograms :) > I don't particularly like the entanglement of tracer logging/verbose bits and > nanojit. Can we disentangle this a bit? I've filed bug 513339 and bug 513341 to try to pull some of the logging noise out of both jstracer and the rest of nanojit. I agree it's getting a bit overgrown, but gathering up all the chatter and depositing it somewhere is a different task from getting this profiler working in the first place. > instead of the finalizer the code can be ifd-defed > straight into the destructor. Fragments have no destructors anymore; they're arena-allocated and released en-masse. I think it's OK to keep the fragment book-keeping outside as the fragment topology and lifecycle is really only something TM knows about. We're *deleting* that knowledge from NJ presently, as it was insufficient anyway. Comments on the patch: - Unless I'm misreading, I think the finalizer function *pointer* stored in the tracemonitor is vestigial from an earlier version of this patch; nobody appears to store into it or call through it, so it should be removed. All finalization calls occur directly to js_FragProfiling_FragFinalizer. - avmplus::List<> is going away, use nanojit::Seq<> - Likewise avmplus::SortedMap<>, use nanojit::HashMap<> - Seems you're incrementally bubble-sorting the final ranked array? We already use std::sort() and qsort() elsewhere in spidermonkey, I'm sure it's OK to here. - Style: this being C++, you can loop-scope a loop variable. for (Foo* bar = ...; ...; ...) { ... } rather than Foo *bar; for (...; ...; ...) { ... } - Style: while (true) rather than while (1) - Style: void foo() rather than void foo(void) - Style: as gal said, foo(Bar *b) rather than foo( Bar *bar ) - As of njn's logging work, we're directing output through to js_LogController.printf. Same meaning, but it centralizes control over that stream. - On line 4823, you're adding a commented-out call to the finalizer. You either want to un-comment this or remove it, I'm not sure which. - Considering that there's no hash function involved -- they're just sequential identifier-numbers -- rename profHash and guardID in GuardPI, and rename profHash in VMFragment to fragID or fragmentID. Generally try to avoid two structures with same-named fields. - The use of a static profHashCtr is no good, we're possibly running multithreaded. If you want you can use JS_ATOMIC_INCREMENT, or else store a counter in the (per-thread) trace monitor and copy it into the VMFragment when you construct it. Otherwise nice work, sorry the lifecycles of everything have been changing underfoot.
(In reply to comment #24) > > - As of njn's logging work, we're directing output through to > js_LogController.printf. I can't take credit for that, I think it was jseward :)
(In reply to comment #24) Graydon, thanks for the comments. Should all be no problem to fix, but one thing I had a query on: > - The use of a static profHashCtr is no good, we're possibly running > multithreaded. If you want you can use JS_ATOMIC_INCREMENT, or else store a > counter in the (per-thread) trace monitor and copy it into the VMFragment when > you construct it. I'd love to use JS_ATOMIC_INCREMENT, but that's going to bork for the TR embedding, no? Could we just copy JS_ATOMIC_INCREMENT into the NJ tree so as to have an independent copy? I'd be happy to do that, and tidy up.
Perhaps then you can have the Fragment constructor take a fragID in verbose mode (default to 0) and move the profHashCtr out to jstracer.cpp; then you can use JS_ATOMIC_INCREMENT in the VMFragment constructor call to the Fragment constructor.
Atomic operations are very slow. Make sure you know what you are doing. That might skew measurements significantly. Also, we _definitively_ don't want atomic profiling updates in product builds ever (I overheard we might want to run more profiling all the time).
Yeah. The other option -- a per-TraceMonitor counter -- makes more sense to me anyway. We always have a TraceMonitor handy when we're constructing a VMFragment. Just pass one in.
Attached patch patch v9 (rebase only) (obsolete) (deleted) — Splinter Review
Attachment #396789 - Attachment is obsolete: true
Attachment #397852 - Attachment is obsolete: true
(In reply to comment #31) > Created an attachment (id=399823) [details] > rebased; w/ correct code size measurements, & addresses review comments Rebased w/ much help from Graydon, following deletion of Fragmento and of distinct loop-top labels. Also incorporates Graydon's fix for correct code size measurement as taken from bug 512744.
Attached patch updated to address all remaining review issues (obsolete) (deleted) — Splinter Review
Attachment #399823 - Attachment is obsolete: true
Attachment #400188 - Flags: review?(graydon)
(In reply to comment #33) > updated to address all remaining review issues Regarding this, from comment #24 > - Seems you're incrementally bubble-sorting the final ranked array? > We already use std::sort() and qsort() elsewhere in spidermonkey, > I'm sure it's OK to here. This is sort-of true. But the final ranked array is only N_TOP_BLOCKS in size, regardless of how many Fragments are processed. This avoids constructing an array with potentially thousands of entries, one for each Fragment, sorting it, and throwing away all but the top 50 entries. Since the array is of fixed size, the cost for processing N Fragments is O(50 x N), which isn't too bad given that this is non-production code. So I've left it as-is.
Attachment #400188 - Attachment is obsolete: true
Attachment #400188 - Flags: review?(graydon)
Comment on attachment 400791 [details] [diff] [review] rebased; made thread-safe; conform to house style; minor tidying Thanks for your patience during the review cycle. I'll land from here.
Attachment #400791 - Flags: review+
"I don't particularly like the entanglement of tracer logging/verbose bits and nanojit. Can we disentangle this a bit? I wonder if we can just use regular lir to model the counters. It would minimally affect register allocation, but probably not noticably. If we are really worried about that we could add LIR_XINC or something like that." The comment above was not addressed as far as I can tell. The checkin breaks x86-64 and sparc (dvander and leon will fix those quickly I am sure). Review from adobe for that many NJ changes would have been nice too. This isn't an r- comment. Just saying that the patch could be nicer. We can probably evolve it further as we go along. It seems useful enough to land it now instead of constantly rebasing it.
(In reply to comment #37) > "I don't particularly like the entanglement of tracer > logging/verbose bits and nanojit. Can we disentangle this a bit? It's less entangled than it was previously. Now, Fragment and GuardRecord have counter fields, and nanojit knows to add code to increment them if requested. You have to tell NJ where the loop label of a trace is, and give it a unique identifier for each new Fragment, but that's about it. The problem of extracting and presenting the results is all on the TM side. > I wonder if we can just use regular lir to model the counters. That leads to an inconsistency, in which the fragment-entry counter inc can be expressed in LIR, but the exit inc, which is part of the exit block, can't be. At least, not without generating contorted LIR which reads the condition just before an guard, and if true, increments the relevant exit counter. And that would lead to a situation where the output of each LIR_ov would be used twice, which IIUC would break at least the x86 backend. > The checkin breaks x86-64 and sparc I don't have a sparc box to test on, and up till now x86-64 hasn't been runnable from the tree. > Review from adobe for that many NJ changes would have been nice too. Yes, my bad there.
Phew! http://hg.mozilla.org/tracemonkey/rev/ce354a3e3b47 for initial checkin, http://hg.mozilla.org/tracemonkey/rev/72d4907df250 to fix the windows build (mistakenly breaking *every* browser build, because browser builds have different ... header visibility?), and so finally http://hg.mozilla.org/tracemonkey/rev/70f5a2e93b94 A little more multi-platform testing would be good in the future.
Whiteboard: fixed-in-tracemonkey
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Flags: wanted1.9.2?
Flags: wanted1.9.2? → wanted1.9.2+
js/src/trace-test/tests/basic/arith.js js/src/trace-test/tests/basic/testRegExpTest.js js/src/trace-test/tests/basic/testShiftLeft.js js/src/trace-test/tests/basic/testSideExitInConstructor.js
Flags: in-testsuite+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: