Closed Bug 631951 Opened 14 years ago Closed 14 years ago

JM: interpret (some?) scripts a few times before compiling them

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
fennec 2.0+ ---

People

(Reporter: n.nethercote, Assigned: billm)

References

Details

(Keywords: memory-footprint, Whiteboard: fixed-in-tracemonkey)

Attachments

(5 files, 7 obsolete files)

(deleted), patch
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
dvander
: review+
Details | Diff | Splinter Review
(deleted), patch
billm
: review+
Details | Diff | Splinter Review
The basic idea here is that we should not compile all scripts (functions, more or less) on their first execution in order to save memory. Confusingly, four bugs have been created on this topic. - Bug 631581 has some space measurements and analysis which shows that lots (eg. ~50%) of scripts are only called once, and ~10% of scripts contain loops that iterate at least once. - Bug 631706 proposes compiling loopless scripts only after they've been called at least once. This requires a single 1-bit counter per function. This is the simpler of two proposals and looks to reduce JM's memory usage by ~2x and the browser's by ~3--10%. I call this the "one-counter approach". - Bug 631714 proposes compiling scripts only after they've been called at least M times, or contain a loop that has iterated at least N times. M and N are likely to be small, eg. 16. This looks to reduce JM's memory usage by ~5-10x and the browser's by ~5--20% (or more). I call this the "two-counter approach". Note that the one-counter approach is just a specialization of the two-counter approach with M=1 and N=0. - Bug 631740 has an implementation of the two-counter and some Sunspider measurements with different values of M and N. It suggests that Sunspider performance isn't regressed unless large values of M and N are used (eg. M=4096 or N=512). Both approaches should make normal browsing a shade faster as well. This bug has been created to subsume the above four bugs and house all future discussion/patches/etc. This makes sense because it's a single idea we're contemplating. We are considering this for Firefox 4.0. Even though it's very late, it's potentially a very big win. Risks include: - Introducing new correctness bugs, though it's not very complicated. - Regressing performance in some cases, perhaps due to disturbing trace tuning, though Sunspider at least seems ok, and small values of M and N seem like they should only disturb programs that are very sensitive to tuning. I'm arguing strongly that we should push to get the two-counter approach in Firefox 4.0, because the browser uses a lot of memory, the JS engine is responsible for a lot of that, and this change helps a great deal.
blocking2.0: --- → ?
Keywords: footprint
I hope we can fix this bug for the last beta, but right now, I wouldn't recommend delaying the release for an estimated 10% footprint reduction. I would suggest doing it in a dot release if we can't make the main one. We should keep discussing it, though, as more data and patches come in.
blocking2.0: ? → .x
tracking-fennec: --- → ?
in reply to https://bugzilla.mozilla.org/show_bug.cgi?id=631733#c9 >Nicholas Nethercote wrote >Bug 631951 will greatly reduce JaegerMonkey's memory usage, if it makes it into >Firefox 4.0 (I sure hope it does). Thanks for this Nicholas, I think this really needs to be sorted out before the 4 release as inevitably people will compare the memory usage to Chrome, Opera, IE9 and it's not going to look good currently I presume your saw my test with the Askmen page where I tracked down jaegermonkey as the culprit for doubling the memory usage between beta 6 and 7 of one tab on this askmen survey http://www.askmen.com/specials/2010_great_male_survey/ https://bugzilla.mozilla.org/show_bug.cgi?id=631733#c1 FF4 Beta 6 added 43MB clicking a tab on that page, Beta 7 added 91MB, just tested other browsers, IE9 added 25MB, Iron(Chrome)8 added 15MB to load that tab, Opera 11 had no change at all....
Beta 7 introduced JaegerMonkey, so that is no surprise. JaegerMonkey uses extra RAM to store its compiled code. This bug is about reducing that memory use.
(In reply to comment #7) > Beta 7 introduced JaegerMonkey, so that is no surprise. JaegerMonkey uses extra > RAM to store its compiled code. This bug is about reducing that memory use. Sure I realised that which is why I used as it as an example, I just posted to show how now it looks even worse compared to other browsers in memory usage.
orangezilla: the askmen one sounds like it's leaking. I opened bug 632234 to track cases like that. This bug is not for leak-related problems.
(In reply to comment #9) > orangezilla: the askmen one sounds like it's leaking. I opened bug 632234 to > track cases like that. This bug is not for leak-related problems. Currently on 30th and onwards builds, I'm not sure about before that, as I often see similar 90-100MB jumps loading pages or tabs on multiple websites.
I did some measurements. TL;DR ------ The one-counter approach with a threshold of 16 looks like a good strategy. We should try it and measure it. My measurement process ---------------------- - 64-bit opt build (with symbols, ie. compiled with -g) on Ubuntu 10.04 - Create a profile with the following pages open: nytimes.com theage.com.au www.techcrunch.com www.cad-comic.com/cad/ aec.gov.au about:memory - Open a fresh browser instance, hit "restore session", which opens a new window. - Close the old window containing about:home. - Wait until all pages have finished loading (as judged by when the spinning icons stop) then wait another five seconds just to be sure. - Then refresh about:memory and read the js/mjit-code value. I did this with cdleary's patch, for a bunch of JS_CALLS_BEFORE_COMPILE and JS_BACKEDGES_BEFORE_COMPILE values (which I'll abbreviate to 'C' and 'B'). The results ----------- Here are the js/mjit-code values (in MB (not MiB)) for the different combinations: B=0 1 2 4 8 16 32 64 1000 ---------------------------------------------------------------------------- C= 0 | 52.0 X X X X X X X X 1 | 36.7 36.8 36.6 2 | 31.0 31.4 4 | 28.8 28.8 8 | 26.8 26.8 16 | 11.9 12.6 32 | 7.5 7.2 64 | 6.4 6.2 1000 | 1.9 2.0 First, consider the leftmost column, where C=N and B=0. This is the case where we delay compiling a script until it's been called N times, with the exception that we compile as soon as a back-edge is hit. (If we can assume that all functions that contain a loop execute at least one iteration of that loop, which is probably almost true, this is the same as the one-counter count-only-calls-to-loopless-scripts approach.) The amount of memory usage goes down clearly as C increases. Importantly, C=1 only gives a 1.4x reduction, so a 1-bit counter isn't good enough. C=16 gives a 4.4x reduction. Now, consider the (C=0,B=N) row. It's marked with Xs because it doesn't make sense, except for the (C=0,B=0) case -- there's no point counting loop iterations if we compile a function as soon as it's called. Next, consider the (C=N,B=N) diagonal. The results are almost the same as the leftmost (C=N,B=0) column. (If everything was deterministic, you'd expect the diagonal values to always be less than or equal to the leftmost column; this isn't true due to noise in the measurements.) This indicates that counting loop iterations does very little to reduce memory use. This makes a certain amount of sense: there must not be many scripts that are both (a) called only a few times and (b) contain loops with a small number of iterations. I tried (C=1,B=1000) as an additional check, for little difference. From a speed point of view, we probably want C+B to be as small as possible -- that minimizes the risk that we get some kind of unexpected time blow-out. Effect on total browser usage ----------------------------- Now, (C=16,B=0) looks like a sweet-spot. It reduces mjit memory usage by 4.4x, or 40MB. What does that do to total browser usage? Well, it depends how you measure it. For the above workload, about:memory reported "memory in use" values (which are more important than "memory mapped" values) in the range 227--316MB. Yes, the variation was that high. (In fact, it seemed to generally increase as I kept restarting Firefox, which was worrying.) But about:memory only measures the heap (ie. malloc/new/new[]). js/mjit-code itself measures a mixture of (1) heap space (for JITScripts and IC data) which is included in about:memory's numbers, and (2) mmap'd executable code. So it's hard to compare "memory in use" with js/mjit-code. If I look at 'top' or /proc/pid/status, I get total virtual memory figures around 900MB and RSS figures around 350MB (again with lots of variation). So that's about a 5% improvement on the VM total figure, but that includes heaps of stuff that'll never be touched like thread stacks and big chunks of code and data sections. My previous measurements with Massif lead me to believe that halving that VM total number gives roughly something like the actual memory touched by the browser. In which case the improvement is roughly 10%. And this workload isn't particularly JS heavy; opening 20 techcrunch.com tabs would be much heavier. So it's probably easier to just focus on the mjit numbers: we know that they can be big, so let's reduce them as much as possible while minimizing performance hits on things like Sunspider, and also while minimizing risk. What now? --------- I think that implementing the simpler one-counter approach and measuring it for a range of C values is the next step. I believe dvander has already made progress there.
That data looks like memory usage is simply independent of B for that test corpus, right?
(In reply to comment #12) > That data looks like memory usage is simply independent of B for that test > corpus, right? Yes. That's why I'm suggesting we implement just the C counter and measure that.
Attached patch one-counter attempt (obsolete) (deleted) — Splinter Review
Great data, Nick. Today I hacked up some really dumb version of the one-counter approach, as a fallback in case we can't get something fancier into the tree in time. So I tried your workload & steps with C=0, 1, and 16, and with Bill's patch in the about:compartments bug to track cumulative memory. C=0: js/mjit-code 46,237,592 js/mjit-code-cumulative 47,421,256 C=1: js/mjit-code 31,895,272 js/mjit-code-cumulative 31,895,480 C=16: js/mjit-code 27,979,880 js/mjit-code-cumulative 28,333,672 These results aren't as good as yours with Chris's patch, so I'll try that tomorrow to see how it differs. My attempt here was to get absolutely no SunSpider loss with -m -j -p, and Bill said that with the two-counter patch, they only tested -m. However, with this one, C=0 and C=16 make no difference with SunSpider, so we can probably try cranking it up more. Anyway, the only weird special case is that if (interpMode != JSINTERP_NORMAL), i.e. profiling or recording a trace, we have to method JIT aggressively or else we lose about 1-2% on SunSpider. I'll investigate more tomorrow as I'm running out of time today.
I tried dvander's patch and got even worse results: C=0: 56.3 MB C=1: 45.7 MB C=16: 42.6 MB C=1000: 42.1 MB That C=1000 number made me really suspicious. So I tried removing the |!script->hasLoops| part of the test in CanMethodJIT() and got these numbers: C=16: 13.0 MB C=1000: 1.6 MB Much closer to comment 11. I can see two possibilities: (1) Most of the functions that are called a small number of times contain a loop that executes at most once. This doesn't seem that likely. (2) There's a problem with the computation of script->hasLoop. Maybe there's another possibility I haven't thought of?
Probably more likely is there's a difference between cdleary's patch and dvander's patch that I haven't accounted for.
I worked out a possible cause. dvander's patch is conservative in one sense: if a script is compiled, all nested scripts within it are also compiled. So if you have a loop at the top-level of a file, the top-level script will be compiled, and so will every function within it. (Although, I don't understand how cdleary's patch avoids that problem...)
Hmm, my analysis in comment 11 assumes cdleary's patch is correct. But it looks like js_MonitorBranch() is only called if JS_TRACER isn't defined. That might explain why the B parameter didn't seem to do anything. Also, another likely difference between the two patches: cdleary's patch changes RunScript() so that a script always is interpreted once.
dvander said he gets a 20% SS regression with cdleary's patch with (C=1,B=0). So I think that patch is too rough to be considered further, and the data in comment 11 is suspect.
FWIW, here are Sunspider numbers with dvander's patch applied with a threshold of 16: --------------------------------------------------------------- | millions of instructions executed | | total | compiled (may overestimate) | --------------------------------------------------------------- | 68.354 68.244 (1.002x) | 44.763 44.760 (------) | 3d-cube | 40.237 39.372 (1.022x) | 25.189 23.932 (1.053x) | 3d-morph | 63.760 63.691 (1.001x) | 37.220 37.218 (------) | 3d-raytrace | 25.287 26.026 (0.972x) | 11.101 11.064 (1.003x) | access-binary- | 91.818 91.799 (------) | 85.809 85.809 (------) | access-fannkuc | 29.094 29.004 (1.003x) | 16.514 16.513 (------) | access-nbody | 36.156 36.136 (1.001x) | 29.335 29.334 (------) | access-nsieve | 7.378 7.366 (1.002x) | 3.253 3.253 (------) | bitops-3bit-bi | 36.514 36.501 (------) | 31.742 31.742 (------) | bitops-bits-in | 15.820 15.756 (1.004x) | 12.018 12.015 (------) | bitops-bitwise | 38.001 37.981 (1.001x) | 32.968 32.967 (------) | bitops-nsieve- | 16.961 16.920 (1.002x) | 12.874 12.870 (------) | controlflow-re | 23.147 23.073 (1.003x) | 11.792 11.790 (------) | crypto-md5 | 19.218 19.140 (1.004x) | 8.495 8.493 (------) | crypto-sha1 | 68.855 71.106 (0.968x) | 21.930 21.443 (1.023x) | date-format-to | 51.923 73.529 (0.706x) | 9.617 9.184 (1.047x) | date-format-xp | 41.606 41.481 (1.003x) | 27.330 27.327 (------) | math-cordic | 22.768 22.719 (1.002x) | 6.308 6.306 (------) | math-partial-s | 31.517 31.479 (1.001x) | 26.589 26.588 (------) | math-spectral- | 47.093 47.317 (0.995x) | 34.563 34.563 (------) | regexp-dna | 25.947 25.643 (1.012x) | 9.254 9.240 (1.002x) | string-base64 | 60.810 60.734 (1.001x) | 32.705 32.703 (------) | string-fasta | 95.105 94.880 (1.002x) | 17.269 17.264 (------) | string-tagclou | 98.241 98.578 (0.997x) | 12.626 12.625 (------) | string-unpack- | 38.630 38.565 (1.002x) | 8.977 8.975 (------) | string-validat ------- | 1094.255 1117.052 (0.980x) | 570.252 567.989 (1.004x) | all date-format-xparb is greatly affected. An interesting thing is that part of the extra time is in the interpreter, but part of it is in trace compilation. Without dvander's patch, the trace JIT doesn't get invoked at all on date-format-xparb; with the patch, it does get invoked quite a lot. It looks like an example of trace tuning perturbation.
Kraken numbers, not much change there: --------------------------------------------------------------- | millions of instructions executed | | total | compiled (may overestimate) | --------------------------------------------------------------- | 3499.890 3499.606 (------) | 3331.079 3331.056 (------) | ai-astar | 1811.866 1804.391 (1.004x) | 1324.528 1324.436 (------) | audio-beat-det | 1327.256 1318.438 (1.007x) | 1164.311 1162.656 (1.001x) | audio-dft | 1717.309 1709.697 (1.004x) | 1249.322 1249.240 (------) | audio-fft | 2348.611 2339.535 (1.004x) | 1624.658 1624.575 (------) | audio-oscillat | 5586.936 5586.894 (------) | 4754.496 4754.495 (------) | imaging-gaussi | 1951.339 1957.331 (0.997x) | 719.356 721.186 (0.997x) | imaging-darkro | 4356.213 4354.941 (------) | 3531.506 3531.476 (------) | imaging-desatu | 640.816 640.328 (1.001x) | 10.150 10.146 (------) | json-parse-fin | 445.112 432.845 (1.028x) | 5.246 5.111 (1.027x) | json-stringify | 1131.289 1130.140 (1.001x) | 727.716 727.701 (------) | stanford-crypt | 665.764 664.899 (1.001x) | 369.794 369.778 (------) | stanford-crypt | 1164.924 1160.898 (1.003x) | 597.261 597.220 (------) | stanford-crypt | 497.284 496.624 (1.001x) | 200.364 200.350 (------) | stanford-crypt ------- | 27144.618 27096.572 (1.002x) | 19609.791 19609.431 (------) | all
tracking-fennec: ? → 2.0next+
If there is a shot at saving >20MB and not regressing SS noticably, Mobile would be very interested in this bug for Fennec 4.0 (blocking-fennec:2.0+) We are getting killed (literally) from excessive memory use on Android.
Marked blocking fennec
tracking-fennec: 2.0next+ → 2.0+
As I mentioned in bug 631706, Type Inference will need something like this as well: see bug 621077. Of course, TI is post-Firefox 4 anyway.
tracking-fennec: 2.0+ → ---
tracking-fennec: --- → 2.0+
Attached patch papered over profiler problem (deleted) — Splinter Review
Today Bill found that the reason this patch regresses v8, and his and Chris's patch regressed SunSpider, is because of the profiler. The profiler aborts when re-entering the interpreter. The recorder does too, but it decomposes complex ops (like CALL with fun_call) into imacros. The profiler doesn't do this. However, we're missing a profiler-abort in EnterMethodJIT. Since we never interpret anything, we never abort the profiler. Bill says this is a bug, but it's not clear how to easily fix it. And our patches to not JIT everything started aborting the profiler, and preventing loops from tracing. To make this patch not totally regress v8, I've narrowed down where the profiler would abort in the interpreter. At present, Bill says he has a WIP patch that tunes the profiler a little more and uses backedge counting to lazily method JIT. That approach seems like a promising alternative. I'll have some more data on delaying nested call compilation in a few minutes.
Attachment #510499 - Attachment is obsolete: true
Attached patch delay call IC compilation (obsolete) (deleted) — Splinter Review
Applies on top of the previous patch and delays call IC compilation using the same heuristics. Setting the delay to 16 regressed SunSpider by 10% - at 1, "only" a 6-7ms (3-4%) regression. At 1, Nick's browser test case got 4MB better from the non-IC-delay patch (27MB from 31MB). At 16, it's 5MB better. (22MB from 27MB). I'm looking forward to seeing what Bill has since backedge counting should be more accurate and flexible. It makes more sense to compile a call graph from inside a loop, rather than functions that contain loops. This patch was intended to be the simplest, least regression-prone thing possible, but it's clear these heuristics are tricky no matter what. Anecdotally: I currently have 13 tabs open on my laptop, using 1.1GB of memory. The JIT code accounts for 27MB of this. On my desktop, with three tabs, the JIT is using 6MB of memory out of 700MB. On Dave Mandelin's computer, I think he said he had five tabs open, it was using 400MB, 7MB of which was JIT. Completely eliminating the JIT wouldn't make a dent in those numbers. We should make sure any memory savings will really be high impact before taking any sort of non-negligible hit on hard-won benchmarks.
(In reply to comment #26) > Completely eliminating the JIT wouldn't make a dent in those numbers. We should > make sure any memory savings will really be high impact before taking any sort > of non-negligible hit on hard-won benchmarks. +1. I think we should recap what level of awesome improvement we're looking for here and if we're at all on track to hit that.
(In reply to comment #26) > > Anecdotally: I currently have 13 tabs open on my laptop, using 1.1GB of memory. > The JIT code accounts for 27MB of this. On my desktop, with three tabs, the JIT > is using 6MB of memory out of 700MB. On Dave Mandelin's computer, I think he > said he had five tabs open, it was using 400MB, 7MB of which was JIT. What measure are you using for the total -- something from about:memory? Or something like total VM usage as reported by 'top' or a similar program? You have to be careful with memory measurements, throwing around numbers without being clear what they are will lead to confusion. > Completely eliminating the JIT wouldn't make a dent in those numbers. We should > make sure any memory savings will really be high impact before taking any sort > of non-negligible hit on hard-won benchmarks. Because JM throws away old code periodically, these numbers don't mean much if we don't have an idea of how recently all those tabs are loaded. Peak memory usage is an important measure. My Massif measurements have repeatedly shown JM code to be among the top handful sources of memory use. Go read bug 615199 or my old blog posts again if you need reminding. I can see this patch is looking more complicated than it did at first, and as a result it might not make 4.0. That doesn't mean it isn't worthwhile. cdleary asks what kind of improvement we're looking for here -- re-read comment 11. A 4.4x relative reduction, saving ~40MB for that simple workload, would be excellent. As I said, that would give a ~5% reduction in total memory use and a ~10% reduction in terms of memory actually touched by the browser. I consider a 5% to 10% reduction as huge; how many reductions of that size did we get while reducing our Sunspider time down from ~800ms to ~300ms? With Sunspider we had clear agreement that the goal was worthwhile, and chasing all those 1% wins (as well as the occasional 5% win) was worth the effort. It's obvious that the value of memory reduction is less clear in everyone's minds. Look back at your anecdotal stats. You're all averaging something close to 100MB per tab. I think that's unreasonable. Our memory usage is way higher than in 3.6. Firefox already has a reputation for being a memory hog. People are going to notice, especially those on mobile.
(In reply to comment #26) > Completely eliminating the JIT wouldn't make a dent in those numbers. We should > make sure any memory savings will really be high impact before taking any sort > of non-negligible hit on hard-won benchmarks. Shouldn't lazy compilation actually win us some performance? According to bug 631581 comment 4 and bug 615199 comment 20 we should actually see speed improvements *and* memory improvements. If we don't then something else is eating up our gains, i.e. there is a hidden performance regression affecting the scripts that should be faster if interpreted instead of being compiled. Either that or those estimates made oversimplifying assumptions.
The 8472, currently used benchmarks don't have any scripts that run few enough iterations to be worth interpreting. So on benchmarks, lazy compilation is pure lose; the only question is how much.
that may be true for benchmarks, but real webpages have lots of run-once/run-rarely scripts (think of all the ads loaded with javascript, social network widgets, google analytics, registering event handlers with JS libraries etc. etc.) that happens right when the page is loaded. As far as i can see those benchmarks are very heavy on loops and light on closures, while bug 631581 states that when tested with real webpages that most scripts (50%) are loop-less and an even larger percentage does not run loops often. In other words this could be a real performance gain, even if it does not show up in synthetic benchmarks. And i don't think that a real optimization should be dropped because it doesn't show up in bad benchmarks. As i mentioned in another bug lazy compilation could get us performance boosts even in those benchmarks if compiling happens on a separate thread (assuming it's a multi-core system). The JS thread itself can still run in interpreted mode while it's waiting for the compiling to finish. But that's something for the future.
> that may be true for benchmarks, Yes, and the comment you were responding to was specifically about benchmarks. We all agree the benchmarks are crap. Too bad people trust them, so regressing them significantly is bad. Note that this is different from "improving other stuff while not regressing the benchmark", which we all agree is a good thing.
(In reply to comment #28) > Because JM throws away old code periodically, these numbers don't mean much if > we don't have an idea of how recently all those tabs are loaded. > > Peak memory usage is an important measure. I thought the first part of http://blog.mozilla.com/jseward/2011/01/27/profiling-the-browsers-virtual-memory-behaviour made a good point on the extent of the utility of 'peak memory usage'. With that understanding, it seems like peak memory usage shouldn't be a primary motivation post-bug 617656. (Obviously, its not worthless either, as even short-term high-memory usage leads to cache-misses and swapping; but then lets have that be the subject of measurement.) Nevertheless, my informal browser-time-spent-in-the-JS-engine measurements also indicate that lazy-mjitting would decrease latency on all but the most JS-intensive pages, so I really would like to see lazy-mjitting landed.
As long as we're shipping 32-bit builds peak memory (or more precisely address space, which is even worse) usage matters, because we will OOM and crash as soon as we go over 2GB at most (usually closer to 1.5GB, due to fragmentation) on Windows....
(In reply to comment #34) > As long as we're shipping 32-bit builds peak memory (or more precisely address > space, which is even worse) usage matters, because we will OOM and crash as > soon as we go over 2GB at most (usually closer to 1.5GB, due to fragmentation) > on Windows.... Yes, sorry, I was unclear; I was implicitly referring to *mjit* peak memory usage which I am assuming bug 617656 fixes. The rest of the browser definitely matters.
> (In reply to comment #29) > Shouldn't lazy compilation actually win us some performance? > (In reply to comment #33) > I really would like to see lazy-mjitting landed. The performance win from lazy JITing on non-JS intensive pages is not perceivable, as far as I understand - we're talking 10-30ms during page load. We want lazy-mjitting, I never said it wasn't worthwhile. It definitely is and it'll be even more important as JM evolves. I'm saying we should be very careful about throwing away benchmark points in 4.0, and I'm concerned with 100MB/tab in a long-running session and very little of it being JIT Code.
(In reply to comment #35) On second thought, though, mjit peak usage could matter if the browser was at an already-high usage level and then the mjit does a burst of allocations. So decreasing peak memory usage would help there. Of course, what you really want for that problem is not a % less peak memory usage but for the malloc OOM handler to purge mjit code and other caches. I've seen this proposed several times before.
(In reply to comment #28) > With Sunspider we had clear agreement that the goal was worthwhile, and chasing > all those 1% wins (as well as the occasional 5% win) was worth the effort. > It's obvious that the value of memory reduction is less clear in everyone's > minds. Yes. We need a better understanding of these tradeoffs. As a gut response, I would say that, right now, a 10% reduction in total memory would be worth a 1% regression in SunSpider performance. But not 5%. For my personal browser usage, browser memory reduction has no value: AFAICT, I never experience any problems from it. That is probably a part of why memory reduction is less appreciated than maybe it deserves. Presumably netbooks, mobile devices, and such could really benefit. Maybe we should get some low-memory devices, try some stuff, and see what gets painful. > Look back at your anecdotal stats. You're all averaging something close to > 100MB per tab. That's not right. In the example of my browser above, I had 10-15 tabs open (I think), not just 5. At the moment, I have 18 tabs open, and about:memory shows 200MB allocated with malloc, 300MB working set, and 7MB jitcode. A little while ago, it was 17MB jitcode; the aging algorithm must have reduced it. On this machine I have 8GB and currently 5.5GB free. > I think that's unreasonable. Our memory usage is way higher than in 3.6. That's expected due to gfx and jit work, and the memory increase is not that out of proportion with the change in the price of memory over that period. Also, Fx4's performance is so much better in every practical aspect that I would not even consider using 3.6. We should reduce memory, but 3.6 is not the right point of comparison. > Firefox already has a reputation for being a memory hog. People > are going to notice, especially those on mobile. If I read the gist of your comment is right, you are trying to persuade people that reducing memory usage is important. And you want to try to move ahead on this patch, but if it doesn't make the Fx4 release because it gets too hard, that's OK, we can do it later. That's my position, anyway. I also think that if we do miss the release on this, we should get it in a service release soon after. There is another consideration, though: if other parts of the browser are also really close to getting some substantial reductions, it could be that we are able to reduce total memory usage more. And if we could cut total usage by, say, 30% from our current place, then we might want to hold the release a little bit for that. I don't have that big picture, though.
(In reply to comment #38) > For my personal browser usage, browser memory reduction has no value: AFAICT, I > never experience any problems from it. That is probably a part of why memory > reduction is less appreciated than maybe it deserves. Presumably netbooks, > mobile devices, and such could really benefit. Maybe we should get some > low-memory devices, try some stuff, and see what gets painful. Please grab some Android devices and use Fennec. We can OOM on the Nexus S fairly easily. I understand that it's not a sure thing that fixing this bug will help our situation on mobile, but I'd like to test it a bit. Do we have consensus on which patch/approach we should be considering? We can try to get some numbers from running on Android.
An easy test to figure out what the maximum gain is for mobile is just to run with mjit off and see what it does to about:memory or your OOM cases. if you still crash, it's not enough. if you don't, we should probably let you tune the threshold somehow, since you have some sunspider lead to burn.
(In reply to comment #39) > (In reply to comment #38) > > > For my personal browser usage, browser memory reduction has no value: AFAICT, I > > never experience any problems from it. That is probably a part of why memory > > reduction is less appreciated than maybe it deserves. Presumably netbooks, > > mobile devices, and such could really benefit. Maybe we should get some > > low-memory devices, try some stuff, and see what gets painful. > > Please grab some Android devices and use Fennec. We can OOM on the Nexus S > fairly easily. > > I understand that it's not a sure thing that fixing this bug will help our > situation on mobile, but I'd like to test it a bit. Do we have consensus on > which patch/approach we should be considering? We can try to get some numbers > from running on Android. I think Fennec almost certainly should use a pretty aggressive approach here. Even if it doesn't fix your OOM cases it seems like pure win. We'll let you know when we have a patch worth trying. In the meantime, shaver's suggestion should give you a pretty good estimate of the magnitude of the savings.
Attached patch unified backedge/call counting patch (obsolete) (deleted) — Splinter Review
This patch incorporates code from Chris's patch and David's patch. It works with the tracer and the profiler. It counts backedges on a per-loop basis, and it also counts calls. It compiles a function when it has been called enough or when any of its loops have been traversed enough. However, it still has the problem with profiler aborts that David mentioned above. I haven't yet decided how to fix this. I'll post performance data next.
Attached file performance data for patch in previous comment (obsolete) (deleted) —
This file shows the performance and memory data for 3 different configurations. In all three configurations, compilation happens if we traverse a loop once. It also happens if we call a function at least N times, where N is 10, 100, and infinity. I measured performance on our usual benchmarks, as well as some that I have gathered over the course of the profiler work. For memory usage, I opened three tabs: Gmail, Zimbra, and TechCrunch. Then I used the patch in bug 625305 to measure cumulative mjit memory usage. This counts how many bytes have been allocated for mjit code, and ignores any releases. The goal is to ignore blips caused by Brian's patch that occasionally throws out mjit code during GC. In the N=10 config, the SunSpider score is basically the same. Memory goes from 92 MB to 48 MB. For N=100, SunSpider regresses by 0.5% and memory goes down to 17 MB. For N=infinity, SunSpider regresses by 1.2% and memory goes down to 8 MB. Other benchmarks don't seem to be affected too much. The biggest V8 change is 0.5%. The biggest SunSpider regression is in date-format-tofte. As far as I can tell, it's caused by the profiler abort problem that David mentioned above. If I could somehow fix this, we might actually be saving time on SS even in the N=infinity case. I'll look into this. Right now, I want to get the patch out so we can start worrying over how risky it is.
Attachment #511214 - Attachment is patch: false
Awesome results! I would definitely take an 0.5% regression for that.
Great stuff. Some comments/questions: - Why did you choose to store the backedge table in the compartment? I would have though the JSScript was the obvious spot. Having said that, adding a vector to every JSScript makes them significantly bigger and there are lots of JSScripts -- was that why? - In backedgeCount() you return 0 if the lookup fails. Will it ever fail? - There isn't a single comment in the entire patch! One or two wouldn't hurt, eg. on CALLS_BEFORE_COMPILE and BACKEDGES_BEFORE_COMPILE. - This: #if (defined(JS_TRACER) && defined(JS_METHODJIT)) || defined(JS_METHODJIT) can be simplified to this: #if defined(JS_METHODJIT) I'll do some measurements of my own.
Oh, you'll need to handle OOM better here: backEdgeTable = js_new<BackEdgeTable>(); JS_ASSERT(backEdgeTable); if (!backEdgeTable->init(32)) JS_ASSERT(0);
Yes, the hash table is in the compartment to save memory. backedgeCount() will return 0 if you compiled because of calls, in which case you may not have hit any back edges. The patch is still kinda preliminary. I'm mainly looking for feedback on the approach.
(In reply to comment #47) > Yes, the hash table is in the compartment to save memory. backedgeCount() will > return 0 if you compiled because of calls, in which case you may not have hit > any back edges. Fair enough. Both of those explanations would make good comments :) > The patch is still kinda preliminary. I'm mainly looking for feedback on the > approach. I'm not an expert on all this interpreter/JM interaction, but it looks pretty clean.
I think this approach is better than counting calls alone, and compiling after only one or two backedges should have a very low chance at regressing. On my x86 VM, using Nick's test workload, I use 34.2M of cumulative JIT code memory before this patch. With the patch, with C=10,B=1, I get 20.4M. With C=10,B=10, I get 14.9M. So something in this workload might be pretty loopy.
blocking2.0: .x → ---
Err, that second one should read B=100.
(In reply to comment #33) > > I thought the first part of > http://blog.mozilla.com/jseward/2011/01/27/profiling-the-browsers-virtual-memory-behaviour > made a good point on the extent of the utility of 'peak memory usage'. With > that understanding, it seems like peak memory usage shouldn't be a primary > motivation post-bug 617656. (Obviously, its not worthless either, as even > short-term high-memory usage leads to cache-misses and swapping; but then lets > have that be the subject of measurement.) When I do page-level Massif runs on Linux64, I see that typically around half the taken space is due to (a) dlopen'd code and data segments, and (b) thread stacks. Lots of that space will never even be touched. So the very first distinction should be between "pages that are mapped but never touched" and "pages that are mapped and touched". The former case can be almost completely ignored (except for the address space exhaustion issue on 32-bit) and it could be maybe 1/4 or 1/3 of total VM space, at a guess. The latter case -- which includes every byte written by JM -- is where things get interesting. Ideally you'd then focus on a smaller set of hotter pages. Working out what those hot pages are is difficult. Julian's VM-sim tool is a great first attempt at measuring that, and it's telling us that there are basically two things causing page faults: GC, and *IC purging*. Fiddling with the order that ICs are purged may help with that; creating fewer ICs will *definitely* help with that.
Also, JITted code space tends to hurt more than normal code or data, because you write it and then execute it, which tends to stuff up the caches. Code that's patched is even worse.
I repeated my experiments from comment 11. My results are much less positive: B=0 1 2 4 8 16 32 64 1000 inf ---------------------------------------------------------------------------- C= 0 | 53.1 X X X X X X X X X 1 | 52.7 53.2 53.4 53.0 52.9 52.7 2 | 44.3 43.0 40.6 37.4 4 | 37.5 32.1 31.8 8 | 32.8 31.9 29.5 28.6 16 | 39.5 39.7 32.5 22.0 19.7 19.0 18.0 32 | 32.0 20.9 13.9 9.4 64 | 32.0 10.6 8.0 1000 | 39.3 39.5 2.2 2.2 inf | 39.3 39.2 37.1 35.8 31.8 19.8 11.7 8.5 2.1 2.0 Even going to (C=inf,B=1) only drops code size from 53.1 to 39.3 MB. Around (C=16,B=16) I start getting good results. I talked to billm on IRC about this, he's getting much better results than me on the same workload. Something weird is going on. I tried a fresh profile and both opt and debug builds, and got the same results with all variations.
The fact that I get no change with (C=1,B=1) is suspicious. It suggests that there are close to zero functions that are called only once, which contradicts both intuition and dmandelin's earlier measurements.
Attached patch fix off-by-one errors in unified patch (obsolete) (deleted) — Splinter Review
There are off-by-one errors in both count checks: script->incCallCount() < CALLS_BEFORE_COMPILE cx->compartment->incBackEdgeCount(pc) < BACKEDGES_BEFORE_COMPILE because the counts are incremented before comparing, the comparisons should be <= not <. This means that setting either threshold to 1 is the same as setting it to 0. That explains why my (C=1,B=1) results have been identical to (C=0,B=0), which is progress of a sort.
(In reply to comment #55) > Created attachment 511293 [details] [diff] [review] > fix off-by-one errors in unified patch This patch also removes some of billm's extra accounting code.
Adjusting the results in comment 53 to account for the off-by-one error: B=0 1 3 7 15 31 63 999 inf --------------------------------------------------------------------- C= 0 | 53.1 X X X X X X X X 1 | 44.3 43.0 40.6 37.4 3 | 37.5 32.1 31.8 7 | 32.8 31.9 29.5 28.6 15 | 39.5 32.5 22.0 19.7 19.0 18.0 31 | 32.0 20.9 13.9 9.4 63 | 32.0 10.6 8.0 999 | 39.3 2.2 2.2 inf | 39.3 37.1 35.8 31.8 19.8 11.7 8.5 2.1 2.0 That looks more reasonable, though the code size drops more slowly than I would expect/hope as B and C increase. For example, if we assume that all scripts are the same size (quite an assumption, but bear with me): - Going from (C=0,B=0) to (C=1,B=0) only dropped us from 53.1 to 44.3. That indicates that only 83% of scripts are executed more than once. dmandelin measured that number to be only 50%. - Going from (C=0,B=0) to (C=inf,B=0) only dropped us from 53.1 to 39.3. That indicates that 74% of scripts have at least one taken back-edge. dmandelin measured that number to be only 47%. Now, dmandelin was measuring script counts, not code size, and he had a different workload. Still, I'm a bit surprised.
I experimented with http://www.valgrind.org/njn/test.html (if you want to try it, download a local copy, it's big and the server is slow). The script it uses contains many (over 40,000) functions, stored in an array. It loops over the array and calls each function once. Now, with (C=1,B=1) I thought that none of the 40,000 functions would be compiled, but they all are. With (C=1,B=20000) half of them are. So it appears that once a loop gets compiled, all functions called from within it are also compiled. Is this intentional? It might be related to ICs (see comment 26). It might also explain why the numbers drop off slower than dmandelin's measurements suggested -- some functions are compiled even though they don't hit the C threshold because they are called within a compiled loop.
(In reply to comment #58) > It might be related to ICs (see comment 26). I tried applying dvander's patch from comment 26 on top of billm's patch. (I'm not even sure if that makes sense, but I thought I'd give it a try :) It fixes the all-functions-called-in-a-compiled-loop-are-compiled problem -- with (C=1,B=0) nothing in my test.html example is compiled, as I originally expected. Furthermore, on my 5 tab workload, the numbers were a lot better: old new --- --- (C=0,B=0) 53.1 55.0 (C=1,B=0) 44.3 34.8 (C=1,B=1) 43.0 33.4 (C=15,B=0) 39.5 25.0 (C=15,B=15) 22.0 14.6 (C=inf,B=0) 39.3 18.4 (C=inf,B=15) 19.8 6.9 I'm not sure why the (C=0,B=0) result went up, but the others results are good, and much more in line with dmandelin's measurements. I also tried Sunspider: (C=0,B=0) 266ms (C=15,B=0) 270ms (C=15,B=15) 261ms (C=1000,B=1000) 322ms (C=inf,B=inf) 1275ms The margin of error was sizeable (eg. 5%) but the reasonable values of B and C didn't seem to hurt.
(Oh, and FWIW, in techcrunch.com accounts for about 75% of the code in my 5 site workload!)
Attachment #511214 - Attachment is obsolete: true
Attached patch new patch (obsolete) (deleted) — Splinter Review
I don't know what happened with the last patch I posted. I'm not longer able to reproduce the memory numbers I was getting on techcrunch. Now I'm getting numbers more in line with Nick. This patch combines the previous patch with David's don't-always-jit-for-call-ICs patch. It includes a workaround for a problem that happens when BACKEDGES > 1. I'm getting pretty good performance and memory results with this at (15,15). However, I'll hold off posting anything until Nick can verify that it actually saves memory.
Attached patch new new patch (obsolete) (deleted) — Splinter Review
Oops. This one's better.
Attachment #511569 - Attachment is obsolete: true
Attached file perf data for calls=15, backedges=15 (deleted) —
This is the performance data for the patch above. Surprisingly, the patch improves all the benchmarks, at least in the aggregate. I tried increasing the call threshold to infinity, and SunSpider performance got much worse, mostly due to controlflow-recursive (not surprisingly).
Attachment #511585 - Attachment is patch: false
new new stats for the new new patch: B=0 1 2 4 8 16 32 64 inf ---------------------------------------------------------------------------- C= 0 | 54.7 X X X X X X X X 1 | 33.0 31.6 30.1 29.4 27.9 2 | 29.1 27.6 26.0 4 | 28.2 25.4 22.6 8 | 24.6 22.6 21.9 16.1 16.1 14.3 16 | 23.1 21.6 17.9 14.8 12.5 12.9 9.5 32 | 21.4 20.0 18.5 13.2 10.7 8.7 64 | 22.3 18.7 9.5 5.6 inf | 17.5 0.3 I reckon (C=16,B=16) still looks like something of a sweet-spot. Or we could do (C=2,B=0) if we want to sneak it into 4.0 :)
Attached file cg_diff output for v8-raytrace (deleted) —
I did some SS and V8bench runs, too: SS V8bench (C=0,B=0) 267ms 3360 (C=1,B=1) 255ms 3671 (C=8,B=8) 268ms 2970 (C=16,B=16) 255ms 3181 (C=32,B=32) 264ms 3136 (C=64,B=64) 285ms 2365 Margin of error for SS was around 5%. At this point I don't trust the score-based version of V8bench much at all. So I did some Cachegrind runs on the time-based version in Sunspider, along with Sunspider itself. --------------------------------------------------------------- | millions of instructions executed | | total | compiled (may overestimate) | --------------------------------------------------------------- | 68.383 68.942 (0.992x) | 44.772 44.716 (1.001x) | 3d-cube | 40.462 40.502 (0.999x) | 25.405 25.401 (------) | 3d-morph | 63.799 62.534 (1.020x) | 37.219 36.921 (1.008x) | 3d-raytrace | 25.301 25.413 (0.996x) | 11.100 11.095 (------) | access-binary- | 91.820 91.821 (------) | 85.809 85.805 (------) | access-fannkuc | 29.080 29.017 (1.002x) | 16.512 16.478 (1.002x) | access-nbody | 36.150 36.094 (1.002x) | 29.335 29.333 (------) | access-nsieve | 7.380 7.401 (0.997x) | 3.253 3.251 (1.001x) | bitops-3bit-bi | 36.515 36.537 (0.999x) | 31.742 31.741 (------) | bitops-bits-in | 15.822 15.838 (0.999x) | 12.018 12.017 (------) | bitops-bitwise | 37.988 37.941 (1.001x) | 32.967 32.966 (------) | bitops-nsieve- | 16.958 16.932 (1.002x) | 12.874 12.870 (------) | controlflow-re | 23.165 23.835 (0.972x) | 11.792 11.731 (1.005x) | crypto-md5 | 19.246 19.298 (0.997x) | 8.493 8.478 (1.002x) | crypto-sha1 | 68.826 69.146 (0.995x) | 21.937 21.894 (1.002x) | date-format-to | 51.999 50.445 (1.031x) | 9.677 9.581 (1.010x) | date-format-xp | 41.593 41.513 (1.002x) | 27.329 27.323 (------) | math-cordic | 22.742 22.801 (0.997x) | 6.304 6.291 (1.002x) | math-partial-s | 31.517 31.573 (0.998x) | 26.588 26.578 (------) | math-spectral- | 47.118 46.916 (1.004x) | 34.563 34.555 (------) | regexp-dna | 25.937 26.075 (0.995x) | 9.252 9.235 (1.002x) | string-base64 | 60.829 60.814 (------) | 32.704 32.735 (0.999x) | string-fasta | 95.109 94.667 (1.005x) | 17.258 17.075 (1.011x) | string-tagclou | 98.966 99.128 (0.998x) | 12.628 12.605 (1.002x) | string-unpack- | 38.649 38.643 (------) | 8.977 8.966 (1.001x) | string-validat ------- | 1095.367 1093.839 (1.001x) | 570.521 569.653 (1.002x) | all --------------------------------------------------------------- | millions of instructions executed | | total | compiled (may overestimate) | --------------------------------------------------------------- | 1197.162 1202.137 (0.996x) | 1035.788 1035.707 (------) | v8-crypto | 1145.612 1146.711 (0.999x) | 842.931 842.762 (------) | v8-deltablue | 894.296 878.813 (1.018x) | 453.731 453.343 (1.001x) | v8-earley-boye | 503.383 467.835 (1.076x) | 237.959 235.706 (1.010x) | v8-raytrace | 877.875 887.177 (0.990x) | 371.943 371.146 (1.002x) | v8-regexp | 1143.424 1144.157 (0.999x) | 1115.515 1115.470 (------) | v8-richards | 429.687 430.731 (0.998x) | 101.069 101.232 (0.998x) | v8-splay ------- | 6191.442 6157.563 (1.006x) | 4158.939 4155.369 (1.001x) | all v8-raytrace is a big win. I've attached the cg_diff output, looks like there are some inefficient GetProp stub calls happening when the method JIT is on.
(Oh, the Cachegrind runs were with (C=16,B=16).)
Attached patch cleaned-up patch (deleted) — Splinter Review
This is a cleaned-up version of the previous patch. The environment variables have been replaced with hard-coded constants. There's a new about:config option, javascript.options.methodjit_always. This makes it easy to turn the mjit tuning on or off without rebuilding. I also added the code that we need to handle exceptions and OOMs. It's running on tryserver now.
Attachment #510951 - Attachment is obsolete: true
Attachment #511211 - Attachment is obsolete: true
Attachment #511293 - Attachment is obsolete: true
Attachment #511571 - Attachment is obsolete: true
Attachment #511838 - Flags: review?(dvander)
Comment on attachment 511838 [details] [diff] [review] cleaned-up patch >+ if (BackEdgeMap::Ptr p = backEdgeTable.lookup(pc)) >+ return p->value; >+ else >+ return 0; >+} You can drop the "else" here (no else-after-return). >+ if (BackEdgeMap::AddPtr p = backEdgeTable.lookupForAdd(pc)) { >+ p->value++; >+ return p->value; >+ } else { >+ backEdgeTable.add(p, pc, 1); >+ return 1; >+ } >+} Same.
Attachment #511838 - Flags: review?(dvander) → review+
http://hg.mozilla.org/tracemonkey/rev/f569d49576bb I left the second conditional as-is because it uses p, which is scoped to the if statement.
Whiteboard: fixed-in-tracemonkey
(In reply to comment #69) > http://hg.mozilla.org/tracemonkey/rev/f569d49576bb > > I left the second conditional as-is because it uses p, which is scoped to the > if statement. Just use NULL not p in the else clause and lose the else. /be
(In reply to comment #70) > Just use NULL not p in the else clause and lose the else. > > /be Strangely, I don't think that's correct. We can get back an AddPtr with valid, useful data and still follow the else branch via a magic C++ AddPtr-to-bool conversion. If you're truly a virulent else-hater, I could lift out the AddPtr definition. But this seems uglier to me. Also, I've seen the same pattern used elsewhere.
Oh, I'm an else-after-return hater, but new hate target: falsy non-null pointer-like magic C++ objects! Luke, is this really righteous? /be
Righteous? I dunno. Recall that, even on a miss, we want to convey the miss information to 'add' so the add is more efficient. So that implies an API of the form: HashTable::Hit_or_miss_info info = ht.lookupForAdd(...); if (... somehow test success from info ...) ... else ht.add(info, ...) on the other hand, forgetting the add-on-miss case for a second, its certainly nice to be able to write: if (HashTable::Ptr p = ht.lookup(...)) ... p->value which wants a 'truthy' Ptr type. Putting the two together you get a truthy Hit_or_miss_info type (= HashTable::AddPtr). The verbose alternative would be: MissInfo missInfo; if (HashTable::Ptr p = ht.lookupForAdd(..., &missInfo)) ... else ht.add(missInfo, ...) which I rather dislike. A less verbose alternative would be for AddPtr (which derives Ptr) to hide the boolean conversion and force use of the (equivalent) found() member: HashTable::AddPtr p = ht.lookupForAdd(....); if (p.found()) ... else ht.add(p, ....) In comment 71 billm said this was ugly, but I don't know, I could dig it. I guess its a matter of how terse you like it. Truthy AddPtr does seem to encourage breaking the no-else-after-return rule... I've over-analyzed it too much to have a strong opinion atm.
Sure, you want two results, a pointer for add and a not-found result. But the bool conversion operator gets the latter from the former (via a load and null test), right? Hiding that just makes for mischief. EIBTI. JS forbids Object toBoolean resulting in anything but true, and C and C++ for plain old pointer types want only null to be falsy, so the magic conversion rubs me the wrong way. /be
(In reply to comment #74) Filed bug 633690 for further discussion.
This patch broke the --jitflags option to jit_tests.py. For example, if I use --jitflags=j,mj, I get "-j -m -j" passed to both invocations of each test. It's something to do with the 'extend' call; unfortunately my non-existent Python is preventing me from getting further than that.
(In reply to comment #76) > my non-existent > Python is preventing me from getting further than that. Turns out this line: t.jitflags = self.jitflags in Test.copy is just copying a reference to jitflags rather than cloning it. That's the problem. Attached patch fixes it.
Attachment #512095 - Flags: review?(wmccloskey)
Attachment #512095 - Flags: review?(wmccloskey) → review+
Depends on: 633929
I'm not sure how the counters are stored, but if reducing the number of bits for counters is important, try using half-a-bit. :) [Probabilistic Counter Updates for Predictor Hysteresis and Stratification] The basic idea is that you can use a somewhat-random (doesn't have to be security-grade random) source to conditionally set a bit (or increment). So instead of having 4 bits to count up to 15, set a 1-bit counter 1/16 of the time.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Now that we're not method-jitting everything immediately, have we made the appropriate changes to all the JM tests written when JM jitting everything? According to njn, we only added '|jit-test| mjitalways' flags to a handful of the tests (http://hg.mozilla.org/mozilla-central/rev/f569d49576bb). It would be very bad if our test coverage just dropped dramatically. (We had similar issues when we changed the tracer's parameters, if I recall correctly.)
Bug 635155 is open to get full test coverage back again.
Assignee: nnethercote → wmccloskey
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: