Closed
Bug 558971
Opened 15 years ago
Closed 15 years ago
Parser arena allocation overhead is too high
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: cdleary, Assigned: cdleary)
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(6 files)
In parsing large payloads (like GMail in the attached), the fixed-size arena blocks are probably too small for the number of parse nodes we create.
They're allocated out of the context's temp pool: we take a watermark, allocate out whatever fixed size blocks we need (keeping a free-list of parsenodes for recycling), then free back to the watermark when the parser is destroyed.
As noted in bug 421435 (stagnated since two years ago), the blocks in the temp pool are fixed size. I have to read up on arena allocation theory, but it seems to make sense to have some kind of vector-like allocation strategy for parse node arena, doubling the size of the new pool each time.
I should probably try to collect some stats on the productions that are creating parse nodes as well.
Assignee | ||
Comment 1•15 years ago
|
||
Forgot to note that in the GMail parse under Shark, 10% of the time was getting eaten up by vm_fault -- only 85% of the time was spent in userspace.
Comment 2•15 years ago
|
||
One of the problems of a doubling vector is that parse nodes get reallocated (potentially). I have argued for using the VM more before. Allocate a really large block (32MB) and use it. If you run oom, throw it away, allocate 64MB, parse again and so on. For the vast majority of case 32MB should do just fine. As long you don't touch most of it, we pay the same price for a small mmap as we would for a large one. Unmap when done. As long the underlying OS doesn't commit memory eagerly, this should work in most places. Windows CE might be an issue, but looks like we don't really ship there anyway. Lets run some statistics. How much space do we need anyway? 32MB is probably aiming way too high.
Assignee | ||
Comment 3•15 years ago
|
||
With 4K arena allocation size in the temp pool we see the following
butt-kicking speedups on Linux/64:
$ python compare_bench.py ~/tm/results/parsemark_tm_4Kpool_2010_04_12.txt
~/tm/results/parsemark_tm_2010_04_12.txt
tetris_combined.js: faster: 4.19ms < baseline 5.26ms (+20.34%)
gravity_combined.js: faster: 24.67ms < baseline 39.08ms (+36.87%)
ga.js: faster: 3.45ms < baseline 3.67ms ( +5.99%)
twitter_combined.js: faster: 20.30ms < baseline 27.81ms (+27.00%)
jquery.min.js: faster: 5.23ms < baseline 6.84ms (+23.54%)
zimbra_combined.js: faster: 127.30ms < baseline 192.73ms (+33.95%)
jsgb_combined.js: faster: 9.79ms < baseline 13.42ms (+27.05%)
280slides_combined.js: faster: 3.63ms < baseline 4.06ms (+10.59%)
v8_combined.js: faster: 22.43ms < baseline 27.83ms (+19.40%)
dojo.js: faster: 8.27ms < baseline 9.74ms (+15.09%)
processing_twitch.js: Meh.
gmail_combined.js: faster: 97.48ms < baseline 146.14ms (+33.30%)
effectgames_galaxy.js: faster: 16.14ms < baseline 23.94ms (+32.58%)
ball_pool_combined.js: faster: 16.13ms < baseline 20.48ms (+21.24%)
yui-min.js: Meh.
MochiKit.js: faster: 16.54ms < baseline 20.94ms (+21.01%)
pipio_combined.js: faster: 82.18ms < baseline 127.90ms (+35.75%)
sunspider_combined.js: faster: 13.28ms < baseline 14.84ms (+10.51%)
Average speedup: 23.39%
Will have numbers for OS X 64b soon.
Comment 4•15 years ago
|
||
Chris, can you file dependent bugs on any other remaining uses of the arena code? I have seen similar badness elsewhere (JS stacks, but luke might have fixed all of that by now).
Assignee | ||
Comment 5•15 years ago
|
||
OS X Snow Leopard 64b numbers:
gravity_combined.js: faster: 39.26ms < baseline 41.37ms ( +5.11%)
processing-0.8.min.js: faster: 18.22ms < baseline 18.40ms ( +0.96%)
ga.js: Meh.
twitter_combined.js: faster: 29.63ms < baseline 30.40ms ( +2.52%)
jquery.min.js: Meh.
gmail_combined.js: faster: 155.83ms < baseline 175.48ms (+11.20%)
tetris_combined.js: Meh.
280slides_combined.js: Meh.
v8_combined.js: faster: 33.26ms < baseline 33.89ms ( +1.86%)
dojo.js: Meh.
processing_twitch.js: Meh.
jsgb_combined.js: Meh.
zimbra_combined.js: faster: 204.31ms < baseline 234.79ms (+12.98%)
effectgames_galaxy.js: faster: 24.98ms < baseline 25.74ms ( +2.95%)
ball_pool_combined.js: faster: 25.25ms < baseline 25.79ms ( +2.09%)
yui-min.js: Meh.
MochiKit.js: faster: 24.96ms < baseline 25.39ms ( +1.70%)
pipio_combined.js: faster: 125.87ms < baseline 142.85ms (+11.89%)
sunspider_combined.js: Meh.
Average speedup: 5.33%
Comment 6•15 years ago
|
||
Luke is not using arenas for stacks, he's using the VM mapping thing to avoid address instability.
Let's use bigger chunks! Make our hay while the sun doth shine, to quote G&S. No need to do more if profiling doesn't support it.
/be
Comment 7•15 years ago
|
||
How does 16K compare -- I'm not kidding.
/be
Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #2)
> One of the problems of a doubling vector is that parse nodes get reallocated
> (potentially).
Right. I was thinking about this, and I'm not sure why non-contiguous arena chunks would be much worse than a contiguous allocation space if the base pool size were sufficiently large. You always add a new chunk of the total size of the current space; i.e. 1 -> [1, 1] -> [1, 1, 2] -> [1, 1, 2, 4], etc. We're using a recycled-node-list anyway, so we'd only ever be doling out of the last allocated chunk.
> If you run oom, throw it away, allocate 64MB,
> parse again and so on. For the vast majority of case 32MB should do just fine.
That seems like an awful behavior for scripts that require 32MB + sizeof(ParseNode).
> As long you don't touch most of it, we pay the same price for a small mmap as
> we would for a large one. Unmap when done. As long the underlying OS doesn't
> commit memory eagerly, this should work in most places.
I like it, but I'm not a portability expert. Do we do this elsewhere?
Comment 9•15 years ago
|
||
Comment 2 is off the mark, especially on OOM handling -- we just need bigger chunks. Measure up to a large size and gnuplot the results, attach the image. Internal fragmentation will not be bad as you say, we recycle in the non-terminal chunks.
/be
Comment 10•15 years ago
|
||
> > As long you don't touch most of it, we pay the same price for a small mmap as
> > we would for a large one. Unmap when done. As long the underlying OS doesn't
> > commit memory eagerly, this should work in most places.
>
> I like it, but I'm not a portability expert. Do we do this elsewhere?
Its the year 2010 of the lord. If your OS doesn't have an MMU and virtual memory, go get an OS for grownups. I am pretty sure its true for every platform we run on.
Comment 11•15 years ago
|
||
The issue is OOM handling, not VM -- althoug check with Mobile folks before grabbing great heaps of VM (page table mappings take up whole pages and TBL entries for running processes, on some OSes).
But really, a doubling vector with stable virtual addressing is overkill here. Leary will do the science and vindicate my claim, I'm sure.
/be
Assignee | ||
Comment 12•15 years ago
|
||
Commencing with the science! http://imgs.xkcd.com/comics/science_montage.png
Assignee | ||
Comment 13•15 years ago
|
||
Small scripts are relatively unaffected, as one would expect. Looks like 4KiB may be a sweet spot. 16KiB is better, but maybe not by enough. (It's hard to tell the whole system effects by just looking at parsing in isolation.)
Assignee | ||
Comment 14•15 years ago
|
||
Busier plot, but perhaps easier to see the important trends.
Assignee | ||
Comment 15•15 years ago
|
||
I ripped the lazy parts out of V8 and force it to eagerly generate an AST, and we fare much better now (win one lose a bunch):
$ python parsemark.py /home/cdleary/v8/v8-stable/v8 /home/cdleary/tm/parse-tests/ -q -b /home/cdleary/tm/results/pool_size/tm_4K_pool_size.json
tetris_combined.js: Meh.
gravity_combined.js: faster: 20.93ms < baseline 22.37ms ( +6.46%)
ga.js: faster: 2.25ms < baseline 2.33ms ( +3.55%)
twitter_combined.js: Meh.
jquery.min.js: Meh.
zimbra_combined.js: faster: 93.95ms < baseline 115.92ms (+18.95%)
jsgb_combined.js: Meh.
280slides_combined.js: Meh.
v8_combined.js: faster: 17.71ms < baseline 18.43ms ( +3.92%)
dojo.js: faster: 6.17ms < baseline 6.57ms ( +6.15%)
processing_twitch.js: Meh.
gmail_combined.js: faster: 62.81ms < baseline 91.00ms (+30.97%)
effectgames_galaxy.js: faster: 12.89ms < baseline 13.80ms ( +6.63%)
ball_pool_combined.js: Meh.
yui-min.js: Meh.
MochiKit.js: faster: 12.85ms < baseline 14.10ms ( +8.90%)
pipio_combined.js: faster: 66.95ms < baseline 73.58ms ( +9.01%)
sunspider_combined.js: SLOWER: 20.75ms > baseline 12.60ms (-64.67%)
Average speedup: 2.99%
As opposed to before:
$ python compare_bench.py /home/cdleary/tm/results/parsemark_v8_eager_2010_04_13.json /home/cdleary/tm/results/parsemark_tm_2010_04_12.txt
tetris_combined.js: faster: 3.69ms < baseline 5.26ms (+29.85%)
gravity_combined.js: faster: 20.93ms < baseline 39.08ms (+46.44%)
ga.js: faster: 2.25ms < baseline 3.67ms (+38.69%)
twitter_combined.js: faster: 18.81ms < baseline 27.81ms (+32.36%)
jquery.min.js: faster: 4.30ms < baseline 6.84ms (+37.13%)
zimbra_combined.js: faster: 93.95ms < baseline 192.73ms (+51.25%)
jsgb_combined.js: faster: 10.22ms < baseline 13.42ms (+23.85%)
280slides_combined.js: faster: 3.23ms < baseline 4.06ms (+20.44%)
v8_combined.js: faster: 17.71ms < baseline 27.83ms (+36.36%)
dojo.js: faster: 6.17ms < baseline 9.74ms (+36.65%)
processing_twitch.js: faster: 2.25ms < baseline 2.79ms (+19.35%)
gmail_combined.js: faster: 62.81ms < baseline 146.14ms (+57.02%)
effectgames_galaxy.js: faster: 12.89ms < baseline 23.94ms (+46.16%)
ball_pool_combined.js: faster: 14.29ms < baseline 20.48ms (+30.22%)
yui-min.js: faster: 1.24ms < baseline 1.79ms (+30.73%)
MochiKit.js: faster: 12.85ms < baseline 20.94ms (+38.63%)
pipio_combined.js: faster: 66.95ms < baseline 127.90ms (+47.65%)
sunspider_combined.js: SLOWER: 20.75ms > baseline 15.12ms (-37.24%)
Average speedup: 32.53%
If we think it's worth the time I'll instrument to get some production statistics and see if the big win/losses stand out somehow.
Comment 16•15 years ago
|
||
(In reply to comment #14)
> Created an attachment (id=438840) [details]
> Perf increase versus pool size.
Eggsalad. Now we need to see the same plot on ARM.
Assignee | ||
Comment 17•15 years ago
|
||
This set of data is a lot more noisy -- big error bars.
Assignee | ||
Comment 18•15 years ago
|
||
Assignee | ||
Comment 19•15 years ago
|
||
Okay, I'm going with 4K - ARENA_HEADER_SIZE for now, as it demonstrates a parsing speedup and doesn't have a significant effect on sunspider scores on x64 or ARM.
On my Linux x64, I found the following relationship on parsemark (from slow to fast, <~ meaning approximately equal, but strictly less than):
1K-40 <~ 1K < 4K <~ 4K - HEADER_SIZE <~ 16K <~ 16K - HEADER_SIZE
On OS X i386:
16K - HEADER_SIZE < 1K < 1K - HEADER_SIZE < 4K <~ 4K - HEADER_SIZE
That being said, I'm still looking at exactly what size mallocs we're performing as a result of arena chunks being pulled, seeing which VM pages that ends up corresponding to, and checking for thrash with a little bit of printf-based observation.
Attachment #440898 -
Flags: review?(sayrer)
Comment 20•15 years ago
|
||
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.
I claim we should take this trivial patch to win now. But also file a follow up to see if we can observe consistent wins when multiplying a constant by pointer size.
Attachment #440898 -
Flags: review?(sayrer)
Attachment #440898 -
Flags: review?
Attachment #440898 -
Flags: review+
Comment on attachment 440898 [details] [diff] [review]
Temp pool chunk size increase.
that sure is an increase.
Attachment #440898 -
Flags: review? → review+
Assignee | ||
Comment 22•15 years ago
|
||
What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a cursory analysis I see that we can reduce 5761 malloc/frees out of the temp pool into ~13 with the doubling approach. Note that we'd get screwed on the larger arenas where chunk utilization is low if the system's vmem doesn't support a mapped-but-not-wired state.
imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how many programs like imacros really exist and require performance, but might as well fix it.)
We also have a few oddball arena frees when the arenas are < 50% utilized, some as low as 1%. Still investigating those.
Comment 23•15 years ago
|
||
(In reply to comment #22)
> What'd be really nice is if we could use sysconf's PAGE_SIZE value. Just from a
> cursory analysis I see that we can reduce 5761 malloc/frees out of the temp
> pool into ~13 with the doubling approach.
Can we hang onto the last arena for a while (GC-based time caching, see gcEmptyArenaPoolLifespan).
> imacros.js also shows we thrash the regexp pool a bjillion times. (Not sure how
> many programs like imacros really exist and require performance, but might as
> well fix it.)
Definitely -- on file?
/be
Assignee | ||
Comment 24•15 years ago
|
||
(In reply to comment #23)
> Can we hang onto the last arena for a while (GC-based time caching, see
> gcEmptyArenaPoolLifespan).
Yeah, I've been calling this the "second" arena, since the fakey one before it is called "first" and is a member structure of the pool. So long as API clients take marks and release to them instead of calling |pool.free()|, the GC should reap the "second" arena when it reaches a ripe old age.
> > imacros.js also shows we thrash the regexp pool
> on file?
Bug 561286.
Comment 25•15 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/55da1c595209
new work to new bugs.
Whiteboard: fixed-in-tracemonkey
Comment 26•15 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•