Closed Bug 948445 Opened 11 years ago Closed 11 years ago

investigate alternative implementations for ProtoAndIfaceCache memory usage

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla31

People

(Reporter: froydnj, Assigned: froydnj)

References

(Blocks 1 open bug)

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files, 4 obsolete files)

use a two-level caching scheme for ProtoAndIfaceArray 11 years ago Nathan Froyd [:froydnj] (deleted), patch		Details \| Diff \| Splinter Review
use a two-level caching scheme for ProtoAndIfaceArray 11 years ago Nathan Froyd [:froydnj] (deleted), patch	bzbarsky : feedback+	Details \| Diff \| Splinter Review
use different caching schemes for ProtoAndIfaceArray depending on the global kind 11 years ago Nathan Froyd [:froydnj] (deleted), patch		Details \| Diff \| Splinter Review
use different caching schemes for ProtoAndIfaceArray depending on the global kind 11 years ago Nathan Froyd [:froydnj] (deleted), patch	bzbarsky : review+	Details \| Diff \| Splinter Review
add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around 11 years ago Nathan Froyd [:froydnj] (deleted), patch		Details \| Diff \| Splinter Review
make nsGlobalWindow pass in USE_FLAT_PROTOTYPE_CACHE when creating globals 11 years ago Nathan Froyd [:froydnj] (deleted), patch		Details \| Diff \| Splinter Review
add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around 11 years ago Nathan Froyd [:froydnj] (deleted), patch		Details \| Diff \| Splinter Review

Nathan Froyd [:froydnj]

Assignee

Description

•

11 years ago

From bug 946781 comment 20: I wonder if the proto-iface cache really needs to be an array; I can't imagine that many members are really going to be hit in your typical scope. Some webpages, sure, but not all of them. Going for a balanced tree or something to reduce memory usage in the average case might be worth investigating. From bug 946781 comment 22 and comment 23: The perf of lookups here is critical for the performance of JS-wrapping. One of the big speedups of WebIDL bindings over XPConnect ones is that we made the lookup of the prototype object not take approximately forever. That said, some sort of balanced tree might not be too bad. It would still be way better than the xpconnect setup. Also, it should be simple to test the performance here: something like createElement or new Event or some such ought to be worst-cases for wrapping performance. This bug is about investigating those alternatives.

Nathan Froyd [:froydnj]

Assignee

Comment 1

•

11 years ago

Assuming I'm counting the size of std::map's entries correctly (I don't see a good way to report their memory explicitly, so I'm resorting to counting), implementing the cache with a std::map makes our xpconnect tree for a newly-opened browser look like: └───0.95 MB (01.36%) -- xpconnect ├──0.62 MB (00.89%) ── scopes ├──0.25 MB (00.36%) ── runtime ├──0.05 MB (00.07%) ── js-component-loader └──0.03 MB (00.04%) ── proto-iface-cache Compare this to the array-based cache reported in bug 946781 comment 20: ├───3.07 MB (04.09%) -- xpconnect │ ├──2.06 MB (02.75%) ── proto-iface-cache │ ├──0.75 MB (01.01%) ── scopes │ └──0.25 MB (00.34%) -- (2 tiny) │ ├──0.25 MB (00.34%) ── runtime │ └──0.00 MB (00.00%) ── js-component-loader ~60x size reduction...I'll take it! Now for the performance benchmarking side of things.

Nathan Froyd [:froydnj]

Assignee

Comment 2

•

11 years ago

So, I'm completely new to DOM benchmarking. bz, does this look like a reasonable benchmark for wrapping performance? var n = 100000 var start = Date.now(); for (var i = 0; i < n; ++i) { e = new CustomEvent("ev", {}); } var end = Date.now(); document.write("Time to wrap " + n + " Events: " + (end - start));

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 3

•

11 years ago

The way to tell is to profile. ;) And what profiling will tell you is that this case is dominated by the unforgeable attribute and some other event suck. For example, new Event is 1.5x faster than new CustomEvent, and createElement is about 8x faster. createTextNode is 16x faster. So start with createTextNode, I guess: (function() { var n = 100000 var start = Date.now(); for (var i = 0; i < n; ++i) { var t = document.createTextNode("x"); } var end = Date.now(); document.write("Time to wrap " + n + " TextNodes: " + (end - start)); })();

Flags: needinfo?(bzbarsky)

Nathan Froyd [:froydnj]

Assignee

Comment 4

•

11 years ago

Thanks for clarifying. With that benchmark in hand, the numbers are roughly: 1M TextNodes, array cache: 2554ms 1M TextNodes, std::map cache: 2751ms This is on x86-64 Linux, FWIW. So switching over to std::map (or similar red-black tree) slows us down by ~8%, in exchange for an order-of-magnitude less memory overhead for the cache. (Counting the nodes correctly gives us a ~20x improvement in memory usage for the caches in a freshly started browser.)

Boris Zbarsky [:bzbarsky]

Comment 5

•

11 years ago

It's worth measuring the memory for the cache in a browser that actually loaded something interesting (and hence might have created several different objects). In fact, it's worth testing the performance of the testcase in such a scenario (where presumably the std::map contains more stuff, right?)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 6

•

11 years ago

Yeah, we should try a benchmark where we've first enumerated everything on Window.

Nathan Froyd [:froydnj]

Assignee

Comment 7

•

11 years ago

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #6) > Yeah, we should try a benchmark where we've first enumerated everything on > Window. Looks like enumerating everything on Window, then doing the TextNode benchmark costs us another 5-7%. So the performance is ~15% worse than what we have now after enumerating everything.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 8

•

11 years ago

I think that's worth it but bz may disagree. We could also cache the last interface lookup or something if rewrapping the same iface over and over again is common.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 9

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #7) > Looks like enumerating everything on Window, then doing the TextNode > benchmark costs us another 5-7%. So the performance is ~15% worse than what > we have now after enumerating everything. Sounds rather horrible. We're trying to make element/textnode creation faster (need to for example get rid of one extra addref/release).

Boris Zbarsky [:bzbarsky]

Comment 10

•

11 years ago

As far as performance goes, in the benchmark above, each createTextNode call is about 300ns for me today, if I'm measuring right. So an 8% hit is about 24ns. A 15% hit is about 45ns. It's a 2.6GHz CPU, so those are about 60-100 CPU cycles, which is not implausible for a binary tree search, I guess (though the 60 feels a bit high for a nearly-empty tree). How does this change affect the "more real-world" (sic) test at <http://dromaeo.com/?dom-mod>? Also, what is the memory usage once you've enumerated everything on Window? That was the other question I had in comment 5.

Boris Zbarsky [:bzbarsky]

Comment 11

•

11 years ago

Another thought: If we can define a frequency-ordering on these things, we might be able to keep using an array but only size it big enough for what we've had to use in the scope so far. This would require not storing protos and interface objects in a single array, of course (though you have that constraint anyway with the std::map approach, I expect).

Nathan Froyd [:froydnj]

Assignee

Comment 12

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #10) > Also, what is the memory usage once you've enumerated everything on Window? > That was the other question I had in comment 5. Worst case memory usage, where the # of prototypes and interfaces is N: array cache: N*wordsize, rounded up to pagesize; 8K on a 32-bit platform (or it will be very shortly) std::map cache: ~N*(6*wordsize), where 6*wordsize is about the size of a std::map tree node in libstdc++ (key, value, parent, left child, right child pointers, and an enum for red/black). Mac and Windows <map> implementations have similar implementations. I'm assuming Android does too. So pretty bad memory usage regression in the worst case.

Boris Zbarsky [:bzbarsky]

Comment 13

•

11 years ago

We're definitely over 1024 entries in this list, so I agree that we're at 8K on 32-bit and 16K on 64-bit.. 6*wordsize, if it gets heap-allocated by jemalloc would get rounded up to 32 bytes on 32-bit and would land at 48 bytes on 64-bit. So if those nodes are all malloced, we're actually looking at N*8*wordsize on 32-bit, or about a 4x regression over the current state (and getting worse for a bit as we add more things, until we bump up to 12K for the array). I wonder whether we can get some data on real pages as to many nodes the tree would have for them. But no matter what the case of about:blank and other nearly-empty (but not entirely!) globals is worth making small if we can...

Boris Zbarsky [:bzbarsky]

Comment 14

•

11 years ago

And one other note. A 24KB regression (that's our worst case on 32-bit; on 64-bit it's more like 32KB) in the size of this data structure on something like gmail is sort of a drop in a bucket. So the real worst case are tiny pages that manage to populate a lot of this data structure, which basically requires something like enumerating the window, I expect.

Nathan Froyd [:froydnj]

Assignee

Comment 15

•

11 years ago

Attached patch use a two-level caching scheme for ProtoAndIfaceArray (obsolete) (deleted) — Details — Splinter Review

Something like this seems like a decent compromise: we get memory savings because we lazily allocate portions of the cache, but lookups remain fast no matter how many slots in the cache we've filled up. Testing locally with the TextNode benchmark shows the difference between the two to be in the noise (~0.5% or less) regardless of whether we've enumerated everything on Window or not. Testing single lookups might show up the extra mallocs that we're doing now. Memory reduction in a newly-started browser is smaller, about 7-8x for the proto-iface-caches line item in about:memory: ├───1.35 MB (01.93%) -- xpconnect │ ├──0.79 MB (01.13%) ── scopes │ └──0.56 MB (00.80%) -- (3 tiny) │ ├──0.29 MB (00.42%) ── proto-iface-cache │ ├──0.21 MB (00.31%) ── runtime │ └──0.05 MB (00.08%) ── js-component-loader Further reductions could be gained if we did a more intelligent sorting of the IDs as bz suggested in comment 11. (Yes, I know there's a memory leak in the patch; a few other things could stand to be cleaned up as well...)

Nicholas Nethercote [inactive]

Updated

•

11 years ago

Assignee: nobody → nfroyd

Whiteboard: [MemShrink] → [MemShrink:P2]

Nathan Froyd [:froydnj]

Assignee

Comment 16

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #15) > Something like this seems like a decent compromise: we get memory > savings because we lazily allocate portions of the cache, but lookups > remain fast no matter how many slots in the cache we've filled up. > Testing locally with the TextNode benchmark shows the difference between > the two to be in the noise (~0.5% or less) regardless of whether we've > enumerated everything on Window or not. Sadly, these numbers are totally bogus because I was testing a opt/debug build...for all the measurements I took yesterday. Doh. A opt/no-debug build says that the difference between the current flat scheme and this two-level scheme is ~15%, which is still too much. I'm going to try splitting things so that the cache has explicit |get|, |set|, and |has| methods, which will at least eliminate a branch from |get| and ideally make things a bit faster.

Nathan Froyd [:froydnj]

Assignee

Comment 17

•

11 years ago

Attached patch use a two-level caching scheme for ProtoAndIfaceArray (obsolete) (deleted) — Details — Splinter Review

Here's a better version that actually benchmarks well (on the createTextNode benchmark) against the current scheme. Somewhat surprisingly, this version is about 5% *faster* than what we currently have in the tree (opt build, x86-64 Linux); I don't have a good explanation for this yet. The code, however, is somewhat uglier. I'm also not sure about the JSAPI bits that I'd need to get HasEntry correct. I see that GCC is choosing to not inline it, presumably because JS::Heap<JSObject*>::set is getting called in the middle, which seems suboptimal. I want to break this up into two pieces (rename ProtoAndIfaceArray, then reimplement) for actual review, but I'd like feedback on whether this is going in the right direction.

Attachment #8345494 - Attachment is obsolete: true

Attachment #8346784 - Flags: feedback?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 18

•

11 years ago

Comment on attachment 8346784 [details] [diff] [review] use a two-level caching scheme for ProtoAndIfaceArray I don't see why we can't just replace HasEntry with GetEntryIfExists() and have it still return JSObject*. In general, I'm ok with this. This cuts our memory usage by a factor of presumably somewhere close to 16 in the about:blank case, and very slightly increases it in the "enumerate the window" case?

Attachment #8346784 - Flags: feedback?(bzbarsky) → feedback+

Wayne Mery (:wsmwk)

Updated

•

11 years ago

Summary: investigate alternative implementations for ProtoAndIfaceCache → investigate alternative implementations for ProtoAndIfaceCache memory usage

Nathan Froyd [:froydnj]

Assignee

Comment 19

•

11 years ago

After looking at some CC logs, it looks like there are two distinct cases for these caches: - Window, where the cache might have a lot of entries and access speed is (relatively) crucial; - Everything else (e.g. ContentFrameMessageManager, BackstagePass) Maybe it's worthwhile implementing two different cache strategies? I have no idea how much plumbing would have to be done so we know which kind of cache we should be using and then point when the cache should be allocated, though.

Boris Zbarsky [:bzbarsky]

Comment 20

•

11 years ago

It could be. How many non-Window globals do we have floating around? I guess one per worker, if nothing else...

Bobby Holley (:bholley)

Comment 21

•

11 years ago

Well, every JSM has a BackstagePass. There are something like ~100 system globals out there, right?

Nathan Froyd [:froydnj]

Assignee

Comment 22

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #20) > It could be. How many non-Window globals do we have floating around? I > guess one per worker, if nothing else... The aforementioned CC log says that we have 213 globals, 200 of which are non-windows. And some of those 13 Windows are ChromeWindows.

Boris Zbarsky [:bzbarsky]

Comment 23

•

11 years ago

OK. We could obviously flag somehow what sort of thing we have (e.g. bit-tagging the pointer in the private slot?). It's not clear that the result would be faster than what we've already tried in this bug, though...

Boris Zbarsky [:bzbarsky]

Comment 24

•

11 years ago

Also, note that the global already has 100-200 reserved slots for the js builtin version of this...

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 25

•

11 years ago

So what's the path forward here (if any)?

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Assignee

Comment 26

•

11 years ago

The current status is this: Benchmarking says we lose a non-trivial amount on a benchmark like comment 3. I think the numbers were in the 5-15% range; I'd have to try redoing those. That was with a two-level page-table-esque scheme. I haven't tried more exotic approaches like separate cache implementations for Window-like globals and everything else. I can re-benchmark and try something more exotic this week. Ultimately, everything is going to be adding instructions (including branches) on top of the single memory read we have now. So the real question is how much speed we are willing to sacrifice for the memory gains. There's disagreement here, but I think smaug and/or bz have the final say in this.

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Assignee

Comment 27

•

11 years ago

(In reply to Nathan Froyd (:froydnj) from comment #26) > I think the numbers were in the 5-15% range; I'd have to try redoing > those. That was with a two-level page-table-esque scheme. Re-benchmarking locally confirms this. The numbers that I'm getting aren't particularly stable, but it looks like 10-15% is in the right ballpark. > I haven't tried > more exotic approaches like separate cache implementations for Window-like > globals and everything else. I can re-benchmark and try something more > exotic this week. This is the next task.

Nathan Froyd [:froydnj]

Assignee

Comment 28

•

11 years ago

Attached patch use different caching schemes for ProtoAndIfaceArray depending on the global kind (obsolete) (deleted) — Details — Splinter Review

OK, so here's a version that does the separate implementations for window-like globals and everything else. Microbenchmarking shows it to be about as fast as the current implementation, and about:memory confirms that we save a good bit of space (~2.3MB on a newly-opened browser on a mostly-empty profile on x86-64 Linux). We could squeeze out a little more space by using std::map (see comment 15) or mozilla::SplayTree (less overhead than std::map) for our non-window-like cache if desired.

Attachment #8346784 - Attachment is obsolete: true

Attachment #8389296 - Flags: review?(bugs)

Nathan Froyd [:froydnj]

Assignee

Comment 29

•

11 years ago

Comment on attachment 8389296 [details] [diff] [review] use different caching schemes for ProtoAndIfaceArray depending on the global kind Review of attachment 8389296 [details] [diff] [review]: ----------------------------------------------------------------- Erm. One small mistake, let's fix that first.

Attachment #8389296 - Flags: review?(bugs)

Nathan Froyd [:froydnj]

Assignee

Comment 30

•

11 years ago

Attached patch use different caching schemes for ProtoAndIfaceArray depending on the global kind (deleted) — Details — Splinter Review

Let's try that again. Please see comment 28 for the details.

Attachment #8389296 - Attachment is obsolete: true

Attachment #8389357 - Flags: review?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 31

•

11 years ago

Comment on attachment 8389357 [details] [diff] [review] use different caching schemes for ProtoAndIfaceArray depending on the global kind Boris, could you review this. My queue is getting out of bounds.

Attachment #8389357 - Flags: review?(bugs) → review?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 32

•

11 years ago

Comment on attachment 8389357 [details] [diff] [review] use different caching schemes for ProtoAndIfaceArray depending on the global kind The ProtoAndIfaceArray name might want changing too, since it's no longer an array. Followup ok for that. >+ Array<Page*, kProtoAndIfaceCacheCount / kPageSize + 1> mPages; This will overallocate by 1 when kProtoAndIfaceCacheCount % kPageSize == 0. Does it make sense to do kProtoAndIfaceCacheCount / kPageSize + size_t(bool(kProtoAndIfaceCacheCount % kPageSize)) or something? >@@ -2392,17 +2547,17 @@ CreateGlobal(JSContext* aCx, T* aObject, nsWrapperCache* aCache, Afaik, this is used for workers. Do we want the WindowLike cache there? >@@ -427,17 +427,17 @@ CreateGlobalObject(JSContext *cx, const JSClass *clasp, nsIPrincipal *principal, And this is used for windows right now as far as I know... Did you make sure that the right sorts of caches are being created for windows? r=me with those nits dealt with and questions answered.

Attachment #8389357 - Flags: review?(bzbarsky) → review+

Nathan Froyd [:froydnj]

Assignee

Comment 33

•

11 years ago

(In reply to Boris Zbarsky [:bz] from comment #32) > >+ Array<Page*, kProtoAndIfaceCacheCount / kPageSize + 1> mPages; > > This will overallocate by 1 when kProtoAndIfaceCacheCount % kPageSize == 0. > > Does it make sense to do kProtoAndIfaceCacheCount / kPageSize + > size_t(bool(kProtoAndIfaceCacheCount % kPageSize)) or something? Doh, probably. I had this fixed at one point and must have lost it in rebasing. Do you have an opinion about using std::map or friends? > >@@ -2392,17 +2547,17 @@ CreateGlobal(JSContext* aCx, T* aObject, nsWrapperCache* aCache, > > Afaik, this is used for workers. Do we want the WindowLike cache there? I suspect not, since there are so many fewer interfaces available in workers? Followup? > >@@ -427,17 +427,17 @@ CreateGlobalObject(JSContext *cx, const JSClass *clasp, nsIPrincipal *principal, > > And this is used for windows right now as far as I know... Did you make > sure that the right sorts of caches are being created for windows? It looks like windows go through CreateGlobal; at least that's what WindowBinding.cpp has.

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 34

•

11 years ago

> Do you have an opinion about using std::map or friends? I don't.... > Followup? Sure. > at least that's what WindowBinding.cpp has. But does anything call that code so far? I expect WindowBinding::Wrap is dead code at the moment...

Flags: needinfo?(bzbarsky)

Nicholas Nethercote [inactive]

Comment 35

•

11 years ago

FWIW, this cache accounts for ~2 MiB of the increases in start-up memory consumption we've seen over the past year or two.

Nathan Froyd [:froydnj]

Assignee

Comment 36

•

11 years ago

Attached patch add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around (obsolete) (deleted) — Details — Splinter Review

Boris was right: windows weren't using the plain flat cache in attachment 8389357 [review]. That's because they were still going through the XPConnect bits for creating global objects. In the absence of WebIDL Window, then, we have to have a way for XPConnect to create prototype caches in the style that Window would like. This patch implements that as an additional flag to initClassesWithNewWrappedGlobal. It's pretty straightforward: add flag, pass it down to where it needs to go, and fix up callers where appropriate. Advice on the name welcome. This patch will be folded into the previous one prior to landing.

Attachment #8394238 - Flags: review?(bobbyholley)

Nathan Froyd [:froydnj]

Assignee

Comment 37

•

11 years ago

Attached patch make nsGlobalWindow pass in USE_FLAT_PROTOTYPE_CACHE when creating globals (deleted) — Details — Splinter Review

Assuming bholley is happy with the XPConnect patch, this is the bit that makes nsGlobalWindow actually get the cache kind it would like. I verified that we do indeed get WindowLike caches with this patch. Need to re-do the performance numbers, though...

Attachment #8394241 - Flags: review?(bzbarsky)

Nathan Froyd [:froydnj]

Assignee

Comment 38

•

11 years ago

Attached patch add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around (deleted) — Details — Splinter Review

Whoops, let's attach the patch without the busted debugging bits.

Attachment #8394238 - Attachment is obsolete: true

Attachment #8394238 - Flags: review?(bobbyholley)

Attachment #8394322 - Flags: review?(bobbyholley)

Bobby Holley (:bholley)

Comment 39

•

11 years ago

Comment on attachment 8394322 [details] [diff] [review] add nsIXPConnect::USE_FLAT_PROTOTYPE_CACHE and propagate it around Review of attachment 8394322 [details] [diff] [review]: ----------------------------------------------------------------- I'm not really wild about adding all this goop, especially since we're going to have to rip it out in a few weeks. Can't we just strcmp js::GetObjectClass with "Window" and "ChromeWindow" in AllocateProtoAndIfaceCache and be done with it?

Nathan Froyd [:froydnj]

Assignee

Comment 40

•

11 years ago

(In reply to Bobby Holley (:bholley) from comment #39) > I'm not really wild about adding all this goop, especially since we're going > to have to rip it out in a few weeks. Can't we just strcmp > js::GetObjectClass with "Window" and "ChromeWindow" in > AllocateProtoAndIfaceCache and be done with it? This is why we ask for reviews. :) That works for me...Boris, WDYT?

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 41

•

11 years ago

Worksforme as well.

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Updated

•

11 years ago

Attachment #8394241 - Flags: review?(bzbarsky)

Bobby Holley (:bholley)

Updated

•

11 years ago

Attachment #8394322 - Flags: review?(bobbyholley)

Nicholas Nethercote [inactive]

Updated

•

11 years ago

Blocks: 986323

Nathan Froyd [:froydnj]

Assignee

Updated

•

11 years ago

Blocks: 987457

Nathan Froyd [:froydnj]

Assignee

Updated

•

11 years ago

Blocks: 987460

Nathan Froyd [:froydnj]

Assignee

Comment 42

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/5ad139206674

Flags: in-testsuite-

Ryan VanderMeulen [:RyanVM]

Comment 43

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/5ad139206674

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla31

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

You need to log in before you can comment on or make changes to this bug.