Closed Bug 35817 Opened 25 years ago Closed 21 years ago

optimize in-mem RDF for small # of props leading to large # of targets

Categories

(Core Graveyard :: RDF, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: waterson, Assigned: mozilla)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [nsbeta3-])

Attachments

(2 files, 7 obsolete files)

The current in-memory datasource implementation is not well tuned for large datasources. It needs the following: 1. an arena-based fixed size allocator from which Assertion objects are allocated. 2. each "source" should have a hashtable of "properties" that lead out of it if there are more than (say) ten distinct properties. 3. each triple <source, property, target> should be hashed for quick lookup if there are more than (say) ten distinct targets for a <source, property> pair. Right now, a lot of the operations degenerate into linear-time lookups, which kills performance in large datasources.
right now, the entire list of newsgroups for a server gets put into the subscribe datasource, which is an in memory datasource. that a big in memory datasource, for a server like news.mcom.com before waterson spends time on this, we should make sure this work would be worth his time. I also have to make sure I'll continue to use the in memory datasource, and not a custom one.
Status: NEW → ASSIGNED
Target Milestone: --- → M18
use PLHash for this, not nsHashTable. steal from waterson, as this is hurts the subscribe dialog.
Assignee: waterson → sspitzer
Status: ASSIGNED → NEW
notes from irc converstion with waterson: <waterson> anyway, if you're going to add a hashtable to remember the properties in in-memory datasource <waterson> it shouldn't be too hard <waterson> have you looked thru that code at all? <sspitzer> yes, I was debugging through it <waterson> ok <sspitzer> when trying to figure out why my HasAssertion() wasn't doing what I thought it would. <sspitzer> I'll get the HasAssertion() riddle figured out <waterson> so there are two top-level hashtables keyed by the "source" <sspitzer> and then, when I get time, add the hashtable. <waterson> and "target" respectively <sspitzer> ok... <waterson> when you assert <waterson> it creates an Assertion object <waterson> that gets added to two linked lists <waterson> the "forward" links, which comes from the source-indexed table <waterson> and <waterson> the "backwards" links, which comes from the target-indexed table <sspitzer> (this is good, keep going...) <waterson> so when doing GetTarget, for example <waterson> it looks up the source in the source-indexed table <waterson> and just walks through the list comparing properties until it finds a match <waterson> and returns the target <waterson> so <waterson> there are two cases to optimize <waterson> source with many properties <waterson> source with few properties, but many targets <waterson> your case is the latter <waterson> so you'd want to index on both property and target. <sspitzer> gotcha. <sspitzer> any need to make both fast? <sspitzer> how about other in memory datasources, like bookmarks? <sspitzer> or, are we "good enough" for them? <waterson> well, we could make both go fast if we used a tree structure instead of hashtables <sspitzer> isn't there some tree datastructures in xpcom/ds? <waterson> there are <waterson> ideally, we'd make both cases go fast <waterson> because there are some graphs where you do have a lot of properties <waterson> on the same source
Status: NEW → ASSIGNED
David, can you make the call on whether we want to try this for beta3?
Assignee: sspitzer → bienvenu
Status: ASSIGNED → NEW
I'll defer to Chris and/or Alec. Is this still an issue, do you think?
I think rjc is working on this under the guises of another bug.
Assignee: bienvenu → rjc
Nominating for PR3 to get this on the radar that a decision needs to be made since there appears to be some sentiment that it's necessary and some confusion over whether it's being worked on.
Keywords: nsbeta3, perf
triage team: nsbeta3+, very important to mail
Whiteboard: [nsbeta3+]
I've been playing around with this bug for a while now and think its time to suggest that its marked "nsbeta3-"... I have a 90% rewrite of RDF's nsInMemoryDataSource which I've been playing with for a while... it uses "smart" arcs (simple arrays which convert over to hash tables when a size boundary is crossed)... however, its significantly slower (about 50%) than the current implementation in most cases (especially for insertions) and uses more memory.
Whiteboard: [nsbeta3+]
nsbeta3- for hard P3 bug we'll have to live with.
Whiteboard: [nsbeta3-]
This milestone has passed.
Target Milestone: M18 → ---
Blocks: 77460
Hey Chris. Is this bug still valid? And if so, who should it be assigned to?
Taking for now, and FUTURE-ing. Will pull it back if I get inspired.
Assignee: rjc → waterson
Keywords: helpwanted
Target Milestone: --- → Future
Status: NEW → ASSIGNED
Chris, mind if I take this bug from you? <grin> I've felt inspired lately.
Assignee: waterson → rjc
Status: ASSIGNED → NEW
New interface for RDF's in-memory data source
Attached patch Small diffs to nsRDFContainer.cpp and makefiles (obsolete) (deleted) — Splinter Review
Attached patch nsInMemoryDataSource.cpp diffs - magic herein (obsolete) (deleted) — Splinter Review
nsInMemoryDataSource.cpp diffs - basically, for RDF SEQs which pass a size focal point, add a new assertion into the graph (invisible to the outside world) which is actually a PLDHash on the associated "properties" arc to all the values. Reading in a largish bookmark file (2300 URLs + 100 folders) I see a 20-30% improvement in graph population.
nsInMemoryDataSource.cpp - entire file for those who don't enjoy applying large diffs
Waterson/Tingley, care to run with these changes for a bit? (Note that these changes are primarily for RDF SEQ usage. A more aggressive hack job for really large graphs which don't use RDF SEQs derives from these changes without too much of a brain spasm.)
Status: NEW → ASSIGNED
Attached patch nsInMemoryDataSource.cpp diffs - magic herein (obsolete) (deleted) — Splinter Review
Oopsy daisy.
Attachment #57287 - Attachment is obsolete: true
Oopsy daisy.
Attachment #57289 - Attachment is obsolete: true
Yeah, I'll give it a spin over the weekend. Is it that much harder to keep the mutation threshold internally, rather than requiring this extra interface and the external comparison against nextVal? Generalizing this will probably require keeping some sort of per-source count inside the datasource itself... Actually, as long as the slow duplicate-checking code is in there that walks all the arcs when Asserting, we could probably get the count for free -- keep track of how many properties we check, and mutate if we see k of them.
In an earlier version (I have about eight on my desktop at the moment <whee>), I did have conversion going on entirely internally. However, you don't get a count for free really... you have to walk over the *entire* assertion list (which the code doesn't really do unless it happens to have a complete mismatch) taking certain conditions into account (for example, you wouldn't want to create a hash{propertyArc} if something was just using lots of #child arcs to indicate containment, as that's a more pathological case -- the more aggressive "brain spasm" case I alluded to) or maintain internal counts (which seemed bogus, maintaining consistency just for the focal convertion point). Having all the decision logic internal regarding containers also meant another trip across the graph (via calling back out into RDFContainerUtils::IsASeq for example) which was another added expense. All that said, being even more agressive might end up needing internal counts. Time will tell. Truthfully, the extra interface is probably needed anyway... for example, it would be beneficial to be able to have the XUL template builder "push" what it knows about containment down through RDF... doing so would allow the in-memory datasource to better know what to optimize for.
Oops, you're right, no free counts. I didn't get as much of a chance to test this over the weekend as I had planned, because I'm a lazy beast. Couple random things: - do we really want to pre-allocate mObservers every time? any idea on the bloat tradeoff? The cost of checking |if (mObservers)| seems pretty low, and we'd still have to check |mObservers->Count() != 0| if we always allocated it. - if the approach we're taking at this point is to rely on external hinting as to what a "hot" resource is, there's no reason the API should pretend to be tailored to containers, as there's nothing in the code (other than the name) that makes it so. I'd just change nsIRDFInMemoryDataSource to nsIRDFOptimizedDataSource (or something) and EnsureFastContainment -> OptimizeForSource() (or something), and then go through the code looking for spots where we can make use of it (certain spots in the XUL template code come to mind). - it looks like there's a frequently repeated chunk of code that checks for the existence of the hash and pulls out either the hashed assertion chain or the "normal" one, depending on state. this should probably be factored out to clean things up... By the way, what generated these diffs? i had to convert \r to \n before they would apply...
> - do we really want to pre-allocate mObservers every time? I certainly consider it a fine idea. bloat should be minimal; verification would be a fine idea though. > there's no reason the API should pretend to be tailored to containers Ah, but that's EXACTLY what these changes are. It (currently) IS specifically for RDF_SEQ containment and not applicable (as is) for other data sets such as a node with lots of attributes or other ways of specifying containment. Again, #child is an example... you wouldn't want to turn a linked list of hash{parent node} -> [#child] -> [target #1] \-> [#child] -> [target #2] \-> [#child] -> [target #3] into just a linked list again (same thing) off of a secondary hash: hash{parent node} -> hash{#child} -> [target #1] \-> [target #2] \-> [target #3] which the code currently would do. This is the pathalogical case. What should really happen is more like: hash{parent node} -> [#child] -> hash[target nodes] - it looks like there's a frequently repeated chunk of code True, although the frequency is not huge (my opinion)... a few lines of duplication. As you'll notice, the in-memory datasource has other issues like this as well, such as with enumerating/updating its internal linked lists. Cleanup can happen after the fact; the current goal is performance. - By the way, what generated these diffs? i had to convert \r to \n before they would apply... Latest beta of MacCVS. :) [What?!?! You don't use a Mac. <grin>] Sorry for the inconvenience.
> The cost of checking |if (mObservers)| seems pretty low, and we'd still have to check |mObservers->Count() != 0| if we always allocated it. BTW, I forgot to add that this should be changed as well; the count of the # of observers only changes at two points: AddObserver & RemoveObserver. Instead of calling mObjservers->Count() all the time, it should only be done at those two points and cached as a member variable of the in-memory datasource. :)
Blocks: 53521
Sorry it took me so long to get around to this. My big concern here is that we're bloating the Assertion object: there tend to be a lot of these. Any way we could do this without increasing its size? For example, using a |union|?
Attached patch RDF optimizations with |union| usage (obsolete) (deleted) — Splinter Review
> My big concern here is that we're bloating the Assertion object: > there tend to be a lot of these. Any way we could do this without > increasing its size? For example, using a |union|? Using a |union| is a fine idea, I was just lazy and didn't. However, there is still a few extra bytes of bloat due to a new boolean flag which can't be inside the |union| Waterson: Here's a new diff, care to review?
Attachment #57283 - Attachment is obsolete: true
Attachment #57284 - Attachment is obsolete: true
Attachment #57292 - Attachment is obsolete: true
Attachment #57293 - Attachment is obsolete: true
Oops, shame that it compiles on Mac but not on Linux. :(
Darn COMPtr inside of class problem.
Yeah, here's the love. ;)
Attachment #61376 - Attachment is obsolete: true
Comment on attachment 61395 [details] [diff] [review] RDF optimizations with |union| usage, tweaked for Linux Weird name. Focal reviews on the brain? ;-) (How about RDF_SEQ_MAX_LIST or RDF_SEQ_LIST_LIMIT?) >+#define RDF_SEQ_MUTATION_FOCALPOINT 8 >+ Couple places where you left commented-out or #if 0'd code in the patch. Just remove? >+/* >+ if (doomed && (doomed->mFlags & HASH_ENTRY)) >+ { >+ // rjc: just forward thinking; if/when we have >+ // multi-level hash-assertions, we'll have to >+ // handle this case; no big deal, though >+ } >+*/ And... >+#if 0 >+ // Implementation methods >+ Assertion* >+ GetArcs(PLDHashTable *hash, nsIRDFNode* u) { >+ PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u, PL_DHASH_LOOKUP); >+ return PL_DHASH_ENTRY_IS_BUSY(hdr) >+ ? NS_REINTERPRET_CAST(Entry*, hdr)->mAssertions >+ : nsnull; } >+ >+ void >+ SetArcs(PLDHashTable *hash, nsIRDFNode* u, Assertion* as) { >+ PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u, >+ as ? PL_DHASH_ADD : PL_DHASH_REMOVE); >+ if (as && hdr) { >+ Entry* entry = NS_REINTERPRET_CAST(Entry*, hdr); >+ entry->mNode = u; >+ entry->mAssertions = as; >+ } } >+#endif And... >+// PL_DHashTableOperate(root->mForwardHash, >+// aProperty, PL_DHASH_REMOVE); >+ PL_DHashTableRawRemove(root->u.hash.mForwardHash, hdr); >+
Attachment #61395 - Flags: superreview+
Sure, "RDF_SEQ_LIST_LIMIT" it is. I'll also remove the commented out chunks. :)
Comment on attachment 61395 [details] [diff] [review] RDF optimizations with |union| usage, tweaked for Linux r=tingley with a bunch of nits. - 2) If/when RDF containers become commonplace, consider implementing - a special case for them to improve access time to individual - elements. Rather than removing this outright, I'd replace it with a comment about the other case... something like 2) Optimize lookups for datasources which have a small number of properties + fanning out to a large number of targets. + PLDHashTable* mForwardHash; maybe it's just me, but I have trouble keeping mForwardHash and mForwardArcs straight when reading this code. mForwardHash is mapping from a source to a set of properties; would mProperties, mPropertyHash, or even mForwardProps be more descriptive? The same issue with the assertion-level |DeleteForwardArcsEntry()| routine (which at the very least should match the name of the hash table it's clearing entries for). + Assertion* mInvNext; I get the willies when I see code that blindly accesses a union without regards to the type -- as is the case with the LockedAssert() code that sets this field. Admittedly, it should be impossible for an assertion that's using |u.mForwardHash| to end up having its mInvNext set (I hope), but it still makes me uneasy. I dunno, pull it out of the union, add |NS_ASSERTION(!mHashEntry, "ack!");|, whatever; anything to help me sleep better :) Perhaps I'm overreacting. + nsIRDFService *mRDFService; Can we use a single static here instead of adding 4 bytes per datasource? + nsresult rv = rv = nsServiceManager::GetService(kRDFServiceCID, typo. - <SETTING><NAME>MWFTP_Post_password</NAME><VALUE>0</VALUE></SETTING> + <SETTING><NAME>MWFTP_Post_password</NAME><VALUE>265rt2cn65mdaci^Qp'^Z&#140;^E^E @&#204;&#191;&#255;&#216;&#240;</VALUE></SETTING> no wonder no one understands the mac build system :) + PL_DHashTableEnumerate(aAssertion->u.hash.mForwardHash, + DeleteForwardArcsEntry, &aAllocator); indentation. * the union declaration and several other spots in the code (especially the NS_ADDREF calls) are using hard tabs; nothing else in the file does. * match bracing style in patch to nsInMemoryDataSource.cpp to the rest of file. Was there ever any quantification on the bloat tradeoff (now extra 4 bytes for mNumObservers + allocating mObservers every time)? Hopefully the number of datasources is small enough that saving the extra function call/pointer check is worth it. We'll need to open a bug eventually for optimizing the other case as well (one/few properties --> many targets).
Attachment #61395 - Flags: review+
Added suggested comment at top o' file, cleaned up indentation (my editor is messing with me) and formatting a bit, renamed mForwardHash to be mPropertyHash (although once the next set of RDF optimizations is written, that name too will be incorrect; however, its fine for this round) and renamed DeleteForwardArcsEntry to be DeletePropertyHashEntry, etc. > nsIRDFService *mRDFService; > Can we use a single static here instead of adding 4 bytes per datasource? Please open another bug for that, and assign it to me, Chris, or yourself. :) > We'll need to open a bug eventually for optimizing the other case as well > (one/few properties --> many targets). Either that, or we can just keep this bug open. (I was planning on the latter.)
Summary: optimize access time for InMemoryDataSource::HasAssertion → optimize in-mem RDF for small # of props leading to large # of targets
>> nsIRDFService *mRDFService; >> Can we use a single static here instead of adding 4 bytes per datasource? > >Please open another bug for that, and assign it to me, Chris, or yourself. :) Why wait? "How hard can it be?" - Homer J. Simpson /be
Brendan: Why wait? rjc: Why now?
Because it's easy, later means never too often, and we are getting very serious about < 0 tolerance for bloat regressions. Why did you make it a member, anyway? That effort was nearly the same as making it a static, it seems to me. We are already too far down the slippery slope. /be
Brendan: Since you are around and have free time this Sunday evening, here 'ya go then, review this and I'll check it in to get rid of the unused usage.
r=mcafee for unused RDF service patch
Comment on attachment 63781 [details] [diff] [review] Remove unused RDF service reference Cool -- I didn't realize it was unused. sr=brendan@mozilla.org. Recording mcafee's r= too. /be
Attachment #63781 - Flags: superreview+
Attachment #63781 - Flags: review+
Indeed, the RDF service reference was just a vestigial remainder of earlier hacking on the in-memory datasource which slipped through the cracks. Thanks for the review(s), everybody.
tever is not RDF QA anymore
QA Contact: tever → nobody
Seems like someone forgot to actually close this bug, the last two diffs seem to be in the trunk.
Status: ASSIGNED → RESOLVED
Closed: 21 years ago
Keywords: helpwanted
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: