Closed
Bug 35817
Opened 25 years ago
Closed 21 years ago
optimize in-mem RDF for small # of props leading to large # of targets
Categories
(Core Graveyard :: RDF, defect, P3)
Core Graveyard
RDF
Tracking
(Not tracked)
RESOLVED
FIXED
Future
People
(Reporter: waterson, Assigned: mozilla)
References
(Blocks 1 open bug)
Details
(Keywords: perf, Whiteboard: [nsbeta3-])
Attachments
(2 files, 7 obsolete files)
(deleted),
patch
|
tingley
:
review+
waterson
:
superreview+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
brendan
:
review+
brendan
:
superreview+
|
Details | Diff | Splinter Review |
The current in-memory datasource implementation is not well tuned for large
datasources. It needs the following:
1. an arena-based fixed size allocator from which Assertion objects are allocated.
2. each "source" should have a hashtable of "properties" that lead out of it if
there are more than (say) ten distinct properties.
3. each triple <source, property, target> should be hashed for quick lookup if
there are more than (say) ten distinct targets for a <source, property> pair.
Right now, a lot of the operations degenerate into linear-time lookups, which
kills performance in large datasources.
Comment 1•25 years ago
|
||
right now, the entire list of newsgroups for a server gets put into the
subscribe datasource, which is an in memory datasource.
that a big in memory datasource, for a server like news.mcom.com
before waterson spends time on this, we should make sure this work would be
worth his time.
I also have to make sure I'll continue to use the in memory datasource, and not
a custom one.
Reporter | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Target Milestone: --- → M18
Comment 2•25 years ago
|
||
use PLHash for this, not nsHashTable.
steal from waterson, as this is hurts the subscribe dialog.
Assignee: waterson → sspitzer
Status: ASSIGNED → NEW
Comment 3•25 years ago
|
||
notes from irc converstion with waterson:
<waterson> anyway, if you're going to add a hashtable to remember the properties
in in-memory datasource
<waterson> it shouldn't be too hard
<waterson> have you looked thru that code at all?
<sspitzer> yes, I was debugging through it
<waterson> ok
<sspitzer> when trying to figure out why my HasAssertion() wasn't doing what I
thought it would.
<sspitzer> I'll get the HasAssertion() riddle figured out
<waterson> so there are two top-level hashtables keyed by the "source"
<sspitzer> and then, when I get time, add the hashtable.
<waterson> and "target" respectively
<sspitzer> ok...
<waterson> when you assert
<waterson> it creates an Assertion object
<waterson> that gets added to two linked lists
<waterson> the "forward" links, which comes from the source-indexed table
<waterson> and
<waterson> the "backwards" links, which comes from the target-indexed table
<sspitzer> (this is good, keep going...)
<waterson> so when doing GetTarget, for example
<waterson> it looks up the source in the source-indexed table
<waterson> and just walks through the list comparing properties until it finds a
match
<waterson> and returns the target
<waterson> so
<waterson> there are two cases to optimize
<waterson> source with many properties
<waterson> source with few properties, but many targets
<waterson> your case is the latter
<waterson> so you'd want to index on both property and target.
<sspitzer> gotcha.
<sspitzer> any need to make both fast?
<sspitzer> how about other in memory datasources, like bookmarks?
<sspitzer> or, are we "good enough" for them?
<waterson> well, we could make both go fast if we used a tree structure instead
of hashtables
<sspitzer> isn't there some tree datastructures in xpcom/ds?
<waterson> there are
<waterson> ideally, we'd make both cases go fast
<waterson> because there are some graphs where you do have a lot of properties
<waterson> on the same source
Status: NEW → ASSIGNED
Comment 4•24 years ago
|
||
David, can you make the call on whether we want to try this for beta3?
Assignee: sspitzer → bienvenu
Status: ASSIGNED → NEW
Comment 5•24 years ago
|
||
I'll defer to Chris and/or Alec. Is this still an issue, do you think?
Reporter | ||
Comment 6•24 years ago
|
||
I think rjc is working on this under the guises of another bug.
Assignee: bienvenu → rjc
Comment 7•24 years ago
|
||
Nominating for PR3 to get this on the radar that a decision needs to be
made since there appears to be some sentiment that it's necessary and some
confusion over whether it's being worked on.
Assignee | ||
Comment 9•24 years ago
|
||
I've been playing around with this bug for a while now and think its time to
suggest that its marked "nsbeta3-"...
I have a 90% rewrite of RDF's nsInMemoryDataSource which I've been playing with
for a while... it uses "smart" arcs (simple arrays which convert over to hash
tables when a size boundary is crossed)... however, its significantly slower
(about 50%) than the current implementation in most cases (especially for
insertions) and uses more memory.
Whiteboard: [nsbeta3+]
Comment 12•24 years ago
|
||
Hey Chris. Is this bug still valid? And if so, who should it be assigned to?
Reporter | ||
Comment 13•24 years ago
|
||
Taking for now, and FUTURE-ing. Will pull it back if I get inspired.
Reporter | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 14•23 years ago
|
||
Chris, mind if I take this bug from you? <grin> I've felt inspired lately.
Assignee: waterson → rjc
Status: ASSIGNED → NEW
Assignee | ||
Comment 15•23 years ago
|
||
New interface for RDF's in-memory data source
Assignee | ||
Comment 16•23 years ago
|
||
Assignee | ||
Comment 17•23 years ago
|
||
nsInMemoryDataSource.cpp diffs - basically, for RDF SEQs which pass a size
focal point, add a new assertion into the graph (invisible to the outside
world) which is actually a PLDHash on the associated "properties" arc to all
the values.
Reading in a largish bookmark file (2300 URLs + 100 folders) I see a 20-30%
improvement in graph population.
Assignee | ||
Comment 18•23 years ago
|
||
nsInMemoryDataSource.cpp - entire file for those who don't enjoy applying large
diffs
Assignee | ||
Comment 19•23 years ago
|
||
Waterson/Tingley, care to run with these changes for a bit?
(Note that these changes are primarily for RDF SEQ usage. A more aggressive hack
job for really large graphs which don't use RDF SEQs derives from these changes
without too much of a brain spasm.)
Status: NEW → ASSIGNED
Comment 22•23 years ago
|
||
Yeah, I'll give it a spin over the weekend.
Is it that much harder to keep the mutation threshold internally, rather than
requiring this extra interface and the external comparison against nextVal?
Generalizing this will probably require keeping some sort of per-source count
inside the datasource itself...
Actually, as long as the slow duplicate-checking code is in there that walks all
the arcs when Asserting, we could probably get the count for free -- keep track
of how many properties we check, and mutate if we see k of them.
Assignee | ||
Comment 23•23 years ago
|
||
In an earlier version (I have about eight on my desktop at the moment <whee>), I
did have conversion going on entirely internally.
However, you don't get a count for free really... you have to walk over the
*entire* assertion list (which the code doesn't really do unless it happens to
have a complete mismatch) taking certain conditions into account (for example,
you wouldn't want to create a hash{propertyArc} if something was just using lots
of #child arcs to indicate containment, as that's a more pathological case --
the more aggressive "brain spasm" case I alluded to) or maintain internal counts
(which seemed bogus, maintaining consistency just for the focal convertion
point). Having all the decision logic internal regarding containers also meant
another trip across the graph (via calling back out into
RDFContainerUtils::IsASeq for example) which was another added expense.
All that said, being even more agressive might end up needing internal counts.
Time will tell. Truthfully, the extra interface is probably needed anyway...
for example, it would be beneficial to be able to have the XUL template builder
"push" what it knows about containment down through RDF... doing so would allow
the in-memory datasource to better know what to optimize for.
Comment 24•23 years ago
|
||
Oops, you're right, no free counts.
I didn't get as much of a chance to test this over the weekend as I had planned,
because I'm a lazy beast.
Couple random things:
- do we really want to pre-allocate mObservers every time? any idea on the
bloat tradeoff? The cost of checking |if (mObservers)| seems pretty low, and
we'd still have to check |mObservers->Count() != 0| if we always allocated it.
- if the approach we're taking at this point is to rely on external hinting as
to what a "hot" resource is, there's no reason the API should pretend to be
tailored to containers, as there's nothing in the code (other than the name)
that makes it so. I'd just change nsIRDFInMemoryDataSource to
nsIRDFOptimizedDataSource (or something) and EnsureFastContainment ->
OptimizeForSource() (or something), and then go through the code looking for
spots where we can make use of it (certain spots in the XUL template code come
to mind).
- it looks like there's a frequently repeated chunk of code that checks for the
existence of the hash and pulls out either the hashed assertion chain or the
"normal" one, depending on state. this should probably be factored out to clean
things up...
By the way, what generated these diffs? i had to convert \r to \n before they
would apply...
Assignee | ||
Comment 25•23 years ago
|
||
> - do we really want to pre-allocate mObservers every time?
I certainly consider it a fine idea. bloat should be minimal; verification
would be a fine idea though.
> there's no reason the API should pretend to be tailored to containers
Ah, but that's EXACTLY what these changes are. It (currently) IS specifically
for RDF_SEQ containment and not applicable (as is) for other data sets such as a
node with lots of attributes or other ways of specifying containment. Again,
#child is an example... you wouldn't want to turn a linked list of
hash{parent node} -> [#child] -> [target #1]
\-> [#child] -> [target #2]
\-> [#child] -> [target #3]
into just a linked list again (same thing) off of a secondary hash:
hash{parent node} -> hash{#child} -> [target #1]
\-> [target #2]
\-> [target #3]
which the code currently would do. This is the pathalogical case. What should
really happen is more like:
hash{parent node} -> [#child] -> hash[target nodes]
- it looks like there's a frequently repeated chunk of code
True, although the frequency is not huge (my opinion)... a few lines of
duplication. As you'll notice, the in-memory datasource has other issues like
this as well, such as with enumerating/updating its internal linked lists.
Cleanup can happen after the fact; the current goal is performance.
- By the way, what generated these diffs? i had to convert \r to \n before they
would apply...
Latest beta of MacCVS. :) [What?!?! You don't use a Mac. <grin>] Sorry for
the inconvenience.
Assignee | ||
Comment 26•23 years ago
|
||
> The cost of checking |if (mObservers)| seems pretty low, and we'd still have
to check |mObservers->Count() != 0| if we always allocated it.
BTW, I forgot to add that this should be changed as well; the count of the # of
observers only changes at two points: AddObserver & RemoveObserver. Instead of
calling mObjservers->Count() all the time, it should only be done at those two
points and cached as a member variable of the in-memory datasource. :)
Reporter | ||
Comment 27•23 years ago
|
||
Sorry it took me so long to get around to this. My big concern here is that
we're bloating the Assertion object: there tend to be a lot of these. Any way we
could do this without increasing its size? For example, using a |union|?
Assignee | ||
Comment 28•23 years ago
|
||
> My big concern here is that we're bloating the Assertion object:
> there tend to be a lot of these. Any way we could do this without
> increasing its size? For example, using a |union|?
Using a |union| is a fine idea, I was just lazy and didn't. However, there is
still a few extra bytes of bloat due to a new boolean flag which can't be
inside the |union|
Waterson: Here's a new diff, care to review?
Attachment #57283 -
Attachment is obsolete: true
Attachment #57284 -
Attachment is obsolete: true
Attachment #57292 -
Attachment is obsolete: true
Attachment #57293 -
Attachment is obsolete: true
Assignee | ||
Comment 29•23 years ago
|
||
Oops, shame that it compiles on Mac but not on Linux. :(
Assignee | ||
Comment 30•23 years ago
|
||
Darn COMPtr inside of class problem.
Assignee | ||
Comment 31•23 years ago
|
||
Yeah, here's the love. ;)
Attachment #61376 -
Attachment is obsolete: true
Reporter | ||
Comment 32•23 years ago
|
||
Comment on attachment 61395 [details] [diff] [review]
RDF optimizations with |union| usage, tweaked for Linux
Weird name. Focal reviews on the brain? ;-) (How about RDF_SEQ_MAX_LIST or
RDF_SEQ_LIST_LIMIT?)
>+#define RDF_SEQ_MUTATION_FOCALPOINT 8
>+
Couple places where you left commented-out or #if 0'd code in the patch. Just
remove?
>+/*
>+ if (doomed && (doomed->mFlags & HASH_ENTRY))
>+ {
>+ // rjc: just forward thinking; if/when we have
>+ // multi-level hash-assertions, we'll have to
>+ // handle this case; no big deal, though
>+ }
>+*/
And...
>+#if 0
>+ // Implementation methods
>+ Assertion*
>+ GetArcs(PLDHashTable *hash, nsIRDFNode* u) {
>+ PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u, PL_DHASH_LOOKUP);
>+ return PL_DHASH_ENTRY_IS_BUSY(hdr)
>+ ? NS_REINTERPRET_CAST(Entry*, hdr)->mAssertions
>+ : nsnull; }
>+
>+ void
>+ SetArcs(PLDHashTable *hash, nsIRDFNode* u, Assertion* as) {
>+ PLDHashEntryHdr* hdr = PL_DHashTableOperate(hash, u,
>+ as ? PL_DHASH_ADD : PL_DHASH_REMOVE);
>+ if (as && hdr) {
>+ Entry* entry = NS_REINTERPRET_CAST(Entry*, hdr);
>+ entry->mNode = u;
>+ entry->mAssertions = as;
>+ } }
>+#endif
And...
>+// PL_DHashTableOperate(root->mForwardHash,
>+// aProperty, PL_DHASH_REMOVE);
>+ PL_DHashTableRawRemove(root->u.hash.mForwardHash, hdr);
>+
Attachment #61395 -
Flags: superreview+
Assignee | ||
Comment 33•23 years ago
|
||
Sure, "RDF_SEQ_LIST_LIMIT" it is. I'll also remove the commented out chunks. :)
Comment 34•23 years ago
|
||
Comment on attachment 61395 [details] [diff] [review]
RDF optimizations with |union| usage, tweaked for Linux
r=tingley with a bunch of nits.
- 2) If/when RDF containers become commonplace, consider implementing
- a special case for them to improve access time to individual
- elements.
Rather than removing this outright, I'd replace it with a comment about the
other case... something like
2) Optimize lookups for datasources which have a small number of properties +
fanning out to a large number of targets.
+ PLDHashTable* mForwardHash;
maybe it's just me, but I have trouble keeping mForwardHash and mForwardArcs
straight when reading this code. mForwardHash is mapping from a source to a
set of properties; would mProperties, mPropertyHash, or even mForwardProps be
more descriptive?
The same issue with the assertion-level |DeleteForwardArcsEntry()| routine
(which at the very least should match the name of the hash table it's
clearing entries for).
+ Assertion* mInvNext;
I get the willies when I see code that blindly accesses a union without
regards to the type -- as is the case with the LockedAssert() code that
sets this field. Admittedly, it should be impossible for an assertion
that's using |u.mForwardHash| to end up having its mInvNext set (I hope),
but it still makes me uneasy. I dunno, pull it out of the union, add
|NS_ASSERTION(!mHashEntry, "ack!");|, whatever; anything to help me sleep
better :) Perhaps I'm overreacting.
+ nsIRDFService *mRDFService;
Can we use a single static here instead of adding 4 bytes per datasource?
+ nsresult rv = rv = nsServiceManager::GetService(kRDFServiceCID,
typo.
-
<SETTING><NAME>MWFTP_Post_password</NAME><VALUE>0</VALUE></SETTING>
+
<SETTING><NAME>MWFTP_Post_password</NAME><VALUE>265rt2cn65mdaci^Qp'^ZŒ^E^E
@Ì¿ÿØð</VALUE></SETTING>
no wonder no one understands the mac build system :)
+ PL_DHashTableEnumerate(aAssertion->u.hash.mForwardHash,
+ DeleteForwardArcsEntry, &aAllocator);
indentation.
* the union declaration and several other spots in the code (especially the
NS_ADDREF calls) are using hard tabs; nothing else in the file does.
* match bracing style in patch to nsInMemoryDataSource.cpp to the rest of file.
Was there ever any quantification on the bloat tradeoff (now extra 4 bytes for
mNumObservers + allocating mObservers every time)? Hopefully the number of
datasources is small enough that saving the extra function call/pointer check
is worth it.
We'll need to open a bug eventually for optimizing the other case as well
(one/few properties --> many targets).
Attachment #61395 -
Flags: review+
Assignee | ||
Comment 35•23 years ago
|
||
Added suggested comment at top o' file, cleaned up indentation (my editor is
messing with me) and formatting a bit, renamed mForwardHash to be mPropertyHash
(although once the next set of RDF optimizations is written, that name too will
be incorrect; however, its fine for this round) and renamed
DeleteForwardArcsEntry to be DeletePropertyHashEntry, etc.
> nsIRDFService *mRDFService;
> Can we use a single static here instead of adding 4 bytes per datasource?
Please open another bug for that, and assign it to me, Chris, or yourself. :)
> We'll need to open a bug eventually for optimizing the other case as well
> (one/few properties --> many targets).
Either that, or we can just keep this bug open.
(I was planning on the latter.)
Assignee | ||
Updated•23 years ago
|
Summary: optimize access time for InMemoryDataSource::HasAssertion → optimize in-mem RDF for small # of props leading to large # of targets
Comment 36•23 years ago
|
||
>> nsIRDFService *mRDFService;
>> Can we use a single static here instead of adding 4 bytes per datasource?
>
>Please open another bug for that, and assign it to me, Chris, or yourself. :)
Why wait?
"How hard can it be?" - Homer J. Simpson
/be
Assignee | ||
Comment 37•23 years ago
|
||
Brendan: Why wait?
rjc: Why now?
Comment 38•23 years ago
|
||
Because it's easy, later means never too often, and we are getting very serious
about < 0 tolerance for bloat regressions. Why did you make it a member,
anyway? That effort was nearly the same as making it a static, it seems to me.
We are already too far down the slippery slope.
/be
Assignee | ||
Comment 39•23 years ago
|
||
Brendan: Since you are around and have free time this Sunday evening, here 'ya
go then, review this and I'll check it in to get rid of the unused usage.
Comment 40•23 years ago
|
||
r=mcafee for unused RDF service patch
Comment 41•23 years ago
|
||
Comment on attachment 63781 [details] [diff] [review]
Remove unused RDF service reference
Cool -- I didn't realize it was unused. sr=brendan@mozilla.org. Recording
mcafee's r= too.
/be
Attachment #63781 -
Flags: superreview+
Attachment #63781 -
Flags: review+
Assignee | ||
Comment 42•23 years ago
|
||
Indeed, the RDF service reference was just a vestigial remainder of earlier
hacking on the in-memory datasource which slipped through the cracks. Thanks for
the review(s), everybody.
Comment 44•21 years ago
|
||
Seems like someone forgot to actually close this bug, the last two diffs seem to
be in the trunk.
Updated•6 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•