Parse without creating atoms
Categories
(Core :: JavaScript Engine, task, P2)
Tracking
()
People
(Reporter: mgaudet, Assigned: djvj)
References
Details
Attachments
(33 obsolete files)
Currently when we are parsing, we freely allocate atoms.
If we want a truly GC-free parse, we need a story to avoid this from happening, deferring the allocation of actual atoms until the end of parsing.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
So I've been investigating this for a little while. After initially trying to figure out impacts by inspection, I quickly realized it was not a particularly sustainable approach, and returned the the approach I ended up deploying for understanding the impact of scope creation. This approach has me encapsulate the existing JSAtom*
inside another type, ParseAtom
(and similarly, to help mesh with the existing code, a ParsePropertyName
).
I then proceeded to go through and attempt to plumb this new facade class through the engine, with the hopes of finding a fixed point minimal interface and set of impacts. The idea would be that if I could complete the facade work, I could then proceed to switch out the contents of the facade, replacing JSAtoms
managed by the GC with a parser-local atomization, that could then be exported to the Atoms table on main thread once the parse was done.
I have yet to complete this work. It's been quite slow going as it turns out (perhaps unsurprisingly) there's a lot of crosscutting complexity in this task.
I wanted to update this a bit to call out some of the challenges that will have to be surmounted to make this successful.
- Stream Crossings: There’s a couple places where we need to convert previously allocated atoms to ParseAtoms. More worryingly there are places where we need to somehow interact with machinery expressed in the language of JSAtoms when all we have is a parse atom… this is not good; these are barriers to future success. Some examples include:
BindingName
: Used as part of the scope data creation story, is a tagged pointer to a JSAtom — indicates we’ll need a translation story.- Modules exist in a liminal state between parse and non-parse worlds, and so need special handling I think.
- At least one hinky issue with self hosted code:
IsExtendedUnclonedSelfHostedFunctionName
- Character type handling: In order to sucessfully handle all possible JS programs we have to recapitulate our input character set support, which is a fair amount of legwork I think. So far I've managed to avoid it mostly, but I believe it will be a constraint on the final design.
- We'll need to support string concatenation, for methods like
prefixAccessorName
.
Reporter | ||
Comment 2•5 years ago
|
||
Oh, one other challenge I should mention is the cx->names()
system; interconvert between ImmutablePropertyNamePtr
and ParsePropertyNames.
Reporter | ||
Comment 3•5 years ago
|
||
It turns out that JS allows arbitrary BigInts as property names, which comes with its own set of challenges for name representation in the parser (See Bug 1605835, and Anba's comment here):
// With numeric separators.
{
let o = {
1_2_3n: "123",
};
assertEq(o[123], "123");
}
Assignee | ||
Comment 4•5 years ago
|
||
I've taken on this atom-related work, and have been looking at it for a couple of days now.
Atomized names and strings are referenced in the frontend in a few key places:
-
The Token structure carries unioned
name_
andatom_
fields that (depending on the token type), either contain a PropertyName* or a JSAtom*. These retrieved and passed around. -
The RecyclableNameMap structure maps JSAtom* to some template-specified type. This is used as the basis for a number of internal name-tables, such as closure name sets, and to track the indexes of atoms in the script's atom table.
The API of this map, if we don't want to change it too much, makes it so that any replacement for JSAtom* be able to carry all of the information that JSAtom* currenty does - namely: the actual contents of the string, the hash, and the length.
-
When constructing JSAtom*s, the parser currently attempts to avoid construction by looking the string up in the per-zone atom cache, which is free of synchronization issues. This likely avoids a lot of unnecessary lookups against the global atoms table (requiring locks to be taken).
-
Parser atoms need to be able to be easily compared to JSAtoms - this functionality is used in places where names are looked up on scope chains and environments that have already been materialized in the GC heap.
-
Some atoms derived from text will be encoded with unicode escapes, or other non-ASCII characters. In these cases, we cannot represent atoms simply with a pointer into the text and a length - proper hashing and length computations will require the decoded string.
The overall design these requirements suggests a ParserAtom
internal atom representation with the following capabilities:
-
Ability to multiplex a JSAtom*, a const char* into the source text, and a char* to a heap-allocated string (probably non-ascii string decoded from source).
-
Carries length and hash directly, to allow for easy comparisons against existing JSAtoms and other ParserAtoms.
Assignee | ||
Comment 5•5 years ago
|
||
This is an initial sketch of the ParseAtom (named ParserAtom in the source, but I'll change that) structure, which compiles.
A ParseAtom is a "fat" structure with 1 pointer (contents) and two 32-bit words (length+flags and hash). Flags are used to indicate whether the ParseAtom is internally a pointer to the source text, a pointer to a parser-allocated heap region, or a pre-existing JSAtom*.
Constructors aren't written but should be trivial. The core comparison methods to allow for easy comparing with JSAtom* are also done.
Next step is to add some constructors and try seeing how much breaks when I replace RecyclableNameMap's usage of JSAtom* with ParserAtom*.
Assignee | ||
Comment 6•5 years ago
|
||
After talking with Matthew about his previous stab at this, it became apparent that a parse-internal interning table is needed. Updated patch with that added.
Assignee | ||
Comment 7•5 years ago
|
||
Updated WIP patch for ParserAtom impl.
Assignee | ||
Comment 8•5 years ago
|
||
WIP patch for changing uses of JSAtom* to ParserAtomId instead.
Assignee | ||
Comment 9•5 years ago
|
||
Matthew - question for you. I'm running into a bit of an issue with the fact that ds::InlineTable seems to demand that its key types be pointers, or at least, can understand "X == nullptr", "nullptr == X" and "X = nullptr" and implement them "appropriately". Did you run into this? I could use to bounce some thoughts off on how to deal with this best.
Fix the impl of ds::InlineTable? Change ParserAtomId to somehow be a pointer-style thing? etc.
Reporter | ||
Comment 10•5 years ago
|
||
On chat.m.o I said
Hrm. If it wasn't for the key = nullptr I'd have imagined just defining operator=(nullptr_t) would have sufficed (and fixing the yoda conditions)
definitely didn't run into this, didn't get far enough. What table is this?
Further research suggests this is the RecyclableNameMap
Honestly, it seems like a problem in InlineTable
; I wonder if you could add a 'Sentinal' value parameter to the templates, plumb in through, and use that in lieu. You'd still need to extend ParserAtomId
to support sentinal comparisons, but maybe not too bad to do?
Assignee | ||
Comment 11•5 years ago
|
||
Talked with Ted and looked closer at this data-structure. InlineTable is used via two types that are built on top of it: InlineSet and InlineMap.
InlineSet is not actually used by anyone, and is dead code: https://searchfox.org/mozilla-central/search?q=InlineSet&case=false®exp=false&path=
InlineMap is used in two places only - the frontend in NameCollections.h, and in Ion: https://searchfox.org/mozilla-central/search?q=symbol:T_js%3A%3AInlineMap&redirect=false
It shouldn't be too hard to fix up this data structure's APIs to be a bit more sensibly structured.
Assignee | ||
Comment 12•5 years ago
|
||
Small patch to make CompilationInfo available from within the TokenStream. The ParserAtoms table will need to hang off of this, so the TokenStream needs access to it.
Assignee | ||
Comment 13•5 years ago
|
||
This adds the actual ParserAtom table. We may want to go in and rename everything 'StencilAtom' and such before landing, but this gets the rough table structure in. The TokenStream now creates ParserAtoms alongside JSAtoms, and seems to work.
Next step: verify some basic correctness of my table vs. the corresponding JSAtoms, then proceed to porting over usage of UsedNamesTracker to ParserAtomIds instead of JSAtom*.
Reporter | ||
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
Thanks for the comments. Generally agree with most of your suggestions. This is very draft quality code and I'm putting it up for bookmarking purposes. Keeping things clean when I'm not sure if I'm gonna have to dirty it up again seems like wasted effort. Prior to landing/review, I'm going to need to go in and organize things in a way that makes sense for reviewers.
Assignee | ||
Comment 16•5 years ago
|
||
Oh, the testing function was used briefly to sanity-check the code, but I removed it in the last bit. The entry format evolved after that and I figured a cross-reference test against actual parses would be good for now.
The thing with the test code is that for it to be meaningful, I'd want a proper set of "text with atoms" of various sorts, covering corner cases and such. That's not something I want to invest time into right now when I don't have a working end-to-end prototype yet. Fixing any lingering correctness issues is not high on the priority list because it's reasonably well-defined and deterministic. There's more uncertainty around the right implementation approach and architecture, so I want to resolve those first.
Assignee | ||
Comment 17•5 years ago
|
||
Ok, finally some draft of this rewritten to split lookup and interning into two different steps. A bunch of the usage code is commented out, but rather than writing the high-level interning APIs ahead of time, I'll add them as necessary to this patch as I work on transitioning the parser to using this.
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 18•5 years ago
|
||
Fixed up parser atoms table which does not do allocation during lookup.
Assignee | ||
Comment 19•5 years ago
|
||
Cleaned up some unused code and moved some of the implementation into cpp files to avoid large rebuilds more often.
Assignee | ||
Comment 20•5 years ago
|
||
More cleanups, and a mechanism for interning atoms straight from existing JSAtom pointers. This is mostly to make it easy to obtain a ParserAtomId at random places in the parser, to allow for incremental transition of the parser code to using ParserAtomIds where JSAtom*s currently exist.
Assignee | ||
Comment 21•5 years ago
|
||
While making downstream changes, discovered that parser atoms need a way of being compared against well-known names (e.g. "arguments", "eval", etc.). In the old system, well-known names were kept in JSContext::names
.
This updated patch contains a parallel mechanism for adding and tracking well-known names in the parser atoms table.
Assignee | ||
Comment 22•5 years ago
|
||
Assignee | ||
Comment 23•5 years ago
|
||
Depends on D70893
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 24•5 years ago
|
||
Comment 25•5 years ago
|
||
bugherder |
Assignee | ||
Comment 26•5 years ago
|
||
Temporary export of current work (3 patches in an HG export file), just to avoid consequences from any local-machine mishaps.
Assignee | ||
Comment 27•5 years ago
|
||
Updated patch queue for saving. Converted two BinAST implementation files, along with builtin/Reflect.cpp and ModuleObject.cpp.
As part of that, went back and added some support functionality to the base ParserAtomsTable patch - adding methods to obtain the character data from a ParserAtomEntry.
Also greatly simplified some of the code in EitherParser and ParserSharedBase to allow lifting/lowering of JSAtoms from/to ParserAtoms.
Also added a parallel implementation of FindReservedWord in TokenStream.cpp that operates on a ParserAtomEntry* intead of a JSLinearString*.. simplifying some implementation code in the process.
Still not actually compiling yet, but I'm making much further progress each time without having to revisit underlying base patches.. and the changes to underlying base patches are becoming less extensive each time I do have to revisit them.
Assignee | ||
Comment 28•5 years ago
|
||
Update - worked more on this bug. Discovered late last week that for -reasons-, I need access to compilationInfo from TokenStreamAnyChars, which doesn't have TokenStreamCharsBase as a base class. Worked a bit on rejigging the base patches to move the compilationInfo
field over to TokenStreamAnyChars, before realizing that the main create-atom methods hung off of the "CharsBase" arm of the inheritance hierarchy and not the "TokenStream" arm.
Now evaluating which approach makes more sense and just trying to understand how the C++ class inheritance hierarchy used for TokenStreams practically interacts with all the places where atoms are needed to be created.
Assignee | ||
Comment 29•5 years ago
|
||
Updated patch for adding parser atoms table, this time representing ParserAtomIds as simple a simple wrapper around a pointer-to-ParserAtomEntry, instead of a uint32_t index into a table of them.
In the near term, this is the best way to get around the issues surfaced in bug 1633841. It throws a wrench in the serialization plans, but we can push that question back a little bit.
The main problem this solves is the following:
-
The "TokenStreamCharsShared" inheritance lineage needs the ability to create atoms.
-
The "TokenStreamAnyChars" inheritance lineage needs the ability to inspect atoms.
With the index-based representation of ParserAtoms, both of those uses required access to the atoms table, which was difficult to plumb through the two lineages. The atoms table would naturally belong in the "TokenStreamCharsShared" lineage, because that's where atoms were created. However, there was no easy way to access that from "TokenStreamAnyChars".
With a pointer-based representation, this issue goes away. Anybody with a ParserAtomId implicitly has access to the contents.
Assignee | ||
Comment 30•5 years ago
|
||
A bunch more progress - some backtracking to add more stuff to base patches, but mostly "gunshot translation" of the atoms code (all stowed away into the tip patch of the queue in case it all goes wrong somehow).
Bookmarking now because this is starting to touch FunctionFlags and it's been a while since I rebased and I know stuff related to flags has landed recently.
Assignee | ||
Comment 31•5 years ago
|
||
Updated patch. Includes a small patch for bug 1635505 that hasn't landed yet because Autoland says it can't rebase to tip but I suspect is really because of closed tree... since this applies to mozilla-central tip I just tested.
Other than that, more progress on conversion of parser code. This reaches a milestone where I ended up hitting BinAST tokenizer code again on the compile failures.
Assignee | ||
Comment 32•5 years ago
|
||
Updated bookmark patch queue so I don't lose in-progress work.
A bunch more code translated. Still not compiling yet.
Assignee | ||
Comment 33•5 years ago
|
||
More progress, this time through a good amount of Parser.cpp.
Ran into issues with atomization relating to BigInt and Numeric property names needing to be normalized, which requires going through the VM-based path for creating them.
All of this logic is currently written against the VM proper, assuming first-class objects like BigInt and JSAtom. Need to find a solution for this eventually, but for now I'm adding translation points around them and moving forward.
Assignee | ||
Comment 34•5 years ago
|
||
Still plugging away at BytecodeEmitter and related code.
Assignee | ||
Comment 35•5 years ago
|
||
Assignee | ||
Comment 36•5 years ago
|
||
Updated ParserAtoms implementation with a bunch of updates.
- Much simplified entry structure with fully owned contents using UniquePtr.
- No repr/inheap fields, using mozilla::Variant of UniquePtrs
- A bunch of fill-ins elsewhere in the code (e.g. Latin1-related handling) to clean up this patch.
- Canonicalization of atom representation regardless of input encoding.
- Movement of well-known atoms to their own shared/permanent table)
The changes eliminate a known deficiency in the previous patches - all atoms table initialization starts with the pre-seeding of the table with all the well-known atoms, on every parse. The shell performance (e.g. in jit-tests) is noticeably faster now.
Should be in good shape to get reviewed shortly.
Assignee | ||
Comment 37•5 years ago
|
||
Assignee | ||
Comment 38•4 years ago
|
||
Bookmark of patch queue.
The base patch has been updated with post-review updates and is almost good to land (with minimal changes to decouple it from the parser for now).
The rest of the patches have been rebased to new changes on tip, and to the relatively extensive changes on the base patch.
The tip patch (SCOPEDATA) is in the middle of teasing apart the XDR, Parser, and other uses of the scope data. The implementation of this code is threaded across the VM and the parser.
So some code in Scope.cpp needs to have dual implementations (or generic ones), and some code which is used by parser exclusively has to have its types rejigged.
Updated•4 years ago
|
Assignee | ||
Comment 39•4 years ago
|
||
Updated bookmark patch queue.
The Scope::Data migration is complete and pseudo-separated (it can't land separately, but the diffs relating to it on a separate patch).
Likewise FoldConstants has been migrated over. One lingering issue has been documented with a FIXME/TODO, and that is folding of chains of string-concatenations. The old code leveraged JSString ropes to linearize the construction of the full concatenated string, and my implementation is quadratic.
This can be fixed in the perf phase by using a local vector and a multi-string concat implementation for parser-atoms, likely with fewer allocations (since the rope method allocates a discarded rope-string for each concat operation).. but it's unlikely to be a problem in the first place.
Some backsplash work from the Scope::Data changes, which related to the BytecodeEmitter, were also done.
Also did some migration of ForOfEmitter and FunctionEmitter.
It seems to be miscellaneous stuff, and then ModuleBuilder, which remain now.
BigInt handling also remains to be addressed.
Assignee | ||
Comment 40•4 years ago
|
||
Fixed up NameFunctions and a few other miscellaneous issues in other files that cropped up. Initial stabs at fixing Modules code.. paused while Ted works on deferring modules. Rebased a bunch and removed BinAST diffs since they don't apply anymore.
ParseContext and ObjectEmitter are next.
Assignee | ||
Comment 41•4 years ago
|
||
More conversion work. ObjectEmitter.cpp, ParseContext.cpp, PropOpEmitter.cpp, and SharedContext.cpp
Special notes: got to turn ScriptStencil::trace into a no-op function today. Will remove it from the GC-traced set at some point.
Miscellaneous backfill work in Scope.cpp, Parser.cpp, jsnum.cpp, TokenStream.cpp, Stencil.cpp, ParseNode.cpp to accommodate changes in the primary files.
After splitting out the ParserAtomsTable initial patch and landing it, the patch queue had shrunk to ~350k, but we're back to ~430k now. It'll be nice when at some point this thing actually compiles.
Assignee | ||
Comment 42•4 years ago
|
||
The compilation process gets to the linker with this patch. Fails at the linker stage with some missing specializations. I already fixed some of those, but there are some remaining with various '::trace' methods.
I put this off while pushing forward on the main patch, but I need to go through and dislodge all the tracing hooks from the ParserAtom analogue structures.
Right now, something is instantiating those trace functions, and it's likely some container type that is traced that now holds references to a ParserAtom-holding thing. Just need to find out where.
Other than that, there's one more task I want to complete on this patch: remove ParserAtomId and ParserNameId, and replace them with ParserAtom and ParserName, which are simple type aliases for pointers to zero-overhead typewrappers around ParserAtomEntry.
Assignee | ||
Comment 43•4 years ago
|
||
Finally got something that builds last evening. It failed every test on a bunch of trivial asserts and minor issues I'm fixing.
I started work on removing the ParserAtomId and ParserNameId types from the patch, and overall clean up the patch for some sort of landing eventually, but didn't get far.
Assignee | ||
Comment 44•4 years ago
|
||
Updated patch. I postponed the ParserAtomId and ParserNameId type removal and focused on getting some tests passing. Found some basic memory and logic bugs.
Now, we pass a majority of tests, and fail on <40% of tests. I expect there's some reasonably common code-path where there are bugs left, given the frequency of failures.
Assignee | ||
Comment 45•4 years ago
|
||
This fixes the last jit-test failure. Haven't tested perf yet but this should be correct. Running a final local jit-test on a unified patch now, but just pushing this up because I'm paranoid about losing work :)
Next is to push to try and push up a phabricator draft review while I work on reworking the ParserAtomId
and ParserNameId
types.
Assignee | ||
Comment 46•4 years ago
|
||
Assignee | ||
Comment 47•4 years ago
|
||
Tested the latest working patch against raptor. Results are here: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=eb8a9e4ecf788a40c2751120c0d8ac42fa7c67ff&framework=10&selectedTimeRange=172800
Some pretty significant slowdowns across a number of top sites. I expect one of the significant culprits is that parser-atoms always heap-allocate their content buffers, while js-atoms can use inline strings. Going to profile to try to pinpoint the issue.
Adding inline variants to ParserAtoms should be relatively straightforward, and cut down significantly on allocations related to small atoms.
Assignee | ||
Comment 48•4 years ago
|
||
Comment 49•4 years ago
|
||
Comment 50•4 years ago
|
||
Backed out for bustages on TokenStream.cpp CLOSED TREE
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=313249424&repo=autoland
Backout link: https://hg.mozilla.org/integration/autoland/rev/c9689902561bc4277430bbdb68e5030e0c53ab42
Updated•4 years ago
|
Comment 51•4 years ago
|
||
Comment on attachment 9166000 [details]
Bug 1592105 - Part 1 - Adaptor glue to allow for parser to transition to internal atoms representation. r?mgaudet,tcampbell
Revision D84865 was moved to bug 1660798. Setting attachment 9166000 [details] to obsolete.
Comment 52•4 years ago
|
||
Comment on attachment 9162373 [details]
Bug 1592105 - Part 2 - Convert uses of JSAtom* and PropertyName* to ParserAtomId and ParserNameId. r?mgaudet,tcampbell
Revision D82826 was moved to bug 1660798. Setting attachment 9162373 [details] to obsolete.
Assignee | ||
Updated•4 years ago
|
Comment 53•4 years ago
|
||
This is effectively complete by Bug 1660798. There are a few nits around snapshotting, but they'll be covered by parent bug.
Updated•4 years ago
|
Description
•