Closed
Bug 1024056
Opened 10 years ago
Closed 9 years ago
Design and implement lexical analyzer to be used in any parser code
Categories
(Core :: XPCOM, enhancement)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla42
Tracking | Status | |
---|---|---|
firefox42 | --- | fixed |
People
(Reporter: mayhemer, Assigned: mayhemer)
References
(Blocks 1 open bug)
Details
Attachments
(1 file, 7 obsolete files)
(deleted),
patch
|
mayhemer
:
review+
|
Details | Diff | Splinter Review |
I on and on see manual walk through of strings when anything from HTTP headers to urls is being parsed which is unsafe, hard to maintain, easy to break, hard to review.
I tend to have a helper class that encapsulates some good approach for this. We could then (if syntax is more complicated) build simple recursive syntax analyzers using it.
I have a good experience with lexical analyzers. It's easy to implement and easy to use.
Other option would be to have regex C++ impl with results, but that is much more complicated to implement correctly and not always what can also be easily used.
Assignee | ||
Comment 1•10 years ago
|
||
As Valentin pointed out in a private mail, to let this work with string, better place is to have it XPCOM.
Going to take this.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Component: MFBT → XPCOM
Assignee | ||
Comment 2•10 years ago
|
||
So, few hours of coding and the first simple working lexical analyzer is alive! Oh, how I love this job from time to time :)
Comment 3•10 years ago
|
||
I'm not sure what this is for, or what it's designed to handle, but you might be interested in these Unicode documents:
UAX 29: Unicode Text Segmentation
http://www.unicode.org/reports/tr29/
UAX 44: Unicode Character Database
http://www.unicode.org/reports/tr44/
And other reports that might be of use to you:
http://www.unicode.org/reports/
Assignee | ||
Comment 4•10 years ago
|
||
Some API improvements + real-life example in a test of non-recursive parse and added a lot of comments.
Attachment #8519356 -
Attachment is obsolete: true
Assignee | ||
Comment 5•9 years ago
|
||
I am still getting in my hands new patches that use strstr() and strchr(). I really want to do better. This was intended to base our new URL parser, hence it was blocked on unicode capabilities. But I would like to start using it for at least ASCII inputs for now and extend in a followup.
I think the headers has good comments and the included test shows how to use the parser.
Attachment #8526908 -
Attachment is obsolete: true
Attachment #8614806 -
Flags: review?(dougt)
Assignee | ||
Comment 6•9 years ago
|
||
Remove the AString dependency and we could potentially move this to MFBT?
Updated•9 years ago
|
Attachment #8614806 -
Flags: review?(dougt) → review?(nfroyd)
Assignee | ||
Comment 7•9 years ago
|
||
review ping
Nathan, this is not that complicated and I know about few places/patches that would like to use it. Thanks.
Flags: needinfo?(nfroyd)
Comment 8•9 years ago
|
||
Comment on attachment 8614806 [details] [diff] [review]
v1
Review of attachment 8614806 [details] [diff] [review]:
-----------------------------------------------------------------
Apologies for the delay.
Where do you see this being used outside of netwerk/? grepping for strchr, for instance, doesn't yield very many possible use cases outside of netwerk/.
Giving this f+ so I can see a revised patch, but there's nothing major in the comments below. Should be a quick r+ once comments are addressed.
::: xpcom/string/Lexan.h
@@ +10,5 @@
> +#include "nsString.h"
> +
> +/**
> + * This is a simple implementation of a lexical analyzer or maybe better
> + * called a tokenizer.
This is bikeshed territory, but I think it would be better called Tokenizer. (It took me a while to work out that Lexan is a LEXical ANalyzer.)
@@ +24,5 @@
> + */
> + enum TokenType {
> + TOKEN_UNKNOWN,
> + TOKEN_ERROR,
> + TOKEN_NUMBER,
Might be a good idea to rename this to TOKEN_INTEGER (and similar for all the *Number things throughout the patch) so people don't start thinking this is a number in, say, the JavaScript sense.
@@ +47,5 @@
> + public:
> + Token() : mType(TOKEN_UNKNOWN), mChar(0), mNumber(0) {}
> +
> + // Static constructors of tokens by type and value
> + static Token Word(nsACString const &aWord);
Gecko style would be |const nsACString& aWord|. Please fix this instance and so on throughout the patch.
@@ +89,5 @@
> + * and value to aToken result and shift the cursor past this just parsed token. Each call
> + * to Next() reads another token from the input and shifts the cursor.
> + * Returns false if we have passed the end of the input.
> + */
> + bool Next(Token &aToken);
I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.
@@ +121,5 @@
> + /**
> + * Check whitespace character is present, either a specified one by aChar argument
> + * or any from the mWhitespaces list.
> + */
> + bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }
Having a default argument here seems redundant with |CheckChar|. Why not make this function take no arguments and therefore always check its argument against mWhitespaces? Callers can use CheckChar if they want.
@@ +128,5 @@
> + * cursor position and return true. Otherwise false.
> + */
> + bool CheckChar(char const aChar) { return Check(Token::Char(aChar)); }
> + /**
> + * This is a custumizable version of CheckChar. aClassifier is a function called with
Nit: "customizable"
@@ +153,5 @@
> +
> + /**
> + * Returns the read cursor position back as it was before the last call of any parsing
> + * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> + * Works only once.
This last sentence really means "works only once per read token", right? i.e. something like:
l.CheckChar(...)
Rollback()
l.CheckChar(...)
Rollback()
is allowed, but not:
l.CheckChar(...)
Rollback()
Rollback()
correct? Please update the documentation to reflect that. If you could make Rollback() assert in the incorrect usage case, that would be even better.
@@ +161,5 @@
> + /**
> + * Start the process of recording. Based on aInclude value the begining of the recorded
> + * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> + * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> + * operation call.
What does "so as the read position was before the last parsing operation call" mean?
@@ +174,5 @@
> +
> +protected:
> + // true if we have already read the EOF token.
> + bool HasInput() const;
> + // Main prasing function, it doesn't shift the read cursor, just returns the next
Nit: "parsing"
@@ +193,5 @@
> +
> + // true iff we have already read the EOF token
> + bool mPastEof : 1;
> + // true iff the last Check*() call has returned false, reverts to true on Rollback() call
> + bool mHasFailed : 1;
I doubt we'll (a) be allocating a lot of these and (b) be allocating them anywhere but the stack. So we might as well make these non-bitfields to make the accesses more efficient.
::: xpcom/tests/TestLexan.cpp
@@ +1,1 @@
> +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
Please fix the modeline (tab-width: 8, c-basic-offset: 2).
@@ +26,5 @@
> + if (xpcom.failed())
> + return 1;
> +
> +#define CHECK(a) \
> + MOZ_ASSERT(a); \
New tests going in to xpcom/ have typically been gtests, not straight cpptests. You can either:
1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same behavior; or
2) Convert everything to a gtest (doing a search-and-replace will get you 90% of the way there).
@@ +42,5 @@
> + "\r\n"
> + "This is the body"));
> +
> + r = l.CheckWord("HTTP");
> + CHECK(r);
Whichever method you choose to modify CHECK, above, it might reduce a bit of visual clutter to convert things to:
CHECK(l.CheckWord("HTTP"));
(would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )
@@ +112,5 @@
> +
> +
> + // Synthetic code-specific test
> +
> + Lexan l1(NS_LITERAL_CSTRING("test123 ,15 \t*\r\n%xx,-15\r\r"));
gtests would also enable separating out individual Lexan tests into functions...
Attachment #8614806 -
Flags: review?(nfroyd) → feedback+
Updated•9 years ago
|
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 9•9 years ago
|
||
(In reply to Nathan Froyd [:froydnj] [away 10 July through 19 July] from comment #8)
> Comment on attachment 8614806 [details] [diff] [review]
> v1
>
> Review of attachment 8614806 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Apologies for the delay.
>
> Where do you see this being used outside of netwerk/?
I want to expose this for any consumer. It can be used for parsing user input, file content. That's not limited to network.
> grepping for strchr,
> for instance, doesn't yield very many possible use cases outside of netwerk/.
If you insist on moving this to netwerk then let me know. True is that the primary target is that module.
>
> Giving this f+ so I can see a revised patch, but there's nothing major in
> the comments below. Should be a quick r+ once comments are addressed.
>
> ::: xpcom/string/Lexan.h
> @@ +10,5 @@
> > +#include "nsString.h"
> > +
> > +/**
> > + * This is a simple implementation of a lexical analyzer or maybe better
> > + * called a tokenizer.
>
> This is bikeshed territory, but I think it would be better called Tokenizer.
> (It took me a while to work out that Lexan is a LEXical ANalyzer.)
Lexan is shorter although cryptic. And this class is just the useful and universal subset of what lexical analyzers do. Tokenizer is the name.
> @@ +89,5 @@
> > + * and value to aToken result and shift the cursor past this just parsed token. Each call
> > + * to Next() reads another token from the input and shifts the cursor.
> > + * Returns false if we have passed the end of the input.
> > + */
> > + bool Next(Token &aToken);
>
> I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.
It's disputable. But consumers may always use 'unused <<'.
>
> @@ +121,5 @@
> > + /**
> > + * Check whitespace character is present, either a specified one by aChar argument
> > + * or any from the mWhitespaces list.
> > + */
> > + bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }
>
> Having a default argument here seems redundant with |CheckChar|. Why not
> make this function take no arguments and therefore always check its argument
> against mWhitespaces? Callers can use CheckChar if they want.
I have to remember why I made it this way exactly. Looks like some leftover as the API had been evolving.
> @@ +153,5 @@
> > +
> > + /**
> > + * Returns the read cursor position back as it was before the last call of any parsing
> > + * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> > + * Works only once.
>
> This last sentence really means "works only once per read token", right?
> i.e. something like:
>
> l.CheckChar(...)
> Rollback()
> l.CheckChar(...)
> Rollback()
>
> is allowed, but not:
>
> l.CheckChar(...)
> Rollback()
> Rollback()
>
> correct? Please update the documentation to reflect that.
I thought the sentence "Works only once" made it clear. But I'll think of a better description.
> If you could
> make Rollback() assert in the incorrect usage case, that would be even
> better.
Yup (should be easy).
>
> @@ +161,5 @@
> > + /**
> > + * Start the process of recording. Based on aInclude value the begining of the recorded
> > + * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> > + * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> > + * operation call.
>
> What does "so as the read position was before the last parsing operation
> call" mean?
Probably my inefficient english. I'll rephrase. probably just s/so as/as if/ ?
> New tests going in to xpcom/ have typically been gtests, not straight
> cpptests. You can either:
>
> 1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same
> behavior; or
> 2) Convert everything to a gtest (doing a search-and-replace will get you
> 90% of the way there).
The test is anyway incompilable. I'll try to build a gtest. Thanks for the tip.
> CHECK(l.CheckWord("HTTP"));
>
> (would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )
Hmm.. having a local var is IMO a bit better. It can help putting conditional breakpoints. But this class is relatively simple, so probably your approach is clearer.
Assignee | ||
Comment 10•9 years ago
|
||
See comment 9
Attachment #8614806 -
Attachment is obsolete: true
Attachment #8634051 -
Flags: review?(nfroyd)
Comment 11•9 years ago
|
||
Comment on attachment 8634051 [details] [diff] [review]
v2
Review of attachment 8634051 [details] [diff] [review]:
-----------------------------------------------------------------
r=me with the changes below.
::: xpcom/string/Tokenizer.cpp
@@ +8,5 @@
> +
> +#include "nsUnicharUtils.h"
> +#include "mozilla/CheckedInt.h"
> +
> +static const char* sWhitespaces = " \t";
Please make this |static const char sWhitespaces[] = " \t";|, as that makes it take up slightly less space in binary form.
@@ +20,5 @@
> + mRecord = mRollback = mCursor;
> + aSource.EndReading(mEnd);
> +}
> +
> +bool Tokenizer::Next(Token &aToken)
Two style nits: this should be formatted as:
bool
Tokenizer::Next(Token& aToken)
and so on throughout the patch. The header file also looks like it has some |Token &aToken| style nits; please fix those up as well.
::: xpcom/string/Tokenizer.h
@@ +15,5 @@
> + *
> + * It is limited only to ASCII input for now. UTF-8 or any other input
> + * encoding must yet be implemented.
> + */
> +class Tokenizer {
Please wrap all this in a |namespace mozilla|.
@@ +62,5 @@
> +
> + TokenType Type() const { return mType; }
> + char AsChar() const { return mChar; }
> + nsCString AsString() const { return mWord; }
> + int64_t AsInteger() const { return mInteger; }
These As* methods should include asserts that the token is of the correct type.
@@ +70,5 @@
> + Tokenizer(const nsACString& aSource);
> +
> + /**
> + * Some methods are collecting the input as it is being parsed to obtain
> + * a substring between particular syntax bounderies defined by descent.
What does "defined by descent" refer to here?
::: xpcom/string/moz.build
@@ +35,5 @@
> 'nsXPIDLString.h',
> 'string-template-def-char.h',
> 'string-template-def-unichar.h',
> 'string-template-undef.h',
> + 'Tokenizer.h',
Apologies for not noticing this previously: can you put this someplace other than xpcom/string/? xpcom/ds/ looks like it has some tokenizer-related stuff already, so putting it there would be fine.
Also, this should go in EXPORTS.mozilla.
::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +14,5 @@
> + c == '_' ||
> + c == '-';
> +}
> +
> +TEST(Tokenizer, HTTPResponse)
Thanks for separating these out into individual tests!
Attachment #8634051 -
Flags: review?(nfroyd) → review+
Assignee | ||
Comment 12•9 years ago
|
||
Nathan, thanks for the reviews!
Attachment #8634051 -
Attachment is obsolete: true
Attachment #8637850 -
Flags: review+
Assignee | ||
Updated•9 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 14•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134
There is just one test added (gtest under xpcom), and this new class is not integrated in the code anywhere but I don't know how to run only gtests on try.
Keywords: checkin-needed
Comment 15•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #14)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134
>
> There is just one test added (gtest under xpcom), and this new class is not
> integrated in the code anywhere but I don't know how to run only gtests on
> try.
the try run shows that this change cause a bustage and so i can not check this in.
bustage:
/builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/tests/gtest/TestTokenizer.cpp:14:27: error: suggest parentheses around '&&' within '||' [-Werror=parentheses]
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Assignee | ||
Comment 16•9 years ago
|
||
linux build fixed, try is closed. will ask after try's done.
Attachment #8637850 -
Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
Attachment #8638627 -
Flags: review+
Assignee | ||
Comment 17•9 years ago
|
||
Assignee | ||
Comment 18•9 years ago
|
||
Attachment #8638627 -
Attachment is obsolete: true
Attachment #8638642 -
Flags: review+
Assignee | ||
Comment 19•9 years ago
|
||
- fixed implicit constructor
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fadd4b6e20a2
Attachment #8638642 -
Attachment is obsolete: true
Attachment #8639251 -
Flags: review+
Comment 21•9 years ago
|
||
Keywords: checkin-needed
Comment 22•9 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
status-firefox42:
--- → fixed
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
You need to log in
before you can comment on or make changes to this bug.
Description
•