Closed Bug 1024056 Opened 10 years ago Closed 9 years ago

Design and implement lexical analyzer to be used in any parser code

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox42 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 7 obsolete files)

I on and on see manual walk through of strings when anything from HTTP headers to urls is being parsed which is unsafe, hard to maintain, easy to break, hard to review. I tend to have a helper class that encapsulates some good approach for this. We could then (if syntax is more complicated) build simple recursive syntax analyzers using it. I have a good experience with lexical analyzers. It's easy to implement and easy to use. Other option would be to have regex C++ impl with results, but that is much more complicated to implement correctly and not always what can also be easily used.
Blocks: url
As Valentin pointed out in a private mail, to let this work with string, better place is to have it XPCOM. Going to take this.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Component: MFBT → XPCOM
Attached patch draft1 - working :) (obsolete) (deleted) — Splinter Review
So, few hours of coding and the first simple working lexical analyzer is alive! Oh, how I love this job from time to time :)
I'm not sure what this is for, or what it's designed to handle, but you might be interested in these Unicode documents: UAX 29: Unicode Text Segmentation http://www.unicode.org/reports/tr29/ UAX 44: Unicode Character Database http://www.unicode.org/reports/tr44/ And other reports that might be of use to you: http://www.unicode.org/reports/
Attached patch draft1.1 (obsolete) (deleted) — Splinter Review
Some API improvements + real-life example in a test of non-recursive parse and added a lot of comments.
Attachment #8519356 - Attachment is obsolete: true
Attached patch v1 (obsolete) (deleted) — Splinter Review
I am still getting in my hands new patches that use strstr() and strchr(). I really want to do better. This was intended to base our new URL parser, hence it was blocked on unicode capabilities. But I would like to start using it for at least ASCII inputs for now and extend in a followup. I think the headers has good comments and the included test shows how to use the parser.
Attachment #8526908 - Attachment is obsolete: true
Attachment #8614806 - Flags: review?(dougt)
Remove the AString dependency and we could potentially move this to MFBT?
Attachment #8614806 - Flags: review?(dougt) → review?(nfroyd)
review ping Nathan, this is not that complicated and I know about few places/patches that would like to use it. Thanks.
Flags: needinfo?(nfroyd)
Comment on attachment 8614806 [details] [diff] [review] v1 Review of attachment 8614806 [details] [diff] [review]: ----------------------------------------------------------------- Apologies for the delay. Where do you see this being used outside of netwerk/? grepping for strchr, for instance, doesn't yield very many possible use cases outside of netwerk/. Giving this f+ so I can see a revised patch, but there's nothing major in the comments below. Should be a quick r+ once comments are addressed. ::: xpcom/string/Lexan.h @@ +10,5 @@ > +#include "nsString.h" > + > +/** > + * This is a simple implementation of a lexical analyzer or maybe better > + * called a tokenizer. This is bikeshed territory, but I think it would be better called Tokenizer. (It took me a while to work out that Lexan is a LEXical ANalyzer.) @@ +24,5 @@ > + */ > + enum TokenType { > + TOKEN_UNKNOWN, > + TOKEN_ERROR, > + TOKEN_NUMBER, Might be a good idea to rename this to TOKEN_INTEGER (and similar for all the *Number things throughout the patch) so people don't start thinking this is a number in, say, the JavaScript sense. @@ +47,5 @@ > + public: > + Token() : mType(TOKEN_UNKNOWN), mChar(0), mNumber(0) {} > + > + // Static constructors of tokens by type and value > + static Token Word(nsACString const &aWord); Gecko style would be |const nsACString& aWord|. Please fix this instance and so on throughout the patch. @@ +89,5 @@ > + * and value to aToken result and shift the cursor past this just parsed token. Each call > + * to Next() reads another token from the input and shifts the cursor. > + * Returns false if we have passed the end of the input. > + */ > + bool Next(Token &aToken); I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT. @@ +121,5 @@ > + /** > + * Check whitespace character is present, either a specified one by aChar argument > + * or any from the mWhitespaces list. > + */ > + bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); } Having a default argument here seems redundant with |CheckChar|. Why not make this function take no arguments and therefore always check its argument against mWhitespaces? Callers can use CheckChar if they want. @@ +128,5 @@ > + * cursor position and return true. Otherwise false. > + */ > + bool CheckChar(char const aChar) { return Check(Token::Char(aChar)); } > + /** > + * This is a custumizable version of CheckChar. aClassifier is a function called with Nit: "customizable" @@ +153,5 @@ > + > + /** > + * Returns the read cursor position back as it was before the last call of any parsing > + * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated. > + * Works only once. This last sentence really means "works only once per read token", right? i.e. something like: l.CheckChar(...) Rollback() l.CheckChar(...) Rollback() is allowed, but not: l.CheckChar(...) Rollback() Rollback() correct? Please update the documentation to reflect that. If you could make Rollback() assert in the incorrect usage case, that would be even better. @@ +161,5 @@ > + /** > + * Start the process of recording. Based on aInclude value the begining of the recorded > + * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last > + * parsed token (INCLUDE_LAST) so as the read position was before the last parsing > + * operation call. What does "so as the read position was before the last parsing operation call" mean? @@ +174,5 @@ > + > +protected: > + // true if we have already read the EOF token. > + bool HasInput() const; > + // Main prasing function, it doesn't shift the read cursor, just returns the next Nit: "parsing" @@ +193,5 @@ > + > + // true iff we have already read the EOF token > + bool mPastEof : 1; > + // true iff the last Check*() call has returned false, reverts to true on Rollback() call > + bool mHasFailed : 1; I doubt we'll (a) be allocating a lot of these and (b) be allocating them anywhere but the stack. So we might as well make these non-bitfields to make the accesses more efficient. ::: xpcom/tests/TestLexan.cpp @@ +1,1 @@ > +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */ Please fix the modeline (tab-width: 8, c-basic-offset: 2). @@ +26,5 @@ > + if (xpcom.failed()) > + return 1; > + > +#define CHECK(a) \ > + MOZ_ASSERT(a); \ New tests going in to xpcom/ have typically been gtests, not straight cpptests. You can either: 1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same behavior; or 2) Convert everything to a gtest (doing a search-and-replace will get you 90% of the way there). @@ +42,5 @@ > + "\r\n" > + "This is the body")); > + > + r = l.CheckWord("HTTP"); > + CHECK(r); Whichever method you choose to modify CHECK, above, it might reduce a bit of visual clutter to convert things to: CHECK(l.CheckWord("HTTP")); (would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) ) @@ +112,5 @@ > + > + > + // Synthetic code-specific test > + > + Lexan l1(NS_LITERAL_CSTRING("test123 ,15 \t*\r\n%xx,-15\r\r")); gtests would also enable separating out individual Lexan tests into functions...
Attachment #8614806 - Flags: review?(nfroyd) → feedback+
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] [away 10 July through 19 July] from comment #8) > Comment on attachment 8614806 [details] [diff] [review] > v1 > > Review of attachment 8614806 [details] [diff] [review]: > ----------------------------------------------------------------- > > Apologies for the delay. > > Where do you see this being used outside of netwerk/? I want to expose this for any consumer. It can be used for parsing user input, file content. That's not limited to network. > grepping for strchr, > for instance, doesn't yield very many possible use cases outside of netwerk/. If you insist on moving this to netwerk then let me know. True is that the primary target is that module. > > Giving this f+ so I can see a revised patch, but there's nothing major in > the comments below. Should be a quick r+ once comments are addressed. > > ::: xpcom/string/Lexan.h > @@ +10,5 @@ > > +#include "nsString.h" > > + > > +/** > > + * This is a simple implementation of a lexical analyzer or maybe better > > + * called a tokenizer. > > This is bikeshed territory, but I think it would be better called Tokenizer. > (It took me a while to work out that Lexan is a LEXical ANalyzer.) Lexan is shorter although cryptic. And this class is just the useful and universal subset of what lexical analyzers do. Tokenizer is the name. > @@ +89,5 @@ > > + * and value to aToken result and shift the cursor past this just parsed token. Each call > > + * to Next() reads another token from the input and shifts the cursor. > > + * Returns false if we have passed the end of the input. > > + */ > > + bool Next(Token &aToken); > > I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT. It's disputable. But consumers may always use 'unused <<'. > > @@ +121,5 @@ > > + /** > > + * Check whitespace character is present, either a specified one by aChar argument > > + * or any from the mWhitespaces list. > > + */ > > + bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); } > > Having a default argument here seems redundant with |CheckChar|. Why not > make this function take no arguments and therefore always check its argument > against mWhitespaces? Callers can use CheckChar if they want. I have to remember why I made it this way exactly. Looks like some leftover as the API had been evolving. > @@ +153,5 @@ > > + > > + /** > > + * Returns the read cursor position back as it was before the last call of any parsing > > + * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated. > > + * Works only once. > > This last sentence really means "works only once per read token", right? > i.e. something like: > > l.CheckChar(...) > Rollback() > l.CheckChar(...) > Rollback() > > is allowed, but not: > > l.CheckChar(...) > Rollback() > Rollback() > > correct? Please update the documentation to reflect that. I thought the sentence "Works only once" made it clear. But I'll think of a better description. > If you could > make Rollback() assert in the incorrect usage case, that would be even > better. Yup (should be easy). > > @@ +161,5 @@ > > + /** > > + * Start the process of recording. Based on aInclude value the begining of the recorded > > + * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last > > + * parsed token (INCLUDE_LAST) so as the read position was before the last parsing > > + * operation call. > > What does "so as the read position was before the last parsing operation > call" mean? Probably my inefficient english. I'll rephrase. probably just s/so as/as if/ ? > New tests going in to xpcom/ have typically been gtests, not straight > cpptests. You can either: > > 1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same > behavior; or > 2) Convert everything to a gtest (doing a search-and-replace will get you > 90% of the way there). The test is anyway incompilable. I'll try to build a gtest. Thanks for the tip. > CHECK(l.CheckWord("HTTP")); > > (would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) ) Hmm.. having a local var is IMO a bit better. It can help putting conditional breakpoints. But this class is relatively simple, so probably your approach is clearer.
Attached patch v2 (obsolete) (deleted) — Splinter Review
Attachment #8614806 - Attachment is obsolete: true
Attachment #8634051 - Flags: review?(nfroyd)
Comment on attachment 8634051 [details] [diff] [review] v2 Review of attachment 8634051 [details] [diff] [review]: ----------------------------------------------------------------- r=me with the changes below. ::: xpcom/string/Tokenizer.cpp @@ +8,5 @@ > + > +#include "nsUnicharUtils.h" > +#include "mozilla/CheckedInt.h" > + > +static const char* sWhitespaces = " \t"; Please make this |static const char sWhitespaces[] = " \t";|, as that makes it take up slightly less space in binary form. @@ +20,5 @@ > + mRecord = mRollback = mCursor; > + aSource.EndReading(mEnd); > +} > + > +bool Tokenizer::Next(Token &aToken) Two style nits: this should be formatted as: bool Tokenizer::Next(Token& aToken) and so on throughout the patch. The header file also looks like it has some |Token &aToken| style nits; please fix those up as well. ::: xpcom/string/Tokenizer.h @@ +15,5 @@ > + * > + * It is limited only to ASCII input for now. UTF-8 or any other input > + * encoding must yet be implemented. > + */ > +class Tokenizer { Please wrap all this in a |namespace mozilla|. @@ +62,5 @@ > + > + TokenType Type() const { return mType; } > + char AsChar() const { return mChar; } > + nsCString AsString() const { return mWord; } > + int64_t AsInteger() const { return mInteger; } These As* methods should include asserts that the token is of the correct type. @@ +70,5 @@ > + Tokenizer(const nsACString& aSource); > + > + /** > + * Some methods are collecting the input as it is being parsed to obtain > + * a substring between particular syntax bounderies defined by descent. What does "defined by descent" refer to here? ::: xpcom/string/moz.build @@ +35,5 @@ > 'nsXPIDLString.h', > 'string-template-def-char.h', > 'string-template-def-unichar.h', > 'string-template-undef.h', > + 'Tokenizer.h', Apologies for not noticing this previously: can you put this someplace other than xpcom/string/? xpcom/ds/ looks like it has some tokenizer-related stuff already, so putting it there would be fine. Also, this should go in EXPORTS.mozilla. ::: xpcom/tests/gtest/TestTokenizer.cpp @@ +14,5 @@ > + c == '_' || > + c == '-'; > +} > + > +TEST(Tokenizer, HTTPResponse) Thanks for separating these out into individual tests!
Attachment #8634051 - Flags: review?(nfroyd) → review+
Attached patch v2.1 (obsolete) (deleted) — Splinter Review
Nathan, thanks for the reviews!
Attachment #8634051 - Attachment is obsolete: true
Attachment #8637850 - Flags: review+
Keywords: checkin-needed
Please provide a Try link :)
Keywords: checkin-needed
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134 There is just one test added (gtest under xpcom), and this new class is not integrated in the code anywhere but I don't know how to run only gtests on try.
Keywords: checkin-needed
(In reply to Honza Bambas (:mayhemer) from comment #14) > https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134 > > There is just one test added (gtest under xpcom), and this new class is not > integrated in the code anywhere but I don't know how to run only gtests on > try. the try run shows that this change cause a bustage and so i can not check this in. bustage: /builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/tests/gtest/TestTokenizer.cpp:14:27: error: suggest parentheses around '&&' within '||' [-Werror=parentheses]
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Attached patch v2.2 (obsolete) (deleted) — Splinter Review
linux build fixed, try is closed. will ask after try's done.
Attachment #8637850 - Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
Attachment #8638627 - Flags: review+
Attached patch v2.2 [correct reviewer] (obsolete) (deleted) — Splinter Review
Attachment #8638627 - Attachment is obsolete: true
Attachment #8638642 - Flags: review+
Attached patch v2.3 (deleted) — Splinter Review
Attachment #8638642 - Attachment is obsolete: true
Attachment #8639251 - Flags: review+
Finally green!
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Blocks: 1188983
Depends on: 1190980
No longer depends on: 1190980
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: