Closed Bug 1024056 Opened 10 years ago Closed 9 years ago

Design and implement lexical analyzer to be used in any parser code

Categories

(Core :: XPCOM, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox42 --- fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 7 obsolete files)

I on and on see manual walk through of strings when anything from HTTP headers to urls is being parsed which is unsafe, hard to maintain, easy to break, hard to review.

I tend to have a helper class that encapsulates some good approach for this.  We could then (if syntax is more complicated) build simple recursive syntax analyzers using it.

I have a good experience with lexical analyzers.  It's easy to implement and easy to use.

Other option would be to have regex C++ impl with results, but that is much more complicated to implement correctly and not always what can also be easily used.
Blocks: url
As Valentin pointed out in a private mail, to let this work with string, better place is to have it XPCOM.

Going to take this.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Component: MFBT → XPCOM
Attached patch draft1 - working :) (obsolete) (deleted) — Splinter Review
So, few hours of coding and the first simple working lexical analyzer is alive!  Oh, how I love this job from time to time :)
I'm not sure what this is for, or what it's designed to handle, but you might be interested in these Unicode documents:

UAX 29: Unicode Text Segmentation
http://www.unicode.org/reports/tr29/

UAX 44: Unicode Character Database
http://www.unicode.org/reports/tr44/

And other reports that might be of use to you:
http://www.unicode.org/reports/
Attached patch draft1.1 (obsolete) (deleted) — Splinter Review
Some API improvements + real-life example in a test of non-recursive parse and added a lot of comments.
Attachment #8519356 - Attachment is obsolete: true
Attached patch v1 (obsolete) (deleted) — Splinter Review
I am still getting in my hands new patches that use strstr() and strchr().  I really want to do better.  This was intended to base our new URL parser, hence it was blocked on unicode capabilities.  But I would like to start using it for at least ASCII inputs for now and extend in a followup.

I think the headers has good comments and the included test shows how to use the parser.
Attachment #8526908 - Attachment is obsolete: true
Attachment #8614806 - Flags: review?(dougt)
Remove the AString dependency and we could potentially move this to MFBT?
Attachment #8614806 - Flags: review?(dougt) → review?(nfroyd)
review ping

Nathan, this is not that complicated and I know about few places/patches that would like to use it.  Thanks.
Flags: needinfo?(nfroyd)
Comment on attachment 8614806 [details] [diff] [review]
v1

Review of attachment 8614806 [details] [diff] [review]:
-----------------------------------------------------------------

Apologies for the delay.

Where do you see this being used outside of netwerk/?  grepping for strchr, for instance, doesn't yield very many possible use cases outside of netwerk/.

Giving this f+ so I can see a revised patch, but there's nothing major in the comments below.  Should be a quick r+ once comments are addressed.

::: xpcom/string/Lexan.h
@@ +10,5 @@
> +#include "nsString.h"
> +
> +/**
> + * This is a simple implementation of a lexical analyzer or maybe better
> + * called a tokenizer.

This is bikeshed territory, but I think it would be better called Tokenizer.  (It took me a while to work out that Lexan is a LEXical ANalyzer.)

@@ +24,5 @@
> +   */
> +  enum TokenType {
> +    TOKEN_UNKNOWN,
> +    TOKEN_ERROR,
> +    TOKEN_NUMBER,

Might be a good idea to rename this to TOKEN_INTEGER (and similar for all the *Number things throughout the patch) so people don't start thinking this is a number in, say, the JavaScript sense.

@@ +47,5 @@
> +  public:
> +    Token() : mType(TOKEN_UNKNOWN), mChar(0), mNumber(0) {}
> +
> +    // Static constructors of tokens by type and value
> +    static Token Word(nsACString const &aWord);

Gecko style would be |const nsACString& aWord|.  Please fix this instance and so on throughout the patch.

@@ +89,5 @@
> +   * and value to aToken result and shift the cursor past this just parsed token.  Each call
> +   * to Next() reads another token from the input and shifts the cursor.
> +   * Returns false if we have passed the end of the input.
> +   */
> +  bool Next(Token &aToken);

I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.

@@ +121,5 @@
> +  /**
> +   * Check whitespace character is present, either a specified one by aChar argument
> +   * or any from the mWhitespaces list.
> +   */
> +  bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }

Having a default argument here seems redundant with |CheckChar|.  Why not make this function take no arguments and therefore always check its argument against mWhitespaces?  Callers can use CheckChar if they want.

@@ +128,5 @@
> +   * cursor position and return true.  Otherwise false.
> +   */
> +  bool CheckChar(char const aChar) { return Check(Token::Char(aChar)); }
> +  /**
> +   * This is a custumizable version of CheckChar.  aClassifier is a function called with

Nit: "customizable"

@@ +153,5 @@
> +
> +  /**
> +   * Returns the read cursor position back as it was before the last call of any parsing
> +   * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> +   * Works only once.

This last sentence really means "works only once per read token", right?  i.e. something like:

l.CheckChar(...)
Rollback()
l.CheckChar(...)
Rollback()

is allowed, but not:

l.CheckChar(...)
Rollback()
Rollback()

correct?  Please update the documentation to reflect that.  If you could make Rollback() assert in the incorrect usage case, that would be even better.

@@ +161,5 @@
> +  /**
> +   * Start the process of recording.  Based on aInclude value the begining of the recorded
> +   * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> +   * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> +   * operation call.

What does "so as the read position was before the last parsing operation call" mean?

@@ +174,5 @@
> +
> +protected:
> +  // true if we have already read the EOF token.
> +  bool HasInput() const;
> +  // Main prasing function, it doesn't shift the read cursor, just returns the next

Nit: "parsing"

@@ +193,5 @@
> +
> +  // true iff we have already read the EOF token
> +  bool mPastEof : 1;
> +  // true iff the last Check*() call has returned false, reverts to true on Rollback() call
> +  bool mHasFailed : 1;

I doubt we'll (a) be allocating a lot of these and (b) be allocating them anywhere but the stack.  So we might as well make these non-bitfields to make the accesses more efficient.

::: xpcom/tests/TestLexan.cpp
@@ +1,1 @@
> +/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */

Please fix the modeline (tab-width: 8, c-basic-offset: 2).

@@ +26,5 @@
> +  if (xpcom.failed())
> +    return 1;
> +
> +#define CHECK(a) \
> +  MOZ_ASSERT(a); \

New tests going in to xpcom/ have typically been gtests, not straight cpptests.  You can either:

1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same behavior; or
2) Convert everything to a gtest (doing a search-and-replace will get you 90% of the way there).

@@ +42,5 @@
> +    "\r\n"
> +    "This is the body"));
> +
> +  r = l.CheckWord("HTTP");
> +  CHECK(r);

Whichever method you choose to modify CHECK, above, it might reduce a bit of visual clutter to convert things to:

CHECK(l.CheckWord("HTTP"));

(would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )

@@ +112,5 @@
> +
> +
> +  // Synthetic code-specific test
> +
> +  Lexan l1(NS_LITERAL_CSTRING("test123 ,15  \t*\r\n%xx,-15\r\r"));

gtests would also enable separating out individual Lexan tests into functions...
Attachment #8614806 - Flags: review?(nfroyd) → feedback+
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] [away 10 July through 19 July] from comment #8)
> Comment on attachment 8614806 [details] [diff] [review]
> v1
> 
> Review of attachment 8614806 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Apologies for the delay.
> 
> Where do you see this being used outside of netwerk/?  

I want to expose this for any consumer.  It can be used for parsing user input, file content.  That's not limited to network.

> grepping for strchr,
> for instance, doesn't yield very many possible use cases outside of netwerk/.

If you insist on moving this to netwerk then let me know.  True is that the primary target is that module.

> 
> Giving this f+ so I can see a revised patch, but there's nothing major in
> the comments below.  Should be a quick r+ once comments are addressed.
> 
> ::: xpcom/string/Lexan.h
> @@ +10,5 @@
> > +#include "nsString.h"
> > +
> > +/**
> > + * This is a simple implementation of a lexical analyzer or maybe better
> > + * called a tokenizer.
> 
> This is bikeshed territory, but I think it would be better called Tokenizer.
> (It took me a while to work out that Lexan is a LEXical ANalyzer.)

Lexan is shorter although cryptic.  And this class is just the useful and universal subset of what lexical analyzers do.  Tokenizer is the name.

> @@ +89,5 @@
> > +   * and value to aToken result and shift the cursor past this just parsed token.  Each call
> > +   * to Next() reads another token from the input and shifts the cursor.
> > +   * Returns false if we have passed the end of the input.
> > +   */
> > +  bool Next(Token &aToken);
> 
> I think all of these bool-returning methods should be MOZ_WARN_UNUSED_RESULT.

It's disputable.  But consumers may always use 'unused <<'.

> 
> @@ +121,5 @@
> > +  /**
> > +   * Check whitespace character is present, either a specified one by aChar argument
> > +   * or any from the mWhitespaces list.
> > +   */
> > +  bool CheckWhite(char const aChar = 0) { return Check(Token::Whitespace(aChar)); }
> 
> Having a default argument here seems redundant with |CheckChar|.  Why not
> make this function take no arguments and therefore always check its argument
> against mWhitespaces?  Callers can use CheckChar if they want.

I have to remember why I made it this way exactly.  Looks like some leftover as the API had been evolving.

> @@ +153,5 @@
> > +
> > +  /**
> > +   * Returns the read cursor position back as it was before the last call of any parsing
> > +   * method of Lexan (Next, Check*, Skip*) so that the last operation can be repeated.
> > +   * Works only once.
> 
> This last sentence really means "works only once per read token", right? 
> i.e. something like:
> 
> l.CheckChar(...)
> Rollback()
> l.CheckChar(...)
> Rollback()
> 
> is allowed, but not:
> 
> l.CheckChar(...)
> Rollback()
> Rollback()
> 
> correct?  Please update the documentation to reflect that.  

I thought the sentence "Works only once" made it clear.  But I'll think of a better description.

> If you could
> make Rollback() assert in the incorrect usage case, that would be even
> better.

Yup (should be easy).

> 
> @@ +161,5 @@
> > +  /**
> > +   * Start the process of recording.  Based on aInclude value the begining of the recorded
> > +   * sub-string is at the current position (EXCLUDE_LAST) or at the position of the last
> > +   * parsed token (INCLUDE_LAST) so as the read position was before the last parsing
> > +   * operation call.
> 
> What does "so as the read position was before the last parsing operation
> call" mean?

Probably my inefficient english.  I'll rephrase.  probably just s/so as/as if/ ?

> New tests going in to xpcom/ have typically been gtests, not straight
> cpptests.  You can either:
> 
> 1) Make this MOZ_RELEASE_ASSERT, so opt and debug builds get the same
> behavior; or
> 2) Convert everything to a gtest (doing a search-and-replace will get you
> 90% of the way there).

The test is anyway incompilable.  I'll try to build a gtest.  Thanks for the tip.

> CHECK(l.CheckWord("HTTP"));
> 
> (would be particularly clear with gtests: ASSERT_TRUE(l.CheckWord("HTTP")) )

Hmm.. having a local var is IMO a bit better.  It can help putting conditional breakpoints.  But this class is relatively simple, so probably your approach is clearer.
Attached patch v2 (obsolete) (deleted) — Splinter Review
See comment 9
Attachment #8614806 - Attachment is obsolete: true
Attachment #8634051 - Flags: review?(nfroyd)
Comment on attachment 8634051 [details] [diff] [review]
v2

Review of attachment 8634051 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the changes below.

::: xpcom/string/Tokenizer.cpp
@@ +8,5 @@
> +
> +#include "nsUnicharUtils.h"
> +#include "mozilla/CheckedInt.h"
> +
> +static const char* sWhitespaces = " \t";

Please make this |static const char sWhitespaces[] = " \t";|, as that makes it take up slightly less space in binary form.

@@ +20,5 @@
> +  mRecord = mRollback = mCursor;
> +  aSource.EndReading(mEnd);
> +}
> +
> +bool Tokenizer::Next(Token &aToken)

Two style nits: this should be formatted as:

bool
Tokenizer::Next(Token& aToken)

and so on throughout the patch.  The header file also looks like it has some |Token &aToken| style nits; please fix those up as well.

::: xpcom/string/Tokenizer.h
@@ +15,5 @@
> + *
> + * It is limited only to ASCII input for now. UTF-8 or any other input
> + * encoding must yet be implemented.
> + */
> +class Tokenizer {

Please wrap all this in a |namespace mozilla|.

@@ +62,5 @@
> +
> +    TokenType Type() const { return mType; }
> +    char AsChar() const { return mChar; }
> +    nsCString AsString() const { return mWord; }
> +    int64_t AsInteger() const { return mInteger; }

These As* methods should include asserts that the token is of the correct type.

@@ +70,5 @@
> +  Tokenizer(const nsACString& aSource);
> +
> +  /**
> +   * Some methods are collecting the input as it is being parsed to obtain
> +   * a substring between particular syntax bounderies defined by descent.

What does "defined by descent" refer to here?

::: xpcom/string/moz.build
@@ +35,5 @@
>      'nsXPIDLString.h',
>      'string-template-def-char.h',
>      'string-template-def-unichar.h',
>      'string-template-undef.h',
> +    'Tokenizer.h',

Apologies for not noticing this previously: can you put this someplace other than xpcom/string/?  xpcom/ds/ looks like it has some tokenizer-related stuff already, so putting it there would be fine.

Also, this should go in EXPORTS.mozilla.

::: xpcom/tests/gtest/TestTokenizer.cpp
@@ +14,5 @@
> +         c == '_' ||
> +         c == '-';
> +}
> +
> +TEST(Tokenizer, HTTPResponse)

Thanks for separating these out into individual tests!
Attachment #8634051 - Flags: review?(nfroyd) → review+
Attached patch v2.1 (obsolete) (deleted) — Splinter Review
Nathan, thanks for the reviews!
Attachment #8634051 - Attachment is obsolete: true
Attachment #8637850 - Flags: review+
Keywords: checkin-needed
Please provide a Try link :)
Keywords: checkin-needed
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134

There is just one test added (gtest under xpcom), and this new class is not integrated in the code anywhere but I don't know how to run only gtests on try.
Keywords: checkin-needed
(In reply to Honza Bambas (:mayhemer) from comment #14)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=cd54f6b2b134
> 
> There is just one test added (gtest under xpcom), and this new class is not
> integrated in the code anywhere but I don't know how to run only gtests on
> try.

the try run shows that this change cause a bustage and so i can not check this in.
bustage:
 /builds/slave/try-lx-d-000000000000000000000/build/src/xpcom/tests/gtest/TestTokenizer.cpp:14:27: error: suggest parentheses around '&&' within '||' [-Werror=parentheses]
Flags: needinfo?(honzab.moz)
Keywords: checkin-needed
Attached patch v2.2 (obsolete) (deleted) — Splinter Review
linux build fixed, try is closed.  will ask after try's done.
Attachment #8637850 - Attachment is obsolete: true
Flags: needinfo?(honzab.moz)
Attachment #8638627 - Flags: review+
Attached patch v2.2 [correct reviewer] (obsolete) (deleted) — Splinter Review
Attachment #8638627 - Attachment is obsolete: true
Attachment #8638642 - Flags: review+
Attached patch v2.3 (deleted) — Splinter Review
- fixed implicit constructor

https://treeherder.mozilla.org/#/jobs?repo=try&revision=fadd4b6e20a2
Attachment #8638642 - Attachment is obsolete: true
Attachment #8639251 - Flags: review+
Finally green!
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/397fd842ab54
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Blocks: 1188983
Depends on: 1190980
No longer depends on: 1190980
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: