Closed Bug 464314 Opened 16 years ago Closed 16 years ago

view-source link-browsing doesn't decode entities

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.2a1

People

(Reporter: dao, Assigned: cbartley)

References

(Blocks 1 open bug)

Details

(Keywords: fixed1.9.1)

Attachments

(3 files, 2 obsolete files)

View Source linkification now correctly decodes entities 16 years ago Curtis Bartley [:cbartley] (deleted), patch	mrbkap : review-	Details \| Diff \| Splinter Review
Simple manual testcase for entity decoding 16 years ago Curtis Bartley [:cbartley] (deleted), text/html		Details
Unit test for view source entity decoding patch (reference only) 16 years ago Curtis Bartley [:cbartley] (deleted), patch		Details \| Diff \| Splinter Review
View Source linkification now correctly decodes entities v2 16 years ago Curtis Bartley [:cbartley] (deleted), patch	mrbkap : review+ mrbkap : superreview+	Details \| Diff \| Splinter Review
View Source linkification now correctly decodes entities v3 16 years ago Curtis Bartley [:cbartley] (deleted), patch	mrbkap : review+ mrbkap : superreview+	Details \| Diff \| Splinter Review

Dão Gottwald [:dao]

Reporter

Description

•

16 years ago

If you view the source of http://example.com/ and it contains a link to "/?foo=1&bar=2", this will lead to view-source:http://example.com/?foo=1&bar=2 instead of view-source:http://example.com/?foo=1&bar=2.

Dão Gottwald [:dao]

Reporter

Updated

•

16 years ago

OS: Windows XP → All

Hardware: PC → All

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

16 years ago

Assignee: nobody → cbartley

Curtis Bartley [:cbartley]

Assignee

Comment 1

•

16 years ago

This bug pretty seriously impacts view source linkification so I think we should absolutely want to fix it for 3.1. I have a patch that is very nearly ready for review.

Flags: blocking1.9.1?

Curtis Bartley [:cbartley]

Assignee

Comment 2

•

16 years ago

Attached patch View Source linkification now correctly decodes entities (obsolete) (deleted) — Details — Splinter Review

I had two choices here: 1. Modify the parser code to return both raw source and source with entities decoded. or 2. Duplicate some entity decoding logic in nsViewSourceHTML.cpp Neither solution is ideal. Duplicating logic is not good (DRY is your friend) but making already crufty parser code even more crufty doesn't seem like a great alternative. I opted for duplicating the logic because I had more confidence that I could get solid and correct code that way, and I believed I could do it faster. (Faster here being a relative term, but I digress.) The new code could also be structured in a way that would make it easy to unit test, providing another level of confidence. I in fact do have some simple unit tests for this new code, but I'm not including them in this patch because I was never able to modify the build system to build the unit tests reliably. I may upload the unit tests as another patch for reference.

Attachment #352879 - Flags: review?(mrbkap)

Curtis Bartley [:cbartley]

Assignee

Comment 3

•

16 years ago

Attached file Simple manual testcase for entity decoding (deleted) — Details

Simple manual testcase for attachment 352879 [details] [diff] [review]: View Source linkification now correctly decodes entities

Curtis Bartley [:cbartley]

Assignee

Comment 4

•

16 years ago

Attached patch Unit test for view source entity decoding patch (reference only) (deleted) — Details — Splinter Review

Unit tests for the patch "attachment 352879 [details] [diff] [review]: View Source linkification now correctly decodes entities". Included only for reference, I'm not requesting review. A complete patch would include makefile changes necessary to build the unit test. These makefile changes have turned out to be much more difficult than I anticipated, by which I mean I still haven't figured them out.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 5

•

16 years ago

Can we abstract out the entity decoding code so that it lives only in one place? Or is it too entwined with the parser innards?

Curtis Bartley [:cbartley]

Assignee

Comment 6

•

16 years ago

It's pretty entwined. See static nsresult ConsumeEntity(nsScannerSharedSubstring& aString, nsScanner& aScanner, PRBool aIECompatible, PRInt32 aFlag) in parser/htmlparser/src/nsHTMLTokens.cpp.

Curtis Bartley [:cbartley]

Assignee

Comment 7

•

16 years ago

(In reply to comment #6) > It's pretty entwined. See ... I think it may be helpful if I elaborate on my thinking. Normally the parser generates tokens with entities decoded to characters for their text content. However, when parsing for view source, the parser stores the raw text in the tokens, since that's what we want to see in the view source window. With linkification, however, there are cases where we need both kinds of text -- when we are displaying the value of an HREF attribute on an anchor, we want to display the undecoded source text in the view source window, but we want to use the decoded text to build the link we're going to attach to that token. There are a couple of approaches that could be used here. 1. Modify the parser to return both kinds of text when the view source flag is set. This would really only need to be done on attribute value tokens. 2. Leave the parser as-is and decode any entities in the attribute values separately when building URLs in nsViewSourceHTML.cpp. For approach 2, there are a couple of variations: a) Re-use the parser logic by calling into it at the right place to simply decode the string containing the attribute value b) Write duplicate parsing logic to identify entities in the attribute value, and then call into the parser logic at the right place to convert the entities to unicode c) Duplicate the parsing logic *and* the numeric entity decoding, but rely on parser logic to decode symbolic entities. d) Duplicate all the logic including numeric and symbolic entity decoding. Now to be fair, there's no real reason to duplicate the symbolic entity decoding since that's already been nicely factored out into the nsHTMLEntities::EntityToUnicode() function, and there's no reason not to call it directly. So we can exclude d) as an option. Now to step back for a second -- approach 1 requires modifying the parser, and modifying it in a way that feels pretty hacky. Since the parser already has special casing inside it specifically for view source, maybe that's not such a big deal. Nevertheless modifying the parser is a riskier change and would require me to invest a fair amount of time in understanding some portion of the parser code so that I could confidently make the changes. So this approach seemed like more risk, more work, and more hacky. Feeling the heat of an impending deadline, I decided not to take this approach. I discounted approaches 2a and 2b because after a cursory inspection of the parsing code that handles entities, it didn't look like there was an easy way to call into this code. It would be necessary at the very least to set up a dummy scanner instance, and not only would this probably be ugly, but it would also introduce a dependency from outside into parser internals that looked like they weren't ever designed to be used that way. So I ended up using approach 2c. This approach involves duplicating the fairly simple entity parsing code and numeric entity conversion, but relies on the existing symbolic entity conversion capability provided by nsHTMLEntities::EntityToUnicode(). In an ideal situation I'd prefer not to duplicate logic like this at all. If the existing parser logic had been better factored out or if I felt comfortable doing refactoring of the existing parser logic (scary!) to make it easier to call, then I'd be all for that. However, given the constraints here it seemed like duplicating some logic that was actually fairly simple seemed like the lesser of evils. One advantage of writing new logic in this case is it's probably a lot easier to code review since it's self-contained, pretty aggressively documented, and easy to test in isolation. Also a note on if-statement style. I usually write: if (condition) { printf("hello, world"); } but if (condition) return "hello, world"; That is to say, if my if-statement is lacking braces, it is: a) a return statement (or a break) b) all on the same line I won't ever write if (condition) return "hello, world"; If I felt the need to put the return on a separate line, I'd always use braces. I use this one line return style because I think in many cases it helps readability (see the IsTokenValueTrimmableCharacter() function in attachment 346379 [details] [diff] [review] for an example) and usually doesn't hurt in other cases. That said, I don't think it will make any real difference in this patch, so I'm not making an impassioned argument to keep the style here. All I want to do is communicate that I put some thought into the issue, it's not just me being sloppy. Anyway, I apologize if this comment comes off as being pedantic. Based on comments here and on IRC (back before Christmas), it seems like there is some question about why I did the things the way I did. I figured it wouldn't hurt to elaborate a priori rather than answer them as part of the code review dialog.

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Blocks: 472846

Blake Kaplan (:mrbkap) (inactive)

Comment 8

•

16 years ago

Comment on attachment 352879 [details] [diff] [review] View Source linkification now correctly decodes entities I'm not happy with the duplication of logic here, but I agree that this is probably better than trying to shoehorn the parser stuff into being usable from anywhere. >diff -r 055cb77c977c parser/htmlparser/src/nsViewSourceHTML.cpp >@@ -995,18 +996,20 @@ PRBool CViewSourceHTML::IsUrlAttribute(c >+ const nsAString& baseSpec = TrimTokenValue(attrValue); >+ nsString expandedBaseSpec; This wants to be an nsAutoString to avoid unnecessary mallocs. >@@ -1071,18 +1074,21 @@ nsresult CViewSourceHTML::CreateViewSour >+ nsString expandedLinkUrl; Ditto. >@@ -1200,8 +1206,154 @@ nsresult CViewSourceHTML::SetBaseURI(con >+void CViewSourceHTML::ExpandEntities(const nsAString& textIn, nsString& textOut) { >+ Nits: Put the { on its own line for function declarations and nix the extra empty line here. >+ nsAString::const_iterator iter, end; >+ textIn.BeginReading(iter); >+ textIn.EndReading(end); >+ >+ // The outer loop treats the input as a sequence of pairs of runs. The first >+ // run of each pair is just a run of regular characters. The second run is >+ // something that looks like it might be an entity reference, e.g. "&". >+ // Any regular run may be empty, and the entity run can be skipped at the end >+ // of the text. Apparent entities that can't be translated are copied >+ // verbatim. In particular this allows for raw ampersands in the input. >+ // Special care is taken to handle the end of the text at any point inside >+ // the loop. >+ while (iter != end) { >+ Nit: nix the extra newline here. >+ // Copy characters to textOut until but not including the first ampersand. >+ for (; iter != end; ++iter) { >+ PRUnichar ch = *iter; >+ if (ch == kAmpersand) break; Here and below: put the |break;| on its own line. Otherwise, it's too easy to miss it when reading the loop. >+ } >+ } >+ } >+ Nit: Nix this extra blank line. >+ } >+ >+} >+ >+PRInt32 CViewSourceHTML::ExpandEntity(const nsAString::const_iterator& start, >+ const nsAString::const_iterator& end) { Same nit about indentation and brace here. >+ // Decode the entity to a Unicode character. >+ switch (entityType) { >+ case TYPE_ID: >+ return nsHTMLEntities::EntityToUnicode(entityBody); Need to imitate ConsumeEntity here and only return entities < 256 or ending with ;. r- on account of that. Looks pretty good otherwise, though.

Attachment #352879 - Flags: review?(mrbkap) → review-

Johnny Stenback (:jst)

Comment 9

•

16 years ago

Sounds like this is something that significantly limits the usability of the linkification we do in view source, which is one of the new features in 3.1 that's been talked about a fair bit, so we should probably fix this. The patch looks close to ready, and looks safe (due to the duplication of code rather than invasive changes to the parser). Blocking.

Flags: blocking1.9.1? → blocking1.9.1+

Priority: -- → P2

Target Milestone: --- → mozilla1.9.1

Curtis Bartley [:cbartley]

Assignee

Comment 10

•

16 years ago

Attached patch View Source linkification now correctly decodes entities v2 (obsolete) (deleted) — Details — Splinter Review

Note that I rewrote a substantial chunk of the code from the previous patch. The new CopyPossibleEntity() function subsumes both the parsing and expansion of individual entities (or entity like things). I think the code is cleaner now, and it allows for missing semicolons on eight-bit entities mirroring the parser code. My point-by-point comments are below, surrounded by "***"s. -------- (From update of attachment 352879 [details] [diff] [review]) I'm not happy with the duplication of logic here, but I agree that this is probably better than trying to shoehorn the parser stuff into being usable from anywhere. >diff -r 055cb77c977c parser/htmlparser/src/nsViewSourceHTML.cpp >@@ -995,18 +996,20 @@ PRBool CViewSourceHTML::IsUrlAttribute(c >+ const nsAString& baseSpec = TrimTokenValue(attrValue); >+ nsString expandedBaseSpec; This wants to be an nsAutoString to avoid unnecessary mallocs. *** done *** >@@ -1071,18 +1074,21 @@ nsresult CViewSourceHTML::CreateViewSour >+ nsString expandedLinkUrl; Ditto. *** done *** >@@ -1200,8 +1206,154 @@ nsresult CViewSourceHTML::SetBaseURI(con >+void CViewSourceHTML::ExpandEntities(const nsAString& textIn, nsString& textOut) { >+ Nits: Put the { on its own line for function declarations and nix the extra empty line here. *** done *** >+ nsAString::const_iterator iter, end; >+ textIn.BeginReading(iter); >+ textIn.EndReading(end); >+ >+ // The outer loop treats the input as a sequence of pairs of runs. The first >+ // run of each pair is just a run of regular characters. The second run is >+ // something that looks like it might be an entity reference, e.g. "&". >+ // Any regular run may be empty, and the entity run can be skipped at the end >+ // of the text. Apparent entities that can't be translated are copied >+ // verbatim. In particular this allows for raw ampersands in the input. >+ // Special care is taken to handle the end of the text at any point inside >+ // the loop. >+ while (iter != end) { >+ Nit: nix the extra newline here. *** done *** >+ // Copy characters to textOut until but not including the first ampersand. >+ for (; iter != end; ++iter) { >+ PRUnichar ch = *iter; >+ if (ch == kAmpersand) break; Here and below: put the |break;| on its own line. Otherwise, it's too easy to miss it when reading the loop. *** done *** >+ } >+ } >+ } >+ Nit: Nix this extra blank line. *** done *** >+ } >+ >+} >+ >+PRInt32 CViewSourceHTML::ExpandEntity(const nsAString::const_iterator& start, >+ const nsAString::const_iterator& end) { Same nit about indentation and brace here. *** ExpandEntity() has been superceded by the new CopyPossiblEntity(). *** >+ // Decode the entity to a Unicode character. >+ switch (entityType) { >+ case TYPE_ID: >+ return nsHTMLEntities::EntityToUnicode(entityBody); Need to imitate ConsumeEntity here and only return entities < 256 or ending with ;. *** Done *** r- on account of that. Looks pretty good otherwise, though.

Attachment #352879 - Attachment is obsolete: true

Attachment #357198 - Flags: review?(mrbkap)

Blake Kaplan (:mrbkap) (inactive)

Updated

•

16 years ago

Attachment #357198 - Flags: superreview+

Attachment #357198 - Flags: review?(mrbkap)

Attachment #357198 - Flags: review+

Blake Kaplan (:mrbkap) (inactive)

Comment 11

•

16 years ago

Comment on attachment 357198 [details] [diff] [review] View Source linkification now correctly decodes entities v2 My only comment is to go through and avoid |if (blah) return| all on one line. Also, I like |// Empty| comments in empty loop bodies, but feel free to ignore that :-).

Curtis Bartley [:cbartley]

Assignee

Comment 12

•

16 years ago

Attached patch View Source linkification now correctly decodes entities v3 (deleted) — Details — Splinter Review

> (From update of attachment 357198 [details] [diff] [review]) > My only comment is to go through and avoid |if (blah) return| all on one line. Done. > Also, I like |// Empty| comments in empty loop bodies, but feel free to ignore > that :-). Done. [Am I supposed to keep asking for review every time I upload a modified patch?]

Attachment #357198 - Attachment is obsolete: true

Attachment #357299 - Flags: review?(mrbkap)

Blake Kaplan (:mrbkap) (inactive)

Updated

•

16 years ago

Attachment #357299 - Flags: superreview+

Attachment #357299 - Flags: review?(mrbkap)

Attachment #357299 - Flags: review+

Blake Kaplan (:mrbkap) (inactive)

Comment 13

•

16 years ago

(In reply to comment #12) > [Am I supposed to keep asking for review every time I upload a modified patch?] Usually, if a reviewer stamps the patch, you don't have to request re-review.

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Keywords: checkin-needed

Dão Gottwald [:dao]

Reporter

Comment 14

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/6a1b8efcecc8

Status: NEW → RESOLVED

Closed: 16 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Whiteboard: baking

Target Milestone: mozilla1.9.1 → mozilla1.9.2a1

Curtis Bartley [:cbartley]

Assignee

Comment 15

•

16 years ago

The patch applies cleanly to 1.9.1 as of a few minutes ago.

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Keywords: checkin-needed

Whiteboard: baking → [needs 191 landing]

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Keywords: checkin-needed

Whiteboard: [needs 191 landing]

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Attachment #357299 - Flags: approval1.9.1?

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

16 years ago

Attachment #357299 - Flags: approval1.9.1?

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 16

•

16 years ago

Comment on attachment 357299 [details] [diff] [review] View Source linkification now correctly decodes entities v3 This patch doesn't need approval, because it's fixing a blocker.

Dão Gottwald [:dao]

Reporter

Updated

•

16 years ago

Keywords: checkin-needed

Curtis Bartley [:cbartley]

Assignee

Updated

•

16 years ago

Whiteboard: [needs 191 landing]

Dão Gottwald [:dao]

Reporter

Comment 17

•

16 years ago

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/83d09b43b31f

Keywords: checkin-needed → fixed1.9.1

Whiteboard: [needs 191 landing]

You need to log in before you can comment on or make changes to this bug.