Closed Bug 1460233 Opened 6 years ago Closed 6 years ago

Percent-encode "fallback" code points in URL's query state

Categories

(Core :: Networking, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: annevk, Assigned: hsivonen)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

The URL parser takes an encoding. This can result code points that cannot be encoded in that encoding to end up as &%23...; in the query of the URL.

We should not just encode the # when doing this encoding, but also encode & and ; as %26 and %3B respectively.

See https://github.com/whatwg/url/pull/386 for the change to the URL Standard.

See https://github.com/whatwg/encoding/issues/139 for additional rationale.

See https://github.com/w3c/web-platform-tests/pull/10915 for test coverage.
Assignee: nobody → valentin.gosu
Priority: -- → P3
Whiteboard: [necko-triaged]
This makes us look bad on https://wpt.fyi/results/encoding , because the test harness depends on the ampersand to be escaped.
(In reply to Henri Sivonen (:hsivonen) from comment #1)
> This makes us look bad on https://wpt.fyi/results/encoding , because the
> test harness depends on the ampersand to be escaped.

I'll try to prioritize this.

Could you provide some feedback regarding my approach?
We currently encode the entire query [1], making it difficult to differentiate between `&` and `;` that are in the original string vs ones produced by the encoder.
We could do this character by character (assuming the perf hit isn't too big) - or we could try do that only when there are instances of &# in the output of the encoder. Or is there a simpler approach I'm not seeing?

Thanks!

[1] https://searchfox.org/mozilla-central/rev/3fdc491e118c5cdfbaf6e2d52f3466d2b27ad1de/netwerk/base/nsStandardURL.cpp#120
Flags: needinfo?(hsivonen)
Priority: P3 → P2
(In reply to Valentin Gosu [:valentin] from comment #2)
> Could you provide some feedback regarding my approach?
> We currently encode the entire query [1], making it difficult to
> differentiate between `&` and `;` that are in the original string vs ones
> produced by the encoder.
> We could do this character by character (assuming the perf hit isn't too
> big) - or we could try do that only when there are instances of &# in the
> output of the encoder. Or is there a simpler approach I'm not seeing?

I haven't measured the perf effect of doing it character by character. The mozilla::Encoder API isn't really designed for that perf-wise, though.

Scanning for pre-existing & and ; and then passing everything in between those characters to mozilla::Encoder should produce equivalent results to doing it character by character, but the scanning would introduce its own overhead.

The return tuple has a boolean that says whether there were replacements, so it doesn't make sense to examine the output to see if there were. So one option would be to first try to encode the whole string and if there were replacements, encoding it *again* piece-wise.

AFAICT, the only way to avoid examining a given input character more than once would be to run the encoder in the WithoutReplacement mode and implement the formatting of unmappable code points in the caller, i.e. copy and paste https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into the caller with different output in place of & and ; and running the usual URL escaping on the other output segments. (If the encoding library was to provide that functionality, it would *also* have to provide the URL percent escaping functionality for all the rest of the output. Otherwise, the percent signs produced by the encoding library would be re-escaped by the URL layer.)
Flags: needinfo?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #3)
> AFAICT, the only way to avoid examining a given input character more than
> once would be to run the encoder in the WithoutReplacement mode and
> implement the formatting of unmappable code points in the caller, i.e. copy
> and paste
> https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into
> the caller with different output in place of & and ; and running the usual
> URL escaping on the other output segments.

This is probably the right way to implement this. (Which is rather sad in terms of what it means for the utility of the NCR generation capability of encoding_rs: The capability of encoding_rs is then useful only for form submission. That is, it covers only one of the two use cases.)
Blocks: encoding
Blocks: 1410139
Sorry about taking the bug, but this is blocking me from seeing the actual encoding correctness situation on WPT.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=41f3b77e16ea560097e6ca233346f1e7ab916b1f
Assignee: valentin.gosu → hsivonen
Status: NEW → ASSIGNED
No problem - I had tried to do it myself a few weeks ago, but I was unfamiliar with how EncodeFromUTF8WithoutReplacement works so my progress was very slow.
Thanks for working on this!
Try run to check that compilers on different platforms are OK with an assertion being unreachable code:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe8e2a439b04e78e18c9451af452f2d0092ed703
Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/948a4673220c
Percent-encode ampersand and colon when replacing unmappable code points in URL query state. r=valentin
https://hg.mozilla.org/mozilla-central/rev/948a4673220c
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: