Closed
Bug 1460233
Opened 6 years ago
Closed 6 years ago
Percent-encode "fallback" code points in URL's query state
Categories
(Core :: Networking, enhancement, P2)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla64
Tracking | Status | |
---|---|---|
firefox64 | --- | fixed |
People
(Reporter: annevk, Assigned: hsivonen)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [necko-triaged])
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
The URL parser takes an encoding. This can result code points that cannot be encoded in that encoding to end up as &%23...; in the query of the URL. We should not just encode the # when doing this encoding, but also encode & and ; as %26 and %3B respectively. See https://github.com/whatwg/url/pull/386 for the change to the URL Standard. See https://github.com/whatwg/encoding/issues/139 for additional rationale. See https://github.com/w3c/web-platform-tests/pull/10915 for test coverage.
Updated•6 years ago
|
Assignee: nobody → valentin.gosu
Priority: -- → P3
Whiteboard: [necko-triaged]
Assignee | ||
Comment 1•6 years ago
|
||
This makes us look bad on https://wpt.fyi/results/encoding , because the test harness depends on the ampersand to be escaped.
Comment 2•6 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1) > This makes us look bad on https://wpt.fyi/results/encoding , because the > test harness depends on the ampersand to be escaped. I'll try to prioritize this. Could you provide some feedback regarding my approach? We currently encode the entire query [1], making it difficult to differentiate between `&` and `;` that are in the original string vs ones produced by the encoder. We could do this character by character (assuming the perf hit isn't too big) - or we could try do that only when there are instances of &# in the output of the encoder. Or is there a simpler approach I'm not seeing? Thanks! [1] https://searchfox.org/mozilla-central/rev/3fdc491e118c5cdfbaf6e2d52f3466d2b27ad1de/netwerk/base/nsStandardURL.cpp#120
Flags: needinfo?(hsivonen)
Priority: P3 → P2
Assignee | ||
Comment 3•6 years ago
|
||
(In reply to Valentin Gosu [:valentin] from comment #2) > Could you provide some feedback regarding my approach? > We currently encode the entire query [1], making it difficult to > differentiate between `&` and `;` that are in the original string vs ones > produced by the encoder. > We could do this character by character (assuming the perf hit isn't too > big) - or we could try do that only when there are instances of &# in the > output of the encoder. Or is there a simpler approach I'm not seeing? I haven't measured the perf effect of doing it character by character. The mozilla::Encoder API isn't really designed for that perf-wise, though. Scanning for pre-existing & and ; and then passing everything in between those characters to mozilla::Encoder should produce equivalent results to doing it character by character, but the scanning would introduce its own overhead. The return tuple has a boolean that says whether there were replacements, so it doesn't make sense to examine the output to see if there were. So one option would be to first try to encode the whole string and if there were replacements, encoding it *again* piece-wise. AFAICT, the only way to avoid examining a given input character more than once would be to run the encoder in the WithoutReplacement mode and implement the formatting of unmappable code points in the caller, i.e. copy and paste https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into the caller with different output in place of & and ; and running the usual URL escaping on the other output segments. (If the encoding library was to provide that functionality, it would *also* have to provide the URL percent escaping functionality for all the rest of the output. Otherwise, the percent signs produced by the encoding library would be re-escaped by the URL layer.)
Flags: needinfo?(hsivonen)
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #3) > AFAICT, the only way to avoid examining a given input character more than > once would be to run the encoder in the WithoutReplacement mode and > implement the formatting of unmappable code points in the caller, i.e. copy > and paste > https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into > the caller with different output in place of & and ; and running the usual > URL escaping on the other output segments. This is probably the right way to implement this. (Which is rather sad in terms of what it means for the utility of the NCR generation capability of encoding_rs: The capability of encoding_rs is then useful only for form submission. That is, it covers only one of the two use cases.)
Assignee | ||
Comment 5•6 years ago
|
||
Let's see how this goes with WPT: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9eaea538c1445982412a00c49fa4d18ed02e2057
Assignee | ||
Comment 6•6 years ago
|
||
Sorry about taking the bug, but this is blocking me from seeing the actual encoding correctness situation on WPT. https://treeherder.mozilla.org/#/jobs?repo=try&revision=41f3b77e16ea560097e6ca233346f1e7ab916b1f
Assignee: valentin.gosu → hsivonen
Status: NEW → ASSIGNED
Comment 7•6 years ago
|
||
No problem - I had tried to do it myself a few weeks ago, but I was unfamiliar with how EncodeFromUTF8WithoutReplacement works so my progress was very slow. Thanks for working on this!
Assignee | ||
Comment 8•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a60492039b3a6153e2b0947e3571509a07b430b2
Assignee | ||
Comment 9•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cc1f5627eb58d13a85b94d3f7fb71430a0b43ba
Assignee | ||
Comment 11•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6e6796ee9b20d8be134028f16dfbed34e46cc7c4
Assignee | ||
Comment 12•6 years ago
|
||
Spec change: https://github.com/whatwg/url/pull/386 MozReview-Commit-ID: Fa84kCNghtU
Assignee | ||
Comment 13•6 years ago
|
||
Try run to check that compilers on different platforms are OK with an assertion being unreachable code: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe8e2a439b04e78e18c9451af452f2d0092ed703
Comment 14•6 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/948a4673220c Percent-encode ampersand and colon when replacing unmappable code points in URL query state. r=valentin
Comment 15•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/948a4673220c
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
status-firefox64:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
You need to log in
before you can comment on or make changes to this bug.
Description
•