Closed Bug 1460233 Opened 6 years ago Closed 6 years ago

Percent-encode "fallback" code points in URL's query state

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla64

Tracking Flags:

Tracking

Status

firefox64

---

fixed

People

(Reporter: annevk, Assigned: hsivonen)

References

(Blocks 2 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

Bug 1460233 - Percent-encode ampersand and colon when replacing unmappable code points in URL query state. 6 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), text/x-phabricator-request		Details

Anne (:annevk)

Reporter

Description

•

6 years ago

The URL parser takes an encoding. This can result code points that cannot be encoded in that encoding to end up as &%23...; in the query of the URL.

We should not just encode the # when doing this encoding, but also encode & and ; as %26 and %3B respectively.

See https://github.com/whatwg/url/pull/386 for the change to the URL Standard.

See https://github.com/whatwg/encoding/issues/139 for additional rationale.

See https://github.com/w3c/web-platform-tests/pull/10915 for test coverage.

Valentin Gosu [:valentin] (he/him)

Updated

•

6 years ago

Assignee: nobody → valentin.gosu

Priority: -- → P3

Whiteboard: [necko-triaged]

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 1

•

6 years ago

This makes us look bad on https://wpt.fyi/results/encoding , because the test harness depends on the ampersand to be escaped.

Valentin Gosu [:valentin] (he/him)

Comment 2

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #1)
> This makes us look bad on https://wpt.fyi/results/encoding , because the
> test harness depends on the ampersand to be escaped.

I'll try to prioritize this.

Could you provide some feedback regarding my approach?
We currently encode the entire query [1], making it difficult to differentiate between `&` and `;` that are in the original string vs ones produced by the encoder.
We could do this character by character (assuming the perf hit isn't too big) - or we could try do that only when there are instances of &# in the output of the encoder. Or is there a simpler approach I'm not seeing?

Thanks!

[1] https://searchfox.org/mozilla-central/rev/3fdc491e118c5cdfbaf6e2d52f3466d2b27ad1de/netwerk/base/nsStandardURL.cpp#120

Flags: needinfo?(hsivonen)

Priority: P3 → P2

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 3

•

6 years ago

(In reply to Valentin Gosu [:valentin] from comment #2)
> Could you provide some feedback regarding my approach?
> We currently encode the entire query [1], making it difficult to
> differentiate between `&` and `;` that are in the original string vs ones
> produced by the encoder.
> We could do this character by character (assuming the perf hit isn't too
> big) - or we could try do that only when there are instances of &# in the
> output of the encoder. Or is there a simpler approach I'm not seeing?

I haven't measured the perf effect of doing it character by character. The mozilla::Encoder API isn't really designed for that perf-wise, though.

Scanning for pre-existing & and ; and then passing everything in between those characters to mozilla::Encoder should produce equivalent results to doing it character by character, but the scanning would introduce its own overhead.

The return tuple has a boolean that says whether there were replacements, so it doesn't make sense to examine the output to see if there were. So one option would be to first try to encode the whole string and if there were replacements, encoding it *again* piece-wise.

AFAICT, the only way to avoid examining a given input character more than once would be to run the encoder in the WithoutReplacement mode and implement the formatting of unmappable code points in the caller, i.e. copy and paste https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into the caller with different output in place of & and ; and running the usual URL escaping on the other output segments. (If the encoding library was to provide that functionality, it would *also* have to provide the URL percent escaping functionality for all the rest of the output. Otherwise, the percent signs produced by the encoding library would be re-escaped by the URL layer.)

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 4

•

6 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #3)
> AFAICT, the only way to avoid examining a given input character more than
> once would be to run the encoder in the WithoutReplacement mode and
> implement the formatting of unmappable code points in the caller, i.e. copy
> and paste
> https://github.com/hsivonen/encoding_rs/blob/master/src/lib.rs#L4679 into
> the caller with different output in place of & and ; and running the usual
> URL escaping on the other output segments.

This is probably the right way to implement this. (Which is rather sad in terms of what it means for the utility of the NCR generation capability of encoding_rs: The capability of encoding_rs is then useful only for form submission. That is, it covers only one of the two use cases.)

Anne (:annevk)

Reporter

Updated

•

6 years ago

Blocks: encoding

Anne (:annevk)

Reporter

Updated

•

6 years ago

Blocks: 1410139

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 5

•

6 years ago

Let's see how this goes with WPT:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9eaea538c1445982412a00c49fa4d18ed02e2057

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 6

•

6 years ago

Sorry about taking the bug, but this is blocking me from seeing the actual encoding correctness situation on WPT.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=41f3b77e16ea560097e6ca233346f1e7ab916b1f

Assignee: valentin.gosu → hsivonen

Status: NEW → ASSIGNED

Valentin Gosu [:valentin] (he/him)

Comment 7

•

6 years ago

No problem - I had tried to do it myself a few weeks ago, but I was unfamiliar with how EncodeFromUTF8WithoutReplacement works so my progress was very slow.
Thanks for working on this!

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 8

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a60492039b3a6153e2b0947e3571509a07b430b2

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 9

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cc1f5627eb58d13a85b94d3f7fb71430a0b43ba

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 11

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6e6796ee9b20d8be134028f16dfbed34e46cc7c4

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 12

•

6 years ago

Attached file Bug 1460233 - Percent-encode ampersand and colon when replacing unmappable code points in URL query state. (deleted) — Details

Spec change: https://github.com/whatwg/url/pull/386

MozReview-Commit-ID: Fa84kCNghtU

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 13

•

6 years ago

Try run to check that compilers on different platforms are OK with an assertion being unreachable code:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fe8e2a439b04e78e18c9451af452f2d0092ed703

Pulsebot

Comment 14

•

6 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/948a4673220c
Percent-encode ampersand and colon when replacing unmappable code points in URL query state. r=valentin

Cosmin Sabou [:CosminS]

Comment 15

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/948a4673220c

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

You need to log in before you can comment on or make changes to this bug.