Open Bug 1842062 Opened 1 year ago Updated 11 months ago

What is the correct way to serialize a URL's path if it was parsed without one? When do we emit a slash?

Categories

(Core :: DOM: Networking, defect, P2)

defect

Tracking

()

People

(Reporter: twisniewski, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

While fixing bug 1347459 I've noticed that many WPTs appear to have inconsistent expectations about whether or not a URL with an empty path should end up serialized with a slash for the path or an empty string.

Gecko seems to always serialize an empty path as a slash, here in nsStandardURL.cpp

  // path must always start with a "/"
  if (mPath.mLen <= 0) {
    LOG(("setting path=/"));
    mDirectory.mPos = mFilepath.mPos = mPath.mPos = i;
    mDirectory.mLen = mFilepath.mLen = mPath.mLen = 1;
    // basename must exist, even if empty (bug 113508)
    mBasename.mPos = i + 1;
    mBasename.mLen = 0;
    buf[i++] = '/';
  } else {

However, I suspect that we should not be adding the slash, based on the spec text:

The URL path serializer takes a URL url and then runs these steps. They return an ASCII string.
- If url has an opaque path, then return url’s path.
- Let output be the empty string.
- For each segment of url’s path: append U+002F (/) followed by segment to output.
- Return output.

And yet, if I change that code to just not emit the slash, we fail many other WPTs, such as:

Setting http://example.net.protocol = 'https:foo : bar' Stuff after the first ':' is ignored - assert_equals: expected "https://example.net/" but got "https://example.net"

Setting https://example.net.search = '' - assert_equals: expected "https://example.net/" but got "https://example.net"

Setting https://example.net.hash = 'main' - assert_equals: expected "https://example.net/#main" but got "https://example.net#main"

Setting <file:///var/log/system.log>.href = 'http://0300.168.0xF0' - assert_equals: expected "http://192.168.0.240/" but got "http://192.168.0.240"

And yet other tests expect no slashes, which why we are presently failing them:

Setting <ssh://me@example.net>.protocol = 'http' Can’t switch from non-special scheme to special - assert_equals: expected "ssh://me@example.net" but got "ssh://me@example.net/"

The WPT expectations don't appear to be based on some simple logic, such as only adding a slash if there is a hash or query, because there are tests expecting a URL with a query/hash to sometimes have the slash and other times not.

So I wonder if this is just the WPTs being written to match inconsistent serialization behavior in Blink/Webkit, or if I'm just reading the spec naively and missing some context which might make sense of this.

:kershaw, what do you think?

Flags: needinfo?(kershaw)
The URL path serializer takes a URL url and then runs these steps. They return an ASCII string.
- If url has an opaque path, then return url’s path.
- Let output be the empty string.
- For each segment of url’s path: append U+002F (/) followed by segment to output.
- Return output.

It looks to me the problem here is how to determine a URL has an opaque path.
If it has an opaque path, we should return the opaque path without adding /.
If not, we should add /.

And yet, if I change that code to just not emit the slash, we fail many other WPTs, such as:

Setting http://example.net.protocol = 'https:foo : bar' Stuff after the first ':' is ignored - assert_equals: expected "https://example.net/" but got "https://example.net"

Setting https://example.net.search = '' - assert_equals: expected "https://example.net/" but got "https://example.net"

Setting https://example.net.hash = 'main' - assert_equals: expected "https://example.net/#main" but got "https://example.net#main"

Setting <file:///var/log/system.log>.href = 'http://0300.168.0xF0' - assert_equals: expected "http://192.168.0.240/" but got "http://192.168.0.240"

See the spec here:

A special URL’s path is always a list, i.e., it is never opaque. 

The URLs listed above are all special URLs, so they don't have an opaque path. This means we should always add /.

And yet other tests expect no slashes, which why we are presently failing them:

Setting <ssh://me@example.net>.protocol = 'http' Can’t switch from non-special scheme to special - assert_equals: expected "ssh://me@example.net" but got "ssh://me@example.net/"

Apparently, the URL ssh://me@example.net has an opaque path me@example.net, so we should not add /.

Flags: needinfo?(kershaw)

Well well:

A URL has an opaque path if its path is a URL path segment.

A URL path segment is an ASCII string. It commonly refers to a directory or a file, but has no predefined meaning.

But thankfully based on that note about special URLs, any URL that has a special scheme is considered opaque, so at least there's that much.

So I guess we'll just have to start by assuming that is the basis for opaqueness, and see if it's good enough to at least pass the tests.

Wouldn't ssh://me@example.net be non-opaque because ssh is non-special?

Flags: needinfo?(kershaw)
Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

(In reply to Ed Guloien [:edgul] from comment #3)

Wouldn't ssh://me@example.net be non-opaque because ssh is non-special?

Sorry, I don't quite understand your question. Could you explain more?

Flags: needinfo?(kershaw) → needinfo?(edgul)

Thanks for asking for followup, my previous phrasing was grossly insufficient. I'll try to be a little more comprehensive.

Lines of interest from the spec for determining opaqueness:

  1. A special URL’s path is always a list, i.e., it is never opaque.
  2. A URL’s path is either a URL path segment or a list of zero or more URL path segments, usually identifying a location. It is initially « ».
  3. A URL path segment is an ASCII string. It commonly refers to a directory or a file, but has no predefined meaning.
  4. A URL has an opaque path if its path is a URL path segment.

Also note that host serialization is handled before path serialization. (5)

So for ssh://me@example.com:

  • Ssh is obviously not special, so we can’t decide right away about opaqueness with [1] alone.
  • me@example.com would be considered host section, not in the path section [5]
  • There is no path string [3], so the URL’s path must be a zero-length list of URL path segments [2] (not a URL path segment). We cannot determine opaqueness with [4].
  • This part is implicit, since the spec doesn’t outright say it, but I think it’s safe to assume if not 4, then non-opaque. By this algorithm, ssh://me@example.com would be non-opaque.

To further add to the complexity, when we have a non-opaque path we do not always just add / the to path when serializing, the instructions below indicate that ssh://me@example.com, which has no path segments (so the for-each doesn’t even run) we would just return the empty path. Also as and aside, I'd like to draw attention to the 3rd point, because it tripped me up on my first reading, the / is appended before the segment in question. (I'm not totally aware of the impact, but it almost looks like we would never simply append a / unless it's possible to have a list of segments where the last element is the empty string)

The URL path serializer takes a URL url and then runs these steps. They return an ASCII string.
* If url has an opaque path, then return url’s path.
* Let output be the empty string.
* For each segment of url’s path: append U+002F (/) followed by segment to output.
* Return output.

So to me, one of the concerns in question looks like an implementation issue, not a WPT issue. Namely:

Setting <ssh://me@example.net>.protocol = 'http' Can’t switch from non-special scheme to special - assert_equals: expected "ssh://me@example.net" but got "ssh://me@example.net/"

I have yet to apply the same thinking to some of the earlier mentioned examples, so this may yet still be incomplete. Please highlight anything I am missing.

Flags: needinfo?(edgul)

So looking at WebKit's code (since they pass the WPTs), they treat an opaque path simply as one where the URL's scheme is non-special, and its protocol (including the colon-forward-slash-forward-slash) is not followed by a third slash. Emulating this in our code lets us pass the 16 or so WPT URL sub-tests related directly to ssh without breaking others.

So, nsStandardURL is always used for special URLs, while for non-special ones we usually use nsSimpleURI (where we don't normally add a / to the serialization).
The problem with ssh is that we defined it to be parsed with nsStandardURL because we wanted it to have a host.
We'll probably remove that once we fix bug 1603699.

You need to log in before you can comment on or make changes to this bug.