Closed Bug 548516 Opened 15 years ago Closed 1 year ago

data: URL charset detection is too brittle

Categories

(Core :: Networking, defect, P3)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: shadow2531, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: student-project, Whiteboard: [necko-backlog])

Attachments

(3 files, 3 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (.NET CLR 3.5.30729) Build Identifier: Load the test URL. The resulting text/plain page should be treated as UTF-8 and you should see √ in the page. Instead, you see √ because the page isn't treated as UTF-8. This is because charset=utf-8 is ignored when unknown headers like foo=bar are present. Safari and Opera handle this fine and treat the page as UTF-8. For an example of why this might be important to fix (besides matching Opera and Safari), see <http://lists.w3.org/Archives/Public/uri/2010Feb/0058.html>. Reproducible: Always
Blocks: 144766
Interestingly, the status bar shows the check mark when you hover over the link.
The adressbar also displays the checkmark correctly.
Blocks: 532230
Michael, do you want to just give this a shot? The relevant code is at http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l190 and the right fix is probably to just use nsIMIMEHeaderParam instead of reinventing the wheel. Alternately, I'm happy to guide whoever wants to fix this through doing it; I just won't be able to get to it myself in the near future.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: student-project
This computer doesn't have the resources to build FF in a practical amount of time. However, I could recreate the parse function in C++ (while still using char* etc.), test it out, pull the latest code, adapt my code to use any Mozilla str functions and create a patch blindly for someone to try. (I'd probably produce an example JS version first though) However, I noticed that white-space handling of the parts separated by ';' differs between browsers. For example, each browser can handle each of these parts differently: "data: text/plain ; charset = utf-8 ; base4 ," Personally, I'd trim all that so that it's like: "text/plain" "charset = utf-8" -> "charset " : " utf-8" -> "charset" : "utf-8" "base64" Further, because raw spaces get resolved to %20, some browsers handle it in some parts and not others: "data:%20text/plain%20;%20charset%20=%20utf-8%20;%20base4%20," , which I would personally make give the same result as the previous example by completely percent-decoding each of the values before trimming, or at least by trimming percent-encoded white-space (no '+' to ' ' of course) I can do some more testing and ask on the URI list what's supposed to happen as far as trimming goes. > nsIMIMEHeaderParam <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src/nsMIMEHeaderParamImpl.cpp>?
> <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src > /nsMIMEHeaderParamImpl.cpp>? That's the implementation; the idea would be to just reuse that component for parsing here. The %20/space thing is a mess. :( That needs to be specced, I think.
(In reply to comment #5) > > <http://mxr.mozilla.org/mozilla-central/source/netwerk/mime/src > > /nsMIMEHeaderParamImpl.cpp>? > > That's the implementation; the idea would be to just reuse that component for > parsing here. Understood. Will look to see if I can find usage examples to see how well it'd work. > The %20/space thing is a mess. :( That needs to be specced, I think. Please review <http://lists.w3.org/Archives/Public/uri/2010Feb/0097.html> when you have a chance. The spacing issue could perhaps be as a different bug, handled later on.
(In reply to comment #7) > So you actually want me to look over > http://shadow2531.com/opera/testcases/datauri/data_uri_rules.html yes? Yes, but only if you have time and feel like it.
I did skim it and it looks ok at first glance, but the devil's obviously in the details... I haven't checked those.
(In reply to comment #9) > I did skim it and it looks ok at first glance, but the devil's obviously in the > details... I haven't checked those. Thanks. I will gather some data URI tests and how they behave in different browsers to support those rules. As for parseURI <http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l190>, I was able to recreate it locally so I can test simple fixes. I'll post when I have something worthwhile.
We have a volunteer!
Assignee: nobody → may.gup
Status: NEW → ASSIGNED
The comment I made in <http://lists.w3.org/Archives/Public/uri/2010Mar/0004.html> may be useful.
Sorry for the delay. I have been able to extract the parameters in the URI as described in the comment above (comment 12), but what am I supposed to do with the values ? Ideally it should offer to Save/Open the data, but I dunno where to go from here, please guide..
I don't follow the question. What are you trying to do with which values, and how is it related to this bug?
Any progress? Comment 13 seems to suggest that code to fix this bug (data URL parsing is too brittle) is fixed but probably not committed anywhere and this bug is missing the patch. It seems to me that comment 13 is already hinting towards an another bug (feature request) that should be opened for actually using the disposition and filename information parsed from the data URL. It definitely is an another bug and this one should be fixed first.
The problem with the test URI <data:text/plain;charset=utf-8;foo=bar,%E2%88%9A>, is that Firefox detects the charset as "utf-8;foo=bar" At <http://hg.mozilla.org/mozilla-central/file/4e1b68ecf126/netwerk/protocol/data/src/nsDataHandler.cpp#l210>, if I change: if (charset) contentCharset = charset + sizeof("charset=") - 1; to: if (charset) { char *test = (char *) strchr(charset, ';'); if (test) { *test = '\0'; } contentCharset = charset + sizeof("charset=") - 1; if (test) { *test = ';'; } } , that should fix it so the charset is just "utf-8". I have yet to actually test that though.
I needed a percent-decoding function. Found NS_UnescapeURL and used it, but I'm not sure that's correct. nsCString::Trim() supports a 4th argument that say to ignore quotes. It defaults to false, but I don't really know what ignore quotes means in this case. Still lots of testing to do.
Attachment #537204 - Flags: feedback?(bzbarsky)
Which kind of %-decoding did you want? There are several different types...
(In reply to comment #18) > Which kind of %-decoding did you want? There are several different types... Just need to convert all the %HH to the bytes they represent (and then possibly convert to utf-8). Something that works like javascript's decodeURIComponent(). For example, '+' should be left alone and not converted to a space. For example, if I was going to do this in JS, I'd use decodeURIComponent().
Yeah, then NS_UnescapeURL should work. I'll try to look at the patch in more detail on Monday; is there something you want feedback on in particular?
(In reply to comment #20) > Yeah, then NS_UnescapeURL should work. Thanks. > I'll try to look at the patch in more detail on Monday; O.K. > is there something you want feedback on in particular? Mainly want to make sure that I can't get stuck in the loop and want to make sure the char*'s I use are never out of bounds etc. Besides that, just wondering what things you require to be changed. Also, if there are existing data URI tests I can run to look for regression, I'd like to run them. And, eventually add more tests.
Also, the way I have it now, if there's more than one charset=, the last non-empty one (empty ones are currently ignored) is honored. This can be changed to allow empty ones to override the previous too. Or, it could be changed so that once a non-empty (or empty too if required) charset= is found, the rest after that are ignored. I like the current way I have it though.
(In reply to comment #22) > Also, the way I have it now, if there's more than one charset=, the last > non-empty one (empty ones are currently ignored) is honored. This can be > changed to allow empty ones to override the previous too. Or, it could be > changed so that once a non-empty (or empty too if required) charset= is > found, the rest after that are ignored. I like the current way I have it > though. Testing with Opera and Firefox, it looks like the first charset= (empty or not) gets honored and any charset= after that are ignored. So, I'll change to match that.
Attachment #537204 - Attachment is obsolete: true
Attachment #537720 - Flags: feedback?(bzbarsky)
Attachment #537204 - Flags: feedback?(bzbarsky)
(In reply to comment #23) > Testing with Opera and Firefox, it looks like the first charset= (empty or > not) gets honored and any charset= after that are ignored. So, I'll change > to match that. Safari agrees with this too. However, Chrome differs in that it honors the first non-empty one.
(In reply to comment #25) > Safari agrees with this too. However, Chrome differs in that it honors the > first non-empty one. I wouldn't loose too much sleep over handling invalid instances if they do not interoperate anyway.
(In reply to comment #26) > (In reply to comment #25) > > Safari agrees with this too. However, Chrome differs in that it honors the > > first non-empty one. > > I wouldn't loose too much sleep over handling invalid instances if they do > not interoperate anyway. Understood. It's easy to change how it's handled, so no biggie. I think doing it the Firefox, Opera, Safari way makes sense as then I'm not changing current behavior with that part.
So _is_ there a reason you're not using the mime header param code?
(In reply to comment #28) > So _is_ there a reason you're not using the mime header param code? I think re-using that code in new places is a bad idea, unless we clean it up a lot. For instance, it would be bad if the non-compliant RFC2047 handling leaks out into new places.
Well, I don't think we should be reinventing the param-parsing wheel here unless we think the data: syntax is different enough from normal param parsing that we need to special-case it.
(In reply to comment #28) > So _is_ there a reason you're not using the mime header param code? Yeh, meant to talk about that. In <http://hg.mozilla.org/mozilla-central/file/8904812b90a7/netwerk/mime/nsMIMEHeaderParamImpl.cpp>: I'm looking for a function that accepts the data URI (well the part before "," after that part is percent-decoded) that parses it like an HTTP header and produces a container (name, value) of all the resulting params. The container would be like vector<pair<string, string>> (no map, want to retain order), where the values are already unfolded (in the case of header folding), rfc2231-decoded/percent-decoded/mime-decoded, converted to UTF-8, trimmed and sanitized etc. and duplicate headers are already taken care of according to the spec. Then, in ParseURI, I'd use that function (GetParamsFromDataURIHeader() for example) to get that container and analyze it to see what name/values are there. The problem is, looking at nsMIMEHeaderParamImpl.cpp, I don't see any function like that and just see functions that look like they do bits and pieces. So, I'd have to write the function I need by using all those functions. The problem with that is I don't know anything about those functions, at all. Also, it'd probably be best to stick to rfc2231 support strictly so the old mime QP stuff doesn't have to be supported. Any idea on how I could make such a function that I need?
> and just see functions that look like they do bits and pieces mime header param gives you functions that let you say "here's a header value; what's the value of this named parameter, if it has one at all?" So you can't get a set of name/value pairs, but for any given name you can ask for the value. You have a known set of names here you need to ask about, so this should work, I'd think. Strict RFC 2231 support is a bigger sticking point; there's no way to skip rfc 2047 processing for mime header param right now without, as you say, gluing bits and pieces together. If that's the only issue, I would prefer that we just exposed an API like getParameter that only does the RFC 2231 stuff. Would that be sufficient here?
Oh, so, you're saying to keep parsing pretty much like I am now and just support stuff like "filename=utf-8''%E2%88%9A.txt" where once I find filename, I'd do getParameter("utf-8''%E2%88%9A.txt") to get the string "\xE2\x88\x9A.txt"? If so, I guess I'd want getParameter to work similar to how mimeConverter in <http://hg.mozilla.org/comm-central/file/8f6268c4dc0a/mailnews/compose/src/nsSmtpUrl.cpp#l224> works where you can also specify the encoding to convert to (just in case it's not utf-8 after it's RFC2231-decoded). Or, do you mean to really support filename*= where there could be multiple of them togther separated by "\r\n " (well, "%0D%0A%20" in the URI before the part before ',' is percent-decoded)? If that's the case, I'd need to modify the parsing to look for '=' that's not preceded by '*' etc. It doesn't seem like the header folding part is needed. > If that's the only issue, I would prefer that we just exposed an API like getParameter that only does the RFC 2231 stuff. That sounds better I think now that I know more of what you mean.
(In reply to comment #32) > ... > If that's the only issue, I would prefer that we just exposed an API like > getParameter that only does the RFC 2231 stuff. Would that be sufficient > here? ... The proposed patch for bug 610054 would probably help. It would take out quoted-pair, but we'd still carry over all other parsing quirks.
> Oh, so, you're saying to keep parsing pretty much like I am now No, I'm saying you grab the part between ':' and ',', call it "header", say, and then do things like: nsString filename; headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, nsnull, filename); After which |filename| will be the value of the filename parameter, if there was one, decoded, etc, etc and converted to UTF-16. It'll handle header folding and the like. IF you want to pass a fallback encoding you can pass something other than EmptyCString() there.
(In reply to comment #35) > > Oh, so, you're saying to keep parsing pretty much like I am now > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > and then do things like: You can't do that, because the first ";" might be part of a quoted-string. I don't believe the current code is reusable for data: URIs (or even the Link header field, which I am looking at). To make things reusable, we need to decouple tokenizing the value from decoding the individual values. And we'll either need a very generic tokenizer, or tokenizers for each type of value.
You mean the first ',', right? OK, point taken. That's really unfortunate, though. :( So I guess the question is whether it's worth writing such a generic tokenizer that would start at some char* and walk forward until it reaches end of data or a given delimiter. Seems like all we need there is a tokenizer that knows how to skip quoted stuff; it doesn't need to do the various decoding and whatnot, right?
(In reply to comment #37) > You mean the first ',', right? OK, point taken. Si. > That's really unfortunate, though. :( > > So I guess the question is whether it's worth writing such a generic > tokenizer that would start at some char* and walk forward until it reaches > end of data or a given delimiter. Seems like all we need there is a > tokenizer that knows how to skip quoted stuff; it doesn't need to do the > various decoding and whatnot, right? Having something like that would be very useful. There are still instances where parsing can't be generalized like that entirely, such as with the Link header, I believe. But that wouldn't make the generic code less useful. For header fields, a generic pattern might be something like: thing = [ name ] [ "=" ] [ value ] value = RFC2616token / RFC2616quotedstring headerfield = thing *( ";" [thing]) However, it's not clear how "forgiving" this needs to be. For instance, if we'd be forgiving and allow ";" inside RFC2616token, then we couldn't detect the delimiter in data URIs. If we do not allow "," there, we could, but then we probably would regress parsing of filenames in C-D (where people may be putting commas into non-quoted parameters).
(In reply to comment #35) > > Oh, so, you're saying to keep parsing pretty much like I am now > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > and then do things like: > > nsString filename; > headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, > nsnull, filename); > > After which |filename| will be the value of the filename parameter, if there > was one, decoded, etc, etc and converted to UTF-16. It'll handle header > folding and the like. IF you want to pass a fallback encoding you can pass > something other than EmptyCString() there. Ah, O.K. How would you search for base64 then? Just use "base64" for the second argument? What about for the content-type, how would you search for that with that function? (In reply to comment #36) > (In reply to comment #35) > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > and then do things like: > > You can't do that, because the first "," might be part of a quoted-string. But, everything between ":" and "," can (according to Opera, Safari and the code I wrote and the idea I was going with) and should be percent-encoded. So, that "," would be %2C at the time the substring between ":" and "," is gotten.
(In reply to comment #39) > (In reply to comment #35) > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > and then do things like: > > > > nsString filename; > > headerParam->GetParameter(header, "filename", EmptyCString(), PR_FALSE, > > nsnull, filename); > > > > After which |filename| will be the value of the filename parameter, if there > > was one, decoded, etc, etc and converted to UTF-16. It'll handle header > > folding and the like. IF you want to pass a fallback encoding you can pass > > something other than EmptyCString() there. > > Ah, O.K. How would you search for base64 then? Just use "base64" for the > second argument? What about for the content-type, how would you search for > that with that function? I don't think it makes sense to try that function. We need to refactor all of this to make it reusable for "data:". > (In reply to comment #36) > > (In reply to comment #35) > > > > Oh, so, you're saying to keep parsing pretty much like I am now > > > > > > No, I'm saying you grab the part between ':' and ',', call it "header", say, > > > and then do things like: > > > > You can't do that, because the first "," might be part of a quoted-string. > > But, everything between ":" and "," can (according to Opera, Safari and the > code I wrote and the idea I was going with) and should be percent-encoded. > So, that "," would be %2C at the time the substring between ":" and "," is > gotten. I can be percent-encoded, but doesn't have to. You can't rely on it.
> I can be percent-encoded, but doesn't have to. You can't rely on it. "," in a param in a data URI isn't allowed now or it'll break things. Seems like that all UAs that improved support to support "," in params could require that everything between ":" and "," is percent-encoded. Of course, it shouldn't be hard to find the first non-quoted comma anyway.
(In reply to comment #41) > > I can be percent-encoded, but doesn't have to. You can't rely on it. > > "," in a param in a data URI isn't allowed now or it'll break things. Seems > like that all UAs that improved support to support "," in params could > require that everything between ":" and "," is percent-encoded. > > Of course, it shouldn't be hard to find the first non-quoted comma anyway. Well, if we do all of this let's try to get it right :-)
(In reply to comment #41) > Of course, it shouldn't be hard to find the first non-quoted comma anyway. In a header, is there anything that closes a quoted value besides " ? In a quoted value, can " appear if it's escaped by \ ? If so, I assume that \ can appear if it's escaped by \. Can things be quoted with ' instead?
(In reply to comment #43) > (In reply to comment #41) > > Of course, it shouldn't be hard to find the first non-quoted comma anyway. > > In a header, is there anything that closes a quoted value besides " ? No. But it must be a " that isn't quoted itself. > In a quoted value, can " appear if it's escaped by \ ? Yes. > If so, I assume that \ can appear if it's escaped by \. Yes. > Can things be quoted with ' instead? No. See... http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p1-messaging-14.html#rfc.section.1.2.2.p.10
> Just use "base64" for the second argument? Yes. > What about for the content-type, how would you search for that with that > function? Use null as the second argument.
(In reply to comment #46) > > Just use "base64" for the second argument? > > Yes. > > > What about for the content-type, how would you search for that with that > > function? > > Use null as the second argument. O.K. Then if the result ends up in an nsString, I'd need to convert it to an nsCString utf-8-wise. Searching, I think that's "nsCString test = NS_ConvertUTF16toUTF8(result)". Is that correct? Given that and my function, I should have enough to try some things out.
(In reply to comment #46) > > Just use "base64" for the second argument? > > Yes. Does that mean that base64=anything will also trigger base64 handling?
(In reply to comment #48) > Does that mean that base64=anything will also trigger base64 handling? That's why I think that re-using code that doesn't do exactly what we need s dangerous...
If there's a way to differentiate between "base64" not found and base64="", then I think if (found && isempty) would work. I see it returns NS_OK. I'm assuming that only happens if a base64 param is found. Another issue though. If Firefox's original code (and what I tried to follow), only the first charset= is honored. Is that what happens in header parsers too or will it be the last duplicate that gets honored? (In reply to comment #49) > (In reply to comment #48) > > Does that mean that base64=anything will also trigger base64 handling? > > That's why I think that re-using code that doesn't do exactly what we need s > dangerous... For my original proposal, name=%HH (utf-8-based %HH) and disposition=attachment|inline is all that's needed to support getting a filename and "attachment" mode.
> nsCString test = NS_ConvertUTF16toUTF8(result) NS_ConvertUTF16toUTF8 test(result); should work. > Does that mean that base64=anything will also trigger base64 handling? Well, you can check the value... but "base64=;" might not be distinguishable from "base64;"; would need to check what's actually returned. > Is that what happens in header parsers too or will it be the last duplicate > that gets honored? The "normal" header content-type parser is.... weird. I wouldn't worry about it for the moment, unless you really care about comma-separate content-types. That code might be changing too.
Just to make my stance here clear, I would much rather switch from the current behavior to whatever (possibly suboptimal) behavior we get from using mime header param than write a bunch of custom code that's still suboptimal due to lack of a spec but needs to be maintained. Once there's a sane data: spec (e.g. one which answers your question about multiple charset params), we can think about how best to follow it.
(In reply to comment #52) > Just to make my stance here clear, I would much rather switch from the > current behavior to whatever (possibly suboptimal) behavior we get from > using mime header param than write a bunch of custom code that's still > suboptimal due to lack of a spec but needs to be maintained. > > Once there's a sane data: spec (e.g. one which answers your question about > multiple charset params), we can think about how best to follow it. I agree that a revision to RFC 2397 would be useful. That being said; it just inherits the media type syntax from somewhere else, and the answer is that multiple params with the same name are invalid. A new spec might make that clearer, but it's unlikely to recommend a specific behavior for this case.
> the answer is that multiple params with the same name are invalid. Which doesn't say what you _do_ with it. A typical problem. > but it's unlikely to recommend a specific behavior for this case. Then it's broken by design, imo.
(In reply to comment #51) > > nsCString test = NS_ConvertUTF16toUTF8(result) > > NS_ConvertUTF16toUTF8 test(result); > > should work. > > > Does that mean that base64=anything will also trigger base64 handling? > > Well, you can check the value... but "base64=;" might not be distinguishable > from "base64;"; would need to check what's actually returned. > > > Is that what happens in header parsers too or will it be the last duplicate > > that gets honored? > > The "normal" header content-type parser is.... weird. I wouldn't worry > about it for the moment, unless you really care about comma-separate > content-types. That code might be changing too. O.K. Thanks. Gonna try to play with the code.
Wow, this works surprisingly good just for a thrown-together first example. And it makes the code so so much simpler. Of course, I didn't do any lowercasing, trimming, stripchr stuff on the content type and charset yet, but that won't make it any more complex really. The "base64=" triggering base64 handling doesn't seem as bad now. Either does the handling of multiple charset=. I'll mess with it more to add the trimming and lowercasing of the content type and charset etc. later.
> The "base64=" triggering base64 handling doesn't seem as bad now. Either > does the handling of multiple charset=. In this case, it's the last charset= that gets honored. (Which matches subject header parsing in mailto URIs btw)
> And it makes the code so so much simpler. Victory! ;)
This one should behave pretty much the same as the currently released code. So, the use of getParameter() shouldn't affect compatibility really. The only difference is that "base64=" will trigger base64 handling ("base64=something" will not though) and the last charset= is honored instead of the first. For valid data URIs, I don't think this matters. I added the code for filename= and disposition= and commented it out to show how easy it is to add to ParseURI(). Of course, I have no idea how to modify the data URI loading code to take advantage of that for data URIs that are loaded directly. (I could make it moz-filename and moz-disposition for now also.) A question I have about the original code is its use of StripWhitespace() on the contentType and contentCharset strings. That function strips whites-space even in the middle of the string. It doesn't just trim leading and trailing white-space. For example, "t e x t / p l a i n" will become "text/plain" and "charset=u t f - 8" will set the charset to "utf-8". Is that really needed? It seems like the strings should just be trimmed instead and any newlines in the middle converted to spaces or removed. Also, for my FindFirstOfNotQuoted function, do you see any errors in it? Also, I just stuck it in nsDataHandler.cpp. Not sure if that'd be the best place for it or not. Also, should I be using any fallback encodings or anything for getParameter()?
Attachment #542534 - Attachment is obsolete: true
> Of course, I have no idea how to modify the data URI loading code to take > advantage of that Once bug 589292 lands that becomes very clear. > Is that really needed? Almost certainly not! > Also, for my FindFirstOfNotQuoted function, do you see any errors in it? Going to defer to Julian on this one. > Also, should I be using any fallback encodings or anything for getParameter()? Let's not and see if anyone notices. I hope we can just get away with requiring UTF-8 here.
Attachment #543063 - Attachment is obsolete: true
Attachment #537720 - Flags: feedback?(bzbarsky) → feedback-
Someone made some alternative proposals to this at <http://lists.w3.org/Archives/Public/uri/2012Mar/0000.html>.
Whiteboard: [necko-backlog]
Blocks: 1392241
Priority: -- → P1
Priority: P1 → P3
Assignee: may.gup → nobody
Status: ASSIGNED → NEW
Severity: normal → S3

The testcase now renders fine as of bug 1845006 landing, and we now align much more closely with the current spec (with further caveats being considered in bug 1845005).

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: